After realising I have content dating back to the year 2015, I took it upon myself to investigate how I could run a link check across my website. The requirements were:
- it must check every link on every page of my website.
- it must check external links.
- it must not check for links within external links.
From my brief research session I found the following solutions:
- the w3C Link Checker; and
- my personal favourite, linkcheck.
Wget is a package available on most Linux distributions. The following command achieved most of my requirements:
$ wget --spider --recursive -o www.danielfranklin.id.au.log https://www.danielfranklin.id.au
--spider flag instructs Wget to only crawl the website and not download
any of the files or assets it comes across.
--recursive flag defaults to five levels deep, which covered the majority
of my website but extended to external links too, i.e. if a link at the second
level was an external link, all the links on the external link would count as
the third level, and so on. This meant not all links on my website were covered, and increasing the range
would increase the time taken to complete the crawl.
-o flag instructs Wget to send the crawl log to the file specified. In
this case, the log would be
Finally, my website address is passed as the website that Wget will crawl.
LinkChecker is an open source
crawler that has been around since 2000. It improves upon Wget by recognising
internal and external links and only following the former, which means there is
no need for a
$ linkchecker -F text https://www.danielfranklin.id.au
-F flag tells LinkChecker to output a log file in the format specified.
There are several output formats available in
such as HTML, CSV or XML but I opted for the text format.
As far as requirements, LinkChecker met all of them and I found it quite reliable and easy to set up.
W3C Link Checker
The W3C Link Checker is best described as an online variant of Wget. It does much the same without the need to install or set up anything locally but does suffer from the same recursive issue and will not automatically check every single link as a result.
linkcheck, not to be confused with LinkChecker, also meets all of my requirements and is very speedy. It also offers checking of anchors and notifies if there are any invalid ones which is immensely useful.
I opted to use Docker to run linkcheck. There are other options such as installing an executable or installing locally (which requires installing Google Dart).
Between linkcheck and LinkChecker, linkcheck feels that little bit faster and has a very streamlined user experience, from the crawling to the output logs.