Linux Home Automation - Linkcheck.pl

Introduction

A while back I found that I needed a way to check the links on my various web pages. I searched around and looked in all the usual places: Google Search (linux "url checking" OR "link checking") and Freshmeat (link check). I probably used other search criterial too but that was back in 2000. I found various programs but all were really complicated to setup and use. All I wanted to do was put my URL in a file and tell the program to go check the URLs on that page. So not finding one that was simple to use I wrote one. I won't pretend to be a programmer but the program does work. I need to add more option and allow you turn off others but I'm not that far along yet. So I'll be updating it from time to time, adding the necessary options and probably a few new features.

If you're looking for a better link checker check out the W3 Link Checker. They create a lot of excellent tools and not only can you use their direct link you can download their code. I'll still work on my code but from time to time I'll use theirs to catch any other mistakes.

Right now Link Checker can't check for sites that are gone but picked up by link farms (grrr!). I need to figure out a way to check for that. Yes I have an idea but I haven't tried to implement it yet.

System requirements

Some kind of Unix - I've tested under Linux but other Unix systems should work.
Sendmail - I've set this up so you can pipe the output directly to sendmail (see below)
Perl - I'm using 5.8 but I would expect that 5.6 would have no problems runnning this code.
The HTML::TokeParser module - earlier versions of my code didn't need this module. I'm glad someone else took care of this as the regular expressions where beginning to get quite hairy.
The LWP module
The HTTP::Request module
The HTTP::Response module
The Getopt::Long module
The MIME::Base64 module
The Encode module (Encode::decode)

Download

There are 3 versions, the older version (0.1) is the plain text output. The color version (0.2) and is html output and the latest (V1.2) has html output and a base64 encode option.

linkcheck.pl.gz - V0.1 - Outputs plain text.
linkcheck-color.pl.gz V0.2 - outputs HTML. This allows me to see color when viewing the email. Red links are the bad ones, while blue links are normal links and black links are one's that have not been check (such as mailto: or news:).
linkcheck-color3.pl V0.3 - like V0.2 it ouputs html but I've added an option to encode the html output to base64.
V1.2 is in development and moves the To: & From: default strings into the rc file. If you don't change the default or setup the To:/From: in the rc file then you'll see a warning. I've also changed the revision/version numbering. I consider this one to be stable and I've moved it under RCS control. I've added an array of User Agents (libwww/Perl tends to get a 4xx error on some sights, correct a few errors and other odds and ends that I just can't remember (it's been 8 years running this software, I can't remember everything).

How to setup

The rc file is called .linkcheckrc and it resides in your home directory. It can be made up of blank lines, lines that start with a # (a comment) and the URL. Make sure that the URL (http://...) starts in column 0 otherwise the URL will be ignored. Here's a sample linkcheckrc file:

# Comments
To: root@example.com
From: webadmin@example.com
Subject: Linkcheck - `date "+%b %d,%Y"` # You can add Unix commands

#
http://www.cookie.uucp/~njc/Personal/athome/other/Coffee/Coffee.html

#http://www.wolfgang.uucp/~njc/Personal/athome/ha_blogs.html
http://www.cookie.uucp/~njc/linkcheck_test.html

http://www.linuxha.com/other/linkcheck/index.html

Really nothing too complicated. Each URL is the page that linkcheck will check. The current versions (0.1 thru 0.3.1c) don't follow links to other pages. So if you want to check other pages then you will need to list them in your rc file.

Version V1.2 supports To: and From:. The tag line Subject: is also supported but it won't add the date or anything else to the message (a later version maybe). The new default for To: is root@example.com and From: is webadmin@example.com.

Some of the options

As with most Unix programs there's the cryptic command line help. :-)

$ linkcheck.pl -?
linkcheck - checks links on a given web page
        --base64 - encode html in base64
        --linkrc <file> - alternate linkrc file
        --to <recipient> - alternate recipient
        --from <sender> - alternate sender
        --subject <subject> - alternate mail subject
        --debug - turn on the debug messages
        --help - this message

--base64 - encode html in base64, this option provides a way to encode the html in a base64 encapsulation. This is a new option in V0.3 and beyond.
--linkrc <file> - alternate linkrc file, the default is $HOME/.linkcheckrc
--to <recipient> - alternate recipient
--from <sender> - alternate sender
--subject <subject> - alternate mail subject
--debug - turn on the debug messages
--help - this message

How to run linkcheck

To run linkcheck at the command line type:

$ (perl linkcheck.pl | /usr/sbin/sendmail -t) &

You could just as easily add the command to cron. You would add a few lines that looks like this:

##[ Link Check ]#########################
# This is my Link checking program for my web pages
# the links to check can be found in the ~/.linkcheckrc file
# Run on Sunday 1AM
00 1 * * 7 perl ~/bin/linkcheck-color.pl |/usr/sbin/sendmail -t -n

Or like this if you need base64 encoding:

##[ Link Check ]#########################
# This is my Link checking program for my web pages
# the links to check can be found in the ~/.linkcheckrc file
# Run on Sunday 1AM
00 1 * * 7 perl ~/bin/linkcheck-color.pl -b |/usr/sbin/sendmail -t -n

Here's a sample output that was created with linkcheck-color.pl. This is what I saw in Thunderbird when I opened the email message.

A sample page to check

Here's the sample page I used to perform the initial tests of linkcheck:

Sample test page HTML Source

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
            "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">
<html>
<head>

<!-- meta -->

   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta http-equiv="expires"  content="Sun, 15 jan 1999 17:00:00 gmt">
   <meta http-equiv="Cache-Control" content="must-revalidate">
   <meta name="keywords" content="Linux home automation, x10, hcs ii, linux,
               source code, neil cherry, software, hardware, weather
               station, wm918, wx200">
   <meta name="description" content="comments ...">
   <link href="style.css" type="text/css" rel="stylesheet">
<!-- athome.title -->

   <title>Linkcheck test page</title>
</head>
<!--<body bgcolor="#FFA5CC" vlink="#FF0000">-->
<body bgcolor="#FAEBD7">


<!-- athome.top -->
<a name="myTop"> </a>

<h1>Linkcheck test page</h1>

<blockquote>

<p> Home  automation and home  control (HA) Software (source  code and
    links  mostly)  for  the   home  automation  devices  CM11A,  CM17
    (Firecracker), LynX10,  WM918, HCS II  and CPUXA.  Links  to other
    hardware/software packages  can be found  on these pages  also. If
    you  know  of  any  additional  links please  contact  me  at:  <a
    href="mailto:&#110;&#099;&#104;&#101;&#114;&#114;&#121;&#064;&#099;&#111;&#109;&#099;&#097;&#115;&#116;&#046;&#110;&#101;&#116;?subject=RE:Linux%20HA%20Pages"&gt;&#110;&#099;&#104;&#101;&#114;&#114;&#121;&#064;&#099;&#111;&#109;&#099;&#097;&#115;&#116;&#046;&#110;&#101;&#116;
</a>
</p>

</blockquote>

<p>

<!-- disclaimer -->

<blockquote>

  <b>Disclaimer: </b> None of the opinions expressed on these pages are
  paid for . They are strictly my own and
  may not represent an endorsement of someone's project, product or
  service (unless otherwise stated so).

</blockquote>
<hr> <!-- ************************************************************** -->

<h1> Index (Last updated: Saturday September 11, 2004)</h1>
<!-- index -->
<a name="Index"></a>


<ul>
  <li><a href="index.html">Linkcheck</a>Good link</li>
  <li><a href="nonexistant_page1.html">Bad link I</a></li>
  <li><a href="nonexistant_page2.html">Bad link II</a></li>
  <li><a href="index.html#Software">Forward link to this same page</a></li>
  <li><a href="nonexistant_page3.html">Bad link III</a></li>
</ul>

<!-- update -->

<hr/><!-- ************************************************************** -->
<p class="invisible"><a href="http://home.comcast.net/~ncherry/">Linux Home Automation</a></p>
<br/>

If you have questions about Home Automation and/or Linux you may email
me at <a
href="mailto:ncherry@comcast.net">ncherry@comcast.net</a>. This email
address is not for unsolicated email (if <b><i>I</i></b> didn't opt-in then it's
unsolicated). <br/>

<p/>Please come back and visit my page again (hopefully this is worth reading).

<p>Last updated: Saturday September 11, 2004 
<!-- athome.bottom -->

<a name="bottom"></a>

<hr> 

<br> 

<br> 
</body>
</html>

You'll note that in the source you'll see all sorts of gibberish such as strings made of 'n' numbers. Those are valid in the URL's so I needed to test for them. Visit the sample page to see what it looks like in your browser.

What linkcheck doesn't check for is the things like "index.html#cli". If index.html is valid that's good enough for linkcheck.

Bugs

Not really sure what to make of this as it's not directly a bug in linkcheck as it may be in the Perl modules. Don't worry if you see:


Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.8/LWP/Protocol.pm line 114, <IniFile>
+line 20.

Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.8/LWP/Protocol.pm line 114, <IniFile>
+line 24.

The report is that this means that the library modules received data that it thinks doesn't match the page encoding. At the moment I'm not certain how to resolve this but It doesn't seem to cause any problems so just accept it as a warning message that can be ignored.

If you have questions about linkcheck, Home Automation and/or Linux you may email me at: ncherry@linuxha.com . This email address is not for unsolicated email (if I didn't opt-in then it's unsolicated).