6.
Handling
javascript links

Update October 2004: Automatic javascript links checking Update November 2004: Regex script refinement Update January 2005: Regex script refinement2

Most websites have navbars with javascript links, which give access to the deeper levels of the site. Typically linkcheckers can't read these links � Xenu included.

Commercial linkcheckers claim to be able to read javascript embedded links, either by �executing� or �parsing� these links.

Parsing means analysing the links found in javascript code, and adding them to the root URL of the scan. Sometimes this does not work, when this relative link path is added to the URL of the page the links are found on � this can lead to incorrect URLs.

Executing avoids this possibility of misconstructed URLs, but takes more time.

Javascript links pose two problems for linkchecking:

When they occur in the navbar
When they occur in the web pages

There is a workaround for both of these issues !

Javascript links in the navbar

My solution for this problem has always been to create a simple htm page, to which the navbar links are added as straightforward clickable links. By pushing this file live on your website and using it as the startpage for the Xenu linkscan, all navbar links are included in the scan.

It is important that this new startpage is posted on the site, and not kept on your local computer, otherwise your site will be seen by Xenu as �external�.

Lets look at an example of a navbar with javascript links, as I have used it on my site integralworld.net:

The links in this navbar are code as:

Menu2_2=new Array("My Visit Reports","http://207.44.196.94/
~wilber/visits.html","",0,20,150,"","","","","","",-1,-1,-1,"","");

�which is impossible for Xenu to recognize as a link. However, when it is recoded like this, Xenu can spider this link without any difficulty:

<a href="http://207.44.196.94/~wilber/visits.html">LINK</a>

This is how a new start page with clickable navbar links would look like:

<html>
<head>
<title>Xenu linkcheck startpage</title>
</head>

<a href="welcome2.html">welcome2.html</a> 
<a href="biography.html">biography.html</a> 
<a href="books.html">books.html</a> 
<a href="readingroom.html">readingroom.html</a> 
<a href="institute.html">institute.html</a> 
<a href="kwdp.html">kwdp.html</a> 

</body>
</html>

or posted on the website, it would look like this:

Note: post this linkcheck start page in the same location as your website and start your scan from there, otherwise the scan will fail -- your site will be regarded as external if you have hosted your start page on a different server.

Feel free to add as many links to this start page as you think is necessary to reach the deeper levels of your website.

Javascript links on the web pages

The second javascript hurdle to take is when they occur on the web pages. Xenu automatically skips these links when scanning a site:

They all get a �skip type� status message.

My solution to this problem has been as follows:

Export the xenu report to a text file, and import this file to Excel. Delete all unnecessary columns, and sort according to Status, so you can find the �skip type� links. Copy these links to a separate worksheet, and sort them according to link name (column A):

Now copy these URLs (from column A) to a text file in Notepad:

Extract the URLs from this file by stripping the links from the javascript in which they are embedded.

You can write a script that does this for you:

Remove all occurrences of '); at the end of each URL
Remove all occurrences of javascript:function(' at the start of each URL

A good program to do that with is Search and Replace from Funduc: http://www.funduc.com/search_replace.htm

Then Check the Javascript-less URLs as described in Chapter 4. Checking a set of links.

Using Regular Expressions

Instead of stripping the URLs from all code except the URL proper, it is quicker to use regular expressions to catch the URLs directly.

For example, this regular expression (using Funduc's Search and Replace program), does exactly that:

javascript:*\('(f|h|/)*'*

This regular expression does the following:

Find all strings starting with "javascript:"
ignore whatever comes next (the name of the function), until:
you encounter a "(" followed by a single quote (the "(" gets escaped by a backward slash: \)
this should be followed by either a "f" or a "h" or a "/" (to catch absolute http/ftp URLs or relative URLs)
ignore whatever comes next until your reach the first single quote.
ignore everything that comes after that single quote.

The replacement string is:

%2%3

meaning:

%1 stands for the name of the function (we are not interested in that here)
%2 stands for either "h" or "f" or "/"
%3 stands for the rest of the URL
adding %2 and %3 gives you the full URL

When the URL in question is:

javascript:openWin ('http://www.intel.com/personal/computing/emea/fra/glossary/index.htm? enh_int_spe','540','420','no','no','no','no','200','200','glossary');

the regular expression matches exactly with:

http://www.intel.com/personal/computing/emea/fra/glossary/index.htm? enh_int_spe

When the list of javascript links in the text file is cleaned up with this regular expression, it can easily be checked with Xenu.

Hopefully future versions of Xenu will be able to read these javascript links directly, so the connection with the source pages is retained.

Added October 2004

As of version 1.2g (now in it's beta phase) Xenu will be able to handle javascript in links of the type:

javascript:popup('http://www.wi-fi.org/');
javascript:popup('http://www.wi-fi.org');
javascript:popup('http://www.crn.com/sections/testcenter/tcinside.asp?sectionid=6&articleid=37086');
javascript:popup('/products/services/emea/deu/mobiletechnology/demo/intro.htm','unwire','700','570');
javascript:openSite('http://www.microsoft.com/windowsxp/home/using/howto/homenet/default.asp');
javascript:openJump('/business/enterprise/emea/deu/pdf/wp030301.pdf')
etc, etc.

It does not matter if the URLs

refer to top domains (xxx.com, xxx.net, xxx.de, xxx.nl, etc.), either with or without /
end in extensions (xxx.htm, xxx, shtml, xxx.pdf, xxx.cgi, xxx.gif etc.)
are absolute (http://www.xxx.com/yyy/zzz.htm)
or relative ('/yyy/zzz.htm')
have parameters of the kind "?xxxx=yyyy"
have parameters following the URL, for example height and width of popup windows
etc, etc.

The only thing you need to do is add this line of code to the .ini file under [Options] (which should be in the same folder as the xenu.exe program is located.):

Javascript=javascript:[_a-zA-Z0-9]+ *\( *['"]([^'"]+)['"]

This regular expression does the following:

match the string "javascript:
match any alphanumeric character (= the name of the function)
match any possible spacers following that function
match an opening (
match any possible spacers
match either a single ' or a double "
match any character that is not single ' or double " (the URL you want to catch!)
match either a single ' or a double "
igore all else what follows.

The results are really fantastic, all javascript embedded links are now extracted from the javascript and treated as "normal" links -- and scanned.

Added November 2004

To ensure that the only those javascript links are checked that yield a correct URL, it is wise to updgrade your regular expression to:

Javascript=javascript: *[_a-zA-Z0-9]+ *\( *['"]
((/|ftp://|https?://)[^'"]+)['"]

This adds to the original regular expression more precision:

Only link starting with "/..... are scanned
Only links starting with ftp:// are scanned
Only links starting with http:// are scanned
Only links starting with https:// are scanned

All other javascript links, such as these, are skipped:

javascript:launchRemote();
javascript:email();
javascript:tttrPop(8);

Added January 2005

Please note there can be slight variations in the way "javascript:" is written:

javascript:openJump()
Javascript:openJump()
JavaScript:openJump()
etc.

To catch all these variations, refine your regex with: [Jj]ava[Ss]cript:

Javascript=[Jj]ava[Ss]cript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)[^'"]+)['"]

A highly sophisticated regular expression, written by Eugeny Sattler, but currently not supported by Xenu is:

javascript:\w+\s*\(\s*['\"]((?:ftp|https?)://[^'\"]+?)['\"]
(?:\s*,[^,]+?\s*)*\s*\);

See: http://groups.yahoo.com/group/xenu-usergroup/message/102

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10

6.Handling javascript links

Javascript links in the navbar

Javascript links on the web pages

Using Regular Expressions

6.
Handling
javascript links