Finding Broken links using Selenium and Java

Problem statement

Finding broken links on a web page can be classified into two steps

1. Finding all the links on the page.

2. Iteratively checking the links if they are broken.

All the links usually are in image <img /> and anchor tags <a/> on a web page. Also all links are mentioned as href attribute of the element tag.

For eg: <a href=”http://toolsqa.wpengine.com/selenium-introduction/”>Introduction</a>. This will appear as a link with text “Introduction”.

 

Lets get into the details. Step 1 is to find all image and anchor tag elements. In this step you would also like to filter out elements that don’t have href attributes. Below is the code that shows how to do this.

As you can see in the above function we are first gathering all the anchor and image tags in a List<WebElement> elementList. After that in the for loop we are filtering out all the elements which don’t have any href tag. This way we have all the elements with href tags in them.

Now the most important part is to check in the links are working. This is the step 2. Here I will introduce to you a Class from Java, called HttpURLConnection class. This class is used to  make HTTP requests to the webserver hosting the links extracted in step 1.

You can read more about this class here

The basic idea is to make an HTTP requests to the urls extracted in step 1 and see the response returned by the server. Based on the response we can figure out if the link is broken or not. I will present to you very minimal implementation of the HTTP request method. You can modify this to suite your needs, but this should do the trick

This piece of code makes an HTTP call to the server for the link we extracted and returns the response from the server as string, the return statements in the function.

In case of exception we get the exception message back. It contains more details about why an exception was returned from the server.

Complete code is listed here

Note: dont mind so many imports, its from my previous project settings. You can remove the unnecessary ones.

I have printed the results in the console, once you run this code you will get output like this

ResultsLinksBroken

Results that you may get

Instead of OK you may get redirected also. This means that user was redirected to a different URL after the request. This is typically a tricky situation because most of the broken links redirect you to a static error page. You should ideally trust only those urls that returned OK, as shown in the results image above, rest should be verified once.

 

This is a very simple implementation of finding broken links on a webpage. The idea of this post is to get the basics so that you can build up on it to make complex structure as suited in your project.

Thanks

Virender