Saturday, November 11, 2006

Hello HTML Scraper

Here is a little java application example that shows one way to scrape data from an HTML page. I wrote a little utility class that I can use to scrape various HTML elements from any web page.

I used the nekohtml project (which is an incubated apache project) to parse the html page. The library is pretty simple to use. You give the library the URL to an HTML page, it parses it, and turns it into a XML DOM (Document Object Model) object. Each of the elements in the html page are represented as a node in the DOM.

Whenever I encounter some new technology I try to make a habit of creating an example in the 'hello' project on sourceforge. I start my project by using maven and then work on it using the Eclipse IDE. Exactly how I do that is described in this README.txt file in the project.

The CVS repository information for checking out the example into your Eclipse IDE is:

host = hello.cvs.sourceforge.net
repository path = /cvsroot/hello
user = anonymous
module = /development/examples/hello-html

A unit test, which scrapes all the hyperlinks from the google home page, is here. It tests a little utility class, HTMLPage, that I wrote using the nekohtml library.

My utility class can be used for scraping anything from a page. The code for the utility class implementation is here.

This example shows the basics of how to do HTML scraping.

I am thinking that perhaps it is better as an example of how not to write a teaching example. As a teaching example I do not think it is very good for the following reasons:

  • It trys to be too general when instead it should have just shown how to use the nekohtml API.
  • It unnecessarily introduces a complicated language feature, java generics, that it not important to the teaching example.
  • And if you were looking carefully you will notice I mistakenly called my test a unit test when it is really a system test.

I would write the example over again, except I think you guys probably still got the basics from it.

That's it for this week.

Catch you later,
Tony


That's all for today.
Tony