Parsing scraped resources

Most of the time you will want to use a customised Parser plugin. This is where the real application logic in scraping a resource tends to go and the whole purpose of Scraping-Engine is to let you focus on this part without having to constantly deal with the other parts.

Configuration

Parsers have only one configuration option, the parser config.

Xxx.parser=<classname>
This contains the classname of your customised Parser plugin.

Generic parsers

The org.osjava.scraping.parser package contains generic re-usable parsers. At the time of writing, there are two reusable parsers.

  1. PassThroughParser - A positive null pattern parser; it merely places the entire page in the results. Useful when you just want to download a resource, or want to test whether your Fetcher is working correctly.
  2. UrlScraper - An abstract superclass which makes life easier if the aim of parsing the HTML page is to come up with a file to download; for example downloading today's Dilbert.
Technically another parser exists, org.osjava.scraping.NullParser, which does absolutely nothing and is what you get if you fail to specify a parser in the configuration. It's possible that PassThroughParser should replace this.

Implementing your own Parser

This involves implementing the org.osjava.scraping.Parser interface or extending the org.osjava.scraping.AbstractParser abstract class, getting it in the classpath and adding its classname to the configuration using the parser config.

Xxx.parser=com.example.scraping.SlashdotParser
The Parser interface has three methods:
 void bringDown(Config cfg)          
 Result parse(Page page, Config cfg, Session session)
 void startUp(Config cfg) 
The AbstractParser provides no-op implementations for bringDown and startUp. In short, a Parser's job is to convert a Page object into a Result object. It will usually look something like this:
 public Result parse(Page page, Config cfg, Session session) throws ParsingException {
   String txt = page.readAsString();

   ... find desired data ...

   return new SingularResult( desiredData );
 }
Usually you're interested in more than a single piece of data and would do something like this:
 public Result parse(Page page, Config cfg, Session session) throws ParsingException {
   String txt = page.readAsString();

   ... build Object[] of desired data in a java.util.List ...

   Result = new TabularResult( listOfDesiredData.iterator() );
 }
Note that in this second case, you are passing an iterator in. It's a very arguable piece of design, but make sure you don't use your listOfDesiredData instance for anything else, it now exists solely for the purposes of the Result. (NOTE: As it's so arguable, it might have to change :) )

Walking a website

When scraping from a website, it is common to have to walk a few pages first to get that sites session in the right state, or quite simply to figure out the url to use today. Rather than having to implement this code yourself, or somehow re-route things back to the start of the Scraping-Engine, the Page class has a method to get the next url for you.

 Page fetch(String uri, Config cfg, Session session) throws FetchingException
As pages know about their document-base, you can also obtain this next url using a relative url. ie) while scraping "http://www.osjava.org/scraping-engine/index.html", you could use the page to get "apidocs/index.html" and it would know that you mean "http://www.osjava.org/scraping-engine/apidocs/index.html".

Parsing the data

You may have noticed something about this page on writing your own parser. It goes on and on, but at no point mentions how to actually parse the data you've downloaded. This is purely intentional.

  • Scraping-Engine is parser-technology neutral.
  • There are many types of files out there to parse, HTML is the common one, but RSS, iCalendar and other XMLs run it a close second; CSV, Excel, mbox even images jump out as potential parse targets.
Given that you've made it this far, there are three parsing libraries that I'll quickly advertise.
  1. HTML parsing with gj-scrape.
  2. CSV parsing with gj-csv.
  3. XML parsing with gj-xml.