Most of the time you will want to use a customised Parser plugin. This is where the real application logic in scraping a resource tends to go and the whole purpose of Scraping-Engine is to let you focus on this part without having to constantly deal with the other parts.
Parsers have only one configuration option, the parser config.
Xxx.parser=<classname>This contains the classname of your customised Parser plugin.
The org.osjava.scraping.parser package contains generic re-usable parsers. At the time of writing, there are two reusable parsers.
PassThroughParser - A positive null pattern parser; it merely places the entire page in the results. Useful when you just want to download a resource, or want to test whether your Fetcher is working correctly. UrlScraper - An abstract superclass which makes life easier if the aim of parsing the HTML page is to come up with a file to download; for example downloading today's Dilbert.org.osjava.scraping.NullParser, which does absolutely nothing and is what you get if you fail to specify a parser in the configuration. It's possible that PassThroughParser should replace this.
This involves implementing the org.osjava.scraping.Parser interface or extending the org.osjava.scraping.AbstractParser abstract class, getting it in the classpath and adding its classname to the configuration using the parser config.
Xxx.parser=com.example.scraping.SlashdotParserThe Parser interface has three methods:
void bringDown(Config cfg) Result parse(Page page, Config cfg, Session session) void startUp(Config cfg)The
AbstractParser provides no-op implementations for bringDown and startUp.
In short, a Parser's job is to convert a Page object into a Result object. It will usually look something like this:
public Result parse(Page page, Config cfg, Session session) throws ParsingException {
String txt = page.readAsString();
... find desired data ...
return new SingularResult( desiredData );
}
Usually you're interested in more than a single piece of data and would do something like this:
public Result parse(Page page, Config cfg, Session session) throws ParsingException {
String txt = page.readAsString();
... build Object[] of desired data in a java.util.List ...
Result = new TabularResult( listOfDesiredData.iterator() );
}
Note that in this second case, you are passing an iterator in. It's a very arguable piece of design, but make sure you don't use your listOfDesiredData instance for anything else, it now exists solely for the purposes of the Result.
(NOTE: As it's so arguable, it might have to change :) )
When scraping from a website, it is common to have to walk a few pages first to get that sites session in the right state, or quite simply to figure out the url to use today. Rather than having to implement this code yourself, or somehow re-route things back to the start of the Scraping-Engine, the Page class has a method to get the next url for you.
Page fetch(String uri, Config cfg, Session session) throws FetchingExceptionAs pages know about their document-base, you can also obtain this next url using a relative url. ie) while scraping
"http://www.osjava.org/scraping-engine/index.html", you could use the page to get "apidocs/index.html" and it would know that you mean "http://www.osjava.org/scraping-engine/apidocs/index.html".
You may have noticed something about this page on writing your own parser. It goes on and on, but at no point mentions how to actually parse the data you've downloaded. This is purely intentional.