Fetching resources

Fetching is the highly imaginative name for downloading a resource. The following configuration options are important to fetching:

Xxx.uri=<string>
Xxx.username=<string>
Xxx.password=<string>
Xxx.header={<string>=<string>,...}
Xxx.norobots.override=true
Xxx.method=POST
Xxx.fetcher=<classname>

uri

The uri configuration specifies where to fetch the resource from. The first part of the uri usually contains the protocol. Standard protocols are already handled by Scraping-Engine and so there is no need to implement your own plugins. The current supported protocols are http, https and ftp.

Xxx.uri=http://www.slashdot.org/
Xxx.uri=https://some.site.com/foo.html
Xxx.uri=ftp://ftp.kernel.org/public/README

username and password

Unsurprisingly, these are the login credentials needed to get to a resource. In the case of the http and https protocols, BASIC authentication is used; while in the case of ftp these are a part of the general ftp specification. If no credentials are supplied for ftp, a username of "anonymous" and a password of "" are used.

Xxx.username=fred
Xxx.password=fR3d

header

http and https protocols HTTP headers may be specified as a comma separated list of key=value pairs. For example the browser in use and the page referer could be set using:

XXX.header=User-Agent=Mozilla/5.0,Referer=http://www.osjava.org/index.html

norobots.override

http and https protocols By default Scraping-Engine uses OSJava's Norbert, a robots.txt parser, to adhere to the Web Robots Exclusion Standard. You may however desire to not adhere to this standard like so:

Xxx.norobots.override=true

method

http and https protocols The two common HTTP operations are GET and POST. Scraping-Engine GETs by default, but you can tell it to switch to POSTing for sites that don't allow GETs to the url you want. The parameters to a POST are put in the query string as with a GET, Scraping-Engine takes care of the difference in communication needed.

Xxx.uri=http://www.example.com/form.html?key=value
Xxx.method=POST

fetcher

There will be times when the resource desired is not available via a built-in Fetcher and you will need to plugin your own Fetcher implementation. This involves implementing the org.osjava.scraping.Fetcher interface, getting it in the classpath and adding its classname to the configuration using the fetcher config.

Xxx.fetcher=com.example.scraping.GopherFetcher