Fetching is the highly imaginative name for downloading a resource. The following configuration options are important to fetching:
Xxx.uri=<string>
Xxx.username=<string>
Xxx.password=<string>
Xxx.header={<string>=<string>,...}
Xxx.norobots.override=true
Xxx.method=POST
Xxx.fetcher=<classname>
The uri configuration specifies where to fetch the resource from. The first part of the uri usually contains the protocol. Standard protocols are already handled by Scraping-Engine and so there is no need to implement your own plugins.
The current supported protocols are http, https and ftp.
Xxx.uri=http://www.slashdot.org/ Xxx.uri=https://some.site.com/foo.html Xxx.uri=ftp://ftp.kernel.org/public/README
Unsurprisingly, these are the login credentials needed to get to a resource. In the case of the http and https protocols, BASIC authentication is used; while in the case of ftp these are a part of the general ftp specification. If no credentials are supplied for ftp, a username of "anonymous" and a password of "" are used.
Xxx.username=fred Xxx.password=fR3d
http and https protocols
HTTP headers may be specified as a comma separated list of key=value pairs. For example the browser in use and the page referer could be set using:
XXX.header=User-Agent=Mozilla/5.0,Referer=http://www.osjava.org/index.html
http and https protocols
By default Scraping-Engine uses OSJava's Norbert, a robots.txt parser, to adhere to the Web Robots Exclusion Standard. You may however desire to not adhere to this standard like so:
Xxx.norobots.override=true
http and https protocols
The two common HTTP operations are GET and POST. Scraping-Engine GETs by default, but you can tell it to switch to POSTing for sites that don't allow GETs to the url you want.
The parameters to a POST are put in the query string as with a GET, Scraping-Engine takes care of the difference in communication needed.
Xxx.uri=http://www.example.com/form.html?key=value Xxx.method=POST
There will be times when the resource desired is not available via a built-in Fetcher and you will need to plugin your own Fetcher implementation. This involves implementing the org.osjava.scraping.Fetcher interface, getting it in the classpath and adding its classname to the configuration using the fetcher config.
Xxx.fetcher=com.example.scraping.GopherFetcher