RISE to Bloome Software
Log In    
Home
RISE
Marshal
Download
 
 
r2bsoftware.se r2bsoftware.se
 
 
 
Click to hide navigation tree

The Marshal Site Router


The Marshal Site Router is a multithreaded windows service for harvesting web sites or individual web pages. The router can follow links to harvest entire web sites. The router will not follow external links, i.e. the router will not harvest pages that are located on another domain than the start page.

Technically the router binds a connection socket to a tcp/ip address and port and listens for requests, and returns the result as JSON.

By default the Site router binds a connection socket to 127.0.0.1 port 8080. If you want to change the address and/or port you can add start parameters. Using the parameters in image 1 below, the router will bind to address 192.168.0.123 and port 8081. Enter the desired parameters and klick the Start button.

Starting the Site router
Image 1, Starting the Site Router


Using the Site Router

When harvesting using Routers, the JSON TCP Harvester is used, please see the Routers article for more information.


Configuring the JSON TCP Harvester


JSON TCP Router Configuration
Image 2, JSON TCP Harvester configuration


Harvester
Name The name of the harvester in use by the selected Query-node. To change harvester, select the name property and click on the ellipsis button. Select the appropriate harvester in the form that is displayed.
Router
Connection String The connection instructs the Site Router which pages to harvest, and how to harvest them. The connection string must start with the url to the web page you want to harvest.
Connection string syntax: url;key1=value1;key2=value2;...
Connection string sample: http://www.mydomain.com;maxdepth=2;keys=id,page,content;image=jpg
For name Not used
Host The tcp/ip address on which the Site Router listens for connections.
Port The tcp/ip port on which the Site Router listens for connections.
Settings
Group by Not used
Order by Not used
Query Timeout Not used
Table or View The name of the table to harvest data from. 
Where
User authentication
Password Not used
User ID Not used
Table 1, JSON TCP Harvester configuration

Connection string keys
maxDepth When harvesting from the page table, see below for more information, the Site router is able to follow links. The router will not follow links to pages that are not on the same domain as the initial page. You may specify the maximum depth to follow links. By setting maxDepth=1, no links will be followed. Setting maxDepth=2 means that the first page, and the pages that are linked by the first page will be harvested. The default value is 25.
keys This key is used when following links. Keys is a comma separated list of url parameters which are valid. If the urls contains query strings, parameters that are not defined as valid are removed from the url.
Example: If keys=id,page,content and the url is: http://www.mydomain.com?id=32&q=test&content=x, the router will navigate to: http://www.mydomain.com?id=32&content=x. I.e. q=test is removed from the url since q is not a valid parameter.
imageThis key is used when saving a schreen shot of a web page as an image. The image key specifies the file format of image captures of web pages. Possible values are: bmp, gif, jpg, jpeg, png, tif and tiff. The default value is png.
imageSizeMode This key is used when saving a schreen shot of a web page as an image. Normally, the web browser component is able to determine the dimensions of the web page. If this is not possible, the router can make an attempt to parse the DOM-tree to determine the dimensions. Possible values for imageSizeMode are default and traverse.
sitewidth This key is used when saving a schreen shot of a web page as an image. Some web pages change their layout depending of the dimensions if the web browser. In these cases you may set the sitewidth to the number of pixels wide you want the screen shots of pages of the site to be.
captureDelay When pages contain dynamic content, there is no way for a web browser to know when the content has been completely loaded, or even if it ever will be. If the capture is performed immediatelly after the static content has been loaded, data will most likely be missing. The captureDelay is the time in milliseconds to wait after the static content has been loaded before the capture is taken. The default value is 200 ms.
Table 2, Connection string keys


Harvest

Adding tables, retrieveing columns, adding relations etc. is done in the same way as when using the ODBC Harvester


Using the Site Router
Image 3, Marshal model for harvesting a site

Every query node (except for the root node), independent of which harvester it uses, has a parent relation property in the XML section. To add or edit a parent relation, select the property and click on the ellipsis button.

All leaves of the selected node, having Column Name specified in the Source section, are listed in the Column combo box, and all parent leaves, having Column Name specified in the Source section, are listed in the Parent column combo box. 

When using the Site Router, sub-queries are related to their parents by the url column. This means that in the first column, of the relation form, you should type Url, and in the second column you should select the Url column. This means that the Url of the parent node is passed to the child nodes, allowing them to harvest additional information using that url.


The Site Router tables

Routers mimic relational databases in representing data as tables with columns, see the Routers article for more information. 

The Site Router implements the 7 tables, Page, Page-AHref, Page-Image, Page-Link, Page-Meta and Page-Script.


The Page table

The page table contains information about the web pages. Each page is represented as a table row. The page table automatically follows links until maxDepth has been reached, or all pages have been retrieved. Only pages from the same domain will be harvested, i.e. the router will not follow external links.

Page
CharacterSet The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response.
Content The Content column contains the response content.
ContentEncoding The ContentEncoding column contains the value of the Content-Encoding header returned with the response.
ContentLength The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1.
ContentType The ContentType column contains the value of the Content-Type header returned with the response.
Image If the Image column is harvested, an image snapshot is taken of the web page, and the image binary is returned in this column.
LastModified The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time.
LoadTime The LoadTime column contains the time in milliseconds to execute the request.
MD5 The MD5 column contains the MD5 sum of the response byte array.
Method This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE.
Pdfa If the Pdfa column is harvested, the web page is printed to pdf, and the document binary is returned in this column.
ProtocolVersion The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource.
ResponseUrl The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present.
Server The Server column contains the value of the Server header returned with the response.
StatusCode The StatusCode column contains a number that indicates the status of the HTTP response.
StatusDescription A string that describes the status of the response. A common status message is OK.
SuggestedName The SuggestedName column contains a fuzzy logic suggested name.
Title The title column contains the page title.
TitlePath The titles of the pages, including the title of the current page title, that the harvester has passed on the way to this page, separated by slash '/'.
TitlePathRoot The titles of the pages, excluding the title of the current page title, that the harvester has passed on the way to this page, separated by slash '/'.
Url The Url column contains the request url.
Table 3, the Page table columns


The Page-AHref table

The Page-AHref table returns all <a href></a> elements of the specified web page. 

Page-AHref
Content The Content column contains the inner text of the element.
Href The Href column contains the value of the href attribute of the element.
Raw The Raw column contains the raw <a href=""></a>-element as it appears in the html document.
Title The Title column contains the value of the title attribute of the element.
Table 4, the Page-AHref columns


The Page-Header table

All headers for the specified web page are returned.

Page-Header
Name The name part of the header name value pair.
Value The value part of the header name value pair.
Table 5, the Page-Image columns


The Page-Image table

All <img />-tags for the specified web page are returned.

Page-Image
Alt The image alt-tag.
CharacterSet The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response.
ContentEncoding The ContentEncoding column contains the value of the Content-Encoding header returned with the response.
ContentLength The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1.
ContentType The ContentType column contains the value of the Content-Type header returned with the response.
LastModified The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time.
LoadTime The time in milliseconds to execute the request.
MD5 The MD5 sum of the response byte array.
Method This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE.
Original This column contains the image binary.
ProtocolVersion The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource.
Raw The raw image tag as it appears in the document.
ResponseUrl The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present.
Server The Server column contains the value of the Server header returned with the response.
Src The image src-tag.
StatusCode The StatusCode column contains a number that indicates the status of the HTTP response.
StatusDescription A string that describes the status of the response. A common status message is OK.
SuggestedName The SuggestedName column contains a fuzzy logic suggested name.
Title The image title-tag.
Table 6, the Page-Image columns


The Page-Link table

For harvesting linked information such as style sheets for the specified web page. 

Page-Link
CharacterSet The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response.
ContentEncoding The ContentEncoding column contains the value of the Content-Encoding header returned with the response.
ContentLength The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1.
ContentType The ContentType column contains the value of the Content-Type header returned with the response.
File The File column contains the response content.
LastModified The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time.
LoadTime The time in milliseconds to execute the request.
MD5 The MD5 sum of the response byte array.
Method This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE.
Name The Name column contains a fuzzy logic suggested name.
ProtocolVersion The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource.
ResponseUrl The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present.
Server The Server column contains the value of the Server header returned with the response.
StatusCode The StatusCode column contains a number that indicates the status of the HTTP response.
StatusDescription A string that describes the status of the response. A common status message is OK.
Table 7, the Page-Link columns


The Page-Meta table

The Page-Meta table is used for harvestig meta tags for the specified web page.

Page-Meta
Charset Specifies the character encoding for the HTML document.
Content Gives the value associated with the http-equiv or name attribute.
Http-equiv Provides an HTTP header for the information/value of the content attribute.
Name Specifies a name for the metadata.
Property The property in meta tags allows web pages to specify values to property fields which come from a property library. The property library (RDFa format) is specified in the head tag.
Raw The raw meta tag as it appears in the document.
Scheme Specifies a scheme to be used to interpret the value of the content attribute.
Table 8, the Page-Meta columns


The Page-Script table

The Page-Script table is used for harvesting external script files for the specified web page.

Page-Script
CharacterSet The CharacterSet column contains a value that describes the character set of the response. This character set information is taken from the header returned with the response.
ContentEncoding The ContentEncoding column contains the value of the Content-Encoding header returned with the response.
ContentLength The ContentLength column contains the value of the Content-Length header returned with the response. If the Content-Length header is not set in the response, ContentLength is set to the value -1.
ContentType The ContentType column contains the value of the Content-Type header returned with the response.
File The File column contains the response content.
LastModified The LastModified column contains the value of the Last-Modified header received with the response. The date and time are assumed to be local time.
LoadTime The time in milliseconds to execute the request.
MD5 The MD5 sum of the response byte array.
Method This column contains the method that is used to return the response. Common HTTP methods are GET, HEAD, POST, PUT, and DELETE.
Name The Name column contains a fuzzy logic suggested name.
ProtocolVersion The ProtocolVersion column contains the HTTP protocol version number of the response sent by the Internet resource.
ResponseUrl The ResponseUri column contains the URI of the Internet resource that actually responded to the request. This URI might not be the same as the originally requested URI, if the original server redirected the request. The ResponseUri column will use the Content-Location header if present.
Server The Server column contains the value of the Server header returned with the response.
StatusCode The StatusCode column is a number that indicates the status of the HTTP response.
StatusDescription A string that describes the status of the response. A common status message is OK.
Table 9, the Page-Script columns