scone.robot
Class RobotTask

java.lang.Object
  extended by scone.robot.RobotTask

public class RobotTask
extends java.lang.Object

RobotTask classes are used to define tasks for the robot. Use the constructor to get a task and set the basic properties

Author:
Frank Wollenweber

Field Summary
static int ALL
          Follow only external links
static int EXTERNAL
          Follow only links that point to files in the same subdirectory
static int INTERNAL
           
static int SUBDIRECTORIES
          Follow only internal links
 
Constructor Summary
RobotTask(SimpleUri startURI, int depth, int restriction, RobotUser robotUser)
          constructor
 
Method Summary
 void addLinkClassifier(LinkClassifier linkClassifier)
          Adds a LinkClassifier to this task.
 void addLinkFilter(LinkFilter linkFilter)
          Filters decides whether to follow a link or not.
 void addPageClassifier(PageClassifier pageClassifier)
          Adds a PageClassifier to this task.
 void addPageFilter(PageFilter pageFilter)
          Filters decides whether to stop the crawling at the current document or to continue with the links.
 void addResultNode(RobotHtmlNode robotHtmlNode)
          Adds an element to the result set
 long getArrivalTime()
          Get the arrival time of this task at the robot
 int getCacheHits()
          Get the number of cach hits queuedUris = downloadedUris + cacheHits
 boolean getCheckDatabase()
          See, if the robot checks the database
 int getCheckedUris()
          Get the number of checked URIs.
 int getDepth()
          Get the crawling depth
 boolean getDoContenSeenTest()
          Checks if the robot does a content-seen-test.
 int getDownloadedUris()
          Get the number of downloaded URIs
 long getEndTime()
          Get the end time for this task
 long getExpiry()
          Get the expiry time
 int getFilteredUris()
          Get the number of filtered URIs.
 boolean getHeadOnly()
          Checks, if the robot is in headOnly mode
 int getId()
          Get the task's uinique id
 java.util.Enumeration getLinkClassifier()
          Get an Enumeration of all LinkClassifiers
 java.util.Enumeration getLinkFilter()
          Get an Enumeration of all LinkFilter
 long getMaxDownloadTime()
          Gets the maximium download time
 int getMaxDownloadUris()
          Get the maximum number of documents the robot will download
 int getMaxPageSize()
          Gets the download size limit
 int getNumberOfOpenUris()
          Get the number of open URIs for this task.
 int getNumberOfResultNodes()
          Get the number of result nodes
 boolean getObeyRobotExclusion()
          Checks, if the robot is in obeyRobotExclusion mode
 QueueEntry getOpenUri(SimpleUri uri)
          Checks if there's an element in this task's list of open URIs which is equal to uri
 java.util.Enumeration getOpenUris()
          Get the URIs of this task the robot is currently working on.
 java.util.Enumeration getPageClassifier()
          Get an Enumeration of all PageClassifiers
 java.util.Enumeration getPageFilter()
          Get an Enumeration of all PageFilter
 int getQueuedUris()
          Get the number of queued URIS
 boolean getRequireSourceCode()
          Checks if the robot does a content-seen test
 RobotHtmlNode getResultNode(SimpleUri uri)
          Get the result node with the URI equal to the parameter uri
 java.util.Enumeration getResultNodes()
          Get all result nodes
 long getStartTime()
          Get the start time of this task
 SimpleUri getStartURI()
          Get the start URI of this task
 long getUpdateDate()
          Gets the update date
 boolean isOpenUri(SimpleUri uri)
          Checks if there's an element in this task's list of open URIs which is equal to uri
 boolean isResultUri(SimpleUri uri)
          Checks if this URI is in the result
 void removeLinkClassifier(LinkClassifier linkClassifier)
          Removes a Classifier
 void removeLinkFilter(LinkFilter linkFilter)
          Removes a Filter
 void removePageClassifier(PageClassifier pageClassifier)
          Removes a Classifier
 void removePageFilter(PageFilter pageFilter)
          Removes a Filter
 void setCheckDatabase(boolean checkDatabase)
          Should the robot check the database before trying to download a document from the web.
 void setDoContentSeenTest()
          Enables the content-seen-test.
 void setExpiry(long time)
          When should this task expire.
 void setHeadOnly(boolean headOnly)
          If this flag is set HEAD instaed of GET is used to contact the server
 void setMaxDownloadTime(long time)
          The robot will only download a document for the specified time
 void setMaxDownloadUris(int max)
          Max Documents are downloaded from the web.
 void setMaxPageSize(int size)
          Only the specified amount of bytes are downloaded from each document
 void setObeyRobotExclusion(boolean obeyRobotExclusion)
          Should the robot obey the robotExclusion.
 void setRequireSourceCode(boolean requireSourceCode)
          If this is set to true, the robot saves the source code of every document.
 void setUpdateDate(long date)
          Pages that were accessed (by the robot or the user) before date are downloaded again.
 boolean wasStopped()
          Get the value of the stop flag
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INTERNAL

public static final int INTERNAL
See Also:
Constant Field Values

SUBDIRECTORIES

public static final int SUBDIRECTORIES
Follow only internal links

See Also:
Constant Field Values

EXTERNAL

public static final int EXTERNAL
Follow only links that point to files in the same subdirectory

See Also:
Constant Field Values

ALL

public static final int ALL
Follow only external links

See Also:
Constant Field Values
Constructor Detail

RobotTask

public RobotTask(SimpleUri startURI,
                 int depth,
                 int restriction,
                 RobotUser robotUser)
constructor

Parameters:
startUri - start the crawl from at this uri
depth - follow the links with this depth
restriction - use the constants defined in this class to restrict the crawling process
robotUser - the robotUser will be called for every found document and at the end of the crawling
Method Detail

getId

public int getId()
Get the task's uinique id

Returns:
id

getStartURI

public SimpleUri getStartURI()
Get the start URI of this task

Returns:
start URI

setHeadOnly

public void setHeadOnly(boolean headOnly)
If this flag is set HEAD instaed of GET is used to contact the server

Parameters:
headOnly - if true, only the head of startUri will be loaded

getHeadOnly

public boolean getHeadOnly()
Checks, if the robot is in headOnly mode


getDepth

public int getDepth()
Get the crawling depth


setObeyRobotExclusion

public void setObeyRobotExclusion(boolean obeyRobotExclusion)
Should the robot obey the robotExclusion. For details see http://www.robotstxt.org/wc/exclusion.html

Parameters:
obeyRobotExclusion - if true, the robot will obey the robot exclusion protocol

getObeyRobotExclusion

public boolean getObeyRobotExclusion()
Checks, if the robot is in obeyRobotExclusion mode

Returns:
true, if the robot obeys the robot exclusion protocol

setExpiry

public void setExpiry(long time)
When should this task expire. After this time the robot will stop this robot task, even if it's processing has not been started yet.

Parameters:
time - time period in milliseconds beginning with the arrival of the task at the robot to the task's expiry.

getExpiry

public long getExpiry()
Get the expiry time

Returns:
expiry time

setMaxDownloadUris

public void setMaxDownloadUris(int max)
Max Documents are downloaded from the web. After the robot has downloaded max documents from the web, the task is stopped. Running PageLoaderThreads are not interrupted, so that the actual number of downloaded documents may be higher.

Parameters:
max - download max documents

getMaxDownloadUris

public int getMaxDownloadUris()
Get the maximum number of documents the robot will download

Returns:
max downloaded URIs

setCheckDatabase

public void setCheckDatabase(boolean checkDatabase)
Should the robot check the database before trying to download a document from the web.

Parameters:
checkDatabase - if true the robot allways tries to find linked documents in the database.

getCheckDatabase

public boolean getCheckDatabase()
See, if the robot checks the database

Returns:
true, if the robot checks the database

setUpdateDate

public void setUpdateDate(long date)
Pages that were accessed (by the robot or the user) before date are downloaded again.

Parameters:
date - date in milliseconds after January 1, 1970 00:00:00 GMT

getUpdateDate

public long getUpdateDate()
Gets the update date

Returns:
date in milliseconds after January 1, 1970 00:00:00 GMT

setMaxPageSize

public void setMaxPageSize(int size)
Only the specified amount of bytes are downloaded from each document

Parameters:
size - download only size bytes

getMaxPageSize

public int getMaxPageSize()
Gets the download size limit

Returns:
the maximum amount of bytes the robot will download for each page

setMaxDownloadTime

public void setMaxDownloadTime(long time)
The robot will only download a document for the specified time

Parameters:
maximum - download time for each document

getMaxDownloadTime

public long getMaxDownloadTime()
Gets the maximium download time

Returns:
maximum download time for each page

addPageClassifier

public void addPageClassifier(PageClassifier pageClassifier)
Adds a PageClassifier to this task. The classifier can add attributes to the page. All classifiers are executed serially.

Parameters:
pageClassifier - add this PageClassifier

removePageClassifier

public void removePageClassifier(PageClassifier pageClassifier)
Removes a Classifier

Parameters:
pageClassifier - remove this one

addLinkClassifier

public void addLinkClassifier(LinkClassifier linkClassifier)
Adds a LinkClassifier to this task. The classifier can add attributes to the link. All classifiers are executed serially.

Parameters:
linkClassifier - add this LinkClassifier

removeLinkClassifier

public void removeLinkClassifier(LinkClassifier linkClassifier)
Removes a Classifier

Parameters:
linkClassifier - remove this one

addPageFilter

public void addPageFilter(PageFilter pageFilter)
Filters decides whether to stop the crawling at the current document or to continue with the links. The filters are executed serially and a boolean and operation is used for the decision. Adds a PageFilter to this task.

Parameters:
pageFilter - add this PageFilter

removePageFilter

public void removePageFilter(PageFilter pageFilter)
Removes a Filter

Parameters:
pageFilter - remove this one

addLinkFilter

public void addLinkFilter(LinkFilter linkFilter)
Filters decides whether to follow a link or not. The filters are executed serially and a boolean and operation is used for the decision. Adds a LinkFilter to this task.

Parameters:
linkFilter - add this LinkFilter

removeLinkFilter

public void removeLinkFilter(LinkFilter linkFilter)
Removes a Filter

Parameters:
linkFilter - remove this one

getPageClassifier

public java.util.Enumeration getPageClassifier()
Get an Enumeration of all PageClassifiers

Returns:
Enumeration of all PageClassifiers

getLinkClassifier

public java.util.Enumeration getLinkClassifier()
Get an Enumeration of all LinkClassifiers

Returns:
Enumeration of all LinkClassifiers

getPageFilter

public java.util.Enumeration getPageFilter()
Get an Enumeration of all PageFilter

Returns:
Enumeration of all PageFilter

getLinkFilter

public java.util.Enumeration getLinkFilter()
Get an Enumeration of all LinkFilter

Returns:
Enumeration of all LinkFilter

setDoContentSeenTest

public void setDoContentSeenTest()
Enables the content-seen-test. If the robot does a content-seen-test, the crawling stops at pages that have been seen before under a different url.


getDoContenSeenTest

public boolean getDoContenSeenTest()
Checks if the robot does a content-seen-test.

Returns:
true, if the robot does a content-seen-test

setRequireSourceCode

public void setRequireSourceCode(boolean requireSourceCode)
If this is set to true, the robot saves the source code of every document. Documents that are in the database without source are downloaded again.

Parameters:
requireSourceCode - do a content-seen-test or not

getRequireSourceCode

public boolean getRequireSourceCode()
Checks if the robot does a content-seen test

Returns:
true, if the robot does a content-seen-test

getArrivalTime

public long getArrivalTime()
Get the arrival time of this task at the robot

Returns:
arrival time

getStartTime

public long getStartTime()
Get the start time of this task

Returns:
start time

getEndTime

public long getEndTime()
Get the end time for this task

Returns:
end time

getCheckedUris

public int getCheckedUris()
Get the number of checked URIs. Every link and frame is counted, even if the URI has been checked before.

Returns:
checked URIs

getQueuedUris

public int getQueuedUris()
Get the number of queued URIS

Returns:
queued URIs

getFilteredUris

public int getFilteredUris()
Get the number of filtered URIs. The robot counts all URIs that where filtered by the DefaultFilter (wrong file-extension, restriction) or by the LinkFilters of this task. checkedUris = filteredUris + queuedUris + Uris that have been processed before.

Returns:
filtered Uris

getDownloadedUris

public int getDownloadedUris()
Get the number of downloaded URIs

Returns:
downloaded URIs

getCacheHits

public int getCacheHits()
Get the number of cach hits queuedUris = downloadedUris + cacheHits

Returns:
cache hits

isOpenUri

public boolean isOpenUri(SimpleUri uri)
Checks if there's an element in this task's list of open URIs which is equal to uri

Parameters:
uri - look for this uri
Returns:
true, if a similar URI is open

getOpenUri

public QueueEntry getOpenUri(SimpleUri uri)
Checks if there's an element in this task's list of open URIs which is equal to uri

Parameters:
uri - look for this uri
Returns:
QueueEntry with an URI equal to the parameter uri

getNumberOfOpenUris

public int getNumberOfOpenUris()
Get the number of open URIs for this task.

Returns:
number of open URIs

getOpenUris

public java.util.Enumeration getOpenUris()
Get the URIs of this task the robot is currently working on.

Returns:
Enumeration of the URIs

addResultNode

public void addResultNode(RobotHtmlNode robotHtmlNode)
Adds an element to the result set

Parameters:
robotHtmlNode - add this node

isResultUri

public boolean isResultUri(SimpleUri uri)
Checks if this URI is in the result

Parameters:
uri - check this URI
Returns:
true, if uri is in the result

getNumberOfResultNodes

public int getNumberOfResultNodes()
Get the number of result nodes

Returns:
number of result nodes

getResultNodes

public java.util.Enumeration getResultNodes()
Get all result nodes

Returns:
Enumeration of the result nodes

getResultNode

public RobotHtmlNode getResultNode(SimpleUri uri)
Get the result node with the URI equal to the parameter uri

Parameters:
uri - get the result node for this URI
Returns:
RobotHtmlNode with URI equal to the parameter uri or null

wasStopped

public boolean wasStopped()
Get the value of the stop flag

Returns:
True, if the task was stopped unfinished