RobotTask (Scone API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

scone.robot
Class RobotTask

java.lang.Object
  scone.robot.RobotTask

public class RobotTask
extends java.lang.Object
extends java.lang.Object

RobotTask classes are used to define tasks for the robot. Use the constructor to get a task and set the basic properties

Author:: Frank Wollenweber

Field Summary
`static int`	`ALL` Follow only external links
`static int`	`EXTERNAL` Follow only links that point to files in the same subdirectory
`static int`	`INTERNAL`
`static int`	`SUBDIRECTORIES` Follow only internal links

Constructor Summary
`RobotTask(SimpleUri startURI, int depth, int restriction, RobotUser robotUser)` constructor

Method Summary
`void`	`addLinkClassifier(LinkClassifier linkClassifier)` Adds a LinkClassifier to this task.
`void`	`addLinkFilter(LinkFilter linkFilter)` Filters decides whether to follow a link or not.
`void`	`addPageClassifier(PageClassifier pageClassifier)` Adds a PageClassifier to this task.
`void`	`addPageFilter(PageFilter pageFilter)` Filters decides whether to stop the crawling at the current document or to continue with the links.
`void`	`addResultNode(RobotHtmlNode robotHtmlNode)` Adds an element to the result set
`long`	`getArrivalTime()` Get the arrival time of this task at the robot
`int`	`getCacheHits()` Get the number of cach hits queuedUris = downloadedUris + cacheHits
`boolean`	`getCheckDatabase()` See, if the robot checks the database
`int`	`getCheckedUris()` Get the number of checked URIs.
`int`	`getDepth()` Get the crawling depth
`boolean`	`getDoContenSeenTest()` Checks if the robot does a content-seen-test.
`int`	`getDownloadedUris()` Get the number of downloaded URIs
`long`	`getEndTime()` Get the end time for this task
`long`	`getExpiry()` Get the expiry time
`int`	`getFilteredUris()` Get the number of filtered URIs.
`boolean`	`getHeadOnly()` Checks, if the robot is in headOnly mode
`int`	`getId()` Get the task's uinique id
`java.util.Enumeration`	`getLinkClassifier()` Get an Enumeration of all LinkClassifiers
`java.util.Enumeration`	`getLinkFilter()` Get an Enumeration of all LinkFilter
`long`	`getMaxDownloadTime()` Gets the maximium download time
`int`	`getMaxDownloadUris()` Get the maximum number of documents the robot will download
`int`	`getMaxPageSize()` Gets the download size limit
`int`	`getNumberOfOpenUris()` Get the number of open URIs for this task.
`int`	`getNumberOfResultNodes()` Get the number of result nodes
`boolean`	`getObeyRobotExclusion()` Checks, if the robot is in obeyRobotExclusion mode
`QueueEntry`	`getOpenUri(SimpleUri uri)` Checks if there's an element in this task's list of open URIs which is equal to uri
`java.util.Enumeration`	`getOpenUris()` Get the URIs of this task the robot is currently working on.
`java.util.Enumeration`	`getPageClassifier()` Get an Enumeration of all PageClassifiers
`java.util.Enumeration`	`getPageFilter()` Get an Enumeration of all PageFilter
`int`	`getQueuedUris()` Get the number of queued URIS
`boolean`	`getRequireSourceCode()` Checks if the robot does a content-seen test
`RobotHtmlNode`	`getResultNode(SimpleUri uri)` Get the result node with the URI equal to the parameter uri
`java.util.Enumeration`	`getResultNodes()` Get all result nodes
`long`	`getStartTime()` Get the start time of this task
`SimpleUri`	`getStartURI()` Get the start URI of this task
`long`	`getUpdateDate()` Gets the update date
`boolean`	`isOpenUri(SimpleUri uri)` Checks if there's an element in this task's list of open URIs which is equal to uri
`boolean`	`isResultUri(SimpleUri uri)` Checks if this URI is in the result
`void`	`removeLinkClassifier(LinkClassifier linkClassifier)` Removes a Classifier
`void`	`removeLinkFilter(LinkFilter linkFilter)` Removes a Filter
`void`	`removePageClassifier(PageClassifier pageClassifier)` Removes a Classifier
`void`	`removePageFilter(PageFilter pageFilter)` Removes a Filter
`void`	`setCheckDatabase(boolean checkDatabase)` Should the robot check the database before trying to download a document from the web.
`void`	`setDoContentSeenTest()` Enables the content-seen-test.
`void`	`setExpiry(long time)` When should this task expire.
`void`	`setHeadOnly(boolean headOnly)` If this flag is set HEAD instaed of GET is used to contact the server
`void`	`setMaxDownloadTime(long time)` The robot will only download a document for the specified time
`void`	`setMaxDownloadUris(int max)` Max Documents are downloaded from the web.
`void`	`setMaxPageSize(int size)` Only the specified amount of bytes are downloaded from each document
`void`	`setObeyRobotExclusion(boolean obeyRobotExclusion)` Should the robot obey the robotExclusion.
`void`	`setRequireSourceCode(boolean requireSourceCode)` If this is set to true, the robot saves the source code of every document.
`void`	`setUpdateDate(long date)` Pages that were accessed (by the robot or the user) before date are downloaded again.
`boolean`	`wasStopped()` Get the value of the stop flag

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

INTERNAL

public static final int INTERNAL

See Also:: Constant Field Values

SUBDIRECTORIES

public static final int SUBDIRECTORIES

Follow only internal links

See Also:: Constant Field Values

EXTERNAL

public static final int EXTERNAL

Follow only links that point to files in the same subdirectory

See Also:: Constant Field Values

ALL

public static final int ALL

Follow only external links

See Also:: Constant Field Values

Constructor Detail

RobotTask

public RobotTask(SimpleUri startURI,
                 int depth,
                 int restriction,
                 RobotUser robotUser)

constructor

Parameters:: startUri - start the crawl from at this uri; depth - follow the links with this depth; restriction - use the constants defined in this class to restrict the crawling process; robotUser - the robotUser will be called for every found document and at the end of the crawling

Method Detail

getId

public int getId()

Get the task's uinique id

Returns:: id

getStartURI

public SimpleUri getStartURI()

Get the start URI of this task

Returns:: start URI

setHeadOnly

public void setHeadOnly(boolean headOnly)

If this flag is set HEAD instaed of GET is used to contact the server

Parameters:: headOnly - if true, only the head of startUri will be loaded

getHeadOnly

public boolean getHeadOnly()

Checks, if the robot is in headOnly mode

getDepth

public int getDepth()

Get the crawling depth

setObeyRobotExclusion

public void setObeyRobotExclusion(boolean obeyRobotExclusion)

Should the robot obey the robotExclusion. For details see http://www.robotstxt.org/wc/exclusion.html

Parameters:: obeyRobotExclusion - if true, the robot will obey the robot exclusion protocol

getObeyRobotExclusion

public boolean getObeyRobotExclusion()

Checks, if the robot is in obeyRobotExclusion mode

Returns:: true, if the robot obeys the robot exclusion protocol

setExpiry

public void setExpiry(long time)

When should this task expire. After this time the robot will stop this robot task, even if it's processing has not been started yet.

Parameters:: time - time period in milliseconds beginning with the arrival of the task at the robot to the task's expiry.

getExpiry

public long getExpiry()

Get the expiry time

Returns:: expiry time

setMaxDownloadUris

public void setMaxDownloadUris(int max)

Max Documents are downloaded from the web. After the robot has downloaded max documents from the web, the task is stopped. Running PageLoaderThreads are not interrupted, so that the actual number of downloaded documents may be higher.

Parameters:: max - download max documents

getMaxDownloadUris

public int getMaxDownloadUris()

Get the maximum number of documents the robot will download

Returns:: max downloaded URIs

setCheckDatabase

public void setCheckDatabase(boolean checkDatabase)

Should the robot check the database before trying to download a document from the web.

Parameters:: checkDatabase - if true the robot allways tries to find linked documents in the database.

getCheckDatabase

public boolean getCheckDatabase()

See, if the robot checks the database

Returns:: true, if the robot checks the database

setUpdateDate

public void setUpdateDate(long date)

Pages that were accessed (by the robot or the user) before date are downloaded again.

Parameters:: date - date in milliseconds after January 1, 1970 00:00:00 GMT

getUpdateDate

public long getUpdateDate()

Gets the update date

Returns:: date in milliseconds after January 1, 1970 00:00:00 GMT

setMaxPageSize

public void setMaxPageSize(int size)

Only the specified amount of bytes are downloaded from each document

Parameters:: size - download only size bytes

getMaxPageSize

public int getMaxPageSize()

Gets the download size limit

Returns:: the maximum amount of bytes the robot will download for each page

setMaxDownloadTime

public void setMaxDownloadTime(long time)

The robot will only download a document for the specified time

Parameters:: maximum - download time for each document

getMaxDownloadTime

public long getMaxDownloadTime()

Gets the maximium download time

Returns:: maximum download time for each page

addPageClassifier

public void addPageClassifier(PageClassifier pageClassifier)

Adds a PageClassifier to this task. The classifier can add attributes to the page. All classifiers are executed serially.

Parameters:: pageClassifier - add this PageClassifier

removePageClassifier

public void removePageClassifier(PageClassifier pageClassifier)

Removes a Classifier

Parameters:: pageClassifier - remove this one

addLinkClassifier

public void addLinkClassifier(LinkClassifier linkClassifier)

Adds a LinkClassifier to this task. The classifier can add attributes to the link. All classifiers are executed serially.

Parameters:: linkClassifier - add this LinkClassifier

removeLinkClassifier

public void removeLinkClassifier(LinkClassifier linkClassifier)

Removes a Classifier

Parameters:: linkClassifier - remove this one

addPageFilter

public void addPageFilter(PageFilter pageFilter)

Filters decides whether to stop the crawling at the current document or to continue with the links. The filters are executed serially and a boolean and operation is used for the decision. Adds a PageFilter to this task.

Parameters:: pageFilter - add this PageFilter

removePageFilter

public void removePageFilter(PageFilter pageFilter)

Removes a Filter

Parameters:: pageFilter - remove this one

addLinkFilter

public void addLinkFilter(LinkFilter linkFilter)

Filters decides whether to follow a link or not. The filters are executed serially and a boolean and operation is used for the decision. Adds a LinkFilter to this task.

Parameters:: linkFilter - add this LinkFilter

removeLinkFilter

public void removeLinkFilter(LinkFilter linkFilter)

Removes a Filter

Parameters:: linkFilter - remove this one

getPageClassifier

public java.util.Enumeration getPageClassifier()

Get an Enumeration of all PageClassifiers

Returns:: Enumeration of all PageClassifiers

getLinkClassifier

public java.util.Enumeration getLinkClassifier()

Get an Enumeration of all LinkClassifiers

Returns:: Enumeration of all LinkClassifiers

getPageFilter

public java.util.Enumeration getPageFilter()

Get an Enumeration of all PageFilter

Returns:: Enumeration of all PageFilter

getLinkFilter

public java.util.Enumeration getLinkFilter()

Get an Enumeration of all LinkFilter

Returns:: Enumeration of all LinkFilter

setDoContentSeenTest

public void setDoContentSeenTest()

Enables the content-seen-test. If the robot does a content-seen-test, the crawling stops at pages that have been seen before under a different url.

getDoContenSeenTest

public boolean getDoContenSeenTest()

Checks if the robot does a content-seen-test.

Returns:: true, if the robot does a content-seen-test

setRequireSourceCode

public void setRequireSourceCode(boolean requireSourceCode)

If this is set to true, the robot saves the source code of every document. Documents that are in the database without source are downloaded again.

Parameters:: requireSourceCode - do a content-seen-test or not

getRequireSourceCode

public boolean getRequireSourceCode()

Checks if the robot does a content-seen test

Returns:: true, if the robot does a content-seen-test

getArrivalTime

public long getArrivalTime()

Get the arrival time of this task at the robot

Returns:: arrival time

getStartTime

public long getStartTime()

Get the start time of this task

Returns:: start time

getEndTime

public long getEndTime()

Get the end time for this task

Returns:: end time

getCheckedUris

public int getCheckedUris()

Get the number of checked URIs. Every link and frame is counted, even if the URI has been checked before.

Returns:: checked URIs

getQueuedUris

public int getQueuedUris()

Get the number of queued URIS

Returns:: queued URIs

getFilteredUris

public int getFilteredUris()

Get the number of filtered URIs. The robot counts all URIs that where filtered by the DefaultFilter (wrong file-extension, restriction) or by the LinkFilters of this task. checkedUris = filteredUris + queuedUris + Uris that have been processed before.

Returns:: filtered Uris

getDownloadedUris

public int getDownloadedUris()

Get the number of downloaded URIs

Returns:: downloaded URIs

getCacheHits

public int getCacheHits()

Get the number of cach hits queuedUris = downloadedUris + cacheHits

Returns:: cache hits

isOpenUri

public boolean isOpenUri(SimpleUri uri)

Checks if there's an element in this task's list of open URIs which is equal to uri

Parameters:: uri - look for this uri
Returns:: true, if a similar URI is open

getOpenUri

public QueueEntry getOpenUri(SimpleUri uri)