Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Documentation/Importing/VisitedLinks
Contents
VisitedLinks: An auxiliary Service for crawler workers
The VisitedLinks service helps crawlers to track which links in the data source have already been visited. This is necessary for source where the link graph is not a simple tree (as in a filesystem usually) but can have meshes or even cycles. For example, when crawling a web site, some pages will be linked from many (even from all) other pages, and pages can refer to each other, but they must not be crawled over and over again in order to prevent duplicates or endless loops. The VisitedLinks service can keep track of the information even when the crawler worker is running on multiple nodes in a SMILA cluster.
The usage is relatively simple, the crawler just needs to call a single method:
boolean isVisited(String sourceId, String link, String jobRunId, String inputBulkId)
The method returns false if
- the link was not yet visited in this job run.
- the link was visited in this job run while processing a link bulk with the same id. Usually an input bulk is processed twice only if a first try failed for some reason (e.g. the process or machine crashed), so if a worker processes the same input bulk again, it's quite sure that the first processing has failed and the link is not really crawled.
In this case the service updates the entry for the link and the crawler should continue to crawl the link. However, because checking and updating the entry in the service may not be completely atomic, the crawler should check again a bit later (before actually writing records to output bulks) if the link has still not been visited by another task. Just repeat the same isVisited call as before.
Otherwise, the method return true, i.e. the link was visited in the same job run, but read from a different input bulk. In this case the crawler should just drop the link.
See the WebCrawlerWorker for an example of how to use this service.
ObjectStoreVisitedLinks service implementation
The bundle org.eclipse.smila.importing.state.objectstore provides an implementation of the VisitedLinks service using the ObjectStore service in a similar way as the ObjectStoreDeltaService to keep track of the visited state of links.
The service uses store visitedlinks.
Configuration
As the ObjectStoreVisitedLinks service shares most of its code with the ObjectStoreDeltaService it also has the same configuration properties as the delta service. The only difference is that they are read from org.eclipse.smila.importing.state.objectstore/visitedlinksstore.properties.
VisitedLinks ReST API
Currently there is only a simple REST API for VisitedLinks that allows to see for which data source how many entries have been stored and to delete all entries of a single source or all entries or all sources.
Show active sources
- URL: /smila/importing/visitedlinks
- Method: GET
- Response Code: 200 OK, if successful,
- Response JSON:
{"sources": [ { "id": "web", "url": "http://localhost:8080/smila/importing/visitedlinks/web" } ]}
Clear all sources
- URL: /smila/importing/visitedlinks
- Method: DELETE
- Response Code: 200 OK, if successful
- Response JSON: none
Get info about sources
- URL: /smila/importing/visitedlinks/<sourcename>
- Method: GET
- Response Code:
- 200 OK, if successful,
- 404 NOT FOUND, if source does not have entries currently.
- Response JSON:
Contains the ID of the source and the number of entries. If there are more than 10000 entries, the number is only estimated because exact counting could take a long time. To force an exact count, add ?countExact=true to the request URL.
{ "id": "web", "count": "123456" }
Clear a single source
- URL: /smila/importing/visitedlinks/<sourcename>
- Method: DELETE
- Response Code: 200 OK, if successful
- Response JSON: none