Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/Importing/CrawlingMultipleStartURLs

Crawling multiple start URLs in one job run

This page describes an alternative way of using the WebCrawler worker. This way allows you to define multiple start URLs to be crawled in a single job run (but multiple workflow runs) instead of a single start URL only. The main idea is to send each start URL as a simple record to the Bulkbuilder Push API at (/smila/job/<crawlJobName>/record) instead of specifying a single start URL as a parameter value in the job definition.

It would easy to define more variants that crawl start URLs produced by some other worker or similar use cases.

Idea.png

Though we tested the following workflow and settings with the WebCrawler worker only, similar workflows using other crawler workers should work as well, provided that the used crawler worker is able to crawl follow-up links in its input slot that were produced by the very same worker in a previous task. An example of such a worker is the FileCrawler worker. Some workers might expect internal attributes to be set in these follow-up link records, which might cause problems. Please notify us if you observe such issues so that we can extend the respective worker accordingly.



Workflow Definition

You add such a workflow definition:

POST /smila/jobmanager/workflows/
{
  "name":"multiWebCrawling",
  "modes":[
    "standard"
  ],
  "parameters": {
    "startUrl":"<send start urls via bulkbuilder>",
    "bulkLimitSize":1
  },
  "startAction":{
    "worker":"bulkbuilder",
    "output":{
      "insertedRecords":"linksToCrawlBucket"
    }
  },
  "actions":[
    {
      "worker":"webCrawler",
      "input":{
        "linksToCrawl":"linksToCrawlBucket"
      },
      "output":{
        "linksToCrawl":"linksToCrawlBucket",
        "crawledRecords":"crawledLinksBucket"
      }
    },
    {
      "worker":"deltaChecker",
      "input":{
        "recordsToCheck":"crawledLinksBucket"
      },
      "output":{
        "updatedRecords":"updatedLinksBucket",
        "updatedCompounds":"compoundLinksBucket"
      }
    },
    {
      "worker":"webExtractor",
      "input":{
        "compounds":"compoundLinksBucket"
      },
      "output":{
        "files":"fetchedLinksBucket"
      }
    },
    {
      "worker":"webFetcher",
      "input":{
        "linksToFetch":"updatedLinksBucket"
      },
      "output":{
        "fetchedLinks":"fetchedLinksBucket"
      }
    },
    {
      "worker":"updatePusher",
      "input":{
        "recordsToPush":"fetchedLinksBucket"
      }
    }
  ]
}

The differences to the standard "webCrawling" workflow are:

  • The start action is the bulkbuilder, not the webCrawler itself, so you can send records to this job using the document push API when it is running. We will use this to send records containing the start URLs.
  • The bulkbuilder parameter bulkLimitSize is set to 1 (byte), so each inserted record will be written to an own bulk and crawled in its own workflow run. This way fatal errors caused by one start URL will not abort the crawl of another start URL.
  • The webCrawler parameter startUrl is set to a dummy value, because it is required, but we do not need it, so we do not have to include it in the job definition.
  • This job wants to run in "standard" mode instead of "runOnce" mode. This means that you have to finish it yourself after providing the start URLs.

Job Definition

This could be the job definition:

POST /smila/jobmanager/jobs/
{
  "name":"crawlMultipleStartUrls",
  "workflow":"multiWebCrawling",
  "parameters":{
    "tempStore":"temp",
    "dataSource":"multiweb",
    "jobToPushTo":"indexUpdate",
    "linksPerBulk": 100,
    "filters":{
      "maxCrawlDepth": 3,
      "urlPatterns": {
         "include": [
           "http://.*eclipse\\.org/.*SMILA.*",
           "http://.*eclipse\\.org/.*smila.*"],
         "exclude": [".*\\?.*" ]
      }
    },
    "mapping":{
          "httpCharset": "Charset",
          "httpContenttype": "ContentType",
          "httpLastModified": "LastModifiedDate",
          "httpMimetype": "MimeType",
          "httpSize": "Size",
          "httpUrl": "Url",
          "httpContent": "Content"
    }
  }
}

The definion is very similar to a standard crawl job definition, it just does not include the start URL (which was fixed to a dummy value in the workflow definition already). Note that the urlPatterns will be applied to all URLs for each start URL, so include patterns must be valid for all start URLs you are planning to crawl, or possibly nothing will be crawled at all. In this example we want to crawl only different parts on eclipse.org hosts, so the include patterns will work.

Idea.png

You can also use the "stayOn" parameter for such use cases. It will cause the crawler to ignore all links on a web page that to not point to the same host or domain than the URL of the web page itself. WebCrawlerWorker parameters



Running the Job

Start the target job and the crawl job:

POST /smila/jobmanager/jobs/indexUpdate/
POST /smila/jobmanager/jobs/crawlMultipleStartUrls/

The jobs are in RUNNING mode now, but nothing else happens.

Push start URLs:

POST /smila/job/crawlMultipleStartUrls/record/
{
  "_recordid": "startUrl",
  "httpUrl":"http://www.eclipse.org/smila",
  "crawlDepth": 4
}
POST /smila/job/crawlMultipleStartUrls/record/
{
  "_recordid": "startUrl",
  "httpUrl":"http://wiki.eclipse.org/SMILA",
}

Things to note:

  • The value of the _recordid attribute is irrelevant, but the bulkbuilder requires it to be set.
  • The start URL must be provided as attribute httpUrl, regardless of the attribute mapping specified in the job.
  • The optional crawlDepth parameter can be used to specify an individual crawl depth for the given (start) url. When this parameter isn't set, the Web Crawler Worker maxCrawlDepth parameter will be used as default. If maxCrawlDepth is also not set, the crawl depth is unlimited.
    • In the first pushed record above, "crawlDepth: 4" is set, so this is used as limit when following links.
    • In the second record, no crawlDepth is set, so "maxCrawlDepth: 3" (set in the crawl job above) will be used
    • Hint: You can use "crawlDepth: -1" to set the crawl depth unlimited, despite of a maxCrawlDepth setting.

Finish the job:

POST /smila/jobmanager/jobs/crawlMultipleStartUrls/20120823-164700474635/finish/

(of course, you have to adapt the job run id)

This will also cause the delta-delete to be triggered, when the crawling is done. Note that you should disable delta-delete if you do not crawl all start URLs in each job run, or else the documents from the start URLs not crawled in the latest job run will be removed from the index.

Back to the top