Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Documentation/JobDefinitions
Contents
Job definitions
To run a certain workflow in SMILA, you will have to create a job definition first that references a workflow and sets the desired parameters. With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called job run. For jobs running in "standard" mode, the actual job processing is triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow. The triggering continues with new objects until the job run is finished manually. Jobs running in "runOnce" mode, by contrast, do not react on new objects but process all objects currently contained in the respective input bucket and then finish automatically. Jobs can also be predefined in the system, in which case they are located at configuration/org.eclipse.smila.jobmanager/jobs.json and are marked with "readOnly": true. Predefined jobs cannot be modified and in contrast to user-defined jobs do not contain a timestamp element.
Please note that job names must conform to the job manager naming convention:
- names must inly consist of the following characters: a-zA-Z._-
If they do not conform, they won't be accessible in SMILA.
- Pushing elements with invalid names will result in a 400 Bad Request,
- predefined elements with invalid names won't be loaded, a warning will be logged in the SMILA.log file.
E.g.
... WARN ... internal.DefinitionPersistenceImpl - Error parsing predefined jobs from configuration area org.eclipse.smila.common.exceptions.InvalidDefinitionException: Value 'indexInsert/Update' in field 'name' is not valid: A name must match pattern ^[a-zA-Z0-9-_\.]+$.
Job properties in detail
- name: Required. Defines the name of the job.
- workflow: Required. Gives the name of the desired workflow.
- modes: Optional. Restricts the modes the job can be started in and defines the default mode. Possible modes are standard and runOnce.
- the first mode in this list will be used as the default job run mode (i.e. if no mode is provided during job start).
- If a modes section in the workflow is already set, the modes section in the job definition can not add modes to the list of available modes defined in the workflow definition. Adding a job definition that violates the workflow definition concerning modes will result in a 400 Bad Request error. See Job modes for more information.
- parameters: Optional. Defines the job parameters that will be resolved in the workflow to configure the participating workers and to instantiate the buckets. All parameter (variables) that are declared in the used data object types and workers and that have not yet been set in the workflow or bucket definitions must be set here at the latest. Otherwise an error will occur when trying to create the job.
- taskControl: Optional, additional configuration of task delivery to the workers that are part of this job's workflow. Currently, this section can describe that for some workers tasks should not be delivered as fast as possible but with a delay: A delayed worker will only receive a new task if no other task for this worker in this job has been started in the configured time period. E.g., if the delay is set to 10 seconds, the worker will not get a new task if another task for this worker in the same job has been started during the last 10 seconds. This feature can be useful for example during crawling, when you want to reduce the number requests to a data source (typically web sources) to avoid denial of service.
- workers: A list of names of workers that should be delayed.
- delay: a long value specifying the delay in milliseconds.
Job definitions can include additional information (e.g. comments or additional for external tools, etc.), but a GET request will return only relevant information (i.e. the above attributes). If you want to retrieve the additional info that is present in the json file or has been posted with the definition, add returnDetails=true as request parameter.
Example
An exemplary job definition:
{ "name":"myJob", "workflow":"myWorkflow", "modes": ["standard"], "parameters":{ "index": "wikipedia", "store": "wikidocs" } }
An exemplary job definition including task control section:
{ "name":"crawlSmilaWiki", "workflow":"crawlWeb", "parameters":{ "startUrl":"http://wiki.eclipse.org/SMILA", "dataSource": "web", "tempStore": "wikidocs", "jobToPushTo":"indexUpdate" }, "taskControl":{ "workers": ["webCrawler"], "delay": 2000 } }
List, create, modify jobs
All jobs
Use a GET request to retrieve a list of all job definitions. Use POST for adding or updating a job definition.
Supported operations:
- GET: Get a list of all job definitions and details about latest job run. Switch off details with returnDetails=false as a URL parameter. If there are no jobs defined, you will get an empty list.
- POST: Create a new job definition or update an existing one. If the job already exists, it will be updated after successful validation. However, the changes will not apply until the next job run, i.e. the current job run is not influenced by the changes.
Usage:
- URL: http://<hostname>:8080/smila/jobmanager/jobs/
- Allowed methods:
- GET
- POST
- Response status codes:
- 200 OK: Upon successful execution (GET).
- 201 CREATED: Upon successful execution (POST).
- 400 Bad Request: If you reference undefined workflows, if not all parameters were resolved, if mandatory fields are missing, the job name is invalid or if validation finds errors (POST).
Examples:
To get all job definitions:
GET /smila/jobmanager/jobs/
The result would be:
HTTP/1.x 200 OK { "jobs" : [ { "name" : "myJob", "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/" } ] }
To create a job:
POST /smila/jobmanager/jobs/ { "name":"myJob", "parameters":{ "index": "wikipedia", "store": "wikidocs" }, "workflow":"myWorkflow" }
The result would be:
HTTP/1.x 201 CREATED { "name" : "myJob", "timestamp": "2011-08-12T14:49:48.862+0200", "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/" }
Specific job
Use a GET request to retrieve the definition of a specific job and an overview of job runs (current job run, if running, and a history of any previous job runs). Use DELETE to delete a job.
Supported operations:
- GET: get the definition of the given job.
- You can set the URL parameter returnDetails to true to return additional information that might have been provided when creating the job. If the parameter is ommitted or set to false only the relevant information (see above) is gathered.
- You can set the URL parameter returnRuns to false to hide the job run lists. This should make the request faster, if you don't need this information. If the parameter is ommitted or set to true, the job run list is included.
- DELETE: delete the given job definition.
- POST: start a job run, see SMILA/Documentation/JobRuns#Start_job_run
Usage:
- URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/
- Allowed methods:
- GET
- DELETE
- POST: see SMILA/Documentation/JobRuns#Start_job_run
- Response status codes:
- 200 OK: Upon successful execution (GET, DELETE). If the job definition to be deleted does not exist you will get a 200 anyway.
- 400 Bad Request: If a job run mode has been requested that is not allowed by workflow or job definition.
- 404 Server Error: If an undefined name is used, an HTTP 404 Server Error including an error message in the response body will be returned.
Examples:
To get a specific job definition:
GET /smila/jobmanager/jobs/myJob/
The result would be:
HTTP/1.x 200 OK { "definition": { "name":"myJob", "timestamp": "2011-08-12T14:49:48.862+0200", "parameters":{ "index": "wikipedia", "store": "wikidocs" }, "workflow":"myWorkflow" }, "runs": { "current" : { "jobId" : "20110930-121108155498", "state" : "RUNNING", "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110930-121108155498/" }, "history" : [ { "jobId" : "20110930-121054946584", "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110930-121054946584/" }, { "jobId" : "20110930-121042898813", "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110930-121042898813/" } ] } }
If the job is currently not running, the "current" map will be empty, if there are no finished job runs, the "history" sequence will be empty. If "returnRuns=false" has been set in the URL, the complete "runs" part will be missing:
GET /smila/jobmanager/jobs/myJob/?returnRuns=false --> HTTP/1.x 200 OK { "definition": { "name":"myJob", "timestamp": "2011-08-12T14:49:48.862+0200", "parameters":{ "index": "wikipedia", "store": "wikidocs" }, "workflow":"myWorkflow" } }