Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/JobRuns

Job runs

With a job definition alone, the system is not yet doing anything. First, the job must be started to get a so called job run. How the actual processing is then triggered, depends on the mode of the job run.


Job run life cycle

JobLifecycle-1.1.png

Download this zip file containing the original PowerPoint file of this slide.

A job run starts in the PREPARING state. In this phase the job run is instantiated, necessary structures in the runtime storage (ZooKeeper) are created and the initial tasks for runOnce jobs are computed. If this has been done successfully, the job run goes to the RUNNING state and the real work of the job begins. When a finish command is sent (either by the user or the jobmanager itself for runOnce jobs) the job run is moved to the FINISHING state, in which no new workflow runs can be created, but only existing tasks and their follow-up tasks are completed. If all workflow runs have been done, a completion phase can follow (state COMPLETING, see below) and finally there is a clean-up phase in which statistics are persisted and the job run structures are removed from the runtime storage. If everything is OK, the job run ends in state SUCCEEDED.

A job run can be canceled if it is in the state PREPARING, RUNNING or FINISHING. This tries to remove or abort all current tasks of this job and immediately finishes the job run. No further tasks will be created in this job, no data in transient buckets will be removed. Also it will often cause lots of log entries complaining about tasks that could not be finished anymore. For that reason, it should be used as an emergency exit only, for example if some job is running wild or has gone dead. The state of such a job is CANCELED.

If fatal errors occur in any of the non-final states (e.g. PREPARING, COMPLETING etc), the job will be finished and stored with state FAILED. This happens, in particular, when none of the started workflow runs could be completed successfully.

Job modes

There are two different modes in which job runs can be operated:

  • Job runs in "standard" mode are triggered with every new object that is dropped into the bucket connected to the start action of the respective workflow (or, if it has no input bucket, which the start worker produces from other sources like the Bulkbuilder). They continue until they are finished manually. Once in the FINISHING state, no new workflow runs are accepted anymore, but the active ones continue until completed.
  • Job runs in "runOnce" start, create some initial tasks and go immediately to the FINISHING state. They do not react on further input. If something goes wrong while creating the initial tasks, the job run goes to state FAILED immediately and no task will be processed at all. All tasks are executed in a single workflow run, all follow-up tasks of the initial tasks are also part of this workflow run. A consequence of this is that the complete job run will fail, if one task fails fatally, so workers in runOnce jobs should finish tasks with fatal errors only if this is really a critical problem that should cancel the job run. The details of the task creation depend on the start action worker, because the task creation is actually done by the worker's task generator. Currently we have two varieties:
    • Either the workflow has exactly one persistent input bucket in its start action, and for each existing object in this bucket, one task for the startAction worker is generated. Unlike job runs in standard mode, this job does not react on new objects but process all objects that are currently contained in the respective input bucket and then finish automatically.
    • As another possibility the workflow starts with a worker using the runOnceTrigger task generator. In runOnce mode, this task creates one initial task without input data objects, afterwards it behaves just like the default task generator. The worker can then create output bulks at will. An application of this is the importing use case, where the starting crawler worker creates the first bulks from data extracted from an external data source. The input bucket of the start action worker does not need to be persistent in this use case.

Job modes can be restricted in the workflow and job definitions.

  • if no modes section is present in either definition, all modes are allowed when starting the job run, default mode will be standard (i.e. if no mode is provided when starting the job).
  • the default mode to use is the first mode listed in the list of available modes.
  • if modes sections are present in workflow and job definition, the job definitions modes section overwrites the workflow definition' modes section.
  • the job definition's modes section can only restrict the workflow definition's section or define another default mode but not expand the allowed modes list.

Why should I restrict a workflow or job

There are many reasons why one should want to restrict modes for a workflow or a job or define default modes, so a job is started in the desired mode if the mode is not explicitly given during job start.

E.g. (most) crawler jobs make only sense when started in runOnce mode, so the modes should be restricted to this mode. Or a job that uses the Bulkbuilder as start action will only make sense in the standard mode, so the workflow should also be restricted to this mode.

Job Completion Phase

Workers can request to get extra tasks after all the "standard" tasks of a job have been done. The purpose of this is to clean up or consolidate work that cannot be triggered before the actual work of the job run is done. An example is the Delta Delete part of the importing jobs: Only after a data source has been crawled completely, it can be determined which elements of the data source have been deleted since a previous import and must be deleted in the import target, too.

A completion job run is initiated if

  • at least one worker used in the workflow requests it: this is done by setting mode requestsCompletion in the worker definition.
  • at least one of the standard workflow runs has been completed successfully, i.e. without the completion run, the job run would have been finished as SUCCEEDED.

Then the taskgenerator of every worker that requests completion is invoked to create completion tasks. The default task generator creates one such task without any input or output bulks attached. Such tasks can be recognized by the worker because the task property isCompletingTask is present and set to true. These tasks are added to the worker task queue in the TaskManager just like normal tasks. No follow-up tasks are created after these completion tasks have finished successfully or with a fatal error, the retry behaviour on recoverable errors is just like that for normal tasks, too.

During the completion workflow run, the job run is in state COMPLETING. When all completion tasks are done, the job run is cleaned up (in a short intermediate state named CLEANINGUP) and finally finished (as SUCCEEDED or FAILED). If no worker requests completion or no task generator creates a completion task for the workers that request a completion run, the job leaves the COMPLETING state immediately and (via CLEANINGUP) moves to SUCCEEDED or FAILED.

Resuming jobs after restart of SMILA process

By default, when restarting SMILA after it has been shut down or crashed (hopefully not), no jobs will be running any more and all tasks that were "todo" or "in-progress" in the previous session will be gone. However, it is possible to restart SMILA in such a way that all jobs that were active at the time of shutdown will immediately be active again, and all tasks not yet done will still be available, so that all job runs should continue without losing the information they were processing.

Of course, tasks that were in-progress at the time of shutdown will not be continued immediately but retried after their time-to-live has expired. Such tasks will be counted as "retriedAfterTimeout" in the job run data.

Resuming of jobs can be enabled or disabled in two ways:

  • By configuration: see TaskManager Configuration
  • Override only for the current start: You can use one of two command line parameters to override the configuration setting:
    • -resumeJobs: resume jobs, don't clear job run data and tasks.
    • -discardJobs: discard jobs, remove all active job run data and tasks.
Idea.png
Implementation details

Technically, the removal of active job run data and tasks is done in the Zookeeper service: When resuming jobs is disabled, the service removes all files in its data directories before starting the internal Zookeeper service. Otherwise it just starts with the data leftover from the last run which is a persisted version of Zookeeper's in-memory tree.

Additionally, this means that you should not configure Zookeeper to write its data to a RAM disk if you need the active job run data and tasks to survive even a crash of the machine SMILA is running on (e.g. due to power failure). However, it is strongly recommended to use at least a different hard disk for the Zookeeper data dir than the disk SMILA stores its own data on. See Zookeeper service for details on this configuration.



Start job run

You can use a POST request without a request body to start a job run in default mode (see Job modes to find out which mode is the default mode of the job). To start a job run in a different mode than the default mode, add the following simple JSON object to the request body (in this example the job should be started in runOnce mode):

{
  "mode": "runOnce"
}

The job cannot be started if it requires another job that is currently not running. A required job is a job that is referenced by a parameter that is marked with range="jobName" in one of the workers used in this job.

Supported operations:

  • POST: Start job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/
  • Allowed methods:
    • POST (with optional request body specifying the mode)
  • Response status codes:
    • 200 OK: Upon successful execution. A JSON object with jobId and url will be returned.
    • 400 BAD REQUEST:
      • The taskgenerator, job or workflow definition do not allow the requested mode.
      • The job uses a worker that required another job to be running.

Example:

To start the job named "myJob" in "default" mode (see above) you don't have to explicitly set a job run mode:

POST /smila/jobmanager/jobs/myJob/

The result would be:

HTTP/1.x 200 OK

{
  "jobId" : "20110712-184509666721",
  "url" : "http://localhost:8080/smila/jobmanager/jobs/myJob/20110712-184509666721/"
}


To start the job named "myJob" in "standard" mode, if that mode is allowed by job and workflow definition:

POST /smila/jobmanager/jobs/myJob/

{
  "mode": "standard"
}

To start the job named "myJob" in "runOnce" mode, if that mode is allowed by job and workflow definition:

POST /smila/jobmanager/jobs/myJob/

{
  "mode": "runOnce"
}

The result object would be equal to "standard" mode.

Monitor a job run or delete job run data

Use a GET request to view job run data of a specific job run. Use DELETE to delete the data of a specific job run.

Job run data:

The following parameters are contained in the job run data:

  • jobId: The ID of the job run.
  • mode: The mode of the job run, i.e. either STANDARD or RUNONCE.
  • state: The current status of the job run, see Job run life cycle. May be one of the following:
    • PREPARING: started but not running yet
    • RUNNING: running
    • FINISHING: finished but not all tasks processed yet
    • COMPLETING: finished, all tasks processed but job run not completed (e.g. not persisted) yet
    • SUCCEEDED: successfully completed
    • FAILED: failed
    • CANCELING: canceled, but clean-up is not yet completed.
    • CANCELED: canceling done.
  • workflowRuns: Describes the workflow runs which are part of this job run. Note: startedWorkflowRunCount == activeWorkflowRunCount + successfulWorkflowRunCount + failedWorkflowRunCount + canceledWorkflowRuns
    • startedWorkflowRunCount: The number of started workflow runs.
    • startedCompletionWorkflowRunCount: The number of started completion workflow runs.
    • activeWorkflowRunCount: The number of active workflow runs.
    • successfulWorkflowRunCount: The number of successfully finished workflow runs (incl. a completion workflow run, if started).
    • failedWorkflowRunCount: The number of failed workflow runs.
    • canceledWorkflowRunCount: The number of canceled workflow runs.
  • tasks: Describes the tasks which are part of this job run. After the job has finished it should be createdTaskCount == successfulTaskCount + retriedAfterErrorTaskCount + retriedAfterTimeoutTaskCount + failedAfterRetryTaskCount + failedWithoutRetryTaskCount + canceledTaskCount + obsoleteTaskCount. However, we cannot strictly guarantee this, under very high load it's possible that a task is not counted correctly.
    • createdTaskCount: The number of tasks created in this run. This includes tasks created due to retry.
    • successfulTaskCount: The number of tasks that were finished successfully by a worker.
    • retriedAfterErrorTaskCount: The number of tasks that were retried because a worker finished the task with a recoverable error (e.g. IOError while reading the input or writing the output).
    • retriedAfterTimeoutTaskCount: The number of tasks that were retried because a worker did not send the "keepAlive" signal anymore.
    • failedAfterRetryTaskCount: The number of tasks that finally failed after reaching the configured maximum number of retries.
    • failedWithoutRetryTaskCount: The number of tasks that finally failed because the worker finished the task with a fatal error (e.g. due to corrupt input data).
    • canceledTaskCount: The number of tasks that were canceled because a workflow run was canceled or failed due to another task in the workflow run having finally failed. They may have produced their result successfully, but they did not trigger follow-up tasks.
    • obsoleteTaskCount: The number of tasks that became obsolete for some reason. The difference to 'canceledTaskCount' is that becoming obsolete is not triggered by an error on the workflow run resp. another task. It's just that the cause / precondition to process this task is gone.
  • startTime: The timestamp when the job run was started (DateTime format ISO).
  • finishTime: The timestamp when the finish command was called for this job run (DateTime format ISO).
  • endTime: The timestamp when the job status changed to SUCCEEDED, FAILED or CANCELED.
  • worker: Contains accumulated job run data for all workers that have contributed to this job run. Each entry starts with a counter followed by "_" and the Worker Name (e.g. "0_bulkbuilder", "1_pipeletProcessor", "2_pipeletProcessor"), so that if a workflow contains the same Worker multiple times we have separate counters for each Worker instance. The counter matches the order of execution of the workers, starting with 0 for the first Worker (the start action). Each worker entry contains:
    • The number of successful, failed, and retried tasks for each worker in this job run (same counter names and meanings as in the global section above).
    • startTime: The timestamp when the first task for a worker of this type was started in the job run (DateTime format ISO).
    • endTime: The timestamp when the latest task for a worker of this type was finished in the job run (DateTime format ISO). This timestamp is updated with every finished task.
    • The accumulated counters as reported by the workers in their result descriptions.

Supported operations:

  • GET: To monitor the job run.
  • DELETE: To delete job run data.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-id>/
  • Allowed methods:
    • GET
    • DELETE
  • Response status codes:
    • 200 OK: Upon successful execution (GET/DELETE). If the job run with the given job name and job id does not exist, no error will occur during DELETE.
    • 500 Server Error: If the job run is still running (DELETE).

Monitor a job run with details

It is possible to update existings jobs. You can update the job definition, workflow definition and bucket definition. To see which definitions have been used during a job run you can display additional information with returnDetails=true.

GET /smila/jobmanager/jobs/<job-name>/<job-id>/?returnDetails=true

Finish job run

Use a POST request to finish a job run.

A job cannot be finished, while a dependent job is running. A dependent job is one using a worker that has a parameter marked with range="jobName" referencing this job.

Supported operations:

  • POST: finish job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/finish/
  • Allowed methods:
    • POST
  • Response status codes:
    • 202 ACCEPTED: Finishes the job run (asynchronous call)
    • 400 BAD REQUEST: wrong URL pattern, or another job required that this job is running.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Cancel job run

Use a POST request to cancel a job run.

Supported operations:

  • POST: cancel job run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/cancel/
  • Allowed methods:
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution. Cancel the job run
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Cancel workflow run

Not only a whole job run can be canceled, but also a single workflow run inside a job run. A workflow run may be canceled if the job run is in state RUNNING or FINISHING.

To cancel a workflow you need the job run id and the workflow run id. This information is e.g. contained in the JSON response body when pushing records to the record API.

If the canceled workflow run does not exist, the call is ignored.

Supported operations:

  • POST: cancel workflow run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id>/cancel/
  • Allowed methods:
    • POST
  • Response status codes:
    • 200 OK: Upon successful execution.
    • 400 BAD REQUEST: wrong URL pattern.
    • 404 NOT FOUND: job run not found (Hint: if the workflow run is not found, the call is ignored and the response is 200 OK)
    • 405 METHOD NOT ALLOWED: wrong HTTP method used, only POST is accepted here
    • 410 GONE: job run was finished before and has already been moved to the history of job runs
    • 500 INTERNAL SERVER ERROR: other errors

Monitor a workflow run

Use a GET request to monitor a workflow run.

Supported operations:

  • GET: monitor workflow run.

Usage:

  • URL: http://<hostname>:8080/smila/jobmanager/jobs/<job-name>/<job-id>/workflowrun/<workflowRun-id>/
  • Allowed methods:
    • GET
  • Response status codes:
    • 200 OK: Upon successful execution.
    • 404 NOT FOUND: If the workflow run specified does not exist. This can either mean that the workflow run existed but has already been finished, or that it never existed all. You cannot differentiate both cases without further information unless you can make sure that the ID existed before.

Examples:

To monitor a workflow run:

GET /smila/jobmanager/jobs/myJob/20110527_175314695579/workflowrun/1/

If it is still running, the result would be:

HTTP/1.x 200 OK

{
  "activeTaskCount": 1
  "transientBulkCount": 1
}

If not, the result would be:

HTTP/1.x 404 NOT FOUND

Back to the top