Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Documentation/Agent
Contents
Overview
An Agent monitors a data source for changes, sending both content and metadata of interest about new/modified resources and Ids of deleted resources.
SMILA currently comes with two types of Agents, each for a different datasource type, namely MockAgent (a sample implementation of an agent) and a FeedAgent that enables monitoring of RSS and atom feeds. Furthermore, the Connectivity Framework provides an API for developers to create their own Agents.
API
An Agent has to implement interface Agent which extends interface Runnable. The easiest way to achieve this is to extend the abstract base class AbstractAgent located in bundle org.eclipse.smila.connectivity.framework. This class already contains handling for the Agents Id, an OSGI service activate method and also default implementations for the start() and stop() methods creating a new Thread for the Agent to run in. So the only method that has to be implemented is method run() of the Runnable interface which contains the processing logic of the agent.
Javadoc org.eclipse.smila.connectivity.framework.Agent
Architecture
Agents are managed and instantiated by the AgentController. The AgentController communicates with the Agent via interface Agent, starting or stopping the agent. As long as the agent is running it communicates with the AgentController via the callback interface AgentControllerCallback to send add and delete events to the AgentController. The agent itself has no reference to DeltaIndexingManager, only the AgentController who initializes the delta indexing session has one. To identify the session the parameter sessionId is passed in method start(final AgentControllerCallback controllerCallback, final AgentState agentState, final DataSourceConnectionConfig config, final String sessionId) so that the Agent can send it back to the AgentController via interface AgentControllerCallback. Agents extend the Runnable interface and must implement method run(). There is already some functionality included in the abstract base class AbstractAgent for thread handling. In the start() method a new Thread is created for the Agent and stored in a private member variable. It also contains a private boolean flag _stopThread. The run() method should watch this flag using method isStopThread() to check when processing should end. Here is some skelleton code of how the implementation could look like:
/** * Skelleton code for the run() method. * @see java.lang.Runnable#run() */ public void run() { try { while (!isStopThread()) { try { // here goes the agent business logic } catch (InterruptedException e) { if (_log.isTraceEnabled()) { _log.trace("agent thread was interrupted ", e); } } } } catch (Exception e) { throw new RuntimeException(e); } catch (Throwable t) { throw new RuntimeException(t); } finally { try { stop(); } catch (Exception e) { throw new RuntimeException(e); } } }
Package org.eclipse.smila.connectivity.framework.util provides some factory classes for Agents to create Ids, hashes and DataReference objects.
Configuration
An Agent is started with a specific, named configuration, that defines what information is to be sent (e.g. content, kinds of metadata) and where to find that data (e.g. file system path, JDBC Connection String). See each Agent documentation for details on configuration options.
Each Agent can define its own configuration because Agents need different information to monitor different data sources. As example a JDBC-Agent need information about which database and which table should be monitored and which columns should be returned.
Therefore the Agent developer defines a schema that contains all interesting information. This schema is based on a root schema that is shared betweeen Agents and Crawler. It declares the generic framework/frame which has to be used to send DataSourceConnectionConfigs to the SMILA framework. The root-schema can be found in: configuration\org.eclipse.smila.connectivity.framework.schema/schemas/RootDataSourceConnectionConfigSchema.xsd.
The root schema looks like as follows:
- DataSourceID
- A description string that is used in the whole framework to separate and address information that apply to the same agent
- SchemaID
- The SchemaID contains the whole bundle name of the Agent (e.g. FeedAgent: org.eclipse.smila.connectivity.framework.agent.feed).
The SMILA Framework uses this information to gather the schema for the validation of the DataSourceConnectionConfig that should be executed.
- DataConnectionID
- This tag describes if an Agent or Crawler should be used. It contains either of the following tags:
- Agent
- Crawler
- The name that is used in these tags is the Service name of the Agent/Crawler.
- RecordBuffer
- Here you can specify settings to optimize record transfer to ConnectivityManager. These settings are not applicable to Agents !
- Size - the number of records to be send to ConnectivityManager in one block. Default is 1.
- FlushInterval - a time interval in milliseconds after which to send the current elements of the RecordBuffer to ConnectivityManager. Default is 1000.
- DeltaIndexing
- Configuration options for delta indexing that are to be interpreted by the AgentController. The following values are supported:
- full - delta indexing is fully activated. Records are checked if they need to be updated, entries for new/updated records are added to the deltaIndexingManager, delta-delete is executed if no error occurred
- additive - as full but delta-delete is not executed
- initial - For an initial import in an empty index or a new source in an existing index performance can be optimized by NOT checking if a record needs to be updated (we know that all records are new) but adding an entry in the DeltaIndexingManager for each Record. This allows later runs using full or additive to make use of DeltaIndexing information.
- disabled - delta indexing is fully disabled. No checks are done, no entries are created/updated, no Delta-Delete is executed. Later runs cannot benefit from DeltaIndexing
- CompoundHandling
- Configuration options for CompoundHandling. See CompoundManagement for details.
- Attributes
- Placeholder for each Agent's attribute definition.
The Agent defines here which Attributes it returns. An attribute is a specific information of an entry in the data-source that is crawled by the Agent (E.g. In a filesystem an entry is a file, and attributes of a file are Size, Content, etc.)
- Process
- This element is meant to be extended by the the Agent developer in a derived schema and may be used to define anything that is pertinent of getting the Agent's job done.
These information may include connection information to the data source to monitor or filters such as for instance queries, wild cards, include, excludes, etc.
Further Information:
- See for each Agent Attributes and Process Tags
- How to implement an Agent
Agent lifecycle
The AgentController manages the life cycle of the agent (e.g. start, stop, abort) and may instantiate multiple agents concurrently, even of the same type. This is realised by using OSGi ComponentFactories. Each agent does not automatically start an OSGi service, but registers only an Agent ComponentFactory with the AgentController. Via the ComponentFactory the AgentController can instantiate agents on demand.
Here is a template for an agent OSGi component definition
<component name="%AGENT_TYPE%" immediate="false" factory="AgentFactory"> <implementation class="%AGENT_IMPLEMENTATION_CLASS%" /> <service> <provide interface="org.eclipse.smila.connectivity.framework.agent"/> </service> </component>
See also
More information about the different Agents can be found here: