Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Documentation/DeltaIndexingManager
Contents
Overview
The DeltaIndexingManager stores information about the last modification of each record and can determine if a record has changed since its last processing. This decision is based on a hash value provided by a crawler. How such hash is computed depends on the crawler and its configuration. For example the filesystem crawler usually computes the hash from file's last modification date. It provides functionality to manage this information, to determine if already processed documents have changed, to mark documents that have not changed (visited flag) and to determine documents that are indexed but no longer exist in the datasource.
Before you can use delta indexing you have to create a working session with the DeltaIndexingManager by calling init(final String dataSourceID). This will generate a new session and lock the given data source (if not already locked by another session), and return the session ID. This session ID has to be used for all upcomming calls to DeltaIndexingManager. With calling finish(final String sessionId) the lock is released and the session is destroyed.
API
for the current definition of the interface in trunk see Javadoc: org.eclipse.smila.connectivity.deltaindexing.DeltaIndexingManager
Implementations
SMILA comes at the moment with two implementations: a memory and a database backed implementation and others may provide further implementations for the DeltaIndexingManager interface.
In general it makes sense to only activate one DeltaIndexingManager Impl. at a time. This is achieved by just starting the desired impl. bundle. If multiple implementations are started, a client using the DeltaIndexingManager has to use a filter has to provide an OSGi Filter when requesting the service, otherwise it gets a reference randomly. Each component description includes a property named smila.connectivity.deltaindexing.impl that can be used for filtering. At the moment the only component that has a reference to the DeltaIndexingManager is the ConnectivityManager.
Below is a list of the currently available implementations.
org.eclipse.smila.connectivity.deltaindexing.impl
The implementation stores the delta indexing information in memory. When stopping/starting the DeltaIndexingManager the current state is written to/read from files located atworkspace\.metadata\.plugins\org.eclipse.smila.connectivity.deltaindexingThese files are named according to the dataSourceId. This implementation is only usefull during development, as the in memory storage will certainly lead to OutOfMemoryExceptions when used with a high data load.
Filter Property
<property name="smila.connectivity.deltaindexing.impl" value="memory"/>
Configuration
There are no configuration options available for this bundle.
org.eclipse.smila.connectivity.deltaindexing.jpa.impl
This implementation uses eclipseLink JPA to store the delta indexing information in an apache derby database. The data is stored in the two tables DATA_SOURCES and DELTA_INDEXING:
Column | Type | Description |
---|---|---|
SOURCE_ID | VARCHAR | a hashed value of the Id object of the record |
LOCKED | BOOLEAN | a flag if this data source was locked |
LOCKED_BY | VARCHAR | the id of the thread that locked this data source |
Column | Type | Description |
---|---|---|
ID_HASH | VARCHAR | the hashed value of the Id object of the record |
HASH | VARCHAR | the delta indexing hash value |
SOURCE_ID | VARCHAR | the data source Id |
IS_COMPOUND | BOOLEAN | flag if this entry is a compound object |
PARENT_ID_HASH | VARCHAR | the hashed value of the parent Id object. This is only set if this Id is an element of a compound object, otherwise it is NULL |
VISITED | BOOLEAN | flag if this entry was already visited |
MODIFIED | BOOLEAN | flag if this entry was modified |
ID | BLOB | the serialized Id object. This is needed to reconstruct the Id objects for method obsoleteIdIterator() |
Filter Property
<property name="smila.connectivity.deltaindexing.impl" value="jpa"/>
Configuration
The only configuration needed is a typicall eclipseLink configuration property file. Therin you can specify settings for logging, database connection settings. For more information please refer to the eclipseLink documentation [[1]].
The configuration is located at configuration/org.eclipse.smila.connectivity.deltaindexing.jpa.impl/persistence.properties.
# EclipseLink properties eclipselink.logging.level=INFO eclipselink.target-server=None eclipselink.target-database=org.eclipse.persistence.platform.database.DerbyPlatform eclipselink.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver eclipselink.jdbc.url=jdbc:derby:workspace/.metadata/.plugins/org.eclipse.smila.connectivity.deltaindexing.jpa.impl/deltaindexingstorage;create=true eclipselink.jdbc.password=smila eclipselink.jdbc.user=smila eclipselink.ddl-generation=drop-and-create-tables
After starting Smila for the first time, the DDL generation setting will print out some nasty warnings, complaining that it can't create some tables. These warnings are not critical. You can get rid of them by setting eclipselink.ddl-generation=none, but only after Smila was started at least once (and the tables were created).
Limitations
At the moment it is necessary to import all packages containing JDBCDriver classes in org.eclipse.smila.connectivity.deltaindexing.jpa.impl. So for changing from derby to another database it is not sufficient to change the configuration in persistence.properties, you also have to add import package statementsv for the JDBC driver to use to your bundles manifest. This will hopefully be changed with the next release of eclipseLink.