Skip to main content

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Project Concepts/ID Concept

Description

The purpose of an ID is to identify an object in the system. What is an object in SMILA?

  • simple case: a single document
  • what about compounds?
    • archive files, e.g. ZIPs
    • Big documents that should be indexed by page or by section
  • SMILA objects have a life cycle
    • creation in crawler or agent
    • enrichment, splitting, merging (possible?) during processing in SMILA
    • persisting in storages (possibly in different states of procesing) or indexes (usually at the end, but also possibly multiple times).
    • process is repeated, when object source changes (index update) -> new object must have same object ID.
    • using the ID it must be possible to refer to the source object.


Discussion

Technical proposal

Definition of concepts:

  • data source: a single location providing access to a colletion of data. (web server, file system, database, CMS, ...). Data is read from a data source using crawler/agents. A data source must have an unique source ID within SMILA to refer to it without having to deal with the technical details of access.
  • source object: entity in data source. A crawler/agent can create multiple SMILA objects from a single object source (e.g. by extracting files from a ZIP archive). A source object can be identified with respect to its data source using a relatively simple key (URL, path, primary key, ...)
  • record: an entity representing a complete source object or a part of an source object to be processed by SMILA.
    • Can be split into multiple records.
    • Multiple records referring to different parts of the same source object can be merged again? Could be useful to split really large documents, process them section by section and merge the results again.
    • Can be written to storages or indexes.
    • Can be read from a storage in order to redo the rest of the processing (e.g. to

rebuild an index after ontology changes).


Record ID design

A Record ID must contain and it must be able to extract:

  • data source ID
  • key of source object in data source, relative to the definitions of the data source

These must be provided by the crawler/agent.

Source objects can have multiple key values, e.g. in database tables with a primary key consisting of multiple columns.

During processing, the record ID may/can be enhanced:

  • Part specification after splitting a compound
    • Element: part of a container, e.g. path in archive (what about recursion: part of part of part...), attachment index in mails, etc. The element is identified by another key which is relative to the container element.
    • Fragment: identified by page number, section number, section name, etc.

If merging is supported, multiple records belonging to the same source object can be merged into a single record. The merged ID must reflect this.

Do we want to pack all this into a single ID string (URL, whatever)? All kinds of quoting problems may arise (remember that the source object key could be a complex URL itself already). Thus, we probably want to use a structured ID object. Something like this:

<rec:Record>
	<id:ID>
		<id:Source><!-- String: ID of data source --></id:Source>
		<id:Key><!-- String: key of source object relative to data source --></id:Key>
 
		<!-- the elements above are mandatory, the following is optional -->
 
		<id:Element>
			<id:Key><!-- String: path in archive, attachment index --></id:Key>
			<!-- id:Element can be repeated for recursive archives -->
		</id:Element>
 
		<id:Fragment><!-- page number, section name/number --></id:Fragment>
		<!-- maybe repeated e.g. for books: Part, Chapter, Section, Subsection ... -->
	</id:ID>
 
	<!-- other metadata and non-binary content -->
 
</rec:Record>

For a source object with multiple key values it must be distinguishable which key value belongs to which key "column". Therefore id:Key can be optionally annotated with a name attribute:

<rec:Record>
	<id:ID>
		<id:Source><!-- String: ID of data source --></id:Source>
		<id:Key name="column1"><!-- key value in named column --></id:Key>
		<id:Key name="column2"><!-- key value in named column --></id:Key>
		...
	</id:ID>
</rec:Record>

Because id:Element uses the id:Key element to identify the element inside a compound, it would be technically possible to support compounds that need multiple key values to identify an element. We cannot think of an actual use case currently, though (-;

In Java:

public interface ID extends Serializable
{
    String getSource();
    Key getKey(); 
 
    List<Key> getElements();
    List<String> getFragments();
 
    ID createElementID(String elementName);
    ID createElementID(Key elementKey);
    ID createFragmentID(String framentName);
 
    ID mergeWith(Collection<ID> otherParts);
}
public interface Key extends Serializable
{
    static final String NONAME = "__SMILA:unnamedkey__";
 
    Iterator<String> getKeyNames();
    String getKey(String name);
    String getKey(); // shortcut for getKey(NONAME)
}
public interface IDFactory
{
    ID createID(String source, Key key);
    Key createKey(Map<String, String> keyValues);
 
    // convenience methods:
    ID createID(String source, String key);
    ID createID(String source, Map<String, String> keyValues);
    Key createKey(String key);
}

IDs should be usable as hash keys:

  • IDs are unchangeable objects
  • Provide appropriate hashcode() implementation

Examples

Assume a file system data source named "share", referring to a shared directory on a file server (e.g. "\\fileserv\share"). It looks like this:

\\fileserv\share
    |- PDF
    |   \- big.pdf
    \- Archive
        \- oldstuff.zip
        \- PDF
            \- old.pdf
            \- another.zip
                \- another.pdf

"big.pdf" initially gets this ID:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>PDF/big.pdf</id:Key>
</id:ID>

After splitting it by pages, the following ID refers to the first page of the document:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>PDF/big.pdf</id:Key>
    <id:Fragment>0</id:Fragment> <!-- or start counting at 1? -->
</id:ID>

Similar for the ZIP: It starts as:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
</id:ID>

When it is expanded, the contained file is referred to as

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
    <id:Element>
        <id:Key>PDF/old.pdf</id:Key>
    </id:Element>
</id:ID>

which it turn can be splitted to pages to become:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
    <id:Element>
        <id:Key>PDF/old.pdf</id:Key>
    </id:Element>
    <id:Fragment>0</id:Fragment>
</id:ID>

And finally, the first page of the PDF in the recursive.zip would have this ID:

<id:ID>
    <id:Source>share</id:Source>
    <id:Key>Archive/oldstuff.zip</id:Key>
    <id:Element>
        <id:Key>another.zip</id:Key>
        <id:Element>
            <id:Key>another.pdf</id:Key>
        </id:Element>
    </id:Element>
    <id:Fragment>0</id:Fragment>
</id:ID>

Similar, for a mail server as a data source "mail" we could have the following ID to refer to an attachment of a mail in folder INBOX. In this case, the Element name is the index of the Mime Message part in the message in this case.

<id:ID>
    <id:Source>mail</id:Source>
    <id:Key>INBOX/42</id:Key>
    <id:Element>
      <id:Key>2</id:Key>
    </id:Element>
</id:ID>

A row in a database table with a primary key consisting of columns x and y would be identitified like this:

<id:ID>
    <id:Source>db</id:Source>
    <id:Key name="x">0815</id:Key>
    <id:Key name="y">4711</id:Key>
</id:ID>

Back to the top