Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
TPTP DMS
Contents
TPTP DMS
Requirements Overview
High Level Structure
This image shows a very high level rendering of the runtime components involved in the data flow through the data management services of TPTP.
The following gives an overview of each component in this block architecture.
Event Source
An event source is nothing more than a data provider to the loader. In TPTP this often thought to be an agent, but even though that may be where an event originated, the event source in context of this discussion is the input stream to the loader.
There is an implicit contract between the event source and the loader that the data provided is consumable by the loader, however since the transport layers available in TPTP are nothing but a transport service this contract is delegated to the data source pushing data into the transport.
TPTP maintains a set of specifications for event formats it can create, transport and consume. These events will continue to be extended in TPTP as the need arises. Prior to 4.4 events were small and incremental. In 4.4 bulk events will be introduced. These already in effect exist in the form of a "large" XML stream of already known events, wrapped in some sort of envelope. This is how the current trace file import works. In 4.4 we will make this more scalable from a transport and loader perspective.
The design is not complete but may include options of simply passing a Uri to an batch resource of a known format. Also under consideration is embedding and CSV stream. This is not really an event source specific issue, however it is mentioned here to prevent any assumption about the structure or format of the data that may arrive through the stream.
Event Parser
The event loader consists of a few simple processing steps. The first is the event parser, the second is the store specific loader. The contract between these two components is a strongly type data object. In TPTP this object is a simple Java object. The binding between these two steps is directly controlled by the object created. Once the parser has populated the members of the object it will invoke the "add yourself" method, which will trigger the loader behaviour. This may migrate to an asynchronous event model in the future but in 4.4, parsing and loading will be synchronized operations.
The role of the event parser is to instantiate the object(s) needed by the loader, and set the members as specified by the input stream. All default content must be provided by the pojo. In the current TPTP implementation, the specific event parser is selected based on the initial part of the event stream. At this time, events are formated as XML so the element name is the key used to select the parser. Alternate parsers can be easily registered for XML formated events, and with some additional work the same could be done for other stream formats.
This framework allows for a complete decoupling of the input stream format from the loader by using the intermeadiate pojo.
Store specific loader
The store specific loader takes an input pojo and does what ever is needed to process and likely persist it. The historical TPTP implementation instantiated an EMF model and deferred persistence to the EMF resource serialization. Although this is still of interest for quick in memory models, it is not a completely scalable for large volumes of related data by default.
Going forward, the loader is clearly bound to the storage system, and should be optimized for the data store of choice. This implies that the loader needs to be registered independent from the event parser implementation, but still needs to be coupled to it at run time. This registrations will be configurable via plug-in extensions, a configuration file or dynamically. The "add yourself" implementation of the pojo will deal with this aspect of the configuration. This approach will maintain compatibility for current extenders of the loader infrastructure.
Data store
The data store is just that, a service that will persist and retrieve data on demand and manage a consistent state of that data. An example data store is EMF which will manage an in memory data representation, and will also serialize the data into a zipped XML serialization in the case of TPTP. This format will in fact be very similar if not identical to the structures historically used in TPTP. A more scalable example will be a relational database. TPTP will provide a relational database implementation for it's models. This includes a set of DDL and contracts of behavior with the store specific loaders and the store specific query system. This implementation is intended to be portable to various RDB implementations, however as long as the contracts are maintained, alternate schema can be used that may be more optimal is certain configurations.
The loader contract includes a specific schema for storage as well as a consistent state or context for the loaders to operate in. In the case of EMF this is the object graph itself. In this case, due to the fact that the graph is considered to be all in memory, there is no support for commit scopes or data stability beyond basic synchronization locking done in the object graph itself.
In the case of the relational store, this is the RDB schema. Where possible, update able views will be provided to isolate the actual tables for insert and update actions. Views will be provided for all supported read patterns. This will allow for some schema evolution and implementation optimizations. It is up to the loader to declare and use commit scopes.
The query and access contract is provided through views in the case of the relational implementation. In the case of EMF, the object graph itself is exposed.
Client
The client is not a part of the DMS but is included here for completeness and to facilitate various use case discussions.
In TPTP the primary client is the Eclipse user interface, however this will not be supported any differently than a thick client perhaps implemented in SWT or a web client that may or may not be an AJAX based implementation. TPTP will extend it's current user interface to exploit the DMS and this includes the Eclipse workbench, RCP applications and BIRT reports.
Client object model
The client object model is nothing more than the name implies, and is not directly affected by DMS. In use cases where DMS is involved, it is assumed that the client has a need to hold some data that has been retrieved via the DMS in memory.
The typical role of the client object model is to hold data of value to the client in a format that is optimal for that client. Since this is obviously not predictable by TPTP in advance a flexible contract is needed. TPTP will provide an implementation of what are expected to be very common use cases and bases on that experience will evolve an api for extension. This has been referred to as a registered formatting function but requires further investigation before exposure. In the meantime TPTP will provide a simple pojo and EMF contract between the client the access api. This means a specific object graph for simple tables, trees and graphs will be used to access returned results from the access apis. At a minimum this will be pojo based, and an EMF format may also be provided. A formatting to other object models or serializations such as XML are conceivable, but not currently in plan.
Note that it is not assumed that the client is user based. It is expected to be very normal for a data warehouse to be a consumer of data managed by the TPTP DMS. The contract to the result specific api and optionally the store specific query and access component is unchanged.
Result specific access api
The role of the result specific api is to bind the client object model to the the results of storage system specific queries and access. This will be done without exposing the storage specific api itself.
For example the client may invoke a request for most recent CPU utilization of a specific CPU separated by processes. The result of such a request when bound to the the TPTP simple pojo client model, would return a simple table of process names and the related percentage utilization value. The signature may look like " Object retrieveCpuUtilization(OutputFormater, MachineFilter, CpuFilter, ProcessFilter)". An alternate signature may be in the context of a previous request that returned a collection of CPUs. For example " Object retrieveCpuUtilization(OutputFormater, ProcessFilter)" may be an operation on a CPU object. The significance is that the client is only exposed to the concepts understood, but not the actual storage or query mechanism. Note the contract with the client is managed via the implementation of the mapping in the formater.
The implementation of this example function will in turn use the storage specific data query and access layer. This function understands the concepts of the data being managed, and exploits the appropriate access api. Yet the most important role of this implementation is to bind the results of the access request to the formater.
Store specific data query/access
The query and access apis are by definition storage system agnostic and are also storage content agnostic. Therefore the request will have the form of a generic query and will return primitive constructs in the initial implementations. This is similar to the popular usage of JDBC for example, however since the storage system may not have a direct or even possible JDBC mapping this api will be more general.
The implementation however will be optimized to the storage system. In the initial release we will likely provide a JDBC implementation to target a relational database, and perhaps a solution to support and EMF based storage system.
The results from this component will initially be constrained to simple tables and trees.
Event and schema driven contracts
As can be seen from the previous information, the bulk of DMS is a framework for building domain specific solutions. The domains that TPTP has captured in the past are as follows:
- Trace - stack and heap information captured as graphs and/or statistical counters
- Test - test definition and behavior as well as related execution logs
- Log - logs that have been transformed into the generic common base event format
- Symptoms - pattern matching database used to analyse the log data
- Statistics - generic hierarchy of snapshots of statistical data over time
Each of these domains have a related set of event specifications, loaders as well as a EMF based model. The EMF model was common to the client model and the storage system. This simple structure works well, but does not enforce a complete separation of concerns or provide a way to scale to large volumes of data. However much of this structure can and will be carried forward in spirit, if not in implementation.
If we take a simple log event through DMS we can highlight each of the domain specific contracts that will be in place.
- (Event producer)the XML based event can continue to be used.
- (Event parser)The event parser that exists can also continue more or less as it is today. However the bean used to hold the event data must have an extensible "add yourself" implementation that will more dynamically resolve and implementation.
- (Store specific loader)The current implementation of the add yourself for the EMF model can continue to be used to populate an EMF store, however it needs to be configured rather than being compiled in as part of the bean.
- A RDB implementation will need to be implemented that leverages JDBC and other standard RDB infrastructure
- (Data store)The current EMF CBE model can be used for the storage in zipped xmi resources
- A RDB will have to be put in place to support large scale persistence. Views will be provided for all read access
- (Store specific query/access api)The current EMF query infrastructure used in TPTP can provide the implementation of the EMF storage specific component
- A JDBC based implementation will be provided for the RDB based store
- (Client)The current TPTP Eclipse workbench can be reused as a client once adjusted to use DMS rather the current direct leveraging of paging lists etc.. Note that this would at least logically be a separate instance of the EMF model from the storage model
- (Client Object Model)Although not the simplest implementation the current EMF model can be used as the client object model.
- A simple pojo model can be used for more direct application clients
- (Result specific access api)Logically a new layer of mapping has to be introduced to separate the storage model from the client model this would for example access a list of common base events and provide a formater to the client types. A formater for the EMF model would be needed, but when the storage model is the same this may be no-op code.
- It is likely that the pojo format will be the default minimal structure used to return data from a store to the client formater. This can facilitate a fast path for the most common use cases.
- A pojo formater is most likely needed.
This is a simplistic example but show how the current infrastructure may be reused without constraining alternatives.
Control mechanisms
The control mechanism is not shown in the image, however it is a critical component to make the system work, and is therefore an critical part of DMS. At the run time of this system there is a requirement to communicate asynchronously with the event source. This communication is needed to provide control of the data producer. In TPTP a single object is used to provide this control and the data stream being fed to the event parsers. This is often referred to as the client binding to the agent or an agent proxy since it has the role of encapsulating all transport and other issues involved in dealing with a given data producer (aka agent).
At a minimum this proxy exposes a data stream that can be provided to a data processor that controls the various event parsers, and a command channel. There are a basic set of life cycle commands all agents must be able to handle such as start, stop, apply filter, get/set property and although there is no intention to enumerate them here it is important to recognize there are minimal requirements.
API issues
local versus remote
Although it is conceivable to make each of the components in this structure can be remote it is not necessary or practical. In TPTP we consider anything not in the same process as being remote, and co-existence on the same operating system instance is something that can be optimized but is still remote. A balance has to be maintained and to that end TPTP will cluster the components in the following groups.
- Monitored system The system being monitor has no coincidence to DMS and is included here for completeness. It is totally up to the data source tier to deal with this space.
- Data source tier In TPTP this is where the agents or data producers live. These are rarely not remote to their main client which is the data processing tier. Other than the fact that the data source needs to provide the implementation of the "agent proxy" this tier is invisible to DMS.
- Data processing tier The data processing tier consists of the agent proxy, a data processor that binds the event source data stream to event parsers and in turn the storage specific loader.
- Data storage tier This is the classic server tier of a client server or 3 tier system and hold nothing but the data store. For example it contains the running relational database. In the case of an RDB store this is exposed to the store specific loader and and query/access components via JDBC. If an EMF store were being used, this tier would have to normally be in process (aka local) to loader and access components unless someone provided a remote access library for EMF or the specific models which is not in scope for TPTP.
- Data consumption tier The data consumption tier contains the store specific access implementations as well as the result specific apis. As already noted if the store was RDB, then access to the store would be via a standard network JDBC connection. On the client model side the interface could well be remote and this is a motivation for a simple pojo interface that can be consumed in an optimal fashion locally, or easily bound as a server side implementation to a web service.
- Client object model tier The role of the client model is to hold/cache a transient copy of data for the client to consume as appropriate. The only consideration for DMS is that particularly in a web application environment this memory is typically managed in a separate container from any data source and thus DMS must not assume any locality.
- Client tier This tier is typically a browser or embedded application environment. As such may or may not be remote to the client object model and such decisions are independent of DMS and simply part of the application design.
The interfaces to each of the tiers will have a local implementations that can be bound together if they are co-located. Once extreme example could see all the tiers coexist in a single Eclipse workbench where the workbench is monitoring itself and using an embedded RDB or EMF data store. Another example may see a more classic 3 tier system enabled with a browser client, an network RDB data store and the other components embedded in a web application. This may in fact be a typical out of the box configuration for COSMOS.
pojo, JPA, JDO
There is interest in the community to exploit the newer apis and features of Java 5. The Java Persistence Api is of particular interest. The various persistence frameworks that are discussed all claim to provide a more or less transparent mapping between relational data and Java object models. In the simple demo scenario there is basically a mapping of a class to a table and variations of object life cycle support similar to referential integrity rules of relational technologies. There is also often support for relationships between objects and in some cases even bi-directional association consistency.
These frameworks generally operate in the context of a transaction but are basically client models and leave the management of a connection and transaction scope as well as synchronization to the client. Although there is great value in the automation of such mapping and support there are also some shortcomings that have been the root function if commercial object relation frameworks. In particular the management of large collections. There is no built-in support to manage large collections and working within windows of large lists.
One of the core reasons TPTP is investing in DMS is to make a solid move towards the ability to deal with gigabytes of of data. The client object mode is where what has been called paging or virtual list have to be managed, or they simply acknowledge they are dealing with a subset of the total data and have what is commonly called a view model.
Yet there is a place for JPA and similar frameworks in the context of DMS. For example a JPA object graph could be used to wrap a data store and be exposed as the api to be used by the store specific loaders or the store specific access api. If the actual store is a RDB, one of the current implementations may possibly be used. However in all other cases an implementation would have to be written. The same would be true if JPA was used to develop the client object model.
If an object graph is passed through an remote interface any framework behavior would normally be difficult to carry forward or replicate. This is why at this time TPTP will expose only pojo data graphs at the interfaces that may be exposed remotely.