Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
COSMOS Design 197525
Design Discussion for 197525: Buffering data in the data assembly pipeline
Contents
Data Buffering in COSMOS DC
This design document addresses COSMOS Bugzilla enhancement request 197525.
Change History:
Joel Hawkins 9/05/2007 Initial version
Overview
For import scenarios like Log Shredding, it is natural to take batch-oriented approach for managing data assemblies. Currently, there is no support in the framework for handling data in batches. The framework must support, and indeed prefer, to handle data in bulk where possible. Furthermore, the size of any internal buffers used to implement batch behaviors should be tunable on an assembly level, and a facility must be provided to ensure that no data is left stranded in any internal buffers.
Related Issues
Link to 521 - insertion of proxy nodes (vectorization vs optimization) Link to 833 - schema - attribute on context (buffersize)
Implementation Stages and Corporate Use Cases
Support for buffering will be delivered in iteration 6. Timer-based flushing of collecting assemblies will be added in iteration 7 (unless I get enough free time to implement it for iteration 6).
Terminologies/Acronyms
The terminologies/acronyms below are commonly used throughout this document. The list below defines each term:
ASSEMBLY: A collection of components that are used by the framework to either route data for storage or perform queries on stored data.
COMPONENT: A discrete element in an assembly. Components are external to the framework.
VECTOR: In this case, a general term to describe data provided either as an array or as some implementation of a java Collection.
VECTOR PROXY: A "component" injected by the framework in order to support converting to and from a VECTOR or to and from mismatched VECTOR implementations.
Use Cases
Use Case 1. Batch Insertion
The current End 2 End demo of COSMOS involves a component that reads Apache Tomcat logs and converts them into CBEs. The CBEs are then persisted into a Derby database. These CBEs. The CBE source component produces a single CBE at a time. The CBE sink component is modified to consume a collection of CBEs, in addition to a single CBE at a time. The Assembly specifies that optimization for the CBESink component is allowed, and the framework creates the assembly such that CBEs are collected into batches, and the collection-based store method on the CBESink is invoked.
Use Case 2. Batch results
Query responses are inherently batch-oriented. However, when intervening filters and/or transformations are involed that may not support batch invocation semantics, the framework will create an assembly such that the batch results are passed serially through the intervening components and re-batched for return. The pipeline will be flushed prior to return to ensure that any remaining objects that weren't flushed due to buffersize restrictions reach the final response.
External Interfaces
The 'flush()' method will be added to the DataCollection interface. This method will be exposed externally as an operation on the DataCollection capability, and will be available internally to any component within the assembly that has access to the context.
@ManagedResourceCapability(namespace=IDataCollectionContext.NAMESPACE) public abstract interface IDataCollectionContext { ... code removed for clarity @ManagedOperation void flush(); ... code removed for clarity }
The "buffersize" attribute will be added to the "context " assembly element. This attribute will override the default buffer size for the assembly. The default buffersize can be set by the System property org.eclipse.cosmos.dc.buffersize, and will default to 20.
<complexType name="ContextType"> <choice> <element ref="tns:query"></element> <element ref="tns:source"></element> <element ref="tns:channel"></element> <element ref="tns:multiplex"></element> </choice> <attribute name="name" type="string" use="required"></attribute> <attribute name="id" type="anyURI" use="required"></attribute> <attribute name="buffersize" type="int"></attribute> </complexType>
Here is a sample assembly descriptor with buffersize specified.
<?xml version="1.0" encoding="UTF-8"?> <!-- * Test configuration for outbound pipeline --> <cosmos:context xmlns:cosmos="http://www.eclipse.org/xmlns/cosmos/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:test="http://www.eclipse.org/xmlns/cosmos/test/1.0 org/eclipse/cosmos/dc/tests/xml/configs/CosmosTestBinding.xsd " xsi:schemaLocation="http://www.eclipse.org/xmlns/cosmos/1.0 CosmosDataCollectionAssembly.xsd " name="TestSource" id="http://MySpecialURI/TestSource" buffersize="100"> <cosmos:source factory="TestSourceFactory" optimizable="true"> <test:binding/> <cosmos:transform factory="TestTransformFactory" optimizable="false"> <test:binding/> </cosmos:transform> </cosmos:source> </cosmos:context>
Conventions
Framework Implementation Details
Vector Proxy insertion rules:
- vector(array or collection) to scalar
- scalar to vector(array or collection)
- array to collection
- collection to array
Costing of proxies during assembly compilation
- vector(array or collection) to scalar - cost=2 (discourage)
- scalar to vector(array or collection) - cost=-1 (prefer)
- array to collection - cost=1
- collection to array - cost=1
Batch trigger size
- default to 20
- global override as system property org.eclipse.cosmos.dc.buffersize
- specified in context element of schema
Flushing the pipeline
- for inbound cases, the pipeline is flushed when the batch limit is hit
- for outbound cases, the pipeline is flushed once the page size has been hit or the result set has been exhausted
- programatic control of pipeline flushing is exposed in the context interface. Annotated as manageable.