Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets.boilerpipe
This page describes the SMILA pipelets provided by bundle org.eclipse.smila.processing.pipelets.boilerpipe.
Contents
General
All pipelets in this bundle support the configurable error handling as described in SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation. When used in jobmanager workflows, records causing errors are dropped.
Read Type
- runtime: Parameters are read when processing records. Parameter value can be set per Record.
- init: Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.
org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet
Extracts text from an HTML input using the Boilerpipe library. In contrast to the HtmlToTextPipelet it offers different algorithms for textual content extraction but does not extract HTML metadata.
Configuration
Property | Type | Read Type | Description |
---|---|---|---|
inputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the HTML input is found in an attachment or in an attribute of the record |
outputType | String : ATTACHMENT, ATTRIBUTE | runtime | Defines whether the plain text should be stored in an attachment or in an attribute of the record |
inputName | String | runtime | Name of attachment or attribute that contains the HTML input |
outputName | String | runtime | Name of attachment or attribute for plain text output |
encodingAttribute | String | runtime | Optional name of the attribute with the encoding of the input attachment. |
defaultEncoding | String | runtime | Optional fallback encoding, if anything else fails. |
maxParserBlocks | integer | runtime | Limits Boilerpipe's parsing process, default value is 20.000. Negative values disable the limit. See below for addition information. |
filter | Sequence of String | init | A list of boiler pipe filters to use. This may contain class names, static method or static variable references. Default is de.l3s.boilerpipe.extractors.ArticleExtractor.INSTANCE. Please note that BoilerpipeExtractors implement the interface BoilerpipeFilter and are pipelines of BoilerpipeFilters. Therefore you should not use multiple BoilerpipeExtractors! Also please note that some Extractors and Filters do not have a default Constructor and therefore cannot be used by this Pipelet. Others may not have a public Constructor but a public static instance member. |
Note on Boilerpipe's extraction process and memory consumption
Boilerpipe is not a generic HTML-to-Text-converter that works great with every HTML page. It is specialized on extracting the relevant portion of text from HTML pages with "typical" layouts: For example, news sites usually have a content area in the middle where the actual article is, and around that you'll find areas with navigation links, advertising, header and footer content, etc. On such pages boilerpipe is quite good at extracting only the actual article text.
The cost of this is that Boilerpipe needs more memory than other HTML parsers, because it creates a special document object model of the HTML page to be able to identify the content area. This can lead to problems on "non-typical" pages, for example large pages that consists only of list of links (a quite extreme example is [1]). Such pages often lead to OutOfMemory errors when a vanilla Boilerpipe is used to parse them (8 GB of heap space is not enough to parse the "extreme" example).
To prevent this we have patched Boilerpipe for SMILA so that we can set an upper limit for the extraction process. The configuration property for this is named "maxParserBlocks". It limits the amounts of "TextBlock" objects created by Boilerpipe for it's document model - one TextBlock represents, roughly speaking, the content between two consecutive HTML tags. The default value for this is 20.000 which in our tests was high enough to handle typical news sites or Wikipedia (with still a lot of reserve left). If you get incomplete results from processing your HTML pages with the default setting, you may want to increase the value of this property and additionally you should probably increase the -Xmx setting for your SMILA VM. To disable the limit (proceed at your own risk) set the property to a negative value.
Patch submitted to Boilerpipe project: [2].
Example
Extract text from the HTML input in attachment "html" into the attribute "text" using the encoding given in attribute "http.encoding" and using the extractor ArticleSentencesExtractor:
<proc:invokePipelet name="extractText"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" /> <proc:variables input="request" /> <proc:configuration> <rec:Val key="inputType">ATTACHMENT</rec:Val> <rec:Val key="inputName">html</rec:Val> <rec:Val key="outputType">ATTRIBUTE</rec:Val> <rec:Val key="outputName">text</rec:Val> <rec:Val key="encodingAttribute">http.encoding</rec:Val> <rec:Val key="filter">de.l3s.boilerpipe.extractors.ArticleSentencesExtractor.INSTANCE</rec:Val> </proc:configuration> </proc:invokePipelet>
The same example but using the simple filter MarkEverythingContentFilter:
<proc:invokePipelet name="extractText"> <proc:pipelet class="org.eclipse.smila.processing.pipelets.boilerpipe.BoilerpipePipelet" /> <proc:variables input="request" /> <proc:configuration> <rec:Val key="inputType">ATTACHMENT</rec:Val> <rec:Val key="inputName">html</rec:Val> <rec:Val key="outputType">ATTRIBUTE</rec:Val> <rec:Val key="outputName">text</rec:Val> <rec:Val key="encodingAttribute">http.encoding</rec:Val> <rec:Val key="filter">de.l3s.boilerpipe.filters.simple.MarkEverythingContentFilter.INSTANCE</rec:Val> </proc:configuration> </proc:invokePipelet>