This processor parses a given document and extracts its content on several formats.
Configuration
Example configuration in a processor:
{
"parser": {
"metadata": true,
"key": "/input",
"contentTypeField": "/metadata/content-type",
"defaultEncoding": "UTF-8",
"timeout": "PT1M",
"output": {
"field": "outputFieldName",
"toStorage": true
},
"extraction": {
"type": "xpath",
"xpathQuery": "/xhtml:html/xhtml:body//node()"
}
},
"name": "Tika Processor",
"active": true,
"id": "b25f9a02-a8ca-471c-858e-51853c9e76a6",
"type": "tika-processor"
}
Configuration parameters:
parser.metadata
(Optional, boolean) Should extraction metadata be added to record.
parser.key
(Required, String) Record field where the document can be found
parser.contentTypeField
(Optional, String) Record field with the content type to use during parsing
parser.timeout
(Optional, String) The timeout set on the parsing of each record, expressed as an ISO8601 duration. Defaults to "PT1M", which is 1 minute.
Warning: Each record can take up to 15 more additional seconds to abort the parsing operation when timed out. Take this into account when defining the value for this parameter.
parser.defaultEncoding
(Optional, String) Encoding to use for the extracted text. Default is "UTF-8"
parser.output.field
(Optional, String) field where extracted content should be placed. Default is "tika".
parser.output.toStorage
(Optional, boolean) should the output be written to the binary storage. Default is false
.
parser.extraction.type
(Optional, String) type of extraction to perform. Default is xhtml
Options are:
xhtml
- Extracts document content as xhtmlplain
- Extracts content as plain textxpath
- Extracts document content as xhtml and then filters based on the xpath query provided.
parser.extraction.xpathQuery
(Optional, String) Required if type
is set to xpath
. Tika will only return sections of html that match this query.