Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

This processor parses a given document and extracts its content on several formats.

Configuration

Example configuration in a processor:

{
  "parser": {
    "metadata": true,
    "key": "/input",
    "contentTypeField": "/metadata/content-type",
    "defaultEncoding": "UTF-8",
    "output" : {
      "field": "outputFieldName",
      "toStorage": true
    },
    "extraction" : {
      "type" : "xpath",
      "xpathQuery" : "/xhtml:html/xhtml:body//node()"
    }
  },
  "name": "Tika Processor",
  "active": true,
  "id": "b25f9a02-a8ca-471c-858e-51853c9e76a6",
  "type": "tika-processor"
}

Configuration parameters:

parser.metadata

(Optional, boolean) Should extraction metadata be added to record.

parser.key

(Required, String) Record field where the document can be found

parser.contentTypeField

(Optional, String) Record field with the content type to use during parsing

parser.defaultEncoding

(Optional, String) Encoding to use for the extracted text. Default is "UTF-8"

parser.output.field

(Optional, String) field where extracted content should be placed. Default is "tika".

parser.output.toStorage

(Optional, boolean) should the output be written to the binary storage. Default is false.

parser.extraction.type

(Optional, String) type of extraction to perform. Default is xhtml

Options are:

  • xhtml - Extracts document content as xhtml
  • plain - Extracts content as plain text
  • xpath - Extracts document content as xhtml and then filters based on the xpath query provided.

parser.extraction.xpathQuery

(Optional, String) Required if type is set to xpath. Tika will only return sections of html that match this query.

  • No labels