...
Example configuration in a processor:
{
"single": true,
"sourceField": "text",
"output": "chunks",
"chunkerType": "SIMPLE",
"minChunkSize": 200,
"chunkExpansion": {
"prepend": "fieldAtitle",
"append": "fieldBsuffix",
"separator": " && "
},
"removePunctuation": false,
"processBlankText": false
}
Configuration parameters:
sourceField
(Optional, String /| List) field(s) with the text. Default is "cleanContent". If multiple fields are provided, they will be concatenated with an empty space before the chunk process.
...
simple
- A simple chunker that splits text at line breaks. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).punctuation_paragraph
- A chunker that splits text at line breaks, but only when they are preceded by certain characters. Output from TIKA often uses carriage returns for formatting in the middle of sentences. This chunker will only break a chunk if the line break is preceded by something that makes it look like it is not the middle of a sentence. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).tagged_paragraph
- A chunker that finds elements that match a Selector CSS query. For more information on selector you can visit Jsoup official documentationoverlapping_sentences
- A chunker that splits text at certain HTML tags. Output from TIKA often uses HTML tags for formatting in the middle of sentences. Tags within the chunk would be stripped off as wellthe text in contiguous overlapped sentences. Optionally limits the length of the chunk , but still chunks on a HTML tag (thereby "losing" text).
single
...
(Optional, Boolean) replace punctuation with spaces in the chunked text. Default is true
. Not available for tagged_paragraph
.
breakOnBlankLine
(Optional, Boolean) break on blank line (only on punctuation_paragraph
chunker). Default is true
. Not available for tagged_paragraph
.
lineLengthThreshold
(Optional, Int) threshold for the characters in the line (only on punctuation_paragraph
chunker). Default is 100
. Not available for tagged_paragraph
.
htmlTags
(Optional, List, String) HTML tags to use for splitting chunks (only on . Default is empty. Only available for tagged_paragraph
.
sentenceWindow
(Optional, Int) Number of sentences per chunk. Default is emptyOnly available for overlapping_sentences
. Default is 3
.
numberOverlappingSentences
(Optional, Int) Number of sentences from the last chunk that will be used in the next one. Its value must be smaller than sentenceWindow value. Only available for overlapping_sentences
. Default is sentenceWindow-1
.
chunkBySentence
(Optional, Boolean) chunks text by sentence instead of paragraph. A naive approach is used to determine what a sentence is, basically whenever a dot (.), an exclamation sign (!) or a question mark (?) is found. Default is false
. Not available for tagged_paragraph
.
processBlankText
(Optional, Boolean) Process blank texts for chunks. Defaults to false
. When this flag is set to true
blank texts will produce empty chunks for a given record.
chunkExpansion.prepend
(Optional, String) the field to prepend at the start of every chunk. The text will not be considered for the maxChunkSize
...