...
Example configuration in a processor:
{
"single": true,
"sourceField": "text",
"output": "chunks",
"chunkerType": "SIMPLE",
"minChunkSize": 200,
"chunkExpansion": {
"prepend": "fieldAtitle",
"append": "fieldBsuffix",
"separator": " && "
},
"removePunctuation": false,
"processBlankText": false
}
Configuration parameters:
sourceField
(Optional, String /| List) field(s) with the text. Default is "cleanContent". If multiple fields are provided, they will be concatenated with an empty space before the chunk process.
...
simple
- A simple chunker that splits text at line breaks. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).punctuation_paragraph
- A chunker that splits text at line breaks, but only when they are preceded by certain characters. Output from TIKA often uses carriage returns for formatting in the middle of sentences. This chunker will only break a chunk if the line break is preceded by something that makes it look like it is not the middle of a sentence. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).tagged_paragraph
- A chunker that splits text at certain HTML tags. Output from TIKA often uses HTML tags for formatting in the middle of sentences. Tags within the chunk would be stripped off as well. Optionally limits the length of the chunk, but still chunks on a HTML tag (thereby "losing" text).finds elements that match a Selector CSS query. For more information on selector you can visit Jsoup official documentation
single
(Optional, Boolean) if only the first chunk should be processed. Default is false
.
...
(Optional, Boolean) replace punctuation with spaces in the chunked text. Default is true
. Not available for tagged_paragraph
.
breakOnBlankLine
(Optional, Boolean) break on blank line (only on punctuation_paragraph
chunker). Default is true
. Not available for tagged_paragraph
.
lineLengthThreshold
(Optional, Int) threshold for the characters in the line (only on punctuation_paragraph
chunker). Default is 100
. Not available for tagged_paragraph
.
htmlTags
(Optional, List, String) HTML tags to use for splitting chunks (only on tagged_paragraph
). Default is empty. Only available for tagged_paragraph
.
chunkBySentence
(Optional, Boolean) chunks text by sentence instead of paragraph. A naive approach is used to determine what a sentence is, basically whenever a dot (.), an exclamation sign (!) or a question mark (?) is found. Default is false
. Not available for tagged_paragraph
.
processBlankText
(Optional, Boolean) Process blank texts for chunks. Defaults to false
. When this flag is set to true
blank texts will produce empty chunks for a given record.
chunkExpansion.prepend
(Optional, String) the field to prepend at the start of every chunk. The text will not be considered for the maxChunkSize
...