...
simple
- A simple chunker that splits text at line breaks. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).punctuation_paragraph
- A chunker that splits text at line breaks, but only when they are preceded by certain characters. Output from TIKA often uses carriage returns for formatting in the middle of sentences. This chunker will only break a chunk if the line break is preceded by something that makes it look like it is not the middle of a sentence. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).tagged_paragraph
- A chunker that finds elements that match a Selector CSS query. For more information on selector you can visit Jsoup official documentationoverlapping_sentences
- A chunker that splits the text in contiguous overlapped sentences. Optionally limits the length of the chunk (thereby "losing" text).
single
(Optional, Boolean) if only the first chunk should be processed. Default is false
.
...
(Optional, List, String) HTML tags to use for splitting chunks. Default is empty. Only available for tagged_paragraph
.
sentenceWindow
(Optional, Int) Number of sentences per chunk. Only available for overlapping_sentences
. Default is 3
.
numberOverlappingSentences
(Optional, Int) Number of sentences from the last chunk that will be used in the next one. Its value must be smaller than sentenceWindow value. Only available for overlapping_sentences
. Default is sentenceWindow-1
.
chunkBySentence
(Optional, Boolean) chunks text by sentence instead of paragraph. A naive approach is used to determine what a sentence is, basically whenever a dot (.), an exclamation sign (!) or a question mark (?) is found. Default is false
. Not available for tagged_paragraph
.
...