Chunk Processor

Chunk Processor

This processor takes input text from a field and creates a multi value output field with chunks of the text based on type of chunker and configuration parameters. Useful when trying to sent batches and smaller texts to processors such as BERT.

Configuration

Example configuration in a processor:

{
  "single": true,
  "sourceField": "text",
  "output": "chunks",
  "chunkerType": "SIMPLE",
  "minChunkSize": 200,
  "chunkExpansion": {
    "prepend": "title",
    "append": "suffix",
    "separator": " && "
  },
  "removePunctuation": false,
  "processBlankText": false
}

Configuration parameters:

sourceField

(Optional, String | List) field(s) with the text. Default is "cleanContent". If multiple fields are provided, they will be concatenated with an empty space before the chunk process.

multiSourceFieldSeparator

(Optional, String) separator to join multiple source fields. Default is an empty space.

output

(Optional, String) field where extracted content should be placed. Default is "bertService".

chunkerType

(Optional, String) the chunker to apply to the text. Default is "SIMPLE".

Options are:

  • simple - A simple chunker that splits text at line breaks. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).
  • punctuation_paragraph - A chunker that splits text at line breaks, but only when they are preceded by certain characters. Output from TIKA often uses carriage returns for formatting in the middle of sentences. This chunker will only break a chunk if the line break is preceded by something that makes it look like it is not the middle of a sentence. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).
  • tagged_paragraph - A chunker that finds elements that match a Selector CSS query. For more information on selector you can visit Jsoup official documentation
  • overlapping_sentences - A chunker that splits the text in contiguous overlapped sentences. Optionally limits the length of the chunk (thereby "losing" text).

single

(Optional, Boolean) if only the first chunk should be processed. Default is false.

maxChunks

(Optional, Int) the maximum chunks in a call to the service. Default is 512.

minChunkSize

(Optional, Int) the minimum size of a chunk to send to BERT. Default is 0 (no minimum).

maxChunkSize

(Optional, Int) the maximum size of a chunk to send to BERT. Default is 0 (no maximum).

removePunctuation

(Optional, Boolean) replace punctuation with spaces in the chunked text. Default is true. Not available for tagged_paragraph.

breakOnBlankLine

(Optional, Boolean) break on blank line (only on punctuation_paragraph chunker). Default is true. Not available for tagged_paragraph.

lineLengthThreshold

(Optional, Int) threshold for the characters in the line (only on punctuation_paragraph chunker). Default is 100. Not available for tagged_paragraph.

htmlTags

(Optional, List, String) HTML tags to use for splitting chunks. Default is empty. Only available for tagged_paragraph.

sentenceWindow

(Optional, Int) Number of sentences per chunk. Only available for overlapping_sentences. Default is 3.

numberOverlappingSentences

(Optional, Int) Number of sentences from the last chunk that will be used in the next one. Its value must be smaller than sentenceWindow value. Only available for overlapping_sentences. Default is sentenceWindow-1.

chunkBySentence

(Optional, Boolean) chunks text by sentence instead of paragraph. A naive approach is used to determine what a sentence is, basically whenever a dot (.), an exclamation sign (!) or a question mark (?) is found. Default is false. Not available for tagged_paragraph.

processBlankText (Optional, Boolean) Process blank texts for chunks. Defaults to false. When this flag is set to true blank texts will produce empty chunks for a given record.

chunkExpansion.prepend

(Optional, String) the field to prepend at the start of every chunk. The text will not be considered for the maxChunkSize

chunkExpansion.append

(Optional, String) the field to append at the end of every chunk. The text will not be considered for the maxChunkSize

chunkExpansion.separator

(Optional, String) the separator to join any prepended/appended field. The separator will not be considered for the maxChunkSize

Input/Output examples

Input

{
  "cleanContent": "Text to process"
}

Output

{
  "bertService": {
    "chunk": "0##0",
    "text": "Text to process"
  }
}

©2024 Pureinsights Technology Corporation