This processor sends chunks of text and calculates their corresponding vector using BERT as a Service.

Given that BERT has a limitation of 512 tokens, the processor also provides a chunker to split the text in multiple parts.

Configuration

Example configuration in a processor:

{
  "servers": [
    {
      "host": "localhost",
      "port": 8125
    }
  ],
  "connectTimeout": 1000,
  "readTimeout": 1000,
  "sourceField": [
    "tika",
    "other"
  ],
  "model": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
  "multiSourceFieldSeparator": " ",
  "output": "bertService",
  "chunkExpansion": {
    "append": "fieldA",
    "prepend": "fieldB",
    "separator": " - "
  },
  "chunkerType": "SIMPLE",
  "single": true,
  "maxChunks": 10,
  "minChunkSize": 25,
  "maxChunkSize": 100,
  "removePunctuation": true,
  "breakOnBlankLine": true,
  "lineLengthThreshold": 100,
  "htmlTags": [
    "p"
  ],
  "name": "BERT Service Processor",
  "active": true,
  "id": "b25f9a02-a8ca-471c-858e-51853c9e76a6",
  "type": "tika-processor"
}

Configuration parameters:

servers.host

(Required, String) host where BERT as a Service is located.

servers.port

(Required, Int) host port where BERT as a Service is located.

connectTimeout

(Optional, Int) timeout to connect to the server. Should be expressed in milliseconds. Default 60000 (1m)

readTimeout

(Optional, Int) timeout to read from the server. Should be expressed in milliseconds. Default 60000 (1m)

sourceField

(Optional, String/List) field with the text. Default is "cleanContent". If multiple fields are provided, they will be concatenated with an empty space before the chunk process.

model

(Optional, String) name of Hugging Face model used to encode chunks. If not provided, Hugging Face Service uses the default one. If using bert-service as underlying model, this parameter can be ignored.

multiSourceFieldSeparator

(Optional, String) separator to join multiple source fields. Default is an empty space.

output

(Optional, String) field where extracted content should be placed. Default is "bertService".

chunkerType

(Optional, String) the chunker to apply to the text. Default is "SIMPLE".

Options are:

simple - A simple chunker that splits text at line breaks. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).
punctuation_paragraph - A chunker that splits text at line breaks, but only when they are preceded by certain characters. Output from TIKA often uses carriage returns for formatting in the middle of sentences. This chunker will only break a chunk if the line break is preceded by something that makes it look like it is not the middle of a sentence. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).
tagged_paragraph - A chunker that splits text at certain HTML tags. Output from TIKA often uses HTML tags for formatting in the middle of sentences. Tags within the chunk would be stripped off as well. Optionally limits the length of the chunk, but still chunks on a HTML tag (thereby "losing" text).

single

(Optional, Boolean) if only the first chunk should be processed. Default is false.

chunkerEnabled (Optional, Boolean) Default is true. If set to false all the input text will be processed at once within a single chunk.

maxChunks

(Optional, Int) the maximum chunks in a call to the service. Default is 512.

minChunkSize

(Optional, Int) the minimum size of a chunk to send to BERT. Default is 0 (no minimum).

maxChunkSize

(Optional, Int) the maximum size of a chunk to send to BERT. Default is 0 (no maximum).

removePunctuation

(Optional, Boolean) replace punctuation with spaces in the chunked text. Default is true.

breakOnBlankLine

(Optional, Boolean) break on blank line (only on punctuation_paragraph chunker). Default is true.

lineLengthThreshold

(Optional, Int) threshold for the characters in the line (only on punctuation_paragraph chunker). Default is 100.

htmlTags

(Optional, List, String) HTML tags to use for splitting chunks (only on tagged_paragraph). Default is empty.

chunkBySentence

(Optional, Boolean) chunks text by sentence instead of paragraph. A naive approach is used to determine what a sentence is, basically whenever a dot (.), an exclamation sign (!) or a question mark (?) is found. Default is false.

chunkExpansion.prepend

(Optional, String) the field to prepend at the start of every chunk. The text will not be considered for the maxChunkSize

chunkExpansion.append

(Optional, String) the field to append at the end of every chunk. The text will not be considered for the maxChunkSize

chunkExpansion.separator

(Optional, String) the separator to join any prepended/appended field. The separator will not be considered for the maxChunkSize

Input/Output examples

Input

{
  "cleanContent": "Text to process"
}

Output

{
  "bertService": {
    "chunk": "0##0",
    "text": "Text to process",
    "vector": [
      -0.52814084,
      0.21691051,
      -0.13383421,
      ...,
      0.65233213
    ]
  }
}

Discovery Documentation

BERT Service Processor

Analytics

Configuration

Input/Output examples

Input

Output

Related content