This processor sends chunks of text and calculates their corresponding vector using BERT as a Service.
Given that BERT has a limitation of 512 tokens, the processor also provides a chunker to split the text in multiple parts.
Configuration
Example configuration in a processor:
{
"servers": [
{
"host": "localhost",
"port": 8125
}
],
"connectTimeout": 1000,
"readTimeout": 1000,
"sourceField": [
"tika",
"other"
],
"multiSourceFieldSeparator": " ",
"output": "bertService",
"chunkExpansion": {
"append": "fieldA",
"prepend": "fieldB",
"separator": " - "
},
"chunkerType": "SIMPLE",
"single": true,
"maxChunks": 10,
"minChunkSize": 25,
"maxChunkSize": 100,
"removePunctuation": true,
"breakOnBlankLine": true,
"lineLengthThreshold": 100,
"htmlTags": [
"p"
],
"name": "BERT Service Processor",
"active": true,
"id": "b25f9a02-a8ca-471c-858e-51853c9e76a6",
"type": "tika-processor"
}
Configuration parameters:
servers.host
(Required, String) host where BERT as a Service is located.
servers.port
(Required, Int) host port where BERT as a Service is located.
connectTimeout
(Optional, Int) timeout to connect to the server. Should be expressed in milliseconds. Default 60000 (1m)
readTimeout
(Optional, Int) timeout to read from the server. Should be expressed in milliseconds. Default 60000 (1m)
sourceField
(Optional, String/List) field with the text. Default is "cleanContent". If multiple fields are provided, they will be concatenated with an empty space before the chunk process.
multiSourceFieldSeparator
(Optional, String) separator to join multiple source fields. Default is an empty space.
output
(Optional, String) field where extracted content should be placed. Default is "bertService".
chunkerType
(Optional, String) the chunker to apply to the text. Default is "SIMPLE".
Options are:
simple
- A simple chunker that splits text at line breaks. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).punctuation_paragraph
- A chunker that splits text at line breaks, but only when they are preceded by certain characters. Output from TIKA often uses carriage returns for formatting in the middle of sentences. This chunker will only break a chunk if the line break is preceded by something that makes it look like it is not the middle of a sentence. Optionally limits the length of the chunk, but still chunks on a line break (thereby "losing" text).tagged_paragraph
- A chunker that splits text at certain HTML tags. Output from TIKA often uses HTML tags for formatting in the middle of sentences. Tags within the chunk would be stripped off as well. Optionally limits the length of the chunk, but still chunks on a HTML tag (thereby "losing" text).
single
(Optional, Boolean) if only the first chunk should be processed. Default is false
.
maxChunks
(Optional, Int) the maximum chunks in a call to the service. Default is 512
.
minChunkSize
(Optional, Int) the minimum size of a chunk to send to BERT. Default is 0
(no minimum).
maxChunkSize
(Optional, Int) the maximum size of a chunk to send to BERT. Default is 0
(no maximum).
removePunctuation
(Optional, Boolean) replace punctuation with spaces in the chunked text. Default is true
.
breakOnBlankLine
(Optional, Boolean) break on blank line (only on punctuation_paragraph
chunker). Default is true
.
lineLengthThreshold
(Optional, Int) threshold for the characters in the line (only on punctuation_paragraph
chunker). Default is 100
.
htmlTags
(Optional, List, String) HTML tags to use for splitting chunks (only on tagged_paragraph
). Default is empty.
chunkBySentence
(Optional, Boolean) chunks text by sentence instead of paragraph. A naive approach is used to determine what a sentence is, basically whenever a dot (.), an exclamation sign (!) or a question mark (?) is found. Default is false
.
chunkExpansion.prepend
(Optional, String) the field to prepend at the start of every chunk. The text will not be considered for the maxChunkSize
chunkExpansion.append
(Optional, String) the field to append at the end of every chunk. The text will not be considered for the maxChunkSize
chunkExpansion.separator
(Optional, String) the separator to join any prepended/appended field. The separator will not be considered for the maxChunkSize
Input/Output examples
Input
{
"cleanContent": "Text to process"
}
Output
{
"bertService": {
"chunk": "0##0",
"text": "Text to process",
"vector": [
-0.52814084,
0.21691051,
-0.13383421,
...,
0.65233213
]
}
}