Bert Train Processor

This processor uses the text received to add new vocabulary and train the designated model. To do this the processor makes use of HuggingFace Service.

Training a model takes a considerate amount of time, reason why the execution of this processor will probably take longer than expected.

Configuration

Example configuration in a processor:

{
  "trainingBatch": 2,
  "trainingBatchSize": 1000,
  "model": "C:\\dev\\model",
  "active": true,
  "type": "bert-train-processor",
  "sourceField": "text",
  "servers": [
    {
      "port": 8888,
      "host": "localhost"
    }
  ],
  "trainConnectTimeout": "PT30s",
  "trainReadTimeout": "PT5m",
  "timeInterval":"PT10s",
  "name": "BERT trainer",
  "id": "50dcc5e2-1fcd-40bc-82b7-a257b0ec38ed"
}

Configuration parameters:

trainingBatch - Required, Int

Batch to be used while training the model.

trainingBatchSize - Required, Int

Maximum amount of data to send to HuggingFace Service.

model - Required, String

Path to the model to train, take into account that the model in the specified path will be overwritten by the trained model.

sourceField - Required, String

Field containing the text that will be used to add vocabulary and train.

servers.host - Required, String

Host where HuggingFace Service is located.

servers.port - Required, Int

Host port where HuggingFace Service is located.

trainConnectTimeout - Optional, String

Timeout to connect to the server, expressed in Duration type format.

trainReadTimeout - Required, String

Timeout to read from the server and wait for training to be over, expressed in Duration type format.

timeInterval - Required, String

Time that the component will wait before checking if training is over, expressed in Duration type format.

Example Output

In the specified model path the following files should be updated:

added_tokens.json
config.json
pytorch_model.bin
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.txt