Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 17 Next »

OCR Processor

This processor leverages Tesseract Optical Character Recognition processing for different file types.

The library provides optical character recognition (OCR) support for:

  • TIFF, JPEG, GIF, PNG, and BMP image formats
  • Multi-page TIFF images
  • PDF document format

Note: PDF files are the only ones supported by this content processor.

Configuration

Example configuration in a processor:

{
  "config": {
    "parser": {
      "key": "/file",
      "runOcrKey": "/shouldOcrRun",
      "timeout": "10s",
      "output": {
        "field": "otherField"
      }
    }
  },
  "name": "OCR Processor",
  "active": true,
  "id": "b25f9a02-a8ca-471c-858e-51853c9e76a6",
  "type": "ocr-processor"
}

Configuration parameters:

parser.key

(Required, String) Record field where binary data can be found.

parser.runOcrKey

(Optional, String) Record field to indicate whether the OCR processing should run. If not present OCR processing will run for the record.

parser.timeout

(Optional, String) Duration to time out a request to Tesseract. Defaults to 30s.

parser.output.field

(Optional, String) Field to put the output of the OCR process within a record.

Input/Output examples

Input

{
  "key": "file",
  "runOcrKey": "runOcr",
  "timeout": "40s",
  "output": {
    "field": "otherField"
  }
}

Output

{
  "ocr": "extracted text"
}
  • No labels