This processor executes OCR on documents after a Tika processor has run over the documents.

OCR Processor

This processor leverages Tesseract Optical Character Recognition processing for different file types.

The library provides optical character recognition (OCR) support for:

TIFF, JPEG, GIF, PNG, and BMP image formats
Multi-page TIFF images
PDF document format

Note: PDF files are the only ones supported by this content processor.

Configuration

Example configuration in a processor:

{
  "config": {
    "parser": {
      "key": "/file",
      "runOcrKey": "/shouldOcrRun",
      "timeout": "10s",
      "output": {
        "field": "otherField"
      }
    }
  },
  "name": "OCR Processor",
  "active": true,
  "id": "b25f9a02-a8ca-471c-858e-51853c9e76a6",
  "type": "ocr-processor"
}

...

(Required, String) Record field where the document can be foundbinary data can be found.

parser.runOcrKey

(Optional, String) Record field to indicate whether the OCR processing should run. If not present OCR processing will run for the record.

parser.timeout

(Optional, String) Duration to time out a request to Tesseract. Defaults to 30s.

parser.output.field

(Optional, String) Field to put the output of the OCR process within a record.

Input/Output examples

Input

{
  "key": "file",
  "runOcrKey": "runOcr",
  "timeout": "40s",
  "output": {
    "field": "otherField"
  }
}

Output

{
  "ocr": "extracted text"
}

Version	Old Version 15	New Version Current
Changes made by	Continuous Integration [bot]	Continuous Integration [bot]
Saved on	Aug 30, 2023	Feb 16, 2024

Page Comparison

Versions Compared

Key

OCR Processor

Configuration

Input/Output examples

Input

Output