5. Ingestion Processors

After data is extracted from different content sources, a variety of services are available to transform the data obtained. The list of processors will grow and evolve as Discovery leverages newer or additional data transformation services. There is significant use of dependable open-source tools which contributes to Discovery’s cost effectiveness. This list also provides developers with transparency and insight into the data transformation capabilities of Discovery. The most notable current ingestion / transformation processors include:

Basic Content Processors

CSV Processor – splits a CSV file into multiple records for processing individually.
Field Mapper – allows for simple transformations to processed data (copy fields, join fields, lowercase, uppercase…)
HTML Processor – parses an HTML file and extracts selected subsections by class
JSON Processor – converts a byte-array into its corresponding JSON representation
Keyword Extraction Processor – to extract keywords and key phrases from text
Language Detector – detects the language of a text
NLP Service Processor – to perform Entity Recognition, Sentiment Extraction and Dependency Parse Trees.
OCR Processor – uses Tesseract to extract text from images
Script Processor – allows to configure custom processing scripts. Currently supports:
- Groovy
- Python (through Jython)
- JavaScript (though Rhino and Nashorn)
Taxonomy Tagger – probabilistically tags a text based on a dictionary or taxonomy.
Tika Processor – uses Apache Tika for content detection and extraction
Chunk processor – parse text into chunks using overlapping sentences
Engine score processor – calculates the engine score for a specific query
Split processor – splits a document into multiples
Chemical tagger processor – takes input text from a field and extracts chemical elements using Oscar4
BERT service processor – uses BERT to vectorize chunks of text
OpenAI (GPT) processor – uses OpenAI to vectorize chunks of text
Hugging Face processor – uses Hugging Face models to vectorize chunks of text