June 17, 2022

Core v1.0.0

The Core is the root of all products. Centralizes the configuration of the platform so it can be adapted to any major cloud provider (AWS, Azure, GCP).

Admin API
File storage service
- Allows storage of big files that can be used as input in the configuration of any product
Binary data server
- Intermediate storage to handle the processing of large files
- Can be configured with different providers:
  - Local file system (or NFS)
  - SFTP
Asynchronous message delivery with RabbitMQ

Ingestion v1.0.0

Main controller of the ingestion process. Allows configuration and execution of seeds, pipelines and processors.

Admin API
Pipeline Manager
- Controls the steps to follow during processing
Workflow Manager
- Controls the start and stop of the seed execution
- Allows the execution of scheduled jobs

Connectors v1.0.0

Retrieves data from a given source. Some connectors can work in “scanner mode” , where all documents from the source are analyzed (i.e. all records from a table in a RDBMS). Other connectors can work in “processor mode” , where individual items within a pipeline can be downloaded (i.e. a single document from S3).

Azure Blob Connector : Data from Microsoft Azure Blob Storage
Random Generator Connector : Creates random data for testing
RDBMS Connector : Records from relational databases. Currently supports:
- MySQL
S3 Connector : Data from Amazon S3
Staging Connector : Records from the Discovery Staging Repository
Udemy Connector : Courses from Udemy
URL Connector : Downloads data from specific URLs
Website Connector : Crawls websites based on a starting URL

Content Processors v1.0.0

Transforms the data obtained in a previous step.

BERT Service Processor: Uses BERT to vectorize chunks of text.
CSV Processor: Splits a CSV file into multiple records
Field Mapper: Applies simple transformations to the processed data (copy fields, join fields, lowercase, uppercase…)
HTML Processor: Parses an HTML file and extract selected subsections
JSON Processor: Converts a byte-array into its corresponding JSON representation
Keyword Extractor: Uses YAKE to extract keywords from a text
Language Detector: Detects the language of a text
NLP Service Processor: Uses spaCy to extract the parts-of-speech of a text
OCR Processor: Uses Tesseract to extract text from images
Script Processor: Allows to configure custom processing scripts. Currently supports:
- Groovy
- Python (through Jython)
- JavaScript (though Rhino and Nashorn)
Taxonomy Tagger: Tags a text based on a dictionary
Tika Processor: Uses Apache Tika for content detection

Hydrators v1.0.0

Publishes the processed data.

Elasticsearch Hydrator: Sends the data to an Elasticsearch index
Neo4J Hydrator: Creates the corresponding entities into Neo4j
Staging Hydrator: Sends the data to a bucket in the Discovery Staging Repository

Staging Repository v1.0.0

Abstract representation of an intermediate storage for processed data.

Exposed through HTTP
Supported providers:
- MongoDB
CRUD into specific buckets
Query/filtering
Aggregations for deduplication

Discovery API v1.0.0

Supports the dynamic creation of endpoints by combining different components.

Default Search and Autocomplete endpoints for Elasticsearch
Different configuration for the components of an endpoint:
- Sequential
- Parallel endpoints
- Finite-state machine
Query fallbacks
Supported components:
- Elasticsearch requests
- Faceting
- Featured snippets with DistilBERT
- HTTP requests
- Knowledge graph queries with Neo4j
- Language detector
- Parts-of-speech identification with spaCy
- Query snapping for Elasticsearch
- Query vectorization with BERT
- Question detector
- Request logger
- Redirect requests
- Script processor for custom transformation. Currently supports:
  - Groovy
  - Python (through Jython)
  - JavaScript (though Rhino and Nashorn)
- Security filtering for Elasticsearch
- Template-based requests

Search UI v1.0.0

User Interface that provides a full search experience with the all features of the platform.

Search
Autocomplete
Pagination
Did you mean?
Query fallbacks
Featured snippets answers
Knowledge graph answers
Details page

Discovery Documentation

Version 1.0.0