4. Ingestion Connectors

Connectors allow Discovery to retrieve data from a given source prior to content processing. Connectors first work in scan mode to detect updates and changes to the known set of records to process. This enables the framework to keep track of all the added, updated and deleted documents to be processed and uses resources efficiently when in this mode. Once scanned, records are processed. Documents or records are examined in batches by any “ingestion processor” in the pipeline associated to the content source. Failed documents or batches of documents are retried automatically to ensure maximum completeness.

Discovery currently has connectors to the content sources listed below. The list will continue to grow with future versions of the software, and we expect to be able to support ingestion from all of the most popular data sources in all the most common formats. If a connector is not yet available for a given project, a custom connector can be easily developed as a service engagement using Discovery’s connector framework.

The current connectors for Discovery represent a wide variety of sources, including:

MongoDB Atlas Connector – data from an existing MongoDB Atlas database
URL Connector – to download content from specific URLs
Website Connector – to crawl a website or group of websites based on a starting URL
Azure Blob Connector – data from Microsoft Azure Blog Storage
RDBMS Connector – data from relational databases (via JDBC)
S3 Connector – data from Amazon S3
Udemy Connector – data from the Udemy online course director
Elasticsearch Connector – data from any index or indexes stored in Elasticsearch
OpenSearch Connector – data from an existing OpenSearch index
LinkedIn Learning connector – via 3rd Party

In addition, Discovery has special connectors for development purposes:

Random Generator Connector – to create random data for scalability and performance testing purposes
Apache Solr Connector – on customer request

Connectors for ingesting data from existing search engine indices are useful in cases of migration from one search engine to another, or for enriching an existing index without having to recreate the index from scratch.