Ingestion Framework
Discovery was designed with the cloud in mind: easy to scale and manage, and easy to integrate with other cloud services. Discovery’s content processing architecture includes connectors which scan and process content from diverse sources during ingestion. Discovery also includes the necessary messaging, pipeline management, orchestration, traceability, and publishing services to manage large volumes of data. Kubernetes containerized PODs leverage the latest cloud platform technologies and enables Discovery to easily scale up/down as processing workloads change.
[Image here]
The Ingestion Framework is a custom ETL (Extract, Transform, Load) implementation that facilitates the ingestion, cleansing, normalization, augmentation and digital stitching of content from different data sources so this content can be made available through the Discovery API for use in search applications.
The Framework consists of:
Core Components
Admin API
Workflow Manager and Orchestrator
Pipeline Manager
Core Components
The core components of Discovery provide the basic root functions for communications, configuration and administration of the platform. This includes:
Binary data server – an intermediate storage to manage the processing of large files prior to uploading to the staging repository.
Asynchronous message delivery with RabbitMQ, an open-source, high-performance message broker
Distributed configuration store – pipeline configurations stored in a distributed data store for performance, failover and fault tolerance.
Distributed traceability store – tracking and storing all actions in Discovery for detailed visibility and analytics.
Admin API
A RESTful JSON API to allow for configuration and complete control over all Discovery features. Through this API, you can create ingestion entities (pipelines, processors, seeds, etc.) and also start, stop and schedule ingestion processes.
Workflow Manager (WFM) and Orchestrator
This component acts as the “brain” of the ingestion platform. It is responsible for triggering seed executions, monitoring the progress of existing executions and cleaning up after any work that is done. WFM also optimizes the distribution of jobs throughout the cluster elastically. The workflow manager must be running at all times in active-active mode for effective content ingestion management.
Pipeline Manager
The Pipeline Manager controls the steps to follow during content processing. It works in cooperation with the Workflow Manager (WFM) to ensure ingestion executions are healthy. The Workflow Manager will assign tasks to the Pipeline Manager which get distributed across the cluster for maximum scalability. The Pipeline Manager also performs housekeeping tasks after each job / record in an execution.