Ingestion Framework
Custom ETL (Extract, Transform, Load) implementation that facilitates the ingestion and processing of content from different data sources so this content can be made available through the Discovery API.
Terms
Record
Simple unit of content, usually references one document, one user, one row in a table.
Seed
A seed is the origin of records. Can be records from a relational database, a website, Amazon S3, among others.
Seed clean up
The seed clean up is a job maintenance. The idea is clean data related with a seed that we don't need. This job is triggered when a seed is deleted, sending a message to all processors and connector.
Examples of clean up implementations:
Delete seed scan/failure indexes
Delete remaining cache in disk
The actual clean up implementation does nothing, the class to process the job is created to be replaced.
The default implementation locations is SeedCleanUpProcessor
You can replace this default behavior based on Bean Replacement of Micronaut, for example:
Default class
@Singleton @Named(SeedCleanUpRequest.TASK_NAME + MaintenanceProcessor.BEAN_SUFFIX) public class SeedCleanUpProcessor implements MaintenanceProcessor<SeedCleanUpRequest> { @Override public void execute(SeedCleanUpRequest config) { // Implement the necessary clean up seed } }
New class implementing indexes delete
@Replaces(SeedCleanUpProcessor.class) @Singleton @Named(SeedCleanUpRequest.TASK_NAME + MaintenanceProcessor.BEAN_SUFFIX) public class SeedCleanUpIndexesProcessor implements MaintenanceProcessor<SeedCleanUpRequest> { @Inject protected ElasticsearchSimpleClient elasticsearchSimpleClient; @Override public void execute(SeedCleanUpRequest config) { elasticsearchSimpleClient.deleteIndex(String.format("scan.%s", config.getSeedId())); elasticsearchSimpleClient.deleteIndex(String.format("failures.%s", config.getSeedId())); } }
The example above shows how to replace the class SeedCleanUpProcessor
with SeedCleanUpIndexesProcessor
.
Pipeline
A finite state machine that defines the order in which processing steps are executed on records
Processor
A unit of work that needs to be done to a record.
Cronjob
Defines a schedule to run a given list of seeds.
Job
A batch of records and unit of work in the PDP.