Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Ingestion Framework

Custom ETL (Extract, Transform, Load) implementation that facilitates the ingestion and processing of content from different data sources so this content can be made available through the Discovery API.

Terms

Record

Simple unit of content, usually references one document, one user, one row in a table.

Seed

A seed is the origin of records. Can be records from a relational database, a website, Amazon S3, among others.

Seed clean up

The seed clean up is a job maintenance. The idea is clean data related with a seed that we don't need. This job is triggered when a seed is deleted, sending a message to all processors and connector.

Examples of clean up implementations:

  • Delete seed scan/failure indexes
  • Delete remaining cache in disk

The actual clean up implementation does nothing, the class to process the job is created to be replaced. The default implementation locations is SeedCleanUpProcessor

You can replace this default behavior based on Bean Replacement of Micronaut, for example:

Default class

@Singleton
@Named(SeedCleanUpRequest.TASK_NAME + MaintenanceProcessor.BEAN_SUFFIX)
public class SeedCleanUpProcessor implements MaintenanceProcessor<SeedCleanUpRequest> {

  @Override
  public void execute(SeedCleanUpRequest config) {
    // Implement the necessary clean up seed
  }
}

New class implementing indexes delete

@Replaces(SeedCleanUpProcessor.class)
@Singleton
@Named(SeedCleanUpRequest.TASK_NAME + MaintenanceProcessor.BEAN_SUFFIX)
public class SeedCleanUpIndexesProcessor implements MaintenanceProcessor<SeedCleanUpRequest> {

  @Inject
  protected ElasticsearchSimpleClient elasticsearchSimpleClient;

  @Override
  public void execute(SeedCleanUpRequest config) {
    elasticsearchSimpleClient.deleteIndex(String.format("scan.%s", config.getSeedId()));
    elasticsearchSimpleClient.deleteIndex(String.format("failures.%s", config.getSeedId()));
  }
}

The example above shows how to replace the class SeedCleanUpProcessor with SeedCleanUpIndexesProcessor.

Pipeline

A finite state machine that defines the order in which processing steps are executed on records

Processor

A unit of work that needs to be done to a record.

Cronjob

Defines a schedule to run a given list of seeds.

Job

A batch of records and unit of work in the PDP.