Version 1.0.0

June 17, 2022

Core v1.0.0

The Core is the root of all products. Centralizes the configuration of the platform so it can be adapted to any major cloud provider (AWS, Azure, GCP).

  • Admin API

  • File storage service

    • Allows storage of big files that can be used as input in the configuration of any product

  • Binary data server

    • Intermediate storage to handle the processing of large files

    • Can be configured with different providers:

      • Local file system (or NFS)

      • SFTP

  • Asynchronous message delivery with RabbitMQ

Ingestion v1.0.0

Main controller of the ingestion process. Allows configuration and execution of seeds, pipelines and processors.

  • Admin API

  • Pipeline Manager

    • Controls the steps to follow during processing

  • Workflow Manager

    • Controls the start and stop of the seed execution

    • Allows the execution of scheduled jobs

Connectors v1.0.0

Retrieves data from a given source. Some connectors can work in “scanner mode” , where all documents from the source are analyzed (i.e. all records from a table in a RDBMS). Other connectors can work in “processor mode” , where individual items within a pipeline can be downloaded (i.e. a single document from S3).

  • Azure Blob Connector : Data from Microsoft Azure Blob Storage

  • Random Generator Connector : Creates random data for testing

  • RDBMS Connector : Records from relational databases. Currently supports:

    • MySQL

  • S3 Connector : Data from Amazon S3

  • Staging Connector : Records from the Discovery Staging Repository

  • Udemy Connector : Courses from Udemy

  • URL Connector : Downloads data from specific URLs

  • Website Connector : Crawls websites based on a starting URL

Content Processors v1.0.0

Transforms the data obtained in a previous step.

  • BERT Service Processor: Uses BERT to vectorize chunks of text.

  • CSV Processor: Splits a CSV file into multiple records

  • Field Mapper: Applies simple transformations to the processed data (copy fields, join fields, lowercase, uppercase…)

  • HTML Processor: Parses an HTML file and extract selected subsections

  • JSON Processor: Converts a byte-array into its corresponding JSON representation

  • Keyword Extractor: Uses YAKE to extract keywords from a text

  • Language Detector: Detects the language of a text

  • NLP Service Processor: Uses spaCy to extract the parts-of-speech of a text

  • OCR Processor: Uses Tesseract to extract text from images

  • Script Processor: Allows to configure custom processing scripts. Currently supports:

  • Taxonomy Tagger: Tags a text based on a dictionary

  • Tika Processor: Uses Apache Tika for content detection

Hydrators v1.0.0

Publishes the processed data.

  • Elasticsearch Hydrator: Sends the data to an Elasticsearch index

  • Neo4J Hydrator: Creates the corresponding entities into Neo4j

  • Staging Hydrator: Sends the data to a bucket in the Discovery Staging Repository

Staging Repository v1.0.0

Abstract representation of an intermediate storage for processed data.

  • Exposed through HTTP

  • Supported providers:

    • MongoDB

  • CRUD into specific buckets

  • Query/filtering

  • Aggregations for deduplication

Discovery API v1.0.0

Supports the dynamic creation of endpoints by combining different components.

  • Default Search and Autocomplete endpoints for Elasticsearch

  • Different configuration for the components of an endpoint:

  • Query fallbacks

  • Supported components:

    • Elasticsearch requests

    • Faceting

    • Featured snippets with DistilBERT

    • HTTP requests

    • Knowledge graph queries with Neo4j

    • Language detector

    • Parts-of-speech identification with spaCy

    • Query snapping for Elasticsearch

    • Query vectorization with BERT

    • Question detector

    • Request logger

    • Redirect requests

    • Script processor for custom transformation. Currently supports:

    • Security filtering for Elasticsearch

    • Template-based requests

Search UI v1.0.0

User Interface that provides a full search experience with the all features of the platform.

  • Search

  • Autocomplete

  • Pagination

  • Did you mean?

  • Query fallbacks

  • Featured snippets answers

  • Knowledge graph answers

  • Details page

©2024 Pureinsights Technology Corporation