Version 1.0.0
June 17, 2022
Core v1.0.0
The Core is the root of all products. Centralizes the configuration of the platform so it can be adapted to any major cloud provider (AWS, Azure, GCP).
Admin API
File storage service
Allows storage of big files that can be used as input in the configuration of any product
Binary data server
Intermediate storage to handle the processing of large files
Can be configured with different providers:
Local file system (or NFS)
SFTP
Asynchronous message delivery with RabbitMQ
Ingestion v1.0.0
Main controller of the ingestion process. Allows configuration and execution of seeds, pipelines and processors.
Admin API
Pipeline Manager
Controls the steps to follow during processing
Workflow Manager
Controls the start and stop of the seed execution
Allows the execution of scheduled jobs
Connectors v1.0.0
Retrieves data from a given source. Some connectors can work in “scanner mode” , where all documents from the source are analyzed (i.e. all records from a table in a RDBMS). Other connectors can work in “processor mode” , where individual items within a pipeline can be downloaded (i.e. a single document from S3).
Azure Blob Connector : Data from Microsoft Azure Blob Storage
Random Generator Connector : Creates random data for testing
RDBMS Connector : Records from relational databases. Currently supports:
MySQL
S3 Connector : Data from Amazon S3
Staging Connector : Records from the Discovery Staging Repository
Udemy Connector : Courses from Udemy
URL Connector : Downloads data from specific URLs
Website Connector : Crawls websites based on a starting URL
Content Processors v1.0.0
Transforms the data obtained in a previous step.
BERT Service Processor: Uses BERT to vectorize chunks of text.
CSV Processor: Splits a CSV file into multiple records
Field Mapper: Applies simple transformations to the processed data (copy fields, join fields, lowercase, uppercase…)
HTML Processor: Parses an HTML file and extract selected subsections
JSON Processor: Converts a byte-array into its corresponding JSON representation
Keyword Extractor: Uses YAKE to extract keywords from a text
Language Detector: Detects the language of a text
NLP Service Processor: Uses spaCy to extract the parts-of-speech of a text
OCR Processor: Uses Tesseract to extract text from images
Script Processor: Allows to configure custom processing scripts. Currently supports:
Taxonomy Tagger: Tags a text based on a dictionary
Tika Processor: Uses Apache Tika for content detection
Hydrators v1.0.0
Publishes the processed data.
Elasticsearch Hydrator: Sends the data to an Elasticsearch index
Neo4J Hydrator: Creates the corresponding entities into Neo4j
Staging Hydrator: Sends the data to a bucket in the Discovery Staging Repository
Staging Repository v1.0.0
Abstract representation of an intermediate storage for processed data.
Exposed through HTTP
Supported providers:
MongoDB
CRUD into specific buckets
Query/filtering
Aggregations for deduplication
Discovery API v1.0.0
Supports the dynamic creation of endpoints by combining different components.
Default Search and Autocomplete endpoints for Elasticsearch
Different configuration for the components of an endpoint:
Sequential
Parallel endpoints
Query fallbacks
Supported components:
Elasticsearch requests
Faceting
Featured snippets with DistilBERT
HTTP requests
Knowledge graph queries with Neo4j
Language detector
Parts-of-speech identification with spaCy
Query snapping for Elasticsearch
Query vectorization with BERT
Question detector
Request logger
Redirect requests
Script processor for custom transformation. Currently supports:
Security filtering for Elasticsearch
Template-based requests
Search UI v1.0.0
User Interface that provides a full search experience with the all features of the platform.
Search
Autocomplete
Pagination
Did you mean?
Query fallbacks
Featured snippets answers
Knowledge graph answers
Details page
©2024 Pureinsights Technology Corporation