Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel6
outlinefalse
stylenone
typelist
printablefalse

Introduction 

Pureinsights Discovery is a search application platform that integrates search technologies such as content ingestion, processing and indexing with technologies like Generative AI and Vector Search to deliver the search experience users expect today in an extremely cost-effective manner. 

The platform is cloud-friendly, scalable and secure, and allows you to build a customized intelligent search application for your website, workplace or business application. Pureinsights offers support for your search application in the form of consulting and implementation services, as well as a unique, fully managed services model we call SearchOps™. 

This document provides you with an overview of Discovery and its varied use cases as well as more technical information on the overall architecture, components and capabilities of the platform. 

Pureinsights Discovery 

Discovery is a cutting-edge technology platform utilized to craft exceptional search experiences for our clients. It consists of modular components, akin to building blocks, customizable for specific needs. These components include ingestion and processing, a search engine, knowledge graph, AI services and an API, which enhances search capabilities by enabling features beyond standard search engines and offering an extensive search UI. 

At the core of Discovery lies a robust, modern cloud-based architecture, boasting scalability, reliability, and flexibility. This architecture seamlessly integrates with existing systems and harnesses best-in-class AI services, enabling organizations to fully leverage AI-powered search while optimizing resource utilization and operational efficiency. 

With Discovery's comprehensive content processing capabilities, organizations efficiently ingest, clean, normalize, and enrich vast data volumes from diverse sources. This ensures access to rich, well-structured data, the foundation of robust search. 

Unique to Discovery is a multifaceted search offering catering to diverse user needs and preferences. Integrating advanced Generative AI and Vector Search technologies with traditional keyword search and knowledge graph functionalities, our platform provides an unprecedented search experience. 

Discovery leverages Generative AI to enable features such as automatic summaries, language translation and creative text generation. Additionally, Vector Search unlocks hidden connections and insights, uncovering valuable information previously out of reach. Alongside these advanced features, Discovery retains support for traditional keyword search, offering users familiar tools for seamless data navigation and exploration. 

In addition, Discovery employs Retrieval Augmented Generation (RAG), an advanced search technique that merges traditional information retrieval with Generative models to enhance search capabilities. By harnessing RAG, Discovery can address user queries by blending authenticated content with dynamically generated responses, thereby minimizing the risk of presenting inaccurate or misleading search results. 

Overall, Discovery’s powerful fusion of search capabilities leads to a more intuitive and insightful search experience. 

Discovery also includes a powerful API that developers can utilize to create fully personalized search solutions. With sophisticated query parsing, powered by AI services, the search experience is enhanced by understanding queries at a deeper level and deciphering user intent. Furthermore, Discovery's intuitive user interface and advanced filtering options enable effortless navigation of complex data. 

Discovery isn't just search, it's knowledge exploration reimagined. It empowers you to extract deeper insights, make informed decisions and fuel action like never before. 

...

Ingestion & Processing 

Discovery ingestion orchestrates the gathering and importing of data from various data sources. Connectors to common data repositories are available as standard, enabling scalable and efficient ingestion of raw data, documents and metadata while upholding access controls. Moreover, these connectors monitor data sources in real-time, processing additions, updates, and deletions as they happen. Bespoke connectors can be built using our developer-friendly connector framework.  

A Staging Repository serves as a transitional storage hub for content extracted from its source. This improves application performance by allowing for content reprocessing without having to reach back to the original content repository for every processing iteration. Built on a NoSQL database, the Staging Repository is equipped with a comprehensive REST API, facilitating seamless management, storage, access, and processing of the stored content. 

Discovery’s content processing pipelines streamline various tasks to optimize search. Processors within the pipeline clean, normalize, and enrich data, while specialized components handle tasks like generating embeddings (using a Large Language Model) for vector search and content tagging. By ensuring efficient processing and indexing, these pipelines empower powerful search functionalities and a seamless user experience. 

Hydrate 

Processed, cleansed, and enhanced data is published to an enterprise search engine and/or knowledge graph. We call this hydration. Discovery is independent of search and knowledge graph technology, and we have built hydrators for industry leading products using our toolkit. The enriched data enables advanced search features such as vector search, extractive answers and generative answers. 

Discovery API 

Discovery's API empowers developers to create personalized search experiences effortlessly via configuration. Unlike traditional API adjustments requiring code changes, building, versioning, and deploying, Discovery API allows for on-the-fly tweaking, ensuring quick and seamless updates. Utilizing advanced query parsing, Natural Language Processing (NLP), and other AI services, the API effectively discerns user intent. Security measures are integrated within the API to guarantee that users can only access authorized results, a vital aspect within enterprise environments.  

Security 

Data security and access control are critical to any application platform. Discovery will have the capability to encrypt and secure data throughout its lifecycle, from ingestion, to processing, to consumption by end users or applications. 

Search Relevance Scoring and Tuning 

Search applications are like automobiles – they perform better with regular maintenance and tuning. Discovery includes optional tools and methodologies to monitor and score the relevance of search results to end users. This is the first step in diagnosing and fixing search application issues, or to determine the impact of a change or enhancement to the search application. 

Ingestion Framework 

Discovery was designed with the cloud in mind: easy to scale and manage, and easy to integrate with other cloud services. Discovery’s content processing architecture includes connectors which scan and process content from diverse sources during ingestion. Discovery also includes the necessary messaging, pipeline management, orchestration, traceability, and publishing services to manage large volumes of data. Kubernetes containerized PODs leverage the latest cloud platform technologies and enables Discovery to easily scale up/down as processing workloads change. 

 [Image here]

The Ingestion Framework is a custom ETL (Extract, Transform, Load) implementation that facilitates the ingestion, cleansing, normalization, augmentation and digital stitching of content from different data sources so this content can be made available through the Discovery API for use in search applications. 

The Framework consists of: 

  • Core Components 

  • Admin API 

  • Workflow Manager and Orchestrator 

  • Pipeline Manager 

 Core Components 

The core components of Discovery provide the basic root functions for communications, configuration and administration of the platform. This includes: 

...

Binary data server – an intermediate storage to manage the processing of large files prior to uploading to the staging repository. 

...

Asynchronous message delivery with RabbitMQ, an open-source, high-performance message broker 

...

 Admin API 

A RESTful JSON API to allow for configuration and complete control over all Discovery features. Through this API, you can create ingestion entities (pipelines, processors, seeds, etc.) and also start, stop and schedule ingestion processes. 

Workflow Manager (WFM) and Orchestrator 

This component acts as the “brain” of the ingestion platform. It is responsible for triggering seed executions, monitoring the progress of existing executions and cleaning up after any work that is done. WFM also optimizes the distribution of jobs throughout the cluster elastically. The workflow manager must be running at all times in active-active mode for effective content ingestion management. 

Pipeline Manager 

The Pipeline Manager controls the steps to follow during content processing. It works in cooperation with the Workflow Manager (WFM) to ensure ingestion executions are healthy. The Workflow Manager will assign tasks to the Pipeline Manager which get distributed across the cluster for maximum scalability. The Pipeline Manager also performs housekeeping tasks after each job / record in an execution. 

Ingestion Connectors 

Connectors allow Discovery to retrieve data from a given source prior to content processing. Connectors first work in scan mode to detect updates and changes to the known set of records to process. This enables the framework to keep track of all the added, updated and deleted documents to be processed and uses resources efficiently when in this mode. Once scanned, records are processed. Documents or records are examined in batches by any “ingestion processor” in the pipeline associated to the content source. Failed documents or batches of documents are retried automatically to ensure maximum completeness.  

 [Image here]

Discovery currently has connectors to the content sources listed below. The list will continue to grow with future versions of the software, and we expect to be able to support ingestion from all of the most popular data sources in all the most common formats. If a connector is not yet available for a given project, a custom connector can be easily developed as a service engagement using Discovery’s connector framework. 

The current connectors for Discovery represent a wide variety of sources, including: 

  • MongoDB Atlas Connector – data from an existing MongoDB Atlas database 

  • URL Connector – to download content from specific URLs 

  • Website Connector – to crawl a website or group of websites based on a starting URL 

  • Azure Blob Connector – data from Microsoft Azure Blog Storage 

  • RDBMS Connector – data from relational databases (via JDBC) 

  • S3 Connector – data from Amazon S3 

  • Udemy Connector – data from the Udemy online course director 

  • Elasticsearch Connector – data from any index or indexes stored in Elasticsearch 

  • OpenSearch Connector – data from an existing OpenSearch index 

  • LinkedIn Learning connector – via 3rd Party 

In addition, Discovery has special connectors for development purposes: 

  • Random Generator Connector – to create random data for scalability and performance testing purposes 

  • Apache Solr Connector – on customer request 

 Connectors for ingesting data from existing search engine indices are useful in cases of migration from one search engine to another, or for enriching an existing index without having to recreate the index from scratch. 

Ingestion Processors 

After data is extracted from different content sources, a variety of services are available to transform the data obtained. The list of processors will grow and evolve as Discovery leverages newer or additional data transformation services. There is significant use of dependable open-source tools which contributes to Discovery’s cost effectiveness. This list also provides developers with transparency and insight into the data transformation capabilities of Discovery. The most notable current ingestion / transformation processors include: 

Basic Content Processors 

  • CSV Processor – splits a CSV file into multiple records for processing individually. 

  • Field Mapper – allows for simple transformations to processed data (copy fields, join fields, lowercase, uppercase…) 

  • HTML Processor – parses an HTML file and extracts selected subsections by class 

  • JSON Processor – converts a byte-array into its corresponding JSON representation 

  • Keyword Extraction Processor – to extract keywords and key phrases from text 

  • Language Detector – detects the language of a text 

  • NLP Service Processor – to perform Entity Recognition, Sentiment Extraction and Dependency Parse Trees. 

  • OCR Processor – uses Tesseract to extract text from images 

  • Script Processor – allows to configure custom processing scripts. Currently supports: 

  • Groovy 

  • Python (through Jython

  • JavaScript (though Rhino and Nashorn

  • Taxonomy Tagger – probabilistically tags a text based on a dictionary or taxonomy. 

  • Tika Processor – uses Apache Tika for content detection and extraction 

  • Chunk processor – parse text into chunks using overlapping sentences 

  • Engine score processor – calculates the engine score for a specific query 

  • Split processor – splits a document into multiples 

  • Chemical tagger processor – takes input text from a field and extracts chemical elements using Oscar4 

  • BERT service processor – uses BERT to vectorize chunks of text 

  • OpenAI (GPT) processor – uses OpenAI to vectorize chunks of text 

  • Hugging Face processor – uses Hugging Face models to vectorize chunks of text 

 AI Services 

 Large Language Models 

Embeddings refers to the representation of a word or some text as a series of floating-point numbers or vector. The means for generating the embeddings has evolved rapidly in the two decades since they were invented, with Word2Vec and Glove coming to the fore around 10 years ago at around the same time that pre-trained embeddings models became available.  But the real potential became clear when Google open-sourced their BERT model in 2018, arguably the first Large Language model (LLM).  In the 6 years since, there has been an explosion of activity with around 50,000 NLP models available on the popular Hugging Face repository.  Many of these are trained or tuned for a specific language domain such as Finance, Pharmacology or Toxic language so they can do a better job at whichever task they have been applied to. 

 [Image here]

 *Integration of Large Language Models on customer demand 

 

Now that access to extremely powerful computing power is available it’s possible to build language models with more and more data and parameters, and to the models are getting larger and larger as illustrated below:  

  [Image here]

 In fact, in the last 12 months with the launch of ChatGPT and more latterly GPT3.5 turbo and GPT-4 the power of these models has reached the mainstream and popular imagination, and now every business we speak to is interested in how they might benefit.  Any consideration of a new search system will have to consider how it will use Large Language Models.  This can be a difficult proposition, because the dust has not yet settled and it’s not 100% clear which capabilities will realise the most productivity gains, or which will be most popular and have highest uptake with users.  As a general rule of thumb, if Google or Microsoft are doing it, then it will have been A/B tested heavily and the majority of people will be familiar with it. 

 

Vector Search 

Alongside the rise of LLMs there has been an effort to make use of Vectors in search, with Elasticsearch, OpenSearch, MongoDB and now Solr all providing a vector field type. This allows a similarity search (often based on the K nearest neighbours’ algorithm but could be cosine similarity or Euclidean distance).  This is all still relatively new however, and so there are some shortcomings that are still being ironed out:  

  1. Size limitations: There is a limit to the number of input tokens.  This varies per model but is somewhere between 500 & 8000 usually, but GPT4 can optionally have a 32k prompt.    

  2. Pricing: Using an API based model such as GPT and its relatives can become expensive at search engine scale as it is metered per token.  

  3. Combination with keywords: Vector search optimises for recall not precision which is not always preferable and so ways of combining it with keyword search are still being rolled out.  Note: Pureinsights at this time would recommend that any given Vector search system deployed would ideally be based on a search engine implementation of keyword search AND vector search co-joined. There are several options on the market now where dedicated vector search systems are popping up. Whilst these may have compelling demonstrations and features, they have not had the level of diligence put on their keyword search capabilities. This leaves them sub-par when it comes to performing blended vector search and keyword search operations together.  We strongly believe that the search engine teams will ultimately win this battle technically. It is also suboptimal (in terms of synchronization and storage costs) to pass content to a search engine for keyword searches and a separate vector search database for vector operations by storing the same content twice in two distinct locations. 

 Using Vector Search to answer questions 

To use vector search to answer questions it is first necessary to chunk and ingest the data into a vector store.  This could be a search engine or a dedicated vector database.  

  [Image here]

 Then there are three ways to answer questions driven by Vector search:  

  1. Extractive Answers:  

    1. Perform a vector-based similarity search to find a candidate text chunks  

    2. Use a model such as DistilBERT to identify the best answer to the question in some candidate chunks  

    3. Evaluate confidence that the answer has been answered correctly  

    4. Show the snippet to the end user  

  2. Knowledge Graph questions and answers:  

    1. Identify entities in the question  

    2. Vectorize the query and match it to known ways of querying a database or Knowledge graph  

    3. Insert the entities as parameters to the known query  

    4. Return the answer  

  3. Retrieval Augmented Generation:  

    1.  Perform a vector-based similarity search to find a candidate text chunks  

    2. Use a model such as GPT to prompt it to find the best answer to the question in some candidate chunks.  

    3. Evaluate confidence that the answer has been answered correctly.  

    4. Show the model response to the end user. 

 [Image here]  

  Discovery AI Processors 

  • BERT Service Processor – uses BERT to vectorize chunks of text. BERT is Google’s open-source, transformer-based machine learning technique for natural language processing (NLP) 

  • Huggingface Model runner – exposes a variety of models to perform AI and NLP tasks such as question answering (summarizing and sentiment analysis are also possible). 

  • OpenAI – uses OpenAI to vectorize chunks of text. 

  • Google Vertex AI – On customer demand 

  • AWS AI Services – On customer demand 

 Discovery can easily incorporate more basic and advanced AI-driven content processors as the state of the art improves.  

Hydrators 

Hydrators enable Discovery to write content processing results to a search index, knowledge graph, or NoSQL database. Discovery supports a wide variety of hydrators for these different data storage technologies. Hydrators publish processed data to different repositories or indices that are accessed to respond to search queries, or for other intermediate staging purposes.  

 [Image here]

 

Search Engines 

  • Elasticsearch Hydrator – sends the data to an Elasticsearch index 

  • OpenSearch Hydrator – sends the data to the OpenSearch index 

  • Apache SOLR Hydrator - sends the data to the Apache Solr index 

Knowledge Graphs

  • Neo4J Hydrator – creates the corresponding entities into Neo4j 

No-SQL Databases 

  • MongoDB and MongoDB Atlas 

Other

  • Staging Hydrator – sends the data to the Discovery Staging Repository (more info below) 

 

This is not a definitive list of all the hydrators possible. Pureinsights will support other search engines, knowledge graphs and special repositories (open-source or commercial) depending on customer demand.  

Staging Repository 

The Staging Repository is an intermediate repository where content is placed after it has been extracted from a content source. This improves application performance by allowing for content reprocessing without having to reach back to the original content repository for every processing iteration. 

Each Staging Repository is a storage unit, and each storage unit consists of buckets (like folders in a document system). Content is stored in the buckets, and there is a transaction log for each record stored in a bucket. The Staging Repository leverages a No-SQL database, and includes a REST API and REST client to manage, store, access and process the content stored in the repository. 

 Other features of the Staging Repository: 

  • Exposed through HTTP 

  • Supported No-SQL databases: 

  • MongoDB, MongoDB Atlas and DocumentDB 

  • Create, Read, Update and Delete (CRUD) into specific buckets 

  • Query/filtering 

  • Aggregations for deduplication 

  • External application subscription 

 [Image here]

Discovery API 

The Discovery API is designed to help an application User Interface (UI) access the data and features of underlying search engines, knowledge graphs or other special repositories. Discovery has a REST API that can be used by the UI components and called directly to configure and access search engine features (like from Elasticsearch, OpenSearch or Solr).  

[Image here]  

 The API supports the creation of endpoints through configuration by combining different components. 

  •  Default Search and Autocomplete endpoints for supported search engines (e.g. Elasticsearch, OpenSearch, Solr)  

  • Different configuration for the components of an endpoint: 

  • Query fallbacks 

  • Supported components: 

    • Search engine requests for supported search engines (e.g. Elasticsearch, OpenSearch, Solr) 

    • Faceting 

    • Featured snippets with DistilBERT 

    • HTTP requests 

    • Knowledge graph queries with Neo4j 

    • Language detector 

    • Parts-of-speech identification with spaCy 

    • Query snapping for supported search engines (e.g. Elasticsearch, OpenSearch, Solr)  

    • Query vectorization with BERT 

    • Question detector 

    • Request logger 

    • Redirect requests 

  • Script processor for custom transformation. Currently supports: 

  • MongoDB/Atlas post-processor 

  • NLP Service 

  • OpenAI API 

  • Vector Search 

  • Query feedback 

  • Engine scoring 

 Search UI 

Discovery includes components to help developers create a customized search application User Interface that provides an advanced search experience for users. The UI is adaptable to different branding and functional requirements. 

[Image here]

Example of a Search UI for E-commerce 

[Image here]

Example of a Search UI for a Government Publications Portal 

The basic functionality supported includes: 

  • Natural language queries (question) processing with 

    • Knowledge Graph answers  

    • Featured snippets / extractive answers 

    • Details page / answer cards  

    • Generative answers 

    • Vector search 

  •  Traditional keyword search queries with 

    • Highly relevant traditional keyword search results 

    • Autocomplete 

    • Pagination 

    • Did you mean? 

    • Query fallbacks 

    • Facet Snapping 

    • Results Feedback 

 

Traditional keyword search is familiar to most search application users who may type things like “men’s leather jackets.”  Natural language queries are when users ask the search application a full question like “How old is the moon,” expecting a direct answer back.  

Admin UI 

The Discovery product plans include a full, user-friendly Admin UI. A command line interface to admin (AdminCLI) as well as a JSON API are available in the interim. 

Search Relevance Dashboard 

The Discovery Search Relevance Scoring dashboard is a tool that provides search engine diagnostic capabilities complementary to the rest of the platform. 

One of the goals of a search application is to deliver results that are as relevant as possible to queries submitted by users. And while search relevancy tuning requires significant expertise and a good methodology, the process starts with being able to objectively measure the quality or “score” of a search engine at any given time. 

An analogy to this in medicine is the EKG or ECG (electrocardiogram). This diagnostic tool measures certain mechanical aspects of how the heart is performing and results are compared to an expected “norm.”  Often the EKG is the first step in diagnosing a major issue that might lead to major heart surgery. But no surgeon would even consider such drastic measures without first doing an EKG. 

The Discovery Search Relevance Dashboard measures the “health” of search application results. It can objectively determine when new software modifications or content changes or changing user behavior over time leads to a deterioration in the quality of search results.  

 [Image here]

 The Discovery Search Relevance Dashboard leverages user activity log files to analyze submitted queries and how far down the presented results users have to scroll to click on an answer, link or document they deemed most relevant. But engine scoring is a diagnostic and not a prescriptive tool. If the score is low, search engineers still have to determine the problem and introduce changes or software fixes. Regular or continuous monitoring does, however, provide an indication of whether or not the fixes have improved (or degraded) results. 

 [Image here]

More detailed information is available on how search relevance scoring works, and how it is used for relevancy tuning and continuous improvements of search results. 

Cloud Providers 

Discovery is available on AWS, Azure and Google Cloud Platform (GCP). 

Services  

Service Level Agreements 

Pureinsights can provide Service Level Agreements to support 9x5 or 24×7 support for your search applications.  Depending on your needs and contractual requirements, support activities would fall into one of the following categories, based on their complexity, criticality, and the availability of appropriate Client resources during business hours:  

Full 24 x 7  

Activities that may be attended anytime following the SLA.  

Arranged Prior (24 x 7)  

Activities that can be attended/performed anytime but with planned, advance coordination  

Business Hours (8 x 5)  

Activities that will be attended/performed during business hours  

Service Level Agreements are agreed with the customer on an individual level dependent on their business requirements. 

 

Service Management - SearchOps 

Business as usual managed services for a  search application include the following ongoing tasks: 

  •  Deployment of all new versions to each environment and validation of proper deployment and functionality. Monitoring for regression issues and redeployment of prior versions if necessary. Applying runtime deployment changes when possible.  

  • Providing implication outlines of updated versions to system load and expected performance.  

  • Coordinating and communicating with the business prior to performing scheduled upgrades, OS patches, service packs, security patches, software patches, backup and log rotation.  

  • Monitoring of the complete application and generation of reports on system health, performance, top queries and other functional areas.  

  • Starting, stopping, suspending and resuming services – subject to approval from Client.  

  • Monitoring the document processing, indexing and search performance and making adjustments to improve.  

  • Continuously improving the monitoring systems.  

  • Maintaining of dictionaries (Spell checking and Synonym dictionaries).  

  • Performing Index Maintenance. Identify and solve indexing issues.  

  • Performing minor changes to the document processing pipeline.  

  • Executing and reporting on Relevance Scoring system.  

  • Providing Monthly Status Reports.  

  • Providing Quarterly service reports and attending Quarterly meetings.  

  • Attending other meeting as requested by Client.  

  • Producing and maintaining current and future documentation and procedures.  

  • Performing regular back-ups and restoring of configuration and indexes.  

  • Scheduling and communicating associated with maintenance windows.  

  This list can be adjusted as needed.  

 Pureinsights has standardized on Jira Service Management and OpsGenie for incident management and general support services, and Datadog for monitoring.  These are inexpensive but are a monthly expense to the project.  

 Use Cases  

There are many ways to classify how Discovery can be deployed and used to add AI and knowledge graphs to search applications. A variety of common business and functional use cases are described here to illustrate the flexibility and usefulness of the platform. 

[Image here]

 Business Use Cases 

Intelligent Enterprise Search 

According to a McKinsey report, employees spend 1.8 hours every day – 9.3 hours per week, or 23% of their time, on average – searching and gathering information.  And in today’s economy, with workers increasingly working from home, good corporate information resources have become even more critical.  

But if that is the case, why are corporate intranets (traditional enterprise search) so often maligned? Why do CIOs think they have low return on investment? Why do employees always complain “our search stinks.”  In a chicken-or-egg situation, it is because corporate search applications are so poorly executed.  

But employees rely on internet search every day. The answer to maximizing worker productivity in information search is to deliver a Google-like search experience for corporate intranets. To do this, Google leverages AI and knowledge graphs, and so should your intelligent enterprise search applications. 

E-Commerce Search 

E-Commerce is booming in the post-pandemic economy. And E-Commerce search represents a business use case with indisputable ROI. Research indicates that customers who use search are 2.4 times more likely to buy. They also spend 2.6 times more than other customers. Additionally, 34% of search queries are non-product queries. This could include searches on shipping options, credit options, or return policies. 

The major online retailers may have addressed their customers’ search needs, but the Baymard Institute declares that the state of e-commerce search is “broken”, with only a handful of sites delivering a decent search experience.   

One cause is that e-commerce platforms used by small and medium online retailers do not offer the search functionality necessary to deliver a good search experience. Discovery can complement those platforms to cost-effectively deliver that search experience while your e-commerce platform manages the warehousing and back-office integration aspects of your website. 

Customer Portals 

Whether you are an online retailer, or a company that provides B2B products and services, customer portals play a key role in customer self-service strategies. According to the Service Desk Institute, a good portal can deliver business benefits such as reduced customer support costs, increased customer satisfaction, and the ability to offer round-the-clock support. 

Discovery can help deliver a customer portal experience that helps you realize all these benefits by delivering a search experience that can understand customers’ natural language queries, and deliver the correct answer through direct answers from a knowledge graph, or extractive answers from FAQs, knowledge bases, PDF documentation, and other information sources. 

Content Portals 

Good search is critical for content portals provided by information and media publishers. Whether you are talking about an online subscription to the Bloomberg Finance, a free public portal like the US National archives, or a streaming service like Netflix. In this instance, the content IS the product – and the portal is the customer or user’s access to the content. 

Discovery can be a critical component of a content platform that delivers relevant search results to natural language queries about content. Discovery can also be used to power content recommendation engines, resulting in an optimal experience for content consumers. 

Search & Match 

Search and match applications are special search use cases where the submitted query may be an entire paragraph or document, to search for documents or content in a repository that meet certain matching criteria. Examples include matching resumes to job openings in recruiting; patent searches for potential new patents; similar research in academic repositories; or even molecular formulations in scientific databases. 

In each of these (and similar) use cases, a Discovery-powered search application can automate manual processes or increase the efficiencies of high-salaried knowledge workers. 

 Functional Use Cases 

 Question Answering Systems 

In 2022, Google search statistics indicate that more than 21% of queries use 5 or more words – meaning users are likely typing in full questions.  That figure is likely to grow significantly as people become accustomed to asking full natural language questions on search applications. 

Question answering systems are viewed as the future of search. Google and Bing have trained legions of consumers to be able to type full, natural language questions in a search bar, or ask full questions from digital assistants. These online search engines use AI technologies like machine learning and natural language processing, along with knowledge graphs, do deliver the search experience user expect today. 

Discovery is a platform that can integrate and orchestrate the different cloud technologies available today to complement and enhance existing search applications will full Question Answering capabilities.  

Knowledge Management 

Knowledge management (KM) is the process by which an enterprise gathers, organizes, shares and analyzes its knowledge in a way that is easily accessible to employees. This knowledge includes technical resources, frequently asked questions, training documents and people skills. 

KM is a complementary business process that can ensure the success of enterprise search or corporate intranet deployments. Discovery can leverage the taxonomies, vocabularies and content management processes developed in KM to ensure that content is properly ingested, processed and indexed to ensure complete and relevant results in enterprise search applications. 

Document Understanding 

Document Understand helps search applications “know” what documents are about. To answer a search query like “find all construction and renovation contracts in Saudi Arabia,” a search application would have to deconstruct or “understand” the query, and then submit a query to a search engine for the answer.  

The result might come from a search index or knowledge graph; but to populate those databases, platforms like Discovery first have to “read” or deconstruct all relevant documents (contracts) to extract key features from the document and hydrate the databases. This is the rough equivalent of “understanding” the document. 

Discovery leverages advanced cloud-based natural language and machine learning services like Google BERT, Amazon Comprehend, or Azure Cognitive Services to process and “understand” large-scale document repositories to support various document search applications. 

Content Tagging and Processing 

Users can build entire intelligent search applications around Discovery – from content ingestion and processing to index and knowledge graph hydration, to comprehensive UIs. However, many information publishers and independent software / SaaS vendors (ISVs) may already have significant investments in their applications. 

In this case, they can still leverage Discovery to do the important job of content processing and tagging. This is the process by which content metadata is created or enriched so that search indices and knowledge graphs are hydrated with the information needed to improve search results and relevancy in the application. Even this seemingly mundane function is enhanced in Discovery by the incorporation of AI and advanced NLP services. 

Embedded Search Applications 

Sometimes search applications do not take the form of a traditional search bar. This could include highly faceted search applications for travel reservations, GIS-based search applications, or even ride-share applications like Lyft or Uber. In these use cases, search is just one (albeit complicated) feature in a more complex application platform. 

Rather than having to develop an entire search platform from scratch, independent ISVs can leverage today’s API-driven cloud architectures to have just the search portion of their application powered by Discovery on its own cloud infrastructure. This would allow the ISV’s developers and support teams to focus on elements of the application platform that represent the core competencies of their business. Discovery and the search functionality could be managed by a specialized team, or the entire search function could be delivered as a special managed service like Pureinsights’ SearchOps

Summary 

 Influenced by internet search, people are no longer satisfied with ranked search results from keyword queries. They want to type in full questions and get answers. They expect search applications to support natural language queries, featured snippets and even generative answers. Traditional search is not enough. Discovery brings together all the components you need to provide your users with the advanced search experience they now expect.  

Modern cloud-native architecture 

Discovery was born in the cloud. We are building a cloud-native architecture that exploits the flexibility and resilience of cloud computing. And enables clients to run scalable, efficient, and secure search applications in modern dynamic environments such as public, private and hybrid clouds. 

 Dynamic data connectors 

Data connectors are required to gain access to content sources such as file systems, databases and websites and feed data into Discovery. Discovery’s data connectors ingest data in a scalable and efficient manner while honoring access controls. Plus, they monitor the data source for additions and deletions and process them as they occur. 

 Intelligent content processing 

Poor quality data, especially metadata, can have a detrimental impact on search performance. Discovery uses intelligent content processing pipelines as it ingests data to refine and optimize it for retrieval. Content is cleansed, enhanced, and normalized and can be further enriched via services for entity extraction, metadata augmentation, tagging and classification. 

 AI-Powered 

At its core search is about understanding language and Discovery takes full advantage of AI technologies such as Generative AI and Vector Search. These AI technologies enable users to search in a way that feels natural and surface relevant results. For example, vector similarity search uses Machine Learning to provide a much more refined way to find content with subtle nuances and meanings. And featured snippets use a combination of AI technologies to extract a specific piece of text from a document that best answers a user’s search request.  

 Integrated Knowledge Graph 

A knowledge graph is a database of entities – i.e., people, places, and events – and the relationships between them. As such, they have proved enormously powerful in question-answer systems. Discovery uses knowledge graph technology to provide direct answers to factual questions. 

 API and Versatile UI 

The goal of any search application is to serve the users’ needs quickly and efficiently. Discovery includes a powerful API that developers can use to create a fully personalized search experience. Sophisticated query parsing, NLP and other AI services are deployed to help understand the intent of a user’s search request. Security is included in this API to ensure users are served only results they are allowed to see. Discovery also includes a complete React based Search User Interface that clients can deploy with minimal development effort.  

 Elevate Open-Source Search 

Discovery has been designed to enhance traditional open-source search engines by integrating them with knowledge graph and AI technologies to provide advanced search features. You can choose to augment your existing open-source search engine with Discovery or leverage the complete platform to meet the requirements of your organization.  

 Achieve Cost Efficiencies 

Discovery can help you build a better search application for your workplace, website, or support portal. It can improve the embedded search experience for your information service or software product. Combining Discovery with top open-source technologies results in a cost-efficient means to achieve functionality on par with top commercial cognitive search solutions. 

 You can choose to augment your existing open-source search engine with Discovery or leverage the complete platform to meet the requirements of your organization. Either way, Pureinsights can help you assess, design, and develop your search application and provide on-going support and maintenance. 

About Pureinsights 

Pureinsights has deep expertise building search applications with conventional search engines. 
Now we can take you "Beyond Search", using Generative AI models like ChatGPT and Google Gemini together with Vector Search, Knowledge Graphs, and Natural Language Processing to modernize your organization's search capabilities and deliver the intuitive search experience users want. "Just make it work like Google." 

©2024 Pureinsights Technology Corporation. Pureinsights™, Pureinsights Discovery™ and SearchOps™ are trademarks of Pureinsights Technology Corporation. 

...

Child pages (Children Display)
depth1
allChildrentrue
style
sortAndReverse
first0