Vector Search has become a very appreciated method for searching recently. By using ReactiveSearch pipelines, we can add stages to rearrange results using kNN with just a few lines of code.

Before we start, why this?

This pipeline is required to index vector data without asking the user for the data. Imagine a case where index data consists of various fields like Name, Age etc. Now, our requirement is that when an indexing request comes, we want to convert the Name to a vector and store it in the index request as vector_data.

The question is why is a vector even necessary? Well, a vector can help us build pipelines like the kNN search one where the search requests use the vector data to find the results.

Deploy this pipeline with one-click

Play with the live pipeline in the following playground:

Index Requirements

This how-to guide uses OpenSearch for the demo. In order for the data to be stored in the index, the index should know that the vector_data field will be of type vector. Not just that, the dimensions of the vector field also needs to be specified.

The dimensions can differ for the vector field. It depends on the utility that converts the string (or any other type) of data to vector. In this example, we will use OpenAI's Embeddings and their dimensions are 1536. So in this example, we need to set the dimension of the vector field as that.

It can be set by sending the following request to OpenSearch when creating the index:

Copy
PUT /{index_name}

with the following body

Copy
{
    "settings": {
        "knn": true,
        "knn.algo_param.ef_search": 100
    },
    "mappings": {
        "properties": {
            "vector_data": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib"
                }
            }
        }
    }
}

NOTE that the opensearch-knn plugin will have to be installed in the OpenSearch cluster. This plugin is installed by default in all complete OpenSearch installations but are not part of the min package of OpenSearch. Read more about the plugin here

Assumptions

There are various algorithms that can be run on top of a data to get vector representation of it. In this case, for the sake of example, we will be using OpenAI's Embedding algorithm to find the vector representation of the data. It is important that we use the same algorithm while indexing the data as well as while searching the data to get correct results.

This means, while indexing, we will have to run the fields that we want to store as vector through this algorithm. We will also need to run the search query through this algorithm to get the vector representation of the query.

Data Set

In order to show this pipeline working in action, we are going to use the Amazon Review Dataset. This dataset contains reviews of dog food products on Amazon. Out of all the fields present in the dataset, we will use the Summary and Text field to index them as vector data. What this means is that our array will be a vector representation of the Summary and Text field strings joined using a comma ,.

NOTE that the comma will not change the meaning of the embeddings since it's special character and will not be converted to vector data.

Using OpenAI Embeddings

OpenAI API requires an API key in order to access the API. This API key can be generated by signing up at https://platform.openai.com/signup. Once signed up, click on Personal on the top right corner and click View API keys.

This API key will have to be passed to the pipeline so that it can use the API properly in order to get the data embeddings.

Pre Setups

Now that we know how we are going to implement kNN index, let's start with the basic setup. We will override the _doc endpoint for the index amazon_reviews.

The _doc endpoint is the endpoint that ElasticSearch/OpenSearch accepts indexing requests to.

The file will be defined in the following way:

Copy
enabled: true
description: Index pipeline to store vectorized data

routes:
  - path: /amazon_reviews/_doc
    method: POST
    classify:
      category: elasticsearch
      acl: index

envs:
  openAIApiKey: <your-api-key>
  method: POST

Environment Variables

We are passing the Open AI API Key through envs so that it can be used in any stage necessary. This is the openAIApiKey variable.

Stages

Now that we have the basic pipeline defined, let's get started with the stages. We will have a few pre-built stages and some custom stages in this pipeline.

Pre-Built stages are provided by ReactiveSearch to utilize functions from ReactiveSearch API, like hitting ElasticSearch or translating an RS Query to ES Query.

We will have the following stages defined:

  1. authorization
  2. fetch embeddings
  3. index data

Authorization

This is one of the most important steps in the pipeline. Using this stage we will make sure the user is passing proper credentials to hit the endpoint they are trying to access.

The is a pre-built stage provided by ReactiveSearch and can be leveraged in the following way:

Copy
- id: "authorize user"
  use: "authorization"

Fetch Embeddings

Now that we have authorized the user that's making the request, we can fetch the embeddings for the request body passed and update the body with the embeddings. This can be simply done by using the pre-built stage openAIEmbeddingsIndex.

Copy
- id: fetch embeddings
  use: openAIEmbeddingsIndex
  inputs:
    apiKey: "{{openAIApiKey}}"
    inputKeys:
    - Summary
    - Text
    outputKey: vector_data
  continueOnError: false

This is a stage provided by ReactiveSearch for OpenAI specific usage. It's very easy to use and takes care of reading from the request body, getting the embeddings using OpenAI API and updating the request body accordingly.

Read more about this stage here

In the above stage, we are passing the apiKey input by reading it dynamically from the envs that are defined in the top of the pipeline.

Besides that there are two more inputs specified.

inputKeys is the input that indicates which keys from the request body should be used to fetch the embeddings for. In our example and as stated above, we will use the Summary and Text key and thus the inputKeys array contains those two. These two keys will be extracted and joined using a comma , and then passed to OpenAI API in order to get the vector embedding for them.

outputKey indicates the key where the output will be written. In simple words, this is the key that will be injected in the request body with the vector data that was fetched from OpenAI.

In this example, it is set to vector_data since in the mappings we have defined the vector field as vector_data. This can be found in the Pre Setups section of this how-to doc.

Index Data

Now that we have the vector data ready and merged in the request body, we can send the index request to OpenSearch. This can be done by using the pre-built stage elasticsearchQuery.

Copy
- id: index data
  use: elasticsearchQuery
  needs:
    - fetch embeddings

Complete Pipeline

The complete pipeline is defined as follows

Copy
enabled: true
description: Index pipeline to store vectorized data

routes:
  - path: /amazon_reviews/_doc
    method: POST
    classify:
      category: elasticsearch
      acl: index

envs:
  openAIApiKey: <your-api-key>
  method: POST

stages:
- id: authorize user
  use: authorization
- id: fetch embeddings
  use: openAIEmbeddingsIndex
  inputs:
    apiKey: "{{openAIApiKey}}"
    inputKeys:
    - Summary
    - Text
    outputKey: vector_data
  continueOnError: false
- id: index data
  use: elasticsearchQuery
  needs:
  - fetch embeddings

Create the pipeline

Now that we have the whole pipeline defined, we can create the pipeline by hitting the ReactiveSearch instance.

The URL we will hit is: /_pipeline with a POST request.

The above endpoint expects a multipart/form-data body with the pipeline key containing the path to the pipeline file. All the scriptRef files can be passed as a separate key in the form data and will be parsed by the API automatically. Read more about this endpoint here

We can create the pipeline in the following request:

Below request assumes all the files mentioned in this guide are present in the current directory

Copy
curl -X POST 'CLUSTER_ID/_pipeline' -H "Content-Type: multipart/form-data" --form "pipeline=pipeline.yaml"

Testing the Pipeline

We can now hit the indxe endpoint for amazon_reviews and see if the data is getting converted to vector.

Copy
curl -X POST CLUSTER_ID/amazon_reviews/_doc -H "Content-Type: application/json" -d '{"Summary": "dog food", "Text": "good food for my dog"}'