How to integrate pgvector’s Docker image with Langchain?

How to integrate pgvector’s Docker image with Langchain?




Introduction

Whats up everyone! This blog is a tutorial on how to integrate pgvector’s docker image with langchain project to use it as a vector database. For this tutorial, I am using Google’s embedding model to embed the data and Gemini-1.5-flash model to generate the response. This blog will walk you through all the important files required for the purpose.



Step 1: Set up pgvector’s docker image

Create a docker-compose.yml file to list the pgvector’s docker image and pass all the required parameters to set it up.

services:
  db:
    image: pgvector/pgvector:pg16
    restart: always
    env_file:
      - pgvector.env
    ports:
      - "5432:5432"
    volumes:
      - pg_data:/var/lib/postgresql/data

volumes:
  pg_data:
Enter fullscreen mode

Exit fullscreen mode

By default pgvector is hosted on port 5432 and it is kept on the same port for the local machine. It can be updated as required. Similarly, the name of the volume can be changed as required.

Further, create a pgvector.env file to list all the required environment variables by the docker image.

POSTGRES_USER=pgvector_user
POSTGRES_PASSWORD=pgvector_passwd
POSTGRES_DB=pgvector_db
Enter fullscreen mode

Exit fullscreen mode

Again, you can give any value to these variables. FYI: Once the volume is created, the database can only be accessed using these values; unless you delete the existing volume and create a new one.

This brings us to the end of the first step which was setting up the pgvector’s docker image.



Step 2: Function to get Vector DB’s instance

Create a db.py file. This will contain a function which will declare the instance of pgvector db which can be used to work with the database.

Following are the dependencies required which can be installed using pip install [dependency]

python-dotenv
langchain_google_genai
langchain_postgres
psycopg
psycopg[binary]
Enter fullscreen mode

Exit fullscreen mode

Once the dependencies are installed, following is the db.py file.

import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_postgres import PGVector

load_dotenv()

def get_vector_store():
    # Get Gemini API Keys from environment variables.
    gemini_api_key = os.getenv("GEMINI_API_KEY")
    # Get DB credentials from environment variables.
    postgres_collection = os.getenv("POSTGRES_COLLECTION")
    postgres_connection_string = os.getenv("POSTGRES_CONNECTION_STRING")

    # Initiate Gemini's embedding model.
    embedding_model = GoogleGenerativeAIEmbeddings(
                        model="models/embedding-001", 
                        google_api_key=gemini_api_key
                    )

    # Initiate pgvector by passing the environment variables and embeddings model.
    vector_store = PGVector(
                    embeddings=embedding_model,
                    collection_name=postgres_collection,
                    connection=postgres_connection_string,
                    use_jsonb=True,
                )

    return vector_store
Enter fullscreen mode

Exit fullscreen mode

Before explaining this file, we need to add one more .env file to store the environment variables for the project. Therefore, create .env file.

GEMINI_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXX

POSTGRES_COLLECTION=pgvector_documents
POSTGRES_CONNECTION_STRING=postgresql+psycopg://pgvector_user:pgvector_passwd@localhost:5432/pgvector_db
Enter fullscreen mode

Exit fullscreen mode

This file contains the API Keys of your LLM. Further, for the value of POSTGRES_COLLECTION, you can again give this any value. However, the POSTGRES_CONNECTION_STRING is made up of credentials listed in pgvector.env. It is structured as, postgresql+psycopg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{URL on which DB is serving}/{POSTGRES_DB}. For the URL, we are using localhost:5432, since we are using the docker image and have connected the port 5432 of the local machine with the port 5432 of the container.

Once the .env file is setup, let me walk you through the logic of db.py‘s get_vector_store function. Firstly, we are talking the value of variables declared in .env file. Secondly, we have declared an instance of the embedding model. Here, I am using Google’s embedding model, but you can use anyone. Lastly, we are declaring an instance of PG vector, by passing the instance of the embedding model, collection name and the connection string. Finally, we return this PG vector instance. This brings us to the end of step 2.



Step 3: Main file

In this step we will create the main file, app.py which will read content from the document, store it in the vector db, further when user passes a query, it gets the relevant chunk of data from the vector db and passes the query as well as the relevant chunk of data to the LLM and prints the generated response. Following are the imports required for this file.

# app.py
import PyPDF2
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from db import get_vector_store
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv

load_dotenv()
Enter fullscreen mode

Exit fullscreen mode

Following are the dependencies required for this file, in addition of those required for db.py.

PyPDF2 // Only if you plan the extract data from a pdf.
langchain
langchain_text_splitters
langchain_google_genai
Enter fullscreen mode

Exit fullscreen mode

This file has 4 functions out of which 2 are one of the important functions for this tutorial, that are, store_data and get_relevant_chunk. Hence, here is the detailed explanation of those functions, followed by a brief explanation of the other 2 functions.



store_data

# app.py
def store_data(data):
    # Step 1: Converting the data into Document type.
    document_data = Document(page_content=data)

    # Step 2: Splitting the data into chunks.
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

    documents = text_splitter.split_documents([document_data])

    # Step 3: Get the instance of vector db and store the data.
    vector_db = get_vector_store()

    if not vector_db: return

    vector_db.add_documents(documents)
Enter fullscreen mode

Exit fullscreen mode

Purpose
The purpose of this function is to store the data received into the vector db. Lets understand the steps involved in doing so.

Logic

  • Step 1. The data it receives is in string type. Hence, it converts it into Document type. This Document is imported from langchain.schema.

  • Step 2. Once the data is casted into the required type, the next step is to split the data into several chunks. To split the data, we are using RecursiveCharacterTextSplitter from langchain_text_splitters. The splitter is set to have chunk_size of 1000, that means each chunk of data will contain 1000 units. Moreover, the chunk_overlap is set to 100 means that the last 100 units of the first chunk will be the first 100 units of the second chunk. This makes sure that each chunk has proper context of the data when passed to the LLM. Using this splitter, the data is first split in chunks.

  • Step 3. The next step is the get the instance of vector db using get_vector_store function of db.py. Finally, add_document method of vector db is called by passing the split data to store it in the vector db.



get_relevant_chunk

def get_relevant_chunk(user_query):
    # Step 1: Get the instance of vector db.
    vector_db = get_vector_store()

    # Step 2: Get the relevant chunk of data.
    if not vector_db: return

    documents = vector_db.similarity_search(user_query, k=2)

    # Step 3: Convert the data from array type to string type and return it.
    relevant_chunk = " ".join([d.page_content for d in documents])

    return relevant_chunk
Enter fullscreen mode

Exit fullscreen mode

Purpose
The purpose of this function is the get the relevant chunk of data from the store data in vector db using the user’s query.

Logic

  • Step 1. Get the instance of vector db using get_vector_store function of db.py.

  • Step 2. Call the similarity_search method of vector db by passing the user_query and setting k. The k stands for the number of chunks required. If it is set to 2, it would return the 2 most relevant chunk of data base of the user’s query. The can be set according to the project requirements.

  • Step 3. The relevant chunk received from vector db is in array. Hence, to get it in one string, .join method of python is user. Finally, the relevant data is returned.

Perfect. These were the two important function from this file for this tutorial. Following are the other two functions of this file; however, since this tutorial is about pgvector and vector db, I will quickly walk you through the logic of those functions.

def get_document_content(document_path):
    pdf_text = ""

    # Load the document.
    with open(document_path, "rb") as file:

        pdf_reader = PyPDF2.PdfReader(file)

    # Read and return the document content.
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            pdf_text += page.extract_text()

    return pdf_text 

def prompt_llm(user_query, relevant_data):
    # Initiate a prompt template.
    prompt_template = PromptTemplate(
        input_variables=["user_query", "relevant_data"], 
        template= """
        You are a knowledgeable assistant trained to answer questions based on specific content provided to you. Below is the content you should use to respond, followed by a user's question. Do not include information outside the given content. If the question cannot be answered based on the provided content, respond with "I am not trained to answer this."

        Content: {relevant_data}

        User's Question: {user_query}
        """
        )

    # Initiate LLM instance.
    gemini_api_key = os.getenv("GEMINI_API_KEY")

    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=gemini_api_key)

    # Chain the template and LLM
    chain = prompt_template | llm

    # Invoke the chain by passing the input variables of prompt template.
    response = chain.invoke({
        "user_query":user_query,
        "relevant_data": relevant_data
        })

    # Return the generated response.
    return response.content
Enter fullscreen mode

Exit fullscreen mode

The first function, get_document_content basically, takes the path to a pdf, opens it, reads it and returns its content.

The second function, prompt_llm accepts the user’s query and relevant chunk of data. It initiates a prompt template, in which the instructions for LLM is listed, along with having the user’s query and relevant chunk of data as input variables. Moreover, it initiates an instance of LLM model by passing the required parameters, chains the LLM model with prompt template, invokes the chain by passing the value of input variables for the prompt template and finally returns the generated response.

Finally, once these 4 utility functions are declared, we will declare the main function of this file which will call these utility functions to perform the required operations.

if __name__ == "__main__":
    # Get document content.
    document_content = get_document_content("resume.pdf")

    # Store the data in vector db.
    store_data(document_content)

    # Declare a variable having user's query.
    user_query = "Where does Dev currently works at?"

    # Get relevant chunk of data for solving the query.
    relevant_chunk = get_relevant_chunk(user_query)

    # Prompt LLM to generate the response.
    generated_response = prompt_llm(user_query, relevant_chunk)

    # Print the generated response.
    print(generated_response)
Enter fullscreen mode

Exit fullscreen mode

For this tutorial, I am using my resume’s pdf as data. First this main function takes the data of this pdf in string, stores the data into vector db, declares a variable containing user query, gets the relevant chunk of data by passing the user query, gets the generated response by passing the user’s query and relevant chunk of data and finally prints the LLM response.

To get a more detailed walkthrough of this tutorial, check out this video.




Final Words

This was the tutorial on how to integrate pgvector’s docker image with langchain project to use it as a vector db. Please let me know your feedback on this or if you have any questions. I will be happy to answer.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *