ChatBots and Retrival Augmented Generation (RAG)

In this module, we provide an introduction to chatbots and Retrieval Augmented Generation (RAG), a method for improving a model’s ability to provide more context-specific and accurate information about topics it may not have seen in its training distribution.

Inference Servers for Transformers

Before we begin our discussion of chatbots, we first revisit the concept of an inference server. Recall that an inference server wraps a model in an HTTP interface to enable modularity. Inference servers have become very popular with transformers, especially with very large models that require specialized hardware. Inference servers also provide a mechanism for companies to sell products built around on their models without having to release the models themselves.

The “OpenAI-Compatible” HTTP API

The company OpenAI, makers of ChatGPT, were one of the first to release inference services as an HTTP API, and the popularity of their products has led to the industry adopting it as a de facto standard for LLM inference. In other words, an “OpenAI-compatible” API is an HTTP API that mimics the URL paths, post body parameters, query parameters, and structured responses of the official OpenAI API.

The primary endpoints of an OpenAI-compatible HTTP API are as follows:

/v1/chat/completions: For conversational AI interactions. To send a new chat, POST a JSON request with parameters including:
1. messages, as an array, with each message a JSON dictionary containing a content and a role attribute. Note that, according to the OpenAI Documentation (see Message roles and instruction following) the role attribute is used to provide instructions to the model with different levels of authority.
2. model, as a string, representing the id of the model to use.
3. (Optional) temperature, as a float, controlling the randomness of the generated response; typically between 0 and 2, with lower values indicating more determinism.
4. Additional optional arguments..
The server will send a JSON reply with the following fields:
1. id, as a string, a unique identifier for the request.
2. object, as a string, representing the type of object returned (usually chat.completion)
3. created, as in int (Unix epoch) for the time when the response was generated.
4. choices, as an array, where each object represents a possible response generated by the model. Additionally, for each object in choices, there will be a message object which contains a content object with the actual contents of the message.
5. usage, as an object, containing information about the tokens used in the request.
/v1/embeddings: For generating vector representations of text.
/v1/models: For listing available models.

The Ollama Project

The Ollama project is an open source tool that simplifies running large language models (LLMs) like Llama 3 and Mistral directly on a local machine. It provides a user-friendly interface, as well as a command-line interface, and an HTTP API, for downloading, managing, and interacting with models. Moreovew, its HTTP API is OpenAI-compatible in the sense above, meaning that if you write code to interact with models deployed with Ollama, it should be relatively easy to swap them out for other models.

Deploying Ollama Locally

It is easy to deploy an Ollama instance on a local machine with Docker. The following docker-compose.yml file can be used to start Ollama using the official image and mount a volume to persist the model cache files it downloads.

# docker-compose.yml

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
    - ollama:/root/.ollama
      restart: unless-stopped
volumes:
  ollama:

With that file in place, simply issue the command docker compose up -d to run the Ollama container in the background.

Downloading Models with Ollama

Before you can use a model via Ollama, you need to download its assets. The easiest way to do this is to exec into the ollama container and issue the ollama pull <model_id> command. In the commands below, we pull the Llama 3 model, a popular, small open-weight model from Meta, and the text-embedding-3-small model, a good choice for language embeddings, which we will use later. To explore the models supported by Ollama, see the Ollama Library.

$ ollama pull llama3
  pulling manifest
  pulling 6a0746a1ec1a: 100% ▕█████████████████████████████████████▏ 4.7 GB
  pulling 4fa551d4f938: 100% ▕█████████████████████████████████████▏  12 KB
  pulling 8ab4849b038c: 100% ▕█████████████████████████████████████▏  254 B
  pulling 577073ffcc6c: 100% ▕█████████████████████████████████████▏  110 B
  pulling 3f8eb4da87fa: 100% ▕█████████████████████████████████████▏  485 B
  verifying sha256 digest
  writing manifest
  success

$ ollama pull text-embedding-3-small
  . . .

With the models pulled, we can mow make direct requests to the running Ollama instance.

A Simple Chatbot with Ollama

In this section we build a very simple chatbot based on a local Ollama instance. Note that the code for our chatbot would be almost identical if we wanted to replace Ollama with OpenAI’s GPT model or any other inference server that was OpenAI-compatible.

The basic architecture of our chatbot is very simple:

while not done:
  query = get_user_input()
  reply = get_model_reply(query)
  print(reply)

The key part is calling the model with the user’s input and getting the reply. For that we will need to make an HTTP request to our Ollama server running locally and get the message content out of the reply. We will use the Python requests library. Here is what an example function looks like:

def generate_answer(query):
    """
    This function implements chat completion endpoint using only requests.
    """
    url = f"{BASE_URL}/v1/chat/completions"
    data = {
        "model": CHAT_MODEL,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": f"Answer the following question. Question: {query}",
            },
        ],
    }
    r = requests.post(url, json=data)
    r.raise_for_status()
    return r.json()

A typical response to an input such as What is tapis? will look like:

{
  "id": "chatcmpl-644",
  "object": "chat.completion",
  "created": 1764646602,
  "model": "llama3",
  "system_fingerprint": "fp_ollama",
  "choices": [
      {
      "index": 0,
      "message": {
          "role": "assistant",
          "content": "A interesting question!\n\nTapis is an old French word that refers to a type of carpet or tapestry, typically made of wool or silk. The term \"tapis\" is often used interchangeably with \"tapestry\", although some historians and art enthusiasts make a distinction between the two.\n\nIn modern times, the term \"tapis\" might evoke images of luxurious Oriental rugs or intricate wall hangings adorning fine homes. Historically speaking, tapis have been crafted for centuries across various cultures to serve as adornments for palaces, churches, and other grand spaces.\n\nAre you looking to learn more about textiles, art, or perhaps interior design? I'm here to help!"
      },
      "finish_reason": "stop"
      }
  ],
  "usage": {
      "prompt_tokens": 33,
      "completion_tokens": 138,
      "total_tokens": 171
  }
}

We see that the response["choices"][0]["message"]["content"] contains the reply we want to send to the user.

And with that we can now complete our first chat bot:

def generate_answer(query):
    """
    This function implements chat completion endpoint using only requests.
    """
    url = f"{BASE_URL}/v1/chat/completions"
    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
    data = {
        "model": CHAT_MODEL,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": f"Answer the following question. Question: {query}",
            },
        ],
    }
    r = requests.post(url, headers=headers, json=data)
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]


def main():
    while True:
        query = input("\nQuery (or 'quit'): ")
        if query.lower() == "quit":
            break
        answer = generate_answer(query)
        print("\nAnswer:")
        print(answer)


if __name__ == "__main__":
    main()

Retrieval-Augmented Generation

We now have a completely functional chatbot that can respond to questions with answers. However, the performance of our chatbot will typically not be good when asking about topics the model didn’t see in training.

High-Level Algorithm

The high-level implementation for a RAG application consists of two separate processes:

A process for embedding documents related to the topics of interest for your application.

The actual chatbot, which will utilize the document embeddings in addition to the AI model.

Typically, processes 1 and 2 execute independently. For instance, you could be pulling documents from a private company database and embedding them on some periodicity (say, every night or every hour) while the chatbot runs continuously with the most recent versions of the document embeddings.

Each process is straight-forward to implement:

First the embeddings:

my_documents = [
  "some interesting fact 1",
  "some interesting fact 2",
  . . .
]

doc_embeddings = [compute_embedding(doc) for doc in my_documents]

And the chatbot:

while not done:
    query = get_user_input()
    query_embedding = compute_embedding(query)
    best_docs = get_most_similar_docs(query_embedding)
    reply = get_model_reply(best_docs, query)
    print(reply)

Computing the Embedding and Similarity

To compute the embedding, we’ll use the /v1/embeddings endpoint and an embedding model, such as "nomic-embed-text".

BASE_URL = "http://172.17.0.1:11434"  # Point to a local Ollama instance
EMBEDDING_MODEL = "nomic-embed-text"

def compute_embedding(text):
    """Get embeddings using only requests."""
    url = f"{BASE_URL}/v1/embeddings"
    data = {"input": text, "model": EMBEDDING_MODEL}
    r = requests.post(url, headers=headers, json=data)
    r.raise_for_status()
    return np.array(r.json()["data"][0]["embedding"])

For similarity, we’ll use cosine similarity, which measures how similar two vectors via the cosine of the angle between them (two vectors that are in opposite direction will have angle with a cosine of 0). The formula is given by:

\[Cosine\; Similarity = \frac{A \cdot B}{||A|| ||B||}\]

This can be easily implemented with numpy:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Now, we can easily compute the similarity between the user’s query and our documents:

scores = [cosine_similarity(query_embedding, d) for d in doc_embeddings]
best_index = int(np.argmax(scores))
best_doc = my_documents[best_index]

Putting it all Together: A first RAG App

Let’s pull everything together and create a chatbot that knows about our Tapis API project.

First, we create define some documents and generate their embeddings:

tapis_documents = [
    "Python code for generating a Tapis token: ```python\nfrom tapipy.tapis import Tapis\n\n# Replace with your Tapis tenant base URL, username, and password\nbase_url = 'https://your.tapis.io'\nusername = 'your_username'\npassword = 'your_password'\n\ntry:\n    # Initialize the Tapis client\n    t = Tapis(base_url=base_url, username=username, password=password)\n\n    # Get the Tapis tokens\n    t.get_tokens()\n\n    # Print the access token\n    print(\"Access Token:\", t.access_token.access_token)\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n",
    "Python code for listing Tapis systems: ```python\nfrom tapipy.tapis import Tapis\n\n t.systems.getSystems()",
    "Python code for listing Tapis apps: ```python\nfrom tapipy.tapis import Tapis\n\n t.apps.getApps()",
    "Python code for listing Tapis jobs: ```python\nfrom tapipy.tapis import Tapis\n\n t.jobs.getJobList()",
    "Python code for listing Tapis pods: ```python\nfrom tapipy.tapis import Tapis\n\n t.pods.list_pods()",
]

doc_embeddings = [compute_embedding(doc) for doc in tapis_documents]

Next we implement the chatbot. The main chatbot loop becomes:

def main():
    while True:
        query = input("\nQuery (or 'quit'): ")
        if query.lower() == "quit":
            break

        # compute the embedding of the user-provided query
        query_embedding = embed_with_requests(query)

        # compute the similarity scores comparing the embedding of the user's query to the embeddings of the documents
        scores = [cosine_similarity(query_embedding, d) for d in doc_embeddings]

        # get the document with the greatest similarity
        best_index = int(np.argmax(scores))
        best_doc = tapis_documents[best_index]

        print(f"-->Retrieved doc:{best_doc}")
        answer = generate_answer(best_doc, query)
        print("\nAnswer:")
        print(answer)


if __name__ == "__main__":
    main()

Generating an LLM response using the best document as context:

CHAT_MODEL = "llama3"
BASE_URL = "http://172.17.0.1:11434"  # Point to a local Ollama instance

def generate_answer(context, question):
    """
    This function implements chat completion endpoint using only requests.
    """
    url = f"{BASE_URL}/v1/chat/completions"
    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
    data = {
        "model": CHAT_MODEL,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": f"Use the provided context to answer.\n\nContext:\n{context}\n\nQuestion: {question}",
            },
        ],
    }
    r = requests.post(url, headers=headers, json=data)
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]