ChatBots and Retrival Augmented Generation (RAG) ================================================ In this module, we provide an introduction to chatbots and Retrieval Augmented Generation (RAG), a method for improving a model's ability to provide more context-specific and accurate information about topics it may not have seen in its training distribution. Inference Servers for Transformers ----------------------------------- Before we begin our discussion of chatbots, we first revisit the concept of an inference server. Recall that an inference server wraps a model in an HTTP interface to enable modularity. Inference servers have become very popular with transformers, especially with very large models that require specialized hardware. Inference servers also provide a mechanism for companies to sell products built around on their models without having to release the models themselves. The "OpenAI-Compatible" HTTP API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The company OpenAI, makers of ChatGPT, were one of the first to release inference services as an HTTP API, and the popularity of their products has led to the industry adopting it as a de facto standard for LLM inference. In other words, an "OpenAI-compatible" API is an HTTP API that mimics the URL paths, post body parameters, query parameters, and structured responses of the official OpenAI API. The primary endpoints of an OpenAI-compatible HTTP API are as follows: * ``/v1/chat/completions``: For conversational AI interactions. To send a new chat, POST a JSON request with parameters including: 1. ``messages``, as an array, with each message a JSON dictionary containing a ``content`` and a ``role`` attribute. Note that, according to the OpenAI Documentation (see `Message roles and instruction following `_) the role attribute is used to provide instructions to the model with different levels of authority. 2. ``model``, as a string, representing the id of the model to use. 3. (Optional) ``temperature``, as a float, controlling the randomness of the generated response; typically between 0 and 2, with lower values indicating more determinism. 4. Additional optional arguments.. The server will send a JSON reply with the following fields: 1. ``id``, as a string, a unique identifier for the request. 2. ``object``, as a string, representing the type of object returned (usually ``chat.completion``) 3. ``created``, as in int (Unix epoch) for the time when the response was generated. 4. ``choices``, as an array, where each object represents a possible response generated by the model. Additionally, for each object in ``choices``, there will be a ``message`` object which contains a ``content`` object with the actual contents of the message. 5. ``usage``, as an object, containing information about the tokens used in the request. * ``/v1/embeddings``: For generating vector representations of text. * ``/v1/models``: For listing available models. The Ollama Project ~~~~~~~~~~~~~~~~~~ The `Ollama `_ project is an open source tool that simplifies running large language models (LLMs) like Llama 3 and Mistral directly on a local machine. It provides a user-friendly interface, as well as a command-line interface, and an HTTP API, for downloading, managing, and interacting with models. Moreovew, its HTTP API is OpenAI-compatible in the sense above, meaning that if you write code to interact with models deployed with Ollama, it should be relatively easy to swap them out for other models. Deploying Ollama Locally ~~~~~~~~~~~~~~~~~~~~~~~~~ It is easy to deploy an Ollama instance on a local machine with Docker. The following ``docker-compose.yml`` file can be used to start Ollama using the official image and mount a volume to persist the model cache files it downloads. .. code-block:: bash # docker-compose.yml services: ollama: image: ollama/ollama container_name: ollama ports: - "11434:11434" volumes: - ollama:/root/.ollama restart: unless-stopped volumes: ollama: With that file in place, simply issue the command ``docker compose up -d`` to run the Ollama container in the background. Downloading Models with Ollama ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before you can use a model via Ollama, you need to download its assets. The easiest way to do this is to exec into the ``ollama`` container and issue the ``ollama pull `` command. In the commands below, we pull the Llama 3 model, a popular, small open-weight model from Meta, and the ``text-embedding-3-small`` model, a good choice for language embeddings, which we will use later. To explore the models supported by Ollama, see the `Ollama Library `_. .. code-block:: bash $ ollama pull llama3 pulling manifest pulling 6a0746a1ec1a: 100% ▕█████████████████████████████████████▏ 4.7 GB pulling 4fa551d4f938: 100% ▕█████████████████████████████████████▏ 12 KB pulling 8ab4849b038c: 100% ▕█████████████████████████████████████▏ 254 B pulling 577073ffcc6c: 100% ▕█████████████████████████████████████▏ 110 B pulling 3f8eb4da87fa: 100% ▕█████████████████████████████████████▏ 485 B verifying sha256 digest writing manifest success $ ollama pull text-embedding-3-small . . . With the models pulled, we can mow make direct requests to the running Ollama instance. A Simple Chatbot with Ollama ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this section we build a very simple chatbot based on a local Ollama instance. Note that the code for our chatbot would be almost identical if we wanted to replace Ollama with OpenAI's GPT model or any other inference server that was OpenAI-compatible. The basic architecture of our chatbot is very simple: .. code-block:: python3 while not done: query = get_user_input() reply = get_model_reply(query) print(reply) The key part is calling the model with the user's input and getting the reply. For that we will need to make an HTTP request to our Ollama server running locally and get the message content out of the reply. We will use the Python ``requests`` library. Here is what an example function looks like: .. code-block:: python def generate_answer(query): """ This function implements chat completion endpoint using only requests. """ url = f"{BASE_URL}/v1/chat/completions" data = { "model": CHAT_MODEL, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": f"Answer the following question. Question: {query}", }, ], } r = requests.post(url, json=data) r.raise_for_status() return r.json() A typical response to an input such as *What is tapis?* will look like: .. code-block:: python3 { "id": "chatcmpl-644", "object": "chat.completion", "created": 1764646602, "model": "llama3", "system_fingerprint": "fp_ollama", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "A interesting question!\n\nTapis is an old French word that refers to a type of carpet or tapestry, typically made of wool or silk. The term \"tapis\" is often used interchangeably with \"tapestry\", although some historians and art enthusiasts make a distinction between the two.\n\nIn modern times, the term \"tapis\" might evoke images of luxurious Oriental rugs or intricate wall hangings adorning fine homes. Historically speaking, tapis have been crafted for centuries across various cultures to serve as adornments for palaces, churches, and other grand spaces.\n\nAre you looking to learn more about textiles, art, or perhaps interior design? I'm here to help!" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 33, "completion_tokens": 138, "total_tokens": 171 } } We see that the ``response["choices"][0]["message"]["content"]`` contains the reply we want to send to the user. And with that we can now complete our first chat bot: .. code-block:: python def generate_answer(query): """ This function implements chat completion endpoint using only requests. """ url = f"{BASE_URL}/v1/chat/completions" headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"} data = { "model": CHAT_MODEL, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": f"Answer the following question. Question: {query}", }, ], } r = requests.post(url, headers=headers, json=data) r.raise_for_status() return r.json()["choices"][0]["message"]["content"] def main(): while True: query = input("\nQuery (or 'quit'): ") if query.lower() == "quit": break answer = generate_answer(query) print("\nAnswer:") print(answer) if __name__ == "__main__": main() Retrieval-Augmented Generation ------------------------------ We now have a completely functional chatbot that can respond to questions with answers. However, the performance of our chatbot will typically not be good when asking about topics the model didn't see in training. High-Level Algorithm ~~~~~~~~~~~~~~~~~~~~ The high-level implementation for a RAG application consists of two separate processes: 1. A process for embedding documents related to the topics of interest for your application. 2. The actual chatbot, which will utilize the document embeddings in addition to the AI model. Typically, processes 1 and 2 execute independently. For instance, you could be pulling documents from a private company database and embedding them on some periodicity (say, every night or every hour) while the chatbot runs continuously with the most recent versions of the document embeddings. Each process is straight-forward to implement: First the embeddings: .. code-block:: python3 my_documents = [ "some interesting fact 1", "some interesting fact 2", . . . ] doc_embeddings = [compute_embedding(doc) for doc in my_documents] And the chatbot: .. code-block:: python3 while not done: query = get_user_input() query_embedding = compute_embedding(query) best_docs = get_most_similar_docs(query_embedding) reply = get_model_reply(best_docs, query) print(reply) Computing the Embedding and Similarity ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To compute the embedding, we'll use the ``/v1/embeddings`` endpoint and an embedding model, such as ``"nomic-embed-text"``. .. code-block:: python3 BASE_URL = "http://172.17.0.1:11434" # Point to a local Ollama instance EMBEDDING_MODEL = "nomic-embed-text" def compute_embedding(text): """Get embeddings using only requests.""" url = f"{BASE_URL}/v1/embeddings" data = {"input": text, "model": EMBEDDING_MODEL} r = requests.post(url, headers=headers, json=data) r.raise_for_status() return np.array(r.json()["data"][0]["embedding"]) For similarity, we'll use cosine similarity, which measures how similar two vectors via the cosine of the angle between them (two vectors that are in opposite direction will have angle with a cosine of 0). The formula is given by: .. math:: Cosine\; Similarity = \frac{A \cdot B}{||A|| ||B||} This can be easily implemented with numpy: .. code-block:: python3 import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) Now, we can easily compute the similarity between the user's query and our documents: .. code-block:: python3 scores = [cosine_similarity(query_embedding, d) for d in doc_embeddings] best_index = int(np.argmax(scores)) best_doc = my_documents[best_index] Putting it all Together: A first RAG App ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's pull everything together and create a chatbot that knows about our `Tapis API `_ project. First, we create define some documents and generate their embeddings: .. code-block:: python3 tapis_documents = [ "Python code for generating a Tapis token: ```python\nfrom tapipy.tapis import Tapis\n\n# Replace with your Tapis tenant base URL, username, and password\nbase_url = 'https://your.tapis.io'\nusername = 'your_username'\npassword = 'your_password'\n\ntry:\n # Initialize the Tapis client\n t = Tapis(base_url=base_url, username=username, password=password)\n\n # Get the Tapis tokens\n t.get_tokens()\n\n # Print the access token\n print(\"Access Token:\", t.access_token.access_token)\n\nexcept Exception as e:\n print(f\"An error occurred: {e}\")\n", "Python code for listing Tapis systems: ```python\nfrom tapipy.tapis import Tapis\n\n t.systems.getSystems()", "Python code for listing Tapis apps: ```python\nfrom tapipy.tapis import Tapis\n\n t.apps.getApps()", "Python code for listing Tapis jobs: ```python\nfrom tapipy.tapis import Tapis\n\n t.jobs.getJobList()", "Python code for listing Tapis pods: ```python\nfrom tapipy.tapis import Tapis\n\n t.pods.list_pods()", ] doc_embeddings = [compute_embedding(doc) for doc in tapis_documents] Next we implement the chatbot. The main chatbot loop becomes: .. code-block:: python3 def main(): while True: query = input("\nQuery (or 'quit'): ") if query.lower() == "quit": break # compute the embedding of the user-provided query query_embedding = embed_with_requests(query) # compute the similarity scores comparing the embedding of the user's query to the embeddings of the documents scores = [cosine_similarity(query_embedding, d) for d in doc_embeddings] # get the document with the greatest similarity best_index = int(np.argmax(scores)) best_doc = tapis_documents[best_index] print(f"-->Retrieved doc:{best_doc}") answer = generate_answer(best_doc, query) print("\nAnswer:") print(answer) if __name__ == "__main__": main() Generating an LLM response using the best document as context: .. code-block:: python CHAT_MODEL = "llama3" BASE_URL = "http://172.17.0.1:11434" # Point to a local Ollama instance def generate_answer(context, question): """ This function implements chat completion endpoint using only requests. """ url = f"{BASE_URL}/v1/chat/completions" headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"} data = { "model": CHAT_MODEL, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": f"Use the provided context to answer.\n\nContext:\n{context}\n\nQuestion: {question}", }, ], } r = requests.post(url, headers=headers, json=data) r.raise_for_status() return r.json()["choices"][0]["message"]["content"]