Unveiling RAG: "Simple Local RAG"

Have you ever wondered how those chatbots you chat with online manage to sound so knowledgeable? It’s all thanks to large language models (LLMs), whiz-kid AI programs that can churn out text, translate languages, and even write different kinds of creative content. But here’s the thing: LLMs rely on the information they were trained on, which might not always be the freshest. To overcome this you have to finetunning your model in order to update the LLM witch is hardwork. That’s where RAG comes in!

RAG (Retrieval-Augmented Generation)

Think as a superpowered fact-checker for your friendly neighborhood LLM. RAG takes things a step further by allowing LLMs to access real-time information from trustworthy sources. This means you get the best of both worlds: the LLM’s ability to understand and respond to your questions in a natural way, combined with the accuracy and up-to-dateness of real-world knowledge.

Here’s a breakdown of the magic behind RAG:

Retrieval: When you ask a question, RAG first hunts for relevant information from reliable sources like databases or online articles.
Augmentation: Once RAG has a handful of facts, it uses them to guide the LLM in crafting its response. This ensures the answer is grounded in real information, not just what the LLM remembers from its training data.

So, what are the benefits of RAG?

Factual Accuracy: No more worrying about outdated or misleading information. RAG ensures your LLM stays on top of its game with the latest knowledge.
Improved Context: RAG helps LLMs understand the context of your questions better, leading to more relevant and informative answers.
Wider Applications: RAG opens doors for LLMs to be used in more areas that require factual accuracy, like chatbots for customer service or educational tools.

Ok but how do i implement such a tool in my company, and locally?

Great question! Here’s a simple RAG implementation you can run on your local machine with some basic Python knowledge.

The plan is to have three Python scripts: one for ingesting your documents and another to interact with your chatbot, and the last for tunning your RAG.

For this project, we’re going to use some Python libraries and frameworks, such as:

ChromaDB (vector store database): This stores document representations for efficient retrieval.
Langchain: This framework simplifies building retrieval-based NLP pipelines.
Transformers: This library provides pre-trained models for text processing and embeddings.
Embeddings: These convert text into numerical representations for efficient search in the vector store.
CharacterTextSplitter: This splits text into characters for some embedding models.
Document loaders: These libraries help load and pre-process your documents.

First the ingestion

In order to train you llm with updated information, you need to finetunning or in this case we will trick him to read some of our data, and we are going to doit with an ingestion script, converting chunks of text data into numerical representations (vectors). This allows the LLM to understand and utilize the information effectively.

import chromadb  # Import the ChromaDB vector store database
from langchain.embeddings import HuggingFaceEmbeddings  # Import HuggingFace embeddings
from langchain.document_loaders import TextLoader, PyPDFLoader  # Import document loaders
from langchain.text_splitter import CharacterTextSplitter  # Import text splitter
from langchain.vectorstores import Chroma  # Import vector store from Langchain
from llmsettings import data_path, embedding_model, llm_model, llm_config, persist_directory # Will import the settings from another file

# Load a PDF document using PyPDFLoader
loader = PyPDFLoader(r"the_path_to_your_file")
documents = loader.load()

# Print basic information about the loaded documents
print(len(documents))
print(documents[0].page_content)
print(documents[1].page_content)

# Split the documents into smaller chunks (characters in this case)
text_splitter = CharacterTextSplitter(chunk_size=128, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

# Create HuggingFace embeddings for the chunks
embeddings = HuggingFaceEmbeddings(model_name=embedding_model, model_kwargs={'device': 'cpu'})

# Create a Chroma search instance for initial retrieval
docsearch = Chroma.from_documents(texts, embeddings)

# Create a Chroma vector database for persistent storage
vectordb = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,
    persist_directory=persist_directory,
)

# Print the number of chunks stored in the vector database
print(f"Number of Chunks (chunks) at the VectorDB: {vectordb._collection.count()}")

Now lets go to the file that we interact with our LLM

from langchain.vectorstores import Chroma  # Import Chroma vector store
from langchain.embeddings import HuggingFaceEmbeddings  # Import HuggingFace embeddings
from langchain.chains import RetrievalQA  # Import RetrievalQA chain
from langchain.llms import CTransformers  # Import CTransformers LLM
from llmsettings import data_path, embedding_model, llm_model, llm_config, persist_directory # Will import the settings from another file


# Create HuggingFace embeddings for text processing
embedding = HuggingFaceEmbeddings(model_name=embedding_model, model_kwargs={'device': 'cpu'})

# Load Chroma vector database from persistent storage
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

# Print commented-out line for potential future use (uncomment if desired)
# print(f"Loaded {vectordb._collection.count()} data")

# Build a RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=CTransformers(  # Initialize LLM using CTransformers
        model=llm_model,
        model_type="llama",  # Assuming "llama" is the correct model type
        config=llm_config,
    ),
    chain_type="stuff",  # Placeholder chain type (might need adjustment)
    retriever=vectordb.as_retriever(search_kwargs={'k': 1}),  # Retriever using ChromaDB
)

# User input for the question
question = input("Ask the BOT: ")

# Pass the question to the RetrievalQA chain for processing
result = qa_chain({"query": question})

# Access and print the answer from the result
print(result["result"])

Just one more thing and you can use the llm, but now let me explain the code:

Import libraries: Import necessary modules for vector stores, embeddings, retrieval chain, LLM integration, and settings.
Create embeddings: Create HuggingFace embeddings similar to the previous code.
Load Chroma vector database: Load the Chroma vector database from the persistence directory using the previously created embeddings.
Commented-out line: The commented line (# print(f"Loaded {vectordb._collection.count()} data")) is used for debugging to print the number of loaded vectors. You can uncomment it if needed.
Build RetrievalQA chain: Construct a RetrievalQA chain. This chain combines the following elements:
- LLM: This uses the CTransformers class to integrate the large language model specified by llm_model, model_type, and configuration from llm_config.
- Chain type: "stuff" is a placeholder, and you might need to adjust it based on the specific functionality within the RetrievalQA chain.
- Retriever: This part uses the vectordb.as_retriever method to create a retriever component from the Chroma vector database. The search_kwargs={'k': 1} argument specifies retrieving only the single most relevant document during the search.
User input: Prompt the user to enter a question for the bot using input("Ask the BOT: ").
Process question: Pass the user’s question wrapped in a dictionary ({"query": question}) to the qa_chain for processing.
Print answer: Extract and print the answer from the processing result using "result["result"]".

And finally the file that we can tweak the RAG:

data_path = r"path_to_your_dir"
embedding_model = "sentence-transformers/distiluse-base-multilingual-cased-v1"
llm_model = "TheBloke/zephyr-7B-beta-GGUF"
llm_config = {'max_new_tokens': 128, 'repetition_penalty': 1.1, 'temperature': 0.2}
persist_directory = 'your_db_dir'

As you can see in the first and second python script i called this configuration file llmsettings.py you can change it but is also necessary to make those alterations in all the files.

In your terminal just run the script of the ingestion. This script will process your documents and store the information in the database.
Once the ingestion script finishes, run the second script. This script will start the interface where you can interact with the BOT.

RAG is a powerful tool that bridges the gap between LLMs and the real world. By providing a constant stream of fresh facts, RAG paves the way for a new generation of AI that’s not only intelligent but also reliable.

Unveiling RAG: “Simple Local RAG”