From Retrieval to Intelligence: Exploring RAG, Agent+RAG, and Evaluation with TruLens
Introduction to Enhancing LLM Capabilities
The current landscape of artificial intelligence is rich with powerful foundation models like GPT-4o, Sonnet, Gemini, Llama3.2, Gemma, and Mistral. These models possess extensive general knowledge, covering history, geography, and vast amounts of information from sources like Wikipedia. However, they exhibit two primary weaknesses: a lack of granular detail and outdated knowledge due to training data cutoffs. This article addresses these limitations by focusing on imaginary companies founded before a knowledge cutoff, with recent information changes.
To tackle these challenges, we employ the Retrieval Augmented Generation (RAG) technique, leveraging the LlamaIndex framework. RAG enhances LLMs by providing them with the most relevant information during the answer generation process. This involves maintaining a database of custom data that the model can access. To rigorously assess the system's performance, we integrate the TruLens library and the RAG Triad metrics.
While knowledge cutoffs can be partially mitigated with search tools, they cannot be entirely substituted. Consider two ML specialists: one deeply versed in current GenAI trends, and another who transitioned from GenAI to computer vision six months ago. When asked about recent GenAI models, the first specialist would likely provide a quick, informed answer, perhaps with minor checks. The second specialist, however, would need significant time to research, understand the underlying mechanisms, and then formulate a response. This analogy highlights that extensive searching, while eventually yielding an answer, takes considerably longer. For interactive applications, users expect rapid responses, not minutes spent on simulated "googling." Furthermore, not all necessary information is publicly accessible or searchable.
Generating a Custom Data Corpus
Finding a dataset not already included in the training data of foundation models can be challenging, as most data is indexed during the pre-training phase. To overcome this, we generate a custom private corpus. Using ChatGPT-4o via the OpenAI UI, several prompts were used to create data for four distinct companies. Examples of prompts include requests for details on "Ukraine Boats Inc.," including products, prices, and staff, as well as information on partnerships, legal policies, manufacturing locations, and client case studies.
The generated corpus includes text files for companies like Nova Drive Motors, Aero Vance Aviation, Ukraine Boats Inc., and City Solve. Token counts for these files are approximately: Nova Drive Motors (2757 tokens), Aero Vance Aviation (1860 tokens), Ukraine Boats Inc. (3793 tokens), and City Solve (3826 tokens), totaling around 12,236 tokens. Below is an excerpt from the Ukraine Boats Inc. description:
Ukraine Boats Inc.
Corporate Overview: Ukraine Boats Inc. is a premier manufacturer and supplier of high-quality boats and maritime solutions based in Odessa, Ukraine. The company prides itself on blending traditional craftsmanship with modern technology to serve clients worldwide. Founded in 2005, the company has grown to be a leader in the boating industry, specializing in recreational, commercial, and luxury vessels.
Product Lineup
Recreational Boats:
1. WaveRunner X200
- Description: A sleek speedboat designed for water sports enthusiasts. Equipped with advanced navigation and safety features.
- Price: $32,000
- Target Market: Young adventurers and watersport lovers.
- Features: Top speed of 85 mph, built-in GPS with autopilot mode, seating capacity: 4, lightweight carbon-fiber hull.
2. AquaCruise 350
- Description: A versatile motorboat ideal for fishing, family trips, and casual cruising.
- Price: $45,000
- Features: 12-person capacity, dual 300HP engines, modular interiors with customizable seating and storage, optional fishing equipment upgrades.
3. SolarGlide EcoBoat
- Description: A solar-powered boat for environmentally conscious customers.
- Price: $55,000
- Features: Solar panel roof with 12-hour charge life, zero emissions, maximum speed: 50 mph, silent motor technology...
For evaluation purposes, we also generated 10 questions and answers specifically about Ukraine Boats Inc. based on the corpus. An example question and answer pair is: "What is the primary focus of Ukraine Boats Inc.?" with the answer: "Ukraine Boats Inc. specializes in manufacturing high-quality recreational, luxury, and commercial boats, blending traditional craftsmanship with modern technology."
Data Propagation to Neo4j Database
For storing our custom data for the RAG use case, we utilize the Neo4j graph database, which offers a free instance upon registration and handles future data relations effectively. We begin by instantiating an embedding model, specifically using 256 vector dimensions based on tests indicating less variance in scores with this dimension size. The chosen embedding model is "text-embedding-3-small."
The corpus documents are then read using LlamaIndex's SimpleDirectoryReader
. Subsequently, the SentenceSplitter
is employed to segment the documents into distinct nodes, which are then stored in the Neo4j database. The Neo4jVectorStore
is configured with connection details, embedding dimension, and hybrid search settings (which are initially turned off to focus on vector search performance). Finally, a VectorStoreIndex
is created, populating the database with the nodes.
Implementing the RAG Query Pipeline
The RAG technique can be implemented as a standalone solution or as part of an agent. An agent typically manages chat history, tool handling, reasoning, and output generation. This section details the implementation of query engines (standalone RAG) and the agent approach, where the agent can utilize RAG as one of its tools. We will explore RAG with both OpenAI models and Meta Llama 3.2 models to benchmark their performance.
Configuration parameters for these implementations are centralized in a pyproject.toml
file, specifying settings for similarity, vector store queries, response modes, distance strategies, embedding dimensions, chunking, and model details (LLM, embedding model, HuggingFace models, context windows, API keys).
Standalone RAG with OpenAI Models
The initial step involves connecting to the existing vector index in Neo4j. We instantiate the OpenAI models, using "gpt-4o-mini" as the language model and the same embedding model. These models are configured within LlamaIndex's Settings
object for easy access.
A default query engine is then created from the vector index using index.as_query_engine()
. Executing a query, such as "What is the primary focus of Ukraine Boats Inc.?", retrieves source nodes from the database and generates a response. The output includes the source node IDs, their similarity scores, and the predicted answer. For instance, a query might return nodes with high scores and an answer like: "The primary focus of Ukraine Boats Inc. is designing, manufacturing, and selling luxury and eco-friendly boats, with a strong emphasis on customer satisfaction and environmental sustainability."
For more granular control, LlamaIndex's low-level API allows for custom query engines. This involves configuring a custom retriever (e.g., VectorIndexRetriever
with specified similarity_top_k
), a similarity postprocessor (e.g., SimilarityPostprocessor
with a similarity_cutoff
), and a response synthesizer (e.g., using get_response_synthesizer
). These components are then combined into a RetrieverQueryEngine
.
RAG as an Agent Tool (OpenAI)
To integrate RAG into an agent, we use LlamaIndex's predefined OpenAIAgentWorker
. The previously created query engine is wrapped into a QueryEngineTool
, which the agent can select based on its description. The agent is configured with a system prompt instructing it to always use the retrieval tool before answering and to respond with "Didn't find relevant information" if the answer cannot be found. A QueryEngineTool
is defined with a name, description, and the query engine itself. An AgentRunner
then orchestrates the agent's interactions.
A chat-like interface allows for user-agent interaction. Sample conversations demonstrate the agent's ability to use the RAG tool to answer questions about companies like "Ukraine Boats Inc." and "City Solve." It is noted that for effective vector search, input questions need to be detailed enough for semantic matching.
Standalone RAG with Open-Source Models (Llama 3.2)
As an open-source alternative, we utilize the meta-llama/Llama-3.2–3B-Instruct
model, chosen for its balance of latency and performance. Authentication with HuggingFace is required via an access token. A model wrapper is created for Llama, using a single NVIDIA GeForce RTX 3090 for serving. A system prompt defines the AI's behavior, emphasizing friendly, direct, and professional responses based on source documents.
The query engine interface remains the same as with OpenAI models, producing similar example outputs. For the agent implementation with open-source models, the ReActAgentWorker
is used, which employs an iterative reasoning and acting process. This agent is also configured with the QueryEngineTool
. The agent runner orchestrates these interactions.
Sample chat interactions with the Llama 3.2 agent show a different reasoning process compared to the OpenAI agent. While both agents successfully retrieve information and provide answers, the open-source agent occasionally struggles with tool input formatting, highlighting the importance of clear tool descriptions.
Evaluating RAG Systems with TruLens
Evaluating RAG systems is crucial, and TruLens provides a robust framework for this. The RAG Triad metrics—answer relevance, context relevance, and groundedness—are used to assess performance. These metrics are evaluated using LLMs as judges, scoring answers based on the provided information.
TruLens offers a leaderboard UI and detailed per-record tables to analyze system performance and internal processes. To implement the RAG Triad evaluation, an experiment name and model provider (e.g., OpenAIProvider
using "gpt-4o-mini") are defined. Feedback functions are set up for each metric:
- Context Relevance: Assesses the relevance of each retrieved context chunk to the query.
- Groundedness: Measures how well the generated answer is supported by the retrieved context, often using a chain-of-thought approach.
- Answer Relevance: Evaluates the overall relevance of the generated answer to the user's question.
A TruLlama
object is instantiated to manage feedback calculations during agent calls. The evaluation pipeline iterates through a dataset, resetting the agent and using tru_agent
to record interactions and feedback results. The process logs feedback scores, handling potential exceptions.
Experiments are conducted using different models, query engines, and agent configurations. The results can be reviewed as a DataFrame using the get_leaderboard()
method, providing insights into system performance across various configurations.
Conclusion and Future Directions
This exploration successfully generated a private corpus using GPT models, demonstrating the diversity and utility of AI-generated content for fine-tuning. The Neo4j database proved effective for storing and managing this data, especially for projects involving data relations.
We implemented various RAG approaches, both standalone and agent-based, and evaluated them using the RAG Triad metrics. The OpenAI-based agent performed exceptionally well, while a well-prompted ReAct agent showed comparable results. The use of a custom query engine significantly impacted performance, aligning better with specific data requirements. Both solutions exhibited high groundedness, a critical factor for RAG applications.
Interestingly, the agent call latency between the Llama 3.2 3B model and the GPT-4o-mini API was similar, with database calls being the primary time consumer. While the current system is robust, future improvements can include keyword search, rerankers, neighbor chunking selection, and comparison with ground truth labels. These advanced topics will be covered in subsequent articles on RAG applications.
P.S.
The journey from basic retrieval to intelligent, agent-driven systems is ongoing. By combining techniques like RAG with sophisticated evaluation frameworks like TruLens, we can build more capable and reliable AI applications. The continuous evolution of LLMs and RAG methodologies promises even more powerful solutions in the future.
AI Summary
This article delves into advanced RAG techniques, focusing on enhancing Large Language Models (LLMs) with custom data. It addresses the limitations of LLMs, such as a lack of specific details and knowledge cutoffs, by implementing Retrieval Augmented Generation (RAG) with the LlamaIndex framework. The process begins with generating a private corpus using GPT-4o and storing it in a Neo4j database. The article details the implementation of both standalone RAG query engines and RAG-integrated agents, comparing the performance of OpenAI models and Llama 3.2. A significant portion is dedicated to evaluating these systems using TruLens and the RAG Triad metrics (answer relevance, context relevance, and groundedness). The evaluation process involves using LLMs as judges to score the system