How do you know if your RAG/fine-tuned LLM implementation is good?
A quick primer on LLM evaluation
So you’ve designed and deployed an LLM that uses RAG/fine-tuning on private data at your company. Now what? How do you know if you should spend more time integrating the next embeddings model, using the latest foundation model, or otherwise setting up a better chunk size strategy and metadata definition for the data itself? How do you know how well memory is allocated to handle references to this data against all token limits and context windows?
More importantly, what metric and framework do you use for a/b tests against any other implementation to optimize on (or choose not to)?
Setting up custom test cases that speak to your use case is key to answering these questions - you’ve seen a few tools.
One powerful tool in the evaluator's toolkit is DeepEval - a tool for developers to define custom test cases and metrics that assess an AI model's strengths and weaknesses in one shot.
Simply import your own test case for iteration against metrics (sample taken from the documentation)
import deepeval
import os
import openai # testing model output of 3.5 turbo
from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
# Write a sample ChatGPT function
def generate_chatgpt_output(query: str):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "assistant", "content": "The customer success phone line is 1200-231-231 and the customer success state is in Austin."},
{"role": "user", "content": query}
]
)
expected_output = response.choices[0].message.content
return expected_output
def test_llm_output():
query = "What is the customer success phone line?"
expected_output = "Our customer success phone line is 1200-231-231."
test_case = LLMTestCase(query=query, expected_output=expected_output)
metric = FactualConsistencyMetric()
assert_test(test_case, metrics=[metric])
Or you can implement your own custom metric:
from deepeval.test_case import LLMTestCase
from deepeval.metrics.metric import Metric
from deepeval.run_test import assert_test
# Run this test
class LengthMetric(Metric):
"""This metric checks if the output is more than 3 letters"""
def __init__(self, minimum_length: int = 3):
self.minimum_length = minimum_length
def measure(self, test_case: LLMTestCase):
# sends to server
text = test_case.output
score = len(text)
self.success = bool(score > self.minimum_length)
return score
def is_successful(self):
return self.success
@property
def __name__(self):
return "Length"
def test_length_metric():
metric = LengthMetric()
test_case = LLMTestCase(
output="This is a long sentence that is more than 3 letters"
)
assert_test(test_case, [metric])
Current built-in metrics can return bert score, bias, factual consistency, toxicity, similarity ranking, however to run the tests just save and run:
deepeval test run tests/test_sample.py
You’ll need your free API key here
The key is rigorously interrogating the AI with a diverse battery of tests before ever letting it touch production data, then evaluating as it’s in production to reference the data behind your implementation.