Evaluating AI with Haystack

🧑‍🔬 Discuss Experimental Feature Open in Colab Download

_{Last Updated:
October 30, 2024}

by Bilge Yucel ( X, Linkedin)

In this cookbook, we walk through the Evaluators in Haystack, create an evaluation pipeline, streamline the evaluation with EvaluationHarness and try different Evaluation Frameworks like Ragas and FlowJudge.

📚 Useful Resources:

📺 Watch Along

!pip install haystack-ai sentence-transformers>="3.0.0" pypdf

!pip install ragas-haystack flow-judge[hf] # evaluation frameworks

1. Building your pipeline

ARAGOG

This dataset is based on the paper Advanced Retrieval Augmented Generation Output Grading (ARAGOG). It’s a collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.

The dataset contains:

13 PDF papers.
107 questions and answers generated with the assistance of GPT-4, and validated/corrected by humans.

We have:

ground-truth answers
questions

Get the dataset here

Indexing Pipeline

import os

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PyPDFToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy

embedding_model="sentence-transformers/all-MiniLM-L6-v2"
document_store = InMemoryDocumentStore()

files_path = "/content/papers_for_questions" # <ENTER YOUR PATH HERE>
pipeline = Pipeline()
pipeline.add_component("converter", PyPDFToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_length=250, split_by="word"))  # default splitting by word
pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP))
pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(embedding_model))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "embedder")
pipeline.connect("embedder", "writer")
pdf_files = [files_path+"/"+f_name for f_name in os.listdir(files_path)]

pipeline.run({"converter": {"sources": pdf_files}})

document_store.count_documents()

RAG

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass('OPENAI_API_KEY: ')

from haystack import Pipeline
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever

template = """
    You have to answer the following question based on the given context information only.
    If the context is empty or just a '\\n' answer with None, example: "None".

    Context:
    {% for document in documents %}
        {{ document.content }}
    {% endfor %}

    Question: {{question}}
    Answer:
    """

basic_rag = Pipeline()
basic_rag.add_component("query_embedder", SentenceTransformersTextEmbedder(
    model=embedding_model, progress_bar=False
))
basic_rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store))
basic_rag.add_component("prompt_builder", PromptBuilder(template=template))
basic_rag.add_component("generator", OpenAIGenerator(model="gpt-4o-mini"))

basic_rag.connect("query_embedder", "retriever.query_embedding")
basic_rag.connect("retriever", "prompt_builder.documents")
basic_rag.connect("prompt_builder", "generator")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7a05787d5cf0>
🚅 Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: OpenAIGenerator
🛤️ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)

2. Human Evaluation

from typing import List, Tuple
import json

def read_question_answers() -> Tuple[List[str], List[str]]:
    with open("/content/eval_questions.json", "r") as f:
        data = json.load(f)
        questions = data["questions"]
        answers = data["ground_truths"]
    return questions, answers

all_questions, all_answers = read_question_answers()

print(len(all_questions))
print(len(all_answers))

107
107

questions = all_questions[:15]
answers = all_answers[:15]

index = 5
print(questions[index])
print(answers[index])
question = questions[index]

How were the questions for the multitask test sourced, and what was the criteria for their inclusion?
Questions were manually collected by graduate and undergraduate students from freely available online sources, including practice questions for standardized tests and undergraduate courses, ensuring a wide representation of difficulty levels and subjects.

basic_rag.run({"query_embedder":{"text":question}, "prompt_builder":{"question": question}})

{'generator': {'replies': ['The questions for the multitask test were manually collected by graduate and undergraduate students from freely available sources online. These sources included practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination, as well as questions designed for undergraduate courses and those for readers of Oxford University Press books. The criteria for inclusion were based on ensuring that the questions covered a range of subjects and difficulty levels, including specific tasks like "Elementary," "High School," "College," or "Professional," with each subject containing a minimum of 100 test examples.'],
  'meta': [{'model': 'gpt-4o-mini-2024-07-18',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'completion_tokens': 110,
     'prompt_tokens': 4559,
     'total_tokens': 4669,
     'completion_tokens_details': CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0),
     'prompt_tokens_details': PromptTokensDetails(audio_tokens=None, cached_tokens=0)}}]}}

3. Deciding on Metrics

Semantic Answer Similarity: SASEvaluator compares the embedding of a generated answer against a ground-truth answer based on a common embedding model.
ContextRelevanceEvaluator will assess the relevancy of the retrieved context to answer the query question
FaithfulnessEvaluator evaluates whether the generated answer can be derived from the context

4. Building an Evaluation Pipeline

from haystack import Pipeline
from haystack.components.evaluators import ContextRelevanceEvaluator, FaithfulnessEvaluator, SASEvaluator

eval_pipeline = Pipeline()
eval_pipeline.add_component("context_relevance", ContextRelevanceEvaluator(raise_on_failure=False))
eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator(raise_on_failure=False))
eval_pipeline.add_component("sas", SASEvaluator(model=embedding_model))

5. Running Evaluation

Run the RAG Pipeline

predicted_answers = []
retrieved_context = []

for question in questions: # loops over 15 questions
  result = basic_rag.run({"query_embedder":{"text":question}, "prompt_builder":{"question": question}}, include_outputs_from={"retriever"})
  predicted_answers.append(result["generator"]["replies"][0])
  retrieved_context.append(result["retriever"]["documents"])

Run the Evaluation

eval_pipeline_results = eval_pipeline.run(
    {
        "context_relevance": {"questions": questions, "contexts": retrieved_context},
        "faithfulness": {"questions": questions, "contexts": retrieved_context, "predicted_answers": predicted_answers},
        "sas": {"predicted_answers": predicted_answers, "ground_truth_answers": answers},
    }
)

results = {
    "context_relevance": eval_pipeline_results['context_relevance'],
    "faithfulness": eval_pipeline_results['faithfulness'],
    "sas": eval_pipeline_results['sas']
}

100%|██████████| 15/15 [00:11<00:00,  1.26it/s]
100%|██████████| 15/15 [00:35<00:00,  2.37s/it]

6. Analyzing Results

EvaluationRunResult

from haystack.evaluation import EvaluationRunResult

inputs = {
    'questions': questions,
    'contexts': retrieved_context,
    'true_answers': answers,
    'predicted_answers': predicted_answers
}
run_name="rag_eval"
eval_results = EvaluationRunResult(run_name=run_name, inputs=inputs, results=results)
eval_results.score_report()

             metrics     score
0  context_relevance  0.200000
1       faithfulness  0.611111
2                sas  0.546086

eval_results.to_pandas()

index = 2
print(eval_pipeline_results['context_relevance']["individual_scores"][index], "\nQuestion:", questions[index],"\nTrue Answer:", answers[index], "\nAnswer:", predicted_answers[index])
print("".join([doc.content for doc in retrieved_context[index]]))

Evaluation Harness (Step 4, 5, and 6)

Runs the RAG pipeline
Runs the evaluation

Try EvaluationHarness and give us feedback on Github

from haystack_experimental.evaluation.harness.rag import (
    DefaultRAGArchitecture,
    RAGEvaluationHarness,
    RAGEvaluationMetric,
    RAGEvaluationInput
)

pipeline_eval_harness = RAGEvaluationHarness(
    rag_pipeline = basic_rag,
    rag_components=DefaultRAGArchitecture.GENERATION_WITH_EMBEDDING_RETRIEVAL, # query_embedder, retriever, prompt_builder, generator
    metrics={
        RAGEvaluationMetric.SEMANTIC_ANSWER_SIMILARITY,
        RAGEvaluationMetric.FAITHFULNESS,
        RAGEvaluationMetric.CONTEXT_RELEVANCE,
    }
)

eval_harness_input = RAGEvaluationInput(
    queries=questions,
    ground_truth_answers=answers,
    rag_pipeline_inputs={
        "prompt_builder": {"question": list(questions)},
    },
)

harness_eval_run= pipeline_eval_harness.run(inputs=eval_harness_input, run_name=run_name)

harness_eval_run.results.score_report()

                    metrics     score
0  metric_context_relevance  0.266667
1                metric_sas  0.537721
2       metric_faithfulness  0.747778

Override some parameter

from haystack_experimental.evaluation.harness.rag import RAGEvaluationOverrides

overrides = RAGEvaluationOverrides(rag_pipeline={
    "generator": {"model": "gpt-4"},
})

harness_eval_run_gpt4 = pipeline_eval_harness.run(inputs=eval_harness_input, run_name="harness_eval_run_gpt4", overrides=overrides)

harness_eval_run_gpt4.results.score_report()

                    metrics     score
0  metric_context_relevance  0.266667
1                metric_sas  0.654073
2       metric_faithfulness  0.796429

harness_eval_run.results.comparative_individual_scores_report(harness_eval_run_gpt4.results)

overrides = RAGEvaluationOverrides(rag_pipeline={
    "retriever": {"top_k": 2},
})

harness_eval_run_topk10 = pipeline_eval_harness.run(inputs=eval_harness_input, run_name="harness_eval_run_topk10", overrides=overrides)

Executing RAG pipeline...


100%|██████████| 30/30 [01:50<00:00,  3.67s/it]


Executing evaluation pipeline...


100%|██████████| 30/30 [01:05<00:00,  2.18s/it]
100%|██████████| 30/30 [00:26<00:00,  1.12it/s]

harness_eval_run_topk10.results.score_report()

                    metrics     score
0                metric_sas  0.574303
1       metric_faithfulness  0.780000
2  metric_context_relevance  0.400000

Evaluation Frameworks

from flow_judge.integrations.haystack import HaystackFlowJudge
from flow_judge.metrics.presets import RESPONSE_FAITHFULNESS_5POINT
from flow_judge import Hf

model = Hf(flash_attn=False)

flow_judge_evaluator = HaystackFlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model,
    progress_bar=True,
    raise_on_failure=True,
    save_results=True,
    fail_on_parse_error=False
)

from haystack_integrations.components.evaluators.ragas import RagasEvaluator, RagasMetric

ragas_evaluator= RagasEvaluator(
    metric=RagasMetric.FAITHFULNESS
)

str_fj_retrieved_context = []
for context in retrieved_context:
  str_context = [doc.content for doc in context]
  str_fj_retrieved_context.append(" ".join(str_context)) # ["", "", ...]

str_retrieved_context = []
for context in retrieved_context:
  str_context = [doc.content for doc in context]
  str_retrieved_context.append(str_context) # [["", ""]]

from haystack import Pipeline

integration_eval_pipeline = Pipeline()
integration_eval_pipeline.add_component("ragas_evaluator", ragas_evaluator)
integration_eval_pipeline.add_component("flow_judge_evaluator", flow_judge_evaluator)

eval_framework_pipeline_results = integration_eval_pipeline.run(
    {
        "ragas_evaluator": {"questions": questions, "contexts": str_retrieved_context, "responses":predicted_answers},
        "flow_judge_evaluator": {"query": questions, "context": str_fj_retrieved_context, "response": predicted_answers},
    }
)

Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]


[2m2024-10-30T15:42:26.834829Z[0m [[33m[1mwarning  [0m] [1mNo statements were generated from the answer.[0m [36mlineno[0m=[35m232[0m [36mmodule[0m=[35mragas.metrics._faithfulness[0m
WARNING:ragas.metrics._faithfulness:No statements were generated from the answer.
[2m2024-10-30T15:42:27.247699Z[0m [[33m[1mwarning  [0m] [1mNo statements were generated from the answer.[0m [36mlineno[0m=[35m232[0m [36mmodule[0m=[35mragas.metrics._faithfulness[0m
WARNING:ragas.metrics._faithfulness:No statements were generated from the answer.
[2m2024-10-30T15:42:27.338863Z[0m [[33m[1mwarning  [0m] [1mNo statements were generated from the answer.[0m [36mlineno[0m=[35m232[0m [36mmodule[0m=[35mragas.metrics._faithfulness[0m
WARNING:ragas.metrics._faithfulness:No statements were generated from the answer.
Processing batches: 100%|██████████| 10/10 [03:32<00:00, 21.23s/it]

eval_framework_pipeline_results

{'ragas_evaluator': {'results': [[{'name': 'faithfulness', 'score': 0.5}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': nan}],
   [{'name': 'faithfulness', 'score': nan}],
   [{'name': 'faithfulness', 'score': 0.9090909090909091}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': nan}]]},
 'flow_judge_evaluator': {'results': [{'feedback': "The response provided is highly consistent with the given context. The context explicitly mentions that BERT is pre-trained on two tasks: the masked language model (MLM) task and the next sentence prediction (NSP) task. The response accurately identifies these two tasks as the main pre-training tasks for BERT, directly reflecting the information provided in the context. There are no hallucinated or fabricated details in the response, and all the information presented is supported by the context. Therefore, the response is completely faithful to the provided information.\n\nThe context describes the pre-training process, mentioning the MLM task and NSP task explicitly, and also discusses the implications of these tasks on BERT's performance. The response does not deviate from this information in any way, making it entirely consistent with the provided context.\n\nGiven that the response fully aligns with the context without introducing any unsupported information, it meets the highest standard of the evaluation criteria.",
    'score': 5},
   {'feedback': 'The response provided is completely consistent with and faithful to the given context. It accurately reports the model sizes for BERT as BERTBASE and BERTLARGE, including all the specifications such as the number of layers, hidden size, number of self-attention heads, and total parameters for each model. These details are directly supported by the context provided. There are no hallucinated or fabricated details in the response. Therefore, the response meets the highest standard of the scoring rubric.',
    'score': 5},
   {'feedback': "The response provided is highly consistent with the given context. It accurately captures the essence of BERT's architecture and its ability to facilitate the use of a unified model across diverse NLP tasks as described in the context. The response mentions the multi-layer bidirectional Transformer encoder, the joint consideration of left and right context, and the minimal differences between the pre-trained and downstream architectures, all of which are directly supported by the context. Additionally, it correctly highlights the ability to adapt the same underlying model to various tasks with just one additional output layer, which is also mentioned in the context. There are no hallucinated or fabricated details in the response, and all the information presented is directly supported by the provided context.\n\nTherefore, the response is completely consistent with and faithful to the provided context.",
    'score': 5},
   {'feedback': "The output response is completely empty, providing no information about the modifications LLaMA makes to the transformer architecture for improved performance. Given the context provided, which extensively discusses the training approach, data sources, and performance comparisons of LLaMA models, the response fails to address any aspect of the query. There is no mention of specific architectural changes, optimizations, or improvements made to the transformer model by LLaMA. This complete lack of content directly contradicts the context, which is rich with information about LLaMA's methodology and performance. Therefore, the response is entirely inconsistent with the provided context.\n\nThe context describes a sophisticated approach to training large language models, including the use of model parallelism, specific data preprocessing techniques, and the results achieved with LLaMA models. It also highlights the use of publicly available data and the competitive performance of LLaMA models compared to other large language models. None of these details are reflected in the empty response.\n\nGiven that the response contains no information at all, it can be considered as introducing a significant amount of hallucinated or fabricated information that is not supported by the context. The response fails to meet any of the criteria for consistency or faithfulness to the provided information.",
    'score': 1},
   {'feedback': 'The output response is "None," which indicates that no information was provided in response to the query. Given the context provided, which discusses the approach to embedding layer optimization in LLaMA compared to traditional transformer models, this response is completely inconsistent with the context. The context describes various modifications to the transformer architecture, such as pre-normalization, SwiGLU activation functions, and rotary embeddings, as well as the use of model parallelism. However, the output does not contain any information from or reference to this context at all. Therefore, the response contains no information that is supported by the provided context, making it entirely inconsistent.\n\nBased on the evaluation criteria and scoring rubric, the response should be scored as a 1 because it is completely inconsistent with the provided context and contains no information supported by it.',
    'score': 1},
   {'feedback': 'The generated response is mostly consistent with the provided context, but it omits some important details that are present in the context. Specifically, the response does not mention the manual collection of questions by graduate and undergraduate students, the specific sources of questions (such as the Graduate Record Examination, United States Medical Licensing Examination, undergraduate courses, and Oxford University Press books), or the requirement that each subject contains at least 100 test examples. These omissions are significant because they are key pieces of information provided in the context. However, the response does include some correct information, such as the manual collection of questions from freely available sources and the inclusion criteria of having at least 100 test examples.\n\nThe response does not introduce any hallucinated or fabricated information. It sticks closely to the information that is directly supported by the context, but it misses some important details that would have made it fully consistent with the context. Therefore, while the response is mostly accurate, it is not completely faithful to the provided context.\n\nGiven these considerations, the response fits best with a score of 3, as it is somewhat consistent with the provided context but includes some omissions that prevent it from being fully consistent.',
    'score': 3},
   {'feedback': 'The response is mostly consistent with the provided context, but it omits some specific details that are present in the context. It correctly states that BERT outperforms previous state-of-the-art models on the GLUE benchmark, achieving a score of 80.5 with BERTLARGE, which is a significant improvement over previous models like OpenAI GPT (72.8) and an ELMo-based model (66.5). However, the response does not mention the specific 4.5% and 7.0% average accuracy improvements over prior models, which are crucial details provided in the context. Additionally, it does not mention the specific improvements in various tasks such as MNLI, SQuAD v1.1, or the overall 25-point increase in the average SuperGLUE score. These omissions mean that while the response is generally accurate, it does not fully capture all the important details from the context.\n\nGiven these points, the response is mostly consistent with the context but misses some key details, which places it in the category of being mostly consistent with some minor omissions.',
    'score': 4},
   {'feedback': "The response is mostly consistent with the provided context, but it contains some minor inconsistencies and omissions. \n\nThe response correctly mentions that BERT brings significant improvements to the SQuAD v1.1 and v2.0 tasks, achieving state-of-the-art results, which aligns with the context. It accurately states that BERTLARGE achieves an F1 score of 90.9 for SQuAD v1.1 and RoBERTa achieves an F1 score of 94.6, which is consistent with the context. However, the response does not mention the exact F1 score for SQuAD v2.0, which is 89.8 for RoBERTa, nor does it mention the exact EM score of 86.8. These omissions are minor and do not significantly contradict the context.\n\nThe response does not include specific details about BERTBASE's performance, which is present in the context (F1 score of 84.1 for SQuAD v1.1). While this omission is not a major inconsistency, it does mean that the response is not entirely comprehensive.\n\nOverall, the response is mostly consistent with the provided context, with only minor and inconsequential omissions. Therefore, it aligns most closely with a score of 4.\n\nThe response does not introduce any hallucinated or fabricated information, and all the information provided is supported by the context, albeit not exhaustively.",
    'score': 4},
   {'feedback': 'The generated response is completely consistent with and faithful to the provided context. The context explicitly states that LLaMA is trained exclusively on publicly available datasets, while models like GPT-3, Chinchilla, and PaLM rely on data that is either not publicly available or undocumented. The response accurately reflects this information without introducing any hallucinated or fabricated details. Therefore, the response is fully supported by the context.',
    'score': 5},
   {'feedback': "The output response is completely empty, providing no information or content related to the query about LLaMA's methodology for ensuring diversity in pre-training data or its filtering and language identification processes. Given the extensive context provided, which includes detailed information about LLaMA's performance, training data, and comparisons with other models, the expected response should have contained specific details about the methodologies used by LLaMA. However, the output fails to address any part of the query or utilize any information from the context. This complete lack of content directly contradicts the context and introduces no information from it, making the response entirely inconsistent with the provided context.\n\nTherefore, based on the evaluation criteria and scoring rubric, the response is completely inconsistent with the provided context.",
    'score': 1}],
  'metadata': {'model_id': 'flowaicom/Flow-Judge-v0.1',
   'model_type': 'transformers',
   'generation_params': {'temperature': 0.1,
    'top_p': 0.95,
    'max_new_tokens': 1000,
    'do_sample': True},
   'kwargs': {}},
  'score': 3.4,
  'individual_scores': [5.0, 5.0, 5.0, 1.0, 1.0, 3.0, 4.0, 4.0, 5.0, 1.0]}}