Evaluation

Creating Evals: How to build a "Golden Dataset."

Common Metrics: Precision, Recall, MRR (Mean Reciprocal Rank), NDCG.

LLM as a Judge: Using GPT-4 to grade RAG outputs (RAGAS framework).

Unit Testing for RAG: CI/CD for knowledge pipelines.