- Python
- TypeScript
InstallationTo use Weave’s predefined scorers you need to install some additional dependencies:LLM-evaluators
Update Feb 2025: The pre-defined scorers that leverage LLMs now automatically integrate with litellm.
You no longer need to pass an LLM client; just set the
This scorer checks if your AI system’s output includes any hallucinations based on the input data.Customization:
Use an LLM to compare a summary to the original text and evaluate the quality of the summary.How It Works:This scorer evaluates summaries in two ways:
The How It Works:
The Note: You can use
The Here is an example in the context of an evaluation:
The Here is an example in the context of an evaluation:
The
RAGAS -
The How It Works:
RAGAS -
The How It Works:
model_id.
See the supported models here.HallucinationFreeScorer
This scorer checks if your AI system’s output includes any hallucinations based on the input data.- Customize the
system_promptanduser_promptfields of the scorer to define what “hallucination” means for you.
- The
scoremethod expects an input column namedcontext. If your dataset uses a different name, use thecolumn_mapattribute to mapcontextto the dataset column.
SummarizationScorer
Use an LLM to compare a summary to the original text and evaluate the quality of the summary.- Entity Density: Checks the ratio of unique entities (like names, places, or things) mentioned in the summary to the total word count in the summary in order to estimate the “information density” of the summary. Uses an LLM to extract the entities. Similar to how entity density is used in the Chain of Density paper, https://arxiv.org/abs/2309.04269
- Quality Grading: An LLM evaluator grades the summary as
poor,ok, orexcellent. These grades are then mapped to scores (0.0 for poor, 0.5 for ok, and 1.0 for excellent) for aggregate performance evaluation.
- Adjust
summarization_evaluation_system_promptandsummarization_evaluation_promptto tailor the evaluation process.
- The scorer uses litellm internally.
- The
scoremethod expects the original text (the one being summarized) to be present in theinputcolumn. Usecolumn_mapif your dataset uses a different name.
OpenAIModerationScorer
The OpenAIModerationScorer uses OpenAI’s Moderation API to check if the AI system’s output contains disallowed content, such as hate speech or explicit material.- Sends the AI’s output to the OpenAI Moderation endpoint and returns a structured response indicating if the content is flagged.
EmbeddingSimilarityScorer
The EmbeddingSimilarityScorer computes the cosine similarity between the embeddings of the AI system’s output and a target text from your dataset. It is useful for measuring how similar the AI’s output is to a reference text.column_map to map the target column to a different name.Parameters:threshold(float): The minimum cosine similarity score (between -1 and 1) needed to consider the two texts similar (defaults to0.5).
ValidJSONScorer
The ValidJSONScorer checks whether the AI system’s output is valid JSON. This scorer is useful when you expect the output to be in JSON format and need to verify its validity.ValidXMLScorer
The ValidXMLScorer checks whether the AI system’s output is valid XML. It is useful when expecting XML-formatted outputs.PydanticScorer
The PydanticScorer validates the AI system’s output against a Pydantic model to ensure it adheres to a specified schema or data structure.RAGAS - ContextEntityRecallScorer
The ContextEntityRecallScorer estimates context recall by extracting entities from both the AI system’s output and the provided context, then computing the recall score. It is based on the RAGAS evaluation library.- Uses an LLM to extract unique entities from the output and context and calculates recall.
- Recall indicates the proportion of important entities from the context that are captured in the output.
- Returns a dictionary with the recall score.
- Expects a
contextcolumn in your dataset. Use thecolumn_mapattribute if the column name is different.
RAGAS - ContextRelevancyScorer
The ContextRelevancyScorer evaluates the relevancy of the provided context to the AI system’s output. It is based on the RAGAS evaluation library.- Uses an LLM to rate the relevancy of the context to the output on a scale from 0 to 1.
- Returns a dictionary with the
relevancy_score.
- Expects a
contextcolumn in your dataset. Usecolumn_mapif a different name is used. - Customize the
relevancy_promptto define how relevancy is assessed.
openai/gpt-4o, openai/text-embedding-3-small). If you wish to experiment with other providers, you can simply update the model_id. For example, to use an Anthropic model: