Use language models to evaluate AI model outputs in Datawizz
The LLM as Judge evaluation method uses a language model to compare the outputs of your models against the original log output. This is particularly useful for tasks like conversation, creative writing, or other scenarios where traditional metrics like exact string matching or ROUGE scores may not capture the nuanced quality of the output.
LLM as Judge treats the original logs output as the baseline and compares the outputs of the models against this baseline using a language model’s understanding of quality, relevance, and correctness. All scores range from 0 to 1, where 1 indicates high similarity or accuracy, and 0 indicates poor performance.
When configuring LLM as Judge, you can select a task to get task-relevant prompt templates for evaluation. You can further refine the prompt by providing examples, specifying category names, or adding other specifications. The provided prompt templates are:
Evaluate how well the candidate answered the input question/instruction given the baseline answer.Input: {{input}}Candidate: {{output}}Baseline: {{baseline_output}}
Given the user input below, evaluate if the candidate output is a correct classification according to the baseline output classification.Input: {{input}}Candidate: {{output}}Baseline: {{baseline_output}}
Evaluate how well the candidate extracted the requested elements (as requested in the input) given the baseline extracted entities.Input: {{input}}Candidate: {{output}}Baseline: {{baseline_output}}
Evaluate how well the candidate generated the requested code (as requested in the input) given the baseline code.Input: {{input}}Candidate: {{output}}Baseline: {{baseline_output}}
For more advanced evaluation scenarios, Datawizz offers Custom Metrics functionality that allows you to define multiple evaluation criteria for a single LLM as Judge evaluation. Instead of using a single prompt template, you can create multiple custom metrics, each with its own specific evaluation prompt.
Define multiple evaluation dimensions: Create separate metrics for different aspects of model performance (e.g., tone, truthfulness, brevity, comprehensiveness)
Get granular insights: Understand how models perform across various criteria rather than just an overall score
Visualize trade-offs: Use radar charts to compare model performance across all defined metrics
Evaluate the provided summary and rank how comprehensive it is. Does it contain all the key facts pertaining to the conversation?Conversation: {{input}}Summary: {{output}}
Brevity:
Copy
Ask AI
Evaluate whether the summary is concise and avoids unnecessary details while maintaining clarity.Original Conversation: {{input}}Summary: {{output}}
Truthfulness:
Copy
Ask AI
Assess the factual accuracy of the summary. Are all statements in the summary supported by the original conversation?Conversation: {{input}}Summary: {{output}}