LLM as Judge

The LLM as Judge evaluation method uses a language model to compare the outputs of your models against the original log output. This is particularly useful for tasks like conversation, creative writing, or other scenarios where traditional metrics like exact string matching or ROUGE scores may not capture the nuanced quality of the output.

How LLM as Judge Works

LLM as Judge treats the original logs output as the baseline and compares the outputs of the models against this baseline using a language model’s understanding of quality, relevance, and correctness. All scores range from 0 to 1, where 1 indicates high similarity or accuracy, and 0 indicates poor performance.

Task-Specific Prompt Templates

When configuring LLM as Judge, you can select a task to get task-relevant prompt templates for evaluation. You can further refine the prompt by providing examples, specifying category names, or adding other specifications. The provided prompt templates are:

Question Answering

Evaluate how well the candidate answered the input question/instruction given the baseline answer.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Classification

Given the user input below, evaluate if the candidate output is a correct classification according to the baseline output classification.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Summarization

Evaluate how well the candidate summarized the input given the baseline summary.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Entity Extraction

Evaluate how well the candidate extracted the requested elements (as requested in the input) given the baseline extracted entities.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Code Generation

Evaluate how well the candidate generated the requested code (as requested in the input) given the baseline code.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Generic

Evaluate how similar the candidate output is to the baseline output.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Custom Metrics

For more advanced evaluation scenarios, Datawizz offers Custom Metrics functionality that allows you to define multiple evaluation criteria for a single LLM as Judge evaluation. Instead of using a single prompt template, you can create multiple custom metrics, each with its own specific evaluation prompt.

How Custom Metrics Work

With Custom Metrics, you can:

Define multiple evaluation dimensions: Create separate metrics for different aspects of model performance (e.g., tone, truthfulness, brevity, comprehensiveness)
Get granular insights: Understand how models perform across various criteria rather than just an overall score
Visualize trade-offs: Use radar charts to compare model performance across all defined metrics

Example Use Case: Conversation Summary Evaluation

Consider evaluating conversation summarization models across multiple dimensions:

Comprehensiveness: Does the summary contain all key facts from the conversation?
Truthfulness: Is the summary factually accurate?
Brevity: Is the summary concise and to the point?
Tone: Does the summary maintain an appropriate tone?
Clarity: Is the summary easy to understand?

Setting Up Custom Metrics

Select LLM as Judge as your evaluation function
Choose a judge model (recommended to use a different model than those being evaluated)
Enable Custom Metrics in the advanced options
Add metrics by defining:
- Metric Name: A descriptive name for the evaluation dimension
- Evaluation Prompt: Specific instructions for the judge model on how to evaluate this metric

Example Custom Metric Prompts

Comprehensiveness:

Evaluate the provided summary and rank how comprehensive it is. Does it contain all the key facts pertaining to the conversation?

Conversation: {{input}}
Summary: {{output}}

Brevity:

Evaluate whether the summary is concise and avoids unnecessary details while maintaining clarity.

Original Conversation: {{input}}
Summary: {{output}}

Truthfulness:

Assess the factual accuracy of the summary. Are all statements in the summary supported by the original conversation?

Conversation: {{input}}
Summary: {{output}}

Cost Considerations

Custom Metrics multiply the number of judge model calls:

Formula: Number of Metrics × Number of Models × Number of Evaluation Samples
Example: 5 metrics × 3 models × 20 samples = 300 judge model API calls

Plan accordingly for API costs and rate limits when using multiple custom metrics.

Visualization and Analysis

Custom Metrics results are displayed as:

Individual metric scores for each model
Radar charts showing performance across all metrics
Comparative analysis highlighting model strengths and weaknesses

This visualization helps identify which models excel in specific areas and understand trade-offs between different performance dimensions.

Best Practices

Choose the right task template: Select the prompt template that best matches your use case for more accurate evaluations
Provide examples: When possible, include examples in your custom prompts to guide the judge model
Be specific: Add detailed specifications about what constitutes good performance for your particular task
Consider the judge model: The quality of evaluation depends on the capabilities of the language model used as the judge

When to Use LLM as Judge

LLM as Judge is particularly effective for:

Conversational AI: Evaluating chatbot responses for helpfulness, relevance, and appropriateness
Creative tasks: Assessing creative writing, storytelling, or content generation
Complex reasoning: Evaluating multi-step reasoning or explanation quality
Subjective quality: When traditional metrics can’t capture the nuanced aspects of good output
Open-ended tasks: Where there may be multiple correct answers or approaches

Limitations

Cost: LLM as Judge requires additional API calls to the judge model, which can increase evaluation costs
Consistency: Results may vary between different judge models or even between runs with the same model
Bias: The judge model may have inherent biases that affect evaluation scores
Speed: Evaluation takes longer compared to traditional metrics due to the need for model inference

API Documentation

Endpoints

How LLM as Judge Works

Task-Specific Prompt Templates

Question Answering

Classification

Summarization

Entity Extraction

Code Generation

Generic

Custom Metrics

How Custom Metrics Work

Example Use Case: Conversation Summary Evaluation

Setting Up Custom Metrics

Example Custom Metric Prompts

Cost Considerations

Visualization and Analysis

Best Practices

When to Use LLM as Judge

Limitations

API Documentation

Endpoints

​How LLM as Judge Works

​Task-Specific Prompt Templates

​Question Answering

​Classification

​Summarization

​Entity Extraction

​Code Generation

​Generic

​Custom Metrics

​How Custom Metrics Work

​Example Use Case: Conversation Summary Evaluation

​Setting Up Custom Metrics

​Example Custom Metric Prompts

​Cost Considerations

​Visualization and Analysis

​Best Practices

​When to Use LLM as Judge

​Limitations

How LLM as Judge Works

Task-Specific Prompt Templates

Question Answering

Classification

Summarization

Entity Extraction

Code Generation

Generic

Custom Metrics

How Custom Metrics Work

Example Use Case: Conversation Summary Evaluation

Setting Up Custom Metrics

Example Custom Metric Prompts

Cost Considerations

Visualization and Analysis

Best Practices

When to Use LLM as Judge

Limitations