Datawizz provides a comprehensive set of built-in evaluators for common use cases, but sometimes you need more specialized evaluation logic. Custom evaluators allow you to implement domain-specific evaluation criteria, complex scoring algorithms, or novel assessment methods tailored to your unique requirements. Custom evaluators bridge the gap between generic metrics and your specific business needs. Whether you’re evaluating specialized technical content, implementing proprietary scoring methods, or assessing outputs against custom quality criteria, custom evaluators give you the flexibility to define exactly what “good” means for your use case. There are two types of custom evaluators you can create in Datawizz:

Code Evaluators: Python-based evaluators that give you full programmatic control
LLM as Judge Evaluators: Prompt-based evaluators that leverage language models for nuanced assessment

Code Evaluators

Code evaluators provide the ultimate flexibility for custom evaluation logic. They run in a secure Python environment and can implement any evaluation algorithm you need, from simple rule-based checks to complex machine learning models.

Runtime Environment

Your Python code runs in an isolated environment with the following specifications:

Python runtime: 3.8+
Execution time limit: 5 minutes per evaluation
Memory limit: Reasonable limits for typical evaluation tasks
Package Management: You specify your own dependencies using PEP 723 inline script metadata
No Pre-installed Libraries: You must declare all dependencies you need

Declaring Dependencies: Use PEP 723 inline script metadata at the top of your evaluator code to specify required packages:

# /// script
# requires-python = ">=3.8"
# dependencies = [
#   "evaluate>=0.4.0",
#   "rouge-score>=0.1.2",
#   "numpy>=1.21.0"
# ]
# ///

from evaluate import load
import numpy as np

def evaluator(inputs, outputs, reference_outputs):
    # Your evaluation logic here
    pass

How it works:

Dependencies are installed automatically in an isolated environment on first run (~30-60 seconds)
Subsequent runs are fast (~2-5 seconds) because packages are cached
This ensures your evaluator always has the exact dependencies it needs without conflicts

Function Signature

All code evaluators must implement the following function signature:

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    # Your evaluation logic here
    pass

Parameters:

inputs: List of input data that was sent to the model
outputs: Dictionary containing the model’s response (typically includes a ‘content’ key)
reference_outputs: Dictionary containing the expected/reference output for comparison

Return Types

Your evaluator function can return different types depending on your evaluation needs:

Boolean Returns

Perfect for pass/fail evaluations:

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    # Check if output contains required keywords
    required_keywords = ["safety", "compliance"]
    output_text = outputs.get('content', '').lower()
    return all(keyword in output_text for keyword in required_keywords)

Numeric Scores (0-1)

Ideal for continuous scoring where 1 represents perfect performance:

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    # Calculate semantic similarity using word overlap
    output_words = set(outputs.get('content', '').lower().split())
    reference_words = set(reference_outputs.get('content', '').lower().split())
    
    if not reference_words:
        return 0.0
    
    intersection = len(output_words.intersection(reference_words))
    union = len(output_words.union(reference_words))
    
    return intersection / union if union > 0 else 0.0

String Returns

Useful for classification or categorical evaluation:

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    # Classify response tone
    content = outputs.get('content', '').lower()
    
    if any(word in content for word in ['excellent', 'amazing', 'fantastic']):
        return 'positive'
    elif any(word in content for word in ['terrible', 'awful', 'horrible']):
        return 'negative'
    else:
        return 'neutral'

Dictionary Returns

Perfect for multi-dimensional evaluation:

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    content = outputs.get('content', '')
    
    # Evaluate multiple aspects
    return {
        'clarity': calculate_clarity_score(content),
        'completeness': calculate_completeness_score(content, reference_outputs.get('content', '')),
        'accuracy': calculate_factual_accuracy(content),
        'tone': assess_tone_appropriateness(content)
    }

Example — checking single letter answers from multiple choice questions:

import re

def extract_answer_letter(s):
    """
    Extract a single letter (A-Z) answer from a noisy string.
    Returns the uppercase letter if found, else None.
    """
    if not isinstance(s, str):
        return None

    match = re.match(r"""
        [\s"'(\[]*
        ([A-Za-z])
        [)\]'"\.]*
        (\s|$)
        """, s, re.VERBOSE)
    if match:
        return match.group(1).upper()
    return None

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    return extract_answer_letter(outputs['content']) == extract_answer_letter(reference_outputs['content'])

This evaluator handles the common scenario where models might return verbose responses to multiple choice questions like “The answer is (B) because…” or “B) Machine Learning” and extracts just the letter for comparison.

Best Practices for Code Evaluators

Keep Evaluators Focused and Reusable

Design small, single-purpose evaluators that can be combined for comprehensive evaluation:

# Good: Focused evaluator for length checking
def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    content = outputs.get('content', '')
    min_length = 50  # characters
    max_length = 500
    
    length = len(content)
    if length < min_length:
        return 0.0
    elif length > max_length:
        return max(0.0, 1.0 - (length - max_length) / max_length)
    else:
        return 1.0

Handle Edge Cases Gracefully

Always account for missing data, empty responses, or unexpected formats:

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    # Robust handling of potential issues
    content = outputs.get('content')
    if not content or not isinstance(content, str):
        return 0.0
    
    reference_content = reference_outputs.get('content')
    if not reference_content or not isinstance(reference_content, str):
        return 0.0
    
    # Your evaluation logic here
    return calculate_similarity(content, reference_content)

Use Multiple Evaluators for Complex Assessment

Rather than creating one monolithic evaluator, use multiple focused evaluators:

Performance Note

Testing your evaluator in the Datawizz interface may be slower than production execution due to the additional overhead of the testing environment. The actual evaluation runs will be significantly faster.

Common Use Cases for Code Evaluators

Format Validation

Ensure outputs follow specific formatting requirements:

import json

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    content = outputs.get('content', '')
    
    try:
        # Check if output is valid JSON
        parsed = json.loads(content)
        
        # Check for required fields
        required_fields = ['name', 'age', 'email']
        has_all_fields = all(field in parsed for field in required_fields)
        
        return has_all_fields
    except json.JSONDecodeError:
        return False

Domain-Specific Validation

Implement business logic specific to your domain:

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    content = outputs.get('content', '').lower()
    
    # Medical safety check - ensure no dangerous advice
    dangerous_phrases = [
        'stop taking medication',
        'ignore doctor advice',
        'medical emergency can wait'
    ]
    
    contains_dangerous_advice = any(phrase in content for phrase in dangerous_phrases)
    
    return not contains_dangerous_advice  # Return False if dangerous content found

Advanced Similarity Metrics

Implement sophisticated comparison algorithms:

# /// script
# requires-python = ">=3.8"
# dependencies = [
#   "scikit-learn>=1.0.0",
#   "numpy>=1.21.0"
# ]
# ///

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def evaluator(inputs: list, outputs: dict, reference_outputs: dict):
    output_text = outputs.get('content', '')
    reference_text = reference_outputs.get('content', '')

    if not output_text or not reference_text:
        return 0.0

    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer()
    texts = [output_text, reference_text]

    try:
        tfidf_matrix = vectorizer.fit_transform(texts)
        similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
        return float(similarity)
    except:
        return 0.0

LLM as Judge Evaluators

The LLM as Judge evaluation method represents a paradigm shift in how we assess AI-generated content. Instead of relying solely on traditional metrics like BLEU scores or exact string matching, this approach leverages the sophisticated understanding of language models to evaluate outputs in a more nuanced, human-like manner. This method is particularly powerful for tasks where the quality of output can’t be easily quantified through simple rules or mathematical formulas. Creative writing, conversational responses, complex reasoning, and subjective assessments all benefit from the contextual understanding that LLM judges provide.

How LLM as Judge Works

The evaluation process follows these steps:

Baseline Establishment: The original log output serves as the reference standard
Prompt Construction: A carefully crafted prompt instructs the judge model on evaluation criteria
Contextual Assessment: The judge model receives the input, candidate output, and baseline output
Scoring: The judge returns a score between 0 and 1, with detailed reasoning
Aggregation: Multiple evaluations can be combined for comprehensive assessment

Key Principles:

Scores range from 0 to 1: 1 indicates perfect alignment with the baseline, 0 indicates poor performance
Contextual Understanding: The judge considers nuance, intent, and quality beyond surface-level matching
Consistency: While individual judgments may vary slightly, the overall evaluation trends remain stable
Transparency: Many judge models can provide reasoning for their scores

Task-Specific Prompt Templates

Datawizz provides pre-configured prompt templates optimized for different types of evaluation tasks. Each template is designed to elicit the most accurate and relevant judgments from the LLM judge. You can use these templates as starting points and customize them with specific criteria, examples, or domain knowledge.

Question Answering

Evaluates how well the model answered a question compared to a reference answer. This template considers correctness, completeness, and relevance.

Evaluate how well the candidate answered the input question/instruction given the baseline answer.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Please provide a score from 0 to 1 where:
- 1.0: Perfect answer that matches or exceeds the baseline quality
- 0.8-0.9: Very good answer with minor differences from baseline
- 0.6-0.7: Good answer but missing some important details
- 0.4-0.5: Partially correct but significant gaps or errors
- 0.2-0.3: Poor answer with major issues
- 0.0-0.1: Completely incorrect or irrelevant answer

Classification

Assesses whether the model correctly classified the input according to the expected category or label.

Given the user input below, evaluate if the candidate output is a correct classification according to the baseline output classification.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Score 1.0 if the classifications match exactly, 0.5 for partially correct classifications, and 0.0 for completely incorrect classifications.

Summarization

Evaluates the quality of summaries by comparing key information retention, conciseness, and accuracy.

Evaluate how well the candidate summarized the input given the baseline summary.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Consider the following criteria:
- Information retention: Are key points preserved?
- Conciseness: Is the summary appropriately brief?
- Accuracy: Are facts correctly represented?
- Coherence: Is the summary well-structured and readable?

Provide a score from 0 to 1 reflecting overall summary quality.

Entity Extraction

Assesses how accurately the model identified and extracted requested entities from the input.

Evaluate how well the candidate extracted the requested elements (as requested in the input) given the baseline extracted entities.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Score based on:
- Completeness: Were all entities found?
- Accuracy: Are the extracted entities correct?
- Format: Is the output properly formatted?

Code Generation

Evaluates generated code for correctness, efficiency, and adherence to requirements.

Evaluate how well the candidate generated the requested code (as requested in the input) given the baseline code.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Consider:
- Functional correctness: Does the code work as intended?
- Code quality: Is it well-structured and readable?
- Requirement adherence: Does it meet the specified requirements?
- Best practices: Does it follow coding conventions?

Generic

A flexible template for general similarity assessment when specific task templates don’t apply.

Evaluate how similar the candidate output is to the baseline output.

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Assess overall similarity considering meaning, structure, and quality. Provide a score from 0 to 1.

Best Practices for LLM as Judge

Choose the Right Task Template

Selecting the appropriate template significantly impacts evaluation quality. Match your use case to the most relevant template:

Question Answering: For factual Q&A, reasoning tasks, or instructional responses
Classification: For categorization, sentiment analysis, or labeling tasks
Summarization: For content condensation or key point extraction
Entity Extraction: For named entity recognition or information extraction
Code Generation: For programming tasks or technical documentation
Generic: When no specific template fits or for novel evaluation criteria

Provide Clear Examples and Context

Enhance your prompts with specific examples to guide the judge model:

Evaluate the candidate's explanation of machine learning concepts.

Example of excellent explanation (score 1.0):
"Machine learning uses algorithms to find patterns in data and make predictions..."

Example of poor explanation (score 0.2): 
"ML is when computers learn stuff from data and do things..."

Input: {{input}}
Candidate: {{output}}
Baseline: {{baseline_output}}

Be Specific About Evaluation Criteria

Define exactly what constitutes good performance for your domain:

Evaluate the customer service response quality. Consider:

1. Empathy (30%): Shows understanding of customer's situation
2. Accuracy (40%): Provides correct information and solutions  
3. Professionalism (20%): Maintains appropriate tone and language
4. Completeness (10%): Addresses all aspects of the customer's query

Score from 0 to 1 based on these weighted criteria.

Consider the Judge Model Capabilities

The quality of evaluation depends heavily on the judge model’s capabilities:

Use capable models: More advanced models generally provide better judgments
Match model to task: Some models excel at specific types of evaluation
Test your prompts: Validate that the judge model understands your criteria
Monitor consistency: Check that similar inputs receive similar scores

Account for Subjectivity

LLM judges may have different perspectives on subjective matters:

Evaluate the creativity of this story opening. Note that creativity can be subjective, 
so focus on concrete elements:

- Originality of concept or approach
- Unexpected or innovative elements  
- Engaging narrative voice
- Vivid and imaginative details

Provide reasoning for your score along with the numerical rating.

When to Use LLM as Judge

LLM as Judge evaluation shines in scenarios where traditional metrics fall short. Here are the key use cases where this approach provides the most value:

Conversational AI and Chatbots

Traditional metrics can’t capture the nuances of natural conversation:

Helpfulness: Does the response actually help the user?
Appropriateness: Is the tone and content suitable for the context?
Engagement: Does the response encourage continued interaction?
Empathy: Does the bot show understanding of user emotions?

Example: A customer asking “I’m frustrated with my order delay” needs empathetic acknowledgment, not just factual shipping information.

Creative and Subjective Content

When creativity and style matter more than factual accuracy:

Creative writing: Stories, poems, marketing copy
Content generation: Blog posts, social media content, product descriptions
Artistic critique: Evaluating creative elements in generated content
Style adaptation: Matching specific writing styles or brand voices

Example: A marketing tagline’s effectiveness depends on creativity, memorability, and brand alignment—qualities only human-like judgment can assess.

Complex Reasoning and Explanation

For tasks requiring multi-step thinking and clear communication:

Educational content: Explanations of complex concepts
Problem-solving: Step-by-step reasoning processes
Technical documentation: Clarity and completeness of instructions
Analytical reports: Quality of insights and recommendations

Example: Explaining quantum physics requires not just accuracy but also appropriate analogies and progressive concept building.

Open-Ended and Subjective Tasks

When there are multiple valid approaches or subjective quality criteria:

Advisory responses: Personal recommendations or guidance
Opinion pieces: Balanced argumentation and perspective
Design critiques: Aesthetic and functional evaluation
Strategic planning: Quality of strategic thinking and recommendations

Example: Career advice must be personalized, contextually appropriate, and consider multiple factors—qualities that traditional metrics can’t capture.

Quality Beyond Correctness

When correctness alone isn’t sufficient:

Professional communication: Email drafting, formal correspondence
Customer service: Response quality beyond just solving the problem
Content moderation: Nuanced judgment about appropriateness
Translation quality: Cultural adaptation beyond literal accuracy

Example: A technically correct but cold customer service response may score high on accuracy but low on customer satisfaction.

Limitations and Considerations

While LLM as Judge offers powerful evaluation capabilities, it’s important to understand its limitations and plan accordingly:

Cost Implications

API Usage: Each evaluation requires a call to the judge model, increasing operational costs
Volume Scaling: Large-scale evaluations can become expensive quickly
Model Selection: More capable judge models typically cost more per evaluation
Mitigation: Use LLM as Judge selectively for high-value evaluations; combine with cheaper automated metrics where appropriate

Consistency Challenges

Inter-run Variability: The same input may receive slightly different scores across evaluations
Model Dependencies: Different judge models may have varying scoring patterns
Prompt Sensitivity: Small changes in prompts can significantly affect evaluation outcomes
Mitigation: Use multiple evaluations and statistical averaging; establish baseline consistency metrics; test prompt variations thoroughly

Inherent Biases

Training Data Bias: Judge models reflect biases present in their training data
Cultural Perspectives: Models may favor certain cultural or linguistic patterns
Domain Limitations: Judges may perform poorly in specialized domains they weren’t trained on
Subjective Interpretations: What constitutes “quality” may vary between different judge models
Mitigation: Use diverse judge models; regularly audit evaluation results for bias; incorporate human validation for critical assessments

Performance Considerations

Evaluation Speed: LLM calls are slower than traditional metrics, affecting evaluation throughput
Network Dependencies: Requires stable internet connection and API availability
Rate Limiting: API rate limits may constrain evaluation speed
Latency Variability: Response times can vary based on model load and complexity
Mitigation: Implement async evaluation pipelines; use local models where possible; design evaluation workflows to handle latency

Quality Dependencies

Judge Model Capability: Evaluation quality is fundamentally limited by the judge model’s abilities
Prompt Engineering: Poor prompts lead to poor evaluations regardless of judge model quality
Context Limitations: Very long inputs may exceed model context windows
Domain Expertise: Judges may lack specialized knowledge for technical domains
Mitigation: Select appropriate judge models for your domain; invest in prompt engineering; break down complex evaluations into smaller components

Hybrid Evaluation Strategies

For optimal results, consider combining LLM as Judge with other evaluation methods:

Multi-Tier Evaluation

Automated Metrics: Fast, cheap screening (BLEU, ROUGE, exact match)
LLM as Judge: Nuanced quality assessment for promising candidates
Human Review: Final validation for critical decisions

Complementary Metrics

Use traditional metrics for factual accuracy
Use LLM as Judge for style, tone, and appropriateness
Use human evaluation for final quality assurance

Adaptive Evaluation

Start with cheap automated metrics
Escalate to LLM as Judge based on automated metric thresholds
Reserve human evaluation for edge cases and high-stakes decisions

This approach balances cost, speed, and evaluation quality while maximizing the benefits of each evaluation method.

Choosing the Right Evaluation Approach

The choice between Code Evaluators and LLM as Judge depends on your specific needs, resources, and evaluation requirements:

Use Code Evaluators When:

Deterministic Logic: You need consistent, repeatable results
Performance Critical: Speed and cost efficiency are primary concerns
Clear Criteria: Success can be defined through explicit rules or calculations
Domain Expertise: You have specific knowledge that can be encoded programmatically
Integration Needs: You need to incorporate external APIs, databases, or complex algorithms

Use LLM as Judge When:

Subjective Assessment: Quality depends on nuanced human-like judgment
Complex Context: Evaluation requires understanding of implicit meaning or context
Creative Content: Assessing originality, style, or creative quality
Natural Language: Evaluating conversational or explanatory content
Rapid Prototyping: You need to quickly test evaluation concepts without coding

Hybrid Approaches:

Many successful evaluation strategies combine both methods:

Use code evaluators for objective criteria (format, length, factual accuracy)
Use LLM as Judge for subjective quality (tone, creativity, helpfulness)
Implement progressive evaluation: fast automated screening followed by detailed LLM assessment

Getting Started

Define Your Evaluation Goals: Clearly articulate what “good” means for your specific use case
Start Simple: Begin with basic evaluators and iterate based on results
Test Thoroughly: Validate your evaluators against known good and bad examples
Monitor Performance: Track evaluation consistency and adjust as needed
Scale Thoughtfully: Consider cost and performance implications as you scale

Custom evaluators are powerful tools that let you align AI evaluation with your specific quality standards and business requirements. Whether you choose code-based logic or LLM-based judgment, the key is to match your evaluation method to your specific needs and constraints.

Get Started

Platform

​Code Evaluators

​Runtime Environment

​Function Signature

​Return Types

​Boolean Returns

​Numeric Scores (0-1)

​String Returns

​Dictionary Returns

​Best Practices for Code Evaluators

​Keep Evaluators Focused and Reusable

​Handle Edge Cases Gracefully

​Use Multiple Evaluators for Complex Assessment

​Performance Note

​Common Use Cases for Code Evaluators

​Format Validation

​Domain-Specific Validation

​Advanced Similarity Metrics

​LLM as Judge Evaluators

​How LLM as Judge Works

​Task-Specific Prompt Templates

​Question Answering

​Classification

​Summarization

​Entity Extraction

​Code Generation

​Generic

​Best Practices for LLM as Judge

​Choose the Right Task Template

​Provide Clear Examples and Context

​Be Specific About Evaluation Criteria

​Consider the Judge Model Capabilities

​Account for Subjectivity

​When to Use LLM as Judge

​Conversational AI and Chatbots

​Creative and Subjective Content

​Complex Reasoning and Explanation

​Open-Ended and Subjective Tasks

​Quality Beyond Correctness

​Limitations and Considerations

​Cost Implications

​Consistency Challenges

​Inherent Biases

​Performance Considerations

​Quality Dependencies

​Hybrid Evaluation Strategies

​Multi-Tier Evaluation

​Complementary Metrics

​Adaptive Evaluation

​Choosing the Right Evaluation Approach

​Use Code Evaluators When:

​Use LLM as Judge When:

​Hybrid Approaches:

​Getting Started