Custom Evaluators
Create and use custom evaluators in Datawizz
Datawizz provides a comprehensive set of built-in evaluators for common use cases, but sometimes you need more specialized evaluation logic. Custom evaluators allow you to implement domain-specific evaluation criteria, complex scoring algorithms, or novel assessment methods tailored to your unique requirements.
Custom evaluators bridge the gap between generic metrics and your specific business needs. Whether you’re evaluating specialized technical content, implementing proprietary scoring methods, or assessing outputs against custom quality criteria, custom evaluators give you the flexibility to define exactly what “good” means for your use case.
There are two types of custom evaluators you can create in Datawizz:
- Code Evaluators: Python-based evaluators that give you full programmatic control
- LLM as Judge Evaluators: Prompt-based evaluators that leverage language models for nuanced assessment
Code Evaluators
Code evaluators provide the ultimate flexibility for custom evaluation logic. They run in a secure Python environment and can implement any evaluation algorithm you need, from simple rule-based checks to complex machine learning models.
Runtime Environment
Your Python code runs in a sandboxed environment with the following specifications:
- Python runtime: 3.13
- Execution time limit: 5 seconds per evaluation
- Memory limit: 512 MB per evaluation
- Pre-installed libraries:
transformers==4.39.3
- For working with transformer models and tokenizersdatasets==3.2.0
- Dataset manipulation and loadingimportlib-metadata==8.6.1
- Package metadata utilitieschevron==0.14.0
- Logic-less templatingevaluate==0.4.0
- Evaluation metrics libraryrouge-score==0.1.2
- ROUGE metrics for text summarizationjiwer==3.0.5
- Word Error Rate and other speech recognition metricssacrebleu==2.5.1
- BLEU score for machine translationbert_score==0.3.13
- Semantic similarity using BERT embeddingsscikit-learn==1.6.1
- Machine learning algorithms and metricsnumpy==1.26.4
- Numerical computing
Function Signature
All code evaluators must implement the following function signature:
Parameters:
inputs
: List of input data that was sent to the modeloutputs
: Dictionary containing the model’s response (typically includes a ‘content’ key)reference_outputs
: Dictionary containing the expected/reference output for comparison
Return Types
Your evaluator function can return different types depending on your evaluation needs:
Boolean Returns
Perfect for pass/fail evaluations:
Numeric Scores (0-1)
Ideal for continuous scoring where 1 represents perfect performance:
String Returns
Useful for classification or categorical evaluation:
Dictionary Returns
Perfect for multi-dimensional evaluation:
Example — checking single letter answers from multiple choice questions:
This evaluator handles the common scenario where models might return verbose responses to multiple choice questions like “The answer is (B) because…” or “B) Machine Learning” and extracts just the letter for comparison.
Best Practices for Code Evaluators
Keep Evaluators Focused and Reusable
Design small, single-purpose evaluators that can be combined for comprehensive evaluation:
Handle Edge Cases Gracefully
Always account for missing data, empty responses, or unexpected formats:
Use Multiple Evaluators for Complex Assessment
Rather than creating one monolithic evaluator, use multiple focused evaluators:
Performance Note
Testing your evaluator in the Datawizz interface may be slower than production execution due to the additional overhead of the testing environment. The actual evaluation runs will be significantly faster.
Common Use Cases for Code Evaluators
Format Validation
Ensure outputs follow specific formatting requirements:
Domain-Specific Validation
Implement business logic specific to your domain:
Advanced Similarity Metrics
Implement sophisticated comparison algorithms:
LLM as Judge Evaluators
The LLM as Judge evaluation method represents a paradigm shift in how we assess AI-generated content. Instead of relying solely on traditional metrics like BLEU scores or exact string matching, this approach leverages the sophisticated understanding of language models to evaluate outputs in a more nuanced, human-like manner.
This method is particularly powerful for tasks where the quality of output can’t be easily quantified through simple rules or mathematical formulas. Creative writing, conversational responses, complex reasoning, and subjective assessments all benefit from the contextual understanding that LLM judges provide.
How LLM as Judge Works
The evaluation process follows these steps:
- Baseline Establishment: The original log output serves as the reference standard
- Prompt Construction: A carefully crafted prompt instructs the judge model on evaluation criteria
- Contextual Assessment: The judge model receives the input, candidate output, and baseline output
- Scoring: The judge returns a score between 0 and 1, with detailed reasoning
- Aggregation: Multiple evaluations can be combined for comprehensive assessment
Key Principles:
- Scores range from 0 to 1: 1 indicates perfect alignment with the baseline, 0 indicates poor performance
- Contextual Understanding: The judge considers nuance, intent, and quality beyond surface-level matching
- Consistency: While individual judgments may vary slightly, the overall evaluation trends remain stable
- Transparency: Many judge models can provide reasoning for their scores
Task-Specific Prompt Templates
Datawizz provides pre-configured prompt templates optimized for different types of evaluation tasks. Each template is designed to elicit the most accurate and relevant judgments from the LLM judge. You can use these templates as starting points and customize them with specific criteria, examples, or domain knowledge.
Question Answering
Evaluates how well the model answered a question compared to a reference answer. This template considers correctness, completeness, and relevance.
Classification
Assesses whether the model correctly classified the input according to the expected category or label.
Summarization
Evaluates the quality of summaries by comparing key information retention, conciseness, and accuracy.
Entity Extraction
Assesses how accurately the model identified and extracted requested entities from the input.
Code Generation
Evaluates generated code for correctness, efficiency, and adherence to requirements.
Generic
A flexible template for general similarity assessment when specific task templates don’t apply.
Best Practices for LLM as Judge
Choose the Right Task Template
Selecting the appropriate template significantly impacts evaluation quality. Match your use case to the most relevant template:
- Question Answering: For factual Q&A, reasoning tasks, or instructional responses
- Classification: For categorization, sentiment analysis, or labeling tasks
- Summarization: For content condensation or key point extraction
- Entity Extraction: For named entity recognition or information extraction
- Code Generation: For programming tasks or technical documentation
- Generic: When no specific template fits or for novel evaluation criteria
Provide Clear Examples and Context
Enhance your prompts with specific examples to guide the judge model:
Be Specific About Evaluation Criteria
Define exactly what constitutes good performance for your domain:
Consider the Judge Model Capabilities
The quality of evaluation depends heavily on the judge model’s capabilities:
- Use capable models: More advanced models generally provide better judgments
- Match model to task: Some models excel at specific types of evaluation
- Test your prompts: Validate that the judge model understands your criteria
- Monitor consistency: Check that similar inputs receive similar scores
Account for Subjectivity
LLM judges may have different perspectives on subjective matters:
When to Use LLM as Judge
LLM as Judge evaluation shines in scenarios where traditional metrics fall short. Here are the key use cases where this approach provides the most value:
Conversational AI and Chatbots
Traditional metrics can’t capture the nuances of natural conversation:
- Helpfulness: Does the response actually help the user?
- Appropriateness: Is the tone and content suitable for the context?
- Engagement: Does the response encourage continued interaction?
- Empathy: Does the bot show understanding of user emotions?
Example: A customer asking “I’m frustrated with my order delay” needs empathetic acknowledgment, not just factual shipping information.
Creative and Subjective Content
When creativity and style matter more than factual accuracy:
- Creative writing: Stories, poems, marketing copy
- Content generation: Blog posts, social media content, product descriptions
- Artistic critique: Evaluating creative elements in generated content
- Style adaptation: Matching specific writing styles or brand voices
Example: A marketing tagline’s effectiveness depends on creativity, memorability, and brand alignment—qualities only human-like judgment can assess.
Complex Reasoning and Explanation
For tasks requiring multi-step thinking and clear communication:
- Educational content: Explanations of complex concepts
- Problem-solving: Step-by-step reasoning processes
- Technical documentation: Clarity and completeness of instructions
- Analytical reports: Quality of insights and recommendations
Example: Explaining quantum physics requires not just accuracy but also appropriate analogies and progressive concept building.
Open-Ended and Subjective Tasks
When there are multiple valid approaches or subjective quality criteria:
- Advisory responses: Personal recommendations or guidance
- Opinion pieces: Balanced argumentation and perspective
- Design critiques: Aesthetic and functional evaluation
- Strategic planning: Quality of strategic thinking and recommendations
Example: Career advice must be personalized, contextually appropriate, and consider multiple factors—qualities that traditional metrics can’t capture.
Quality Beyond Correctness
When correctness alone isn’t sufficient:
- Professional communication: Email drafting, formal correspondence
- Customer service: Response quality beyond just solving the problem
- Content moderation: Nuanced judgment about appropriateness
- Translation quality: Cultural adaptation beyond literal accuracy
Example: A technically correct but cold customer service response may score high on accuracy but low on customer satisfaction.
Limitations and Considerations
While LLM as Judge offers powerful evaluation capabilities, it’s important to understand its limitations and plan accordingly:
Cost Implications
- API Usage: Each evaluation requires a call to the judge model, increasing operational costs
- Volume Scaling: Large-scale evaluations can become expensive quickly
- Model Selection: More capable judge models typically cost more per evaluation
- Mitigation: Use LLM as Judge selectively for high-value evaluations; combine with cheaper automated metrics where appropriate
Consistency Challenges
- Inter-run Variability: The same input may receive slightly different scores across evaluations
- Model Dependencies: Different judge models may have varying scoring patterns
- Prompt Sensitivity: Small changes in prompts can significantly affect evaluation outcomes
- Mitigation: Use multiple evaluations and statistical averaging; establish baseline consistency metrics; test prompt variations thoroughly
Inherent Biases
- Training Data Bias: Judge models reflect biases present in their training data
- Cultural Perspectives: Models may favor certain cultural or linguistic patterns
- Domain Limitations: Judges may perform poorly in specialized domains they weren’t trained on
- Subjective Interpretations: What constitutes “quality” may vary between different judge models
- Mitigation: Use diverse judge models; regularly audit evaluation results for bias; incorporate human validation for critical assessments
Performance Considerations
- Evaluation Speed: LLM calls are slower than traditional metrics, affecting evaluation throughput
- Network Dependencies: Requires stable internet connection and API availability
- Rate Limiting: API rate limits may constrain evaluation speed
- Latency Variability: Response times can vary based on model load and complexity
- Mitigation: Implement async evaluation pipelines; use local models where possible; design evaluation workflows to handle latency
Quality Dependencies
- Judge Model Capability: Evaluation quality is fundamentally limited by the judge model’s abilities
- Prompt Engineering: Poor prompts lead to poor evaluations regardless of judge model quality
- Context Limitations: Very long inputs may exceed model context windows
- Domain Expertise: Judges may lack specialized knowledge for technical domains
- Mitigation: Select appropriate judge models for your domain; invest in prompt engineering; break down complex evaluations into smaller components
Hybrid Evaluation Strategies
For optimal results, consider combining LLM as Judge with other evaluation methods:
Multi-Tier Evaluation
- Automated Metrics: Fast, cheap screening (BLEU, ROUGE, exact match)
- LLM as Judge: Nuanced quality assessment for promising candidates
- Human Review: Final validation for critical decisions
Complementary Metrics
- Use traditional metrics for factual accuracy
- Use LLM as Judge for style, tone, and appropriateness
- Use human evaluation for final quality assurance
Adaptive Evaluation
- Start with cheap automated metrics
- Escalate to LLM as Judge based on automated metric thresholds
- Reserve human evaluation for edge cases and high-stakes decisions
This approach balances cost, speed, and evaluation quality while maximizing the benefits of each evaluation method.
Choosing the Right Evaluation Approach
The choice between Code Evaluators and LLM as Judge depends on your specific needs, resources, and evaluation requirements:
Use Code Evaluators When:
- Deterministic Logic: You need consistent, repeatable results
- Performance Critical: Speed and cost efficiency are primary concerns
- Clear Criteria: Success can be defined through explicit rules or calculations
- Domain Expertise: You have specific knowledge that can be encoded programmatically
- Integration Needs: You need to incorporate external APIs, databases, or complex algorithms
Use LLM as Judge When:
- Subjective Assessment: Quality depends on nuanced human-like judgment
- Complex Context: Evaluation requires understanding of implicit meaning or context
- Creative Content: Assessing originality, style, or creative quality
- Natural Language: Evaluating conversational or explanatory content
- Rapid Prototyping: You need to quickly test evaluation concepts without coding
Hybrid Approaches:
Many successful evaluation strategies combine both methods:
- Use code evaluators for objective criteria (format, length, factual accuracy)
- Use LLM as Judge for subjective quality (tone, creativity, helpfulness)
- Implement progressive evaluation: fast automated screening followed by detailed LLM assessment
Getting Started
- Define Your Evaluation Goals: Clearly articulate what “good” means for your specific use case
- Start Simple: Begin with basic evaluators and iterate based on results
- Test Thoroughly: Validate your evaluators against known good and bad examples
- Monitor Performance: Track evaluation consistency and adjust as needed
- Scale Thoughtfully: Consider cost and performance implications as you scale
Custom evaluators are powerful tools that let you align AI evaluation with your specific quality standards and business requirements. Whether you choose code-based logic or LLM-based judgment, the key is to match your evaluation method to your specific needs and constraints.