Datawizz has a powerfull evaluations feature which lets you evaluate the performance of multiple models using your real data, and based on several evaluation metrics. This can help you identify the best performing model and make informed decisions about which model to deploy.Evaluations are critical not only for evaluating custom trained models, but also for understanding the performance of various public models from providers like OpenAI. When newer models are relased, it’s cruicial to evaluate them on your data to understand how they perform compared to your existing models. New models are not always betetr for every use case.
To create an evaluation, head over to the evaluations section in the Datawizz dashboard. Here you can create a new evaluation by clicking the “New Evaluation” button.Here you’ll need to configure your evaluation with the following settings:
Name: A name for your evaluation. This is for your reference and will be displayed in the dashboard.
Models: The models you want to evaluate. You can select multiple models from the same provider or different providers. (note that models you train must be deployed before they can be evaluated)
Data: Select the logs you want to use for evaluation. You can use tags and other filters to select the logs you want to use.
If you are training custom models, you may want to reserve some logs for evaluation to ensure you are not testing on the same data you trained on. You can do this by tagging some logs with test and excluding them from the training data when you initiate a model training.
If you have feedback for your logs, you should filter for positive (👍) logs to evaluate against.
Maximum Sample Count: The maximum number of samples to use for evaluation. This is useful if you have a large dataset and want to limit the number of samples used for evaluation.
When evaluating different models, a request is made to each model for each log in the evaluation data. You bear the cost for these requests, so be sure to set a reasonable maximum sample count to avoid excessive usage.
Evaluators: Select the evaluation functions you want to use. You can select multiple evaluation functions to get a comprehensive view of model performance. You can easily use our built-in evaluation functions, or you can create your own custom evaluation functions using code or LLM prompts (LLM as Judge). See the full list of available evaluation functions below.
Once you’ve created an evaluation, you can view the results in the evaluation details page. Here you’ll see a summary of the evaluation, including the evaluation function used, the number of samples evaluated, and the evaluation metrics. You will also be able to see individual responses from each model and compare them side-by-side.
Datawizz provides a comprehensive set of built-in evaluation functions to help you assess model performance across different dimensions. These evaluators cover various aspects of text generation quality, from semantic similarity to exact matching.
BERT Score is a semantic similarity metric that leverages contextual embeddings from BERT and its variants to measure how well generated text matches reference text. Unlike traditional n-gram based metrics, BERT Score captures semantic meaning and can recognize paraphrases and synonymous expressions. It returns precision, recall, and F1 scores based on token-level similarity in the embedding space.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap between generated text and reference text by comparing shared words or n-gram sequences. It’s particularly useful for evaluating summarization tasks and provides multiple variants (ROUGE-1, ROUGE-2, ROUGE-L) that focus on different aspects of text overlap, from individual words to longest common subsequences.
SacreBLEU is a standardized implementation of the BLEU metric specifically designed for machine translation evaluation. It provides consistent and reproducible results by using standardized tokenization and normalization procedures. The metric measures the overlap between generated text and reference text using n-gram precision with a brevity penalty to discourage overly short outputs.
ChrF (CHaRacter-level F-score) is a character-based evaluation metric that calculates similarity between machine translation output and reference translation using character n-grams rather than word n-grams. This approach makes it particularly effective for morphologically rich languages and helps capture fine-grained differences in text generation quality, including spelling variations and morphological changes.
Character Error Rate (CER) evaluates text generation accuracy by calculating the minimum number of character-level insertions, deletions, and substitutions needed to transform the generated text into the reference text. The result is normalized by the total number of characters in the reference text. This metric is particularly useful for tasks requiring high precision, such as transcription or data extraction, where character-level accuracy is critical.
Word Error Rate (WER) measures the accuracy of generated text by calculating the minimum number of word-level insertions, deletions, and substitutions required to match the reference text. Originally developed for speech recognition evaluation, WER is also valuable for assessing LLM outputs where word-level accuracy is important. The metric is normalized by the total number of words in the reference text.
This evaluator performs exact string matching while ignoring case differences and normalizing whitespace. It converts both texts to lowercase, trims leading and trailing whitespace, and collapses multiple consecutive spaces into single spaces. This is useful for evaluating tasks where the exact content matters but formatting variations should be ignored, such as classification labels or standardized responses.
Copy
Ask AI
import jsondef preprocess_string(text): if not text: return "" ## LOWERCASE text = text.lower() ## TRIM text = text.strip() ## EXTRA WHITESPACE text = " ".join(text.split()) return textdef evaluator(inputs: list, outputs: dict, reference_outputs: dict): return preprocess_string(outputs['content']) == preprocess_string(reference_outputs['content'])
This evaluator performs exact string matching while preserving case sensitivity but still normalizing whitespace issues. It trims leading and trailing whitespace and collapses multiple consecutive spaces into single spaces. This is ideal for tasks where case matters (such as proper nouns or code generation) but minor formatting inconsistencies should be overlooked.
Copy
Ask AI
import jsondef preprocess_string(text): if not text: return "" ## TRIM text = text.strip() ## EXTRA WHITESPACE text = " ".join(text.split()) return textdef evaluator(inputs: list, outputs: dict, reference_outputs: dict): return preprocess_string(outputs['content']) == preprocess_string(reference_outputs['content'])
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an advanced text generation evaluation metric that goes beyond simple n-gram matching. It incorporates exact matches, stemming, synonyms, and paraphrases to provide a more nuanced comparison between generated and reference text. METEOR also includes a penalty for incorrect word order, making it better aligned with human judgment compared to simpler metrics like BLEU.
BLEU (Bilingual Evaluation Understudy) is a widely-used metric for evaluating machine-generated text, particularly translations. It measures the overlap of n-gram sequences between the candidate output and reference texts, providing scores for 1-gram through 4-gram precision. BLEU applies a brevity penalty to discourage overly short generations and returns both an overall BLEU score and individual n-gram precision scores.
This specialized evaluator is designed for tasks that generate JSON output. It compares the structure and content of JSON responses by calculating key overlap and value overlap scores. The evaluator handles common LLM formatting issues by automatically removing markdown code blocks (```json) and validates that both outputs are valid JSON. It provides detailed metrics including key overlap, value overlap, JSON validity, and boolean flags for complete key/value matching.