Datawizz has a powerfull evaluations feature which lets you evaluate the performance of multiple models using your real data, and based on several evaluation metrics. This can help you identify the best performing model and make informed decisions about which model to deploy.

Evaluations are critical not only for evaluating custom trained models, but also for understanding the performance of various public models from providers like OpenAI. When newer models are relased, it’s cruicial to evaluate them on your data to understand how they perform compared to your existing models. New models are not always betetr for every use case.

Creating an evaluation

To create an evaluation, head over to the evaluations section in the Datawizz dashboard. Here you can create a new evaluation by clicking the “New Evaluation” button.

Here you’ll need to configure your evaluation with the following settings:

  • Name: A name for your evaluation. This is for your reference and will be displayed in the dashboard.
  • Models: The models you want to evaluate. You can select multiple models from the same provider or different providers. (note that models you train must be deployed before they can be evaluated)
  • Data: Select the logs you want to use for evaluation. You can use tags and other filters to select the logs you want to use.
    • If you are training custom models, you may want to reserve some logs for evaluation to ensure you are not testing on the same data you trained on. You can do this by tagging some logs with test and excluding them from the training data when you initiate a model training.
    • If you have feedback for your logs, you should filter for positive (👍) logs to evaluate against.
  • Maximum Sample Count: The maximum number of samples to use for evaluation. This is useful if you have a large dataset and want to limit the number of samples used for evaluation.
    When evaluating different models, a request is made to each model for each log in the evaluation data. You bear the cost for these requests, so be sure to set a reasonable maximum sample count to avoid excessive usage.
  • Evaluation Function: how to compare different results. All evaluations treat the original logs output as the baseline, and compare the outputs of the models against this baseline. All scores range from 0 to 1, where 1 indicates high similarity or accuracy, and 0 indicates poor performance. You can choose from the following evaluation functions:
    • String Equality: Compares whether the model output exactly matches the baseline. This is useful for simple classification tasks where exact string matches are important.
      • 1 = Perfect match
      • 0 = No match
      • Supports case-insensitive comparison and prefix matching.
    • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Compares the outputs of the models against the original log output using the ROUGE metric.
      • Returns the score and F-measures for various ROUGE types (e.g., ROUGE-1, ROUGE-2, ROUGE-L).
      • Useful for summarization and text generation.
    • wer (Word Error Rate): Measures the word-level edit distance between the baseline and the new output.
      • Similarity Formula: 1 - WER (higher is better).
      • Useful for speech recognition and transcription evaluation.
    • cer (Character Error Rate):: Measures the character-level edit distance between the baseline and the new output.
      • Similarity Formula: 1 - CER (higher is better).
      • Useful for OCR (Optical Character Recognition) evaluation.
    • BLEU (Bilingual Evaluation Understudy): Evaluates n-gram precision between the model output and the baseline.
      • Returns the BLEU score along with precision values for various n-grams (1 to 4).
      • Commonly used for machine translation evaluation. Higher scores indicate better alignment with the reference.
    • METEOR (Metric for Evaluation of Translation with Explicit ORdering): Improves BLEU by considering synonyms, stemming, and word order.
      • Useful for translation quality assessment with more flexibility.
    • BERTScore: Computes semantic similarity between model output and baseline using BERT embeddings.
      • Useful for paraphrase detection and meaning-based comparison.
      • Returns the F1 score computed using BERT-based embeddings
    • SacreBLEU: A standardized BLEU implementation with improved reproducibility.
      • Ideal for benchmarking translation models.
      • Computes the BLEU score with a specific reference corpus using a standard, reproducible method.
    • CHRF (Character-level F-score): Computes the character-level F-score between the baseline and the new output.
      • Effective for morphologically rich languages and short text evaluation.
    • JSON Structure Comparison: Checks if JSON structure and values match the baseline. Evaluates:
      • Key Overlap Score
      • Value Overlap Score
      • Overall Match Score
      • Valid JSON objects
      • Json completeness (the JSON objects have the same set of keys)
    • JSON List Comparison: Compares lists of JSON objects by matching items using one or more unique identifiers (id_keys). This evaluation is especially useful in tasks like multi-label classification, structured entity extraction, and other scenarios where outputs are arrays or dictionaries of identifiable elements.
      • 🔍 What it Measures:

        • Weighted Precision / Recall: Measures overall accuracy while accounting for how frequently each item or label appears in the data.
          • Weighted Precision rewards correct predictions proportionally to how common a label is.
          • Weighted Recall measures how well the model recovers important/common labels.
            This makes it ideal for imbalanced datasets where some labels appear more often than others.
        • Exact Match Ratio: Proportion of objects that match exactly (based on the fields in id_keys).
        • Hamming Distance: Average difference across key-value pairs in matched objects. Lower is better.
        • Score: The final score is calculated as the average of weighted precision and weighted recall.
      • 📥 Supported Input Formats:

        • The JSON List Comparison metric supports two types of structured outputs:
          • Flat JSON List

            A simple list of objects where each object represents a classification, entity, or data point:

            [
              { "label": "sports", "confidence": 0.92 },
              { "label": "health", "confidence": 0.81 }
            ]
            
          • Wrapped Dictionary with Keyed List

            A dictionary where the list of objects is nested under a specific key. This is often used when the model returns structured fields in a top-level wrapper:

            {
              "labels": [
                { "label": "sports" },
                { "label": "finance" }
              ]
            }
            
      • ⚙️ Configuration Options

        • id_keys (required):
          Comma-separated list of key names used to uniquely identify and match items between the baseline and model outputs.
          Example: "label" or "id,name"
        • remove_json_code_blocks (optional, default: true):
          If enabled, automatically removes markdown-style code block wrappers (like ```json) from the model’s output before parsing it as JSON.
        • Valid JSON required:
          The output must be syntactically valid JSON. Malformed outputs will cause the metric to return a score of 0 with "json_valid": false.
      • ✅ Best Practice: Use a JSON Schema To ensure model outputs are consistently structured, it’s strongly recommended to set the json_schema parameter for each model being evaluated.

        Providing a schema helps:

        • Ensure the model returns valid JSON.
        • Align output structure with evaluation expectations.
        • Reduce inconsistencies or formatting issues.

        Example schema:

        {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "label": { "type": "string" },
              "confidence": { "type": "number" }
            },
            "required": ["label"]
          }
        }
        
    • LLM as Judge: Compares the outputs of the models against the original log output using a language model as a judge. This is useful for tasks like conversation where the output is a response to a user query. Learn more about LLM as Judge evaluation →

Most evaluation functions support preprocessing texts based on user-specified configurations (e.g., case-insensitivity, trimming, extra spaces removal), which helps normalize input before comparison. Where these configurations aren’t explicitly required by the evaluation metric, they do not affect the results, as the method overcomes these preprocessing issues by default.

Viewing evaluation results

Once you’ve created an evaluation, you can view the results in the evaluation details page. Here you’ll see a summary of the evaluation, including the evaluation function used, the number of samples evaluated, and the evaluation metrics. You will also be able to see individual responses from each model and compare them side-by-side.