Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Evaluation

Synaptic provides an evaluation framework for measuring the quality of AI outputs. The Evaluator trait defines a standard interface for scoring predictions against references, and the Dataset + evaluate() pipeline makes it easy to run batch evaluations across many test cases.

The Evaluator Trait

All evaluators implement the Evaluator trait from synaptic_eval:

#[async_trait]
pub trait Evaluator: Send + Sync {
    async fn evaluate(
        &self,
        prediction: &str,
        reference: &str,
        input: &str,
    ) -> Result<EvalResult, SynapticError>;
}
  • prediction -- the AI's output to evaluate.
  • reference -- the expected or ground-truth answer.
  • input -- the original input that produced the prediction.

EvalResult

Every evaluator returns an EvalResult:

pub struct EvalResult {
    pub score: f64,       // Between 0.0 and 1.0
    pub passed: bool,     // true if score >= 0.5
    pub reasoning: Option<String>,  // Optional explanation
}

Helper constructors:

MethodScorePassed
EvalResult::pass()1.0true
EvalResult::fail()0.0false
EvalResult::with_score(0.75)0.75true (>= 0.5)

You can attach reasoning with .with_reasoning("explanation").

Built-in Evaluators

Synaptic provides five evaluators out of the box:

EvaluatorWhat It Checks
ExactMatchEvaluatorExact string equality (with optional case-insensitive mode)
JsonValidityEvaluatorWhether the prediction is valid JSON
RegexMatchEvaluatorWhether the prediction matches a regex pattern
EmbeddingDistanceEvaluatorCosine similarity between prediction and reference embeddings
LLMJudgeEvaluatorUses an LLM to score prediction quality on a 0-10 scale

See Evaluators for detailed usage of each.

Batch Evaluation

The evaluate() function runs an evaluator across a Dataset of test cases, producing an EvalReport with aggregate statistics. See Datasets for details.

Guides

  • Evaluators -- usage and configuration for each built-in evaluator
  • Datasets -- batch evaluation with Dataset and evaluate()