- Inline in SDK code: Define scorers directly in your evaluation scripts for local development or application-specific logic.
- Pushed via CLI: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
- Created in UI: Build scorers in the Braintrust web interface for rapid prototyping and simple configurations.
Score spans
Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score. Your prompt template can reference these variables:{{input}}: The input to your task{{output}}: The output from your task{{expected}}: The expected output (optional){{metadata}}: Custom metadata from the test case
- SDK
- CLI
- UI
Use scorers inline in your evaluation code:
llm_scorer.eval.ts
Score traces
Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace. In an experiment, a scorer evaluates the trace whenever its prompt uses thread variables such as{{thread}}, since the full trace is available to every scorer. For online scoring, you also set the rule’s Scope field to Trace, which controls whether the scorer runs once per trace or per matching span.
Prompt templates for trace-level scorers support the following reserved variables:
| Variable | Type | Description |
|---|---|---|
{{input}} | any | Input from the root span |
{{output}} | any | Output from the root span |
{{expected}} | any | Expected output from the root span (optional) |
{{metadata}} | object | Metadata from the root span |
{{thread}} | text | Full conversation rendered as human-readable text |
{{thread_count}} | number | Total number of messages in the thread |
{{first_message}} | object | First message in the thread |
{{last_message}} | object | Last message in the thread |
{{user_messages}} | array | All user/human messages only |
{{assistant_messages}} | array | All assistant messages only |
{{human_ai_pairs}} | array | Turn pairs — each item has {human, assistant} |
{{thread}} to pass the full conversation to a judge model as formatted text. {{input}}, {{output}}, {{expected}}, and {{metadata}} come from the root span of the trace.
Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, or Ruby SDK v0.2.1+.
- SDK
- CLI
- UI
Use scorers inline in your evaluation code:
trace_llm_scorer.eval.ts
Set pass thresholds
Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).Pass thresholds apply only to scorers that output numeric scores. Classifiers, which output labels, don’t use them.
- CLI
- UI
Add
__pass_threshold to the scorer’s metadata (value between 0 and 1):Apply classification labels
An LLM judge can also power a classifier. The difference is the output: a numeric judge maps the model’s choices to scores, while a classifier keeps each choice as a label. The model selects one choice from the fixed set you define. That choice becomes both theid and label of the resulting classification, the scorer’s name becomes the name, and the model’s reasoning is stored in metadata. Because the model picks a single choice, an LLM-as-a-judge classifier always returns one label.
LLM-as-a-judge classifiers can only be created in the UI. Unlike LLM-as-a-judge scorers, they can’t be defined in code. To classify with a model in code instead, write a custom code classifier that calls an LLM as needed.
- Go to Scorers and create a scorer. Under Type, choose LLM judge.
- Select a Model to run the judge. An LLM-as-a-judge classifier requires a model that supports both streaming and tool use.
- Write the Messages: a user message that passes in the content to evaluate (for example,
{{input}}), and a system message with the rubric that describes each label and when to choose it. - Set Output type to Classification.
- Under Classifications, add each label the model can choose. Labels must be unique, and the model is forced to pick exactly one through a tool schema. Optionally enable Allow “No match” to let the model return no label when none fits.
- Optionally enable Use chain of thought (CoT) so the model reasons before choosing. Its reasoning is saved to the classification’s
metadata. - Click Save as custom scorer.
On self-hosted deployments, classifiers require data plane v2.0 or later.
Next steps
- Autoevals for pre-built scorers you can drop in without writing a prompt
- Custom code for deterministic logic or when you need full control
- Run evaluations using your scorers
- Score production logs with online scoring rules