LLM-as-a-judge scorers and classifiers

LLM-as-a-judge scorers and classifiers use a language model to evaluate outputs based on natural language criteria. A scorer returns a numeric score, while a classifier returns a categorical label. They are best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code. You can define LLM-as-a-judge scorers in three places:

Inline in SDK code: Define scorers directly in your evaluation scripts for local development or application-specific logic.
Pushed via CLI: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
Created in UI: Build scorers in the Braintrust web interface for rapid prototyping and simple configurations.

Most teams prototype in the UI, then push production-ready scorers via the CLI. See Scorers overview for guidance.

Score spans

Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score. Your prompt template can reference these variables:

{{input}}: The input to your task
{{output}}: The output from your task
{{expected}}: The expected output (optional)
{{metadata}}: Custom metadata from the test case

Use scorers inline in your evaluation code:

llm_scorer.eval.ts

import { Eval } from "braintrust";
import { LLMClassifierFromTemplate } from "autoevals";
import OpenAI from "openai";

const client = new OpenAI();

const MOVIE_DATASET = [
  {
    input:
      "A detective investigates a series of murders based on the seven deadly sins.",
    expected: "Se7en",
  },
  {
    input:
      "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
    expected: "Inception",
  },
];

async function task(input: string): Promise<string> {
  const response = await client.responses.create({
    model: "gpt-5-mini",
    input: [
      {
        role: "system",
        content:
          "Based on the following description, identify the movie. Reply with only the movie title.",
      },
      { role: "user", content: input },
    ],
  });
  return response.output_text ?? "";
}

const correctnessScorer = LLMClassifierFromTemplate({
  name: "Correctness",
  promptTemplate: `You are evaluating a movie-identification task.

Output (model's answer): {{output}}
Expected (correct movie): {{expected}}

Does the output correctly identify the same movie as the expected answer?
Consider alternate titles (e.g. "Harry Potter 1" vs "Harry Potter and the Sorcerer's Stone") as correct.

Return only "correct" if the output is the right movie (exact or equivalent title).
Return only "incorrect" otherwise.`,
  choiceScores: {
    correct: 1,
    incorrect: 0,
  },
  useCoT: true,
});

Eval("Movie Matcher", {
  data: MOVIE_DATASET,
  task,
  scores: [correctnessScorer],
});

Define TypeScript or Python scorers in code and push to Braintrust:

llm_scorer.ts

import braintrust from "braintrust";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Helpfulness scorer",
  slug: "helpfulness-scorer",
  description: "Evaluate helpfulness of response",
  tags: ["quality"],
  messages: [
    {
      role: "user",
      content:
        'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
    },
  ],
  model: "gpt-5-mini",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0.5,
    C: 0,
  },
  metadata: {
    __pass_threshold: 0.7,
  },
});

Push to Braintrust using the bt CLI:

bt functions push llm_scorer.ts

Score traces

Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace. In an experiment, a scorer evaluates the trace whenever its prompt uses thread variables such as {{thread}}, since the full trace is available to every scorer. For online scoring, you also set the rule’s Scope field to Trace, which controls whether the scorer runs once per trace or per matching span. Prompt templates for trace-level scorers support the following reserved variables:

Variable	Type	Description
`{{input}}`	any	Input from the root span
`{{output}}`	any	Output from the root span
`{{expected}}`	any	Expected output from the root span (optional)
`{{metadata}}`	object	Metadata from the root span
`{{thread}}`	text	Full conversation rendered as human-readable text
`{{thread_count}}`	number	Total number of messages in the thread
`{{first_message}}`	object	First message in the thread
`{{last_message}}`	object	Last message in the thread
`{{user_messages}}`	array	All user/human messages only
`{{assistant_messages}}`	array	All assistant messages only
`{{human_ai_pairs}}`	array	Turn pairs — each item has `{human, assistant}`

Use {{thread}} to pass the full conversation to a judge model as formatted text. {{input}}, {{output}}, {{expected}}, and {{metadata}} come from the root span of the trace.

Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, or Ruby SDK v0.2.1+.

Use scorers inline in your evaluation code:

trace_llm_scorer.eval.ts

import { Eval, wrapOpenAI, wrapTraced, type Scorer } from "braintrust";
import OpenAI from "openai";

const client = new OpenAI();
const wrappedClient = wrapOpenAI(new OpenAI());

const SUPPORT_DATASET = [
  { input: "My order hasn't arrived yet. Order #12345." },
  { input: "I need help resetting my password." },
];

const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) {
  const response = await wrappedClient.chat.completions.create({
    model: "gpt-5-mini",
    messages,
  });
  return response.choices[0].message.content || "";
});

async function supportTask(input: string): Promise<string> {
  const messages: Array<{ role: string; content: string }> = [
    { role: "system", content: "You are a helpful customer support agent." }
  ];

  messages.push({ role: "user", content: input });
  const response1 = await callLLM(messages);
  messages.push({ role: "assistant", content: response1 });

  messages.push({ role: "user", content: "Can you provide more details?" });
  const response2 = await callLLM(messages);
  messages.push({ role: "assistant", content: response2 });

  messages.push({ role: "user", content: "Thank you for your help!" });
  const response3 = await callLLM(messages);

  return response3;
}

const conversationCoherence: Scorer = async ({ trace }) => {
  if (!trace) return null;

  const thread = await trace.getThread();
  const threadText = thread
    .map(msg => `${msg.role}: ${msg.content}`)
    .join("\n\n");

  const response = await client.responses.create({
    model: "gpt-5-mini",
    input: [
      {
        role: "user",
        content: `Evaluate the coherence of this customer support conversation:

${threadText}

Rate the conversation coherence:
- "A" for highly coherent with natural flow and consistent context
- "B" for mostly coherent with minor gaps or context issues
- "C" for incoherent, disjointed, or lost context

Return only the letter (A, B, or C).`,
      },
    ],
  });

  const rating = response.output_text?.trim().toUpperCase() || "C";
  const choiceScores = { A: 1, B: 0.6, C: 0 };
  const score = choiceScores[rating as keyof typeof choiceScores] ?? 0;

  return {
    name: "Conversation coherence",
    score,
    metadata: { rating, thread_length: thread.length },
  };
};

Eval("Support Conversation Quality", {
  data: SUPPORT_DATASET,
  task: supportTask,
  scores: [conversationCoherence],
});

Define TypeScript or Python scorers in code and push to Braintrust:

trace_llm_scorer.ts

import braintrust from "braintrust";
import { z } from "zod";

const project = braintrust.projects.create({ name: "my-project" });

project.scorers.create({
  name: "Conversation coherence",
  slug: "conversation-coherence",
  description: "Evaluate multi-turn conversation coherence",
  parameters: z.object({
    trace: z.any(),
  }),
  messages: [
    {
      role: "user",
      content: `Evaluate the coherence of this conversation:

{{thread}}

Rate the coherence:
- "A" for highly coherent with natural flow
- "B" for mostly coherent with minor gaps
- "C" for incoherent or disjointed`,
    },
  ],
  model: "gpt-5-mini",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0.6,
    C: 0,
  },
});

Push to Braintrust:

bt functions push trace_llm_scorer.ts

Go to Scorers > + Scorer.
Enter a scorer name and slug.
Select LLM-as-a-judge.
Configure:
- Prompt: Use the {{thread}} variable to reference the conversation thread.
- Model: Which model to use as judge
- Choice scores: Map model choices (A, B, C) to numeric scores
- Use CoT: Enable chain-of-thought reasoning for complex evaluations
Click Save as custom scorer.

Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).

Pass thresholds apply only to scorers that output numeric scores. Classifiers, which output labels, don’t use them.

Add __pass_threshold to the scorer’s metadata (value between 0 and 1):

project.scorers.create({
  name: "Helpfulness scorer",
  slug: "helpfulness-scorer",
  messages: [
    {
      role: "user",
      content: 'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
    },
  ],
  model: "gpt-5-mini",
  choiceScores: { A: 1, B: 0.5, C: 0 },
  metadata: {
    __pass_threshold: 0.7,
  },
});

Apply classification labels

An LLM judge can also power a classifier. The difference is the output: a numeric judge maps the model’s choices to scores, while a classifier keeps each choice as a label. The model selects one choice from the fixed set you define. That choice becomes both the id and label of the resulting classification, the scorer’s name becomes the name, and the model’s reasoning is stored in metadata. Because the model picks a single choice, an LLM-as-a-judge classifier always returns one label.

LLM-as-a-judge classifiers can only be created in the UI. Unlike LLM-as-a-judge scorers, they can’t be defined in code. To classify with a model in code instead, write a custom code classifier that calls an LLM as needed.

To create an LLM-as-a-judge classifier:

Go to Scorers and create a scorer. Under Type, choose LLM judge.
Select a Model to run the judge. An LLM-as-a-judge classifier requires a model that supports both streaming and tool use.
Write the Messages: a user message that passes in the content to evaluate (for example, {{input}}), and a system message with the rubric that describes each label and when to choose it.
Set Output type to Classification.
Under Classifications, add each label the model can choose. Labels must be unique, and the model is forced to pick exactly one through a tool schema. Optionally enable Allow “No match” to let the model return no label when none fits.
Optionally enable Use chain of thought (CoT) so the model reasons before choosing. Its reasoning is saved to the classification’s metadata.
Click Save as custom scorer.

Select the saved classifier when you run an experiment or score production traces, just like a scorer. Like any LLM judge, a classifier can run at span or trace scope. See Scorer and classifier scopes.

On self-hosted deployments, classifiers require data plane v2.0 or later.

Next steps

Autoevals for pre-built scorers you can drop in without writing a prompt
Custom code for deterministic logic or when you need full control
Run evaluations using your scorers
Score production logs with online scoring rules

​Score spans

​Score traces

​Set pass thresholds

​Apply classification labels

​Next steps

Score spans

Score traces

Set pass thresholds

Apply classification labels

Next steps