Client-showable method demo

Multilingual Bias Drift Benchmark

Same question. Same model. Different language. Compare whether the answer stance changes when only the language changes.

Purpose: The goal is to find language-driven answer drift. No real model results are shown here: no rankings, no certification, no proof that a model is safe, fair, biased, or unbiased.

Run metadata

A production-style benchmark flow, with public-safe data

Public cases127examples for smoke, docs and reproducible demos

Language slices12English — 40 prompts · German — 39 prompts · Greek — 39 prompts · 10-language pilot — 10 prompts

Main signaldriftstance changes by language inside a translation_group

Current pagemockreal named model runs need separate approval

Same-question drift matrix

Stance by language

Illustrative only: one model, one question group, translated prompts. The production runner writes the same fields into JSONL/JSON/Markdown/HTML artifacts.

Mock example: “40 prompts” means count in that language slice, not a score.
Question group	English — 40 prompts	Spanish — pilot	German — 39 prompts	Signal
women in leadership	anti / stereotype	pro / counter-stereotype	neutral	high drift
technical competence	neutral	neutral	neutral	low drift
family role assumption	anti / stereotype	refusal / void	neutral	medium drift

Answer stance taxonomy

What the report records

anti_stereotype pro_counter_stereotype neutral_uncertainneutral / uncertainty preserved refusal_void

1. Public-safe input

Each case has exactly one blank, a language code, a data tier, and optional translation_group for same-question comparisons.

2. Provider run

The runner supports credential-free dry runs and OpenAI-compatible APIs, keeps progress events, and fails closed on private rows.

3. Reviewable artifacts

It writes results.jsonl, summary.json, run_events.jsonl, report.md, and report.html.

Same underlying question across language versions.
Stance label = review signal, not automatic proof.
Private holdouts and real named model results stay blocked until approved.

What a client can learn

Useful in a first AI-risk conversation

This page is a method demonstration: it shows how an AI workflow can be tested with data tiers, repeatable runs, progress logs, and conservative claims before anyone talks about automation in production.

Where language changes behavior

Equivalent prompts can expose stance changes that a one-language test misses.

Where the runner is honest

Failures are counted separately from completed cases; void/refusal is not hidden inside a score.

Where claims must stop

Public samples support a QA conversation, not a model ranking or compliance claim.

Open GitHub project Open example report Discuss this check