TL;DR
Steganography hides a message inside innocuous text, so an observer watching the channel cannot tell a secret is being sent. For AI safety this threatens output monitoring: if a model or its users can coordinate through covert signals in normal-looking text, inspecting outputs may miss the communication entirely.
Prior evaluations probe one scheme or one elicitation method at a time. The field has had no shared testbed for the broader question:
We measure it, across 18 schemes.
The benchmark
18 schemes, 5 families from two decades of linguistic steganography:
- Lexical substitution: Chang & Clark, Shirali, Synonym (word choice carries bits)
- Syntactic transformation: Atallah, Meral (parse-tree rewrites, reordering)
- Structural: Acrostics, Length Parity
- Generative: Stegasuras, Meteor, MEC, DISCOP, ADG, RNN-Stega, and more (sampling-distribution coupling)
- Paraphrase: PPDB phrase choices
Explore a scheme:
3 primary models: Gemma-3-4B-IT, Qwen2.5-3B-Instruct, and GPT-OSS-20B,
plus a Qwen 0.5B→72B scaling sweep.
3 training regimes: full SFT (15 epochs, early stopping), in-context
learning, and in-context RL (Reflexion, RSQ).
Metrics: symmetric bit error rate (sBER, 0.5 = chance) for
recoverability, AUROC of a linguistic-feature classifier for detectability.
Each scheme’s reference encoder runs over cover sentences sampled from C4, producing (cover, payload, stego) triples used as standard SFT data, 5K–50K examples per scheme.
stego (bit = 1): “The colour of the painting is vivid.”
Shirali’s US/UK spelling channel: one bit per sentence, invisible without the codebook.
The study covers roughly 1,600 GPU-hours and 1,500+ training runs, with a final grid of about 250 runs on L40S, A100, and H200 hardware.
Examples from the benchmark
Real (cover, payload, stego) triples from the released test splits. Green marks the cover wording and red marks the stego rewrite that carries the bits.
Results: the trainability cliff
Model-and-training behavior falls into four qualitatively distinct tiers:
- Trainable (sBER < 0.10): encoding reduces to local token-level choices. “Given context, choose variant a or b” aligns with next-token prediction.
- Partially trainable (0.10–0.40): partial learning. Multi-token structural coordination caps accuracy (both syntactic schemes plateau here).
- Not trainable (≥ 0.40): fluent text, no recoverable payload (Moraldo near chance everywhere).
- Collapse: the model abandons coherent text and spams EOS/special tokens, a training failure rather than a representational ceiling.
The split reduces to one property: whether the encoding target is stationary under training. Generative schemes encode bits by partitioning the model’s per-token probability distribution, so their targets move with the model itself. Offline data goes stale as the trained model drifts, while online data turns the objective non-stationary. Either way the easiest path through the loss landscape is the EOS shortcut, and all ten generative schemes collapse this way across 3 model families and two prompt designs.
Scale does not save you
- Across Qwen 0.5B to 14B, BER on Chang & Clark varies by under 2 percentage points. The best run is the mid-sized 14B (0.007), not the largest.
- 32B and 72B regress (0.125 and 0.254). The fixed training set is small relative to model size.
- Improving the training data (n-gram-verified synonyms, mixed capacity) cut BER 5× on the same model. Training-data fidelity, not scale, is the primary determinant.
Prompting does not either
The detectability tradeoff
A logistic-regression classifier on linguistic features gives a clear tradeoff across families:
- Lexical: easy to train, more detectable (AUROC 0.54–0.69)
- Syntactic: hard to train, near-undetectable (AUROC 0.52–0.59)
Neither is reliably caught by output inspection alone.
We release everything
Fine-tuned checkpoints for every trainable scheme–model pair, the synthetic training data for all 18 schemes, and the full results grid, gated on Hugging Face for defensive AI-safety research.
Citation
@article{meyer2026stegobench,
title = {StegoBench: Evaluating steganography potential in language
models through supervised learning},
author = {Meyer, Bryce and Krzyzanowski, Robert and Arcuschin, Iv\'an
and Lange, Georg and Lee, Matthew},
year = {2026},
note = {Poseidon Research}
}