Verification & correctness¶
A subtly wrong chat template doesn't crash — it quietly degrades the model and you blame the data. Chat Bricks treats correctness as something to prove, not assume.
There are two correctness claims to verify, and you should check both before training on any non-trivial template:
- Rendering parity — the prompt string Chat Bricks produces is byte-identical to the model's official
apply_chat_templateoutput. - Tokenization parity —
chat.tokenize(...)produces the sameinput_idsyou would get by tokenizing the official prompt withadd_special_tokens=False(i.e. the rendered string already contains any BOS markers the template emits).
Mask alignment follows from these two: if rendering and tokenization both match, the per-token labels and action_mask Chat Bricks emits correspond to exactly the assistant spans you would expect.
One-shot check with compare_hf_template¶
from transformers import AutoTokenizer
from chat_bricks.utils import compare_hf_template
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 3 times 5?"},
{"role": "assistant", "content": "15"},
]
(is_equal,
is_equal_between_implemented_prompts,
is_equal_between_jinja_prompts,
official_prompt,
implemented_prompt,
implemented_jinja_prompt,
highlighted_prompt) = compare_hf_template(
tokenizer, "qwen2.5", messages=messages, add_generation_prompt=True,
)
assert is_equal, "Python render diverges from official template"
assert is_equal_between_jinja_prompts, "Exported Jinja diverges from Python render"
assert is_equal_between_implemented_prompts, "Mask-highlighted render diverges"
The three booleans cover three correctness contracts:
is_equal— Python render vs. the officialapply_chat_templateoutput.is_equal_between_jinja_prompts— Chat Bricks's exported Jinja template, when fed back throughtokenizer.apply_chat_template, produces the same string as the Python render. This is what lets you ship the same template to inference servers that only consume Jinja.is_equal_between_implemented_prompts— the prompt with mask annotations (used internally to computeaction_mask) round-trips to the same string when stripped.
If any of the three fail, highlighted_prompt shows the assistant spans colored in the terminal — use it to find where the divergence is.
Verifying tokenization¶
For Llama-style templates that emit BOS markers in the rendered string (<|begin_of_text|>), tokenizing the result with add_special_tokens=True would double-prepend the BOS. The right comparison is:
prompt = chat.prompt()
hf_inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
implemented = chat.tokenize(tokenizer)
assert torch.equal(hf_inputs["input_ids"], implemented["input_ids"])
For templates that don't include special tokens in the rendered string (most Qwen variants), add_special_tokens=True is also safe because there's nothing to add.
What the test suite checks¶
Every built-in template is checked against the model's official template in CI on every push and PR. The relevant suites:
| File | What it asserts |
|---|---|
tests/test_builtin_templates/test_*_full_align.py |
compare_hf_template byte-equality across a matrix of system/no-system × tools/no-tools × multi-turn × generation-prompt cases |
tests/test_builtin_templates/test_text_templates_tokenize.py |
chat.tokenize produces the same input_ids as the official tokenizer (with add_special_tokens=False) |
tests/test_builtin_templates/test_skills_jinja_parity.py |
Python render and exported-Jinja render match across (tools, skills) ∈ {neither, tools-only, skills-only, both} |
When you add a new template or change a policy, run those suites locally before merging:
pytest tests/test_builtin_templates/ -k "not llama and not kimi"
The -k filter skips gated-model tests that can't run in CI.
When diff-based masking is approximate¶
The HFTemplate path (using a HuggingFace repo as the template) reconstructs masks by diffing incremental renders of the conversation. This works on append-only templates — i.e. each new turn adds text without modifying earlier turns.
It silently produces wrong masks on templates that mutate prior content. Known case: Qwen3 drops previous thinking blocks from the history when rendering a new turn, so the diff misaligns. Chat Bricks ships a hand-written Qwen3Template/Qwen3Renderer for this reason.
If you're using Chat(template="some/repo-id", ...) for a new family:
- Run
compare_hf_templateon a multi-turn conversation with assistant turns of varying length. - If
is_equalholds but the model still produces odd loss values, the template may be non-append-only. Open an issue or write a*Renderersubclass following theQwen3Rendererpattern.
Pre-flight checklist for new models¶
Before kicking off a training run with a model you haven't used before:
- Render check —
compare_hf_template(tokenizer, template_name, messages, tools, add_generation_prompt=True)returnsis_equal=True. - Tokenization check —
chat.tokenizematchestokenizer(prompt, add_special_tokens=False). - Mask sanity —
inputs["action_mask"].sum()matches roughly the assistant content length you'd expect (decode a fewinputs["input_ids"][mask == 1]spans to eyeball). - Multi-turn check — repeat (1)–(3) with a 4-turn conversation including a tool call.
If any step fails, fix it before you train, not after.
Where to go next¶
- Use any HuggingFace model — how
HFTemplateand the diff-based mask path work. - Tools and tool-call variants — verify a custom tool format produces correct masks.