CantHarm QC And Audit

Auditable evidence

QC summary

Release aspect	Evidence in bundle	Count or status	How to interpret it
Bundle identity	`cantharm_release_manifest.json`, `cantharm_release_readme.md`	`2026-04-02 public release`	Defines the single authoritative public release line used by the paper and this site.
Form-level manual label review	`label_review_flag`	1376 form rows	Shows that a substantial subset of released forms was manually reviewed rather than untouched source carry-over.
Variant / normalization review	`variant_rule_manual_review`	67 sense rows	Confirms that some variant handling decisions were explicitly reviewed.
Multi-source provenance fields	`source_dictionary`, `source_status`	Present in released sense rows	Source information remains visible in the public workbook for audit and interpretation.
Form and sense layers	Finalized form targets and linked sense fields	Both released	The benchmark provides a finalized form layer for scoring and a linked sense layer for interpretation.
Pronunciation normalization	`jyutping`, `jyutping_status`, `ipa`	Present in released form rows	Pronunciation is part of the released benchmark object.
Documentation pages	Site pages, README, reproducibility note	Included	The public bundle includes written guidance for scope, protocol, access, ethics, and release interpretation.

Severity note

Rubric-driven scalar score

Severity is released as a scalar target for benchmark evaluation, but it should not be read as a universal or context-independent measure of harm intensity. It is a rubric-driven score aligned to this benchmark object.

Severity 0 count: 358 forms
Severity at least 95: 924 forms

What is intentionally not claimed

Audit boundary

No standalone released agreement-rate table
No released contextual-utterance annotation study
No claim that one numeric QC threshold alone captures audio quality
No claim that source disagreement is fully resolved away at the sense layer