CantHarm Schema And Ontology

Release structure

Sheet roles and benchmark meaning

Release object	Key fields	Role in the benchmark
Form rows	`class_1_final`, `class_2_final`, `severity_score_final`, `jyutping`, `ipa`, `primary_filename`	Defines the adjudicated benchmark gold for form-level tasks and the form side of the speech-conditioned tasks.
Sense rows	`class_1_source`, `class_2_source`, `severity_score_source`, `source_dictionary`, `source_status`, definitions	Preserves source-linked ambiguity, provenance, and cross-source disagreement for sense-level evaluation and audit.
Canonical audio	`Fxxxxx_SPKyy.wav` naming pattern, one waveform per form	Provides the controlled speech grounding for audio-only, fusion, ASR-cascade, and retrieval tasks.

Supervision contract

Form gold versus sense evidence

Form-level evaluation uses adjudicated gold targets. Sense-level evaluation uses source-linked targets. The form gold should not be reconstructed by mechanically unioning or voting over linked sense rows.

507 polysemous forms keep one coarse and one fine label across senses.
199 polysemous forms keep one coarse label but multiple fine labels.
408 forms remain coarse-diverse across senses.

Layer relationship

Why both layers are released

One lexical form can link to more than one sense.
Source dictionaries can differ in granularity and wording.
Form-level targets are finalized for scoring, while sense rows preserve ambiguity and provenance.

The form layer and sense layer are therefore complementary views of the same release rather than duplicate tables.

Label ontology

Coarse categories in the locked release

Coarse label	Forms	Share
Insult-discrimination	2470	51.21%
Illicit-illegal	1206	25.01%
Sexual-obscene	602	12.48%
Benign	358	7.42%
Political	181	3.75%
Terror-extremism	6	0.12%

Schema note. Source provenance, review flags, missingness indicators, and other audit fields are retained so that ambiguity and curation history remain inspectable. They are not part of the benchmark gold definition and are not unrestricted model inputs.