CantHarm Datasheet

Motivation

CantHarm supports evaluation of Cantonese harmful lexical forms, linked senses, coarse and fine labels, scalar severity, and canonical audio. It is designed for controlled benchmark analysis rather than conversational deployment.

Composition

Component	Count or description
Forms	4,823 retained lexical forms.
Senses	6,365 linked sense rows.
Audio	4,823 canonical clips, one per retained form.
Speakers	11 pseudonymous speaker IDs.
Labels	Coarse labels, fine labels, and scalar severity scores.

Annotation workflow

The documented workflow used 12 annotators in 4 groups of 3. Each item received two independent isolated annotations for coarse label, fine label, and severity. Disagreements escalated to an in-group third reviewer, cross-group review, and full-team adjudication when needed.

Agreement item	Audited value
Coarse label alpha, nominal	0.86
Fine label alpha, nominal	0.73
Severity alpha, ordinal	0.82
Severity MAE	6 points
Severity same-band agreement	85%
Severity same-or-adjacent-band agreement	91%

Recommended uses

Benchmark reproduction on the current release.
Analysis of lexical ambiguity and sense-aware harmfulness labels.
Research on canonical audio cues and multimodal baselines within the documented task scope.
Release documentation, citation, and archival review.

Limitations

One canonical audio clip is provided per retained form.
The release is not a spontaneous conversational speech corpus.
The primary split is based on lexical units; no speaker-generalization claim is made.
Audio and fusion evidence in the locked results should be interpreted as modest and complementary.
No speaker demographic distribution claim is made.
Dictionary-derived definitions or source text have source-specific redistribution caveats.

Maintenance

Contact yueyu_dimsum@163.com; backup qijiayin@139.com. This datasheet describes the 2026-04-02 locked public release.