Dataset documentation summary
This page summarizes the 2026-04-02 locked public release in datasheet form.
Motivation
CantHarm supports evaluation of Cantonese harmful lexical forms, linked senses, coarse and fine labels, scalar severity, and canonical audio. It is designed for controlled benchmark analysis rather than conversational deployment.
Composition
| Component | Count or description |
|---|---|
| Forms | 4,823 retained lexical forms. |
| Senses | 6,365 linked sense rows. |
| Audio | 4,823 canonical clips, one per retained form. |
| Speakers | 11 pseudonymous speaker IDs. |
| Labels | Coarse labels, fine labels, and scalar severity scores. |
Annotation workflow
The documented workflow used 12 annotators in 4 groups of 3. Each item received two independent isolated annotations for coarse label, fine label, and severity. Disagreements escalated to an in-group third reviewer, cross-group review, and full-team adjudication when needed.
| Agreement item | Audited value |
|---|---|
| Coarse label alpha, nominal | 0.86 |
| Fine label alpha, nominal | 0.73 |
| Severity alpha, ordinal | 0.82 |
| Severity MAE | 6 points |
| Severity same-band agreement | 85% |
| Severity same-or-adjacent-band agreement | 91% |
Recommended uses
- Benchmark reproduction on the current release.
- Analysis of lexical ambiguity and sense-aware harmfulness labels.
- Research on canonical audio cues and multimodal baselines within the documented task scope.
- Release documentation, citation, and archival review.
Limitations
- One canonical audio clip is provided per retained form.
- The release is not a spontaneous conversational speech corpus.
- The primary split is based on lexical units; no speaker-generalization claim is made.
- Audio and fusion evidence in the locked results should be interpreted as modest and complementary.
- No speaker demographic distribution claim is made.
- Dictionary-derived definitions or source text have source-specific redistribution caveats.
Maintenance
Contact yueyu_dimsum@163.com; backup qijiayin@139.com. This datasheet describes the 2026-04-02 locked public release.