Dataset documentation summary

This page summarizes the 2026-04-02 locked public release in datasheet form.

Motivation

CantHarm supports evaluation of Cantonese harmful lexical forms, linked senses, coarse and fine labels, scalar severity, and canonical audio. It is designed for controlled benchmark analysis rather than conversational deployment.

Composition

ComponentCount or description
Forms4,823 retained lexical forms.
Senses6,365 linked sense rows.
Audio4,823 canonical clips, one per retained form.
Speakers11 pseudonymous speaker IDs.
LabelsCoarse labels, fine labels, and scalar severity scores.

Annotation workflow

The documented workflow used 12 annotators in 4 groups of 3. Each item received two independent isolated annotations for coarse label, fine label, and severity. Disagreements escalated to an in-group third reviewer, cross-group review, and full-team adjudication when needed.

Agreement itemAudited value
Coarse label alpha, nominal0.86
Fine label alpha, nominal0.73
Severity alpha, ordinal0.82
Severity MAE6 points
Severity same-band agreement85%
Severity same-or-adjacent-band agreement91%

Recommended uses

  • Benchmark reproduction on the current release.
  • Analysis of lexical ambiguity and sense-aware harmfulness labels.
  • Research on canonical audio cues and multimodal baselines within the documented task scope.
  • Release documentation, citation, and archival review.

Limitations

  • One canonical audio clip is provided per retained form.
  • The release is not a spontaneous conversational speech corpus.
  • The primary split is based on lexical units; no speaker-generalization claim is made.
  • Audio and fusion evidence in the locked results should be interpreted as modest and complementary.
  • No speaker demographic distribution claim is made.
  • Dictionary-derived definitions or source text have source-specific redistribution caveats.

Maintenance

Contact yueyu_dimsum@163.com; backup qijiayin@139.com. This datasheet describes the 2026-04-02 locked public release.