CantHarm dataset release
CantHarm is a controlled Cantonese benchmark for harmful lexical forms, linked senses, severity labels, and canonical speech grounding. The site documents the locked public release and its access, licensing, citation, and archive plan.
Current release facts
| Field | Value |
|---|---|
| Release version | 2026-04-02 locked public release |
| Official website | https://cantharm.dataset.aidimsum.com/ |
| Mirrors | GitHub, HuggingFace, Zenodo/DOI, OSF. |
| Primary materials | Workbook, metadata bundle, statistics, benchmark highlights, audio inventory, and speaker-packaged canonical audio archives. |
| Current boundary | These materials describe the 2026-04-02 locked public release. |
v1.1-gb extends the 2026-04-02 locked release with a complete second canonical recording line for the same 4,823 forms. It preserves v1.0 as the stable archival baseline while documenting a revised-audio candidate, candidate DOI, and revised-candidate result tables.
Versions · v1.1-gb candidate page · candidate results · GitHub pre-release · candidate DOI
Versioning note: v1.0 remains the current public release and archival DOI; v1.1-gb is a separately versioned candidate line.
What this is and is not
Included scope
The release is intended for dataset inspection, benchmark reproduction, and research on Cantonese lexical ambiguity, sense-aware labels, harmfulness categories, severity scoring, and canonical audio cues.
Out-of-scope uses
It is not a conversational moderation corpus, not a spontaneous-speech corpus, not a production moderation system, and not evidence of demographic or speaker-generalization.
Documentation
Use the pages below as a compact index for the release files, public mirrors, citation record, and access policy. Each page refers to the same 2026-04-02 locked public release.
| Page | What to check |
|---|---|
| Downloads | Release file list, public mirrors, versioned candidate materials, and release documentation links. |
| ArchiveVersions | GitHub, HuggingFace, Zenodo/DOI, and OSF public records. |
| Citation | Version citation, BibTeX, CFF, and Zenodo DOI. |
| Access / License / Ethics | Use boundaries, license scope, source-text caveat, and contact/takedown channels. |
The public release website documents the current release boundary, mirrors, license terms, citation, and contact/takedown channels.