CantHarm dataset release

CantHarm is a controlled Cantonese benchmark for harmful lexical forms, linked senses, severity labels, and canonical speech grounding. The site documents the locked public release and its access, licensing, citation, and archive plan.

Current release facts

4,823retained lexical forms
6,365linked sense rows
4,823canonical audio clips
11pseudonymous speaker IDs
FieldValue
Release version2026-04-02 locked public release
Official websitehttps://cantharm.dataset.aidimsum.com/
MirrorsGitHub, HuggingFace, Zenodo/DOI, OSF.
Primary materialsWorkbook, metadata bundle, statistics, benchmark highlights, audio inventory, and speaker-packaged canonical audio archives.
Current boundaryThese materials describe the 2026-04-02 locked public release.
Versioned candidate: v1.1-gb

v1.1-gb extends the 2026-04-02 locked release with a complete second canonical recording line for the same 4,823 forms. It preserves v1.0 as the stable archival baseline while documenting a revised-audio candidate, candidate DOI, and revised-candidate result tables.

Versions · v1.1-gb candidate page · candidate results · GitHub pre-release · candidate DOI

Versioning note: v1.0 remains the current public release and archival DOI; v1.1-gb is a separately versioned candidate line.

What this is and is not

Included scope

The release is intended for dataset inspection, benchmark reproduction, and research on Cantonese lexical ambiguity, sense-aware labels, harmfulness categories, severity scoring, and canonical audio cues.

Out-of-scope uses

It is not a conversational moderation corpus, not a spontaneous-speech corpus, not a production moderation system, and not evidence of demographic or speaker-generalization.

Documentation

Use the pages below as a compact index for the release files, public mirrors, citation record, and access policy. Each page refers to the same 2026-04-02 locked public release.

PageWhat to check
DownloadsRelease file list, public mirrors, versioned candidate materials, and release documentation links.
ArchiveVersionsGitHub, HuggingFace, Zenodo/DOI, and OSF public records.
CitationVersion citation, BibTeX, CFF, and Zenodo DOI.
Access / License / EthicsUse boundaries, license scope, source-text caveat, and contact/takedown channels.
Public release status

The public release website documents the current release boundary, mirrors, license terms, citation, and contact/takedown channels.