CantHarm dataset release

CantHarm is a controlled Cantonese benchmark for harmful lexical forms, linked senses, severity labels, and canonical speech grounding. The site documents the locked public release and its access, licensing, citation, and archive plan.

Current release facts

4,823retained lexical forms

6,365linked sense rows

4,823canonical audio clips

11pseudonymous speaker IDs

Field	Value
Release version	2026-04-02 locked public release
Official website	https://cantharm.dataset.aidimsum.com/
Mirrors	GitHub, HuggingFace, Zenodo/DOI, OSF.
Primary materials	Workbook, metadata bundle, statistics, benchmark highlights, audio inventory, and speaker-packaged canonical audio archives.
Current boundary	These materials describe the 2026-04-02 locked public release.

Versioned candidate: v1.1-gb

v1.1-gb extends the 2026-04-02 locked release with a complete second canonical recording line for the same 4,823 forms. It preserves v1.0 as the stable archival baseline while documenting a revised-audio candidate, candidate DOI, and revised-candidate result tables.

Versions · v1.1-gb candidate page · candidate results · GitHub pre-release · candidate DOI

Versioning note: v1.0 remains the current public release and archival DOI; v1.1-gb is a separately versioned candidate line.

What this is and is not

Included scope

The release is intended for dataset inspection, benchmark reproduction, and research on Cantonese lexical ambiguity, sense-aware labels, harmfulness categories, severity scoring, and canonical audio cues.

Out-of-scope uses

It is not a conversational moderation corpus, not a spontaneous-speech corpus, not a production moderation system, and not evidence of demographic or speaker-generalization.

Documentation

Use the pages below as a compact index for the release files, public mirrors, citation record, and access policy. Each page refers to the same 2026-04-02 locked public release.

Page	What to check
Downloads	Release file list, public mirrors, versioned candidate materials, and release documentation links.
Archive Versions	GitHub, HuggingFace, Zenodo/DOI, and OSF public records.
Citation	Version citation, BibTeX, CFF, and Zenodo DOI.
Access / License / Ethics	Use boundaries, license scope, source-text caveat, and contact/takedown channels.

Public release status

The public release website documents the current release boundary, mirrors, license terms, citation, and contact/takedown channels.