Benchmark object

What the release contains

  • 4,823 form rows with adjudicated coarse, fine, and severity targets
  • 6,365 linked sense rows that preserve source-level evidence
  • 4,823 canonical audio files, one per retained form
  • Manifest, statistics, result summary, and reproducibility documentation

Non-claims

What this site does not ask reviewers to assume

  • No claim of in-the-wild conversational coverage
  • No claim of speaker-transfer or speaker-generalization evaluation
  • No claim that the released severity score is a universal harm scale
  • No claim that the benchmark is already sufficient for real moderation deployment

Why the narrow scope still matters

A controlled benchmark can still answer a real research question

This benchmark asks a focused question: can current models recover harmful meaning when each item is a lexical form with linked senses and one controlled speech clip? That focused setting is still useful because it keeps ambiguity, label structure, and speech grounding visible inside one auditable release.

Interpretation guide. Speech-related tasks on this site should be read within this controlled lexical setting. They are not meant to stand in for open conversational moderation or broad speaker-variation studies.

Speech validity boundary

Controlled audio facts

  • All 4,823 audio files are stereo 16-bit PCM waveforms.
  • 3,222 clips are stored at 44.1 kHz and 1,601 at 48 kHz.
  • Median duration is 2.0647 s and the 95th percentile is 3.1096 s.
  • Speaker counts range from 108 to 637 clips across SPK01 to SPK11.

Reviewer-safe framing

Recommended reading of the benchmark claim

Use the benchmark as a testbed for lexicon-grounded harmful-semantic reasoning under controlled Cantonese speech grounding. Do not read it as a substitute for contextual utterance datasets, spontaneous speech corpora, or real-world moderation logs.