Revised-candidate final-test-once results

The v1.1-gb candidate result table compares the submitted v1.0 benchmark scores with validation-frozen, final-test-once candidate scores from the revised-audio workspace. It highlights where the revised candidate strengthens the table and where v1.0 remains the stronger retained reference.

Versioning note. These results are published as candidate evidence and reported separately from the v1.0 benchmark table.

Main result table

Task	Metric	Submitted v1.0 score	v1.1 candidate score	Display decision
form_coarse	macro-F1	0.535307	0.566938	v1.1 candidate stronger; report separately.
form_fine	macro-F1	0.459412	0.435410	v1.0 retained as stronger reference.
severity_form	Spearman	0.603658	0.616618	v1.1 candidate stronger; report separately.
fusion_coarse	macro-F1	0.541452		No v1.1 candidate run; v1.0 retained.
sense_coarse	macro-F1	0.503278	0.475998	v1.0 retained as stronger reference.
sense_fine	macro-F1	0.405704	0.517665	v1.1 candidate stronger; report separately.

Interpretation

The candidate table strengthens form coarse, severity, and sense fine rows.
Rows without a stronger candidate retain v1.0 values as the reference.
Fusion is intentionally left unchanged; no separate audio-only row is promoted.

Additional task-coverage results

These rows broaden benchmark usage scenarios and are best treated as appendix/task-coverage evidence.

Task	Metric	Validation	Final test	Display policy
form_binary_label_retrieval	MRR	0.930379	0.930855	appendix/task-coverage table
sense_binary_label_retrieval	MRR	0.929527	0.919798	appendix/task-coverage table
form_coarse_label_retrieval	MRR	0.743787	0.746519	appendix table
sense_coarse_label_retrieval	MRR	0.721390	0.733313	appendix table
form_binary_harm	macro-F1	0.673161	0.642728	appendix/task-coverage table
sense_binary_harm	macro-F1	0.704624	0.667597	appendix/task-coverage table
form_severity_bin4	macro-F1	0.661919	0.627395	appendix/task-coverage table
sense_severity_bin4	macro-F1	0.669793	0.651957	appendix/task-coverage table
form_pairwise_severity	Spearman	0.620881	0.572182	appendix with caveat
sense_pairwise_severity	Spearman	0.574755	0.563883	appendix with caveat
form_polysemy	macro-F1	0.599712	0.586502	appendix with caveat
form_source_coarse_diversity	macro-F1	0.588994	0.587166	appendix with caveat
form_source_fine_diversity	macro-F1	0.571627	0.510565	appendix with caveat