The v1.1-gb candidate result table compares the submitted v1.0 benchmark scores with validation-frozen, final-test-once candidate scores from the revised-audio workspace. It highlights where the revised candidate strengthens the table and where v1.0 remains the stronger retained reference.
Versioning note. These results are published as candidate evidence and reported separately from the v1.0 benchmark table.
Main result table
| Task | Metric | Submitted v1.0 score | v1.1 candidate score | Display decision |
|---|---|---|---|---|
| form_coarse | macro-F1 | 0.535307 | 0.566938 | v1.1 candidate stronger; report separately. |
| form_fine | macro-F1 | 0.459412 | 0.435410 | v1.0 retained as stronger reference. |
| severity_form | Spearman | 0.603658 | 0.616618 | v1.1 candidate stronger; report separately. |
| fusion_coarse | macro-F1 | 0.541452 | No v1.1 candidate run; v1.0 retained. | |
| sense_coarse | macro-F1 | 0.503278 | 0.475998 | v1.0 retained as stronger reference. |
| sense_fine | macro-F1 | 0.405704 | 0.517665 | v1.1 candidate stronger; report separately. |
Interpretation
- The candidate table strengthens form coarse, severity, and sense fine rows.
- Rows without a stronger candidate retain v1.0 values as the reference.
- Fusion is intentionally left unchanged; no separate audio-only row is promoted.
Additional task-coverage results
These rows broaden benchmark usage scenarios and are best treated as appendix/task-coverage evidence.
| Task | Metric | Validation | Final test | Display policy |
|---|---|---|---|---|
| form_binary_label_retrieval | MRR | 0.930379 | 0.930855 | appendix/task-coverage table |
| sense_binary_label_retrieval | MRR | 0.929527 | 0.919798 | appendix/task-coverage table |
| form_coarse_label_retrieval | MRR | 0.743787 | 0.746519 | appendix table |
| sense_coarse_label_retrieval | MRR | 0.721390 | 0.733313 | appendix table |
| form_binary_harm | macro-F1 | 0.673161 | 0.642728 | appendix/task-coverage table |
| sense_binary_harm | macro-F1 | 0.704624 | 0.667597 | appendix/task-coverage table |
| form_severity_bin4 | macro-F1 | 0.661919 | 0.627395 | appendix/task-coverage table |
| sense_severity_bin4 | macro-F1 | 0.669793 | 0.651957 | appendix/task-coverage table |
| form_pairwise_severity | Spearman | 0.620881 | 0.572182 | appendix with caveat |
| sense_pairwise_severity | Spearman | 0.574755 | 0.563883 | appendix with caveat |
| form_polysemy | macro-F1 | 0.599712 | 0.586502 | appendix with caveat |
| form_source_coarse_diversity | macro-F1 | 0.588994 | 0.587166 | appendix with caveat |
| form_source_fine_diversity | macro-F1 | 0.571627 | 0.510565 | appendix with caveat |