fix(gaia): extract answers from FinishAction messages by neubig · Pull Request #409 · OpenHands/benchmarks

neubig · 2026-02-11T07:39:58Z

Problem

The GAIA evaluation was reporting very low resolution rates (~23.6%) for some models like jade-spark, significantly lower than expected. After investigation, the root cause was identified as an answer extraction failure, not a model capability issue.

Analysis

When analyzing the results from litellm_proxy-jade-spark-2862 (results):

Total instances: 165
Instances with NO answer extracted: 109 (66%)

The evaluation log shows repeated failures:

ERROR - Could not find agent output after 30 attempts and 1181.9s total wait time

The issue is that _extract_answer_from_history only looks for MessageEvent from the agent, but the agent frequently uses FinishAction directly with the solution in the message parameter, without first producing a separate MessageEvent.

Looking at actual conversation events:

Successful instances: Have an agent MessageEvent containing <solution> tags
Failing instances: Agent issues FinishAction with <solution>142</solution> in the message, but no separate MessageEvent - so the answer is never extracted

Solution

Update _extract_answer_from_history to check both:

Agent MessageEvent with llm_message content (existing behavior, kept as priority)
ActionEvent with FinishAction containing a message (new fallback)

Changes

Added imports for ActionEvent and FinishAction
Updated _extract_answer_from_history to also extract answers from FinishAction.message
Maintained backward compatibility: MessageEvent is still preferred when present

Expected Impact

This fix should significantly improve GAIA resolution rates for models that use FinishAction directly instead of producing a separate MessageEvent. The actual model capability is likely much higher than reported - possibly closer to the ~50%+ seen with other models on GAIA.

Testing

Verified Python syntax compiles correctly
Logic follows the same pattern used in fake_user_response.py for detecting FinishAction

@neubig can click here to continue refining the PR

Previously, the GAIA evaluation only extracted answers from agent MessageEvents. However, the agent frequently uses FinishAction directly with the solution in the message parameter, without first producing a separate MessageEvent. This caused ~66% of GAIA instances to report no answer found, even when the agent had correctly computed and submitted the answer via FinishAction. This fix updates _extract_answer_from_history to check both: 1. Agent MessageEvent with llm_message content (existing behavior) 2. ActionEvent with FinishAction containing a message (new) The extraction priority ensures MessageEvent is still preferred when present, but FinishAction messages are now properly captured as a fallback. Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-02-11T08:05:54Z

@OpenHands the checking logic has become very complex, could we simplify it to a single branch that catches both FinishAction and MessageAction?

openhands-ai · 2026-02-11T08:06:02Z

I'm on it! neubig can track my progress at all-hands.dev

…geEvent and FinishAction Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-02-11T08:09:30Z

I've simplified the answer extraction logic as requested. The changes have been pushed to the PR branch.

Summary of Changes

Simplified the checking logic to a single branch that handles both MessageEvent and FinishAction:

Before: Two separate if blocks checking for MessageEvent and FinishAction independently
After: A single unified loop that:
1. First filters for agent-sourced events (event.source == "agent")
2. Then extracts text from either MessageEvent or FinishAction in one conditional block
3. Returns with unified logging when text is found

The refactored code is cleaner and more maintainable:

Single source check at the start of the loop
Single extraction logic handling both event types
Single return point with consistent logging

Checklist

Simplified to single branch catching both FinishAction and MessageAction
Maintained the same functionality (MessageEvent and FinishAction both handled)
Updated docstring to reflect simplified logic
Verified syntax compiles correctly
Changes pushed to PR branch fix-gaia-finish-action-extraction

View full conversation

The legacy check is no longer needed since we now properly handle FinishAction in the main extraction loop. Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini · 2026-02-11T11:37:10Z

testing it!

neubig · 2026-02-11T23:33:23Z

We tested this and I think it looks good.

juanmichelini

LGTM

* fix(gaia): extract answers from FinishAction messages Previously, the GAIA evaluation only extracted answers from agent MessageEvents. However, the agent frequently uses FinishAction directly with the solution in the message parameter, without first producing a separate MessageEvent. This caused ~66% of GAIA instances to report no answer found, even when the agent had correctly computed and submitted the answer via FinishAction. This fix updates _extract_answer_from_history to check both: 1. Agent MessageEvent with llm_message content (existing behavior) 2. ActionEvent with FinishAction containing a message (new) The extraction priority ensures MessageEvent is still preferred when present, but FinishAction messages are now properly captured as a fallback. Co-authored-by: openhands <openhands@all-hands.dev> * refactor(gaia): simplify answer extraction to single branch for MessageEvent and FinishAction Co-authored-by: openhands <openhands@all-hands.dev> * refactor(gaia): remove legacy finish event check The legacy check is no longer needed since we now properly handle FinishAction in the main extraction loop. Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev>

refactor(gaia): simplify answer extraction to single branch for Messa…

fc0e84f

…geEvent and FinishAction Co-authored-by: openhands <openhands@all-hands.dev>

refactor(gaia): remove legacy finish event check

fc293ac

The legacy check is no longer needed since we now properly handle FinishAction in the main extraction loop. Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review February 11, 2026 09:51

juanmichelini self-requested a review February 11, 2026 11:37

neubig enabled auto-merge (squash) February 11, 2026 23:33

neubig added the needs-review label Feb 12, 2026 — with OpenHands Staging

neubig added the review-this label Feb 12, 2026 — with OpenHands AI

juanmichelini approved these changes Feb 13, 2026

View reviewed changes

neubig merged commit 013643c into main Feb 13, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gaia): extract answers from FinishAction messages#409

fix(gaia): extract answers from FinishAction messages#409
neubig merged 3 commits intomainfrom
fix-gaia-finish-action-extraction

neubig commented Feb 11, 2026

Uh oh!

neubig commented Feb 11, 2026

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

juanmichelini commented Feb 11, 2026

Uh oh!

neubig commented Feb 11, 2026

Uh oh!

juanmichelini left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Feb 11, 2026

Problem

Analysis

Solution

Changes

Expected Impact

Testing

Uh oh!

neubig commented Feb 11, 2026

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

openhands-ai bot commented Feb 11, 2026

Summary of Changes

Checklist

Uh oh!

juanmichelini commented Feb 11, 2026

Uh oh!

neubig commented Feb 11, 2026

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants