Skip to content

fix(gaia): extract answers from FinishAction messages#409

Merged
neubig merged 3 commits intomainfrom
fix-gaia-finish-action-extraction
Feb 13, 2026
Merged

fix(gaia): extract answers from FinishAction messages#409
neubig merged 3 commits intomainfrom
fix-gaia-finish-action-extraction

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Feb 11, 2026

Problem

The GAIA evaluation was reporting very low resolution rates (~23.6%) for some models like jade-spark, significantly lower than expected. After investigation, the root cause was identified as an answer extraction failure, not a model capability issue.

Analysis

When analyzing the results from litellm_proxy-jade-spark-2862 (results):

  • Total instances: 165
  • Instances with NO answer extracted: 109 (66%)
  • The evaluation log shows repeated failures:
    ERROR - Could not find agent output after 30 attempts and 1181.9s total wait time
    

The issue is that _extract_answer_from_history only looks for MessageEvent from the agent, but the agent frequently uses FinishAction directly with the solution in the message parameter, without first producing a separate MessageEvent.

Looking at actual conversation events:

  • Successful instances: Have an agent MessageEvent containing <solution> tags
  • Failing instances: Agent issues FinishAction with <solution>142</solution> in the message, but no separate MessageEvent - so the answer is never extracted

Solution

Update _extract_answer_from_history to check both:

  1. Agent MessageEvent with llm_message content (existing behavior, kept as priority)
  2. ActionEvent with FinishAction containing a message (new fallback)

Changes

  • Added imports for ActionEvent and FinishAction
  • Updated _extract_answer_from_history to also extract answers from FinishAction.message
  • Maintained backward compatibility: MessageEvent is still preferred when present

Expected Impact

This fix should significantly improve GAIA resolution rates for models that use FinishAction directly instead of producing a separate MessageEvent. The actual model capability is likely much higher than reported - possibly closer to the ~50%+ seen with other models on GAIA.

Testing

  • Verified Python syntax compiles correctly
  • Logic follows the same pattern used in fake_user_response.py for detecting FinishAction

@neubig can click here to continue refining the PR

Previously, the GAIA evaluation only extracted answers from agent
MessageEvents. However, the agent frequently uses FinishAction directly
with the solution in the message parameter, without first producing a
separate MessageEvent.

This caused ~66% of GAIA instances to report no answer found, even when
the agent had correctly computed and submitted the answer via FinishAction.

This fix updates _extract_answer_from_history to check both:
1. Agent MessageEvent with llm_message content (existing behavior)
2. ActionEvent with FinishAction containing a message (new)

The extraction priority ensures MessageEvent is still preferred when
present, but FinishAction messages are now properly captured as a fallback.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig
Copy link
Contributor Author

neubig commented Feb 11, 2026

@OpenHands the checking logic has become very complex, could we simplify it to a single branch that catches both FinishAction and MessageAction?

@openhands-ai
Copy link

openhands-ai bot commented Feb 11, 2026

I'm on it! neubig can track my progress at all-hands.dev

…geEvent and FinishAction

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 11, 2026

I've simplified the answer extraction logic as requested. The changes have been pushed to the PR branch.

Summary of Changes

Simplified the checking logic to a single branch that handles both MessageEvent and FinishAction:

  • Before: Two separate if blocks checking for MessageEvent and FinishAction independently
  • After: A single unified loop that:
    1. First filters for agent-sourced events (event.source == "agent")
    2. Then extracts text from either MessageEvent or FinishAction in one conditional block
    3. Returns with unified logging when text is found

The refactored code is cleaner and more maintainable:

  • Single source check at the start of the loop
  • Single extraction logic handling both event types
  • Single return point with consistent logging

Checklist

  • Simplified to single branch catching both FinishAction and MessageAction
  • Maintained the same functionality (MessageEvent and FinishAction both handled)
  • Updated docstring to reflect simplified logic
  • Verified syntax compiles correctly
  • Changes pushed to PR branch fix-gaia-finish-action-extraction

View full conversation

The legacy check is no longer needed since we now properly handle
FinishAction in the main extraction loop.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review February 11, 2026 09:51
@juanmichelini
Copy link
Collaborator

testing it!

@juanmichelini juanmichelini self-requested a review February 11, 2026 11:37
@neubig neubig enabled auto-merge (squash) February 11, 2026 23:33
@neubig
Copy link
Contributor Author

neubig commented Feb 11, 2026

We tested this and I think it looks good.

Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neubig neubig merged commit 013643c into main Feb 13, 2026
19 checks passed
KTanmay1 pushed a commit to Ethara-Ai/benchmarks that referenced this pull request Mar 3, 2026
* fix(gaia): extract answers from FinishAction messages

Previously, the GAIA evaluation only extracted answers from agent
MessageEvents. However, the agent frequently uses FinishAction directly
with the solution in the message parameter, without first producing a
separate MessageEvent.

This caused ~66% of GAIA instances to report no answer found, even when
the agent had correctly computed and submitted the answer via FinishAction.

This fix updates _extract_answer_from_history to check both:
1. Agent MessageEvent with llm_message content (existing behavior)
2. ActionEvent with FinishAction containing a message (new)

The extraction priority ensures MessageEvent is still preferred when
present, but FinishAction messages are now properly captured as a fallback.

Co-authored-by: openhands <openhands@all-hands.dev>

* refactor(gaia): simplify answer extraction to single branch for MessageEvent and FinishAction

Co-authored-by: openhands <openhands@all-hands.dev>

* refactor(gaia): remove legacy finish event check

The legacy check is no longer needed since we now properly handle
FinishAction in the main extraction loop.

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
KTanmay1 pushed a commit to Ethara-Ai/benchmarks that referenced this pull request Mar 3, 2026
* fix(gaia): extract answers from FinishAction messages

Previously, the GAIA evaluation only extracted answers from agent
MessageEvents. However, the agent frequently uses FinishAction directly
with the solution in the message parameter, without first producing a
separate MessageEvent.

This caused ~66% of GAIA instances to report no answer found, even when
the agent had correctly computed and submitted the answer via FinishAction.

This fix updates _extract_answer_from_history to check both:
1. Agent MessageEvent with llm_message content (existing behavior)
2. ActionEvent with FinishAction containing a message (new)

The extraction priority ensures MessageEvent is still preferred when
present, but FinishAction messages are now properly captured as a fallback.

Co-authored-by: openhands <openhands@all-hands.dev>

* refactor(gaia): simplify answer extraction to single branch for MessageEvent and FinishAction

Co-authored-by: openhands <openhands@all-hands.dev>

* refactor(gaia): remove legacy finish event check

The legacy check is no longer needed since we now properly handle
FinishAction in the main extraction loop.

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants