fix(gaia): extract answers from FinishAction messages#409
Merged
Conversation
Previously, the GAIA evaluation only extracted answers from agent MessageEvents. However, the agent frequently uses FinishAction directly with the solution in the message parameter, without first producing a separate MessageEvent. This caused ~66% of GAIA instances to report no answer found, even when the agent had correctly computed and submitted the answer via FinishAction. This fix updates _extract_answer_from_history to check both: 1. Agent MessageEvent with llm_message content (existing behavior) 2. ActionEvent with FinishAction containing a message (new) The extraction priority ensures MessageEvent is still preferred when present, but FinishAction messages are now properly captured as a fallback. Co-authored-by: openhands <openhands@all-hands.dev>
Contributor
Author
|
@OpenHands the checking logic has become very complex, could we simplify it to a single branch that catches both FinishAction and MessageAction? |
|
I'm on it! neubig can track my progress at all-hands.dev |
…geEvent and FinishAction Co-authored-by: openhands <openhands@all-hands.dev>
|
I've simplified the answer extraction logic as requested. The changes have been pushed to the PR branch. Summary of ChangesSimplified the checking logic to a single branch that handles both
The refactored code is cleaner and more maintainable:
Checklist
|
The legacy check is no longer needed since we now properly handle FinishAction in the main extraction loop. Co-authored-by: openhands <openhands@all-hands.dev>
Collaborator
|
testing it! |
Contributor
Author
|
We tested this and I think it looks good. |
KTanmay1
pushed a commit
to Ethara-Ai/benchmarks
that referenced
this pull request
Mar 3, 2026
* fix(gaia): extract answers from FinishAction messages Previously, the GAIA evaluation only extracted answers from agent MessageEvents. However, the agent frequently uses FinishAction directly with the solution in the message parameter, without first producing a separate MessageEvent. This caused ~66% of GAIA instances to report no answer found, even when the agent had correctly computed and submitted the answer via FinishAction. This fix updates _extract_answer_from_history to check both: 1. Agent MessageEvent with llm_message content (existing behavior) 2. ActionEvent with FinishAction containing a message (new) The extraction priority ensures MessageEvent is still preferred when present, but FinishAction messages are now properly captured as a fallback. Co-authored-by: openhands <openhands@all-hands.dev> * refactor(gaia): simplify answer extraction to single branch for MessageEvent and FinishAction Co-authored-by: openhands <openhands@all-hands.dev> * refactor(gaia): remove legacy finish event check The legacy check is no longer needed since we now properly handle FinishAction in the main extraction loop. Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev>
KTanmay1
pushed a commit
to Ethara-Ai/benchmarks
that referenced
this pull request
Mar 3, 2026
* fix(gaia): extract answers from FinishAction messages Previously, the GAIA evaluation only extracted answers from agent MessageEvents. However, the agent frequently uses FinishAction directly with the solution in the message parameter, without first producing a separate MessageEvent. This caused ~66% of GAIA instances to report no answer found, even when the agent had correctly computed and submitted the answer via FinishAction. This fix updates _extract_answer_from_history to check both: 1. Agent MessageEvent with llm_message content (existing behavior) 2. ActionEvent with FinishAction containing a message (new) The extraction priority ensures MessageEvent is still preferred when present, but FinishAction messages are now properly captured as a fallback. Co-authored-by: openhands <openhands@all-hands.dev> * refactor(gaia): simplify answer extraction to single branch for MessageEvent and FinishAction Co-authored-by: openhands <openhands@all-hands.dev> * refactor(gaia): remove legacy finish event check The legacy check is no longer needed since we now properly handle FinishAction in the main extraction loop. Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The GAIA evaluation was reporting very low resolution rates (~23.6%) for some models like
jade-spark, significantly lower than expected. After investigation, the root cause was identified as an answer extraction failure, not a model capability issue.Analysis
When analyzing the results from
litellm_proxy-jade-spark-2862(results):The issue is that
_extract_answer_from_historyonly looks forMessageEventfrom the agent, but the agent frequently usesFinishActiondirectly with the solution in the message parameter, without first producing a separateMessageEvent.Looking at actual conversation events:
MessageEventcontaining<solution>tagsFinishActionwith<solution>142</solution>in the message, but no separateMessageEvent- so the answer is never extractedSolution
Update
_extract_answer_from_historyto check both:MessageEventwithllm_messagecontent (existing behavior, kept as priority)ActionEventwithFinishActioncontaining a message (new fallback)Changes
ActionEventandFinishAction_extract_answer_from_historyto also extract answers fromFinishAction.messageMessageEventis still preferred when presentExpected Impact
This fix should significantly improve GAIA resolution rates for models that use
FinishActiondirectly instead of producing a separateMessageEvent. The actual model capability is likely much higher than reported - possibly closer to the ~50%+ seen with other models on GAIA.Testing
fake_user_response.pyfor detectingFinishAction@neubig can click here to continue refining the PR