Skip to content

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

@ChenZiHong-Gavin ChenZiHong-Gavin commented Jan 13, 2026

This PR made the following changes to vllm_wrapper:

  1. Introduced Repetition Penalty
  2. Enhanced Prompt Formatting: use self.tokenizer.apply_chat_template to format prompt
  3. Increased Default Timeout: The default timeout value in the vllm_wrapper constructor has been increased from 300 to 600 seconds, allowing more time for generation requests to complete.

@github-actions github-actions bot added the core label Jan 13, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ChenZiHong-Gavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily enhances the vllm_wrapper by integrating a repetition_penalty parameter into the sampling process, which helps in generating more diverse and less repetitive output from the language model. Additionally, it includes a substantial refactor of the prompt building mechanism to use the model's tokenizer for more accurate and standardized chat template application. These changes aim to improve the quality and control over the model's text generation capabilities.

Highlights

  • Repetition Penalty: The repetition_penalty parameter has been introduced to the SamplingParams in the generate_answer method, allowing for better control over generated text to prevent repetitive phrases. It defaults to 1.05 if not provided.
  • Prompt Formatting Refactor: The _build_inputs method has been significantly refactored to utilize the tokenizer's apply_chat_template for constructing prompts. This change ensures more consistent and robust prompt formatting, especially for chat-based models, by leveraging the model's native tokenizer capabilities.
  • Parameter Validation Adjustment: The validation logic for temperature and top_p in SamplingParams has been updated to allow a value of 0 (i.e., >= 0 instead of > 0), which is common for deterministic generation settings.
  • Minor Code Refactoring: The assignment of result_generator has been moved inside the try blocks in both generate_answer and generate_topk_per_token methods for improved error handling and scope management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces repetition_penalty to the sampling parameters and refactors the _build_inputs method to use tokenizer chat templates. While adding the new parameter is a good enhancement, the refactoring of _build_inputs introduces a critical issue: it changes the expected type for the history parameter, which breaks the contract with the BaseLLMWrapper base class and will cause runtime errors. Additionally, the new implementation modifies the history list in-place, which is an undesirable side effect. I've also noted a minor issue with an unused import. My review includes suggestions to address these problems.

I am having trouble creating individual review comments. Click here to see my feedback.

graphgen/models/llm/local/vllm_wrapper.py (48-56)

critical

This refactoring has several issues:

  1. Critical Bug (Contract Violation): The history parameter's type is now Optional[List[dict]], which is incompatible with the Optional[List[str]] defined in the BaseLLMWrapper base class methods. This violates the Liskov Substitution Principle and will cause runtime errors. The method signatures must be consistent across the class hierarchy.
  2. Side Effect: The input history list is modified in-place. A copy should be used to avoid unexpected behavior for the caller.
  3. Vague Type Hint: The return type should be str instead of Any for better type safety.
    def _build_inputs(self, prompt: str, history: Optional[List[dict]] = None) -> str:
        messages = (history or []).copy()
        messages.append({"role": "user", "content": prompt})

        return self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

graphgen/models/llm/local/vllm_wrapper.py (2)

medium

The stop import from tracemalloc is not used in this file and should be removed to maintain code cleanliness.

@ChenZiHong-Gavin
Copy link
Collaborator Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request significantly improves the vllm_wrapper's text generation capabilities by integrating a repetition_penalty to enhance output diversity and by overhauling the prompt formatting mechanism to leverage the tokenizer's native chat template application for greater consistency. These changes, alongside an extended default timeout and refined error handling, aim to make the language model interaction more robust and controllable.

Highlights

  • Introduced Repetition Penalty: A repetition_penalty parameter has been added to SamplingParams in the generate_answer method, defaulting to 1.05, to help prevent the generation of repetitive text.
  • Enhanced Prompt Formatting: The _build_inputs method has been significantly refactored to utilize the model's tokenizer (self.tokenizer.apply_chat_template) for constructing prompts. This ensures more consistent and robust prompt formatting, especially for chat-based models, by leveraging the tokenizer's native capabilities.
  • Increased Default Timeout: The default timeout value in the vllm_wrapper constructor has been increased from 300 to 600 seconds, allowing more time for generation requests to complete.
  • Improved Parameter Validation: The validation logic for temperature and top_p in SamplingParams has been adjusted to allow values greater than or equal to zero (>= 0), accommodating scenarios where zero values are valid (e.g., for deterministic generation).
  • Robust Error Handling: asyncio.TimeoutError has been added to the exception handling blocks in both generate_answer and generate_topk_per_token methods, improving the resilience of the generation process against timeouts.
  • Optimized Generator Consumption: The _consume_generator method now explicitly breaks its loop once request_output.finished is True, ensuring that only the final output is processed and improving efficiency.
Changelog
  • graphgen/models/llm/local/vllm_wrapper.py
    • Increased the default timeout parameter in the __init__ method from 300 to 600.
    • Initialized self.tokenizer using self.engine.engine.tokenizer.tokenizer in the __init__ method.
    • Refactored the _build_inputs method: changed it from a static method to an instance method, updated its signature to accept List[dict], and implemented self.tokenizer.apply_chat_template for prompt construction.
    • Modified _consume_generator to explicitly break the loop once request_output.finished is True.
    • Added repetition_penalty=extra.get("repetition_penalty", 1.05) to SamplingParams in generate_answer.
    • Updated temperature and top_p conditions in SamplingParams from > 0 to >= 0.
    • Moved the result_generator = self.engine.generate(...) call inside the try block in both generate_answer and generate_topk_per_token.
    • Added asyncio.TimeoutError to the exception handling tuple in generate_answer and generate_topk_per_token.
Activity
  • gemini-code-assist[bot] provided an initial summary of the pull request, highlighting the repetition penalty, prompt formatting refactor, parameter validation adjustment, and minor code refactoring.
  • github-code-quality[bot] identified an unused import of 'stop' in graphgen/models/llm/local/vllm_wrapper.py and suggested its removal.
  • ChenZiHong-Gavin requested a new summary from Gemini Code Assist.

@ChenZiHong-Gavin ChenZiHong-Gavin merged commit d1b3b6e into main Jan 14, 2026
6 checks passed
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the fix/fix-vllm-repeating branch January 14, 2026 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants