This repository contains the code for generating the GQVis dataset available on Hugging Face.
The code generates a collection of natural language Queries on genomics Data and responds with a visualization specification in the form of a Gosling grammar.
📂 Dataset on Hugging Face: HIDIVE/GQVis
- Template Generation will create abstract questions and specifications with placeholders for sample, entities, and location as well as constraints for those sample and entities.
- Data-schema/All-schema are our defined dataset schemas retrieved from 4DN, ENCODE, and Chromoscope.
- Template Expansion will reify the template questions/specifications given the provided schemas for all possibilities that satify the constraints.
- Paraphraser will use an LLM framework to paraphrase input questions to cover different styles of expertise and formality in the input.
- Multi-step defines links, chains, and scripts to generate multi-step queries.
- Alt-Gosling exports bulk Alt-Gosling text based on the resulting .csv file.
.
├── datasets/ # Source structured data files
├── ideogram_data/ # Ideograma data for template expansion
├── location_data/ # Retrieve location for genomic intervals
├── misc/ # Helper code for our paper
├── multi-step/ # Contains code for multi-step generation and linking
├── paraphraser.py # LLM code to paraphrase questions
├── template_expansion.py # Code to reify template questions
├── template_generation.py # Code to create abstract questions and Gosling specifications
└── README.md # This file
