Skip to content

Fetch over entire genes... #13

@carloartieri

Description

@carloartieri

I was thinking about it a little more and the idea of 'fetching' reads by SNP isn't going to work. The reason is that SNPs are frequently near enough to one another that reads can span more than one. While iterating over SNPs, this will re-fetch the same reads and lead to double-counting unless you store a list of already-counted SNPs in memory. Regardless, it will be relatively inefficient.

Therefore it makes more sense to accept the list of SNPs and a GTF file, then parse the GTF to fetch only over whole genic regions at a time. Going exon-by-exon would work, as long as each fetch stipulated that reads overlapping the previous exon would be excluded. For example, it's straighforward to create a column in a pandas dataframe, sorted by chrom+position, that notes the end position of the previous element in the table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions