Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vcf module #47

Merged
merged 1 commit into from
Sep 12, 2023
Merged

Add vcf module #47

merged 1 commit into from
Sep 12, 2023

Conversation

TedBrookings
Copy link
Contributor

  • Based on pysam
  • Adds helper reader and writer utilties
    • reader blocks stderr messages for index older than vcf
  • Adds PysamVariantBuilder builder class for relatively easy construction of variants
    • builder has utilty methods for adding necessary headers

@TedBrookings
Copy link
Contributor Author

Notes on changes from Nils' original code

  • I altered the input values for the builder to be the same names as the fields in the produced VariantRecord. This way it will not be necessary to translate (e.g. builder.add(chr="chr3"), but test that record.contig == "chr3")
  • I made the conversion from 1-based coordinates to 0-based coordinates in PysamVariantBuilder.add explicit and overridable, because if you are adding variants somehow generated from a 0-based program, you then don't have to add 1 to subtract 1 later. I also set the external interface to use pos instead of start. pos remains 1-based, so there should be less head-scratching as numbers stay the same and just work. i.e. if you ignore all this and pass a value for pos, that will be the value in the final VCF and it will be the value in the record.pos.
  • For read/write options, I changed TextIOWrapper to TextIO because a) it made mypy slightly happier, and b) it seemed more general and did work in my tests.
  • I added the stderr redirect to avoid the warning about out-of-date indices. This can get brutal when working in parallel.
  • I altered the default contig to be the first contig in the builder's sequence dictionary, rather than chr1.
  • I altered the default alts to be ('.',) instead of None, because pysam threw errors.
  • I made multiple (trivial) changes to types for several general reasons:
    • mypy wanted it that way
    • I wanted something immutable in a default or a test value
    • I wanted something general as an input (prefer Iterable over List or Tuple)
    • I got rid of OrderedDict and replaced with dict-comprehensions because dicts are insertion-ordered as of python 3.6+

Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TedBrookings I didn't go through test_builder.py yet - I think if you could take a crack and addressing most of my comments first (especially merging PysamVariantBuilder into VariantBuilder and eliminating duplicate or unnecessary things) I can then make another pass.


The module contains the following public classes:

- :class:`~variantbuilder.VariantBuilder` -- A builder class that allows the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think variantbuilder is no longer a module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely re-worked the docstrings for the module.

fgpyo/vcf/__init__.py Outdated Show resolved Hide resolved
fgpyo/vcf/__init__.py Outdated Show resolved Hide resolved


class Defaults:
Position: ClassVar[int] = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class level docstring please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this Defaults class and moved the default values into the VariantBuilder.add function.

Comment on lines 92 to 93
@staticmethod
def init(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this init() method is confusing to me. Perhaps it will make sense when I see how it's used, but it seems weird to set basically static defaults like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. I think the original intention was to provide a defaults system that could work with multiple sub-class add functions, but that is no longer necessary.

Comment on lines 108 to 248
if self.samples is None or len(self.samples) == 0:
formats = None
elif samples is None or len(samples) == 0:
formats = [{"GT": "./."} for _ in self.samples]
else:
format_keys = sorted({key for sample in self.samples for key in samples[sample]})
formats = [
None
if sample not in samples
else {key: samples[sample].get(key, ".") for key in format_keys}
for sample in self.samples
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super confusing due to self.samples vs. samples (the parameter) and then using sample in various comprehensions. Could we perhaps alias things locally as:

vcf_samples = self.samples
genotyped_samples = samples

(or similar) and then use those terms in comprehensions too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked some things:

  1. I renamed VariantBuilder.samples -> VariantBuilder.sample_ids
  2. I aliased the input samples variable to sample_formats
  3. I renamed the value passed to the new_record() function to record_samples
  4. I was able to substantially simplify the construction of record_samples, because pysam actually does most of that stuff for you. The only necessary items was re-ordering by sample_id. It should be much easier to understand now.
    (and I added comments explaining why everything was named the way it was)

fgpyo/vcf/builder.py Outdated Show resolved Hide resolved
Comment on lines 110 to 111
elif samples is None or len(samples) == 0:
formats = [{"GT": "./."} for _ in self.samples]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This elif leads to different behavior if a sample is not provided to this method and i) other samples are provided vs. ii) no other samples are provided.

Why? Why not just have the formats by None here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior is:

  • if the VCF is sites-only, don't supply samples/formats
  • if the VCF has samples and none are provided jam in a bunch of no-call GTs
  • if the VCF has samples and some are provided, set the missing samples to no-call for every field in all_format_keys

So I guess the intention is to distinguish a sites-only VCF (no format data) from a VCF with calls where for whatever reason we've provided no calls for a variant. My thinking is that it would be slightly preferable to
a) use the format header fields and set all of those to no-call, or b) just throw an error. Unless there's objections, I'm planning to work up option a).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After trying a bunch of things, some more complicated than others, I settled on just setting the format values to None (i.e. provide nothing). This does not produce an error, nor does it cause GT to appear when it otherwise might not be added. It's easy enough to create an all no-call GT that the user should just do so if that's what they want:

sample_formats = {sample_id: {"GT": "."} for sample_id in builder.sample_ids})

and if they don't care about FORMAT values, the least surprising thing is for them to be empty. Happy to change this if there are strong opposing feelings or it becomes inconvenient down the line.

filter: Optional[Iterable[str]] = None,
info: Optional[Dict[str, Any]] = None,
samples: Optional[Dict[str, Dict[str, Any]]] = None,
input_start_index: int = 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 91 to 92
input_start_index: the genomic start index for input data (e.g. 0 for 0-based, 1-for
1-based). VCF files are 1-based, but pysam is 0-based.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has not been my experience. If I have a variant at "chr1:1000" in a VCF (i.e. the POS column is the string literal "1000"), then when I read that with VariantFile and access the pos field, I get 1000 back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know pysam's SAM/BAM code is 0-based, but I think VCF is 1-based though I guess I have only tested .vcf and .vcf.gz files, and not .bcf files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a little more poking I see the confusion. VariantRecord has both pos which is 1-based like the VCF, but then also exposes attributes of start and stop (which are NOT vcf fields) which are 0-based open ended. This is super confusing and honestly a really stupid decision in pysam to have.

Since I'd like to flatten this class into VariantBuilder, I think we can hide this somewhat and just accept 1-based positions and do this translation internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Altered as per our conversation.

@TedBrookings TedBrookings requested a review from tfenne July 31, 2023 16:50
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 do you want to look at this or is @TedBrookings good to merge based on my approval?

pos: the 1-based position of the variant
id: the variant id
ref: the reference allele
alts: the list of alternate alleles, None if no alternates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the type could you add a little more. If str is that considered to be a singular ALT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's it exactly. It's important because str are iterable, and you can wind up with a bunch of single-character alts if single str is not checked for. I updated the documentation to be explicit. I also gave filter the same treatment, and updated its documentation as well.

@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2023

Codecov Report

Patch coverage: 89.78% and project coverage change: -0.33% ⚠️

Comparison is base (0c17fe0) 92.57% compared to head (e26ffa1) 92.24%.
Report is 5 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
- Coverage   92.57%   92.24%   -0.33%     
==========================================
  Files          26       29       +3     
  Lines        2517     2850     +333     
  Branches      452      520      +68     
==========================================
+ Hits         2330     2629     +299     
- Misses        136      156      +20     
- Partials       51       65      +14     
Files Changed Coverage Δ
fgpyo/vcf/builder.py 77.55% <77.55%> (ø)
fgpyo/vcf/__init__.py 97.05% <97.05%> (ø)
fgpyo/vcf/tests/test_builder.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

* Based on pysam
* Adds helper reader and writer utilties
* * reader blocks stderr messages for index older than vcf
* Adds PysamVariantBuilder builder class for relatively easy
  construction of variants
* * builder has utilty methods for adding necessary headers

Co-authored-by: Nils Homer <[email protected]>
@TedBrookings TedBrookings merged commit 07182fa into main Sep 12, 2023
5 checks passed
@nh13 nh13 deleted the feature/add-vcf branch September 12, 2023 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants