-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vcf module #47
Add vcf module #47
Conversation
TedBrookings
commented
Jul 12, 2023
- Based on pysam
- Adds helper reader and writer utilties
-
- reader blocks stderr messages for index older than vcf
- Adds PysamVariantBuilder builder class for relatively easy construction of variants
-
- builder has utilty methods for adding necessary headers
Notes on changes from Nils' original code
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TedBrookings I didn't go through test_builder.py
yet - I think if you could take a crack and addressing most of my comments first (especially merging PysamVariantBuilder into VariantBuilder and eliminating duplicate or unnecessary things) I can then make another pass.
fgpyo/vcf/__init__.py
Outdated
|
||
The module contains the following public classes: | ||
|
||
- :class:`~variantbuilder.VariantBuilder` -- A builder class that allows the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think variantbuilder
is no longer a module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely re-worked the docstrings for the module.
fgpyo/vcf/__init__.py
Outdated
|
||
|
||
class Defaults: | ||
Position: ClassVar[int] = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class level docstring please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this Defaults
class and moved the default values into the VariantBuilder.add
function.
fgpyo/vcf/__init__.py
Outdated
@staticmethod | ||
def init( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this init() method is confusing to me. Perhaps it will make sense when I see how it's used, but it seems weird to set basically static defaults like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it. I think the original intention was to provide a defaults system that could work with multiple sub-class add
functions, but that is no longer necessary.
fgpyo/vcf/builder.py
Outdated
if self.samples is None or len(self.samples) == 0: | ||
formats = None | ||
elif samples is None or len(samples) == 0: | ||
formats = [{"GT": "./."} for _ in self.samples] | ||
else: | ||
format_keys = sorted({key for sample in self.samples for key in samples[sample]}) | ||
formats = [ | ||
None | ||
if sample not in samples | ||
else {key: samples[sample].get(key, ".") for key in format_keys} | ||
for sample in self.samples | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super confusing due to self.samples
vs. samples
(the parameter) and then using sample
in various comprehensions. Could we perhaps alias things locally as:
vcf_samples = self.samples
genotyped_samples = samples
(or similar) and then use those terms in comprehensions too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked some things:
- I renamed
VariantBuilder.samples
->VariantBuilder.sample_ids
- I aliased the input
samples
variable tosample_formats
- I renamed the value passed to the
new_record()
function torecord_samples
- I was able to substantially simplify the construction of
record_samples
, because pysam actually does most of that stuff for you. The only necessary items was re-ordering by sample_id. It should be much easier to understand now.
(and I added comments explaining why everything was named the way it was)
fgpyo/vcf/builder.py
Outdated
elif samples is None or len(samples) == 0: | ||
formats = [{"GT": "./."} for _ in self.samples] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This elif leads to different behavior if a sample is not provided to this method and i) other samples are provided vs. ii) no other samples are provided.
Why? Why not just have the formats by None
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior is:
- if the VCF is sites-only, don't supply samples/formats
- if the VCF has samples and none are provided jam in a bunch of no-call GTs
- if the VCF has samples and some are provided, set the missing samples to no-call for every field in
all_format_keys
So I guess the intention is to distinguish a sites-only VCF (no format data) from a VCF with calls where for whatever reason we've provided no calls for a variant. My thinking is that it would be slightly preferable to
a) use the format header fields and set all of those to no-call, or b) just throw an error. Unless there's objections, I'm planning to work up option a).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After trying a bunch of things, some more complicated than others, I settled on just setting the format values to None (i.e. provide nothing). This does not produce an error, nor does it cause GT
to appear when it otherwise might not be added. It's easy enough to create an all no-call GT that the user should just do so if that's what they want:
sample_formats = {sample_id: {"GT": "."} for sample_id in builder.sample_ids})
and if they don't care about FORMAT values, the least surprising thing is for them to be empty. Happy to change this if there are strong opposing feelings or it becomes inconvenient down the line.
fgpyo/vcf/builder.py
Outdated
filter: Optional[Iterable[str]] = None, | ||
info: Optional[Dict[str, Any]] = None, | ||
samples: Optional[Dict[str, Dict[str, Any]]] = None, | ||
input_start_index: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
fgpyo/vcf/builder.py
Outdated
input_start_index: the genomic start index for input data (e.g. 0 for 0-based, 1-for | ||
1-based). VCF files are 1-based, but pysam is 0-based. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has not been my experience. If I have a variant at "chr1:1000" in a VCF (i.e. the POS column is the string literal "1000"), then when I read that with VariantFile and access the pos
field, I get 1000
back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know pysam's SAM/BAM code is 0-based, but I think VCF is 1-based though I guess I have only tested .vcf and .vcf.gz files, and not .bcf files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a little more poking I see the confusion. VariantRecord
has both pos
which is 1-based like the VCF, but then also exposes attributes of start
and stop
(which are NOT vcf fields) which are 0-based open ended. This is super confusing and honestly a really stupid decision in pysam to have.
Since I'd like to flatten this class into VariantBuilder, I think we can hide this somewhat and just accept 1-based positions and do this translation internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Altered as per our conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nh13 do you want to look at this or is @TedBrookings good to merge based on my approval?
fgpyo/vcf/builder.py
Outdated
pos: the 1-based position of the variant | ||
id: the variant id | ||
ref: the reference allele | ||
alts: the list of alternate alleles, None if no alternates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the type could you add a little more. If str
is that considered to be a singular ALT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's it exactly. It's important because str
are iterable, and you can wind up with a bunch of single-character alts if single str
is not checked for. I updated the documentation to be explicit. I also gave filter
the same treatment, and updated its documentation as well.
Codecov ReportPatch coverage:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #47 +/- ##
==========================================
- Coverage 92.57% 92.24% -0.33%
==========================================
Files 26 29 +3
Lines 2517 2850 +333
Branches 452 520 +68
==========================================
+ Hits 2330 2629 +299
- Misses 136 156 +20
- Partials 51 65 +14
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
* Based on pysam * Adds helper reader and writer utilties * * reader blocks stderr messages for index older than vcf * Adds PysamVariantBuilder builder class for relatively easy construction of variants * * builder has utilty methods for adding necessary headers Co-authored-by: Nils Homer <[email protected]>
e186cee
to
e26ffa1
Compare