-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genomecov function #410
Comments
This is a good enough approximation: library(valr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
genome <- read_genome("https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes")
bed_random(
genome,
length = 100,
n = 1e5
) |>
bed_merge(cov = n())
#> # A tibble: 43,672 × 4
#> chrom start end cov
#> <chr> <int> <int> <int>
#> 1 chrI 21 177 2
#> 2 chrI 237 437 2
#> 3 chrI 762 862 1
#> 4 chrI 884 1063 2
#> 5 chrI 1152 1252 1
#> 6 chrI 1259 1369 2
#> 7 chrI 1465 1565 1
#> 8 chrI 1615 1739 2
#> 9 chrI 1805 1975 3
#> 10 chrI 1985 2178 2
#> # ℹ 43,662 more rows Created on 2023-09-28 with reprex v2.0.2 |
I would suggest instead using e.g. library(valr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
genome <- read_genome("https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes")
ivls <- bed_random(
genome,
length = 100,
n = 1e5,
seed = 42
)
bed_merge(ivls, cov = n())
#> # A tibble: 43,692 × 4
#> chrom start end cov
#> <chr> <int> <int> <int>
#> 1 chrI 59 159 1
#> 2 chrI 165 307 2
#> 3 chrI 456 755 5
#> 4 chrI 1347 1504 2
#> 5 chrI 1587 1750 3
#> 6 chrI 2014 2482 9
#> 7 chrI 2657 2757 1
#> 8 chrI 3021 3226 3
#> 9 chrI 3282 3413 2
#> 10 chrI 3594 3694 1
#> # ℹ 43,682 more rows
bed_partition(ivls, cov = n())
#> # A tibble: 154,751 × 4
#> chrom start end cov
#> <chr> <int> <int> <int>
#> 1 chrI 59 159 1
#> 2 chrI 165 207 1
#> 3 chrI 207 265 2
#> 4 chrI 265 307 1
#> 5 chrI 456 527 1
#> 6 chrI 527 534 2
#> 7 chrI 534 556 3
#> 8 chrI 556 617 2
#> 9 chrI 617 627 3
#> 10 chrI 627 634 2
#> # ℹ 154,741 more rows Created on 2023-09-28 with reprex v2.0.2 |
i get "vector out of memory" errors for large numbers of intervals (>10e6)with |
I'm going to reopen as I agree the genomecov functionality would be good to have and it doesn't seem that we have a scalable alternative. Regarding the memory issue that you are seeing, I'm not sure of the best way to improve the current approach. The partitioning step is fast but it does not return a datastructure with information about every interval that overlapped each partition. To get this information there is a second pass through the data using bed_map() to compare the partitioned intervals to the original intervals, and apply the supplied custom summary function. The bed_map() function consumes a lot of memory because it generates a data.frame with a row for every overlapping interval (using intersect_impl). This is necessary to make it compatible with a dplyr group_by -> summarize() call but can generate a very large data.frame when there are many overlapping intervals. With the yeast genome, 10e6 100bp random intervals leads to ~80 overlaps per input interval due to the small genome size, so effectively there is an intermediate data.frame of 800 million rows being generated. With a larger genome, e.g. human, 10e6 random 100bp intervals can be processed in ~ 1 min as there are fewer overlapping intervals. We could change the |
At least while I'm teaching right now, I have found myself wanting / needing a
genomecov()
like function that can go from BED records to bedGraph coverage.One application of this is to generate a random distribution of BED fragments across a genome (with
bed_random()
) and then generating coverage from those.The text was updated successfully, but these errors were encountered: