Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genomecov function #410

Closed
jayhesselberth opened this issue Sep 28, 2023 · 4 comments
Closed

genomecov function #410

jayhesselberth opened this issue Sep 28, 2023 · 4 comments

Comments

@jayhesselberth
Copy link
Member

At least while I'm teaching right now, I have found myself wanting / needing a genomecov() like function that can go from BED records to bedGraph coverage.

One application of this is to generate a random distribution of BED fragments across a genome (with bed_random()) and then generating coverage from those.

@jayhesselberth
Copy link
Member Author

This is a good enough approximation:

library(valr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

genome <- read_genome("https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes")

bed_random(
  genome,
  length = 100,
  n = 1e5
) |>
  bed_merge(cov = n())
#> # A tibble: 43,672 × 4
#>    chrom start   end   cov
#>    <chr> <int> <int> <int>
#>  1 chrI     21   177     2
#>  2 chrI    237   437     2
#>  3 chrI    762   862     1
#>  4 chrI    884  1063     2
#>  5 chrI   1152  1252     1
#>  6 chrI   1259  1369     2
#>  7 chrI   1465  1565     1
#>  8 chrI   1615  1739     2
#>  9 chrI   1805  1975     3
#> 10 chrI   1985  2178     2
#> # ℹ 43,662 more rows

Created on 2023-09-28 with reprex v2.0.2

@kriemo
Copy link
Member

kriemo commented Sep 28, 2023

I would suggest instead using bed_partition() as it will preserve locations where the coverage changes rather than merging overlapping regions into a single region.

e.g.

library(valr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
genome <- read_genome("https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes")

ivls <- bed_random(
  genome,
  length = 100,
  n = 1e5, 
  seed = 42
) 

bed_merge(ivls, cov = n())
#> # A tibble: 43,692 × 4
#>    chrom start   end   cov
#>    <chr> <int> <int> <int>
#>  1 chrI     59   159     1
#>  2 chrI    165   307     2
#>  3 chrI    456   755     5
#>  4 chrI   1347  1504     2
#>  5 chrI   1587  1750     3
#>  6 chrI   2014  2482     9
#>  7 chrI   2657  2757     1
#>  8 chrI   3021  3226     3
#>  9 chrI   3282  3413     2
#> 10 chrI   3594  3694     1
#> # ℹ 43,682 more rows
bed_partition(ivls, cov = n())
#> # A tibble: 154,751 × 4
#>    chrom start   end   cov
#>    <chr> <int> <int> <int>
#>  1 chrI     59   159     1
#>  2 chrI    165   207     1
#>  3 chrI    207   265     2
#>  4 chrI    265   307     1
#>  5 chrI    456   527     1
#>  6 chrI    527   534     2
#>  7 chrI    534   556     3
#>  8 chrI    556   617     2
#>  9 chrI    617   627     3
#> 10 chrI    627   634     2
#> # ℹ 154,741 more rows

Created on 2023-09-28 with reprex v2.0.2

@jayhesselberth
Copy link
Member Author

i get "vector out of memory" errors for large numbers of intervals (>10e6)with bed_partition(). may eventually implement a dedicated Rcpp function, which would definitely be faster / more efficient than the above.

@kriemo
Copy link
Member

kriemo commented Sep 28, 2023

I'm going to reopen as I agree the genomecov functionality would be good to have and it doesn't seem that we have a scalable alternative.

Regarding the memory issue that you are seeing, I'm not sure of the best way to improve the current approach. The partitioning step is fast but it does not return a datastructure with information about every interval that overlapped each partition. To get this information there is a second pass through the data using bed_map() to compare the partitioned intervals to the original intervals, and apply the supplied custom summary function.

The bed_map() function consumes a lot of memory because it generates a data.frame with a row for every overlapping interval (using intersect_impl). This is necessary to make it compatible with a dplyr group_by -> summarize() call but can generate a very large data.frame when there are many overlapping intervals.

With the yeast genome, 10e6 100bp random intervals leads to ~80 overlaps per input interval due to the small genome size, so effectively there is an intermediate data.frame of 800 million rows being generated. With a larger genome, e.g. human, 10e6 random 100bp intervals can be processed in ~ 1 min as there are fewer overlapping intervals.

We could change the partition_impl Rcpp function to also return the row index of every interval that contributed to each partitioned interval. However that data structure would need to be a list of integer vectors, one vector for each output partitioned interval. I am not quite sure how to integrate that type of data structure with dplyr's group_by and summarize (without expanding the data.frame such that there is 1 row per entry in each group).

@kriemo kriemo reopened this Sep 28, 2023
kriemo added a commit that referenced this issue Oct 11, 2023
@kriemo kriemo closed this as completed in a4093ef Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants