genomecov function #410

jayhesselberth · 2023-09-28T13:52:13Z

At least while I'm teaching right now, I have found myself wanting / needing a genomecov() like function that can go from BED records to bedGraph coverage.

One application of this is to generate a random distribution of BED fragments across a genome (with bed_random()) and then generating coverage from those.

The text was updated successfully, but these errors were encountered:

jayhesselberth · 2023-09-28T14:21:20Z

This is a good enough approximation:

library(valr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

genome <- read_genome("https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes")

bed_random(
  genome,
  length = 100,
  n = 1e5
) |>
  bed_merge(cov = n())
#> # A tibble: 43,672 × 4
#>    chrom start   end   cov
#>    <chr> <int> <int> <int>
#>  1 chrI     21   177     2
#>  2 chrI    237   437     2
#>  3 chrI    762   862     1
#>  4 chrI    884  1063     2
#>  5 chrI   1152  1252     1
#>  6 chrI   1259  1369     2
#>  7 chrI   1465  1565     1
#>  8 chrI   1615  1739     2
#>  9 chrI   1805  1975     3
#> 10 chrI   1985  2178     2
#> # ℹ 43,662 more rows

^{Created on 2023-09-28 with reprex v2.0.2}

kriemo · 2023-09-28T14:44:05Z

I would suggest instead using bed_partition() as it will preserve locations where the coverage changes rather than merging overlapping regions into a single region.

e.g.

library(valr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
genome <- read_genome("https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes")

ivls <- bed_random(
  genome,
  length = 100,
  n = 1e5, 
  seed = 42
) 

bed_merge(ivls, cov = n())
#> # A tibble: 43,692 × 4
#>    chrom start   end   cov
#>    <chr> <int> <int> <int>
#>  1 chrI     59   159     1
#>  2 chrI    165   307     2
#>  3 chrI    456   755     5
#>  4 chrI   1347  1504     2
#>  5 chrI   1587  1750     3
#>  6 chrI   2014  2482     9
#>  7 chrI   2657  2757     1
#>  8 chrI   3021  3226     3
#>  9 chrI   3282  3413     2
#> 10 chrI   3594  3694     1
#> # ℹ 43,682 more rows
bed_partition(ivls, cov = n())
#> # A tibble: 154,751 × 4
#>    chrom start   end   cov
#>    <chr> <int> <int> <int>
#>  1 chrI     59   159     1
#>  2 chrI    165   207     1
#>  3 chrI    207   265     2
#>  4 chrI    265   307     1
#>  5 chrI    456   527     1
#>  6 chrI    527   534     2
#>  7 chrI    534   556     3
#>  8 chrI    556   617     2
#>  9 chrI    617   627     3
#> 10 chrI    627   634     2
#> # ℹ 154,741 more rows

^{Created on 2023-09-28 with reprex v2.0.2}

jayhesselberth · 2023-09-28T16:28:31Z

i get "vector out of memory" errors for large numbers of intervals (>10e6)with bed_partition(). may eventually implement a dedicated Rcpp function, which would definitely be faster / more efficient than the above.

kriemo · 2023-09-28T20:27:17Z

I'm going to reopen as I agree the genomecov functionality would be good to have and it doesn't seem that we have a scalable alternative.

Regarding the memory issue that you are seeing, I'm not sure of the best way to improve the current approach. The partitioning step is fast but it does not return a datastructure with information about every interval that overlapped each partition. To get this information there is a second pass through the data using bed_map() to compare the partitioned intervals to the original intervals, and apply the supplied custom summary function.

The bed_map() function consumes a lot of memory because it generates a data.frame with a row for every overlapping interval (using intersect_impl). This is necessary to make it compatible with a dplyr group_by -> summarize() call but can generate a very large data.frame when there are many overlapping intervals.

With the yeast genome, 10e6 100bp random intervals leads to ~80 overlaps per input interval due to the small genome size, so effectively there is an intermediate data.frame of 800 million rows being generated. With a larger genome, e.g. human, 10e6 random 100bp intervals can be processed in ~ 1 min as there are fewer overlapping intervals.

We could change the partition_impl Rcpp function to also return the row index of every interval that contributed to each partitioned interval. However that data structure would need to be a list of integer vectors, one vector for each output partitioned interval. I am not quite sure how to integrate that type of data structure with dplyr's group_by and summarize (without expanding the data.frame such that there is 1 row per entry in each group).

jayhesselberth closed this as completed Sep 28, 2023

kriemo reopened this Sep 28, 2023

kriemo added a commit that referenced this issue Oct 11, 2023

add genomecov functionality, closes #410

a07e147

kriemo closed this as completed in a4093ef Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genomecov function #410

genomecov function #410

jayhesselberth commented Sep 28, 2023

jayhesselberth commented Sep 28, 2023

kriemo commented Sep 28, 2023

jayhesselberth commented Sep 28, 2023

kriemo commented Sep 28, 2023

genomecov function #410

genomecov function #410

Comments

jayhesselberth commented Sep 28, 2023

jayhesselberth commented Sep 28, 2023

kriemo commented Sep 28, 2023

jayhesselberth commented Sep 28, 2023

kriemo commented Sep 28, 2023