Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First head-to-head against HDF5Array #12

Open
LTLA opened this issue Jun 22, 2020 · 6 comments
Open

First head-to-head against HDF5Array #12

LTLA opened this issue Jun 22, 2020 · 6 comments

Comments

@LTLA
Copy link
Owner

LTLA commented Jun 22, 2020

This is a condensed version of a real application involving PCA on sparse log-transformed expression values:

sce <- scRNAseq::MacoskoRetinaData() 
y <- scuttle::normalizeCounts(counts(sce))
dim(y)
## [1] 24658 49300

library(BiocSingular)
library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts"))
##    user  system elapsed 
## 144.265   3.220 147.627 
system.time(hdf.pcs <- runPCA(t(hdf.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 861.133  57.775 918.967 

library(TileDBArray)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts"))
##    user  system elapsed 
##  66.415   1.717  20.009 
system.time(tdb.pcs <- runPCA(t(tdb.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 888.668 167.635 347.845 

Note that this is not quite a fair comparison:

Nonetheless, these results are encouraging given that no effort has been made to optimize the TileDB calls either. For starters, I suspect the tile extents are too small. (Defaults to 100 in each dimension.)

Session information
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TileDBArray_0.0.1           HDF5Array_1.17.2           
 [3] rhdf5_2.33.3                BiocSingular_1.5.0         
 [5] scRNAseq_2.3.6              SingleCellExperiment_1.11.5
 [7] SummarizedExperiment_1.19.5 DelayedArray_0.15.5        
 [9] matrixStats_0.56.0          Matrix_1.2-18              
[11] Biobase_2.49.0              GenomicRanges_1.41.5       
[13] GenomeInfoDb_1.25.2         IRanges_2.23.10            
[15] S4Vectors_0.27.12           BiocGenerics_0.35.4        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6                  rsvd_1.0.3                   
 [3] lattice_0.20-41               zoo_1.8-8                    
 [5] assertthat_0.2.1              digest_0.6.25                
 [7] mime_0.9                      BiocFileCache_1.13.0         
 [9] R6_2.4.1                      RSQLite_2.2.0                
[11] httr_1.4.1                    pillar_1.4.4                 
[13] zlibbioc_1.35.0               rlang_0.4.6                  
[15] curl_4.3                      irlba_2.3.3                  
[17] tiledb_0.7.0                  blob_1.2.1                   
[19] BiocParallel_1.23.0           RcppCCTZ_0.2.7               
[21] AnnotationHub_2.21.1          RCurl_1.98-1.2               
[23] bit_1.1-15.2                  shiny_1.4.0.2                
[25] compiler_4.0.0                httpuv_1.5.4                 
[27] base64enc_0.1-3               pkgconfig_2.0.3              
[29] htmltools_0.5.0               tidyselect_1.1.0             
[31] tibble_3.0.1                  GenomeInfoDbData_1.2.3       
[33] interactiveDisplayBase_1.27.5 crayon_1.3.4                 
[35] dplyr_1.0.0                   dbplyr_1.4.4                 
[37] later_1.1.0.1                 rhdf5filters_1.1.0           
[39] bitops_1.0-6                  rappdirs_0.3.1               
[41] grid_4.0.0                    xtable_1.8-4                 
[43] lifecycle_0.2.0               DBI_1.1.0                    
[45] magrittr_1.5                  scuttle_0.99.9               
[47] XVector_0.29.2                promises_1.1.1               
[49] DelayedMatrixStats_1.11.0     ellipsis_0.3.1               
[51] generics_0.0.2                vctrs_0.3.1                  
[53] Rhdf5lib_1.11.2               tools_4.0.0                  
[55] bit64_0.9-7                   nanotime_0.2.4               
[57] glue_1.4.1                    purrr_0.3.4                  
[59] BiocVersion_3.12.0            fastmap_1.0.1                
[61] yaml_2.2.1                    AnnotationDbi_1.51.0         
[63] BiocManager_1.30.10           ExperimentHub_1.15.0         
[65] memoise_1.1.0                
@LTLA
Copy link
Owner Author

LTLA commented Sep 6, 2020

Column sums timing assessment

It occurred to me that runPCA() was doing too many things internally to be a straightforward evaluation,
so we'll take it easy and have a look at a simpler metric - computing the column sums.

Setup

Here's the set-up, Macosko data again. Note how I have forced both matrices to use the same "chunk" sizes.

sce <- scRNAseq::MacoskoRetinaData() # 24658 x 49300
y <- scuttle::normalizeCounts(counts(sce))

library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts", chunkdim=c(1000, 1000)))
##    user  system elapsed 
## 150.084   0.900 151.245 

library(TileDBArray)
tiledb::limitTileDBCores(1)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts", extent=1000))
##    user  system elapsed
##  40.917   0.652  45.131

TileDB is faster, consistent with what we saw before.

Timings

This is where it gets interesting.

# # If the files have already been constructed:
# library(HDF5Array)
# hdf.mat <- HDF5Array("macosko.h5", "logcounts")
# 
# library(TileDBArray)
# tdb.mat <- TileDBArray("macosko_tdb", "logcounts")

system.time(colSums(hdf.mat))
##   user  system elapsed 
## 24.750   0.832  25.583 


system.time(colSums(tdb.mat))
##    user  system elapsed
##  67.933   1.129  68.522

We see that HDF5 is considerably faster than TileDB on this ostensibly sparse matrix. What's going on here?

Diagnosis

Simplifying it right down, we can have a look at the calls.

start <- c(1, 1)
count <- c(10000, 10000)

system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=count))
##    user  system elapsed 
##   2.115   0.220   2.335 

obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, count[1]), start[2] - 1 + cbind(1, count[2]))
system.time(out2 <- obj[])
##    user  system elapsed
##   9.333   0.504   9.700

Fiddling with start and count gets some different fold increases but TileDB is oftentimes slower. In some cases, TileDB is faster but the associated times are so low that it doesn't contribute to the colSums() timings. I would add that the density of non-zero elements is not particularly high here, less than 5%. We're also extracting data along chunk/tile boundaries, so it's not like one format is being handicapped by having to extract multiple chunks/tiles. We can simplify it further by extracting just the first chunk/tile:

start <- c(1, 1)
count <- c(1000, 1000)

system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=count))
##    user  system elapsed 
##   0.027   0.000   0.028 

obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, count[1]), start[2] - 1 + cbind(1, count[2]))
system.time(out2 <- obj[])
##    user  system elapsed
##    0.16    0.00    0.16

Same kind of difference. Interestingly, the read time decreases (see below) when you request a smaller part of the same tile, which is not what I expected: I thought that each tile would be read into memory in its entirety, much like how HDF5 reads in atomic chunks for decompression and data extraction. For example, we see a spike in TileDB's elapsed time at when extracting the top-left 600-by-600 block.

start <- c(1,1)
for (i in 1:10*100) {
    print(i)
    print(system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=c(i, i))))
    obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
    tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, i), start[2] - 1 + cbind(1, i))
    print(system.time(out2 <- obj[]))
}

## [1] 100 # dimension of extracted square
##    user  system elapsed # HDF5 time
##   0.025   0.000   0.026 
##    user  system elapsed # TileDB time
##   0.008   0.000   0.008 
## [1] 200
##    user  system elapsed 
##   0.025   0.000   0.024 
##    user  system elapsed 
##   0.007   0.004   0.011 
## [1] 300
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.012   0.004   0.015 
## [1] 400
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.014   0.008   0.021 
## [1] 500
##    user  system elapsed 
##   0.024   0.000   0.025 
##    user  system elapsed 
##   0.029   0.000   0.029 
## [1] 600
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.058   0.000   0.058 
## [1] 700
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.076   0.000   0.076 
## [1] 800
##    user  system elapsed 
##   0.025   0.000   0.025 
##    user  system elapsed 
##   0.098   0.000   0.098 
## [1] 900
##    user  system elapsed 
##   0.026   0.000   0.025 
##    user  system elapsed 
##   0.121   0.000   0.121 
## [1] 1000
##    user  system elapsed 
##   0.025   0.000   0.025 
##    user  system elapsed 
##   0.145   0.004   0.149 

I guess the questions are:

  • Is the slowdown due to the R wrapper or something deeper?
  • Are there any flags I can toggle to improve performance? (I would prefer not to have to change the number of cores, but it seems like setting tiledb::limitTileDBCores doesn't really help anyway.)

Session information

This was generated using the tidying branch of TileDBArray and tiledb version 0.8.0.

Session information
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] TileDBArray_0.99.0  HDF5Array_1.17.3    rhdf5_2.33.7       
[4] DelayedArray_0.15.7 IRanges_2.23.10     S4Vectors_0.27.12  
[7] BiocGenerics_0.35.4 matrixStats_0.56.0  Matrix_1.2-18      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         lattice_0.20-41    zoo_1.8-8          rhdf5filters_1.1.2
 [5] grid_4.0.0         tiledb_0.8.0       Rhdf5lib_1.11.3    tools_4.0.0       
 [9] RcppCCTZ_0.2.9     bit64_4.0.5        nanotime_0.3.2     bit_4.0.4         
[13] compiler_4.0.0    

@eddelbuettel
Copy link
Collaborator

There is quite a bit to chew over here, and I will try to take closer looks.

limitTileDBCores() is a very blunt instrument to comply with CRAN requests. If you look at its code and documentatio you see that setting options("Ncpus") or environment variable OMP_THREAD_LIMIT before loading the tiledb package should leave you at full "pedal to the metal" speed (but sadly we have have to throttle 'by default' for CRAN).

I'll circulate your results at our end and I am sure we'll come back with another suggestion or two.

@kasperdanielhansen
Copy link

That cell example is interesting. There must be a bottleneck somehow

userTime = c(0.008, 0.00, 0.012, 0.014, 0.029, 0.058, 0.076, 0.098, 0.121, 0.145) # times from example
plot(userTime)
idx = 1:10
abline(lm(userTime[4:10] ~ idx[4:10]))

shows the first 4 userTimes being roughly constant and then we see linear growth.

@stavrospapadopoulos
Copy link

@kasperdanielhansen @LTLA

This loop:

start <- c(1,1)
for (i in 1:10*100) {
    print(i)
    print(system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=c(i, i))))
    obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
    tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, i), start[2] - 1 + cbind(1, i))
    print(system.time(out2 <- obj[]))
}

progressively asks for a larger range in each iteration. So it is normal for the time to increase in each iteration. Initially it is fairly flat because the range is so small in the first few iterations that a fixed number of tiles is retrieved. Therefore, there is a relatively fixed cost incurred for the given tile size configuration.

We will be actively working on comparisons with HDF5, so we will certainly share code and results with you very soon.

@LTLA
Copy link
Owner Author

LTLA commented Oct 8, 2020

Just to be clear, the increase in the range is intentional. My surprise was from the fact that we were still operating within a single tile (1000 x 1000). I had been under the impression that the TileDB tiles were analogous to HDF5 chunks, in that each chunk/tile would be read into memory in its entirety before further processing. Under such a model, the constant-time I/O for the entire tile/chunk (plus its decompression) should be the major factor in the timings, with a relatively small scaling due to the organization and subsetting of the data to return the requested subset.

Apparently this is not the case, which is fine, I was just expecting something different.

@stavrospapadopoulos
Copy link

Oh, I did not realize that this was inside a single tile. TileDB should function exactly like HDF5 in that case, so it's probably a (perf) bug on our end. We'll investigate. That justifies the HDF5 times being flat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants