First head-to-head against HDF5Array #12

LTLA · 2020-06-22T09:13:10Z

This is a condensed version of a real application involving PCA on sparse log-transformed expression values:

sce <- scRNAseq::MacoskoRetinaData() 
y <- scuttle::normalizeCounts(counts(sce))
dim(y)
## [1] 24658 49300

library(BiocSingular)
library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts"))
##    user  system elapsed 
## 144.265   3.220 147.627 
system.time(hdf.pcs <- runPCA(t(hdf.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 861.133  57.775 918.967 

library(TileDBArray)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts"))
##    user  system elapsed 
##  66.415   1.717  20.009 
system.time(tdb.pcs <- runPCA(t(tdb.mat), 10, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed 
## 888.668 167.635 347.845

Note that this is not quite a fair comparison:

HDF5 library read/writes are single-threaded, while the TileDB library will happily use multiple cores.
HDF5Array write for sparse matrices is currently rather inefficient, see write_block fails for SparseArraySeed Bioconductor/HDF5Array#30.

Nonetheless, these results are encouraging given that no effort has been made to optimize the TileDB calls either. For starters, I suspect the tile extents are too small. (Defaults to 100 in each dimension.)

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TileDBArray_0.0.1           HDF5Array_1.17.2           
 [3] rhdf5_2.33.3                BiocSingular_1.5.0         
 [5] scRNAseq_2.3.6              SingleCellExperiment_1.11.5
 [7] SummarizedExperiment_1.19.5 DelayedArray_0.15.5        
 [9] matrixStats_0.56.0          Matrix_1.2-18              
[11] Biobase_2.49.0              GenomicRanges_1.41.5       
[13] GenomeInfoDb_1.25.2         IRanges_2.23.10            
[15] S4Vectors_0.27.12           BiocGenerics_0.35.4        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6                  rsvd_1.0.3                   
 [3] lattice_0.20-41               zoo_1.8-8                    
 [5] assertthat_0.2.1              digest_0.6.25                
 [7] mime_0.9                      BiocFileCache_1.13.0         
 [9] R6_2.4.1                      RSQLite_2.2.0                
[11] httr_1.4.1                    pillar_1.4.4                 
[13] zlibbioc_1.35.0               rlang_0.4.6                  
[15] curl_4.3                      irlba_2.3.3                  
[17] tiledb_0.7.0                  blob_1.2.1                   
[19] BiocParallel_1.23.0           RcppCCTZ_0.2.7               
[21] AnnotationHub_2.21.1          RCurl_1.98-1.2               
[23] bit_1.1-15.2                  shiny_1.4.0.2                
[25] compiler_4.0.0                httpuv_1.5.4                 
[27] base64enc_0.1-3               pkgconfig_2.0.3              
[29] htmltools_0.5.0               tidyselect_1.1.0             
[31] tibble_3.0.1                  GenomeInfoDbData_1.2.3       
[33] interactiveDisplayBase_1.27.5 crayon_1.3.4                 
[35] dplyr_1.0.0                   dbplyr_1.4.4                 
[37] later_1.1.0.1                 rhdf5filters_1.1.0           
[39] bitops_1.0-6                  rappdirs_0.3.1               
[41] grid_4.0.0                    xtable_1.8-4                 
[43] lifecycle_0.2.0               DBI_1.1.0                    
[45] magrittr_1.5                  scuttle_0.99.9               
[47] XVector_0.29.2                promises_1.1.1               
[49] DelayedMatrixStats_1.11.0     ellipsis_0.3.1               
[51] generics_0.0.2                vctrs_0.3.1                  
[53] Rhdf5lib_1.11.2               tools_4.0.0                  
[55] bit64_0.9-7                   nanotime_0.2.4               
[57] glue_1.4.1                    purrr_0.3.4                  
[59] BiocVersion_3.12.0            fastmap_1.0.1                
[61] yaml_2.2.1                    AnnotationDbi_1.51.0         
[63] BiocManager_1.30.10           ExperimentHub_1.15.0         
[65] memoise_1.1.0

The text was updated successfully, but these errors were encountered:

LTLA · 2020-09-06T18:55:48Z

Column sums timing assessment

It occurred to me that runPCA() was doing too many things internally to be a straightforward evaluation,
so we'll take it easy and have a look at a simpler metric - computing the column sums.

Setup

Here's the set-up, Macosko data again. Note how I have forced both matrices to use the same "chunk" sizes.

sce <- scRNAseq::MacoskoRetinaData() # 24658 x 49300
y <- scuttle::normalizeCounts(counts(sce))

library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts", chunkdim=c(1000, 1000)))
##    user  system elapsed 
## 150.084   0.900 151.245 

library(TileDBArray)
tiledb::limitTileDBCores(1)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts", extent=1000))
##    user  system elapsed
##  40.917   0.652  45.131

TileDB is faster, consistent with what we saw before.

Timings

This is where it gets interesting.

# # If the files have already been constructed:
# library(HDF5Array)
# hdf.mat <- HDF5Array("macosko.h5", "logcounts")
# 
# library(TileDBArray)
# tdb.mat <- TileDBArray("macosko_tdb", "logcounts")

system.time(colSums(hdf.mat))
##   user  system elapsed 
## 24.750   0.832  25.583 


system.time(colSums(tdb.mat))
##    user  system elapsed
##  67.933   1.129  68.522

We see that HDF5 is considerably faster than TileDB on this ostensibly sparse matrix. What's going on here?

Diagnosis

Simplifying it right down, we can have a look at the calls.

start <- c(1, 1)
count <- c(10000, 10000)

system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=count))
##    user  system elapsed 
##   2.115   0.220   2.335 

obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, count[1]), start[2] - 1 + cbind(1, count[2]))
system.time(out2 <- obj[])
##    user  system elapsed
##   9.333   0.504   9.700

Fiddling with start and count gets some different fold increases but TileDB is oftentimes slower. In some cases, TileDB is faster but the associated times are so low that it doesn't contribute to the colSums() timings. I would add that the density of non-zero elements is not particularly high here, less than 5%. We're also extracting data along chunk/tile boundaries, so it's not like one format is being handicapped by having to extract multiple chunks/tiles. We can simplify it further by extracting just the first chunk/tile:

start <- c(1, 1)
count <- c(1000, 1000)

system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=count))
##    user  system elapsed 
##   0.027   0.000   0.028 

obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, count[1]), start[2] - 1 + cbind(1, count[2]))
system.time(out2 <- obj[])
##    user  system elapsed
##    0.16    0.00    0.16

Same kind of difference. Interestingly, the read time decreases (see below) when you request a smaller part of the same tile, which is not what I expected: I thought that each tile would be read into memory in its entirety, much like how HDF5 reads in atomic chunks for decompression and data extraction. For example, we see a spike in TileDB's elapsed time at when extracting the top-left 600-by-600 block.

start <- c(1,1)
for (i in 1:10*100) {
    print(i)
    print(system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=c(i, i))))
    obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
    tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, i), start[2] - 1 + cbind(1, i))
    print(system.time(out2 <- obj[]))
}

## [1] 100 # dimension of extracted square
##    user  system elapsed # HDF5 time
##   0.025   0.000   0.026 
##    user  system elapsed # TileDB time
##   0.008   0.000   0.008 
## [1] 200
##    user  system elapsed 
##   0.025   0.000   0.024 
##    user  system elapsed 
##   0.007   0.004   0.011 
## [1] 300
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.012   0.004   0.015 
## [1] 400
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.014   0.008   0.021 
## [1] 500
##    user  system elapsed 
##   0.024   0.000   0.025 
##    user  system elapsed 
##   0.029   0.000   0.029 
## [1] 600
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.058   0.000   0.058 
## [1] 700
##    user  system elapsed 
##   0.024   0.000   0.024 
##    user  system elapsed 
##   0.076   0.000   0.076 
## [1] 800
##    user  system elapsed 
##   0.025   0.000   0.025 
##    user  system elapsed 
##   0.098   0.000   0.098 
## [1] 900
##    user  system elapsed 
##   0.026   0.000   0.025 
##    user  system elapsed 
##   0.121   0.000   0.121 
## [1] 1000
##    user  system elapsed 
##   0.025   0.000   0.025 
##    user  system elapsed 
##   0.145   0.004   0.149

I guess the questions are:

Is the slowdown due to the R wrapper or something deeper?
Are there any flags I can toggle to improve performance? (I would prefer not to have to change the number of cores, but it seems like setting tiledb::limitTileDBCores doesn't really help anyway.)

Session information

This was generated using the tidying branch of TileDBArray and tiledb version 0.8.0.

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] TileDBArray_0.99.0  HDF5Array_1.17.3    rhdf5_2.33.7       
[4] DelayedArray_0.15.7 IRanges_2.23.10     S4Vectors_0.27.12  
[7] BiocGenerics_0.35.4 matrixStats_0.56.0  Matrix_1.2-18      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         lattice_0.20-41    zoo_1.8-8          rhdf5filters_1.1.2
 [5] grid_4.0.0         tiledb_0.8.0       Rhdf5lib_1.11.3    tools_4.0.0       
 [9] RcppCCTZ_0.2.9     bit64_4.0.5        nanotime_0.3.2     bit_4.0.4         
[13] compiler_4.0.0

eddelbuettel · 2020-09-06T19:04:12Z

There is quite a bit to chew over here, and I will try to take closer looks.

limitTileDBCores() is a very blunt instrument to comply with CRAN requests. If you look at its code and documentatio you see that setting options("Ncpus") or environment variable OMP_THREAD_LIMIT before loading the tiledb package should leave you at full "pedal to the metal" speed (but sadly we have have to throttle 'by default' for CRAN).

I'll circulate your results at our end and I am sure we'll come back with another suggestion or two.

kasperdanielhansen · 2020-10-08T15:39:25Z

That cell example is interesting. There must be a bottleneck somehow

userTime = c(0.008, 0.00, 0.012, 0.014, 0.029, 0.058, 0.076, 0.098, 0.121, 0.145) # times from example
plot(userTime)
idx = 1:10
abline(lm(userTime[4:10] ~ idx[4:10]))

shows the first 4 userTimes being roughly constant and then we see linear growth.

stavrospapadopoulos · 2020-10-08T18:52:23Z

@kasperdanielhansen @LTLA

This loop:

start <- c(1,1)
for (i in 1:10*100) {
    print(i)
    print(system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=c(i, i))))
    obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
    tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, i), start[2] - 1 + cbind(1, i))
    print(system.time(out2 <- obj[]))
}

progressively asks for a larger range in each iteration. So it is normal for the time to increase in each iteration. Initially it is fairly flat because the range is so small in the first few iterations that a fixed number of tiles is retrieved. Therefore, there is a relatively fixed cost incurred for the given tile size configuration.

We will be actively working on comparisons with HDF5, so we will certainly share code and results with you very soon.

LTLA · 2020-10-08T20:19:05Z

Just to be clear, the increase in the range is intentional. My surprise was from the fact that we were still operating within a single tile (1000 x 1000). I had been under the impression that the TileDB tiles were analogous to HDF5 chunks, in that each chunk/tile would be read into memory in its entirety before further processing. Under such a model, the constant-time I/O for the entire tile/chunk (plus its decompression) should be the major factor in the timings, with a relatively small scaling due to the organization and subsetting of the data to return the requested subset.

Apparently this is not the case, which is fine, I was just expecting something different.

stavrospapadopoulos · 2020-10-08T20:31:33Z

Oh, I did not realize that this was inside a single tile. TileDB should function exactly like HDF5 in that case, so it's probably a (perf) bug on our end. We'll investigate. That justifies the HDF5 times being flat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First head-to-head against HDF5Array #12

First head-to-head against HDF5Array #12

LTLA commented Jun 22, 2020

LTLA commented Sep 6, 2020

eddelbuettel commented Sep 6, 2020

kasperdanielhansen commented Oct 8, 2020

stavrospapadopoulos commented Oct 8, 2020

LTLA commented Oct 8, 2020 •

edited

Loading

stavrospapadopoulos commented Oct 8, 2020

First head-to-head against HDF5Array #12

First head-to-head against HDF5Array #12

Comments

LTLA commented Jun 22, 2020

LTLA commented Sep 6, 2020

Column sums timing assessment

Setup

Timings

Diagnosis

Session information

eddelbuettel commented Sep 6, 2020

kasperdanielhansen commented Oct 8, 2020

stavrospapadopoulos commented Oct 8, 2020

LTLA commented Oct 8, 2020 • edited Loading

stavrospapadopoulos commented Oct 8, 2020

LTLA commented Oct 8, 2020 •

edited

Loading