-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First head-to-head against HDF5Array #12
Comments
Column sums timing assessmentIt occurred to me that SetupHere's the set-up, Macosko data again. Note how I have forced both matrices to use the same "chunk" sizes. sce <- scRNAseq::MacoskoRetinaData() # 24658 x 49300
y <- scuttle::normalizeCounts(counts(sce))
library(HDF5Array)
system.time(hdf.mat <- writeHDF5Array(y, filepath="macosko.h5", name="logcounts", chunkdim=c(1000, 1000)))
## user system elapsed
## 150.084 0.900 151.245
library(TileDBArray)
tiledb::limitTileDBCores(1)
system.time(tdb.mat <- writeTileDBArray(y, path="macosko_tdb", attr="logcounts", extent=1000))
## user system elapsed
## 40.917 0.652 45.131 TileDB is faster, consistent with what we saw before. TimingsThis is where it gets interesting. # # If the files have already been constructed:
# library(HDF5Array)
# hdf.mat <- HDF5Array("macosko.h5", "logcounts")
#
# library(TileDBArray)
# tdb.mat <- TileDBArray("macosko_tdb", "logcounts")
system.time(colSums(hdf.mat))
## user system elapsed
## 24.750 0.832 25.583
system.time(colSums(tdb.mat))
## user system elapsed
## 67.933 1.129 68.522 We see that HDF5 is considerably faster than TileDB on this ostensibly sparse matrix. What's going on here? DiagnosisSimplifying it right down, we can have a look at the calls. start <- c(1, 1)
count <- c(10000, 10000)
system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=count))
## user system elapsed
## 2.115 0.220 2.335
obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, count[1]), start[2] - 1 + cbind(1, count[2]))
system.time(out2 <- obj[])
## user system elapsed
## 9.333 0.504 9.700 Fiddling with start <- c(1, 1)
count <- c(1000, 1000)
system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=count))
## user system elapsed
## 0.027 0.000 0.028
obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, count[1]), start[2] - 1 + cbind(1, count[2]))
system.time(out2 <- obj[])
## user system elapsed
## 0.16 0.00 0.16 Same kind of difference. Interestingly, the read time decreases (see below) when you request a smaller part of the same tile, which is not what I expected: I thought that each tile would be read into memory in its entirety, much like how HDF5 reads in atomic chunks for decompression and data extraction. For example, we see a spike in TileDB's elapsed time at when extracting the top-left 600-by-600 block. start <- c(1,1)
for (i in 1:10*100) {
print(i)
print(system.time(out <- h5read("macosko.h5", "logcounts", start=start, count=c(i, i))))
obj <- tiledb::tiledb_array("macosko_tdb", attrs="logcounts")
tiledb::selected_ranges(obj) <- list(start[1] - 1 + cbind(1, i), start[2] - 1 + cbind(1, i))
print(system.time(out2 <- obj[]))
}
## [1] 100 # dimension of extracted square
## user system elapsed # HDF5 time
## 0.025 0.000 0.026
## user system elapsed # TileDB time
## 0.008 0.000 0.008
## [1] 200
## user system elapsed
## 0.025 0.000 0.024
## user system elapsed
## 0.007 0.004 0.011
## [1] 300
## user system elapsed
## 0.024 0.000 0.024
## user system elapsed
## 0.012 0.004 0.015
## [1] 400
## user system elapsed
## 0.024 0.000 0.024
## user system elapsed
## 0.014 0.008 0.021
## [1] 500
## user system elapsed
## 0.024 0.000 0.025
## user system elapsed
## 0.029 0.000 0.029
## [1] 600
## user system elapsed
## 0.024 0.000 0.024
## user system elapsed
## 0.058 0.000 0.058
## [1] 700
## user system elapsed
## 0.024 0.000 0.024
## user system elapsed
## 0.076 0.000 0.076
## [1] 800
## user system elapsed
## 0.025 0.000 0.025
## user system elapsed
## 0.098 0.000 0.098
## [1] 900
## user system elapsed
## 0.026 0.000 0.025
## user system elapsed
## 0.121 0.000 0.121
## [1] 1000
## user system elapsed
## 0.025 0.000 0.025
## user system elapsed
## 0.145 0.004 0.149 I guess the questions are:
Session informationThis was generated using the Session informationR version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] TileDBArray_0.99.0 HDF5Array_1.17.3 rhdf5_2.33.7
[4] DelayedArray_0.15.7 IRanges_2.23.10 S4Vectors_0.27.12
[7] BiocGenerics_0.35.4 matrixStats_0.56.0 Matrix_1.2-18
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 zoo_1.8-8 rhdf5filters_1.1.2
[5] grid_4.0.0 tiledb_0.8.0 Rhdf5lib_1.11.3 tools_4.0.0
[9] RcppCCTZ_0.2.9 bit64_4.0.5 nanotime_0.3.2 bit_4.0.4
[13] compiler_4.0.0 |
There is quite a bit to chew over here, and I will try to take closer looks.
I'll circulate your results at our end and I am sure we'll come back with another suggestion or two. |
That cell example is interesting. There must be a bottleneck somehow
shows the first 4 userTimes being roughly constant and then we see linear growth. |
This loop:
progressively asks for a larger range in each iteration. So it is normal for the time to increase in each iteration. Initially it is fairly flat because the range is so small in the first few iterations that a fixed number of tiles is retrieved. Therefore, there is a relatively fixed cost incurred for the given tile size configuration. We will be actively working on comparisons with HDF5, so we will certainly share code and results with you very soon. |
Just to be clear, the increase in the range is intentional. My surprise was from the fact that we were still operating within a single tile (1000 x 1000). I had been under the impression that the TileDB tiles were analogous to HDF5 chunks, in that each chunk/tile would be read into memory in its entirety before further processing. Under such a model, the constant-time I/O for the entire tile/chunk (plus its decompression) should be the major factor in the timings, with a relatively small scaling due to the organization and subsetting of the data to return the requested subset. Apparently this is not the case, which is fine, I was just expecting something different. |
Oh, I did not realize that this was inside a single tile. TileDB should function exactly like HDF5 in that case, so it's probably a (perf) bug on our end. We'll investigate. That justifies the HDF5 times being flat. |
This is a condensed version of a real application involving PCA on sparse log-transformed expression values:
Note that this is not quite a fair comparison:
Nonetheless, these results are encouraging given that no effort has been made to optimize the TileDB calls either. For starters, I suspect the tile extents are too small. (Defaults to 100 in each dimension.)
Session information
The text was updated successfully, but these errors were encountered: