Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlatfieldCorrection fails on GCP if solution images larger than 2GB #41

Open
carshadi opened this issue May 24, 2022 · 0 comments
Open

Comments

@carshadi
Copy link

carshadi commented May 24, 2022

My input tiles are 16-bit, shape (2000,1600,420) 2.563GB each. The job runs all the way up to the end where it exports the solution TIFFs to the google storage bucket, and then

22/05/24 06:23:05 INFO TaskSchedulerImpl: Removed TaskSet 55.0, whose tasks have all completed, from pool 
22/05/24 06:23:05 INFO DAGScheduler: ResultStage 55 (foreach at FlatfieldCorrectionSolver.java:199) finished in 12601.346 s
22/05/24 06:23:05 INFO DAGScheduler: Job 34 is finished. Cancelling potential speculative or zombie tasks for this job
22/05/24 06:23:05 INFO TaskSchedulerImpl: Killing all running tasks in stage 55: Stage finished
22/05/24 06:23:05 INFO DAGScheduler: Job 34 finished: foreach at FlatfieldCorrectionSolver.java:199, took 12601.395705 s
22/05/24 06:23:05 INFO TorrentBroadcast: Destroying Broadcast(62) (from destroy at FlatfieldCorrectionSolver.java:394)
Stack is larger than 4GB. Most TIFF readers will only open the first image. Use this information to open as raw:
name=Untitled, dir=, width=2000, height=1600, nImages=420, offset=4233, gap=0, type=float, byteOrder=big, format=0, url=, whiteIsZero=f, lutSize=0, comp=1, ranges=null, samples=1
22/05/24 08:14:32 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/05/24 08:14:33 INFO MemoryStore: MemoryStore cleared
22/05/24 08:14:33 INFO BlockManager: BlockManager stopped
22/05/24 08:14:33 INFO BlockManagerMaster: BlockManagerMaster stopped
22/05/24 08:14:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/05/24 08:14:33 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.OutOfMemoryError: Required array size too large
        at java.nio.file.Files.readAllBytes(Files.java:3156)
        at org.janelia.dataaccess.googlecloud.GoogleCloudDataProvider.saveImage(GoogleCloudDataProvider.java:246)
        at org.janelia.flatfield.FlatfieldCorrection.saveSolutionComponent(FlatfieldCorrection.java:487)
        at org.janelia.flatfield.FlatfieldCorrection.run(FlatfieldCorrection.java:394)
        at org.janelia.flatfield.FlatfieldCorrection.run(FlatfieldCorrection.java:195)
        at org.janelia.flatfield.FlatfieldCorrection.main(FlatfieldCorrection.java:80)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/05/24 08:14:33 INFO ShutdownHookManager: Shutdown hook called
22/05/24 08:14:33 INFO ShutdownHookManager: Deleting directory /home/jupyter/spark/local/spark-86ed0dcc-292c-48d0-b42a-6ec045286b6a
22/05/24 08:14:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-40ada513-d5d3-46fe-90bf-83df8ec8b5f7

We run into that pesky 2GB limit on the size of single byte[]

final byte[] bytes = Files.readAllBytes( tempPath );

It would be useful to document this limitation so that others might avoid having to re-run the entire pipeline with smaller chunks.

Alternatively, there are methods to create blobs from Paths in newer versions of the google-cloud-storage Java API that could come in handy here.
See https://github.com/googleapis/java-storage/blob/854d7a3edcab88c410ccf7947dbec36bd5ba4585/google-cloud-storage/src/main/java/com/google/cloud/storage/StorageImpl.java#L209-L213

This repo pulls in google-cloud-storage 1.106.0 through n5-google-cloud 3.2.1, which does not come with those methods. Maybe a good time to update those dependencies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant