Skip to content

Running on Google Cloud

Igor Pisarev edited this page Mar 21, 2020 · 4 revisions

Prerequisites

Create a Google Cloud Platform account.

It is required to fill in the billing details, but you get $300 credit when signing up so you will not have to pay anything at the time of the registration. This credit is valid within your first year of usage, and will be automatically applied towards all used resources (including running the stitching pipeline following these instructions).

All resources on Google Cloud are grouped into projects, so you will also be prompted to create your first project.

Building the package

Clone the repository with submodules:

git clone --recursive https://github.com/saalfeldlab/stitching-spark.git 

Build the package:

python build.py

This will generate a binary file target/stitching-spark-<version>-SNAPSHOT.jar. Upload this file to a Google Cloud Storage bucket.

Creating a cluster

In the Dataproc console select Create cluster. Set the geographical region and configure the master and worker nodes.

Submitting a job

In the Jobs console select Submit job:

  • Set Job type to Spark
  • In the Main class or jar field provide full class name of your application (including the package)
  • In the Jar files field provide the link to your uploaded .jar file in the form gs://<bucket-name>/<jar-path>
  • Specify command line arguments for your application

Running stitching pipeline

1. Preparing input tile configuration files

The application requires an input file containing the registered tiles configuration for each channel. It should be a JSON formatted as follows:

[
{
  "index" : 0,
  "file" : "FCF_CSMH__54383_20121206_35_C3_zb15_zt01_63X_0-0-0_R1_L086_20130108192758780.lsm.tif",
  "position" : [0.0, 0.0, 0.0],
  "size" : [991, 992, 880],
  "pixelResolution" : [0.097,0.097,0.18],
  "type" : "GRAY16"
},
{
  "index" : 1,
  "file" : "FCF_CSMH__54383_20121206_35_C3_zb15_zt01_63X_0-0-0_R1_L087_20130108192825183.lsm.tif",
  "position" : [716.932762003862, -694.0887500300357, -77.41783189603937],
  "size" : [991, 992, 953],
  "pixelResolution" : [0.097,0.097,0.18],
  "type" : "GRAY16"
}
]

2. Uploading tiles

The tile images need to be uploaded to Google Cloud Storage to be accessible from the Google Cloud cluster.
Run the provided script that uploads the tiles and corresponding configuration files to your bucket:

python startup-scripts/cloud/upload-tiles-n5.py -i ch0.json -i ch1.json -o gs://target-bucket/

3. Flatfield estimation

Create a cluster if you have not done it yet (according to the instructions above).
Submit a job with the following parameters:

  • Main class: org.janelia.flatfield.FlatfieldCorrection
  • Jar file: gs://<your-bucket>/<path>/stitching-spark-<version>-SNAPSHOT.jar
  • Arguments:
-i
gs://<your-bucket>/ch0-converted-n5.json

This will create a folder named ch0-converted-n5-flatfield/ in the same bucket. After the application is finished, it will store two files S.tif and T.tif (the brightfield and the offset respectively). The next steps will detect the flatfield folder and will automatically use the estimated flatfields for on-the-fly correction.

The full list of available parameters for the flatfield script is available here.

4. Stitching

Submit a job with the following parameters:

  • Main class: org.janelia.stitching.StitchingSpark
  • Jar file: gs://<your-bucket>/<path>/stitching-spark-<version>-SNAPSHOT.jar
  • Arguments:
--stitch
-i
gs://<your-bucket>/ch0-converted-n5.json
-i
gs://<your-bucket>/ch1-converted-n5.json

This will run the stitching performing a number of iterations until it cannot improve the solution anymore. The multichannel data will be averaged on-the-fly before computing pairwise shifts in order to get higher correlations because of denser signal.

As a result, it will create files ch0-converted-n5-final.json and ch1-converted-n5-final.json near the input tile configuration files. It will also store a file named optimizer.txt that will contain the statistics on average and max errors, number of retained tiles and edges in the final graph, and cross correlation and variance threshold values that were used to obtain the final solution.

The current stitching method is iterative translation-based (improving the solution by building the prediction model). The pipeline incorporating a higher-order model is currently under development in the split-tiles branch.

The full list of available parameters for the stitch script is available here.

5. Export

Submit a job with the following parameters:

  • Main class: org.janelia.stitching.StitchingSpark
  • Jar file: gs://<your-bucket>/<path>/stitching-spark-<version>-SNAPSHOT.jar
  • Arguments:
--fuse
-i
gs://<your-bucket>/ch0-converted-n5-final.json
-i
gs://<your-bucket>/ch1-converted-n5-final.json

This will generate an N5 export under export.n5/ folder. The export is fully compatible with N5 Viewer for browsing.

The full list of available parameters for the export script is available here.

6. Conversion to slice TIFF

Submit a job with the following parameters:

  • Main class: org.janelia.stitching.N5ToSliceTiffSpark
  • Jar file: gs://<your-bucket>/<path>/stitching-spark-<version>-SNAPSHOT.jar
  • Arguments:
-i
gs://<your-bucket>/export.n5

This will output a set of XY slices as TIFF images for each channel of the N5 export.

The full list of available parameters for this step is available here.