diff --git a/docs/conf.py b/docs/conf.py index 9950659b..11f16991 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -20,13 +20,13 @@ # -- Project information ----------------------------------------------------- project = u'odgi' -copyright = '2020-2023, *Guarracino A., *Heumos S., Nahnsen S., Prins P., Garrison E. Revision v0.8.2-1fa78aa' +copyright = '2020-2024, *Guarracino A., *Heumos S., Nahnsen S., Prins P., Garrison E. Revision v0.8.4-a19163ea' author = u'*Andrea Guarracino, *Simon Heumos, Sven Nahnsen, Pjotr Prins, Erik Garrison' # The short X.Y version -version = 'v0.8.2' +version = 'v0.8.4' # The full version, including alpha/beta/rc tags -release = '1fa78aa' +release = 'a19163ea' # -- General configuration --------------------------------------------------- diff --git a/docs/img/DRB1-3123_sorted.U1000.png b/docs/img/DRB1-3123_sorted.U1000.png new file mode 100644 index 00000000..73efc364 Binary files /dev/null and b/docs/img/DRB1-3123_sorted.U1000.png differ diff --git a/docs/img/DRB1-3123_sorted.j10000.png b/docs/img/DRB1-3123_sorted.j10000.png new file mode 100644 index 00000000..d3b2e39e Binary files /dev/null and b/docs/img/DRB1-3123_sorted.j10000.png differ diff --git a/docs/img/DRB1-3123_sorted.x2.png b/docs/img/DRB1-3123_sorted.x2.png new file mode 100644 index 00000000..d6be3879 Binary files /dev/null and b/docs/img/DRB1-3123_sorted.x2.png differ diff --git a/docs/img/DRB1-3123_sorting_layouting.png b/docs/img/DRB1-3123_sorting_layouting.png index f001bd45..45c2192c 100644 Binary files a/docs/img/DRB1-3123_sorting_layouting.png and b/docs/img/DRB1-3123_sorting_layouting.png differ diff --git a/docs/index.rst b/docs/index.rst index 4615dbba..1e7adf14 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -79,7 +79,7 @@ Core Functionalities :target: rst/tutorials/extract_selected_loci.html .. |sorting_layouting| image:: img/DRB1-3123_sorting_layouting.png - :target: rst/tutorials/sorting_layouting.html + :target: rst/tutorials/sort_layout.html .. |navigating_and_annotating_graphs| image:: img/nav_welcome.png :target: rst/tutorials/navigating_and_annotating_graphs.html diff --git a/docs/rst/commands/odgi_sort.rst b/docs/rst/commands/odgi_sort.rst index 274664a9..5f6c8a17 100644 --- a/docs/rst/commands/odgi_sort.rst +++ b/docs/rst/commands/odgi_sort.rst @@ -49,9 +49,8 @@ order: force-directed graph drawing algorithm minimizes the graph’s energy function or stress level. It applies stochastic gradient descent (SGD) to move a single pair of nodes at a time. The path index is - used to pick the terms to move stochastically. If ran with 1 thread - only, the resulting order of the graph is deterministic. The seed is - adjustable. + used to pick the terms to move stochastically. For more details about + the algorithm, please take a look at https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2. Sorting the paths in a graph my refine the sorting process. For the users’ convenience, it is possible to specify a whole pipeline of sorts diff --git a/docs/rst/multiqc.rst b/docs/rst/multiqc.rst index 4ac7e04e..ed17cdc1 100644 --- a/docs/rst/multiqc.rst +++ b/docs/rst/multiqc.rst @@ -38,7 +38,7 @@ To see the full statistics in YAML format of the graph, execute: .. code-block:: bash - odgi stats -i DRB1-3123.gfa.og -m + odgi stats -i DRB1-3123.gfa.og -m -sgdl This prints the following YAML to stdout: @@ -89,7 +89,7 @@ Let's save the statistics this time: .. code-block:: bash - odgi stats -i DRB1-3123.gfa.og -m > DRB1-3123.gfa.og.stats.yaml + odgi stats -i DRB1-3123.gfa.og -m -sgdl > DRB1-3123.gfa.og.stats.yaml .. note:: @@ -167,7 +167,7 @@ Assuming, we have several graphs, of which we want to compare the statistics fro .. code-bock:: bash odgi build -g LPA.gfa -o LPA.gfa.og - odgi stats -i LPA.gfa.og -y > LPA.gfa.og.stats.yaml + odgi stats -i LPA.gfa.og -m -sgdl > LPA.gfa.og.stats.yaml odgi viz -i LPA.gfa.og -o LPA.gfa.og.viz_mqc.png odgi layout -i LPA.gfa.og -o LPA.gfa.og.lay odgi draw -i LPA.gfa.og -c LPA.gfa.og.lay -p LPA.gfa.og.lay.draw_mqc.png -w 10 -C diff --git a/docs/rst/quick_start.rst b/docs/rst/quick_start.rst index c61a171c..c399191b 100644 --- a/docs/rst/quick_start.rst +++ b/docs/rst/quick_start.rst @@ -18,12 +18,12 @@ version 1 (`GFAv1 `_) Build graph from GFA ---------------------------- -Assuming that your current working directory is the root of the ``odgi`` project, to construct an ``odgi`` file from a -``GFA`` file, execute: +To construct an ``odgi`` file from a ``GFA`` file, execute: .. code-block:: bash - odgi build -g test/DRB1-3123.gfa -o DRB1-3123.og + wget https://raw.githubusercontent.com/pangenome/odgi/master/test/DRB1-3123.gfa + odgi build -g DRB1-3123.gfa -o DRB1-3123.og The command creates a file called ``DRB1-3123.og``, which contains the input graph in ``odgi`` format. diff --git a/docs/rst/tutorials/exploratory_analysis.rst b/docs/rst/tutorials/exploratory_analysis.rst index 7db08d80..35cd38b1 100644 --- a/docs/rst/tutorials/exploratory_analysis.rst +++ b/docs/rst/tutorials/exploratory_analysis.rst @@ -76,7 +76,7 @@ Color with respect to the node position This is a linearized visualization, but the pangenome graphs are not linear when the embedded genomes present structural variation. However, a graph can be optimized for being better visualized in 1-Dimension by sorting its nodes properly -(see the :ref:`sorting-layouting` tutorial for more information). +(see the :ref:`sort-layout` tutorial for more information). To color the bars with respect to the node position in each path, execute: diff --git a/docs/rst/tutorials/sort_layout.rst b/docs/rst/tutorials/sort_layout.rst index 8cbf5dd6..517a3fd4 100644 --- a/docs/rst/tutorials/sort_layout.rst +++ b/docs/rst/tutorials/sort_layout.rst @@ -1,4 +1,4 @@ -.. _sorting-layouting: +.. _sort-layout: ############### Sort and Layout @@ -16,6 +16,8 @@ a 1D and 2D layout to simplify these complex regions. This tutorial shows how to sort and visualize a graph in 1D. It explains how to generate a 2D layout of a graph, and how to take a look at the calculated layout using static and interactive tools. +For more details about the applied algorithm, please take a look at https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2. + .. Pangenome graphs embed linear pangenomic sequences as paths in .. the graph, but to our knowledge, no algorithm takes into account this biological information in the sorting. Moreover, .. existing 2D layout methods struggle to deal with large graphs. ``odgi`` implements a new layout algorithm to simplify a pangenome @@ -39,12 +41,12 @@ to take a look at the calculated layout using static and interactive tools. Build the unsorted DRB1-3123 graph ---------------------------------- -Assuming that your current working directory is the root of the ``odgi`` project, to construct an ``odgi`` graph from the -``DRB1-3123`` dataset in ``GFA`` format, execute: +To construct an ``odgi`` graph from the ``DRB1-3123`` dataset in ``GFA`` format, execute: .. code-block:: bash - odgi build -g test/DRB1-3123_unsorted.gfa -o DRB1-3123_unsorted.og + wget https://raw.githubusercontent.com/pangenome/odgi/master/test/DRB1-3123_unsorted.gfa + odgi build -g DRB1-3123_unsorted.gfa -o DRB1-3123_unsorted.og The command creates a file called ``DRB1-3123_unsorted.og``, which contains the input graph in ``odgi`` format. This graph contains 12 ALT sequences of the `HLA-DRB1 gene `_ from the GRCh38 reference genome. @@ -129,6 +131,22 @@ nodes. .. note:: The PG-SGD is not deterministic, because of its `Hogwild! `_ approach. + For more details about the applied algorithm, please take a look at https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2. + +.. note:: + The 1D PG-SGD implementation comes with a huge amount of tunable parameters. Based on our experience applying it to hundreds of graphs, the current + defaults usually work well for most graphs. However, if you feel the sorting did not work well enough, there are 2 key parameters one can tune: + + | **-G, --path-sgd-min-term-updates-paths**\ =\ *N*: The minimum number of terms to be + updated before a new path-guided + linear 1D SGD iteration with adjusted + learning rate eta starts, expressed as + a multiple of total path steps (default: 1.0). + | **-x, --path-sgd-iter-max**\ =\ *N*: The maximum number of iterations for path-guided linear 1D SGD model (default: 100). + + Increasing both can lead to a better sorted graph. For example, one can start optimizing with setting **-x, --path-sgd-iter-max**\ =\ *200*. + For more parameter details please take + a look at :ref:`odgi sort`. .. To reproduce the visualization below, the sorted graph can be found under ``test/DRB1-3123_sorted.og``. @@ -169,6 +187,73 @@ This prints to stdout: Compared to before, these metrics show that the goodness of the sorting of the graph improved significantly. +-------------------------------------------- +Playing around with the 1D PG-SGD parameters +-------------------------------------------- + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +What happens if the maximum number of iterations is very low? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: bash + + odgi sort -i DRB1-3123_unsorted.og --threads 2 -P -Y -x 2 -o DRB1-3123_sorted.x2.og + odgi viz -i DRB1-3123_sorted.x2.og -o DRB1-3123_sorted.x2.png + +.. image:: /img/DRB1-3123_sorted.x2.png + +The graph appears very complex and not quite human readable. That's because in total there were two times the number +of total path steps node position updates instead of one hundred times the number of total path steps, which is the current default. +For very complex graphs, one may have to increase this number even further. + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +What happens if the minimum number of term updates is very high? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: bash + + odgi sort -i DRB1-3123_unsorted.og --threads 2 -P -Y -U 1000 -o DRB1-3123_sorted.U1000.og + odgi viz -i DRB1-3123_sorted.U1000.og -o DRB1-3123_sorted.U1000.png + +.. image:: /img/DRB1-3123_sorted.U1000.png + +The graph lost it's complexity and is now linear. Compared to the 1D visualization using the default parameters, it is hard +to spot any differences. So let's take a look at the metrics: + +.. code-block:: bash + + odgi stats -i DRB1-3123_sorted.U1000.og -s -d -l -g + +This prints to stdout: + +.. code-block:: bash + + #mean_links_length + path in_node_space in_nucleotide_space num_links_considered num_gap_links_not_penalized + all_paths 1.00361 8.30677 21870 15195 + #sum_of_path_node_distances + path in_node_space in_nucleotide_space nodes nucleotides num_penalties num_penalties_different_orientation + all_paths 3.23238 3.73489 21882 163416 3750 1 + +We actually were able to improve the metrics compared to using default parameters. However, the runtime increased from under 1 second to ~30 seconds. +So one needs to be careful with such a parameter. Compared to the gains in linearity, such an additional time usage would be a huge +waste with very large graphs. + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +What happens if the threshold of the maximum distance of two nodes is very high? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: bash + + odgi sort -i DRB1-3123_unsorted.og --threads 2 -P -Y -j 10000 -o DRB1-3123_sorted.j10000.og + odgi viz -i DRB1-3123_sorted.j10000.og -o DRB1-3123_sorted.j10000.png + +.. image:: /img/DRB1-3123_sorted.j10000.png + +The graph appears very complex and not quite human readable. That's because the iterations are terminated as soon as the +expected distance of two nodes, the nucleotide distance given by two randomly chosen path steps, is as close as 10000. +Naturally, this happens very soon. + ========================================================= 1D reference-guided grooming and reference-guided sorting ========================================================= @@ -267,6 +352,8 @@ We can clearly observe, that the path positions of the two reference now define 2D layout ========= +The 2D PG-SGD layout algorithm is described in https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2. + ----------------------------------------- 2D layout of the unsorted DRB1-3123 graph ----------------------------------------- @@ -277,6 +364,23 @@ We want to have a 2D layout of our DRB1-3123 graph: odgi layout -i DRB1-3123_unsorted.og -o DRB1-3123_unsorted.og.lay -P --threads 2 +.. note:: + The 2D PG-SGD implementation comes with a huge amount of tunable parameters. Based on our experience applying it to hundreds of graphs, the current + defaults usually work well for most graphs. However, if you feel the resulting 2D layout is not of a good enough quality, there are 2 key parameters one can tune: + + | **-G, --path-sgd-min-term-updates-paths**\ =\ *N*: Minimum number of terms N to be + updated before a new path-guided 2D + SGD iteration with adjusted learning + rate eta starts, expressed as a + multiple of total path length + (default: 10). + | **-x, --path-sgd-iter-max**\ =\ *N*: The maximum number of iterations N for + the path-guided 2D SGD model (default: + 30). + + Increasing both can lead to a better graph layout. For example, one can start optimizing with setting **-x, --path-sgd-iter-max**\ =\ *100*. + For more parameter details please take a look at :ref:`odgi layout`. + -------------------------------------------- Drawing the 2D layout of the DRB1-3123 graph --------------------------------------------