Skip to content

Commit

Permalink
♻️ move handling of binary files with git to advanced/ and merge jupy…
Browse files Browse the repository at this point in the history
…ter tools into one file
  • Loading branch information
krother committed Oct 6, 2023
1 parent 0bb606f commit 93fdfe7
Show file tree
Hide file tree
Showing 6 changed files with 240 additions and 240 deletions.
120 changes: 120 additions & 0 deletions docs/productive/git/advanced/binary-files.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@

Git for binary files
====================

``git diff`` can be configured so that it can also display meaningful diffs for
binary files.

… for Excel files
-----------------

For this we need `openpyxl <https://openpyxl.readthedocs.io/en/stable/>`_
and `pandas <https://pandas.pydata.org>`_:

.. code-block:: console
$ pipenv install openpyxl pandas
Then we can use :doc:`pandas:reference/api/pandas.DataFrame.to_csv` in
:file:`exceltocsv.py` to convert the Excel files:

.. literalinclude:: exceltocsv.py
:caption: exceltocsv.py
:name: exceltocsv.py
:language: python

Now add the following section to your global Git configuration
:file:`~/.gitconfig`:

.. code-block:: ini
[diff "excel"]
textconv=python3 /PATH/TO/exceltocsv.py
binary=true
Finally, in the global :file:`~/.gitattributes` file, our ``excel`` converter is
linked to :file:`*.xlsx` files:

.. code-block:: ini
*.xlsx diff=excel
… for PDF files
---------------

For this, ``pdftohtml`` is additionally required. It can be installed with

.. tab:: Debian/Ubuntu

.. code-block:: console
$ sudo apt install poppler-utils
.. tab:: macOS

.. code-block:: console
$ brew install pdftohtml
Add the following section to the global Git configuration :file:`~/.gitconfig`:

.. code-block:: ini
[diff "pdf"]
textconv=pdftohtml -stdout
Finally, in the global :file:`~/.gitattributes` file, our ``pdf`` converter is
linked to :file:`*.pdf` files:

.. code-block:: ini
*.pdf diff=pdf
Now, when ``git diff`` is called, the PDF files are first converted and then a
diff is performed over the outputs of the converter.

… for Word documents
--------------------

Differences in Word documents can also be displayed. For this purpose `Pandoc
<https://pandoc.org/>`_ can be used, which can be easily installed with

.. tab:: Debian/Ubuntu

.. code-block:: console
$ sudo apt install pandoc
.. tab:: macOS

.. code-block:: console
$ brew install pandoc
.. tab:: Windows

Download and install the :file:`*.msi`. file from `GitHub
<https://github.com/jgm/pandoc/releases/>`_.

Then add the following section to your global Git configuration
:file:`~/.gitconfig`:

.. code-block:: ini
[diff "word"]
textconv=pandoc --to=markdown
binary=true
prompt=false
Finally, in the global :file:`~/.gitattributes` file, our ``word`` converter is
linked to :file:`*.docx` files:

.. code-block:: ini
*.docx diff=word
The same procedure can be used to obtain useful diffs from other binaries, for
example ``*.zip``, ``*.jar`` and other archives with ``unzip`` or for changes in
the meta information of images with ``exiv2``. There are also conversion tools
for converting ``*.odt``, ``.doc`` and other document formats into plain text.
For binary files for which there is no converter, strings are often sufficient.
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/productive/git/advanced/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ Advanced Git
cherry-pick
bisect
hooks/index
tools
jupyter-config
jupyter-notebooks
binary-files
vs-code/index
gitlab/index
git-big-picture
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,84 @@
..
.. SPDX-License-Identifier: BSD-3-Clause
Configuring Git for Jupyter Notebooks
=====================================
Jupyter Notebooks with Git
==========================

Problems with version control of Jupyter Notebooks
--------------------------------------------------

There are several issues to manage Jupyter Notebooks with Git:

* Jupyter Notebooks cell metadata changes even when no content changes have been
made to the cells. This makes Git diffs unnecessarily complicated.
* The lines that Git writes to the ``*.ipynb`` files in case of :ref:`merge
conflicts <merge-conflicts>` cause the notebooks to no longer be valid JSON
and therefore cannot be opened by Jupyter: you will then get the *Error
loading notebook* message when opening them.

Conflicts are especially common in notebooks because Jupyter changes the
following each time a notebook is run:

* Each cell contains a number that indicates the order in which it was
executed. If team members execute the cells in different order, every single
cell has a conflict! To fix this manually would take a very long time.
* For each image, such as a plot, Jupyter records not only the image itself in
the notebook, but also a simple text description containing the ID of the
object, for example :samp:`{<matplotlib.axes._subplots.AxesSubplot at
0x7fbc113dbe90>}`. This will change every time you run a notebook, and
therefore will conflict every time two people run that cell.
* Some output can be non-deterministic, such as a notebook that uses random
numbers or interacts with a service that provides different output over
time.
* Jupyter adds metadata to the notebook that describes the environment in
which it was last run, such as the name of the kernel. This often varies
between different installations, and so two people saving a notebook (even
without other changes) will often have a conflict in the metadata.

``nbdev2``
----------

`nbdev2 <https://nbdev.fast.ai>`_ has a set of git hooks that provide clean git
diffs that automatically resolve most git conflicts and ensure that any
remaining conflicts can be fully resolved within the standard Jupyter notebook
environment:

* A new ``git merge`` driver provides notebook-native conflict markers that
result in notebooks opening directly in Jupyter, even if there are Git
conflicts. Local and remote changes are each shown as separate cells in the
notebook, so you can simply delete the version you don’t want to keep or
combine the two cells as needed.

.. seealso::
`nbdev.merge docs <https://nbdev.fast.ai/api/merge.html>`_

* Resolving git merges locally is extremely helpful, but we also need to resolve
them remotely. For example, if a :doc:`merge request <gitlab/merge-requests>`
is submitted and then someone else submits the same notebook before the merge
request is merged, it could cause a conflict:

.. code-block:: javascript
"outputs": [
{
<<<<<< HEAD
"execution_count": 8,
======
"execution_count": 5,
>>>>>> 83e94d58314ea43ccd136e6d53b8989ccf9aab1b
"metadata": {},
The *save hook* of nbdev2 automatically removes all unnecessary metadata
(including :samp:`execution_count`) and non-deterministic cell output; this
means that there are no pointless conflicts like the one above, since this
information is not stored in the commits in the first place.
To get started, follow the instructions in `Git-Friendly Jupyter
<https://nbdev.fast.ai/tutorials/git_friendly_jupyter.html>`_.
``jq``
------
The results of the calculations can also be saved in the notebook file format
:ref:`nbformat <whats-an-ipynb-file>`. These can also be Base-64-coded blobs
Expand All @@ -20,7 +96,7 @@ JSON processor. It takes some time to set up ``jq`` because it has its own
query/filter language, but the default settings are usually well chosen.
Installation
------------
~~~~~~~~~~~~
``jq`` can be installed with:
Expand All @@ -37,7 +113,7 @@ Installation
$ brew install jq
Example
-------
~~~~~~~
A typical call is:
Expand All @@ -59,7 +135,7 @@ information. If you want to keep certain meta information, you can indicate this
here.
Set up
------
~~~~~~
#. To make your work easier, you can create an alias in the ``~/.bashrc`` file:
Expand Down Expand Up @@ -141,3 +217,40 @@ Set up
done
unset nbfile
}
ReviewNB
--------
`ReviewNB <https://www.reviewnb.com>`_ solves the problem of doing
:doc:`gitlab/merge-requests` with notebooks. GitLab’s code review GUI only works
with line-based file formats, such as Python scripts. Most of the time, however,
I prefer to check the source code notebooks because:
* I want to check the documentation and the tests, not just the implementation
* I want to see the changes to the cell output, like charts and tables, not just
the code.
For this purpose ReviewNB is perfect.
``nbdime``
----------
`nbdime <https://nbdime.readthedocs.io/>`_ is a GUI for `nbformat
<https://nbformat.readthedocs.io/>`_ diffs and replaces `nbdiff
<https://github.com/tarmstrong/nbdiff>`_. It attempts content-aware diffing
locally as well as merging notebooks, is not limited to displaying diffs, but
also prevents unnecessary changes from being checked in. However, it is not
compatible with ``nbdev2``.
.. _nbstripout_label:
``nbstripout``
--------------
`nbstripout <https://github.com/kynan/nbstripout>`_ automates *Clear all
outputs*. It uses `nbformat <https://nbformat.readthedocs.io/>`_ and a few auto
magic to set up ``.git config``. In my opinion, however, it has two drawbacks:
* it is limited to the problematic metadata section
* it is slow.
Loading

0 comments on commit 93fdfe7

Please sign in to comment.