Using IPython notebooks under version control
What is a good strategy for keeping IPython notebooks under version control?
The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. In particular, I am trying to find a good workflow that:
As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. The problem is when I do not want to version control the output. There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues:
Cell/All Output/Clear
menu option, thereby creating unwanted noise in the diffs. This is resolved by some of the answers. I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. A full solution might require some changes to IPython, or may rely on some simple external scripts. I currently use mercurial, but would like a solution that also works with git: an ideal solution would be version-control agnostic.
This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. The answer to this question should provide the definitive strategy. It is fine if it requires a recent (even development) version of IPython or an easily installed extension.
Update: I have been playing with my modified notebook version which optionally saves a .clean
version with every save using Gregory Crosswhite's suggestions. This satisfies most of my constraints but leaves the following unresolved:
.clean
file, and then need to be integrated somehow into my working version. (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated. Notes
Removing (stripping) Output
Cell/All Output/Clear
menu option for removing the output. Newsgroups
Issues
Pull Requests
Here is my solution with git. It allows you to just add and commit (and diff) as usual: those operations will not alter your working tree, and at the same time (re)running a notebook will not alter your git history.
Although this can probably be adapted to other VCSs, I know it doesn't satisfy your requirements (at least the VSC agnosticity). Still, it is perfect for me, and although it's nothing particularly brilliant, and many people probably already use it, I didn't find clear instructions about how to implement it by googling around. So it may be useful to other people.
~/bin/ipynb_output_filter.py
) chmod +x ~/bin/ipynb_output_filter.py
) Create the file ~/.gitattributes
, with the following content
*.ipynb filter=dropoutput_ipynb
Run the following commands:
git config --global core.attributesfile ~/.gitattributes
git config --global filter.dropoutput_ipynb.clean ~/bin/ipynb_output_filter.py
git config --global filter.dropoutput_ipynb.smudge cat
Done!
Limitations:
somebranch
and you do git checkout otherbranch; git checkout somebranch
git checkout otherbranch; git checkout somebranch
, you usually expect the working tree to be unchanged. Here instead you will have lost the output and cells numbering of notebooks whose source differs between the two branches. git commit notebook_file.ipynb
, although it would at least keep git diff notebook_file.ipynb
free from base64 garbage). My solution reflects the fact that I personally don't like to keep generated stuff versioned - notice that doing merges involving the output is almost guaranteed to invalidate the output or your productivity or both.
EDIT:
if you do adopt the solution as I suggested it - that is, globally - you will have trouble in case for some git repo you want to version output. So if you want to disable the output filtering for a specific git repository, simply create inside it a file .git/info/attributes, with
**.ipynb filter=
as content. Clearly, in the same way it is possible to do the opposite: enable the filtering only for a specific repository.
the code is now maintained in its own git repo
if the instructions above result in ImportErrors, try adding "ipython" before the path of the script:
git config --global filter.dropoutput_ipynb.clean ipython ~/bin/ipynb_output_filter.py
EDIT : May 2016 (updated February 2017): there are several alternatives to my script - for completeness, here is a list of those I know: nbstripout (other variants), nbstrip, jq.
We have a collaborative project where the product is Jupyter Notebooks, and we've use an approach for the last six months that is working great: we activate saving the .py
files automatically and track both .ipynb
files and the .py
files.
That way if someone wants to view/download the latest notebook they can do that via github or nbviewer, and if someone wants to see how the the notebook code has changed, they can just look at the changes to the .py
files.
For Jupyter
notebook servers , this can be accomplished by adding the lines
import os
from subprocess import check_call
def post_save(model, os_path, contents_manager):
"""post-save hook for converting notebooks to .py scripts"""
if model['type'] != 'notebook':
return # only do this for notebooks
d, fname = os.path.split(os_path)
check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)
c.FileContentsManager.post_save_hook = post_save
to the jupyter_notebook_config.py
file and restarting the notebook server.
If you aren't sure in which directory to find your jupyter_notebook_config.py
file, you can type jupyter --config-dir
, and if you don't find the file there, you can create it by typing jupyter notebook --generate-config
.
For Ipython 3
notebook servers , this can be accomplished by adding the lines
import os
from subprocess import check_call
def post_save(model, os_path, contents_manager):
"""post-save hook for converting notebooks to .py scripts"""
if model['type'] != 'notebook':
return # only do this for notebooks
d, fname = os.path.split(os_path)
check_call(['ipython', 'nbconvert', '--to', 'script', fname], cwd=d)
c.FileContentsManager.post_save_hook = post_save
to the ipython_notebook_config.py
file and restarting the notebook server. These lines are from a github issues answer @minrk provided and @dror includes them in his SO answer as well.
For Ipython 2
notebook servers , this can be accomplished by starting the server using:
ipython notebook --script
or by adding the line
c.FileNotebookManager.save_script = True
to the ipython_notebook_config.py
file and restarting the notebook server.
If you aren't sure in which directory to find your ipython_notebook_config.py
file, you can type ipython locate profile default
, and if you don't find the file there, you can create it by typing ipython profile create
.
Here's our project on github that is using this approach: and here's a github example of exploring recent changes to a notebook.
We've been very happy with this.
I have created nbstripout
, based on MinRKs gist, which supports both Git and Mercurial (thanks to mforbes). It is intended to be used either standalone on the command line or as a filter, which is easily (un)installed in the current repository via nbstripout install
/ nbstripout uninstall
.
Get it from PyPI or simply
pip install nbstripout
链接地址: http://www.djcxy.com/p/35488.html
上一篇: 自定义Mathematica快捷键
下一篇: 在版本控制下使用IPython笔记本