Reviewing Zeppelin and Jupyter notebooks

Posted on March 16, 2019 in datascience • 2 min read

Recently, I started a new project using Scala/Spark. The project runs on the AWS EMR infrastructure and data science investigations are performed in Zeppelin notebooks hosted on S3. We review all our data science deliverables and it quickly became clear that reviewing notebooks is not as easy as reviewing regular code. To give an example, this is how a Zeppelin notebook looks like in a BitBucket pull request:

img1

The same applies to Jupyter notebooks since their raw format is also JSON.

Further, being notebooks primarily code, would be nice to handle them as other source code. E.g., “all source code and notebooks are reviewed with pull requests and versioned on BitBucket”. In the last years, I saw notebooks attached as comments to JIRA tickets, sent around by email, or even worse, lost somewhere in chaotic S3 buckets. This is definitely something you don’t want to experience! trust me.

So, how can we handle notebooks as close as possible to regular source code?

The solution: nb2md

Markdown documents are nicely rendered on web versioning tools and they are also easy to review since they are plain readable code similar to LateX (but waaay simpler). I built a small command-line tool in Python3 that does exactly this: it converts both Zeppelin and Jupyter notebooks to readable and reviewer-friendly Markdown documents. An example of Zeppelin notebook converted to the Markdown format being reviewed on BitBucket looks like this:

img2

As you can see, cell code and outputs are rendered. visual outputs are not rendered, but the reviewer can still comment on their corresponding code cells while rendering it in another tab. After the reviewing process, the notebook remains nicely accessible as Markdown document. E.g.:

img3

The nb2md tool can read both Zeppelin and Jupyter notebooks from S3, HTTP and local paths. Try it out! pip install nb2md (requires Python3).

The official documentation and code are available at https://github.com/elehcimd/nb2md. It is less than 300 lines of code (less than the code required for packaging) so it is also easy to modify/extend it.

[Thread on hacker news]