Continuous Delivery in Data Science

As I discussed in a previous article, Data Science is in desperate need of Devops. Fortunately, there are finally some emerging devops patterns to support Data Science development. DataBricks themselves are providing much of it.

Two concepts keep popping up in the devops patterns: “Continuous Integration / Continuous Deployment” and “Test Driven Design” (Moving toward “Behavioral Driven Design” but that’s not a widely used term).

Databricks has a great article on how to architect dev/prod CI/CD: https://databricks.com/blog/2017/10/30/continuous-integration-continuous... This shows details such how each developer has their own ‘development’ environment but with managed configurations and plugins to ensure all development occurs in the same configuration.

I would personally love to see data science engineering development patterns move into Test / Behavioral Driven Design – mostly because it makes things a lot easier on data end users; but also because it forces strict requirement definitions: https://databricks.com/blog/2017/06/02/integrating-apache-spark-cucumber...

It's a little late to pick up on HumbleBundle, but Continuous Delivery with Docker and Jenkins by Rafal Lesko is a great read on the topic of Continuous Delivery, even if you are unfamiliar with the technologies.