Docker and Python: making them play nicely and securely for Data Science and ML

Speaker

Tania Allard

common pain points in DS and ML
- complex setup / deps
- reliance on data / database
- fast evolving projects
- are containers secure enough?
how is it different from web apps?
- not every deliverable is an app or a model
- relies on data
- Mixture of wheels and compiled packages
- Security access levels - for data and software
- Mixture of stakeholders:
  - data scientists
  - software engineers
  - ML engineers
best practices
- Split complex RUN statements and sort them
- Prefer COPY to add files
- install only necessary packages
- explicitly ignore files
  - documentations
  - never add data
  - secrets
cookiecutter template
- docker-science/cookiecutter-docker-science

Rebuild your images frequently - get security updates for system packages
Never work as root / minimize the privileges
- run as non-root user
- minimize capability
You do not want to use Alpine Linux (go for buster, stretch or the Jupyter stack)
pin / version EVERYTHING (use pip-tools, conda, poetry or pipenv)
Leverage build cache
Use one Dockerfile per project
Use multi-stage builds
- fetch and manage secrets in an intermediate layer
- creates smaller image
Make your images identifiable (test, production, R&D) - also be careful when accessing databases and using ENV variables / build variables
- Provide context with LABELS
Do not reinvent the wheel! Use repo2docker
Automate - no need to build and push manually
Use a linter