Atanu Maity
Data Science Ideal Generalized Project Template
From my professional experience, it has been seen that the entire data science eco-structure is not well organized and well maintained across different teams/team members. So each member's work needs a standardized template. Let's see how we can do that.
Virtual Environment
While working with a DS project, it is must that we need to create a Virtual Separated Environment built solely for that particular project. But why do we need that? In particular, they help you to:
Resolve dependency issues by allowing you to use different versions of a package for different projects. For example, you could use Package A v2.7 for Project X and Package A v1.3 for Project Y.
Make your project self-contained and reproducible by capturing all package dependencies in a requirements.txt/environment.yaml file.
Install packages on a host on which you do not have admin privileges.
Keep your global site-packages/ directory tidy by removing the need to install packages system-wide which you might only need for one project.
So as you see here when the volume of work increases, how important it is to make separate environments for different projects.
So first thing’s first!
Create Virtual Environment: In industry practice, Conda is mostly used than Python virtual environment, but sometimes Python environments come to more flexible and easy to operate across different team/team members. But we will use Conda by default to create any Virtual Environment.
# Create a virtual environment with name Comp_Project1
conda create --prefix <path/to/the/folder/Comp_Project1>
# To create an environment with a specific version of Python
conda create --prefix <path/to/the/folder/Comp_Project1> python=3.7
# Activate the environment
source activate <path/to/the/folder/Comp_Project1>
Install packages in your Comp_Project1 Conda environment:
# Using Conda
(Comp_Project1)$ conda install <your_package> # you can specify the specific package version too
# Using pip (If any package is not listed in Conda)
(Comp_Project1)$ pip install <your_package>
Creating an Environment file:
The easiest way to make your work reproducible by others is to include a file in your project’s root directory listing all the packages, along with their version numbers, that are installed in your project’s environment. Conda calls these environment files. They are the exact analog of requirements files for Python’s virtual environments.
(Comp_Project1)$ conda env export --file environment.yml
The environment file looks something like this:
name: null # as we have created the env using --prefix
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- ca-certificates=2020.1.1=0
- certifi=2020.4.5.1=py37_0
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- ncurses=6.2=he6710b0_0
- openssl=1.0.2u=h7b6447c_0
- pip=20.0.2=py37_1
- python=3.7.0=h6e4f718_3
- readline=7.0=h7b6447c_5
- setuptools=46.1.3=py37_0
- sqlite=3.31.1=h7b6447c_0
- tk=8.6.8=hbc83047_0
- wheel=0.34.2=py37_0
- xz=5.2.4=h14c3975_4
- zlib=1.2.11=h7b6447c_3
prefix: /Comp/Comp_Project1
Recreating a new environment with the same configuration:
We can recreate the same environment using the environment.yml file
$ conda env create -n conda-env -f /path/to/environment.yml
Creating Project Template
In the evaluation path as an organization, we have felt that the DS work done across team members are highly scattered and not maintained in a proper hierarchy, which ultimately results in untrackable outcomes. So we need a proper standardized and generalized format where any DS project process may easily fit into i.e. a templated Folder and File structure
Solution: Cookiecutter. It's a command-line utility that creates projects from cookiecutters (project templates). For the DS template, it has a dedicated project repo named as cookiecutter-data-science (A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.)
Installation:
$ pip install cookiecutter
What it will do:
After installing it, in your project root folder (say Comp_Project1) you have to simply clone the project template from the cookiecutter-data-science repo, which can be done by the following code:
$ cookiecutter https://github.com/drivendata/cookiecutter-data-science
Once you run this command, a prompt will ask you for your inputs, like project name, repo name, company name, python version etc.., to create a project. and that's it. It will automatically set a project structure, with all your given inputs, where all your Data Folders, Script folders, Scripts, etc will be in a templated format. So now the inside of your project folder will look something like this:
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- Make this project pip installable with `pip install -e`
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
Note: After getting the template, you can modify the scripts and folder contents without changing its inherent essence.
Isn’t it nice!? 🙂
Points to follow or Mandates in a nutshell:
Create a project folder
Activate Conda virtual environment inside the project folder
Install all your necessary packages inside the environment
Create an environment.yaml file holding all the configuration info of your environment
Create a project template using the cookie-cutter with all the relevant information provided
Start your project