Starting a Project
Structure
All successful data science projects begin with a semi-rigorous file structure1.
Recommended reading:
Note: these are opinions. They’re best-practice-informed opinions, but still opinions.
Language-agnostic project elements (essential):
.
├── data
│ ├── processed
│ └── raw
├── LICENSE.md
├── models
├── notebooks
├── README.md
└── reports
└── figures
/data contains (1) actual data files, or (2) symbolic links to data files (more on that later). /data/raw is data your code hasn’t touched. /data/processed are derivatives created within the current project.
/models contains fitted/trained serialized model objects (e.g., Python .pkl or R .rds files)
/notebooks contains exploratory and prototyping code written in Jupyter or R Markdown/Quarto notebooks. This is generally not where the analysis “actually happens.”
/reports contains presentation materials you intend to share (e.g., LaTeX, Quarto, .pdf, .docx)
README.md describes essential information about the project and its data repository
LICENSE.md is a standardized file describing how your code may be used.
.
├── {project-name}.Rproj
├── R
│ └── {load_data}.R
├── renv
├── renv.lock
/{}.Rproj` file – basic collection of settings interpreted by RStudio. Not strictly necessary, but handy.
/renv and renv.lock are environment files created by the package renv. More on virtual environments later.
/R contains R scripts where the analysis “actually happens.”
.
├── environment.yml
├── src
│ └── {load_data}.py
/environment.yml is a “frozen” file created by the library conda. More on virtual environments later.
/src contains Python scripts where the analysis “actually happens.”
Virtual Environment
notes: python -m ipykernel install –user –name $CONDA_DEFAULT_ENV
Footnotes
Many doomed projects also begin with semi-rigorous data structures.↩︎