Git and GitHub
Version Control for Analysis Projects
My LLM friend Claude and I made this together.
What is Git?
Git is a version control system that tracks changes in your files. It works locally on your computer and creates snapshots (called commits) of your work. Think of it like “track changes” for code and data projects, but much more powerful.
What is GitHub?
GitHub is a remote hosting platform for Git repositories. It provides:
- Backup in the cloud
- Tools to share and collaborate
- A portfolio of your work
- Project management features
Key distinction: Git runs locally on your machine, while GitHub is the remote service where you store your repositories.
Git Basics
Core Concepts
Repository (repo): A project folder that Git is tracking. Contains all your files plus a hidden .git directory with the version history.
Commit: A snapshot of your work at a specific point in time. Each commit has a unique ID and a message describing what changed.
Branch: A separate line of development. We’ll keep things simple with just the main branch for now.
What Git Tracks
Git is great for tracking certain types of files:
✅ DO track:
- Code files (
.py,.R,.jl) - Notebooks (
.ipynb,.qmd,.Rmd) - Documentation (
README.md, notes) - Configuration files (
requirements.txt,config.yaml)
❌ DON’T track:
- Large datasets (use data storage solutions instead)
- API keys and credentials (use environment variables)
- Generated outputs (plots, HTML reports)
- Binary files that change frequently
Basic Workflow
The fundamental Git workflow follows these steps:
- Modify files in your project
- Stage changes with
git add - Commit changes with
git commit - Repeat!
This pattern becomes second nature once you practice it a few times.
Essential Commands
Here are the 5 commands you need to get started with Git:
# Start tracking a project
git init
# Stage changes (prepare files for commit)
git add filename.py
# Save a snapshot with a message
git commit -m "Clear message about what changed"
# Check what's changed
git status
# View commit history
git logExample: Creating Your First Repository
# Create and navigate to a new project
git init my-analysis
cd my-analysis
# Create a simple analysis file
# (create your notebook or script here)
# Stage and commit
git add analysis.ipynb
git commit -m "Initial analysis of dataset X"GitHub for Collaboration
Connecting Local and Remote
Once you have a local Git repository, you can connect it to GitHub to back it up and share it with others.
Two new commands:
# Send your commits to GitHub
git push
# Get the latest changes from GitHub
git pullTypical Workflow with GitHub
- Create a repository on GitHub
- Connect your local repository to GitHub
- Push your work to the remote
- Collaborate with others who can clone or contribute
- Pull their changes to stay up to date
Pushing to GitHub
# Connect to GitHub (one time setup)
git remote add origin https://github.com/username/repo.git
# Push your commits
git push -u origin mainAfter pushing, anyone with access can view and use your analysis through GitHub’s web interface.
Data Science Specific Considerations
What to Commit
For data science projects, make sure to include:
- Analysis scripts (
.py,.R,.jl) - Jupyter notebooks or Quarto documents
- Dependency files (
requirements.txt,renv.lock,environment.yml) README.mdwith project description and setup instructions- Documentation and notes
What NOT to Commit
Use a .gitignore file to exclude:
- Data files (
.csv,.parquet,.xlsx) - usually too large for Git - Credentials (
.envfiles, API keys, passwords) - Output files (plots, HTML reports, model files)
The .gitignore File
Create a .gitignore file in your repository root:
# Data
*.csv
*.parquet
*.xlsx
data/
raw_data/
# Credentials
.env
secrets.json
.Renviron
# Outputs
*.html
*.png
*.pdf
figures/
output/
# Python
__pycache__/
venv/
*.pyc
# R
.Rproj.user/
.Rhistory
.RData
# Jupyter
.ipynb_checkpoints/
Working with Jupyter Notebooks
Jupyter notebooks can be tricky with version control because they contain outputs and metadata that change frequently.
Best practices:
- Clear outputs before committing:
Cell → All Output → Clear - Use tools like
nbstripoutto automatically strip outputs - Consider using Quarto (
.qmd) files instead - they’re plain text and version control friendly - Review diffs carefully on GitHub to see what actually changed
Hands-On Practice
To solidify these concepts, try this exercise:
- Create a new directory for a simple analysis project
- Initialize Git with
git init - Create a basic notebook or script with some analysis
- Make your first commit
- Make a change to your code and commit again
- Create a repository on GitHub
- Push your local work to GitHub
Bonus: Add a .gitignore file appropriate for your project.
Best Practices
Commit Messages
Write clear, descriptive commit messages that explain what changed and why.
Good examples:
"Add data cleaning function for missing values""Fix bug in correlation calculation""Update visualization to use seaborn style"
Bad examples:
"updates""fixed stuff""asdfasdf"
Commit Frequency
Commit early, commit often
Make small, focused commits rather than large ones that change many things. This makes it easier to:
- Understand what changed
- Find bugs by reviewing specific commits
- Revert changes if something breaks
Branching (Advanced)
As you get more comfortable, explore branches for:
- Testing new features without breaking your main analysis
- Collaborating with others on different aspects
- Keeping stable and experimental work separate
Resources for Learning More
- Git Documentation - Official Git documentation
- GitHub Skills - Interactive tutorials
Key Takeaways
- Version control is essential for reproducible data science
- Start simple with the basic commands: init, add, commit, push, pull
- Commit frequently with clear messages