git for Scientific Software Development

Jack Atkinson

Senior Research Software Engineer
ICCS - University of Cambridge

2025-10-01

Precursors

Slides and Materials

To access links or follow on your own device these slides can be found at:
jatkinson1000.github.io/git-for-science


All materials are available at:

Licensing

Except where otherwise noted, these presentation materials are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

Vectors and icons by SVG Repo used under CC0(1.0)

Precursors

  • Be nice (Python code of conduct)
  • Ask questions whenever they arise.
    • Someone else is probably wondering the same thing.
    • For troubleshooting please write a message in the zoom chat and someone will assist you in a thread
    • For more general queries please raise a hand.
  • I will make mistakes.
    • Not all of them will be intentional.

Learning Objectives

  • Recap the basic git commands
  • Improve contextual understanding of git:
    • through mental models
    • review landscape

  • Understand key components of git repositories to aid in collaboration
  • Lern how to use git and GitHub/GitLab to better manage development and collaboration
    • branches
    • issues
    • merge/pull requests
    • code review

Structure & Premise

  • This workshop has teaching interwoven with practical exercises
  • We will be cloning a basic git repository and improving it over the session
  • After learning new concepts we will immediately put them into practice with a coding exercise before returning to learn more.

I suggest you have open:

  • A text editor or IDE
  • A terminal window
  • A browser window

Structure & Premise

You are doing some work on pendula and your colleague says they have written some code that solves the equations and they can share with you.
This is made easy by the fact that it is on git!
Let’s see how we get on…

Go to the workshop repository:

If you have an account then fork the repository and clone your fork.

git 101

Installation and setup

Git comes preinstalled on most Linux distributions and macOS.
You can check it is on your system by running which git.


If you are on Windows, or do not have git, check the git docs1 or the GitHub guide to installing git. https://github.com/git-guides/install-git


Setting up a new git repository is beyond the scope of this talk but involves using the
git --init command.


We will assume that you have created a repository using an online hosting service (GitLab, GitHub etc.) that provides a nice UI wrapper around the process.

What is git?

  • a version control system developed by Linus Torvalds.1
  • tracks changes made to files over time.
  • not Dropbox!

Rabbit hole from Disney’s Alice in Wonderland under fair use

git Geography

Locations

  • Local
    • Workspace
    • Staging area or index
    • Local repo
    • Stash
  • Remote
    • Remote repo

These (and more) can be explored in Andrew Peterson’s Interactive git cheat sheet

A Warning

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
$ git clone git@github.com:jatkinson1000/git-for-science.github git4sci
  Cloning into 'git4sci'...
  remote: Enumerating objects: 42, done.
  remote: Counting objects: 100% (42/42), done.
  remote: Compressing objects: 100% (39/39), done.
  remote: Total 42 (delta 26), reused 31 (delta 15), pack-reused 0
  Receiving objects: 100% (42/42), 69.62 MiB | 5.64 MiB/s, done.
  Resolving deltas: 100% (26/26), done.
$
$ cd git4sci/
$
$ echo "This is a new file." > newfile.txt
$

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
  • git status
    • Check the state of the directory
  • git add <filepath>
    • Update the index with any changes
$ git status
  On branch main
  Your branch is up to date with 'origin/main'.
  
  Untracked files:
    (use "git add <file>..." to include in what will be committed)
          newfile.txt
  
  no changes added to commit (use "git add" and/or "git commit -a")
$
$ git add newfile.txt
$
$ git status
  On branch main
  Your branch is up to date with 'origin/main'.
  
  Changes to be committed:
    (use "git restore --staged <file>..." to unstage)
          new file:   newfile.txt
$

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
  • git status
    • Check the state of the directory
  • git add <filepath>
    • Update the index with any changes
  • git commit
    • git commit -m <message>
    • Commit changes in the index to (local) record
$ git status
  On branch main
  Your branch is up to date with 'origin/main'.
  
  Changes to be committed:
    (use "git restore --staged <file>..." to unstage)
          new file:   newfile.txt
  
$
$ git commit -m "Add newfile with placeholder text."
  1 file changed, 1 insertion(+)
  create mode 100644 newfile.txt
$
$ git status
  On branch main
  Your branch is ahead of 'origin/main' by 1 commit.
    (use "git push" to publish your local commits)
  
  no changes added to commit (use "git add" and/or "git commit -a")
$

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
  • git status
    • Check the state of the directory
  • git add <filepath>
    • Update the index with any changes
  • git commit
    • git commit -m <message>
    • Commit changes in the index to (local) record
  • git push <remote> <branch>
    • Send your locally committed changes to the remote repo
$ git status
  On branch main
  Your branch is ahead of 'origin/main' by 1 commit.
    (use "git push" to publish your local commits)

  no changes added to commit (use "git add" and/or "git commit -a")
$
$ git push origin main
  Enumerating objects: 3, done.
  Counting objects: 100% (3/3), done.
  Delta compression using up to 8 threads
  Compressing objects: 100% (8/8), done.
  Writing objects: 100% (8/8), 1.89 KiB | 1.89 MiB/s, done.
  Total 8 (delta 7), reused 0 (delta 0), pack-reused 0
  remote: Resolving deltas: 100% (7/7), completed with 7 local objects.
  remote:
  To github.com:jatkinson1000/git-for-science.git
     7647d3a..7ab12ff  main -> main
$

Exercise

Obtain a copy of the repository from the online remote using git clone.


If you made a fork clone this, otherwise clone the main workshop version.


Take a look around, how useful is this?1

Before we get started properly we’ll do a quick practical refresher of the basic git commands.

How does git work?

A mental model:

  • Each time you commit work git stores it as a diff.
    • This shows specific lines of a file and how they changed (+/-).
    • This is what you see with the git diff command.
  • diffs are stored in a tree.
    • By applying each diff one at a time we can reconstruct files.
    • We do not need to do this in order
      see cherry-picking and merge conflicts…

How does git work?

diff --git a/mycode/functions.py b/mycode/functions.py
index b784b07..d08024a 100644
--- a/mycode/functions.py
+++ b/mycode/functions.py
@@ -340,11 +341,10 @@ def rootfind_score(
         fpre = fcur
         if abs(scur) > delta:
             xcur += scur
+        elif sbis > 0:
+            xcur += delta
         else:
-            if sbis > 0:
-                xcur += delta
-            else:
-                xcur -= delta
+            xcur -= delta

         fcur = f_root(xcur, score, rnd)
         val = xcur

Evans (2024), Mukerjee (2024)

How does git work?

Actually:

  • Each time you commit work git creates a snapshot
    • Contains the commit message and a hash to a tree.
  • The tree is a list of files in the repo at this commit.
    • In reality it is a tree of trees for efficiency!
    • The roots of the tree are packed files at time of commit.
  • packed files are efficiently compressed.
    • And may use deltas which are a bit like diffs.
  • By tracing the tree and then unpacking we can reconstruct the repo at a state in time given by the commit hash.

Evans (2024)

Repository Files

README

  • A file in the main directory of your repository.
  • The entry point for new users.
  • Helps to ensure your code:
    • is accessible
    • is used properly
    • has longevity
  • Today encouraged to be written in Markdown as README.md.

README - examples

README

Essential:

  • Name
  • Short summary
  • Install instructions
  • Usage/getting-started instructions
  • Information about contributing
  • Authors and Acknowledgment
  • License information

Nice to have:

  • References to key papers/materials
  • Badges
  • Examples
  • Link to docs
  • List of users
  • FAQ
  • See readme.so/ for a longer list

makeareadme.com and readme.so are great tools to help.

Add as soon as you can in a project and update as you go along.

README - good examples

Exercise - README

How can we improve the README in the pyndulum code?

Edit the README.md file to improve it. In-particular think about:

  • A description of what the code is
  • How it can be installed
  • How to use it or get started
  • Information about the authors and how to contribute

Add and commit your changes.

If you are working from a fork then push to see these changes reflected on the online remote repository.

License

All public codes should have a license attached!

  • As a LICENSE file in the main directory
    • Recommended to to choose an OSI-approved license without modification.
  • Protects ownership and limits liability
  • Enables collaboration
  • Clarifies what can be done with the code and its derivatives
    • Public Domain ↔︎ Permissive ↔︎ Copyleft
    • The options may depend on your organisation and/or funder.

See choosealicense.com and the OSI list of licenses for more information.

GitHub and GitLab contain helpers to easily create popular licenses.

Adding a License - GitLab

1) From the main repo select the “+” dropdown menu and “New file”.

2) In the filename type “LICENSE” and GitLab will detect and offer you a dropdown to choose a LICENSE template.

3) Once you have chosen you can “Commit Changes” to add the file. It will appear as a LICENSE file at the top of the repository and will be detected by GitLab in the right hand side metadata.

Adding a License - GitHub

From the main repo select “Add file” and “+ Create new file”.

In the filename type “LICENSE” and GitHub will detect and offer you the option to choose a LICENSE template.

Adding a License - GitHub

Select your desired license and follow the instructions to apply it to your repository. Once complete it will appear as a LICENSE file at the top of the repository and will be detected by GitHub in the right hand side metadata.

Exercise - License

Add a license to our pyndulum code.


We will use the online helper features of GitHub or GitLab to choose and add a License.


Once you have done this don’t forget to run:

git pull <repo> <branch>

from your local copy to get these changes locally before you make further updates.

.gitignore

It is a good idea to add a .gitignore file to your projects:

  • A list of file patterns that will be skipped over by git.
  • Makes it easier for us to see through to what is important.
  • Used for:
    • junk that shouldn’t be in there - the infamous .DS_store
    • build files - mycode.o, mymodule.mod, out.a etc.
    • local environments - .venv/
    • large files - 50_year_run.nc or my_thesis.pdf etc.
    • keeping sensitive information out of public1

Generating a gitignore

  • Again, GitHub and GitLab contain helpers and templates to create .gitignore.

    • Add a file and this time enter “.gitignore” to get templates.
  • gitignore.io also provides support for generating multi-language .gitignores.

  • You can always edit the file later to add more things to it.


Aside:

While it can be tempting to use git add -a it should be avoided to prevent detritus and unclear commits. For the terminally lazy a better alternative is git add -u.

Exercise - .gitignore

Add a .gitignore to the pyndulum code?


We will use the online helper features of GitHub or GitLab to set up a basic gitignore file for a Python code.


Again, once you have done this don’t forget to run:

git pull <repo> <branch>

from your local copy to get these changes locally before you make further updates.

Git Workflow

Issues

Both GitHub and GitLab have methods for tracking issues.

These are useful for organising work.

  • managing separate tasks
  • logging new issues as they arise
  • tracking and scheduling development

Example issues:

GitHub

GitLab

Exercise - Issues

It would enhance the pyndulum code if we added functions to calculate:

  • pendulum total energy, and
  • pendulum length from desired period.


We will open issues for these on the online repository.

I will open one for pendulum energy, but you should open one for pendulum length.


!NOTE “Issues and Forks”

By default issues are off for forks, so you should open the issue on my copy of the repository on GitLab or GitHub.

Branches

So far we have been using the main branch in everything we do.


Our commits look something like this:

    %%{init: {'theme': 'dark',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "1-ad4e"
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       commit id: "1-2y4f Improve README"
       commit id: "4-664e Add a LICENSE file"
       commit id: "6-d3et Add a .gitignore from template"

But what if:

  • Someone else is modifying the same files as us?
  • We are working on different aspects/features of the project in parallel?
  • We find a bug and need to quickly fix it?

Branches

Branches help with all of the aforementioned situations, but are a sensible way to organise your work even if you are the only contributor.

    %%{init: {'theme': 'base',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       commit id: "fea 1.c"
       commit id: "fea 1.d"
       commit id: "5-af6f"

Conduct development in branches and merged into main when completed:

    %%{init: {'theme': 'base',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       branch feature
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       commit id: "fea 1.c"
       commit id: "fea 1.d"
       checkout main
       merge feature
       commit id: "5-af6f"

  • git branch <branchname>
    • Creates new branch branchname from current point
  • git checkout <branchname>
    • move to branch branchname
    • Updates local files - beware
  • git merge <branchname>
    • Tie the branchname branch into the current checked out branch with a merge commit.1

Branches

This comes into its own when working concurrently on different features.
git is not just about backups – it is about project organisation.

This way danger and obscurity lies:

    %%{init: {'theme': 'base',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       commit id: "fea 2.a"
       commit id: "fea 1.c"
       commit id: "fea 2.b"
       commit id: "5-af6f"
       commit id: "1-ad4e"

This is manageable and understandable:

    %%{init: {'theme': 'base',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       branch feature_1
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       checkout main
       branch feature_2
       commit id: "fea 2.a"
       checkout feature_1
       commit id: "fea 1.c"
       checkout main
       merge feature_1
       checkout feature_2
       commit id: "fea 2.b"
       checkout main
       merge feature_2
       commit id: "5-af6f"
       commit id: "1-ad4e"

Branches

The examples so far have been quite simple, but this gives a good audiovisual example of the power of branches:

Exercise - Branches

We want to add a functions to calculate pendulum length from desired period and energy.


Together we will create a local branch and add the energy equation to pendulum_equations.py, add, commit, and push those changes.


You should then return to main and create another local branch to add the length equation to pendulum_equations.py. Make sure you add and commit your changes!

Once you have done this use

git push <remote> <branch>

to push your work up to a remote feature branch.

Merge/Pull Requests

  • Another feature of GitHub/GitLab.1

  • A friendlier, graphical way of merging branches

  • Can be linked to GitHub/GitLab issues

  • A method of tracking progress

    • can be opened after the first push
    • a place for collaborative discussion.

Merge/Pull Requests

When opening a request you should include:

  • A description of what you have done
  • Any points to be particularly aware of
  • Checkboxes for required/ongoing tasks

Exercise - Merge/Pull requests

From the branches you pushed up in the previous exercise open pull requests either:

  • into the main branch of your fork, or
  • back to the main branch of my repository.

Use additional features of GitHub/GitLab:

  • write a clear description of the PR,
  • use keywords to tie back to your issue
    e.g. “closes #6”,
  • add a “label” to categorise the work,
  • Assign yourself

Code Review

Code review is not:

  • just for ‘real’ software
  • a chance to feel bad about your code

Code review is:

  • chance to reflect on what you wrote,
  • chance to spot bugs - we all make them!
  • testing that someone else can understand your code,
  • guarding against laziness,
  • a method to improve quality reusability,
  • chance to learn.

Code Review

Again, GitHub and GitLab have nice infrastructure to make this an effective and visual process.

Anyone can conduct a code review on a public repository.
If working alone ask colleagues for help and return the favour.


Do:

  • remember who the person you are reviewing is
  • explain your reasons for requests
  • praise good code, not just point out errors

Do not:

  • impose preferences
  • nitpick excessively

Exercise - Code Review

We will work through the length equation pull request and perform a code review before merging the work.


If anyone in the audience opened a pull request back to my repository and would like to volunteer we can review your code!

Please raise your hand and let me know the “number” pf your Merge/Pull request.

Aside - Commit Messages

Aside - Commit Frequency

There are differing thoughts on commit frequency and style. I suggest:

  • Synoptic Scale: Project or Meta-issues
  • Mesoscale: Pull requests
  • Microscale: Commits

Some useful examples:

  • CAM-ML #23 - Adding number concentration calculations
  • CAM-ML #32 - restoring a previously removed variable
    We can look directly at 8bdb319 to see what needs doing.
  • FTorch #230 - Adding optimizers
    A little long, but shows the development process and discussions.
  • TCTrack #69 - Refactoring code structure
    A big refactor, but we break up into logical chunks should one have an issue.

Closing

Summary

git is not just a series of backups, it is a project management system.


  • Improve your repositories:
    • README.md
    • LICENSE
    • .gitignore
  • Use branches:
    • Separate workflows
    • Organise project
  • Make full use of GitLab/GitHub features:
    • Helper tools
    • Issues
    • Pull Requests
    • Code review:
      • Learn, spot bugs, improve re-useability

Beyond today

As we said at the opening, git is a rabbit hole.

  • Project Boards for project management
  • Merge commit, squash, and rebase options for merging
  • git rebase - to shuffle history and keep things clean
  • git add -p - patched commits, add a section of a file
  • git commit --fixup - for when you missed something out of a commit
  • git worktrees to help organise working on branches in parallel
  • Continuous Integration Workflows
  • pre-commit

Where can I learn more?

  • GitButler’s 2024 FOSDEM talk “so you think you know git?”:

Thanks

References

Evans, J. 2024. “Do We Think of Git Commits as Diffs, Snapshots, and/or Histories?” https://jvns.ca/blog/2024/01/05/do-we-think-of-git-commits-as-diffs--snapshots--or-histories/.
Mukerjee, A. 2024. “Unpacking Git Packfiles.” https://codewords.recurse.com/issues/three/unpacking-git-packfiles.

Plus other links throughout the slides.