git for Scientific Software Development

Jack Atkinson

Senior Research Software Engineer
ICCS - University of Cambridge

Adeleke Bankole

Research Software Engineer
ICCS - University of Cambridge

2025-06-23

Precursors

Slides and Materials

To access links or follow on your own device these slides can be found at:
jatkinson1000.github.io/git-for-science


All materials are available at:

Licensing

Except where otherwise noted, these presentation materials are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

Vectors and icons by SVG Repo used under CC0(1.0)

Precursors

  • Be nice (Python code of conduct)
  • Ask questions whenever they arise.
    • Someone else is probably wondering the same thing.
  • I will make mistakes.
    • Not all of them will be intentional.

git 101

What is git

git is a version control system developed by Linus Torvalds.1

It tracks changes made to files over time.

Installation and setup

Git comes preinstalled on most Linux distributions and macOS.
You can check it is on your system by running which git.


If you are on Windows, or do not have git, check the git docs1 or the GitHub guide to installing git. https://github.com/git-guides/install-git


Setting up a new git repository is beyond the scope of this talk but involves using the
git --init command.


We will assume that you have created a repository using an online hosting service (GitLab, GitHub etc.) that provides a nice UI wrapper around the process.

How does it work?

A mental model:

  • Each time you commit work git stores it as a diff.
    • This shows specific lines of a file and how they changed (+/-).
    • This is what you see with the git diff command.
  • diffs are stored in a tree.
    • By applying each diff one at a time we can reconstruct files.
    • We do not need to do this in order
      see cherry-picking and merge conflicts…

How does it work?

diff --git a/mycode/functions.py b/mycode/functions.py
index b784b07..d08024a 100644
--- a/mycode/functions.py
+++ b/mycode/functions.py
@@ -340,11 +341,10 @@ def rootfind_score(
         fpre = fcur
         if abs(scur) > delta:
             xcur += scur
+        elif sbis > 0:
+            xcur += delta
         else:
-            if sbis > 0:
-                xcur += delta
-            else:
-                xcur -= delta
+            xcur -= delta

         fcur = f_root(xcur, score, rnd)
         val = xcur

Evans (2024), Mukerjee (2024)

How does it work?

Actually:

  • Each time you commit work git creates a snapshot
    • Contains the commit message and a hash to a tree.
  • The tree is a list of files in the repo at this commit.
    • In reality it is a tree of trees for efficiency!
    • The roots of the tree are packed files at time of commit.
  • packed files are efficiently compressed.
    • And may use deltas which are a bit like diffs.
  • By tracing the tree and then unpacking we can reconstruct the repo at a state in time given by the commit hash.

Evans (2024)

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
$ git clone git@github.com:jatkinson1000/git-for-science.github git4sci
Cloning into 'git4sci'...
remote: Enumerating objects: 42, done.
remote: Counting objects: 100% (42/42), done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 42 (delta 26), reused 31 (delta 15), pack-reused 0
Receiving objects: 100% (42/42), 69.62 MiB | 5.64 MiB/s, done.
Resolving deltas: 100% (26/26), done.
$
$ cd git4sci/
$
$ echo "This is a new file." > newfile.txt
$

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
  • git status
    • Check the state of the directory
  • git add <filepath>
    • Update the index with any changes
$ git status
On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        newfile.txt

no changes added to commit (use "git add" and/or "git commit -a")
$
$ git add newfile.txt
$
$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   newfile.txt
$

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
  • git status
    • Check the state of the directory
  • git add <filepath>
    • Update the index with any changes
  • git commit
    • git commit -m <message>
    • Commit to record changes in the index
$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   newfile.txt

$
$ git commit -m "Add newfile with placeholder text."
 1 file changed, 1 insertion(+)
 create mode 100644 newfile.txt
$
$ git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

no changes added to commit (use "git add" and/or "git commit -a")
$

The basic commands

  • git clone <repo> [<dir>]
    • Clone a repository into a new directory
  • git status
    • Check the state of the directory
  • git add <filepath>
    • Update the index with any changes
  • git commit
    • git commit -m <message>
    • Commit to record changes in the index
  • git push <remote> <branch>
    • Send your locally committed changes to the remote repo
$ git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

no changes added to commit (use "git add" and/or "git commit -a")
$
$ git push origin main
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 8 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1.89 KiB | 1.89 MiB/s, done.
Total 8 (delta 7), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (7/7), completed with 7 local objects.
remote:
To github.com:jatkinson1000/git-for-science.git
   7647d3a..7ab12ff  main -> main
$

The git atlas

Locations

  • Workspace
  • Staging area or index
  • Local repo
  • Remote repo
  • Stash

These (and more) can be explored in Andrew Peterson’s Interactive git cheat sheet

A Warning

How does this help in science?

Exercise

You are doing some work on pendula and your colleague says they have written some code that solves the equations and they can share with you.
This is made easy by the fact that it is on git!
Let’s see how we get on…

Go to the workshop repository:

If you have an account then fork the repository and clone your fork.
If you do not have an account clone my repository.

Take a look around, how useful is this?1

Repository Files

README

  • A file in the main directory of your repository.
  • The entry point for new users.
  • Helps to ensure your code is used properly.
  • Today encouraged to be written in Markdown as README.md.

README - examples

README

Essential

  • Name
  • Short summary
  • Install instructions
  • Usage/getting-started instructions
  • Information about contributing
  • Authors and Acknowledgment
  • License information

Nice to have:

  • References to key papers/materials
  • Badges
  • Examples
  • Link to docs
  • List of users
  • FAQ
  • See readme.so/ for a longer list

makeareadme.com and readme.so are great tools to help.

Add as soon as you can in a project and update as you go along.

Exercise - README

How can we improve the README in the pyndulum code?

Edit the README.md file to improve it.

Add and commit your changes.

If you are working from a fork then push to see these changes reflected on the online remote.

License

All public codes should have a license attached!

  • LICENSE file in the main directory
  • Protect ownership
  • Limit liability
  • Clarify what can be done with the code
    • Public Domain, Permissive, Copyleft

The right selection may depend on your organisation and/or funder.

See choosealicense.com and the OSI list of licenses for more information.

GitHub and GitLab contain helpers to easily create popular licenses.

Exercise - License

Add a license to our pyndulum code.


We will use the online helper features of GitHub or GitLab to choose and add a License.


Once you have done this don’t forget to run:

git pull <repo> <branch>

from your local copy to get these changes locally before you make further updates.

.gitignore

It is a good idea to add a .gitignore file to your projects:

  • A list of file patterns that will be skipped over by git.
  • Makes it easier for us to see through to what is important.
  • Used for:
    • junk that shouldn’t be in there - the infamous .DS_store
    • build files - mycode.o or mymodule.mod etc.
    • large files - 50_year_run.nc or my_thesis.pdf etc.
    • keeping sensitive information out of public1

Again, GitHub and GitLab contain helpers and templates to create .gitignore.

Exercise - .gitignore

Add a .gitignore to the pyndulum code?


We will use the online helper features of GitHub or GitLab to set up a basic gitignore file for a Python code.


Again, once you have done this don’t forget to run:

git pull <repo> <branch>

from your local copy to get these changes locally before you make further updates.

Git Workflow

Issues

Both GitHub and GitLab have methods for tracking issues.

These are useful for organising work.

  • managing separate tasks
  • logging new issues as they arise
  • tracking and scheduling development

Example issue log on GitHub: Cambridge-ICCS/FTorch

Exercise - Issues

It would enhance the pyndulum code if we added functions to calculate:

  • pendulum total energy, and
  • pendulum length from desired period.


We will open issues for these on the online repository.

Branches

So far we have been using the main branch in everything we do.


Our commits look something like this:

    %%{init: {'theme': 'dark',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "1-ad4e"
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       commit id: "1-2y4f Improve README"
       commit id: "4-664e Add a LICENSE file"
       commit id: "6-d3et Add a .gitignore from template"

But what if:

  • Someone else is modifying the same files as us?
  • We are working on different aspects/features of the project in parallel?
  • We find a bug and need to quickly fix it?

Branches

Branches help with all of the aforementioned situations, but are a sensible way to organise your work even if you are the only contributor.

    %%{init: {'theme': 'dark',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       commit id: "fea 1.c"
       commit id: "fea 1.d"
       commit id: "5-af6f"

Conduct development in branches and merged into main when completed:

    %%{init: {'theme': 'dark',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       branch feature
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       commit id: "fea 1.c"
       commit id: "fea 1.d"
       checkout main
       merge feature
       commit id: "5-af6f"

  • git branch <branchname>
    • Creates new branch branchname from current point
  • git checkout <branchname>
    • move to branch branchname
    • Updates local files - beware
  • git merge <branchname>
    • Tie the branchname branch into the current checked out branch with a merge commit.1

Branches

This comes into its own when working concurrently on different features.
git is not just about backups – it is about project organisation.

This way danger and obscurity lies:

    %%{init: {'theme': 'dark',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       commit id: "fea 2.a"
       commit id: "fea 1.c"
       commit id: "fea 2.b"
       commit id: "5-af6f"
       commit id: "1-ad4e"

This is manageable and understandable:

    %%{init: {'theme': 'dark',
              'gitGraph': {'rotateCommitLabel': true},
              'themeVariables': {
                  'commitLabelBackground': '#bbbbbb',
                  'commitLabelColor': '#ffffff'
    } } }%%
    gitGraph
       commit id: "4-ff6b"
       commit id: "0-fd7f"
       branch feature_1
       commit id: "fea 1.a"
       commit id: "fea 1.b"
       checkout main
       branch feature_2
       commit id: "fea 2.a"
       checkout feature_1
       commit id: "fea 1.c"
       checkout main
       merge feature_1
       checkout feature_2
       commit id: "fea 2.b"
       checkout main
       merge feature_2
       commit id: "5-af6f"
       commit id: "1-ad4e"

Branches

The examples so far have been quite simple, but this gives a good audiovisual example of the power of branches:

Exercise - Branches

We want to add a functions to calculate pendulum length from desired period and energy.


Create a branch locally and add the new length equation to pendulum_equations.py. Make sure you add and commit your changes!

Return to main and create another new branch to add the energy calculation. Again, commit your work.


Once you have done this use

git push <remote> <branch>

to push your work up to a remote feature branch.

Aside - Commit Messages

Merge/Pull Requests

  • Another feature of GitHub/GitLab.1

  • A friendlier, graphical way of merging branches

  • Can be linked to GitHub/GitLab issues

  • A method of tracking progress

Merge/Pull Requests

When opening a request you should include:

  • A description of what you have done
  • Any points to be particularly aware of
  • Checkboxes for required/ongoing tasks

Exercise - Merge/Pull requests

From the branches you pushed up in the previous exercise open pull requests either:

  • into the main branch of your fork, or
  • back to the main branch of my repository.

Code Review

Code review is not:

  • just for ‘real’ software
  • a chance to feel bad about your code

Code review is:

  • chance to reflect on what you wrote,
  • chance to spot bugs - we all make them!
  • testing that someone else can understand your code,
  • guarding against laziness,
  • a method to improve quality reusability,
  • chance to learn.

Code Review

Again, GitHub and GitLab have nice infrastructure to make this an effective and visual process.

Anyone can conduct a code review on a public repository.
If working alone ask colleagues for help and return the favour.


Do:

  • remember who the person you are reviewing is
  • explain your reasons for requests
  • praise good code, not just point out errors

Do not:

  • impose preferences
  • nitpick excessively

Exercise - Code Review

We will work through the two pull requests we opened and perform a code review before merging the work.


If anyone in the audience opened a pull request back to my repository and would like to volunteer we can review their code!

Closing

Summary

git is not just a series of backups, it is a project management system.


  • Improve your repositories:
    • README.md
    • LICENSE
    • .gitignore
  • Use branches:
    • Separate workflows
    • Organise project
  • Make full use of GitLab/GitHub features:
    • Helper tools
    • Issues
    • Pull Requests
    • Code review:
      • Learn, spot bugs, improve re-useability

Where can I learn more?

  • GitButler’s 2024 FOSDEM talk “so you think you know git?”:

Thanks

References

Evans, J. 2024. “Do We Think of Git Commits as Diffs, Snapshots, and/or Histories?” https://jvns.ca/blog/2024/01/05/do-we-think-of-git-commits-as-diffs--snapshots--or-histories/.
Mukerjee, A. 2024. “Unpacking Git Packfiles.” https://codewords.recurse.com/issues/three/unpacking-git-packfiles.

Plus other links throughout the slides.