Git and libgit2
Mike Slinn

Partial Clone With Sparse Checkout

Published 2023-03-30. Last modified 2023-06-02.
Time to read: 4 minutes.

This page is part of the git collection.

If you have ever needed to work on a relatively small portion of a large git repository, you know how slow things can get, and how problems arise with large files and directories. Two new features, partial clone and sparse checkout, can be used together to dramatically speed things up. Also, signifiantly less storage will be required on your computing device!

Git added a partial clone feature in version 2.24, via git clone --filter. Git’s sparse checkout feature became user-friendly in version 2.25 with the addition of the git sparse-checkout and git clone --sparse porcelain commands.

Definitions

By default, git repositories have up to 3 copies of every file. Copies can exist in git’s:

  1. Working tree (also known as the working directory) – this is where you edit files that you are currently working on. The working tree consists of the contents of .git/.., which is the the parent directory of the .git directory. The contents of the .git directory are not part of the working tree.
  2. Index (also known as the staging area, or the cache in older documentation) – stored in .git/index. When you run git add or git commit -a, a new snapshot of your working tree is saved to the index.
  3. Object database – stored in .git/objects. When you run git commit, the contents of the snapshots in the index are saved to the object database.

If you want to work on a subdirectory of a large git project, you may not want to have the entire project’s repository on your device. A partial clone, combined with the git sparse checkout feature allows you to just work on the subdirectory of interest in your repository.

Standalone Sparse Checkout

By itself, sparse checkout only affects the working tree, and hence the index. In contrast, git’s object database is by default complete.

Sparse checkout means that for this local repository, only selected portions of the repository’s object database are instantiated in the working tree.

When you git push from a sparse clone to a remote repository such as origin, the snapshots contained in the local repository’s entire object database are copied to the remote repository.

The integrity of the entire original repo is maintained. If someone else checks out the new repository, without performing the sparse checkout procedure, their working tree will populated from the complete contents of the original repository’s object database.

Shell
$ 

As of the date this was written (2023-06-02), the git-sparse-checkout command was still marked experimental. The features and syntax have changed significantly since it was first proposed.

The git sparse-checkout init subcommand is now deprecated and no longer recommended. Non-cone mode is also deprecated. Read about cones here.

Partial Clones

Partial clones work by specifying a filter that limits which objects are fetched. In the following examples, <repo> stands for the URL of a remote repository:

Shell
$ # omit all blobs
$ git clone --filter=blob:none <repo>

$ # omit blobs larger then 1 MB
$ git clone --filter=blob:limit=1m <repo>

By default, partial clones retrieve missing objects when the user attempts to access them. Thus, a partial clone will grow larger over time unless sparse checkout is used in conjunction with a partial clone.

Sparse checkouts allow you to restrict the files and directories that git can retrieve from the remote repository. When sparse checkout is used with partial cloning, the two features work together so that not only is the size of the working tree reduced, but the git object database also reduced in size, so that only the requested objects are fetched from the remote repository, on demand.

Shell
$ 

Case Study

The project I wanted to work on was Sinatra-ActiveRecord and I wanted to play with the sample project for sqlite. The sample project was very small (too small to be useful, actually!), so it made no sense to fill my computing device with an overly large repository.

I wanted to eventually create two git remotes:

  • upstream – pointing to the original git repo, sinatra-activerecord/sinatra-activerecord.
  • origin – pointing to a new repo in my GitHub account that will contain the complete original repo's contents and history, plus my changes. This repo will be called mslinn/sinatra-activerecord-sqlite.

In the following command, notice how I used the ‑‑origin option to name the upstream remote, instead of using the default name, origin.

Shell
$ git clone \
  --filter=blob:none \
  --origin upstream \
  --sparse \
  https://github.com/sinatra-activerecord/sinatra-activerecord/
Cloning into 'sinatra-activerecord'...
remote: Enumerating objects: 1020, done.
remote: Counting objects: 100% (145/145), done.
remote: Compressing objects: 100% (74/74), done.
remote: Total 1020 (delta 41), reused 123 (delta 38), pack-reused 875
Receiving objects: 100% (1020/1020), 131.40 KiB | 3.86 MiB/s, done.
Resolving deltas: 100% (245/245), done.
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 9 (delta 0), reused 0 (delta 0), pack-reused 3
Receiving objects: 100% (9/9), 6.15 KiB | 6.15 MiB/s, done. 

$ cd sinatra-activerecord/

The ‑‑filter=blob:none option in the above git clone command suppressed all but the top-level population of the working tree. The same thing would have happened if ‑‑filter=tree:0 had been used instead of ‑‑filter=blob:none. The only items in the working tree are the top-level files at this point:

Shell
$ ls -aF1
./
../
.git/
.gitignore
Appraisals
CHANGELOG.md
CONTRIBUTING.md
Gemfile
LICENSE
README.md
Rakefile
sinatra-activerecord.gemspec 

Now we can ask for just the portions of the repository that interest us. Notice that a checkout happens right after the git-sparse-checkout set command. Directories specified by git-sparse-checkout must not have a leading slash.

Shell
$ git sparse-checkout set example/sqlite
remote: Enumerating objects: 14, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 14 (delta 0), reused 0 (delta 0), pack-reused 13
Receiving objects: 100% (14/14), 2.36 KiB | 2.36 MiB/s, done.
Resolving deltas: 100% (1/1), done. 

$ git sparse-checkout list
example/sqlite 

Here are the files and directories that I just sparsely cloned from the repo:

Shell
$ ls -af example/sqlite/
README.md  ./  config/  app.rb  Gemfile  config.ru  bin/  ../  Rakefile  db/ 

Next I used the GitHub CLI to create a repo in my GitHub account for containing the complete repo, along with my modifications. This command created a remote called origin, which points at the GitHub repo that was just created.

Shell
$ gh repo create --public --source=. --remote=origin
✓ Created repository mslinn/sinatra-activerecord-sqlite on GitHub
✓ Added remote git@github.com:mslinn/sinatra-activerecord-sqlite.git 
😁

The above gh repo create command automatically names the repo from the current directory name.

I do this so often that I defined 2 bash aliases in ~/.bash_aliases:

Shell
alias gh_new_private='gh repo create --private --source=. --remote=origin'
alias gh_new_public='gh repo create --public --source=. --remote=origin'

References



* indicates a required field.

Please select the following to receive Mike Slinn’s newsletter:

You can unsubscribe at any time by clicking the link in the footer of emails.

Mike Slinn uses Mailchimp as his marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp’s privacy practices.