Published 2023-03-30.
Last modified 2023-06-02.
Time to read: 4 minutes.
git
collection.
If you have ever needed to work on a relatively small portion of a large git repository, you know how slow things can get, and how problems arise with large files and directories. Two new features, partial clone and sparse checkout, can be used together to dramatically speed things up. Also, signifiantly less storage will be required on your computing device!
Git added a partial clone feature in version 2.24, via git clone --filter
.
Git’s sparse checkout feature became user-friendly in version 2.25 with
the addition of the git sparse-checkout
and git clone --sparse
porcelain commands.
Definitions
By default, git repositories have up to 3 copies of every file. Copies can exist in git’s:
-
Working tree (also known as the working directory) –
this is where you edit files that you are currently working on.
The working tree consists of the contents of
.git/..
, which is the the parent directory of the.git
directory. The contents of the.git
directory are not part of the working tree. -
Index (also known as the staging area, or the cache in older documentation) –
stored in
.git/index
. When you rungit add
orgit commit -a
, a new snapshot of your working tree is saved to the index. -
Object database – stored in
.git/objects
. When you rungit commit
, the contents of the snapshots in the index are saved to the object database.
If you want to work on a subdirectory of a large git project, you may not want to have the entire project’s repository on your device. A partial clone, combined with the git sparse checkout feature allows you to just work on the subdirectory of interest in your repository.
Standalone Sparse Checkout
By itself, sparse checkout only affects the working tree, and hence the index.
In contrast, git’s object database is by default complete.
Sparse checkout means that for this local repository,
only selected portions of the repository’s object database are instantiated in the working tree.
When you git push
from a sparse clone to a remote repository such as origin
,
the snapshots contained in the local repository’s entire object database
are copied to the remote repository.
The integrity of the entire original repo is maintained. If someone else checks out the new repository, without performing the sparse checkout procedure, their working tree will populated from the complete contents of the original repository’s object database.
$
As of the date this was written (2023-06-02),
the git-sparse-checkout
command was still marked experimental.
The features and syntax have changed significantly since it was first proposed.
The git sparse-checkout init
subcommand is
now deprecated and no longer recommended.
Non-cone mode is also deprecated.
Read about cones here.
Partial Clones
Partial clones work by specifying a filter that limits which objects are fetched. In the following examples, <repo> stands for the URL of a remote repository:
$ # omit all blobs $ git clone --filter=blob:none <repo> $ # omit blobs larger then 1 MB $ git clone --filter=blob:limit=1m <repo>
By default, partial clones retrieve missing objects when the user attempts to access them. Thus, a partial clone will grow larger over time unless sparse checkout is used in conjunction with a partial clone.
Sparse checkouts allow you to restrict the files and directories that git can retrieve from the remote repository. When sparse checkout is used with partial cloning, the two features work together so that not only is the size of the working tree reduced, but the git object database also reduced in size, so that only the requested objects are fetched from the remote repository, on demand.
$
Case Study
The project I wanted to work on was
Sinatra-ActiveRecord
and I wanted to play with the sample project for sqlite
.
The sample project was very small (too small to be useful, actually!),
so it made no sense to fill my computing device with an overly large repository.
I wanted to eventually create two git remotes:
upstream
– pointing to the original git repo,sinatra-activerecord/sinatra-activerecord
.-
origin
– pointing to a new repo in my GitHub account that will contain the complete original repo's contents and history, plus my changes. This repo will be calledmslinn/sinatra-activerecord-sqlite
.
In the following command,
notice how I used the ‑‑origin
option to name the upstream
remote,
instead of using the default name, origin
.
$ git clone \ --filter=blob:none \ --origin upstream \ --sparse \ https://github.com/sinatra-activerecord/sinatra-activerecord/ Cloning into 'sinatra-activerecord'... remote: Enumerating objects: 1020, done. remote: Counting objects: 100% (145/145), done. remote: Compressing objects: 100% (74/74), done. remote: Total 1020 (delta 41), reused 123 (delta 38), pack-reused 875 Receiving objects: 100% (1020/1020), 131.40 KiB | 3.86 MiB/s, done. Resolving deltas: 100% (245/245), done. remote: Enumerating objects: 9, done. remote: Counting objects: 100% (6/6), done. remote: Compressing objects: 100% (6/6), done. remote: Total 9 (delta 0), reused 0 (delta 0), pack-reused 3 Receiving objects: 100% (9/9), 6.15 KiB | 6.15 MiB/s, done. $ cd sinatra-activerecord/
The ‑‑filter=blob:none
option in the above git clone
command
suppressed all but the top-level population of the working tree.
The same thing would have happened if ‑‑filter=tree:0
had been
used instead of ‑‑filter=blob:none
.
The only items in the working tree are the top-level files at this point:
$ ls -aF1 ./ ../ .git/ .gitignore Appraisals CHANGELOG.md CONTRIBUTING.md Gemfile LICENSE README.md Rakefile sinatra-activerecord.gemspec
Now we can ask for just the portions of the repository that interest us.
Notice that a checkout happens right after the git-sparse-checkout set
command.
Directories specified by git-sparse-checkout
must not have a leading slash.
$ git sparse-checkout set example/sqlite remote: Enumerating objects: 14, done. remote: Counting objects: 100% (1/1), done. remote: Total 14 (delta 0), reused 0 (delta 0), pack-reused 13 Receiving objects: 100% (14/14), 2.36 KiB | 2.36 MiB/s, done. Resolving deltas: 100% (1/1), done. $ git sparse-checkout list example/sqlite
Here are the files and directories that I just sparsely cloned from the repo:
$ ls -af example/sqlite/ README.md ./ config/ app.rb Gemfile config.ru bin/ ../ Rakefile db/
Next I used the GitHub CLI to create a repo in my GitHub account
for containing the complete repo, along with my modifications.
This command created a remote called origin
,
which points at the GitHub repo that was just created.
$ gh repo create --public --source=. --remote=origin ✓ Created repository mslinn/sinatra-activerecord-sqlite on GitHub ✓ Added remote git@github.com:mslinn/sinatra-activerecord-sqlite.git
The above gh repo create
command automatically names the repo from the current directory name.
I do this so often that I defined 2 bash aliases in ~/.bash_aliases
:
alias gh_new_private='gh repo create --private --source=. --remote=origin' alias gh_new_public='gh repo create --public --source=. --remote=origin'