Published 2023-03-11.
Last modified 2023-03-20.
Time to read: 3 minutes.
git
collection, categorized under Azure, Git, GitHub, Java, Python, Ruby, Software-Expert.
When working as a software expert witness,
I normally use several programs to analyse git repositories,
including git
itself,
and enhancement programs such as git-fame
.
Sometimes I write specialized programs for analysing certain aspects of git repos.
For simple tasks, bash
might be a good choice.
More complex tasks are better implemented in Python
(using pygit2
), or
Ruby (using rugged
).
libgit2
Both pygit2
and rugged
are built on the
libgit2
API.
Because libgit2
is implemented in the C language,
it is compatible with C++.
Other languages also have libraries that provide bindings for libgit2
,
for example
.NET,
Node,
and Julia.
Java has several implementations of libgit2
wrapper libraries, including
jagged
,
Git24J
,
and JGit,
however they all use Java Native Interface (JNI) to invoke libgit2
.
Java is notorious for poor performance and memory safety issues when invoking C libraries via JNI.
This means that Java wrappers for libgit2
are not used much when working with libgit2
.
Java’s Project Panama is still evolving –
perhaps one day a better Java wrapper for libgit2
will emerge.
GitHub, GitLab and Azure DevOps are all built on libgit2
.
You can examine the
GitLab source code
to see for yourself.
Low- to High-Level User Interfaces
libgit2
exposes git
’s low-level interface,
which I discussed in
Low Level Git Commands (‘Plumbing Internals’).
If you are unfamiliar with git
’s low-level plumbing,
working with libgit2
and its language bindings will probably be confusing and frustrating.
Terminology
diff delta patch hunk line
The Libgit2
API defines a hierarchy of terms.
Knowing these definitions greatly helps one
understand how to work with libgit2
,
and language bindings to that API.
A diff
consists of delta
s,
which contain patch
es,
which contain hunk
s,
which contain line
s.
The terms are defined below, along with the names of the Ruby classes that implement them. I have paraphrased the documentation where appropriate.
-
diff
(Rugged::Diff
) -
A
diff
represents the cumulative list of differences between two snapshots of a repository, possibly filtered by a set of file name patterns.
Adiff
contains a list ofdelta
s. -
delta
(Rugged::Diff::Delta
) -
A delta contains a description of changes to one file or rename operation.
It might also contain helpful information about the entry if you request it.
This optional information includes a similarity score and a binary flag.
A delta contains one or two hashes for a changed file, defining theold_file
andnew_file
characteristics. Although the two sides of the delta are namedold_file
andnew_file
, they may actually correspond to entries that represent a file, a symbolic link, a submodule commit id, or a tree if you are tracking type changes or ignored/untracked directories.
The primary accessors are:new_file
(absent if the file was deleted).old_file
(absent if the file was just created).status
is a symbol, like:changed
.
git_diff_find_similar()
. See thelibgit2
documentation for more information.binary
indicates if this is a binary file.similarity
score.
-
patch
(Rugged::Patch
) -
A
patch
contains a list ofhunk
s. -
hunk
(Rugged::Diff::Hunk
) -
A hunk contains a list of modified
line
s in adiff
, along with context, resulting from a single change in a file. You can configure the amount of context and other properties of how hunks are generated. Hunks include a header that described where it starts and ends in both the old and new versions in the delta.
Ahunk.header
is aString
that summarizes otherhunk
properties, and might look like"@@ -1,16 +1,8 @@\n"
.
See thelibgit2
documentation for information abouthunk
properties. -
line
(Rugged::Diff::Line
) -
A
line
is a portion of the data within a hunk. For text files, aline
is simply a line of the hunk text; for binary files, a hunk is a data span.
The encoding of data in the file being diffed is not known, soline
content can only be parsed after first examining the actual file.
Also,line
data will not be NUL-byte terminated, because it just consists of a span of bytes inside a file.
Theline_origin
property has typeSymbol
, and can have the following values:-
:context
context lines exist in both the old and new versions. The:context?
method returns true ifline_origin
has value:context
. -
:added
lines only exist in the new version The:added?
method returns true ifline_origin
has value:added
. -
:removed
lines only exist in the old version The:deletion?
method returns true ifline_origin
has value:removed
.
-