Reading a reddit comment a while ago, I learned about a somewhat “mysterious” git object which can be both present and absent at the same time! Well… not by the same definition of “presence”, but that spoils the fun, right? :P Let’s see this object in more details.
If you want to check whether a given object exists in a git repository, you
can use git rev-parse
. As the man page says:
--verify
Verify that exactly one parameter is provided, and that it can be
turned into a raw 20-byte SHA-1 that can be used to access the
object database. If so, emit it to the standard output; otherwise,
error out.
[...]
To make sure that $VAR names an existing object of any
type, git rev-parse "$VAR^{object}" can be used.
Let’s run an example inside the git.git repository:
$ git rev-parse --verify e83c5163316f89bfbde7d9ab23ca2e25604af290^{object}
e83c5163316f89bfbde7d9ab23ca2e25604af290
$ echo $?
0
Ok. What about an nonexistent object?
$ git rev-parse --verify 0000000000000000000000000000000000000000^{object}
fatal: Needed a single revision
$ echo $?
128
Hm, nothing out of the ordinary here. But the fun starts when we look at the
“mysterious” SHA1 hash 4b825dc642cb6eb9a060e54bf8d69288fbee4904
. If we
list all objects in the git.git repository using git cat-file
and grep for
this particular hash, we get nothing:
$ git cat-file --batch-check="%(objectname)" --batch-all-objects | \
grep 4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ echo $?
1
However … rev-parse
seems to disagree about the object’s presence:
$ git rev-parse --verify 4b825dc642cb6eb9a060e54bf8d69288fbee4904^{object}
4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ echo $?
0
Hmm, what is happening here? Is the object there or not? Let’s go a bit further with a reduced test case:
$ git init /tmp/repo
Initialized empty Git repository in /tmp/repo/.git/
$ git -C /tmp/repo rev-parse --verify 4b825dc642cb6eb9a060e54bf8d69288fbee4904^{object}
4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ echo $?
0
Wait, what? The just-created /tmp/repo
repository clearly has no objects:
$ ls /tmp/repo/.git/objects
/tmp/repo/.git/objects
├── info
└── pack
2 directories, 0 files
Is rev-parse broken? Let’s try something else… Running git cat-file
to
print all object hashes and grep for our target did not produce any result.
But cat-file
can also be used to print metadata about a given list of hashes.
Let’s try that with our hash:
$ echo 4b825dc642cb6eb9a060e54bf8d69288fbee4904 | \
git -C /tmp/repo cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(objectsize:disk)'
4b825dc642cb6eb9a060e54bf8d69288fbee4904 tree 0 0
Hmmmm, so 4b825dc642
is a tree object of size 0, both on disk and
decompressed. Nevertheless, we saw that there are no objects on disk… I was
intrigued. Is this hardcoded somewhere in git? And if so, why?
My first attempt to “uncover the mystery” was:
$ git -C git.git grep 4b825dc642
git-rebase--preserve-merges.sh:288: ptree=4b825dc642cb6eb9a060e54bf8d69288fbee4904
t/oid-info/hash-info:16:empty_tree sha1:4b825dc642cb6eb9a060e54bf8d69288fbee4904
t/t0015-hash.sh:26: grep 4b825dc642cb6eb9a060e54bf8d69288fbee4904 actual
The first match comes from a shell script which uses our mysterious hash
(a.k.a. the empty tree hash) as a fallback if a commit does not have a parent.
This is done to compare the hashes of a commit’s tree and its parent commit’s
tree to decide whether the commit is considered “empty” (i.e. its tree is the
same as the parent). The other two matches come from the test suite. Hmm, so no
hardcoded value on the actual object reading code? That’s curious… Let’s see
what gdb
has to show us! Running git rev-parse -e
4b825dc642cb6eb9a060e54bf8d69288fbee4904
through the debugger, we can see the
following call chain:
cmd_cat_file()
cat_one_file()
repo_has_object_file()
...
do_oid_object_info_extended()
find_cached_object()
And at the footer of find_cached_object()
we have this code:
if (oideq(oid, the_hash_algo->empty_tree))
return &empty_tree;
return NULL;
Aha! So we have this “empty tree” hash saved somewhere… Well, turns out that
my git grep
search did not found it because it is not defined in hex format!
See:
#define EMPTY_TREE_SHA1_BIN_LITERAL \
"\x4b\x82\x5d\xc6\x42\xcb\x6e\xb9\xa0\x60" \
"\xe5\x4b\xf8\xd6\x92\x88\xfb\xee\x49\x04"
#define EMPTY_TREE_SHA256_BIN_LITERAL \
"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
"\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5" \
"\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc" \
"\x53\x21"
The last thing to “uncover” is: why is this hash value hardcoded? Well, for
that we can find the explanation using git blame
(or tig blame
).
The code at find_cached_object()
comes from the commit 346245a1bb
("hard-code the empty tree object",
2008-02-13)
, which says:
commit 346245a1bb6272dd370ba2f7b9bf86d3df5fed9a
Author: Jeff King <peff@peff.net>
Date: Wed Feb 13 06:25:04 2008 -0500
hard-code the empty tree object
Now any commands may reference the empty tree object by its
sha1 (4b825dc642cb6eb9a060e54bf8d69288fbee4904). This is
useful for showing some diffs, especially for initial
commits.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There we have it, mystery uncovered!
Ok, Ok… I definitely over-dramatized this process… But I find it quite interesting to run this kind of analysis! It helps better understand parts of a code base, reproduce bugs, or even find the reason why a certain function (or line of code) was written in a given way. So I decided to document this particular small adventure. I hope you also enjoyed the “Schrödinger’s object” :)
Til next time,
Matheus