We are approaching the end of GSoC. In fact, this is the final week! A working
version of git-grep with parallel inflation was already sent and merged into
pu
. But threading is still disabled when --recurse-submodules
or
--textconv
are given. So I’ve been working on a v2 to support grepping
submodules and try some more optimizations along the way. It didn’t seem so
complex at the beginning, but I quickly realized I was just at the tip of the
iceberg. This was probably the most intense week in my GSoC. It went like
this:
1) Submodule’s initializations
At first, to efficiently grep submodules in parallel I thought I just had to internally protect these four functions at the beginning of grep_submodule:
- is_submodule_active()
- repo_submodule_init()
- repo_read_gitmodules()
- add_to_alternates_memory()
Doing so, I would be able to remove the grep_read_mutex
around them in favor
of the internal object reading locks I added. This would let
git grep --recurse-submodules --cached
take full advantage of the parallel
inflation. (the internal locks are released when inflating.)
1.1) Finding the thread-unsafe spots
The first thing to do was to discover why those functions needed to be in a critical section. A comment right above them give the answer:
/*
* NEEDSWORK: submodules functions need to be protected because they
* access the object store via config_from_gitmodules(): the latter
* uses get_oid() which, for now, relies on the global the_repository
* object.
*/
But inspecting config_from_gitmodules()
I discovered the comment was outdated
and the function no longer used get_oid()
. However, it now called
add_to_alternates_memory()
, which adds the subrepo object directory as an
in-memory alternate to the global the_repository. As this operation would be
performed concurrently with the object reading operations by the worker threads,
it could cause data races. (Because it writes to the r->objects->odb
list,
while oid_object_info_extended()
’s call chain may need to iterate through it.)
1.2) Removing 1st add_to_alternates_memory()
config_from_gitmodules()
was adding the subrepo to the alternates memory in
order to call config_with_options()
, which needs access to the subrepo’s
objects. The call chains I found that reads the subrepo objects are:
config_with_options() > git_config_from_blob_ref() > get_oid()
config_with_options() > git_config_from_blob_ref() > get_oid() > git_config_from_blob_oid() > read_object_file()
So, in order to remove the addition to the alternates list, I added a
repo_config_with_options()
, which would take an extra repository parameter and
pass it down in those chains. Then I made config_from_gitmodules()
use this
new function.
1.3) Thread-safe initializations?
At this point, I was pretty sure the first 3 submodule operations were already
thread-safe without the need of grep_read_mutex
or even my internal object
reading locks. But there was no way to be sure other than going through each and
every function in the call chains. I tried that, but as one would imagine, the
call chains were too big for me to manually examine line-by-line…
1.4) Removing 2nd add_to_alternates_memory()
.
I still had to take care of one more thread-unsafe function: the
add_to_alternates_memory()
at grep_submodule()
. Removing it would not only
help in the parallelization but would also solve another NEEDSWORK
comment:
/*
* NEEDSWORK: This adds the submodule's object directory to the list of
* alternates for the single in-memory object store. This has some bad
* consequences for memory (processed objects will never be freed) and
* performance (this increases the number of pack files git has to pay
* attention to, to the sum of the number of pack files in all the
* repositories processed so far). This can be removed once the object
* store is no longer global and instead is a member of the repository
* object.
*/
The call was there so that worker threads could access the subrepo’s objects through the alternates list. Thus, to remove it I had to pass the subrepository reference down to the threads and make them use it explicitly. Unfotunatelly, it seems that this patch introduced some data races I wasn’t able to find until this moment :(
2) Other submodule’s data races
I also noticed some other operations on grep_submodule()
which weren’t
thread-safe:
- The call to
parse_object_or_die()
, when grepping trees, is outside a lock. This function may load objects internally, write tor->parsed_objects_obj_hash
and also calllookup_replace_object()
, which is thread-unsafe. So calls to it must be done in a critical section. - To clean the memory allocated for the subrepository,
repo_clear(&subrepo)
was being called right before returning fromgrep_submodule()
. However, by the time this function is called, there’s no guarantee that the worker threads have already finished working on the subrepo. To properly avoid that, we would need to implement a kind of mapping which tells, for each repo, how many objects are yet to be processed. As this would be somehow laborious, I just skipped the memory cleaning for now, to implement a testing version.
3) Thread-safe code inside critical section
I noticed what it seemed to be an already thread-safe code inside add_work()
’s
critical section. So I moved it out, which lead to a minor but noticeable
speedup:
-static void add_work(struct grep_opt *opt, const struct grep_source *gs)
+static void add_work(struct grep_opt *opt, struct grep_source *gs)
{
+ if (opt->binary != GREP_BINARY_TEXT)
+ grep_source_load_driver(gs, opt->repo->index);
+
grep_lock();
while ((todo_end+1) % ARRAY_SIZE(todo) == todo_done) {
pthread_cond_wait(&cond_write, &grep_mutex);
}
todo[todo_end].source = *gs;
- if (opt->binary != GREP_BINARY_TEXT)
- grep_source_load_driver(&todo[todo_end].source,
- opt->repo->index);
4) Move the obj_read_mutex
down
I didn’t mention it before, but some of those four submodule operations use
object reading function such as oid_object_info_extended()
, which is not
thread-safe. To overcome this, I decided to move v1’s obj_read_mutex
from
read_object_file_extended()
down to oid_object_info_extended()
. This allowed
even more functions to be called in parallel, but some extra locks were needed.
5) Replace grep_read_mutex
by internal locks
With all the above done, it was time to remove the grep_read_mutex
and enable
the internal object reading locks. This moment, I also re-enabled threads in
the cached grep with --recurse-submodules
. Unfortunatelly, the result wasn’t
the best. I tried a lot of changes to this main idea, but all of them ended up
in a data race, Segmentation Fault, failure to read subrepos’ objects, or
performance drop. Each of these alternative versions are in my local branches,
which I thought wouldn’t be worthy uploading to share, but
here
is the “main” one.
Next steps
I believe my idea for v2 turned out to be a very big step and I just got lost in it. So my current idea is to take a step back and go slower. With that in mind, I might end up not finishing this version during GSoC. But I thinkg that’s OK as we already have v1 sent and with good speedup. And I can keep working on this after GSoC, anyway :)
So, instead of making more tweaks to my v2 series, I sat down and analysed the code trying to list what are the real open issues (and how we might solve them):
add_to_alternates_memory()
: althought this function cannot be called in parallel, the case in which git-grep uses it, may not be problematic, if:prepare_alt_odb()
is called at repositories’ initalization (atcmd_grep()
for the_repository andgrep_submodule()
for subrepos). This way, we force eager initialization and avoid data races with the worker threads in the initialization.- The
ent->next = NULL
line is moved up inlink_alt_odb_entry()
. Then, the worker threads may iterate throughr->objects->odb
without possibly falling on an unmaped memory region. (That’s hacky, but should work).
parse_object_or_die()
: again, this should not be a problem in our case, if:- The object reading operations are properly protected internally.
prepare_replace_object()
is called at repositories’ initialization (as before).
- Submodule initialization functions (at
grep_submodule
): This is the critical part! I re-investigated them and found outis_submodule_active()
, for example, hasrepack_packed_git()
in its call chain. This could clearly generate data races with the workers. And as we are working with “indermediate level” locks, I really don’t know yet how to protect these functions without going lower or higher in our “protection layer”. On the other hand,submodule_from_path()
has the same problem and it is outside the critical section… So maybe we can ignore those data races which are too improbable?
Regarding the commits I already have, I plan to:
- Remove the
submodule-config: don't add subrepo to alternates
from the series. It is no longer necessary here, so it might be sent separately. - Remove
grep: don't add submodules to the alternates list
and hold it in standby. It currently brings data races that must be solved if I’m going to send it in the future. - Refactor the patch adding the object reading locks. I plan to protect only
oid_object_info_extended()
and notread_object_file_extended()
. This won’t work for the general case (which I’ll make explicit), but should work in our case as we’ll be forcing eager initialization for alternates and replace maps in all repos. - The
grep: move driver pre-load out of critical section
should be OK if we have the object reading locks. - The
WIP: grep: remove racy call to repo_clear()
must be properly finished.
And that’s the plan! Well, I really hope I’m not being too flustered again, and proposing bigger steps than what I can take…
Main difficulties
Main difficulties this week were:
- Debugging data races: I used a combination of GDB, ThreadSanitizer and valgring (helgrind and memcheck). But sometimes the racy spot is to occasional to be caught.
- Investigating deep call chains: To make a low-level “thread-protection layer” is hard. I needed to go deep in git-grep call stacks to make sure some functions were safe to be called unlocked. And a couple times I was still unsure.
I also spent some time thinking how I should properly deal with:
- Lazy initializations: They are great in big codebases as they mitigate the
need to keep track of all initializations in the beginning of an execution.
But unfortunatelly, they are difficult to handle when threaded, as two threads
can try to initialize a resource at the same time. (For example, take a look
at
prepare_replace_object()
andprepare_alt_odb()
, which are called by many functions.) - Static function-scope variables: Some functions such as
oid_to_hex()
return a static buffer. This is a good convenience for the callees and it avoids memory leaks, but it makes the function thread-unsafe.
Til next time,
Matheus