This is being a quite busy college week, so I unfortunatelly had to delay the post (hence the “week 5.5” in the title). But let’s get to it:
Some leftover investigations
At the last post, I left some investigations to be made on some time consuming
git-grep operations. As we’ve seen, kwsexec()
and grep_source_is_binary()
were two of those functions which needed further examination. So, this week, I
proceeded with that, trying to understand why they are expensive and how could
we optimize them, if possible.
kwexec()
kwsexec()
is part of “kwset” or “keyword set searching” code. This section
comes from GNU grep and uses the boyer-moore algorithm for a single word search
or the commentz-walter algorithm for multiple words search. But since the
incorporation of this code, it seems GNU grep’s version has evolved to be much
more perfomatically. For example a3d42de (“grep: use Aho-Corasick algorithm to
search multiple fixed words”, 2016-06-02) reports a 7x speedup for a typical
case:
Using best-of-5 trials for the benchmark:
find /usr/share/doc/ -type f | LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
The results were:
real 11.37 user 11.03 sys 0.24 [without the change] real 1.49 user 1.31 sys 0.15 [with the change]
So one thing that we should consider doing is to update Git’s kwset. I don’t know about the license issues, thought, since GNU grep is GPLv3 and Git is GPLv2. Anyways, this probably goes beyond my project…
grep_source_is_binary()
grep_source_is_binary()
checks whether a file is binary or not so that
git-grep may properly handle it. Without --textconv
and --text
, git-grep
will have to call this function for each and every file in the worktree (or
path in an specified revision). To decide whether a path refers to a binary file
or not, the function first tries to read the diff
attribute for the specified
path. If it fails to do that (or the attribute was unspecified), it goes to
auto-detection, which involves reading the whole file (or blob) and searching
its first 8000 bytes for a 0
unsigned char. Both attribute and file/blob
reading can be really time consuming. But it doesn’t seem there’s much we can
do here without getting too much into the attributes code. However, making the
blob reading thread-safe may help since grep_source_is_binary()
currently
relies on a lock to perform this operation.
Userdiff drivers pre-load
Reading builtin/grep.c
I was curious why add_work()
, responsible for adding
tasks to the worker threads, called grep_source_load_driver()
, which loads
the userdiff drivers if not already loaded. This function is already being
called by the worker threads, in parallel (with locks where needed), so wouldn’t
it be faster to let the workers handle it alone? Well, 9dd5245 (“grep: pre-load
userdiff drivers when threaded”, 2012-02-02) gave me the answer. It says:
The low-level grep_source code will automatically load the userdiff driver to see whether a file is binary. However, when we are threaded, it will load the drivers in a non-deterministic order […]
Meanwhile, the attribute lookup code (which underlies the userdiff driver lookup) is optimized to handle paths in sequential order […] Multi-threading the lookups destroys the locality and makes this optimization less effective. […]
We can fix this by pre-loading the userdiff driver in the main thread […] My best-of-five for “git grep foo” on the linux-2.6 repository went from:
real 0m0.391s user 0m1.708s sys 0m0.584s
to:
real 0m0.360s user 0m1.576s sys 0m0.572s
But since this change is from 2012, I was wondering whether this optimization still performs well nowadays. So I removed the pre-loading and tested git-grep with two regexes of different complexities over chromium’s repository:
git grep "abcd[02]"
git grep -E '(static|extern) \b(int|double)\b \*[A-Za-z]+'
I also performed the tests for 2 and 8 threads, to see if it presents any difference and for both the worktree and object store1, totalizing 16 test cases. Each test case was repeated 30 times after one warmup execution, to mitigate fluctuations. Here are the resulting means for each of them2:
WORKTREE GIT-GREP
==============================
PRE-LOAD NO PRE-LOAD
Regex 1
Threads 2: 4.1529s 4.0146s
Threads 8: 2.9638s 2.9051s
Regex 2
Threads 2: 31.5089s 31.8494s
Threads 8: 19.6766s 19.2586s
CACHED GIT-GREP
==============================
PRE-LOAD NO PRE-LOAD
Regex 1
Threads 2: 14.4761s 14.1820s
Threads 8: 15.5850s 13.9691s
Regex 2
Threads 2: 38.3241s 38.7090s
Threads 8: 25.5049s 25.3535s
By this measurements we can see it seems worth to remove the pre-loading, expecially for simpler regexes and more threads, when the performance gain is higher. For more complex regexes and less threads we may have a little slowdown, but it is very little and only happens for a low number of threads. I’ll check with my mentors what they think about these tests and a possible patch removing the pre-load.
Time measuring inside lock operations
As the main plan is to parallelize pack access and refine git-grep locks, I also
made a test to see how much time the threads spend inside each of those locks.
This should complement the plots we saw in the last post and/or even help us
visualizing details we didn’t previously see. I was curious, for example, about
the grep_attr_mutex
since it protects the attribute machinery and that’s
a section which seems to be kind of expensive, as we previously saw.
It’s important to note that these tests don’t account for the time in lock contention (which would be another really nice measure to take). For now, we’ll focus on the time spent by the threads working in the critical section. As the previous test, each test case was repeated 30 times after one warmup execution. Here are the resulting means together with the percentage of the total time spent in each lock:
WORKTREE GIT-GREP
==============================
Threads 2
Regex 1 Regex 2
Total: 4.2326s 31.5687s
grep_mutex: 0s (0%) 0s (0%)
grep_read_mutex: 0s (0%) 0s (0%)
grep_attr_mutex: 0.8809s (20.81%) 1.0706s (3.39%)
Threads 8
Regex 1 Regex 2
Total: 3.0244s 19.7046s
grep_mutex: 0s (0%) 0s (0%)
grep_read_mutex: 0s (0%) 0s (0%)
grep_attr_mutex: 0.98s (32.40%) 1.1041s (5.60%)
CACHED GIT-GREP
==============================
Threads 2
Regex 1 Regex 2
Total: 14.3626s 38.3953s
grep_mutex: 0s (0%) 0s (0%)
grep_read_mutex: 12.2364s (85.20%) 12.2781s (31.98%)
grep_attr_mutex: 1.0041s (6.99%) 1.1283s (2.94%)
Threads 8
Regex 1 Regex 2
Total: 15.5181s 25.4049s
grep_mutex: 0s (0%) 0s (0%)
grep_read_mutex: 12.927s (83.30%) 14.7175s (57.93%)
grep_attr_mutex: 1.0536s (6.79%) 1.2476s (4.91%)
Some conclusions we can take from this table are:
- The userdiff and attribute handling take a significant amount of time. But it gets masked by the regex processing time, when using more complex regexes.
- The time spent adding and retriving work from the producer-consumer FIFO is
pretty small. We can see that by the
grep_mutex
lock, which is used to protect these operations. (But again, we are not measuring the contention time here). - When performing a cached grep, reading objects from the object store can take a massive amount of time. Again, this time is masked by regex processing when we use more complex regexes. But is very well visible on simpler ones, taking even 85% of the time in some cases!
- Cached grep had a curious behavior when more threads were added: even though the execution time decreased, the time in object reading was a little bit higher. I can’t explain why, right now.
- Cached grep takes longer than the worktree grep for every test case. But if we subtract the time it spent on object reading, the resulting time is smaller than the worktree one. So worktree grep must be doing some additional work that cache does not. This may be because reading the uncompressed file is more expensive than reading the compressed blobs.3
I think the most important thing we can take from the tests and these conclusions is that parallelizing the object reading is promising. It accounts for much of the execution time and should yield to good speedups if threaded. But we can’t be sure as we don’t know how much of its internals is I/O and how much is CPU processing. If the second slice is big (because of inflation, for example) than we are good to go. But we will only be sure when we get there…
Others: re-sending a patchset
This week I also re-sent my patchset on dir-iterator and local clone. This series add tests for the dir-iterator API, simplifies its mechanics and use it at copy_or_link_directory (builtin/clone.c). There’s also some behavior changes and new features together with new tests. I’m looking forward to see it upstream :)
Dificulties and next steps
Writting this post I realized I’ve learned a lot about git-grep internals so far, which is great. But I haven’t made the same progress towards coding yet :( I think the greatest difficulty so far has been to divide the ultimate goal into smaller tasks. I still cannot visualize a very clear pathway…
For my next steps, although we have deviated from the original schedule, I plan
to revisit it and check the global states I’ve already listed. But I don’t think
it’s a nice approach to just go protecting them. I’m more inclined to go with a
top-down conversion starting from git-grep’s call to
read_object_with_reference()
. This way we can be more certain that nothing was
left behind.
The bad thing is that we cannot really test whether a function is thread-safe
or not. So, for my next steps I’m thinking about using some
empirical/probabilistic tests as I go converting the code: Starting at the big
lock around read_object_with_reference()
, I could move it further down in the
call stack (to read_object_file()
, for example), and execute threaded git-grep
some times, checking the output. It’s not the safest check and it has some
faults: one example is that we won’t be able to test each code branch without
running throught all git-grep’s cli options… But it will, at least, give us
some security during the process.
Duy’s suggestion on a hacky parallel git-grep should help me in this process as well.
Footnotes
-
I had to re-enable threading at cached git-grep, which was disabled at 53b8d93 (“grep: disable threading in non-worktree case”, 2011-12-12). ↩
-
Tests were performed in an i7-7700HQ with 16GB of RAM, SSD and running Manjaro Linux. ↩
-
We could think worktree grep is taking longer to run because it is also working on git-ignored files. That is not the case, though, since running
git status --ignored
on the chromium’s directory returns nothing. ↩
Til next time,
Matheus