GSoC Week 5.5: Timing git-grep's internals and other analyzes

This is being a quite busy college week, so I unfortunatelly had to delay the post (hence the “week 5.5” in the title). But let’s get to it:

Some leftover investigations

At the last post, I left some investigations to be made on some time consuming git-grep operations. As we’ve seen, kwsexec() and grep_source_is_binary() were two of those functions which needed further examination. So, this week, I proceeded with that, trying to understand why they are expensive and how could we optimize them, if possible.

kwexec()

kwsexec() is part of “kwset” or “keyword set searching” code. This section comes from GNU grep and uses the boyer-moore algorithm for a single word search or the commentz-walter algorithm for multiple words search. But since the incorporation of this code, it seems GNU grep’s version has evolved to be much more perfomatically. For example a3d42de (“grep: use Aho-Corasick algorithm to search multiple fixed words”, 2016-06-02) reports a 7x speedup for a typical case:

Using best-of-5 trials for the benchmark:

   find /usr/share/doc/ -type f |
   LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null

The results were:

   real 11.37  user 11.03  sys 0.24  [without the change]
   real 1.49   user 1.31   sys 0.15  [with the change]

So one thing that we should consider doing is to update Git’s kwset. I don’t know about the license issues, thought, since GNU grep is GPLv3 and Git is GPLv2. Anyways, this probably goes beyond my project…

grep_source_is_binary()

grep_source_is_binary() checks whether a file is binary or not so that git-grep may properly handle it. Without --textconv and --text, git-grep will have to call this function for each and every file in the worktree (or path in an specified revision). To decide whether a path refers to a binary file or not, the function first tries to read the diff attribute for the specified path. If it fails to do that (or the attribute was unspecified), it goes to auto-detection, which involves reading the whole file (or blob) and searching its first 8000 bytes for a 0 unsigned char. Both attribute and file/blob reading can be really time consuming. But it doesn’t seem there’s much we can do here without getting too much into the attributes code. However, making the blob reading thread-safe may help since grep_source_is_binary() currently relies on a lock to perform this operation.

Userdiff drivers pre-load

Reading builtin/grep.c I was curious why add_work(), responsible for adding tasks to the worker threads, called grep_source_load_driver(), which loads the userdiff drivers if not already loaded. This function is already being called by the worker threads, in parallel (with locks where needed), so wouldn’t it be faster to let the workers handle it alone? Well, 9dd5245 (“grep: pre-load userdiff drivers when threaded”, 2012-02-02) gave me the answer. It says:

The low-level grep_source code will automatically load the userdiff driver to see whether a file is binary. However, when we are threaded, it will load the drivers in a non-deterministic order […]

Meanwhile, the attribute lookup code (which underlies the userdiff driver lookup) is optimized to handle paths in sequential order […] Multi-threading the lookups destroys the locality and makes this optimization less effective. […]

We can fix this by pre-loading the userdiff driver in the main thread […] My best-of-five for “git grep foo” on the linux-2.6 repository went from:
 real    0m0.391s
 user    0m1.708s
 sys     0m0.584s
to:
 real    0m0.360s
 user    0m1.576s
 sys     0m0.572s

But since this change is from 2012, I was wondering whether this optimization still performs well nowadays. So I removed the pre-loading and tested git-grep with two regexes of different complexities over chromium’s repository:

git grep "abcd[02]"
git grep -E '(static|extern) \b(int|double)\b \*[A-Za-z]+'

I also performed the tests for 2 and 8 threads, to see if it presents any difference and for both the worktree and object store¹, totalizing 16 test cases. Each test case was repeated 30 times after one warmup execution, to mitigate fluctuations. Here are the resulting means for each of them²:

WORKTREE GIT-GREP
==============================
             PRE-LOAD   NO PRE-LOAD
Regex 1
  Threads 2: 4.1529s    4.0146s
  Threads 8: 2.9638s    2.9051s
Regex 2
  Threads 2: 31.5089s   31.8494s
  Threads 8: 19.6766s   19.2586s

CACHED GIT-GREP
==============================
             PRE-LOAD   NO PRE-LOAD
Regex 1
  Threads 2: 14.4761s   14.1820s
  Threads 8: 15.5850s   13.9691s
Regex 2
  Threads 2: 38.3241s   38.7090s
  Threads 8: 25.5049s   25.3535s

By this measurements we can see it seems worth to remove the pre-loading, expecially for simpler regexes and more threads, when the performance gain is higher. For more complex regexes and less threads we may have a little slowdown, but it is very little and only happens for a low number of threads. I’ll check with my mentors what they think about these tests and a possible patch removing the pre-load.

Time measuring inside lock operations

As the main plan is to parallelize pack access and refine git-grep locks, I also made a test to see how much time the threads spend inside each of those locks. This should complement the plots we saw in the last post and/or even help us visualizing details we didn’t previously see. I was curious, for example, about the grep_attr_mutex since it protects the attribute machinery and that’s a section which seems to be kind of expensive, as we previously saw.

It’s important to note that these tests don’t account for the time in lock contention (which would be another really nice measure to take). For now, we’ll focus on the time spent by the threads working in the critical section. As the previous test, each test case was repeated 30 times after one warmup execution. Here are the resulting means together with the percentage of the total time spent in each lock:

WORKTREE GIT-GREP
==============================
  Threads 2
                      Regex 1            Regex 2
    Total:            4.2326s           31.5687s
    grep_mutex:       0s (0%)           0s (0%)
    grep_read_mutex:  0s (0%)           0s (0%)
    grep_attr_mutex:  0.8809s (20.81%)  1.0706s (3.39%)

  Threads 8
                      Regex 1            Regex 2
    Total:            3.0244s           19.7046s
    grep_mutex:       0s (0%)           0s (0%)
    grep_read_mutex:  0s (0%)           0s (0%)
    grep_attr_mutex:  0.98s (32.40%)    1.1041s (5.60%)

CACHED GIT-GREP
==============================
  Threads 2
                      Regex 1            Regex 2
    Total:            14.3626s           38.3953s
    grep_mutex:       0s (0%)            0s (0%)
    grep_read_mutex:  12.2364s (85.20%)  12.2781s (31.98%)
    grep_attr_mutex:  1.0041s (6.99%)    1.1283s (2.94%)

  Threads 8
                      Regex 1            Regex 2
    Total:            15.5181s           25.4049s
    grep_mutex:       0s (0%)            0s (0%)
    grep_read_mutex:  12.927s (83.30%)   14.7175s (57.93%)
    grep_attr_mutex:  1.0536s (6.79%)    1.2476s (4.91%)

Some conclusions we can take from this table are:

The userdiff and attribute handling take a significant amount of time. But it gets masked by the regex processing time, when using more complex regexes.
The time spent adding and retriving work from the producer-consumer FIFO is pretty small. We can see that by the grep_mutex lock, which is used to protect these operations. (But again, we are not measuring the contention time here).
When performing a cached grep, reading objects from the object store can take a massive amount of time. Again, this time is masked by regex processing when we use more complex regexes. But is very well visible on simpler ones, taking even 85% of the time in some cases!
Cached grep had a curious behavior when more threads were added: even though the execution time decreased, the time in object reading was a little bit higher. I can’t explain why, right now.
Cached grep takes longer than the worktree grep for every test case. But if we subtract the time it spent on object reading, the resulting time is smaller than the worktree one. So worktree grep must be doing some additional work that cache does not. This may be because reading the uncompressed file is more expensive than reading the compressed blobs.³

I think the most important thing we can take from the tests and these conclusions is that parallelizing the object reading is promising. It accounts for much of the execution time and should yield to good speedups if threaded. But we can’t be sure as we don’t know how much of its internals is I/O and how much is CPU processing. If the second slice is big (because of inflation, for example) than we are good to go. But we will only be sure when we get there…

Others: re-sending a patchset

This week I also re-sent my patchset on dir-iterator and local clone. This series add tests for the dir-iterator API, simplifies its mechanics and use it at copy_or_link_directory (builtin/clone.c). There’s also some behavior changes and new features together with new tests. I’m looking forward to see it upstream :)

Dificulties and next steps

Writting this post I realized I’ve learned a lot about git-grep internals so far, which is great. But I haven’t made the same progress towards coding yet :( I think the greatest difficulty so far has been to divide the ultimate goal into smaller tasks. I still cannot visualize a very clear pathway…

For my next steps, although we have deviated from the original schedule, I plan to revisit it and check the global states I’ve already listed. But I don’t think it’s a nice approach to just go protecting them. I’m more inclined to go with a top-down conversion starting from git-grep’s call to read_object_with_reference(). This way we can be more certain that nothing was left behind.

The bad thing is that we cannot really test whether a function is thread-safe or not. So, for my next steps I’m thinking about using some empirical/probabilistic tests as I go converting the code: Starting at the big lock around read_object_with_reference(), I could move it further down in the call stack (to read_object_file(), for example), and execute threaded git-grep some times, checking the output. It’s not the safest check and it has some faults: one example is that we won’t be able to test each code branch without running throught all git-grep’s cli options… But it will, at least, give us some security during the process.

Duy’s suggestion on a hacky parallel git-grep should help me in this process as well.

Footnotes

I had to re-enable threading at cached git-grep, which was disabled at 53b8d93 (“grep: disable threading in non-worktree case”, 2011-12-12). ↩
Tests were performed in an i7-7700HQ with 16GB of RAM, SSD and running Manjaro Linux. ↩
We could think worktree grep is taking longer to run because it is also working on git-ignored files. That is not the case, though, since running git status --ignored on the chromium’s directory returns nothing. ↩