Some fixes at the dir-iterator patches
Before I talk about my main project, I think I should briefly mention how my dir-iterator patchset has evolved since the last report on it.
Last week I wrote that the problem regarding the st_ino
usage was solved. But,
in fact, there was still a way to go. Dscho and I exchanged
some emails,
trying to figure out the best way to overcome the problems that were still
open. After considering some alternatives, we ended up deciding to remove the
snippet that detects circular simlinks from the series. This code was still far
from portable and it was blocking the pactchset from being merged. Also, we may
work on a more portable implementation in the future, after the series gets
merged. So, here
you can find the version 8 of this patchset, without the circular symlink finder
and the respective adjustments made. Junio has already added it to the
respective topic branch.
Recap: where were we?
OK, back to the main project: Last week we saw that parallel inflation could lead to some good time reductions on git-grep. In fact, we even observed a 46% reduction on gentoo’s repo, for one of the test cases!
But as the targets are bigger repos, I was looking forward to see the impact of
those changes on a linux or chromium repository. Unfortunatelly, grep’ing this
repos with my modified git-grep only lead to Segmentation Fault
:( The error
was due to some race condition on the delta_base_cache
, but I couldn’t find
the problematic spot. And why did it only happened on larger repositories, in
the first place? That’s when last week ended and debugging started!
The race condition on delta_base_cache
My first hypothesis was that one thread could have free’d a cache entry which
a second thread was expecting to be present at the cache. Through GDB
I found
out that this was not the case. The call stack showed that the error occurred
when trying to remove an entry from the cache. OK, so why wasn’t it able to
remove the entry? GDB was really helpful but Valgrind’s memcheck gave me the
answer:
Invalid read of size 4
at: entry_equals (hashmap.c:99)
by: find_entry_ptr (hashmap.c:137)
by: hashmap_remove (hashmap.c:222)
by: detach_delta_base_cache_entry (packfile.c:1395)
by: release_delta_base_cache (packfile.c:1421)
by: add_delta_base_cache (packfile.c:1447)
by: unpack_entry (packfile.c:1730)
by: ...
Address 0x556c9c18 is 8 bytes inside a block of size 104 free'd
at: free (vg_replace_malloc.c:530)
by: detach_delta_base_cache_entry (packfile.c:1398)
by: release_delta_base_cache (packfile.c:1421)
by: add_delta_base_cache (packfile.c:1447)
by: unpack_entry (packfile.c:1730)
by: ...
Block was alloc'd at
at: malloc (vg_replace_malloc.c:299)
by: do_xmalloc (wrapper.c:60)
by: xmalloc (wrapper.c:87)
by: add_delta_base_cache (packfile.c:1437)
by: unpack_entry (packfile.c:1730)
by: ...
Through this output and code examination we can conclude the program was trying
to access a delta_base_cache
entry that was already released and free’d. And,
as we had already seen through GDB, the invalid access was also trying to
release the entry. The funniest thing is that both of these conflicting
releases happen at the same spot of unpack_entry()
’s code. As I knew this spot
was protected by the added obj_read_mutex
, there was no possibility of two
threads executing it “simultaneously”. So, how could this error be happening?
I studied the unpack_entry()
function in its current form and past revisions
to learn the most I could of how object reading from packfile works. (It was
very interesting!) This helped me understand how the cache works in more
details.
Before really adding a new entry, add_delta_base_cache()
may need to remove
some of the least recently used ones if the cache is full. It keeps control of
that through a list from the list.h
API (so the cache is composed of a
hashmap and a list). The error I got was exactly at the code that iterate
trought this list removing old entries.
I thought maybe the list was not being properly updated in my threaded code and
it was keeping wrong references. So I ran some tests trying to understand some
points. For example, what happens with just one thread? What happens if I
remove the unlock-and-re-lock pattern from git_inflate()
? And what if I add it
somewhere else? Could it be the same thread trying to double release an entry?
Could a thread be holding a reference to an entry already released and re-insert
that into the cache, later?
Long story short, I finally found the problem: what if there were two entries
for the same data in the lru
list? Then, maybe, a thread was trying to release
an entry whose data was also in another entry released by another thread… The
idea matched with the information Valgrind was displaying.
My first solution to this problem was to check if the entry was already present
at the lru
list when trying to add a new one. To do that, I added the
following snippet to the beginning of add_delta_base_cache()
:
list_for_each(lru, &delta_base_cache_lru) {
struct delta_base_cache_entry *f =
list_entry(lru, struct delta_base_cache_entry, lru);
if(f->key.base_offset == base_offset && f->key.p == p)
return;
}
And, with that, we got free from errors!! But… the execution time got really slow :( It’s easy to see why: everytime we add a new entry we need to iterate throught the whole list!
I scratched my head for a while until I got to a better solution: Oh well, if we
want to check for an entry existence, why don’t we use the hashmap instead of
the list? Replacing the above code with a simple
if (get_delta_base_cache()) return
solved the issue keeping up a good
performance! I’m very happy that we got to this point :)
You can see this working version here.
Results
I tested the modified git-grep at chromium’s repo1 with these2 test cases. The original sequential performance was also measured for a reference. Here are the means of 30 executions after 2 warmups for each test case:
Times of git-grep at chromium repo
==============================
Threads 1
Regex 1 Regex 2
Original: 17.5815s 21.7217s
Threads 2
Regex 1 Regex 2
Original: 14.6363s 15.6117s
Parallel inflate: 9.7983s 11.3965s
Reduction: 33.05% 27.00%
Threads 8
Regex 1 Regex 2
Original: 13.5643s 16.4347s
Parallel inflate: 6.3097s 6.9667s
Reduction: 53.48% 57.61%
Some very good execution time reductions!
Difficulties
Debugging threaded code can be quite a challenge! Even with the help of GDB and Valgrind (memcheck and helgrind) I spent a lot of time trying to understand what was happening and why it was happening. It’s been a long week, but I’m so happy that this barrier has finally been overcome!
Next steps
From last week, the following topics were completed:
- Continue investigating the
delta_base_cache
SegFaults; and how to avoid the problems with the execution order of critical sections (the intercalations). - Measure speedup on Chromium.
Now that I have a working version with verified improvements, I have to:
- Make the patch less hacky:
- Split the changes into logical patches.
- Improve how the object reading mutex is being handled at
sha1-file.c
andzlib.c
- Possibly add a
obj_read_use_locks
variable for the API users to set, when needed. - Ensure there’s no other possible race condition introduced with these changes.
- Protect the submodule and textconv operations on git-grep (they were left
behind when I removed the
grep_read_mutex
). - Maybe check the improvements on a HDD machine.
- Take a look at Duy’s patch again to see if I can improve my version based on it.
- Try to refine the object reading locks to have even more parallel work.
I’m also considering if I should investigate even more why the threaded
inflation version tries to add duplicates to the delta_base_cache. Perhaps if I
find the cause I could refactor the code so that checking the entry at
add_delta_base_cache()
becomes unnecessary. But I think this may be really
time demanding, so I need to check with my mentors about it.
Footnotes:
Til next time,
Matheus