edited: This PR has been squashed with a few revisions from the initial PR. The unsquashed version is still available here https://github.com/bitcoin/bitcoin/compare/master...JeremyRubin:cuckoocache-pull-request-not-squashed. The PR message has also been updated to reflect the current squashed version more closely.
Summary
This PR replaces the boost::unordered_set in sigcache with CuckooCache::cache.
Benefits
- Better memory utilization (no pointers/linked lists/stored hashes)
- Lock Free concurrent reads and erases
- Lazy Deletion of entries (potentially beneficial in case of re-orgs)
- Low cache thrashing -- a standard lookup and erase should only require ~9 cache lines (Hash 0..Hash 7, and erase flag)
- Allocation free design
- Insert guaranteed to terminate
- Generations smartly delete older entries before newer ones
Design
The CuckooCache uses a design similar to a Cuckoo Hash Table. Each key inserted has a hash function 0 through hash function 7 (h0..h7).
- Lookup(k) is a matter of examining location h0(k)..h7(k).
- Insert(k) first sees if h0(k)..h7(k) is empty, and insets to one if so. If not, insert swaps k with the key at one more than the last hash location (so it goes in a cycle through h0..h7) and attempts to insert the swapped out key at it's other location, repeating (up to a limit) if necessary. At that limit, the last entry is dropped. Insert also checks a heuristic (described below) to age a generation before trying to insert.
- Delete(k) first does a Lookup(k), then marks the corresponding bit in an std::atomic<uint8_t> as set as a relaxed atomic operation.
More information about the synchronization specification is in the code documentation.
Generation Heuristic
A bit vector is used to track if the element belongs to the current or prior generation. Once the number of the current generation's non-deleted elements exceeds ~45%, the generation is retired. The previously retired generation is marked deleted. Because the cache deletes lazily, these entries are still present but are more likely to be overwritten than elements in the currently retired generation. Generations are scanned to see if they should be retired using a heuristic of how many inserts would need to occur since the last scan to exceed the generation size.
There is a slight (1-bit per entry) memory overhead to this approach. This does not negatively impact this cache because the trade off is worth it for intelligently deleting older entries. Only one bit is used for simplicity, as we get three represented states (deleted, old, current). Using more generations (lets say, 1 more bit) would allow for more granular freeing of only the oldest quarter and not half per scan. This has advantages, but also an additional memory cost and two generations was sufficient for this use case.
Simulation
These simulation results are slightly outdated, @morcos will chime in below with some updated figures. It should be better/the same as below.
This can overall be seen as a 40% improvement on validation times over master on a 16 core machine simulated over a month. The simulation methodology sends relayed transactions and blocks to the node in the order they were received historically. On 4 and 1 core simulations, the improvement is not significant (but no worse). Our tests didn't really hit hard saturation on the cache, which suggests that the megabytes allocated to the cache could be reduced.
Testing
The PR includes a test suite which checks a few key properties:
- Check that values not inserted are never readable
- Check that a reasonable hit rate is achieved
- Check that erased elements are inserted on reasonably well rather than replacing fresh contents
- Test that (3) holds when erases are in parallel.
Future Work
There are future optimizations which may be able to bring some of the advantages seen with 16 cores to 4 and 1 cores, but this PR is focused on a minimal complexity patch with a demonstrable improvement over main as a base for future work in this direction.
There are also a few cleanup related items (such as cache initialization) which are not done in this patch to make it a minimal code change. These can be fixed in a later PR if this is accepted.
Acknowledgements
Thanks to @morcos, @sdaftuar, and @TheBlueMatt for their assistance and review on this patch.