JeremyRubin commented at 9:51 PM on October 5, 2016: contributor

edited: This PR has been squashed with a few revisions from the initial PR. The unsquashed version is still available here https://github.com/bitcoin/bitcoin/compare/master...JeremyRubin:cuckoocache-pull-request-not-squashed. The PR message has also been updated to reflect the current squashed version more closely.

Summary

This PR replaces the boost::unordered_set in sigcache with CuckooCache::cache.

Benefits

Better memory utilization (no pointers/linked lists/stored hashes)
Lock Free concurrent reads and erases
Lazy Deletion of entries (potentially beneficial in case of re-orgs)
Low cache thrashing -- a standard lookup and erase should only require ~9 cache lines (Hash 0..Hash 7, and erase flag)
Allocation free design
Insert guaranteed to terminate
Generations smartly delete older entries before newer ones

Design

The CuckooCache uses a design similar to a Cuckoo Hash Table. Each key inserted has a hash function 0 through hash function 7 (h0..h7).

Lookup(k) is a matter of examining location h0(k)..h7(k).
Insert(k) first sees if h0(k)..h7(k) is empty, and insets to one if so. If not, insert swaps k with the key at one more than the last hash location (so it goes in a cycle through h0..h7) and attempts to insert the swapped out key at it's other location, repeating (up to a limit) if necessary. At that limit, the last entry is dropped. Insert also checks a heuristic (described below) to age a generation before trying to insert.
Delete(k) first does a Lookup(k), then marks the corresponding bit in an std::atomic<uint8_t> as set as a relaxed atomic operation.

More information about the synchronization specification is in the code documentation.

Generation Heuristic

A bit vector is used to track if the element belongs to the current or prior generation. Once the number of the current generation's non-deleted elements exceeds ~45%, the generation is retired. The previously retired generation is marked deleted. Because the cache deletes lazily, these entries are still present but are more likely to be overwritten than elements in the currently retired generation. Generations are scanned to see if they should be retired using a heuristic of how many inserts would need to occur since the last scan to exceed the generation size.

There is a slight (1-bit per entry) memory overhead to this approach. This does not negatively impact this cache because the trade off is worth it for intelligently deleting older entries. Only one bit is used for simplicity, as we get three represented states (deleted, old, current). Using more generations (lets say, 1 more bit) would allow for more granular freeing of only the oldest quarter and not half per scan. This has advantages, but also an additional memory cost and two generations was sufficient for this use case.

Simulation

These simulation results are slightly outdated, @morcos will chime in below with some updated figures. It should be better/the same as below. This can overall be seen as a 40% improvement on validation times over master on a 16 core machine simulated over a month. The simulation methodology sends relayed transactions and blocks to the node in the order they were received historically. On 4 and 1 core simulations, the improvement is not significant (but no worse). Our tests didn't really hit hard saturation on the cache, which suggests that the megabytes allocated to the cache could be reduced.

Testing

The PR includes a test suite which checks a few key properties:

Check that values not inserted are never readable
Check that a reasonable hit rate is achieved
Check that erased elements are inserted on reasonably well rather than replacing fresh contents
Test that (3) holds when erases are in parallel.

Future Work

There are future optimizations which may be able to bring some of the advantages seen with 16 cores to 4 and 1 cores, but this PR is focused on a minimal complexity patch with a demonstrable improvement over main as a base for future work in this direction.

There are also a few cleanup related items (such as cache initialization) which are not done in this patch to make it a minimal code change. These can be fixed in a later PR if this is accepted.

Acknowledgements

Thanks to @morcos, @sdaftuar, and @TheBlueMatt for their assistance and review on this patch.

fanquake added the label Validation on Oct 6, 2016

MarcoFalke added the label Resource usage on Oct 6, 2016

fanquake commented at 10:40 AM on October 9, 2016: member

Compiling on OS X, with Xcode 8

  CXX      test/test_test_bitcoin-cuckoocache_tests.o
In file included from test/cuckoocache_tests.cpp:4:
In file included from /usr/local/include/boost/test/unit_test.hpp:18:
In file included from /usr/local/include/boost/test/test_tools.hpp:46:
In file included from /usr/local/include/boost/test/tools/old/impl.hpp:19:
In file included from /usr/local/include/boost/test/unit_test_log.hpp:24:
In file included from /usr/local/include/boost/test/utils/wrap_stringstream.hpp:26:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/sstream:174:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:138:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ios:216:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:18:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/mutex:177:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:992:39: error: 
      type 'float' cannot be used prior to '::' because it has no members
    : public binary_function<typename _Predicate::first_argument_type,
                                      ^
./cuckoocache.h:206:65: note: in instantiation of template class
      'std::__1::binary_negate<float>' requested here
        depth_limit = std::max((uint8_t)1, static_cast<uint8_t>(std::log...
                                                                ^
In file included from test/cuckoocache_tests.cpp:4:
In file included from /usr/local/include/boost/test/unit_test.hpp:18:
In file included from /usr/local/include/boost/test/test_tools.hpp:46:
In file included from /usr/local/include/boost/test/tools/old/impl.hpp:19:
In file included from /usr/local/include/boost/test/unit_test_log.hpp:24:
In file included from /usr/local/include/boost/test/utils/wrap_stringstream.hpp:26:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/sstream:174:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:138:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ios:216:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:18:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/mutex:177:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1002:36: error: 
      type 'float' cannot be used prior to '::' because it has no members
    bool operator()(const typename _Predicate::first_argument_type& __x,
                                   ^
In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:206:70: error: no member named 'log2' in namespace 'std'
        depth_limit = std::max((uint8_t)1, static_cast<uint8_t>(std::log2(static...
                                                                ~~~~~^
3 errors generated.

JeremyRubin commented at 1:52 PM on October 9, 2016: contributor

@fanquake should be fixed, <cmath> wasn't included for some reason.

theuni commented at 5:17 PM on October 10, 2016: member

Concept ACK and code review ACK.

I still need to read up on the security/performance guarantees to understand better, but the implementation looks sane.

TheBlueMatt commented at 2:38 PM on October 16, 2016: contributor

Concept ACK, if nothing else the increase in entries-per-byte should be a big win for memory usage.

JeremyRubin commented at 9:56 PM on October 20, 2016: contributor

I just pushed up 4 commits which further improve the cuckoo cache. These commits add epochs/generations for the cache entries which improves the hit rate. Below is a brief description of the changes, for more detail, see the code & code documentation.

The first 2 commits are changes to the tests. The first commit simplifies what is checked in the existing tests (it was over specific to that Cache's design). The second adds a new test which fails under the existing code because the code does not prioritize newer entries.

The third patch adds generations to cache which work as follows: a bit vector is used to track if the element belongs to the current or prior generation. Once the number of the current generation's non-deleted elements exceeds ~45%, the generation is retired. The previously retired generation is marked deleted. Because the cache deletes lazily, these entries are still present but are more likely to be overwritten than elements in the currently retired generation. Generations are scanned to see if they should be retired using a heuristic of how many inserts would need to occur since the last scan to exceed the generation size.

The fourth commit increases the number of hashes used per element from 2 to 8. The benefit of this change is that it permits using a large generation size (e.g., 45%) compared to 2 hashes (30%). A larger generation size is the "effective size" of the cache, i.e., new entries stay in the current generation for longer.

Performance is markedly better, especially under attack scenarios. @morcos has run several simulations which back up this claim.

There is a slight (1-bit per entry) memory overhead to this approach. This does not negatively impact this cache because the trade off is worth it for intelligently deleting older entries. Only one bit is used for simplicity, as we get three represented states (deleted, old, current). Using more generations (lets say, 1 more bit) would allow for more granular freeing of only the oldest quarter and not half per scan. This has advantages, but also an additional memory cost and two generations was sufficient for this use case.

On a slightly separate note, another side benefit of this cache that I failed to mention when I opened the PR (before these added commits) is that inserts are guaranteed to terminate (in the sigcache currently in master, insert may run forever if GetRand is "unlucky").

JeremyRubin force-pushed on Oct 20, 2016

JeremyRubin force-pushed on Oct 21, 2016

JeremyRubin commented at 6:56 PM on October 21, 2016: contributor

I've squashed all the commits (unsquashed still available here if you've already looked at this. https://github.com/bitcoin/bitcoin/compare/master...JeremyRubin:cuckoocache-pull-request-not-squashed). I've also edited the PR message.

The current Travis failed build seems to be related to the ongoing Dyn DDoS attack, as Travis is telling me variants of The repository at JeremyRubin/bitcoin was not found. and The repository at bitcoin/bitcoin was not found.. When that attack eases, I'll make travis retry.

JeremyRubin force-pushed on Oct 22, 2016

in src/cuckoocache.h:None in 3ba2bf4ecc outdated

 211 | +     * @param e the element whose hashes will be returned
 212 | +     * @returns std::array<uint32_t, 8> of deterministic hashes derived from e
 213 | +     */
 214 | +    inline std::array<uint32_t, 8> compute_hashes(const Element& e) const
 215 | +    {
 216 | +        return {hash_function.template operator()<0>(e) % size,

sipa commented at 1:22 AM on October 22, 2016:

Instead of size, store the log2 of the size, and use a bitshift or mask here. Modulus operations are very slow.

JeremyRubin commented at 5:12 PM on October 22, 2016:

Yes, this is a rather slow operation, but I don't quite fully understand your operation... doesn't that only work if you're at power of two size?

sipa commented at 5:15 PM on October 22, 2016:

Yes.

JeremyRubin commented at 5:28 PM on October 22, 2016:

Unfortunately the size currently isn't restricted to be a power of two, and by default is not a power of two. Are you suggesting we limit it to be a power of two?

sipa commented at 5:31 PM on October 22, 2016:

Yes :) Sorry if that wasn't obvious. We could benchmark whether it matters compared to other operations of course, but restricting to a power of two means you at most get something that's a factor sqrt(2) off of your desired size, which seems acceptable for a cache.

JeremyRubin commented at 5:57 PM on October 22, 2016:

Apologies, long rambly response below:

Overall, I think it's a reasonable idea to do this. Modulo is slow. It's not the bottleneck, but it's a low cost way to make this code faster.

There are a couple of weird things though. You have currently about 2 bits of bookkeeping overhead per entry (one bit for erasure, one bit for epoch). If you're restricting to a power of two size for the main cache memory, you also may as well do some of the following if the user specifies more space: more generations for finer grained generations; additional fee tracking to preferentially evict low fee items; some kind of mempool pointer/index map to evict signatures on mempool eviction. (preface: I don't like this idea, but for sake of discussion) You could also keep a large and a small cache and look up from both; letting you target size = 2^m + 2^n.

Also, power of two works reasonably well low in the series (1,2,4,8,16,32,64,128) but it seems to be kind of uncomfortably that at say a 1GB cache you have to choose between 1 and 2 GB if you want to increase a little bit. Yes, sqrt, but people have fixed memory sizes so it does kind of matter.

Maybe a better inbetween would be to do as (roughly) follows (should work on GCC, clang, and msvc):

DEFAULT_MAX_SIGCACHE_SIZE = 32;
#ifdef __MSC_VER
#include <intrin.h>
#define __builtin_popcount __popcnt
#endif
bool fast_mod = __builtin_popcount(size) == 1;
compute_hash(E e) {
    if (fast_mod)
        // compute hash with masks
    else
       // compute hash with mod
}

and by default not count the bookkeeping bits in the size computation. That way if you pass in a power of two megabytes as the parameter, you get this magic boost in performance, and if you pass in a non power of two if you get what you asked for. The branch shouldn't hurt because it should be easily predicted.

morcos commented at 2:13 PM on October 28, 2016: member

Tested ACK either 3ba2bf4ecca453e926754db35fb6ca3b8141c4e7 or 121dfcd798bda9998dda13fa304be2d0afb7ec82

I gave some nits about the comments offline. I am ambivalent about the change to eliminate modulo.

I have extensively tested the performance of this patch. Lock contention on the sig cache is a serious bottleneck to ConnectBlock performance with 8 or more cores. This patch appears to all but eliminate that contention and leads to a 40% improvement in ConnectBlock time for 16 cores. It also allows for further performance improvements that would not see any benefit until this bottleneck was removed.

I have also tested the hit rate performance and it is a significant improvement over the existing sigcache in the event of a spam attack, for a long running node, or in the event of a reorg.

In addition to reviewing the included tests, I've also run my own tests and done code review to be sure that there are no false positives.

morcos commented at 2:40 PM on November 1, 2016: member

ACK 453aef4

in src/cuckoocache.h:None in 453aef404b outdated

  79 | +    {
  80 | +        bit_packed_atomic_flags d(b);
  81 | +        std::swap(mem, d.mem);
  82 | +    }
  83 | +
  84 | +    /** bit_set sets an entry as discardable.

TheBlueMatt commented at 9:53 PM on November 7, 2016:

Nit: You have some trailing whitespace on 3 lines in this file, including this one.

JeremyRubin commented at 11:08 PM on November 30, 2016:

👍

in src/cuckoocache.h:None in 453aef404b outdated

  27 | +{
  28 | +/** bit_packed_atomic_flags implements a container for garbage collection flags
  29 | + * that is only thread unsafe on calls to setup. This class bit-packs collection
  30 | + * flags for memory efficiency.
  31 | + *
  32 | + * All operations are std::memory_order_relaxed so external mechanisms must

TheBlueMatt commented at 9:55 PM on November 7, 2016:

Why not use release/acquire - its identical instructions on x86 and wont require a full flush anywhere else to be correct (indeed, relaxed is maybe better the way its implemented now, but if we want the read/write locks in the sigcache to be more effecient we might not want to have a full flush there).

JeremyRubin commented at 11:05 PM on November 30, 2016:

You're right it's the same on x86, but it isn't for ARM. We want the operation to be relaxed, so release or acquire are over constrained.

If later changes want to change the memory model, they should do so then. It's hard to say if a full flush is better or worse than lots of release/acquires; that would require benchmarking.

atomic instruction mappings: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

in src/cuckoocache.h:None in 453aef404b outdated

 358 | +     * @post one of the following: All previously inserted elements and e are
 359 | +     * now in the table, one previously inserted element is evicted from the
 360 | +     * table, the entry attempted to be inserted is evicted.
 361 | +     *
 362 | +     */
 363 | +    inline void insert(Element e)

TheBlueMatt commented at 8:03 PM on November 8, 2016:

It seems to be a strange api to require that a copy be made as the parameter in the function, and then rely on std::move to make insertion effecient (which we cant do for uint256, since it doesnt have any non-POD memory). Seems just as good (and much more common) to pass in a const-reference and then just let operator=() handle the copy.

JeremyRubin commented at 11:36 PM on November 30, 2016:

We also do a std::swap on e so that's why we have a mutable copy.

TheBlueMatt commented at 2:23 AM on December 1, 2016:

std::swap is just as slow for uint256, though. Since there is no dynamically-allocated memory we cant speed it up, really.

fanquake commented at 2:55 PM on November 9, 2016: member

Testing this (OSX 10.12, XCode 8.1): Compiling threw 1 new warning:

In file included from script/sigcache.cpp:14:
./cuckoocache.h:220:17: warning: suggest braces around initialization of subobject [-Wmissing-braces]
        return {hash_function.template operator()<0>(e) & hash_mask,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./cuckoocache.h:438:40: note: in instantiation of member function 'CuckooCache::cache<uint256, (anonymous
      namespace)::SignatureCacheHasher>::compute_hashes' requested here
        std::array<uint32_t, 8> locs = compute_hashes(e);
                                       ^
script/sigcache.cpp:70:25: note: in instantiation of member function 'CuckooCache::cache<uint256,
      (anonymous namespace)::SignatureCacheHasher>::contains' requested here
        return setValid.contains(entry, erase);

Errors trying to compile the tests:

In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:218:36: error: implicit instantiation of undefined template 'std::__1::array<unsigned int, 8>'
    inline std::array<uint32_t, 8> compute_hashes(const Element& e) const
                                   ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__tuple:116:65: note: template is declared here
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;
                                                                ^
In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:368:33: error: implicit instantiation of undefined template 'std::__1::array<unsigned int, 8>'
        std::array<uint32_t, 8> locs = compute_hashes(e);
                                ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__tuple:116:65: note: template is declared here
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;
                                                                ^
In file included from test/cuckoocache_tests.cpp:5:
./cuckoocache.h:438:33: error: implicit instantiation of undefined template 'std::__1::array<unsigned int, 8>'
        std::array<uint32_t, 8> locs = compute_hashes(e);
                                ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__tuple:116:65: note: template is declared here
template <class _Tp, size_t _Size> struct _LIBCPP_TYPE_VIS_ONLY array;

  CXX      test/test_test_bitcoin-key_tests.o
In file included from test/cuckoocache_tests.cpp:4:
In file included from /usr/local/include/boost/test/unit_test.hpp:18:
In file included from /usr/local/include/boost/test/test_tools.hpp:46:
In file included from /usr/local/include/boost/test/tools/old/impl.hpp:19:
In file included from /usr/local/include/boost/test/unit_test_log.hpp:18:
In file included from /usr/local/include/boost/test/tree/observer.hpp:17:
In file included from /usr/local/include/boost/test/detail/global_typedef.hpp:15:
In file included from /usr/local/include/boost/test/utils/basic_cstring/basic_cstring.hpp:21:
In file included from /usr/local/include/boost/test/utils/basic_cstring/bcs_char_traits.hpp:25:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/string:439:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:628:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:1740:31: error: no matching constructor
      for initialization of 'block_activity'
            ::new((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
                              ^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:1656:18: note: in instantiation of
      function template specialization 'std::__1::allocator<block_activity>::construct<block_activity, const unsigned int &, CuckooCache::cache<uint256,
      cuckoocache_tests::uint256Hasher> &>' requested here
            {__a.construct(__p, _VSTD::forward<_Args>(__args)...);}
                 ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:1502:14: note: in instantiation of
      function template specialization 'std::__1::allocator_traits<std::__1::allocator<block_activity> >::__construct<block_activity, const unsigned int
      &, CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> &>' requested here
            {__construct(__has_construct<allocator_type, _Tp*, _Args...>(),
             ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/deque:1801:21: note: in instantiation of
      function template specialization 'std::__1::allocator_traits<std::__1::allocator<block_activity> >::construct<block_activity, const unsigned int &,
      CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> &>' requested here
    __alloc_traits::construct(__a, _VSTD::addressof(*__base::end()), _VSTD::forward<_Args>(__args)...);
                    ^
test/cuckoocache_tests.cpp:368:18: note: in instantiation of function template specialization 'std::__1::deque<block_activity,
      std::__1::allocator<block_activity> >::emplace_back<const unsigned int &, CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> &>'
      requested here
        last_few.emplace_back(BLOCK_SIZE, set);
                 ^
test/cuckoocache_tests.cpp:391:5: note: in instantiation of function template specialization
      'cuckoocache_tests::test_cache_generations<CuckooCache::cache<uint256, cuckoocache_tests::uint256Hasher> >' requested here
    test_cache_generations<CuckooCache::cache<uint256, uint256Hasher>>();
    ^
test/cuckoocache_tests.cpp:324:12: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 2 were provided
    struct block_activity {
           ^
test/cuckoocache_tests.cpp:324:12: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 2 were provided
4 errors generated.

JeremyRubin commented at 9:18 PM on November 9, 2016: contributor

@fanquake I think I can fix the warnings but I'm curious as to how much that's a goal given that that flag should be deprecated & there are lots of build errors. Happy to push a squashme if you think so.

The build errors should be fixable, must be a different include order or something, I need to add a #include<array>.

fanquake commented at 11:19 PM on November 9, 2016: member

@JeremyRubin The warning isn't so much of an issue, was just making a note of it.

JeremyRubin commented at 2:39 AM on November 10, 2016: contributor

@fanquake please confirm fix when you have a moment.

fanquake commented at 2:48 AM on November 10, 2016: member

@JeremyRubin Everything compiling file now.

Can you suggest tests/benchmarks that reviewers can run to test the performance increase here, or is that going to be hard for anyone without a-lot of cores?

MarcoFalke added this to the milestone 0.14.0 on Nov 10, 2016

in src/cuckoocache.h:None in 5ec14c8b4a outdated

 147 | + *
 148 | + * Note on function names:
 149 | + *   - The name "allow_erase" is used because the real discard happens later.
 150 | + *   - The name "please_keep" is used because elements may be erased anyways on insert.
 151 | + *
 152 | + * @tparam Element should be a POD type that is 32-alignable

sipa commented at 8:49 PM on November 15, 2016:

Where does the requirement for POD come from?

JeremyRubin commented at 5:15 AM on November 16, 2016:

I suppose the way it is currently written (using swap and move) it should be safe to remove both requirements (POD and 32-alignable).

Prior versions of the code I think were unsafe for that use case.

JeremyRubin commented at 5:16 AM on November 16, 2016:

(I can do a squashme with just that unless you have other small feedbacks)

in src/cuckoocache.h:None in 78f1c92e5c outdated

 138 | + *      - invalid()
 139 | + *      - compute_hashes()
 140 | + *
 141 | + * User Must Guarantee:
 142 | + *
 143 | + * 1) Write Requires synchronized access (e.g., a lock)

sipa commented at 10:24 PM on November 15, 2016:

Any reason why these locks aren't integrated into the cache class? It isn't obvious to be how to correctly synchronize the erase method with the rest.

JeremyRubin commented at 5:11 AM on November 16, 2016:

Yes there is a reason.

The locks aren't actually needed at all how the cache is currently used. All that needs to happen is per thread: at the beginning of a block processing, a memory acquire; at the end of block processing a memory release. Additionally, one more lock must be acquired before block processing by the master to ensure that there are no concurrent writers (from addtomempool) and that master must not release until all slaves release as well. This is much better than the repeated acquire/release of locking/unlocking on every single signature operation.

I have written a version of the code which has this semantics, but some feedback from others felt that it was fragile & prone to someone else breaking that behavior down the line (bad for maintainability). In the interest of keeping that optimization available and not introducing a maintenance hazard, I kept the locking where it was in the "layer up".

JeremyRubin commented at 8:10 PM on November 17, 2016: contributor

@fanquake yeah it'll be tough, there's not much speed improvement from losing the erase locks until ~8 cores. What you could test is better memory efficiency by allocating smaller caches (2 4 8 16...) and comparing performance to a node running with that size and the old cache. I'd recommend running one node and peering your test nodes exclusively thought that to ensure nodes are getting the same messages for a given time. But if you're hardware limited that will be hard.

in src/cuckoocache.h:None in 78f1c92e5c outdated

 286 | +        } else
 287 | +            // reset the epoch_heuristic_counter to next do a scan when worst
 288 | +            // case behavior (no intermittent erases) would exceed epoch size,
 289 | +            // with a reasonable minimum scan size.
 290 | +            epoch_heuristic_counter = std::max(1u, std::max(epoch_size / 16,
 291 | +                        epoch_size - std::min(epoch_size, epoch_unused_count)));

TheBlueMatt commented at 7:03 AM on November 18, 2016:

The std::min should be a NOP. we're in an else(epoch_size > epoch_unused_count).

JeremyRubin commented at 11:34 PM on November 30, 2016:

👍

in src/cuckoocache.h:None in 78f1c92e5c outdated

 116 | +    }
 117 | +};
 118 | +
 119 | +/** cache implements a cache with properties similar to a cuckoo-set
 120 | + *
 121 | + *  The cache is able to hold up to (~(uint32_t) 1) elements.

TheBlueMatt commented at 7:18 AM on November 18, 2016:

nit: technically it can hold up to (~(uint32_t) 1) - 1, because invalid() uses ~(uint32_t) 1, though I think you meant to use 0.

JeremyRubin commented at 11:07 PM on November 30, 2016:

I thought there was a reason I used ~(uint32_t) 1 rather than ~(uint32_t) 0, but for the life of me can't recall it. Will change it to be that, but would appreciate you to review that there was no reason.

in src/cuckoocache.h:None in 78f1c92e5c outdated

 130 | + *  Write Operations:
 131 | + *      - setup()
 132 | + *      - setup_bytes()
 133 | + *      - insert()
 134 | + *      - please_keep()
 135 | + *      - rehash()

TheBlueMatt commented at 7:19 AM on November 18, 2016:

rehash has been removed in this version of the patch (and you might only list public methods? up to you).

JeremyRubin commented at 11:08 PM on November 30, 2016:

👍

in src/cuckoocache.h:None in 78f1c92e5c outdated

 141 | + * User Must Guarantee:
 142 | + *
 143 | + * 1) Write Requires synchronized access (e.g., a lock)
 144 | + * 2) Read Requires no concurrent Write, synchronized with the last insert.
 145 | + * 3) Erase requires no concurrent Write, synchronized with last insert.
 146 | + * 4) An Eraser must release all Erases before allowing a new Writer.

TheBlueMatt commented at 7:20 AM on November 18, 2016:

I believe this also refers to an old/different version of this patch.

JeremyRubin commented at 11:19 PM on November 30, 2016:

I think it's correct? I'll try to clarify.

TheBlueMatt commented at 2:26 AM on December 1, 2016:

Oh, indeed, I believe I was mistaken.

in src/cuckoocache.h:None in 78f1c92e5c outdated

 151 | + *   - The name "please_keep" is used because elements may be erased anyways on insert.
 152 | + *
 153 | + * @tparam Element should be a POD type that is 32-alignable
 154 | + * @tparam Hash should be a function/callable which takes a template parameter
 155 | + * hash_select and an Element and extracts a hash from it. Should return
 156 | + * high-entropy hashes for `Hash h; h<0>(e) and h<1>(e)`.

TheBlueMatt commented at 7:21 AM on November 18, 2016:

Also now out-of-date: you're using 8 hashes.

JeremyRubin commented at 11:08 PM on November 30, 2016:

👍

in src/cuckoocache.h:None in 78f1c92e5c outdated

 175 | +     */
 176 | +    mutable std::vector<bool> epoch_flags;
 177 | +
 178 | +    /** epoch_heuristic_counter is used to determine when a epoch
 179 | +     * might be aged & an expensive scan should be done.
 180 | +     * epoch_heuristic_counter is incremented on insert and reset to the

TheBlueMatt commented at 7:21 AM on November 18, 2016:

s/incremented/decremented/

JeremyRubin commented at 11:21 PM on November 30, 2016:

👍

in src/cuckoocache.h:None in 78f1c92e5c outdated

 262 | +     * cheap heuristic is reset to retrigger after the worst case growth of the
 263 | +     * current epoch's elements would exceed the epoch_size.
 264 | +     */
 265 | +    void epoch_check()
 266 | +    {
 267 | +        if ((--epoch_heuristic_counter) != 0)

TheBlueMatt commented at 7:25 AM on November 18, 2016:

nit: technically this is an off-by-one: because you check this prior to actually doing the insert you'll always do an "expensive" scan twice for the first (and all) epochs. Practically, you might just want to set epoch_heuristic_count to something like epoch_size + 10 by default, though I suppose it doesnt matter all that much.

JeremyRubin commented at 11:28 PM on November 30, 2016:

👍 -- I fixed it, I think, by decrementing inside the if block. (didn't want to do postfix to stop the counter from underflowing)

in src/cuckoocache.h:None in 78f1c92e5c outdated

 341 | +        return setup(bytes/sizeof(Element));
 342 | +    }
 343 | +
 344 | +    /** insert loops at most depth_limit times trying to insert a hash
 345 | +     * at various locations in the table via a variant of the Cuckoo Algorithm
 346 | +     * with two hash locations.

TheBlueMatt commented at 7:26 AM on November 18, 2016:

s/two/eight/

JeremyRubin commented at 11:08 PM on November 30, 2016:

👍

TheBlueMatt approved

TheBlueMatt commented at 7:33 AM on November 18, 2016: contributor

A few minor nits on the code, but I'd be fine with it as-is...more of my comments were about your comments :+1:.

sipa commented at 9:06 PM on November 30, 2016: member

@JeremyRubin Care to have a look at @TheBlueMatt's comments above?

JeremyRubin commented at 11:41 PM on November 30, 2016: contributor

Should be all addressed.

TheBlueMatt commented at 2:27 AM on December 1, 2016: contributor

LGTM at 88b58d3c8d840924c8cfb556db994f71dfe6ad13

morcos commented at 5:09 PM on December 5, 2016: member

reACK 88b58d3

in src/cuckoocache.h:None in 88b58d3c8d outdated

 393 | +            /** Swap with the element at the location that was
 394 | +            * not the last one looked at. Example:
 395 | +            *
 396 | +            * 1) On first iter, always false so defaults to locs[0]
 397 | +            * 2) Second iter, last_loc == locs[0] so will go to locs[1]
 398 | +            *

sdaftuar commented at 8:08 PM on December 7, 2016:

I think this comment needs to be updated now that we have more than 2 hashes, right? If I understand right, the algorithm now is:

Swap with the element one past the last one looked at.  Example:

1) On first iter, always false so defaults to locs[0].
2) Second iter, last_loc == locs[k] for some k, so will go to locs[k+1] (wrapping back to 0 if necessary).

in src/cuckoocache.h:None in 88b58d3c8d outdated

 438 | +     * flag is set
 439 | +     * @returns true if the element is found, false otherwise
 440 | +     */
 441 | +    inline bool contains(const Element& e, const bool erase) const
 442 | +    {
 443 | +

sdaftuar commented at 8:12 PM on December 7, 2016:

nit: random newline

in src/script/sigcache.cpp:None in 5ec14c8b4a outdated

  21 |   * We're hashing a nonce into the entries themselves, so we don't need extra
  22 |   * blinding in the set hash computation.
  23 | + *
  24 | + * This may exhibit platform endian dependent behavior but because these are
  25 | + * nonced hashes (random) and this state is only ever used locally it is safe.
  26 | + * All that matter is local consistency.

sdaftuar commented at 8:16 PM on December 7, 2016:

nit: "matter" -> "matters"

in src/script/sigcache.cpp:None in 5ec14c8b4a outdated

  99 | -        setValid.insert(entry);
 100 | +        return setValid.setup_bytes(n);
 101 |      }
 102 |  };
 103 |  
 104 | +// Initialized outisde of VerifySignature to avoid atomic operation per call

sdaftuar commented at 8:39 PM on December 7, 2016:

nit: Can you clarify this comment -- what is the atomic operation you're referring to?

Also, it seems like this comment is here to explain why you moved the code from its old place to the new place (which is helpful for me to understand!), but it's perhaps not very helpful here in its current form to future code readers who haven't seen the previous implementation.

JeremyRubin commented at 11:32 PM on December 7, 2016:

I believe that a function local static initialized uses an atomic (perhaps even a locking!) operation under the hood for the case where concurrent callers enter the function initially (C++11 guarantees this to be safe).

I also believe that this is not the case for a global static (initialized before concurrency is allowed).

I'm not 100% certain the spec forces conformity on this point, but I think it is at least the case in most implementations.

sdaftuar commented at 8:45 PM on December 7, 2016: member

Code review ACK, will test. Few comment nits/questions below.

Also, it would be nice if we had a unit test for the sigcache itself...

JeremyRubin commented at 9:17 PM on December 12, 2016: contributor

@sdaftuar, https://github.com/bitcoin/bitcoin/pull/8895/commits/d4294ad41044eb9447de884736e8d45d8387b04f should fix your documentation concerns.

RE: sigcache unit tests, I could whip something up, but I think it's a little bit hard to test meaningfully (i.e., beyond what other tests using the CachingTransactionSignatureChecker already cover) without opening up the anonymous namespace in script/sigcache.cpp. We can't detect actual cache hits v.s. misses with just the CachingTransactionSignatureChecker::VerifySignature as it just returns a bool. As this would require a bit of restructuring, it's probably easiest review that as a separate PR, if needed.

sdaftuar commented at 9:07 PM on December 13, 2016: member

Tested ACK 88b58d3, and the comments in d4294ad look good too, thanks.

Re: sigcache tests, I don't know the best way to address. If I could come up with a test suite that exercised the logic sufficiently to make changes like this safer, then I'd happily suggest that, but it seems like a hard problem. Still, after reviewing I'm comfortable enough with this PR to proceed without additional sigcache tests (and it's great that we have tests for the cuckoocache at least).

sipa commented at 6:44 AM on December 14, 2016: member

utACK. Please squash? :)

Add CuckooCache implementation and replace the sigcache map_type with it

SQUASHME: Change cuckoocache to only work for powers of two, to avoid mod operator
SQUASHME: Update Documentation and simplify logarithm logic
SQUASHME: OSX Build Errors
SQUASHME: minor Feedback from sipa + bluematt
SQUASHME: DOCONLY: Clarify a few comments.

c9e69fbf39

Add unit tests for the CuckooCache

SQUASHME: Update Tests for other SQUASHMEs

67dac4e193

JeremyRubin force-pushed on Dec 14, 2016

JeremyRubin commented at 9:07 PM on December 14, 2016: contributor

@sipa Squashed down to just cache and testing commits.

unsquashed version preserved at https://github.com/JeremyRubin/bitcoin/tree/cuckoocache-pull-request-unsquashed

sipa merged this on Dec 15, 2016

sipa closed this on Dec 15, 2016

sipa referenced this in commit b83264d9c7 on Dec 15, 2016

codablock referenced this in commit c1811e5661 on Jan 18, 2018

Warrows referenced this in commit 3568fbeed6 on Feb 23, 2018

216k155 referenced this in commit 29ffaf9ad7 on Aug 25, 2018

andvgal referenced this in commit 8de527d7d3 on Jan 6, 2019

JeremyRubin cross-referenced this on Feb 21, 2019 from issue Switch from LRU cache to CuckooCache by JeremyRubin

CryptoCentric referenced this in commit e75257fdba on Feb 26, 2019

furszy cross-referenced this on Jun 8, 2020 from issue Better sigcache implementation (CuckooCache) by furszy

random-zebra referenced this in commit c072659135 on Jul 3, 2020

str4d cross-referenced this on Apr 20, 2021 from issue Replace SigCache implementation with CuckooCache by str4d

zkbot referenced this in commit bb6e5fad4c on Apr 20, 2021

bitcoin locked this on Sep 8, 2021

Better SigCache Implementation #8895

Summary

Benefits

Design

Generation Heuristic

Simulation

Testing

Future Work

Acknowledgements