Fix horrific performance found by gmaxwell.

TheBlueMatt commented at 7:58 am on January 3, 2012: member

I hereby certify that I have greped for ever use of CDataStream and find no case where secure_allocator is necessary. Note that this fix increases blockchain download by gmaxwell’s estimate of some “1000x faster”

luke-jr commented at 8:00 am on January 3, 2012: member

would be much nicer if you just typedef std::vector vector_type; :P

TheBlueMatt commented at 8:04 am on January 3, 2012: member

I prefer it the other way, but meh I dont care; changed.

gmaxwell commented at 8:07 am on January 3, 2012: contributor

well, … 1000x is perhaps an exaggeration. More like 40-50x or so. A complete chain syncup here took 29 minutes on code with all the mlock removed. On the same system syncing to about height 37k took three and a half hours earlier today. I’m not patient enough to benchmark a complete chain syncup without the fix, though perhaps someone else will.

This problem was really tricky to find— oprofile cycle sampling was completely blind to the slowdown caused by mlock, I guess if I had hooked GLOBAL_TLB_FLUSHES I would have seen it but who would guess to do that? Valgrind’s callgrind couldn’t see it (unsurprisingly). I eventually found it with ltrace.

In general we shouldn’t be calling mlock per-allocation for anything. Even on private key data it’s really too slow to use more than a couple times during execution, though I guess it’s not completely horrific if just used for that. Instead we should pre-allocate a mlocked arena and allocate from that for private keys. Perhaps boost has something to make this easier?

Independent of the that, CDataStream’s wide usage and allocation behavior make me cringe. It’s creating a LOT of tiny indirect heap allocation calls all over the code. That can’t be good for performance. (though obviously not on the same scale as the mlock overuse)

gavinandresen commented at 3:42 pm on January 3, 2012: contributor

Error compiling, compiler version problem?

i686-apple-darwin10-llvm-g++-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.6)

serialize.h:872: error: ‘CDataStream::CDataStream(const std::vector<char, std::allocator >&, int, int)’ cannot be overloaded serialize.h:867: error: with ‘CDataStream::CDataStream(const std::vector<char, std::allocator >&, int, int)’

Also, not zeroing freed memory any more makes me nervous, because memory containing leftover garbage is a good building block for remote code exploits. Maybe a free_after_delete_allocator… (does boost already define one?)

sipa commented at 3:54 pm on January 3, 2012: member

Same error here, on Ubuntu. I’m also in favor of keeping the zeroing functionality, even for non-mlocked allocations.

gmaxwell commented at 4:01 pm on January 3, 2012: contributor

Boost apparently does have an allocator, which you can override how it gets memory: http://www.boost.org/doc/libs/1_47_0/libs/pool/doc/interfaces/user_allocator.html If we want to be properly security paranoid, and we’re using a specialized allocator, we could potentially add canary functionality just like the stack protection in addition to the zeroize. I’d have to dump the allocator usage data to see how much overhead that might have. (e.g. add a word after (and perhaps before) every allocation which is checked on free)

I think the mlock security improvement is pretty inconsequential. The significance of the zeroizing at making use-after-free exploits hard is more significant. While we’re discussing this, we should note that if we use our own allocator we should add IFDEFed out valgrind macros (see memcheck.h IIRC) so that we don’t reduce valgrind’s sensitivity.

TheBlueMatt commented at 6:43 pm on January 3, 2012: member

Fixed (I need to stop coding so late, probably also why I didnt catch this bug before I included it in my encryption pull…) My 2 cents (well, ok more like 1): use-after-free exploits are so rare its almost not worth doing the freeing. When you combine that with the fact that the item in question is a std::vector, not a class, I would argue its even less of an issue. If someone wants to use the stuff in the vector to call a nasty function, they would be 100x more likely to be able to while its still allocated and in memory, not once it has been freed/reallocated.

TheBlueMatt commented at 7:07 pm on January 3, 2012: member

Oops more to fix( I had to board and didn’t have time to test it) I’ll fix it when I’m in the air and have wifi

gmaxwell reply@reply.github.com wrote:

Boost apparently does have an allocator, which you can override how it gets memory: http://www.boost.org/doc/libs/1_47_0/libs/pool/doc/interfaces/user_allocator.html If we want to be properly security paranoid, and we’re using a specialized allocator, we could potentially add canary functionality just like the stack protection in addition to the zeroize. I’d have to dump the allocator usage data to see how much overhead that might have. (e.g. add a word after (and perhaps before) every allocation which is checked on free)

I think the mlock security improvement is pretty inconsequential. The significance of the zeroizing at making use-after-free exploits hard is more significant. While we’re discussing this, we should note that if we use our own allocator we should add IFDEFed out valgrind macros (see memcheck.h IIRC) so that we don’t reduce valgrind’s sensitivity.

Reply to this email directly or view it on GitHub: #740 (comment)

TheBlueMatt commented at 7:50 am on January 4, 2012: member

Actually fixed in that commit.

laanwj commented at 10:53 am on January 4, 2012: member

ACK, this is great :)

gavinandresen commented at 3:49 pm on January 4, 2012: contributor

I love the performance improvement, but I still don’t like the elimination of zero-after-free. Security in depth is important.

Here’s the danger:

Attacker finds a remotely-exploitable buffer overrun somewhere in the networking code that crashes the process. They turn the crash into a full remote exploit by sending carefully constructed packets before the crash packet, to initialize used-but-then-freed memory to a known state.

Unlikely? Sure.

Is it ugly to define a zero_after_free_allocator for CDataStream? Sure. (simplest implementation: copy secure_allocator, remove the mlock/munlock calls).

But given that CDataStream is the primary interface between bitcoin and the network, I think being extra paranoid here is a very good idea.

gmaxwell commented at 4:09 pm on January 4, 2012: contributor

The performance difference from avoiding the zeroization doesn’t appear to be huge: It saves about 30 seconds out of a 30 minute full sync (compared to not zeroizing).

Chart showing all mlocks gone, vs bluematt’s patch: http://people.xiph.org/~greg/bitcoin-sync.png The bluematt patch gets a late start due to the time spent filling the keypool, but because its a bit faster it eventually catches up.

The mlocks on the keying stuff are still problematic: The keypool refill takes 17 seconds with the mlocks in, <1 second with them out. Users with encrypted wallets will often do a mass refill when they unlock. To resolve that the frequency of mlocking should be reduced. The pool allocator I linked to above should make that easy to use (make secure_allocator use the pool, make the pool use malloc + mlock).

The frequent mlock usage may also be creating security weaknesses in the form of mlock ineffectiveness. If the process leaks mlocked pages then eventually it will hit the limit (e.g. I think most linux desktops have a 1MB/process limit or something like that), and past the limit pages will no longer be mlocked.

TheBlueMatt commented at 7:04 pm on January 4, 2012: member

I was thinking it would make more of a performance difference, but as long as its very minor, I just made it zero_after_free.

gmaxwell commented at 7:10 pm on January 4, 2012: contributor

BlueMatt: I don’t think you should have to provide a allocate(). Though maybe it’s fine to leave it in case we convert it into a canary allocator.

sipa commented at 10:49 pm on January 4, 2012: member

ACK

Fix horrific performance found by gmaxwell. f7a9a11391

TheBlueMatt commented at 0:46 am on January 5, 2012: member

@gmaxwell heh, of course…maybe I just need to give up programming at this point…

gmaxwell commented at 4:01 am on January 5, 2012: contributor

ACK

gavinandresen referenced this in commit 7486c64dd8 on Jan 5, 2012

gavinandresen merged this on Jan 5, 2012

gavinandresen closed this on Jan 5, 2012

coblee referenced this in commit 4d52355e5d on Jul 17, 2012

ptschip referenced this in commit 3199704967 on Aug 14, 2017

sipa referenced this in commit ef7117193c on May 22, 2020

sipa referenced this in commit 67f232b5d8 on Jun 9, 2020

stackman27 referenced this in commit 78cde6f8c7 on Jun 26, 2020

rajarshimaitra referenced this in commit ea333901cc on Aug 5, 2021

DrahtBot locked this on Sep 8, 2021

Fix horrific performance found by gmaxwell. #740