CPU DoS on mainnet in debug mode

ajtowns commented at 2:00 am on May 6, 2023: contributor

I’ve been seeing 100% cpu usage on one of my bitcoind nodes recently, with transactions taking >60s to be accepted into the mempool. Digging into it, the problem seems to be BOOST_MULTI_INDEX_ENABLE_SAFE_MODE which was introduced in #24395. That results in iterator destruction taking O(n) time (in the number of other active iterators on the same multi-index) in the worst case, rather than O(1) time, due to keeping the iterators in a singly linked list:

0 * Iterators are chained in a single attached list, whose header is
1 * kept by the container. More elaborate data structures would yield better
2 * performance, but I decided to keep complexity to a minimum since
3 * speed is not an issue here.

Speed does become an issue here though, as we keep an iterator around for every mempool entry (in vTxHashes).

This isn’t always a problem: presumably in the normal case new iterators are added to the head of the list then removed while they’re still roughly at the head of the list, giving O(1) timing; but somehow having a full mempool (and hence TrimToSize actually doing things) triggers worst case behaviour?

maflcko added the label Brainstorming on May 6, 2023

maflcko added the label Tests on May 6, 2023

maflcko added the label P2P on May 6, 2023

darosior commented at 9:14 am on May 7, 2023: member

I experienced this as well.

ArmchairCryptologist commented at 6:39 pm on May 7, 2023: none

Been seeing this on my 24.0.1 public nodes as well, it’s not 100% CPU usage as such, but a single CPU core is spinning at 100%. One 24.0.1 node with listen=0 did not seem affected.

I noticed that after a node restart, the CPU usage went back to normal (and is still normal after a day) even though the mempool was loaded from disk - though it might be because it’s no longer entirely full due to mempoolminfee.

PS: These nodes do not run in debug mode.

darosior commented at 8:17 pm on May 7, 2023: member

Just to confirm, yours was in debug mode as well? Otherwise it’d be concerning.

——– Original Message ——– On May 7, 2023, 8:39 PM, ArmchairCryptologist wrote:

Been seeing this on my 24.0.1 public nodes as well, it’s not 100% CPU usage as such, but a single CPU core is spinning at 100%. One 24.0.1 node with listen=0 did not seem affected.

I noticed that after a node restart, the CPU usage went back to normal (and is still normal after a day) even though the mempool was loaded from disk - though it might be because it’s not entirely full.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

ArmchairCryptologist commented at 8:29 pm on May 7, 2023: none

Just to confirm, yours was in debug mode as well? Otherwise it’d be concerning.

My nodes are NOT in debug mode, no. I run four public nodes (24.0.1) with different server hosts, and all four of them were spinning a CPU core at 100% and experiencing slowness with block/transaction handling when I checked them yesterday, two to the point where they started occasionally dropping connectivity to my private nodes. Like I mentioned, one 24.0.1 node which does not accept incoming connections (but is connected to untrusted nodes) did not have this issue. The other private nodes still run 23.0.0 and also do not have any issues, though these only connect through my own public nodes so they aren’t directly exposed to untrusted nodes.

ArmchairCryptologist commented at 5:43 am on May 8, 2023: none

Of the four nodes, the one that was restarted on May 6th is now back to spinning 100% of a CPU core (60% average CPU usage on a two-core VM the last hour) while the three that were restarted yesterday are also seeing high CPU usage but not 100% all the time (between 30% and 38% average CPU usage on a two-core VM), so it’s building back up relatively quickly. I also verified that the mempool is not fully committed (usage is at 251217552).

There were around 15 peers with bytessent_per_msg.inv over 100000 without any received ones, but even after doing setban on them and verifying that they have been disconnected, the CPU usage did not drop noticeably if at all. This is probably because the node has 180 connected peers, and 140+ of those have both bytessent_per_msg.inv and bytesrecv_per_msg.inv over 100000, so disconnecting 15 of them for not sending any probably doesn’t do much..

Based on what you say about it being a different bug I assume that it is not an issue introduced with 24.0, but I reverted one of the nodes to 23.0 just to verify.

localbool commented at 7:27 am on May 8, 2023: none

Just to confirm, yours was in debug mode as well? Otherwise it’d be concerning.

What is debug mode? A binary compiled with --enable-debug option or debug=1 saved in bitcoin.conf?

willcl-ark commented at 7:28 am on May 8, 2023: member

What is debug mode? A binary compiled with --enable-debug option or debug=1 saved in bitcoin.conf?

Compiled with --enable-debug

disappointed72 commented at 7:07 pm on May 8, 2023: none

Same issue with 24.0.1 node today and went back to normal after daemon reload. My node do NOT run in debug mode too.

poiuty commented at 8:09 pm on May 8, 2023: contributor

I have the same problem. bitcoind started using 100% cpu (single thread) on my servers. I used the binary from https://bitcoincore.org/bin/bitcoin-core-24.0.1/bitcoin-24.0.1-x86_64-linux-gnu.tar.gz

Built bitcoind from source (v24.0.1.tar.gz). Works fine for an hour now.

0./configure --without-gui --disable-zmq --disable-wallet

golden-guy commented at 8:29 pm on May 8, 2023: none

Same issue with 24.0.1 node today and went back to normal after daemon reload. My node do NOT run in debug mode too.

Same here since yesterday, using the 24.0.1 binaries NOT running in debug mode. (Binaries from https://bitcoincore.org/bin/bitcoin-core-24.0.1/bitcoin-24.0.1-x86_64-linux-gnu.tar.gz). I didn’t restart the node, but used the setban approach as pointed out here #27586 (comment)

Since then, the single core load has decreased to 45% without restart and made the node cli and RPC calls responsive again, regardless of mempool allocation peaking at 300MB. So this has brought relief to some extent.

Using this command to identify those peers: bitcoin-cli getpeerinfo | jq -r '.[] | if (.bytessent_per_msg.inv > 100000 and .bytesrecv_per_msg.inv == null) then .addr else empty end'

Sjors commented at 8:28 am on May 9, 2023: member

The above command on my listening node (running a mess of commits on top of master from ~March) returns localhost (probably my mempool instance) plus 10 other peers. Five of those are universities (e.g. DNS Bitcoin Monitoring, 3 Digital Ocean. Some of them are bitcoinj wallets, i.e. you wouldn’t expect them to relay anything to us (except perhaps if they make a transaction).

CPU use looks normal to me, it’s not compiled with –enable-debug. Last restart was 3 days ago. I also have peerbloomfilters=0 (default since v0.19).

ArmchairCryptologist commented at 9:27 am on May 9, 2023: none

Just to confirm, after a day or so I’m seeing the same kind of CPU load increasing over time with the node I downgraded to 23.0.0, so this other issue probably was not introduced with 24.0.1.

Reducing maxconnections to a low-ish number (100 or less) seems to at least significantly delay the problem, though I can’t say yet whether it’ll mitigate it entirely. 150-200 is definitely causing issues on a 4 GHz Xeon.

I didn’t restart the node, but used the setban approach as pointed out here #27586 (comment)

That comment was actually a reply to a comment by @ajtowns who pointed that out as a possible mitigation to a potential issue he identified, but said comment seems to have been removed (too much information?), so it probably seems somewhat out-of-context.

Sjors commented at 11:56 am on May 9, 2023: member

@ArmchairCryptologist do you have peerbloomfilters on?

ArmchairCryptologist commented at 12:16 pm on May 9, 2023: none

@ArmchairCryptologist do you have peerbloomfilters on?

Nope. The nodes are pretty much running with the default settings except for maxconnections. But like I mentioned earlier, I’m only seeing this on well-connected public nodes, so if you get <100 incoming connections you might not be having this issue.

gits7r commented at 3:09 pm on May 9, 2023: none

This is true for 2 (two) full nodes that run with pruning off, peerbloomfilters off, maxmempool set to 2048 and maxconnections set to 500, where both have an average between 170 and 225 active peers. They are also NOT in debug mode, but still affected by abnormally high CPU usage and RAM consumption.

Both use fairly reasonable CPUs and are at 100% CPU usage and a lot of RAM (like 8GB only bitcoind). For reference:

“connections”: 222, “connections_in”: 212, “connections_out”: 10,

bitcoind-res

glozow commented at 3:43 pm on May 9, 2023: member

This is true for 2 (two) full nodes that run with pruning off, peerbloomfilters off, maxmempool set to 2048 and maxconnections set to 500, where both have an average between 170 and 225 active peers.

This particular case seems like you have configured bitcoind to do more than your cpu can handle during higher-traffic times. Consider reducing maxconnections for now. And perhaps shrink maxmempool - may I ask why you are using 2048?

theStack commented at 1:02 am on May 10, 2023: contributor

For the reporters using htop to analyze CPU usage (which is the case for at least @poiuty @gits7r, according to the posted screenshots): could you go to the display settings (reachable via key F2) and activate “tree view” and the global option “show custom thread names”? This would show the CPU usage per thread, which could be interesting for further investigation, e.g.:

(FWIW I’m running a listening pruned node on some quite lame dual-core virtual server, have rarely seen more than 10-15% CPU usage the last few days; will tinker around with higher maxmempool/maxconnections settings to see if I can reproduce the issue reported above).

ajtowns commented at 2:23 am on May 10, 2023: contributor

I think #27610 should improve the non-debug cpu usage in b-msghand. Both commits should cleanly cherry-pick for on top of v24.x; and the first commit (the “CompareDepthAndScore” one) should cleanly cherry-pick for v23.x.

ayeowch commented at 5:29 am on May 10, 2023: contributor

Using this command to identify those peers: bitcoin-cli getpeerinfo | jq -r '.[] | if (.bytessent_per_msg.inv > 100000 and .bytesrecv_per_msg.inv == null) then .addr else empty end'

Someone posted https://news.ycombinator.com/item?id=35824998 linking to a DoS script https://github.com/visualbasic6/drain/blob/c8373b55d83832f95fb0a77aca56436adeffb58d/drain-btc.py a day before this issue. It’s possible someone may be using the script targeting public nodes.

Check your bytessent_per_msg.headers for anomalous hits. I’m seeing bytessent_per_msg.headers=2542689711 and 100% CPU util on b-msghand thread when testing the DoS script for 10s with single thread on my localhost node.

ArmchairCryptologist commented at 5:40 am on May 10, 2023: none

It’s possible someone may be using the script targeting public nodes.

FWIW, I was not seeing any ridiculous bytessent_per_msg.inv numbers on any of my nodes. In fact the highest counts on all of my public nodes is my private nodes, but not by much.

A notable change happened yesterday (May 9th) around 14:25 UTC where the CPU load on all the public nodes dropped drastically at the same time, and they all look normal at present. The high-fee BRC-20 token spam we were seeing also appear to have quieted down a bit, so it seems likely that the issues were simply related to that. Unless it starts back up it might be hard to diagnose this further without artificially generating a similar traffic pattern.

Incidentally, the command originally provided by @ajtowns which lists the peer addresses with the inv numbers was:

bitcoin-cli getpeerinfo | jq -j '.[] | (.bytessent_per_msg.inv, " ", .bytesrecv_per_msg.inv, " ", .addr, "\n")' | grep ' null ' | sort -n

(Remove the | grep ' null ' part to show nodes that are also sending invs.)

Edit: I misread your post, I see you are referring to the headers field. I’m not seeing any anomalous hits at present, and it doesn’t seem consistent with the behavior I was seeing with gradual increase of CPU usage over time, but I’ll check it to rule it out if the problem reappears.

Sjors commented at 7:33 am on May 10, 2023: member

Using: bitcoin-cli getpeerinfo | jq -j '.[] | (.bytessent_per_msg.headers, " ", .addr, "\n")' | sort -n | grep -v 127.0.0.1

My highest bytessent_per_msg.headers 35 KB (for non localhost), i.e. a lot less than the 2.5 GB @ayeowch is seeing. CPU still calm. The node has 114 inbound connections (and 10 out).

I guess that means none of my peers have asked me for the full 64 MB headers chain during this session.

Sjors commented at 9:09 am on May 10, 2023: member

@ayeowch running that particular script against a modest Intel i5-8400 with 6 cores, either from another computer on the LAN or from the machine itself, I don’t see any noticeable CPU difference (compared to all the other stuff its running). Even when downloading 10GB of headers in a few minutes, which is probably a faster download rate than you experienced given the total (though you tally disconnected peers too I suppose).

willcl-ark commented at 9:52 am on May 10, 2023: member

FWIW I am also seeing some sustained high CPU usage on a listening (non debug mode) 24.0.1 node from the b-msghand thread:

htop ~_001

I am used to seeing this thread at ~1% on this machine:

htop ~_002

I also have a number of these “hungry” peers:

 0bitcoin in ~ took 53s ➜ bitcoin-cli getpeerinfo | jq -j '.[] | (.bytessent_per_msg.inv, " ", .bytesrecv_per_msg.inv, " ", .addr, "\n")' | grep ' null ' | sort -n
 12501 null xx.xx.xx.xx:xxxx
 26379589 null xx.xx.xx.xx:xxxx
 319839661 null xx.xx.xx.xx:xxxx
 424468696 null xx.xx.xx.xx:xxxx
 538336458 null xx.xx.xx.xx:xxxx
 662889263 null xx.xx.xx.xx:xxxx
 779513965 null xx.xx.xx.xx:xxxx
 880910412 null xx.xx.xx.xx:xxxx
 980972609 null xx.xx.xx.xx:xxxx
1082147824 null xx.xx.xx.xx:xxxx
11343710563 null xx.xx.xx.xx:xxxx
12484505648 null xx.xx.xx.xx:xxxx
13484552023 null xx.xx.xx.xx:xxxx
14484575864 null xx.xx.xx.xx:xxxx

This hardware usually deals with bitcoind pretty easily and has been running with maxconnections for a while:

 0bitcoin in ~ ➜ bitcoin-cli getnettotals
 1{
 2  "totalbytesrecv": 49382137900,
 3  "totalbytessent": 1290284641863,
 4  "timemillis": 1683631908883,
 5  "uploadtarget": {
 6    "timeframe": 86400,
 7    "target": 0,
 8    "target_reached": false,
 9    "serve_historical_blocks": true,
10    "bytes_left_in_cycle": 0,
11    "time_left_in_cycle": 0
12  }
13}

I will cherry-pick #27610 on top of 24.0.1 and see how it fares.

poiuty commented at 11:19 am on May 10, 2023: contributor

@ayeowch, I ran the script on several vps. Generated a lot of traffic and CPU load increased. But, the time bitcoind was using 100% CPU (single thread) I didn’t see much network traffic.

josibake commented at 12:50 pm on May 10, 2023: member

Using this command to identify those peers: bitcoin-cli getpeerinfo | jq -r '.[] | if (.bytessent_per_msg.inv > 100000 and .bytesrecv_per_msg.inv == null) then .addr else empty end'

Someone posted https://news.ycombinator.com/item?id=35824998 linking to a DoS script https://github.com/visualbasic6/drain/blob/c8373b55d83832f95fb0a77aca56436adeffb58d/drain-btc.py a day before this issue. It’s possible someone may be using the script targeting public nodes.

Check your bytessent_per_msg.headers for anomalous hits. I’m seeing bytessent_per_msg.headers=2542689711 and 100% CPU util on b-msghand thread when testing the DoS script for 10s with single thread on my localhost node. @ayeowch I haven’t tested this, but can’t you mitigate the damage done by the script by setting -maxuploadtarget ?

maflcko commented at 6:35 pm on May 11, 2023: member

I presume this is still an issue? If yes, what about filing a new one, given that this issue is about --enable-debug (not -debug), and most comments here may be about a recently fixed issue #https://github.com/bitcoin/bitcoin/issues/27623 and not about --enable-debug performance?

kroese commented at 7:56 pm on May 11, 2023: none

I’m also seeing the same issue.. The load average suddenly increased from 0.25 to 1+. From the graph its clear it started on the 6th of May:

load-pinpoint=1683143407,1683834607

ArmchairCryptologist commented at 5:28 am on May 12, 2023: none

I presume this is still an issue?

I haven’t seen any notable issues since the BRC-20 spam died down a few days ago, one node with 412 incoming connections is currently sitting at ~16% average CPU usage.

I’m also seeing the same issue.. The load average suddenly increased from 0.25 to 1+.

You probably need to restart the node to clear it up after it’s started spinning, a few days ago I had one node still spinning even after dropping to <80 connections due to timeouts/disconnects.

kroese commented at 10:12 am on May 12, 2023: none

You probably need to restart the node to clear it up after it’s started spinning

You were right.. restarting it dropped CPU usage from 100 percent to about 3 percent immediately. I hope it will stay this way as rc3 (which should fix it) still hasn’t been released.

ajtowns commented at 4:19 am on May 19, 2023: contributor

Maybe we can move the boost option into a new --enable-slow-debug option, and just enable that for fuzzing and perhaps one of the CI tasks?

 0diff --git a/configure.ac b/configure.ac
 1index cbbf8d6172..fbf7a72e7b 100644
 2--- a/configure.ac
 3+++ b/configure.ac
 4@@ -306,6 +306,13 @@ AC_ARG_ENABLE([debug],
 5     [enable_debug=$enableval],
 6     [enable_debug=no])
 7 
 8+dnl Enable slow debug
 9+AC_ARG_ENABLE([slow-debug],
10+    [AS_HELP_STRING([--enable-slow-debug],
11+                    [use compiler flags and macros suited for debugging that may cause slowdown (default is no)])],
12+    [enable_slow_debug=$enableval],
13+    [enable_slow_debug=no])
14+
15 dnl Enable different -fsanitize options
16 AC_ARG_WITH([sanitizers],
17     [AS_HELP_STRING([--with-sanitizers],
18@@ -1482,7 +1489,7 @@ if test "$use_boost" = "yes"; then
19   AX_CHECK_PREPROC_FLAG([-DBOOST_NO_CXX98_FUNCTION_BASE], [BOOST_CPPFLAGS="$BOOST_CPPFLAGS -DBOOST_NO_CXX98_FUNCTION_BASE"], [], [$CXXFLAG_WERROR],
20                         [AC_LANG_PROGRAM([[#include <boost/config.hpp>]])])
21 
22-  if test "$enable_debug" = "yes" || test "$enable_fuzz" = "yes"; then
23+  if (test "$enable_debug" = "yes" && test "$enable_slow_debug" = "yes") || test "$enable_fuzz" = "yes"; then
24     BOOST_CPPFLAGS="$BOOST_CPPFLAGS -DBOOST_MULTI_INDEX_ENABLE_SAFE_MODE"
25   fi

maflcko commented at 6:15 am on May 19, 2023: member

See also #27300#pullrequestreview-1352821536

fanquake closed this on May 29, 2023

BlackcoinDev referenced this in commit fb4f047686 on Feb 5, 2024

bitcoin locked this on May 28, 2024

CPU DoS on mainnet in debug mode #27586