log: Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…)

practicalswift commented at 5:47 pm on September 22, 2020: contributor

Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…).

A disk fill attack is an attack where an untrusted party (such as a peer) is able to cheaply make your node log to disk excessively. The excessive logging may fill your disk and thus make your node crash either cleanly (best case: if disk fill rate is relatively slow) or uncleanly (worst case: if disk fill rate is relatively fast).

The hourly logging quota is set per source location. Every single source location (say net_processing.cpp:418) gets a quota of 1 MB of logging per hour.

Notes:

Only logging to disk is rate limited. Logging to console is not rate limited.
Only LogPrintf(…) is rate limited. LogPrint(category, …) (-debug) is not rate limited.
UpdateTip: new best=[…] is logged using LogPrintfWithoutRateLimiting(…) to avoid rate limiting. High log volume is expected for that source location during IBD.

Live demo:

 0$ git diff
 1diff --git a/src/init.cpp b/src/init.cpp
 2index 023aa9aba..56a250876 100644
 3--- a/src/init.cpp
 4+++ b/src/init.cpp
 5@@ -1232,6 +1232,13 @@ bool AppInitInterfaces(NodeContext& node)
 6     return true;
 7 }
 8 
 9+void SimulateDiskFillingAttack() {
10+    uint64_t n = 0;
11+    while (true) {
12+        LogPrintf("Chancellor on brink of %d:th bailout for banks. LogPrintf on brink of first disk fill for node.\n", ++n);
13+    }
14+}
15+
16 bool AppInitMain(const util::Ref& context, NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
17 {
18     const ArgsManager& args = *Assert(node.args);
19@@ -1845,6 +1852,8 @@ bool AppInitMain(const util::Ref& context, NodeContext& node, interfaces::BlockA
20     }
21 #endif
22 
23+    std::thread{SimulateDiskFillingAttack}.detach();
24+
25     std::vector<fs::path> vImportFiles;
26     for (const std::string& strFile : args.GetArgs("-loadblock")) {
27         vImportFiles.push_back(strFile);
28$ make -C src/ bitcoind
29$ src/bitcoind
30$ tail debug.log
312020-09-22T16:11:36Z Chancellor on brink of 8813:th bailout for banks. LogPrintf on brink of first disk fill for node.
322020-09-22T16:11:36Z Chancellor on brink of 8814:th bailout for banks. LogPrintf on brink of first disk fill for node.
332020-09-22T16:11:36Z Chancellor on brink of 8815:th bailout for banks. LogPrintf on brink of first disk fill for node.
342020-09-22T16:11:36Z Chancellor on brink of 8816:th bailout for banks. LogPrintf on brink of first disk fill for node.
352020-09-22T16:11:36Z Chancellor on brink of 8817:th bailout for banks. LogPrintf on brink of first disk fill for node.
362020-09-22T16:11:36Z Chancellor on brink of 8818:th bailout for banks. LogPrintf on brink of first disk fill for node.
372020-09-22T16:11:36Z Chancellor on brink of 8819:th bailout for banks. LogPrintf on brink of first disk fill for node.
382020-09-22T16:11:36Z Chancellor on brink of 8820:th bailout for banks. LogPrintf on brink of first disk fill for node.
392020-09-22T16:11:36Z Excessive logging detected from init.cpp:1238 (SimulateDiskFillingAttack): >1 MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: 2020-09-22T16:11:36Z Chancellor on brink of 8821:th bailout for banks. LogPrintf on brink of first disk fill for node.
402020-09-22T16:12:37Z Adding fixed seed nodes as DNS doesn't seem to be available.
41$ ls -hl debug.log
42–rw------- 1     1.1M Sep 22 16:12 debug.log

DrahtBot commented at 6:55 pm on September 22, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#19809 (log: Prefix log messages with function name if -logfunctionnames is set by practicalswift)
#16673 (Relog configuration args on debug.log rotation by LarryRuane)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Utils/log/libs on Sep 22, 2020

DrahtBot added the label Validation on Sep 22, 2020

fanquake removed the label Validation on Sep 22, 2020

naumenkogs commented at 7:25 am on September 23, 2020: member

Concept ACK. I was wondering about this threat several times over last year, but never managed to look into it for real.

laanwj commented at 12:04 pm on September 23, 2020: member

I’m not sure about this. This kind of “drop everything” rate limiting can be used for a different kind of DoS that might be even more serious than just shutting down: prevent important things from being logged. Edit: I misunderstood the code here, it tracks per source location. Concept ACK.

practicalswift renamed this:
~~log: Mitigate disk filling attacks by rate limiting LogPrintf(…)~~
log: Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…)
on Sep 23, 2020

practicalswift commented at 1:14 pm on September 23, 2020: contributor

@laanwj

Have you looked at the implementation? It certainly doesn’t drop everything :)

The implementation was carefully written to minimise the risk of suppressing something important from being logged while at the same time killing the disk-fill-via-peer-triggered-logging bug class for good.

Do you have any ideas on how to improve the implementation in a way to reduce the “prevent important things from being logged” risk?

Perhaps this implementation isn’t the way to go but I really think we need some kind of general mitigation for disk fill attacks. This is a non-theoretical DoS vector which has been problematic for Bitcoin Core historically.

naumenkogs commented at 1:56 pm on September 23, 2020: member

I don’t know how hard of a task that would be, but it would be great to see how easy it is to fill disk space depending on the protocol piece (message but also context). For example, tx INV: X bytes per second ADDR: Y bytes per second GETADDR: Z bytes per second

From there we could make sure our rate-limiting is effective?

practicalswift commented at 2:21 pm on September 23, 2020: contributor

@naumenkogs

When testing this patch I reverted the fix for an historic disk fill bug to see if this mitigation would have been successful.

In that case an untrusted peer could achieve a disk fill rate of 15 GB/hour without this mitigation, and the expected 1 MB/hour with this mitigation. (The 15 GB/hour was achieved on a really old machine with mechanical disks so YMMV.)

In other words: the mitigation slowed that specific attack by a factor of 15 000 :)

Note that the quota is set per source location. So every single source location (say net_processing.cpp:418) gets a quota of 1 MB of logging per hour. I think we could bump that to 10 MB/hour or even 100 MB/hour and still largely kill this bug class.

Perhaps I misunderstood what you meant by the INV/ADDR/GETADDR bytes per second thing, but we only log per such message via LogPrint(CATEGORY, …) (-debug). LogPrint(CATEGORY, …) is intentionally not covered by the rate limiting, since end-users opt-ing in to -debug are assumed to know that doing so may cause excessive disk usage :)

The goal of this PR is to kill this bug class for a bitcoind node running with default settings (no -debug).

in src/logging.cpp:266 in 88995e53bd outdated

261+        if (m_bytes_logged_per_source_location[source_location] > HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION) {
262+            const bool print_quota_state_change_message = (m_bytes_logged_per_source_location[source_location] - str_prefixed.size()) <= HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION;
263+            if (print_quota_state_change_message) {
264+                str_prefixed = LogTimestampStr(strprintf("Excessive logging detected from %s (%s): >%d MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: %s", source_location, logging_function, HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION / (1024 * 1024), str_prefixed));
265+            } else {
266+                skip_writing_to_disk_due_to_rate_limiting = true;

laanwj commented at 1:47 pm on September 24, 2020:

Oh, sorry, I thought this was a global flag.

in src/logging.h:182 in 88995e53bd outdated

178@@ -173,17 +179,31 @@ static inline void LogPrintf(const char* fmt, const Args&... args)
179             /* Original format string will have newline so don't add one here */
180             log_msg = "Error \"" + std::string(fmterr.what()) + "\" while formatting log message: " + fmt;
181         }
182-        LogInstance().LogPrintStr(log_msg);
183+        const std::string source_location = source_file + ":" + ToString(source_line);

laanwj commented at 1:49 pm on September 24, 2020:

We could save quite some memory and avoid having to build a string here by indexing with a (const char*, int) tuple. __FUNC__ names end up in the .text segment after all so a pointer will do.

practicalswift commented at 8:43 pm on September 24, 2020:

Fixed!

in src/logging.h:72 in 88995e53bd outdated

68@@ -67,6 +69,8 @@ namespace BCLog {
69         FILE* m_fileout GUARDED_BY(m_cs) = nullptr;
70         std::list<std::string> m_msgs_before_open GUARDED_BY(m_cs);
71         bool m_buffering GUARDED_BY(m_cs) = true; //!< Buffer messages before logging can be started.
72+        std::map<std::string, uint64_t> m_bytes_logged_per_source_location GUARDED_BY(m_cs);

laanwj commented at 1:50 pm on September 24, 2020:

Maybe unordered_map? we have no ordering constraing here.

practicalswift commented at 8:43 pm on September 24, 2020:

Fixed!

practicalswift force-pushed on Sep 24, 2020

practicalswift commented at 8:44 pm on September 24, 2020: contributor

@laanwj Thanks for the Concept ACK! :)

Feedback addressed.

practicalswift force-pushed on Sep 24, 2020

practicalswift force-pushed on Sep 25, 2020

log: Mitigate disk filling attacks by rate limiting LogPrintf a86ab57110

practicalswift force-pushed on Sep 25, 2020

in src/logging.h:207 in a86ab57110

203+
204+// Unconditional logging WITHOUT rate limiting. Use only for log messages that
205+// MUST NOT be rate limited no matter how often they are logged. That requirement
206+// should be extremely rare, so please use with care. Prefer LogPrintf(...) if
207+// possible.
208+#define LogPrintfWithoutRateLimiting(...) LogPrintf_(__func__, __FILE__, __LINE__, /* skip_disk_usage_rate_limiting */ true, __VA_ARGS__)

laanwj commented at 12:53 pm on September 30, 2020:

This name is really awkward. I would prefer a more compact name, but don’t really have any good suggestions.

practicalswift commented at 6:18 pm on September 30, 2020:

Heh, I agree :) Naming is hard! I’m happy to change the name of course.

I expect LogPrintfWithoutRateLimiting to be used once: for UpdateTip which is the only non-category log message which we ever expect to legitimately log at a rate of >1 MB/hour (or whatever hourly source-location threshold we choose).

One alternative would be to call LogPrintf_ directly from UpdateTip as it is a one-off. That would solve the naming problem too :)

laanwj commented at 1:30 pm on September 30, 2020: member

0$ size -G src/bitcoind*
1      text       data        bss      total filename
2   6454125    1929533      38929    8422587 src/bitcoind
3   6564925    1942029      38929    8545883 src/bitcoind.19995

FWIW, this results in a 120 kB increase in stripped binary size for bitcoind on x86_64. In contrast to what I first thought, this is not because of the __FILE__. There’s only 12 kB more in strings. The rest is extra instructions in .text to pass the line number to every logging call.

(still concept ACK though, and thanks for addressing my comments, code changes LGTM now)

in src/logging.cpp:258 in a86ab57110

253+    bool skip_writing_to_disk_due_to_rate_limiting{false};
254+    if (!skip_disk_usage_rate_limiting) {
255+        const std::chrono::seconds now = GetTime<std::chrono::seconds>();
256+        if ((now - m_last_reset_of_bytes_logged_per_source_location) > std::chrono::hours{1}) {
257+            m_bytes_logged_per_source_location.clear();
258+            m_last_reset_of_bytes_logged_per_source_location = now;

jnewbery commented at 10:11 am on October 9, 2020:

It’d be good to check here whether logging was suppressed for this log, and if so log that logging is restarting for this log (eg. “Restarting logging from %s …”)

practicalswift commented at 1:09 pm on October 13, 2020:

That’s a good idea. I’ll implement!

in src/logging.cpp:266 in a86ab57110

261+        if (m_bytes_logged_per_source_location[source_location] > HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION) {
262+            const bool print_quota_state_change_message = (m_bytes_logged_per_source_location[source_location] - str_prefixed.size()) <= HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION;
263+            if (print_quota_state_change_message) {
264+                str_prefixed = LogTimestampStr(strprintf("Excessive logging detected from %s:%d (%s): >%d MiB logged during the last hour. Suppressing logging to disk from this source location for up to one hour. Console logging unaffected. Last log entry: %s", source_location.first, source_location.second, logging_function, HOURLY_LOG_QUOTA_IN_BYTES_PER_SOURCE_LOCATION / (1024 * 1024), str_prefixed));
265+            } else {
266+                skip_writing_to_disk_due_to_rate_limiting = true;

jnewbery commented at 10:14 am on October 9, 2020:

Consider keeping a tally of how many log messages were dropped, and then logging that number when you restart logging (eg "Restarting logging from <location>. <number> log messages dropped since <time>")

practicalswift commented at 1:09 pm on October 13, 2020:

Makes sense. I’ll implement.

jnewbery commented at 10:22 am on October 9, 2020: member

Concept ACK.

Log shedding needs to be done very carefully, since logs are usually the best data we have when diagnosing user issues. I’m not convinced that shedding logs per location is a good idea. Partial logs are often worse than no logs at all and can be very misleading (“it logged this and then it didn’t log this, so it can’t have taken this code branch”). Even with a special “Suppressing logging to disk from this source location” log, it’s very easy to be misled unless you know exactly what you’re looking for.

I think I’d prefer log-shedding to be done globally, with a special log at the end of a log suppression period that tallies up which log locations have been suppressed.

MarcoFalke commented at 11:48 am on October 13, 2020: member

Conceptually, I am not sure if it is good to fight the symptoms. I’d prefer if unconditional logs were only used for the init/shutdown sequence and local system errors, such as data corruption. Anything else the average user probably doesn’t care about, and if they did, they could enable the corresponding debug category and provide enough disk space for the debug log file.

practicalswift commented at 1:06 pm on October 13, 2020: contributor

@MarcoFalke

Conceptually, I am not sure if it is good to fight the symptoms.

I fail to see how introducing a mitigation would be to “fight the symptoms”.

The reason we use ASLR for example isn’t that we’ve given up on the ambition to write non-vulnerable code :)

The reason we use mitigations is to make exploitation harder in case we fail to write non-vulnerable free code, and thus hopefully turn a vulnerability into a normal bug.

Have you looked at the implementation in this PR? :)

The mitigation logic would only kick in in under extreme circumstances meaning that the only observable difference from a user perspective is that his/her node would be exploitable to disk fill attacks without the mitigation in place, whereas he/she would be safe with it.

Note that this is not a theoretical issue: we’ve had exploitable disk filling vulnerabilities in the past which this mitigation would have successfully turned into just another non-exploitable non-security bug.

I’d prefer if unconditional logs were only used for the init/shutdown sequence and local system errors, such as data corruption. Anything else the average user probably doesn’t care about, and if they did, they could enable the corresponding debug category and provide enough disk space for the debug log file.

I agree that we should log more conservatively and ideally leave no room for disk filling attacks, but that doesn’t preclude also protecting our users in case we fail to do so :)

luke-jr commented at 5:23 pm on October 24, 2020: member

Can we get a way to disable this via config file?

practicalswift commented at 10:18 pm on December 27, 2020: contributor

Could a maintainer mark this PR as “Up for grabs”? I’m afraid I won’t have time to incorporate the suggested changes any time soon.

FWIW I still think this mitigation is very much worth doing: the “up for grabs” state is purely due to lack of time on my part. I’d be glad to review any work being done to reduce the risk of disk fill attacks.

amitiuttarwar added the label Up for grabs on Dec 28, 2020

fanquake closed this on Dec 28, 2020

fanquake removed the label Up for grabs on Apr 7, 2021

fanquake commented at 6:32 am on April 7, 2021: member

Picked up in #21603.

practicalswift deleted the branch on Apr 10, 2021

DrahtBot locked this on Aug 18, 2022

log: Mitigate disk filling attacks by temporarily rate limiting LogPrintf(…) #19995

Conflicts