Immediately disconnect on invalid net message checksum

jonasschnelli commented at 9:16 pm on January 18, 2019: contributor

Currently, messages with invalid checksums will be partially tolerated (skipped). The checksum check happens after deserialising all messages in the current read buffer.

This PR moves the checksum check to the network/transport layer (where it probably belongs) and rejects messages with invalid checksums immediately and disconnects the peer.

This PR separates the transport from the processing layer and helps protocol upgrades like BIP151.

jonasschnelli added the label P2P on Jan 18, 2019

jonasschnelli renamed this:
~~p2p: immediately disconnect on invalid net message checksum~~
Immediately disconnect on invalid net message checksum
on Jan 18, 2019

DrahtBot commented at 10:37 pm on January 18, 2019: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#19107 (p2p: Refactor, move all header verification into the network layer, without changing behavior by troygiorshev)
#19053 (refactor: replace CNode pointers by references within net_processing.{h,cpp} by theStack)
#18638 (net: Use mockable time for ping/pong, add tests by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

laanwj commented at 4:29 pm on January 19, 2019: member

Concept ACK

gmaxwell commented at 5:42 pm on January 19, 2019: contributor

I don’t the deferred processing matters– the checkvalue is protecting against network corruption not malicious input or anything like that, but I also don’t think that doing it eagerly is harmful either.

One consequence of this is that it moves more processing into the networking thread, rather than leaving it in the message processing thread. As a result it may result in lower performance in cases where hashing was a bottleneck, but I’m doubtful the slowdown will matter much anywhere.

sipa commented at 5:46 pm on January 19, 2019: member

@gmaxwell It actually doesn’t change much, as the message hash is being computed on the fly as the message packets come in. The GetMessageHash call only does the SHA256 finalization.

gmaxwell commented at 6:07 pm on January 19, 2019: contributor

@sipa I was aware, but for many messages (e.g. small ones) the finalization is the hashing.

TheBlueMatt commented at 9:37 pm on January 19, 2019: member

“The checksum check happens after deserialising all messages in the current read buffer.” <– No, this has not been the case since #9045. We hash as we receive data, only the finalization (ie last compression round and second hash) happen after we’ve received the full message. Don’t really have a strong feeling about this, it does move a tiny big of CPU cost over to the net thread from net_processing, which is nice, though.

jonasschnelli commented at 6:11 am on January 20, 2019: contributor

@TheBlueMatt: thanks for the clarification.

The intention of this PR is mainly to move the transportation protocol out of the message processing… and slowly moves towards a CNetMessage that is transport logic agnostic. With something like BIP150, we will have a different checksum (Poly1305 in the case of BIP151) and a different header.

I think CNetMessage should at one point only carry the payload and the command (and eventually the recv. time). The serialisation as well as optional encryption/decryption plus the checksum should happen outside of CNetMessage.

in src/net.cpp:866 in 2bf4bf76e6 outdated

858@@ -859,6 +859,18 @@ int CNetMessage::readData(const char *pch, unsigned int nBytes)
859     memcpy(&vRecv[nDataPos], pch, nCopy);
860     nDataPos += nCopy;
861 
862+    if (complete()) {
863+        const uint256& hash = GetMessageHash();
864+        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0)
865+        {
866+            LogPrint(BCLog::NET, "%s(%s, %u bytes): CHECKSUM ERROR expected %s was %s\n", __func__,

Sjors commented at 3:43 pm on January 23, 2019:

Suggest adding “, disconnecting” to this message, consistent with the other disconnection scenario in CNode::ReceiveMsgBytes.

jonasschnelli commented at 7:34 pm on January 24, 2019:

Fixed

in src/net.cpp:686 in 2bf4bf76e6 outdated

865+        {
866+            LogPrint(BCLog::NET, "%s(%s, %u bytes): CHECKSUM ERROR expected %s was %s\n", __func__,
867+               SanitizeString(hdr.pchCommand), hdr.nMessageSize,
868+               HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
869+               HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
870+            return -1;

Sjors commented at 3:45 pm on January 23, 2019:

I see that CNode::ReceiveMsgBytes returns false if readData doesn’t return 0, and that in turn CConnman::SocketHandler() calls pnode->CloseSocketDisconnect() if that returned false. This feels a bit brittle though, but maybe that’s just because I’m not super familiar with this part of the codebase.

jonasschnelli commented at 7:26 pm on January 24, 2019:

It’s the same path of closing a connection then if a remote peer would send an oversize message (or an invalid header that can’t be deserialized). I think it’s acceptable,

jonasschnelli force-pushed on Jan 24, 2019

MarcoFalke commented at 3:43 pm on January 31, 2019: member

Needs rebase (for sanity checking against the merged 36aeb43c01d250d99cfffdfbb70d2420b70054cc)

jonasschnelli force-pushed on Jan 31, 2019

jonasschnelli commented at 7:45 pm on January 31, 2019: contributor

Rebased (required for now merged #15246). Adapted test.

Sjors commented at 10:20 am on February 1, 2019: member

p2p_invalid_messages.py test passes on macOS 10.14.3 for the newly rebased e7ab7d7. It failed on one of the Travis machines though.

jonasschnelli commented at 11:48 pm on February 1, 2019: contributor

p2p_invalid_messages.py test passes on macOS 10.14.3 for the newly rebased e7ab7d7. It failed on one of the Travis machines though.

Indeed. I think it is because of the immediate termination of the connection which somehow confuses mininode’s getdata. I’ll investigate.

MarcoFalke closed this on Feb 3, 2019

MarcoFalke reopened this on Feb 3, 2019

gmaxwell commented at 4:56 am on February 3, 2019: contributor

Should we also worry about this breaking networking for nodes behind idiot firewalls that corrupt certain strings? We know they exist because they occasionally cause nodes to be unable to fetch certain blocks and other such sillyness. Right now if the stupidity that triggers them is in an ephemeral message (like a ping, inv, or an addr message) it’ll just get dropped, logged, and move on.

NicolasDorier commented at 5:37 am on February 3, 2019: contributor

I always wondered. Would the checksum could just be skipped at all?

If I understand what is being said here, the only reason for having the checksum is so people have a clearer message in logs if the message get corrupted by firewall?

gmaxwell commented at 7:57 pm on February 3, 2019: contributor

@NicolasDorier It could be but it would generally be a bad idea, because then nodes would operate with silent corruption in some cases. In particular, a common failure mode (relatively speaking) is for firewalls rewrite addresses and port numbers, corrupting version handshakes and addr messages. It was more of a concern before we started relaxing banning behaviour, but even without that, no one should have to waste countless hours troubleshooting a really dumb issue like a non-transparent firewall. Doubly so because protocol conformance requires encoding it anyways.

(plus, there is simply no reason to do so, the time they take is essentially negligible compared to other activities, and if we wanted to reduce it the authentication in the encrypted transport is faster (and thoroughly not optional)…)

DrahtBot added the label Needs rebase on Oct 9, 2019

jonasschnelli force-pushed on May 7, 2020

jonasschnelli commented at 7:15 am on May 7, 2020: contributor

Rebased

DrahtBot removed the label Needs rebase on May 7, 2020

MarcoFalke commented at 12:15 pm on May 7, 2020: member

I think there have been concerns that a bit flip on the wire will now cause the connection to be dropped instead of the msg that has the bit flip. How was this addressed?

jonasschnelli commented at 12:27 pm on May 7, 2020: contributor

I think there have been concerns that a bit flip on the wire will now cause the connection to be dropped instead of the msg that has the bit flip. How was this addressed?

No. That’s not addressed. I didn’t see that concern. Was there a public discussion?

MarcoFalke commented at 1:17 pm on May 7, 2020: member

Oh, that is how I read the previous discussion in this pull request. Maybe I am missing something obvious or read the code wrong.

jonatack commented at 12:04 pm on May 10, 2020: member

I think there have been concerns that a bit flip on the wire will now cause the connection to be dropped instead of the msg that has the bit flip. How was this addressed?

No. That’s not addressed. I didn’t see that concern. Was there a public discussion?

I think @MarcoFalke is referring to #15206 (comment) above (that is how I’m reading it, at least).

in src/net.cpp:678 in 1d9bc6a8c5 outdated

673@@ -674,6 +674,18 @@ int V1TransportDeserializer::readData(const char *pch, unsigned int nBytes)
674     memcpy(&vRecv[nDataPos], pch, nCopy);
675     nDataPos += nCopy;
676 
677+    if (Complete()) {

jonatack commented at 12:38 pm on May 10, 2020:

It may be good to add a comment here, along the lines of

0     if (Complete()) {
1+        // Verify message has valid checksum. If invalid, reject and disconnect the peer.

narula commented at 4:54 pm on May 20, 2020:

Is there a particular reason to check Complete() and do this here instead of in the complete check in ReceiveMsgBytes()?

jkczyz commented at 7:06 pm on May 20, 2020:

In the future, I’d like to see TransportDeserializer have a stateless interface such that it only produces valid messages or otherwise returns an appropriate error; see #16202 (comment). Not sure if @jonasschnelli agrees with me.

In that world, all validity checks would remain within the deserializer. Then CNetMessage would only need to represent valid messages so all its m_valid_* fields could be removed.

jnewbery commented at 1:36 am on May 22, 2020:

@jkczyz I don’t think that it’s possible to have a stateless interface for TransportDeserializer. When we call into Read(pch, nBytes) the bytes that we’ve read from the socket and are passing in may not constitute the full message. That’s why TransportDeserializer is stateful and keeps track of partially received headers/messages in hdrbuf/vRecv.

I agree with @narula that the error-handing should be done in ReceiveMsgBytes(). That’s partially because it’s neater: Read(), readHeader() and readData() could all be changed to return an error enum or object, and error-handling can all happen in one place. I also think that logging needs to happen within CNode for it to be meaningful. Without logging the node id, this log message is pretty useless.

jkczyz commented at 5:32 pm on May 29, 2020:

@jkczyz I don’t think that it’s possible to have a stateless interface for TransportDeserializer. When we call into Read(pch, nBytes) the bytes that we’ve read from the socket and are passing in may not constitute the full message. That’s why TransportDeserializer is stateful and keeps track of partially received headers/messages in hdrbuf/vRecv.

IIRC, my linked comment was alluding to the fact that additional refactoring may be needed to accomplish this. For instance, decoupling reading bytes from the wire from deserializing those read bytes.

jonatack commented at 12:46 pm on May 10, 2020: member

ACK 1d9bc6a8c52f33bb34c74f34169100e18f8e1784 modulo the concern expressed in #15206 (comment) which I don’t feel competent to evaluate. The concept seems good and the implementation yields simpler code. Built, all tests green locally, verified the functional test fails without the change, ran bitcoind with sanity check logging added to V1TransportDeserializer::readData.

rajarshimaitra commented at 7:31 am on May 18, 2020: contributor

The refactoring part seems like a good option. Though I do not have enough expertise to evaluate the proper context, I will look deeper into further review.

Currently Concept ACK apart from the #15206(comment). There seems to remain a threat of an honest node being eclipsed from the network if an attacker manages to keep corrupting its incoming data. Or I am missing something.
I am also curious on the motivation behind such aggressive connection dropping behavior.

jonasschnelli commented at 11:52 am on May 18, 2020: contributor

There seems to remain a threat of an honest node being eclipsed from the network if an attacker manages to keep corrupting its incoming data.

If an attacker can corrupt incoming data, there are plenty of other ways to force a disconnect.

I am also curious on the motivation behind such aggressive connection dropping behavior.

It seems to be wasteful to allow other peers to send endless amounts of messages containing an invalid checksum. The operators peer may need to calculate tons of sha256 hashes at almost no cost for the remote peer.

MarcoFalke added the label Review club on May 18, 2020

MarcoFalke commented at 4:13 pm on May 18, 2020: member

Might be interesting for reviewers: https://bitcoincore.reviews/15206.html

rajarshimaitra commented at 5:05 pm on May 18, 2020: contributor

Concept ACK after this comment.

MarcoFalke commented at 5:12 pm on May 18, 2020: member

The operators peer may need to calculate tons of sha256 hashes at almost no cost for the remote peer.

I am pretty sure the same is true for payloads that have a valid checksum. E.g. an unknown message type. The message is hashed, but then dropped without any adverse effects on the peer that sent it. So I think this should not be used as an argument to justify the change in this pull request.

jonasschnelli commented at 5:38 pm on May 18, 2020: contributor

I am pretty sure the same is true for payloads that have a valid checksum. E.g. an unknown message type.

Unknown messages require the remote peer to calculate at least a valid sha256 hash.

While random 4byes are enough to trigger a remote dbl-sha256 operation.

Arguing that there are other places where a remote peer can drain endless resources with almost no cost should bring little weight to this discussion. Ideally we look at this isolated wether it’s worth fixing/changing its behavior.

I though agree that it is debatable if we drop or tolerate connections sending messages with invalid checksums.

MarcoFalke commented at 5:42 pm on May 18, 2020: member

I though agree that it is debatable if we drop or tolerate connections sending messages with invalid checksums.

I’d be interested to know if it is possible to implement BIP151 without the behaviour change here? If yes, what are the trade-offs?

jonasschnelli commented at 5:57 pm on May 18, 2020: contributor

I’d be interested to know if it is possible to implement BIP151 without the behaviour change here? If yes, what are the trade-offs?

Absolutely. There is no need to change this behaviour to implement BIP324. In the V2 proposal, we have a MAC (poly1305) instead of a sha256 checksum. There, I’d strongly suggest we drop the connection on an invalid MAC (which is implemented in #18242).

MarcoFalke commented at 6:06 pm on May 18, 2020: member

Ok, so the motivation of this change is to bring the legacy behaviour in line with BIP324? If yes, it could make sense to mention that motivation in the pull request description.

troygiorshev commented at 5:59 pm on May 19, 2020: contributor

ACK 1d9bc6a. Reviewed the changes, built, and ran the tests.

Full agreement on moving the checksum verification to the net layer.

Agreeing with this comment that we should keep verifying the checksum, for the same reasons. Removing the checksum seems out of scope for this PR.

I don’t think we have consensus on what the purpose of this checksum is. It was added way back in protocol version 209 but I haven’t been able to find the original discussions around it (or any since). A clearer picture of why the checksum exists should help justify what the behavior should be upon a checksum verification failure.

With all of that said, I’m indifferent on the change of behavior. It will be changed when implementing BIP324, so that seems like reason enough to do it now and see if any unanticipated problems arise.

troygiorshev approved

in src/net.cpp:686 in 1d9bc6a8c5 outdated

680+        {
681+            LogPrint(BCLog::NET, "CHECKSUM ERROR (%s, %u bytes), expected %s was %s, disconnecting\n",
682+               SanitizeString(hdr.pchCommand), hdr.nMessageSize,
683+               HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
684+               HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
685+            return -1;

ariard commented at 5:58 am on May 20, 2020:

If we go forward with disconnecting on an invalid checksum do you think it’s worthy to add a release note for this ? We can’t evaluate how many nodes may be affected by a tampering firewall, but at least if node operators experience unexpected disconnect they are aware of a possible reason. And we would know what to revert.

glowang commented at 4:27 pm on May 20, 2020:

Out of curiousity, I l did some googling on payload tampering firewall & checksum, but I couldn’t find anything that seems relevant. Could you please point me to some resource to read more about it? Sounds very interesting :)

ariard commented at 5:11 pm on May 20, 2020:

With regards to firewall, rewriting some outgoing messages could be a NAT implementation, you can read more about this topic here : https://tools.ietf.org/html/rfc3022

glowang commented at 8:17 pm on May 20, 2020:

Thanks a lot!! 🙏

ariard commented at 6:20 am on May 20, 2020: member

I don’t see the rational to why we have to align behavior with BIP 324, its MAC has a completely different semantic, i.e a verification failure would be a hint of an active tampering at the infrastructure level ? I wouldn’t qualify honest a firewall/router melding with encrypted traffic.

I agree that a near-sender infrastructure attacker can already get the hosted peer disconnected. But for normal operations, if we have concern of unexpected behavior, why to take the risk ?

EDIT: disconnect, not ban, it’s less worrying

in src/net.h:619 in 1d9bc6a8c5 outdated

612@@ -613,7 +613,6 @@ class CNetMessage {
613     int64_t m_time = 0;                  // time (in microseconds) of message receipt.
614     bool m_valid_netmagic = false;
615     bool m_valid_header = false;
616-    bool m_valid_checksum = false;

MarcoFalke commented at 2:09 pm on May 20, 2020:

We already disconnect on invalid netmagic (which might also be due to a bit-flip), so disconnecting here is probably fine. However, I think for cleaner code, the two seem related enough to be treated in the same way (either both in net_processing or both in net). No strong opinion where, but for example this is the diff if they are kept in net_processing:

 0diff --git a/src/net.cpp b/src/net.cpp
 1index 9950b9aea4..9e213547fa 100644
 2--- a/src/net.cpp
 3+++ b/src/net.cpp
 4@@ -597,6 +597,11 @@ bool CNode::ReceiveMsgBytes(const char *pch, unsigned int nBytes, bool& complete
 5             vRecvMsg.push_back(std::move(msg));
 6 
 7             complete = true;
 8+
 9+            if (!msg.m_valid_checksum) {
10+                // return early and notify net processing
11+                return true;
12+            }
13         }
14     }
15 
16diff --git a/src/net_processing.cpp b/src/net_processing.cpp
17index 6d85b46831..a5606dfe66 100644
18--- a/src/net_processing.cpp
19+++ b/src/net_processing.cpp
20@@ -3499,6 +3499,13 @@ bool PeerLogicValidation::ProcessMessages(CNode* pfrom, std::atomic<bool>& inter
21     }
22     CNetMessage& msg(msgs.front());
23 
24+    if (!msg.m_valid_checksum) {
25+        LogPrint(BCLog::NET, "%s(%s, %u bytes): CHECKSUM ERROR peer=%d\n",
26+            __func__, SanitizeString(msg.m_command), msg.m_message_size, pfrom->GetId());
27+        pfrom->fDisconnect = true;
28+        return false;
29+    }
30+
31     msg.SetVersion(pfrom->GetRecvVersion());
32     // Check network magic
33     if (!msg.m_valid_netmagic) {
34@@ -3518,15 +3525,7 @@ bool PeerLogicValidation::ProcessMessages(CNode* pfrom, std::atomic<bool>& inter
35     // Message size
36     unsigned int nMessageSize = msg.m_message_size;
37 
38-    // Checksum
39     CDataStream& vRecv = msg.m_recv;
40-    if (!msg.m_valid_checksum)
41-    {
42-        LogPrint(BCLog::NET, "%s(%s, %u bytes): CHECKSUM ERROR peer=%d\n", __func__,
43-           SanitizeString(msg_type), nMessageSize, pfrom->GetId());
44-        return fMoreWork;
45-    }
46-
47     // Process message
48     bool fRet = false;
49     try

jkczyz commented at 6:35 pm on May 20, 2020:

Agree that message validity checks should all happen in one place. I have a preference to have them in net so that net_processing only sees valid messages.

MarcoFalke commented at 2:11 pm on May 20, 2020: member

Concept ACK 1d9bc6a8c52f33bb34c74f34169100e18f8e1784, some questions about the implementation 🐒

Signature:

 0-----BEGIN PGP SIGNED MESSAGE-----
 1Hash: SHA512
 2
 3Concept ACK 1d9bc6a8c52f33bb34c74f34169100e18f8e1784, some questions about the implementation 🐒
 4-----BEGIN PGP SIGNATURE-----
 5
 6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
 7pUj4ZAv/UJs8Rz4wN86FJYgQY7AWTCbqSg7jQTq9B/4v1BGhjR5atI1W4xEToYUz
 8+Z0h3RMQhPoShhrMVHZFW8Fsf9LqyZgglxoVkPdnkTOCEw90oVqojKOZR3PBwBB6
 9sgB5apYWlS9vUldyqZM5Jv7ZNuM9RaziV1aLRVJ0xJvwnMgDzZtYgVXnO23Twfv2
104/g9nPgLeoucbAL2iXYFP24f1HQmHrj9LslveZ1xzNguTvWWhy2Q/l8Yx/CVloXS
11EDVYbUiDmG0nCsBCoP+87VqbTOmg8FZ3zvmmId2txhmxtNg50Dv4NY8lsXTzRY5c
12juSq8l5MyKnFY0Y6WaU9mwkUf1tk3CxlaIqW5iHPd87Iooctga5d1v3YFfhjQiXM
13dvaSW35wXAeAjkAXDBIzPR7ZH5T1Qpqd98qSmaXr2qgARhchPJjf8UbBxAEUgdIO
14u2WqNqhdertvUgKOTVNhykrUrgW38jFQtdYbFTgGzYZQkebHPQt8vJh/cghGbVk8
15150Bk/TC
16=aWOk
17-----END PGP SIGNATURE-----

Timestamp of file with hash 9b486e9184d168b333bf13a6f48f2da6400ae323f9390e4095aa458f1d1ec614 -

in src/net.cpp:705 in 1d9bc6a8c5 outdated

703@@ -692,23 +704,14 @@ CNetMessage V1TransportDeserializer::GetMessage(const CMessageHeader::MessageSta
704     // store state about valid header, netmagic and checksum

troygiorshev commented at 3:11 pm on May 20, 2020:

This should be changed to read something like:

0    // store state about valid header and netmagic

troygiorshev commented at 5:45 pm on May 20, 2020: contributor

For reference #15197

jnewbery commented at 3:01 pm on May 21, 2020: member

I agree with moving the checksum checking to the net layer. Disconnecting on checksum failure is probably ok, but I don’t think the refactor and behaviour change need to be in the same PR (and certainly not in the same commit). How about changing this PR to just do the refactor and have a follow-up PR where discussion can be focused on whether to disconnect on checksum failure?

I also agree with @jkczyz’s suggestion here: #15206 (review) which would allow all deserialization error-handling to be done in one place.

MarcoFalke commented at 3:08 pm on May 21, 2020: member

Do you think it is possible to do the refactor without the behaviour change? I believe it is only possible to do the behaviour change without the refactor (see #15206 (review)) but not the other way round.

Happy to be proven wrong

MarcoFalke commented at 3:20 pm on May 21, 2020: member

For reference, I had one checksum error on a google cloud test node for one connection. Haven’t tested other networks or NAT. Full history for that connection: history_peer_id_1849951.txt (2.5 Mb). (The error happened in an inv message after about 15 minutes of being connected. Total connection time was less than a day)

Note that checksum errors are only logged with -debug or -debug=net.

jnewbery commented at 3:24 pm on May 21, 2020: member

Do you think it is possible to do the refactor without the behaviour change?

Why wouldn’t it be possible? The checksum can be checked in ReceiveMessageBytes() and the message dropped (with a log) if the checksum is wrong.

fjahr commented at 9:01 pm on May 21, 2020: member

Concept ACK on the refactor. After reading the discussion here and in the PR review club I am still undecided if the disconnect is a good idea. We don’t seem to have enough data to be confident that it will not be a problem for some users.

Would using the banscore be a viable middle ground in this case? Yes, we would ban the peer but only after multiple offenses and we could tweak the number so that we are confident there is a bigger problem (like the actual attack discussed earlier) in which case a ban seems actually more warranted than a disconnect.

Banscore is generally not considered much these days it seems but I can’t see why it would not be a possible solution in this case.

jnewbery commented at 0:57 am on May 22, 2020: member

Banscore is generally not considered much these days it seems but I can’t see why it would not be a possible solution in this case. @fjahr my view on banscore is that it’s not particularly useful. If a peer sends us one bad message it’s generally safe to assume that it’ll probably send us more bad messages. Why wait until it’s sent us 10 or 100 before disconnecting or banning? The ban score is a simple counter (and not some protocol violations/time gauge), so even if a peer sends us one bad message every few days, it’ll eventually reach the ban limit.

More generally, it’s useful to ask why we’d disconnect a peer. I think there are three good reasons:

to protect ourselves from resource-wasting by broken peers. Note that this doesn’t really help against adversarial peers if we’re accepting incoming connections, since they can always reconnect, and even banning connections from that IP address isn’t much protection, since acquiring new IP addresses to connect from is trivial for an adversary.
to free up on of our limited connection slots for a better peer (eg one that is more likely to relay blocks or propagate transactions to us) 3 (maybe) in some cases, disconnecting from peers is the only way to signal to them that they’re sending us bad data. If all your peers are disconnecting from you, then you can infer that there’s probably something wrong with your connection or implementation. If peers were just dropping your messages, you may never find out.

In all of those cases, I think that if you’re going to disconnect, it’s better to disconnect immediately, rather than wait for some arbitrary number of protocol violations. Or put differently, if you can tolerate 9 bad messages from a peer, why should you suddenly disconnect on the 10th?

From an implementation standpoint, ban scores exist only in the net processing layer. net.cpp has no concept of a peer’s ban score. If we want to move checksum verification to the net layer, then we either can’t use ban scores, or we need to do a big change in the way they’re handled.

rajarshimaitra commented at 11:22 am on May 22, 2020: contributor

Ran all the tests. All unit tests and functional tests passing. Verified the test fails on master. Still haven’t been able to decide on connection dropping behavior.

Do you think it is possible to do the refactor without the behavior change?

This seems like a good idea as the good refactoring part of the change can get merged while we gather up more data to decide on connection dropping behavior.

Also I was wondering, in the scenario where a firewall mingles with message data causing checksum to fail, do they do it for all data all the time? or they do it intermittently?

3 (maybe) in some cases, disconnecting from peers is the only way to signal to them that they’re sending us bad data. If all your peers are disconnecting from you, then you can infer that there’s probably something wrong with your connection or implementation. If peers were just dropping your messages, you may never find out.

Is it possible that my own firewall is playing foul on me which then results in me disconnecting all the nodes? Can this also give signal that something is wrong with my end of the connection?

Emzy commented at 1:15 pm on May 22, 2020: contributor

Just for information. I see this errors at my node (about 400 connections).

2020-05-21T00:22:57Z CHECKSUM ERROR (SMBr, 0 bytes), expected 5df6e0e2 was 00000000 2020-05-21T00:23:44Z CHECKSUM ERROR (, 0 bytes), expected 5df6e0e2 was 00000000 2020-05-21T00:23:55Z CHECKSUM ERROR (, 1 bytes), expected f3035c79 was 01000000 About 15 on one day. Logging net is enabled.

MarcoFalke commented at 1:32 pm on May 22, 2020: member

Those are clearly malicious nodes and disconnection is the best we can do

DrahtBot added the label Needs rebase on May 23, 2020

p2p: immediately disconnect on invalid net message checksum 9164e1cc4d

jonasschnelli force-pushed on May 26, 2020

jonasschnelli commented at 7:00 am on May 26, 2020: contributor

Rebased.

DrahtBot removed the label Needs rebase on May 26, 2020

jonatack commented at 10:00 am on May 26, 2020: member

re-ACK 9164e1c rebase only, no change since my last review @ 1d9bc6a per git range-diff 24f7029 1d9bc6a 9164e1c. If you retouch or for a follow-up, have at a look at suggestions like #15206 (review), #15206 (review), #15206 (review), and #15206 (review).

troygiorshev commented at 2:40 pm on May 27, 2020: contributor

In acknowledgement of @MarcoFalke’s challenge and @jnewbery’s suggestion, this branch holds an example of moving the checksum verification to GetMessage() (called by ReceiveMsgBytes()), without changing the current disconnect/don’t disconnect behavior. In agreement with this other comment, a natural continuation could be this branch, which moves all header checks into net, again without changing the current behavior (along the lines of #15197).

Thanks again to @jonasschnelli for being the impetus behind these PRs. These will pave the way for improvements like BIP324 and #18989. I hope these branches can clear up some of the worries expressed both here and in the PR Review Club.

jnewbery commented at 3:07 pm on May 27, 2020: member

I like the approach in @troygiorshev’s branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header, which moves the checks into net.cpp without changing behaviour. @jonasschnelli : can you take a look at that branch and see what you think? Does it interfere with your BIP324 work?

jonasschnelli commented at 6:03 am on May 29, 2020: contributor

I don’t know how to proceed (if) with this PR. It seems like that the immediate disconnect on a single checksum is to harsh (by some contributors judgment). I’m personally unsure wether it’s too strict or desirable.

Splitting the PR into a pure refactoring is possible. @troygiorshev branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header looks not bad.

If we are not going to do the disconnect (just a refactor), I suggest do close this PR and start with a fresh one for something like @troygiorshev is suggesting.

DrahtBot commented at 8:37 pm on June 4, 2020: member

🐙 This pull request conflicts with the target branch and needs rebase.

DrahtBot added the label Needs rebase on Jun 4, 2020

PastaPastaPasta commented at 3:59 pm on June 28, 2020: contributor

Personally, I have no problem disconnecting on invalid checksum.

If we continue with this PR, can you run clang-format-diff?

 0diff --git a/src/net.cpp b/src/net.cpp
 1index 15e06ffbe..e41c64e44 100644
 2--- a/src/net.cpp
 3+++ b/src/net.cpp
 4@@ -677,12 +677,11 @@ int V1TransportDeserializer::readData(const char *pch, unsigned int nBytes)
 5 
 6     if (Complete()) {
 7         const uint256& hash = GetMessageHash();
 8-        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0)
 9-        {
10+        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0) {
11             LogPrint(BCLog::NET, "CHECKSUM ERROR (%s, %u bytes), expected %s was %s, disconnecting\n",
12-               SanitizeString(hdr.pchCommand), hdr.nMessageSize,
13-               HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
14-               HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
15+                SanitizeString(hdr.pchCommand), hdr.nMessageSize,
16+                HexStr(hash.begin(), hash.begin() + CMessageHeader::CHECKSUM_SIZE),
17+                HexStr(hdr.pchChecksum, hdr.pchChecksum + CMessageHeader::CHECKSUM_SIZE));
18             return -1;
19         }
20     }

fanquake referenced this in commit 6af9b31bfc on Sep 29, 2020

jnewbery commented at 10:39 am on September 29, 2020: member

This is obsoleted by #19107 @jonasschnelli - if you think that there’s anything in here that isn’t addressed by 19107, feel free to reopen or open another PR.

jnewbery closed this on Sep 29, 2020

sidhujag referenced this in commit 7f1c584210 on Sep 29, 2020

DrahtBot locked this on Feb 15, 2022

Immediately disconnect on invalid net message checksum #15206

Conflicts