net: Better askfor request management

laanwj commented at 9:38 am on September 3, 2014: member

Instead of naively planning requests every two minutes into (potentially) the far future w/ mapAlreadyAskedFor, actively keep track of requests, and which nodes offer the requested items
Better handles the case where nodes announce an inv and go away, or stop responding
Maintains state separate from CNode, which makes it easier to review and understand separately from the rest, as well as localizes changes if extension of the logic is necessary

I also think it can serve as an example of how to ‘peel off’ a single concern from net/main.

Uses CTimeoutCondition from @ashleyholman (#4230).

Cases to test:

Announce and don’t reply to getdata, make sure retry chooses different node after REQUEST_TIMEOUT
Node disconnects and was the current node requested from - should immediately try to get from different node
Last node that had announced a certain transaction goes away - request should be discarded I’ve tested these cases using custom pynode clients, as well as by running this with full debugging on my node to see how it behaves,
Premature flushing if more than 1000 getdatas queued to a node (this is a very unlikely path but it needs to be tested)

TODO:

Slim down debug logging (this is initially useful for testing and checking, but too much in production)
Make sure FlushGetdata’s handling of CNode locking is correct
Don’t request from nodes whose sendbuffer is full

Known issues:

Doesn’t currently group getdata requests (sends a getdata request per item instead of one getdata for many invs) - wouldn’t be too difficult to fit in but I’m not sure it is worth the complexity, as usually only one inv is sent at a time (done now)

Supercedes: #4828 #4547 #4827

in src/netaskfor.cpp: in 57b0fa5ade outdated

260+    /// Bound number of concurrent inventory requests to each node, this has
261+    /// the indirect effect of bounding all data structures.
262+    if (state->setAskFor.size() > MAX_SETASKFOR_SZ)
263+        return;
264+
265+    LogPrint("netaskfor", "askfor %s  peer=%d\n", inv.ToString(), nodeid);

rebroad commented at 4:30 am on September 4, 2014:

“tx” instead of “netaskfor”? And should this be added to the command line syntax messages in init.cpp?

laanwj commented at 6:29 am on September 4, 2014:

Could be called ’tx’ but I tried to keep it as general as possible. It requests inventory items. ACK on adding it to command-line syntax.

in src/netaskfor.cpp: in 57b0fa5ade outdated

156+            invRequestsWorkQueue.begin()->first ? "retry" : "first request");
157+    invstate.beingRequestedFrom = nodeid;
158+
159+    std::vector<CInv> vGetData;
160+    vGetData.push_back(inv);
161+    node->PushMessage("getdata", vGetData);

rebroad commented at 5:01 am on September 4, 2014:

Just one inv per getdata message?

Why not use the old code to create a list of invs to getdata and use that one node at a time?

laanwj commented at 6:33 am on September 4, 2014:

It is a bit more difficult because we act per request here, not per node.

It would be possible to accumulate getdatas per node, then submit them at once after an iteration, but as said under KNOWN ISSUES in the OP I’m not sure it’s worth the extra complexity. Anyhow it’s an obvious optimization that could be done. Working on this right now…

rebroad commented at 7:11 am on September 4, 2014:

I’ve added a pull request (https://github.com/laanwj/bitcoin/pull/3) to this branch that might help with this… (also makes it easier to collect stats in later commits of mine).

laanwj force-pushed on Sep 4, 2014

in src/netaskfor.cpp: in a2afa1d17b outdated

 95+/** Handler for when a new node appears */
 96+void InitializeNode(NodeId nodeid, const CNode *pnode)
 97+{
 98+    LOCK(cs_invRequests);
 99+    /*CNodeAskForState &state = */
100+    mapNodeAskForState.insert(std::make_pair(nodeid, CNodeAskForState())).first->second;

SergioDemianLerner commented at 0:30 am on September 5, 2014:

What is the trailing .first->second for?

laanwj commented at 9:52 am on September 5, 2014:

That’s a leftover from the commented code above it. Basically, the CNodeAskForState had some initial assignments here so a reference to it was stored. This is not necessary at the moment so both the commented code and the .first->second should go.

SergioDemianLerner commented at 0:57 am on September 5, 2014: contributor

This is excellent. Could we unify the handling and scheduling of transactions getdata/invs with block getdata/invs ? Because sooner or later block fetching will require a similar method to withstand malicious block invs. Maybe that can be done later on top of this patch.

sipa commented at 1:01 am on September 5, 2014: member

We are actually doing already something very similar for blocks, and the headersfirst branch extends it (#4468). It’s a bit more complicated there, as we want a moving window of block fetching to limit out-of-orderness in which blocks arrive.

sipa commented at 1:08 am on September 5, 2014: member

Big +1 on code organization: implementing independent pieces of protocol handling should definitely move to separate files, with separate data structures and separate locks (other examples that afaik could easily be converted into this style are ping/pong and alerts).

It’s a bit unfortunate that the block fetching and tx fetching code are separated but implement similar functionality. It’s a strict improvement though as they were already separate (originally my fault I guess, as I didn’t touch the tx handling code when rewriting the block handling part), and unifying them now would probably interfere with other changes.

laanwj commented at 10:08 am on September 5, 2014: member

@SergioDemianLerner As I see it, blocks handling is much more closely bound to main/core than this (basically independent) inventory item fetcher, making it enough of a separate concern to warrant being a different module. I’m not against unifying them if it can be done sanely, of course. But block handling is essentially different especially after headers-first.

rebroad commented at 11:06 am on September 5, 2014: contributor

Malicious block invs? What is one of those?

rebroad commented at 0:23 am on September 6, 2014: contributor

Currently, the logic is such that large orphan txs are requested repeatedly and ignored repeatedly:-

2014-09-06 00:01:46 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=1 2014-09-06 00:01:46 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=11 2014-09-06 00:01:47 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=27 2014-09-06 00:01:48 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=23 2014-09-06 00:01:49 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=12 2014-09-06 00:01:50 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=26 2014-09-06 00:01:51 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=24 2014-09-06 00:01:53 ignoring large orphan tx (size: 5057, hash: 18098a192869ef0f9128be9bf1f3bb243575f88d072bb24f918c4e4f5a894b80) peer=20

Is it possible to make the node remember recently ignored txs so that they don’t keep being requested? This was working in the previous implementation thanks to #4542

laanwj commented at 2:36 am on September 6, 2014: member

@rebroad This should simply work with #4542? It doesn’t touch any part of the same code.

As I see it, the logic of whether to request something is outside scope of this module. If you ask netaskfor to retrieve a transaction for you, it will do so, until being told to stop :) (or until it runs out of peers)

rebroad commented at 7:25 am on September 6, 2014: contributor

@laanwj oh.. yes, you are right. Ok, I’ll re-merge 4542 in that case (which didn’t touch large orphan txs anyway I’ve just noticed!).

in src/netaskfor.cpp: in a2afa1d17b outdated

216+                LogPrint("netaskfor", "%s: processing item %s\n", __func__, inv.ToString());
217+                if (it != mapInvRequests.end())
218+                {
219+                    CInvState &invstate = it->second;
220+                    invstate.workQueueIter = invRequestsWorkQueue.end();
221+                    /// Pick a node to request from, if available

laanwj commented at 8:21 am on September 6, 2014:

This should skip nodes whose sendbuffer is full.

What is the correct way to check this? This would be something like pnode->nSendSize < SendBufferSize(), but I’m not sure what lock is needed for that.

sipa commented at 8:28 am on September 6, 2014:

In theory you would need pnode->cs_vSend, but if you can argue that the system keeps working correctly even if the test returns the right result only most of the time, put a big comment, and use no lock…

laanwj commented at 3:19 pm on September 9, 2014: member

Looks like the timeout could indeed be reduced.

 02014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=7543
 12014-09-09 15:15:27 ThreadHandleAskFor: processing item tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698
 22014-09-09 15:15:27 QueueGetdata: Requesting item tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698 from node 7543 (first request)
 32014-09-09 15:15:27 FlushGetdata: peer=7543 getdata tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698 
 42014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=6150
 52014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=633
 62014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=7455
 72014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=7018
 82014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=330
 92014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=7471
102014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=6306
112014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=978
122014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=7436
132014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=7596
142014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=7504
152014-09-09 15:15:27 askfor tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698  peer=6215
162014-09-09 15:15:27 Completed: tx b87fabfb19b9401f20c9bb9330db0916c4781a2d06efc914556776ae6715a698 peer=7543

In by-far most cases the transaction is returned in a few seconds. At the same time, invs for it are still coming in from other nodes.

sync: Add CTimeoutCondition

A condition that can wait for a specified timeout, this is useful when
it is known in advance that events have to be processed at some time in the
future.

b9e5772597

net: add StartThreads and StopThreads signals

This allows modules to maintain their own threads.

7ea2476c0a

net: Better inventory request management 9cbaaf61ad

in src/netaskfor.cpp: in a2afa1d17b outdated

327+    if (it == mapInvRequests.end())
328+    {
329+        std::pair<MapInvRequests::iterator, bool> ins = mapInvRequests.insert(std::make_pair(inv, CInvState()));
330+        it = ins.first;
331+        /// As this is the first time that this item gets announced by anyone, add it to the work queue immediately
332+        it->second.workQueueIter = invRequestsWorkQueue.insert(std::make_pair(0, inv));

laanwj commented at 9:07 am on September 10, 2014:

There is nothing in place at the moment to make sure that data items are requested in the same order that they’re announced (std::multimap does not preserve insertion order). There probably needs to be.

OTOH when requesting from multiple nodes (or when timeouts are involved) there is no guarantee that the response will come in the same order as the requests. So maybe it’s not an issue.

laanwj force-pushed on Sep 18, 2014

BitcoinPullTester commented at 10:37 am on September 18, 2014: none

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/p4831_9cbaaf61ade4b91469f3d728795ec83859c25192/ for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.

sipa commented at 2:20 am on September 20, 2014: member

Going to test this.

sipa commented at 3:52 am on September 29, 2014: member

Works without problems (even in valgrind, after running for a few days). I didn’t actually check whether it fetches/relays things correctly, though.

in src/netaskfor.cpp: in 9cbaaf61ad

 6+#include "config/bitcoin-config.h"
 7+#endif
 8+
 9+#include "netaskfor.h"
10+
11+#include <boost/thread.hpp>

Diapolo commented at 8:43 am on September 29, 2014:

Nit: This should go below our headers, so just flip this with the block below.

Diapolo commented at 8:46 am on September 29, 2014: none

Just a general question, why are most/all comments beginning with ///?

fanquake commented at 8:58 am on September 29, 2014: member

@Diapolo It’s so that they’ll be picked up by doxygen. It doesn’t recognise comments starting with //

laanwj added this to the milestone 0.11.0 on Nov 3, 2014

laanwj added the label P2P on Nov 3, 2014

sipa commented at 4:27 pm on November 17, 2014: member

I hope you pick this up soon after 0.10 :)

laanwj removed this from the milestone 0.11.0 on May 1, 2015

laanwj added this to the milestone 0.12.0 on May 1, 2015

rebroad commented at 4:10 pm on June 25, 2015: contributor

Needs rebase - is this still in progress?

jgarzik commented at 6:12 pm on July 23, 2015: contributor

Ping. Needs refresh.

I think the general consensus is that we want this, but needs more review? Seems to have positive noises in the security discussion and on here, but no ACKs.

gmaxwell commented at 8:46 am on September 6, 2015: contributor

I tested this previously (at some version…) but it got put down after being punted out of 0.10. Seems to have been forgotten. Lets unforget it.

dcousens commented at 1:25 pm on September 7, 2015: contributor

concept ACK

jgarzik commented at 5:14 pm on September 15, 2015: contributor

concept ACK - let this not be forgotten

sipa commented at 5:34 pm on September 15, 2015: member

Perhaps this will be pushed back until after Cory’s P2P refactor?

laanwj commented at 3:58 am on September 18, 2015: member

Needs to be picked back up at some point, I’m not sure when. Certainly don’t want to interfere with Cory’s libevent work.

laanwj closed this on Sep 26, 2015

MarcoFalke locked this on Sep 8, 2021

net: Better askfor request management #4831