getblocktemplate blocks submitblock #7015

jtoomim commented at 1:26 AM on November 14, 2015: none

I ran into this during testing on BIP101. My mempool was pretty full, so GBT was slow. During one 12-second GBT call, ckpool found a block:

[2015-11-14 00:26:16.931] Possible block solve diff 541623.733740 !
[2015-11-14 00:26:26.164] HTTP socket read+write took 10.954s in json_rpc_call ( "getblock...)
[2015-11-14 00:26:29.144] BLOCK ACCEPTED!
[2015-11-14 00:26:29.157] Solved and confirmed block 604647

Note the timestamp. That's 12.2 seconds of delay between ckpool making the block solve and bitcoind accepting the block. I can see the other 1.3 seconds as possibly being block verification time or dealing with the other network messages that piled on during the 10.9 seconds that GBT took. IIRC, this was a 5 MB block.

This problem was seen on BitcoinXT, not Bitcoin Core, but it affects you too, so I thought I should notify upstream (Core). The underlying issue is how locks are done in CreateNewBlock and elsewhere. Locking cs_main in everything, including both submitblock (rpcmining.cpp:639) and getblocktemplate, is problematic.

A workaround for this is to set up multiple bitcoinds, and set the poolserver to do submitblock to all of them in parallel. Kinda nasty, though. Another workaround is to speed up CNB a la @morcos's work. The locks should still get fixed, though.

gmaxwell commented at 1:44 AM on November 14, 2015: contributor

Thanks. This is known, and there is a preexisting plan to improve it; though there is also a standard and widely deployed architecture miners use to work around this today (run multiple daemons).

The locks simply can't be disposed of there because createnewblock and block handling must necessarily lock the state of the blockchain. :) But just because they must doesn't mean there must be considerable delays.

jtoomim commented at 1:55 AM on November 14, 2015: none

I checked with #ckpool on this before bringing it here, and they (Con Kolivas and Kano) said that ckpool does not implement the workaround for this issue. It's possible they confused the issue with another one and do actually have this worked around, and it's also possible that the workaround can be implemented outside of ckpool.

The locks can be replaced with finer-grained locks. CreateNewBlock needs to lock against modifying or deleting transactions and against reorgs. It also needs to make sure it's working off of the same blockchain tip the entire time, so it needs to grab a pointer to the tip at the beginning. It also needs to grab a view of the mempool that will not change. It does not need to prevent new transactions from being added to the mempool or new blocks from being added to the blockchain.

gmaxwell commented at 1:56 AM on November 14, 2015: contributor

This is more complex than you think--, due to iterator invalidation, mempool eviction, cache management, utxo state, etc. Fortunately, it's also not necessary to get very low latency.

jtoomim commented at 1:59 AM on November 14, 2015: none

Do you mean because of CLTV transactions and nLockTime stuff?

pstratem commented at 2:19 AM on November 14, 2015: contributor

@jtoomim The lock is not going to be removed, it would make the code substantially more complex.

Any pool that doesn't support submitting blocks through a separate bitcoind is broken and needs to be fixed.

jtoomim commented at 2:46 AM on November 14, 2015: none

@pstratem I think one of the reasons why ckpool doesn't implement that feature is that it ckpool was written with btcd in mind. I suspect that btcd's behavior is not as bad as Core and XT's behavior in this situation.

If you guys want this issue marked as closed;wontfix, that's fine with me. I think it would be best to leave it up for a couple days first to give others a chance to comment if they feel the need.

pstratem commented at 2:48 AM on November 14, 2015: contributor

@jtoomim It is inherent in the basic architecture that any getblocktemplate implementation will require locks which any submitblock implementation would also need.

The only way to get around this requirement would be to implement an MVCC database which is both extremely difficult and inherently slow.

gmaxwell commented at 2:56 AM on November 14, 2015: contributor

@jtoomim Btcd submit also competes with work creation, AFAIK. I think you are making a common error of jumping prematurely to a particular solution. The issue you've reported (That calling GBT ends up blocking submitting blocks for an unreasonable amount of time) will get fixed, but will not likely get fixed by making things run lock free (or at least not mostly!).

But all I was saying is that it sounded like were underestimating the complexity of safely eliminating locking around blockchain/mempool access, and overestimating the necessity of doing so. But the report at face value, is a fine one, and thanks for making it.

kanoi commented at 3:07 AM on November 14, 2015: none

I very much doubt anyone is suggesting using no locks. However, broad coarse locks are a sign of lack of design or understanding of the appropriate nature and use of locks.

gmaxwell commented at 3:10 AM on November 14, 2015: contributor

That is not a helpful comment. Both submitblock and createnewblock must have mutable access to the same data (the current chainstate)-- as both perform exactly the same operation (verifying a block against the current state), we are not talking about "broad coarse locks" here. There are ways around that, but they are not "just use more specific locks". :)

kanoi commented at 3:14 AM on November 14, 2015: none

You've just stated it is a broad coarse lock across all the same data used by both of them. So I'm not sure of how helpful your comment is as you don't seem to have a clear grasp of the concept of locking - it's not "lock" or "don't lock" as you clearly implied - it's determining the appropriate granularity of the locks and the type of locks used.

TheBlueMatt commented at 3:15 AM on November 14, 2015: member

@kanoi Please go read the code before commenting further. Your comments are entirely useless and you seem to be an arrogant prick.

kanoi commented at 3:18 AM on November 14, 2015: none

#7015 (comment) "@kanoi Please go read the code before commenting further. Your comments are entirely useless and you seem to be an arrogant prick."

Gotta record that for posterity in case someone deletes it :D

jtoomim commented at 3:25 AM on November 14, 2015: none

@pstratem getblocktemplate does not make any writes to anything except the block template and its own CCoinsViewCache. Nothing that getblocktemplate creates is consumed by anything else except for the RPC caller. Any data that getblocktemplate needs from the rest of the system can be copied or (e.g. for mempool) locked in a fine-grained manner.

Furthermore, there is no need for submitblock to lock everything for the entire duration of its activities. It can change the order of operations to check the new block's validity, commit it to memory as an object (but not necessarily as part of the chain), and notify peers all before locking cs_main and executing UpdateTip. @gmaxwell, if verifying a block actually requires mutable access to the chain state in order to verify a block (which sounds surprising to me, but I'll admit that I haven't checked and you probably know better than me), it still could be implemented with a lock just around the TestBlockValidity() calls. With production code, a medium-small fraction of the total time is taken in TestBlockValidity(). With morcos's work, that fraction dominates, it's true. It's also likely that calling TestBlockValidity both before and after sending a template to a miner is redundant.

Anyway, I agree that this issue is not something that should be addressed right now, given how much faster GBT will be soon. I'd call it more of a target of opportunity. When writing code in the future, I think we should make an effort to think about what we actually need to have locked, and only lock those things. You are welcome to ignore my opinion.

TheBlueMatt commented at 3:30 AM on November 14, 2015: member

@jtoomim indeed, getblocktemplate can lose cs_main for a bit of its runtime by copying state into its CCoinsViewCache (it already nearly does this), but, sadly, it doesnt actually save a ton of cs_main time by doing so. In any case, the rewrites morcos has been working on are far superior to any of this anyway (and could be designed to fail to refresh the templates by just giving up if cs_main is under contension during run). Sadly, submitblock really does need to hold cs_main for the duration because Bitcoin Core must refuse to relay blocks which are invalid (and the point of submitblock is to accept potentially-invalid blocks and validate them).

gmaxwell commented at 3:41 AM on November 14, 2015: contributor

@TheBlueMatt Actually there are some nifty proposals that avoid some of those things for submitblock in some useful cases. :)

jtoomim commented at 3:42 AM on November 14, 2015: none

@TheBlueMatt agreed, and you're right about submitblock needing to hold the lock (IFF the lock is needed for testing block validity). My mistake for not thinking that through.

TheBlueMatt commented at 3:43 AM on November 14, 2015: member

@gmaxwell indeed, if you wish to relay unverified blocks (eg to peers who have whitelisted you), you could do that without locking.

jtoomim commented at 3:44 AM on November 14, 2015: none

@raisethelimit let's not antagonize, please. This is a forum for technical discussion, not politics.

Sorry about the trolls and the politicians. I swear I didn't invite them (except kanoi insofar that I asked them about the issue before bringing it here).

jtoomim commented at 3:55 AM on November 14, 2015: none

@TheBlueMatt @gmaxwell It's a bit off-topic for this thread, but I also think that block relay should be begun after verifying PoW and before full verification. It might have some ill effects for SPV wallets that were connecting to dishonest pseudo-fullnodes which I haven't fully thought through yet, but it would help propagation times quite a bit and would also make hard forks run more cleanly (in case Core ever decides to increase the block size past 1 MB).

TheBlueMatt commented at 3:59 AM on November 14, 2015: member

@jtoomim That's been brough up a bunch over the years. Generally the thought has always been that we should only relay validated data and ban peers who break the rule, but there is a reasonable argument to be made to relay blocks with a special flag indicating they have not been validated (except for their PoW and basic sanity checks). I'm not a big fan of the idea because we've already people mining on unvalidated (and invalid) blocks and we should generally avoid making it easier to do so. That, plus validation is becoming quicker and quicker (with yet still a bunch of improvements in 0.12).

gmaxwell commented at 4:02 AM on November 14, 2015: contributor

@jtoomim That was actually implemented previously (by luke-jr) and didn't improve performance back then. It probably would now.

It's really critically important to not hand unverified blocks to SPV clients; but that is avoidable, in that you can specifically negotiate forwarding non-verified blocks and tag them as such (which is also how you avoid transitive partitioning due to misbehavior banning); so no one is surprised by them (which is what luke's patch did). I'm fearful how that will interact with verification free mining :( but the harm there is created by the verification free mining, and not by the pre-forwarding... Rather than twiddling the basic p2p protocol, however, it would probably just make more sense to handle this in new block relay mechanisms which are explicitly unverified-- since we now know we can do a lot better than the classical p2p protocol.

jtoomim commented at 4:11 AM on November 14, 2015: none

Harm with the verification-free mining is caused by the miner never verifying the chain, not by the miner making delayed verifications of the chain. I like to call this the difference between optimistic SPV mining and blind SPV mining. (I know the term SPV mining is inaccurate because no payments are being verified, but it's short and people understand it.) I don't think there's much to worry about about a fork with a 2-block invalid branch, and anything longer is vanishingly improbable if verification times are less than about 20 seconds per block. The problem is when the fork goes on for hours.

I think the Chinese (and everyone else) learned their lesson on this one: blind SPV mining is expensive. They might not care about the harm it causes other Bitcoin users much, but as long as they care about keeping their mining rewards, I doubt anyone will make the same mistake again.

gmaxwell commented at 4:20 AM on November 14, 2015: contributor

@jtoomim SPV clients make a very strong security assumption that the blocks they receive have been validated; if you setup a simulator and model success rates for 2-confirm invalid block attacks with a small hashrate attacker (e.g. 1%), even short duration extension without verification massively amplifies the success. The result is surprising, but less so when you realize that most blocks are "fast". There are a great many users that accept transactions at 2-confirms (and even wallets that just display 'confirmed' with no count at all), wise or not. So I don't think it's correct to say that it only matters if it goes on for hours. It's only the SPV nodes that need to be worried about in any case; full nodes will happily ignore the invalid blocks. :)

You should also consider more than the verification time; most mining stacks (including the hardware/firmware itself) often has fairly high latency (and/or takes a performance hit if your update their work frequently)-- usually I use a rule of thumb and figure any work you issue will be hashed on for at least 30 seconds. When I observe work issued by pools, the delay from the first pool to half the pools moving to a new block is about 10x what bitcoind verification time is (even before the speedups in 0.12).

One of the larger verification skipping miners responded that even with the losses, they believe it has been very profitable for them, and they would continue doing it.

But in any case, even all that said I think it's largely orthogonal with the aggressive forwarding. Any amount of block extension without verification invalidates the security assumption of SPV; but thats a fault of extending bad chains, not a fault of forwarding things eagerly at least so long as the eager forwarding is marked as such.

jtoomim commented at 4:52 AM on November 14, 2015: none

SPV wallets need to be insulated from "SPV" miners. Fully verifying nodes need to be in between them. We can blame "SPV" miners for it, we can blame SPV wallets for doing it, or we can recognize that economics that are stronger than us have dictated their use and try to design a system that allows them to coexist with risks that each party deems acceptable. Such a system might be possible with an isVerified flag that SPV miners could use to talk to each other and fully verifying nodes could use to talk to each other more quickly and which SPV wallets could run away from. It might be necessary to think of a way to punish dishonest use of that bit.

[redacted paragraph]

One protection against forwarding bad chains that would be cheap to implement is only optimistically forwarding an unvalidated block if it is based on a block that has been fully validated. This would limit forwarding of invalid chains to 1 block in length while still allowing for the vast majority of the performance benefit.

tulip0 commented at 5:46 AM on November 14, 2015: none

@jtoomim

I think the Chinese (and everyone else) learned their lesson on this one: blind SPV mining is expensive.

Not as expensive as you might imagine. If miners are improving their income by even a percent or two, any losses caused by validation-free mining during the BIP66 activation would have been recovered within days. It's lucrative to operate like this and there's no reason to stop.

kanoi commented at 11:00 PM on November 18, 2015: none

The simplest solution to the original issue is to allow getblocktemplate to be notified of the lock request, release the lock, then start again and get a new template. Actually using the template provided, is the wrong action in this case. Simple use of locks to do this :)

The suggestion to use multiple bitcoinds when they won't even propagate blocks among themselves quickly seems like a bad solution.

#7049

luke-jr commented at 1:30 AM on November 19, 2015: member

@kanoi I agree that seems like reasonable behaviour. I'm not sure it's as simple as you expect, though. Can you elaborate on how you would propose implementing it? (Perhaps even submit a pull request?) Thanks

kanoi commented at 2:16 AM on November 19, 2015: none

With locking?

template would (always) wait ... and get the lock
newblock would always "trylock" and process on success, or on failure "notify" it's intent, then wait on the lock
template call holding the lock, seeing the "notification" would:

release the lock, then start again and get a new template

luke-jr commented at 2:43 AM on November 19, 2015: member

Ok, so the template-producing code needs to check for the notification every iteration - that will make performance even worse, though. Might be better to just optimise it so it doesn't matter...

kanoi commented at 2:49 AM on November 19, 2015: none

False. The CPU requirement is so insignificant I wonder if you understand programming and locks at all.

kanoi commented at 3:38 AM on November 19, 2015: none

Here, a little lesson in locks and coding.

We'll call the lock being shared by newblock and template (and anything else) MASTER

The notifier actually only needs to be a boolean global variable "notify" used for this "priority" handling of access to MASTER

Anything else than needs MASTER and doesn't cause delays can completely ignore dealing with "notify" Anything else that wants priority of MASTER (in the same way newblock does) can also work exactly the same as the newblock changes Anything else that you want to be able to take the lock away from, works the same as the template changes

the template code only need check "notify" being true/false when inside the MASTER lock, no other locking required AT ALL for access to "notify" since it is boolean - the a or b choice in 2) below will not change this
the newblock code would depend on the threading access a or b to newblock (and it's cronies) but not affect the code in 1) above

a) if there is _only one thread_ accessing newblock (and anything else wanting "priority" access to MASTER), then they need only set "notify" to true, without any "notify" locking, when waiting on MASTER due to a trylock failure, and set it to false, without any "notify" locking, immediately after gaining the MASTER lock

b) if there is multi thread access, there needs to be a separate lock VAR for access to a variable that counts the number of threads wanting priority, so the "notify" code would first take out/wait on the VAR lock, increment the var counter and also inside the VAR lock, set the "notify" boolean true. Once we have the MASTER lock we take out the VAR lock, then decrement the counter and if it is zero, we clear the boolean "notify" then release the VAR lock.

Where's the CPU usage issue in that design? There is none.

Now, I'm not at all sure why I need to explain locking techniques to developers, but I am quite surprised that I need to.

gmaxwell commented at 4:25 AM on November 19, 2015: contributor

@kanoi Can you try to ease your tone a bit-- it can help aid communication, which is often a challenge online.

kanoi commented at 4:51 AM on November 19, 2015: none

Of course it will effect performance to add more CPU cycles. Adding a single line of code called once, a=1; can also effect performance.

that will make performance even worse

It's affect on performance is negligible, if that. However it solves a rather poorly coded problem. "worse" is a pointless statement to make since it implies something false, but he can get out of that by pretending he didn't imply that the solution is problematic.

I'd also add that no matter what performance boost you do to template, this change reduces divergence, so there's no reason to disparage it. It's a specific solution to a specific problem and there's no black magic involved.

laanwj added the label Mining on Nov 19, 2015

MarcoFalke commented at 6:20 PM on May 8, 2020: member

The feature request didn't seem to attract much attention in the past. Also, the issue seems not important enough right now to keep it sitting around idle in the list of open issues.

Closing due to lack of interest. Pull requests with improvements are always welcome.

MarcoFalke closed this on May 8, 2020

DrahtBot locked this on Feb 15, 2022