Fix some concurrency issues in ActivateBestChain()

skeees commented at 8:09 pm on April 18, 2018: contributor

Originally this PR was just to add tests around concurrency in block validation - those tests seem to have uncovered another bug in ActivateBestChain - this now fixes that bug and adds tests.

ActivateBestChain (invoked after a new block is validated) proceeds in steps - acquiring and releasing cs_main while incrementally disconnecting and connecting blocks to sync to the most work chain known (FindMostWorkChain()). Every time cs_main is released the result of FindMostWorkChain() can change - but currently that value is cached across acquisitions of cs_main and only refreshed when an invalid chain is explored. It needs to be refreshed every time cs_main is reacquired. The test added in https://github.com/bitcoin/bitcoin/pull/13023/commits/6094ce73045fe0b4654ff94327c2059512af88fb will occasionally fail without the commit fixing this issue https://github.com/bitcoin/bitcoin/pull/13023/commits/26bfdbaddbb9f13864deb7241c6d513f22c5ab62

Original description below

After a bug discovered where UpdatedBlockTip() notifications could be triggered out of order (#12978), these unit tests check certain invariants about these signals.

The scheduler test asserts that a SingleThreadedSchedulerClient processes callbacks fully and sequentially.

The block validation test generates a random chain and calls ProcessNewBlock from multiple threads at random and in parallel. ValidationInterface callbacks verify that the ordering of BlockConnected BlockDisconnected and UpdatedBlockTip events occur as expected.

in src/Makefile.test.include:91 in 5f8f1bed6c outdated

86@@ -87,7 +87,8 @@ BITCOIN_TESTS =\
87   test/txvalidationcache_tests.cpp \
88   test/versionbits_tests.cpp \
89   test/uint256_tests.cpp \
90-  test/util_tests.cpp
91+  test/util_tests.cpp \
92+  test/validation_block_tests.cpp

jamesob commented at 8:11 pm on April 18, 2018:

Looks like there’s an alphabetic ordering here that may be good to preserve.

in src/test/scheduler_tests.cpp:133 in 5f8f1bed6c outdated

128+    for (int i = 0; i < 5; i++) {
129+        threads.create_thread(boost::bind(&CScheduler::serviceQueue, &scheduler));
130+    }
131+
132+    // these are not atomic, if SinglethreadedSchedulerClient prevents
133+    // parallel execution at the queue level no synchronization should required here

jamesob commented at 8:13 pm on April 18, 2018:

should be required here

in src/test/validation_block_tests.cpp:86 in 5f8f1bed6c outdated

81+    auto pblock = Block(prev_hash);
82+    pblock->vtx.push_back(pblock->vtx[0]);
83+    pblock->hashMerkleRoot = BlockMerkleRoot(*pblock);
84+
85+    while (!CheckProofOfWork(pblock->GetHash(), pblock->nBits, Params().GetConsensus()))
86+        ++(pblock->nNonce);

jamesob commented at 8:23 pm on April 18, 2018:

Could be worth parameterizing Block with a bool make_invalid option to avoid the duplication here, but that’s your call.

skeees commented at 3:25 pm on April 20, 2018:

Extracted into FinalizeBlock - its probably worth refactoring BlockAssembler into a builder style class - would make unit tests much easier to write … maybe one day

jamesob commented at 5:14 pm on April 20, 2018:

Yep, definitely agree.

in src/test/validation_block_tests.cpp:103 in 5f8f1bed6c outdated

86+        ++(pblock->nNonce);
87+
88+    return pblock;
89+}
90+
91+void BuildChain(const uint256& root, int height, const unsigned int invalid_rate, const unsigned int branch_rate, const unsigned int max_size, std::vector<std::shared_ptr<const CBlock>>& blocks)

jamesob commented at 8:24 pm on April 18, 2018:

I know we don’t have columnar limits in the styleguide, but this line’s pretty long…

skeees commented at 3:25 pm on April 20, 2018:

i agree with you - i originally had this on two lines - and then the linter put it all back on one

skeees renamed this:
~~Add unit tests for signals generated by ProcessNewBlock~~
[WIP] Add unit tests for signals generated by ProcessNewBlock
on Apr 18, 2018

in src/test/validation_block_tests.cpp:100 in 5f8f1bed6c outdated

 95+    bool gen_invalid = GetRand(100) < invalid_rate;
 96+    bool gen_fork = GetRand(100) < branch_rate;
 97+
 98+    const std::shared_ptr<const CBlock> pblock = gen_invalid ? BadBlock(root) : GoodBlock(root);
 99+    blocks.push_back(pblock);
100+    if (!gen_invalid)

jamesob commented at 8:25 pm on April 18, 2018:

Braces needed.

Empact commented at 9:37 pm on April 18, 2018:

clang-format would put those in I believe

skeees commented at 3:25 pm on April 20, 2018:

it doesn’t! but i put them in now

jamesob commented at 8:39 pm on April 18, 2018: member

Concept ACK, just a few nits. Nice test!

in src/test/validation_block_tests.cpp:135 in 5f8f1bed6c outdated

130+    // this will create parallelism and randomness inside validation - the ValidationInterface
131+    // will subscribe to events generated during block validation and assert on ordering invariance
132+    boost::thread_group threads;
133+    for (int i = 0; i < 10; i++) {
134+        threads.create_thread([&blocks]() {
135+            bool blah;

Empact commented at 9:37 pm on April 18, 2018:

Calling this ignored might be more straightforward

fanquake added the label Tests on Apr 18, 2018

skeees force-pushed on Apr 20, 2018

skeees renamed this:
~~[WIP] Add unit tests for signals generated by ProcessNewBlock~~
Always refresh most work chain when reacquiring cs_main in ActivateBestChain()
on Apr 20, 2018

skeees commented at 3:27 pm on April 20, 2018: contributor

Updated to address reviewer comments and fix a bug that this test seems to have uncovered

skeees force-pushed on Apr 20, 2018

MarcoFalke removed the label Tests on Apr 22, 2018

MarcoFalke added the label Validation on Apr 22, 2018

in src/validation.cpp:2676 in 26bfdbaddb outdated

2672@@ -2673,9 +2673,7 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
2673             ConnectTrace connectTrace(mempool); // Destructed before cs_main is unlocked
2674 
2675             CBlockIndex *pindexOldTip = chainActive.Tip();
2676-            if (pindexMostWork == nullptr) {
2677-                pindexMostWork = FindMostWorkChain();
2678-            }
2679+            pindexMostWork = FindMostWorkChain();

ryanofsky commented at 6:07 pm on April 23, 2018:

Might be good to add a comment here saying why it’s important to recompute pindexMostWork each loop iteration (how sync could get stuck otherwise).

sdaftuar commented at 4:13 pm on April 26, 2018:

This might cause substantial slow-down during IBD or reindex, because currently FindMostWorkChain walks from the candidate tip to the fork point on the current chain to ensure none of the blocks are invalid. We may be able to optimize that, but we also might want to consider another solution.

in src/test/validation_block_tests.cpp:61 in 6094ce7304 outdated

56+
57+    CScript pubKey;
58+    pubKey << i++ << OP_TRUE;
59+
60+    auto ptemplate = BlockAssembler(Params()).CreateNewBlock(pubKey, false);
61+    CBlock* pblock = new CBlock(ptemplate->block);

ryanofsky commented at 6:13 pm on April 23, 2018:

I’d think this could be written more simply as:

0auto block = make_shared<CBlock>(ptemplate->block);
1block->hashPrevBlock = ...
2return block;

It looks like pblock is leaked currently, or is this not the case?

ryanofsky commented at 6:27 pm on April 23, 2018: member

utACK 6094ce73045fe0b4654ff94327c2059512af88fb. Fun tests!

sdaftuar commented at 3:53 pm on April 26, 2018: member

Thanks for opening this PR. I thought it would be helpful for other reviewers to clarify the (IMO significant) bug here, so that we can properly evaluate fixes.

If ActivateBestChain is invoked simultaneously in separate threads, then we can end up at a lower-work tip, and remain stuck until the next block comes in.

Suppose in thread 1, we have just been delivered a block that causes us to try to activate two blocks in ABC. We connect 1 of them in ABCStep, and then release cs_main before connecting the second.
In thread 2, suppose we have been delivered a 3rd block (say via rpc) that builds on the first two. It invokes ABC and gets to run after the first block has been connected in thread 1. It connects one block, releases cs_main, and then connects one more, and finishes.
When thread 1 gets to run again, the most work chain has advanced, but (before this PR) we don’t refresh pindexMostWork (except when we find an invalid block). Consequently we would invoke ABCStep with a less work tip than our current tip(!). This would cause us to disconnect our tip and return.

Some travis failures have been observed due to this bug, as seen for instance here: https://travis-ci.org/bitcoin/bitcoin/jobs/370848272. The test that sometimes fails, rpc_deprecated.py, generates blocks on two nodes roughly simultaneously, so one of the nodes is generating blocks in an rpc thread while also processing blocks on the network thread, which I believe is enough to trigger this bug.

in src/test/validation_block_tests.cpp:91 in 6094ce7304 outdated

86+const std::shared_ptr<const CBlock> BadBlock(const uint256& prev_hash)
87+{
88+    auto pblock = Block(prev_hash);
89+
90+    // a second coinbase will make this block invalid
91+    pblock->vtx.push_back(pblock->vtx[0]);

sdaftuar commented at 4:28 pm on April 26, 2018:

This method of constructing a bad block is not ideal, because this kind of invalidity is detected well before ConnectBlock. In order to get better code coverage, I think it would be better to generate different kinds of invalid blocks, to test failure at different points in the validation process.

(As it is, I believe this method of making an invalid block is detectable in CheckBlock (merkle-tree fails validation due to duplicate transactions). It would also fail in CheckBlock for having more than one coinbase, and it would fail in ContextualCheckBlock for having an invalid witness commitment.)

sdaftuar changes_requested

sdaftuar commented at 4:31 pm on April 26, 2018: member

Nice work finding this issue and good start on the new test. I haven’t finished my review yet, but here’s my initial feedback.

sdaftuar commented at 3:22 pm on April 27, 2018: member

I have a branch where I changed the test you wrote to produce invalid blocks that would be detected in ConnectBlock, and where I have an alternate fix for the bug you found here, as well as a fix for the bug I mentioned in #13092: https://github.com/sdaftuar/bitcoin/commits/2018-04-alternate-abc-fix.

For fixing the bug here, rather than invoke FindMostWorkChain on every loop iteration (which I expect to be very slow during reindex), I instead added a new test: if the tip has changed since the last loop iteration, or if pindexMostWork is no longer in setBlockIndexCandidates, then update pindexMostWork. (I believe that only the first criteria is actually necessary, but included the second as belt-and-suspenders.)

I compared this approach vs master for doing a reindex on my workstation, and saw < 3% slowdown (some slowdown was expected due to checking setBlockIndexCandidates each time).

skeees commented at 4:29 pm on April 27, 2018: contributor

Nice - thanks for this. I wonder if it might be a bit more readable to explore feasibility of caching results internally in findmostworkchain instead of doing it in the activatebestchain loop - the control flow there is getting harder and harder to follow. I’m happy to explore that.

sdaftuar commented at 6:07 pm on April 27, 2018: member

@TheBlueMatt and I discussed that idea offline (of rewriting FindMostWorkChain to use g_failed_blocks, our internal data structure for caching invalid blocks, to speed it up) but our initial reaction was that the review overhead to ensure correctness might be very high. I’m open to other ideas though (and if you think reworking FMWC is practical please feel free to take a shot at it).

laanwj added this to the milestone 0.16.1 on May 3, 2018

laanwj added the label Needs backport on May 3, 2018

skeees force-pushed on May 4, 2018

skeees commented at 3:57 pm on May 4, 2018: contributor

@sdaftuar - pulled in your commits.

Its still hard for me to convince myself that the set of conditions you’ve defined are the only cases in which pindexMostWork needs to be refreshed. Maybe I’m just not familiar enough with those areas - I’ll leave it to the people with more experience in that area to review.

Just wanted to throw out one alternate way to do determine when to refresh most work tip that’s (imo) a bit easier to convince myself behaves correctly. Instead of caching pindexOldTip and using that as well as a couple of other things to determine whether to refresh - how about giving setBlockIndexCandidates a sequence number that gets incremented every time it gets updated. Then all ABC has to do is record the last tip candidate set version it saw (instead of the old tip) and use that to determine whether a refresh may be necessary. You could do this in ABC or even internally in FindMostWorkChain.

MarcoFalke commented at 6:04 pm on May 4, 2018: member

Copy of travis output:

 0Running tests: validation_block_tests from test/validation_block_tests.cpp
 1Running 1 test case...
 2Test cases order is shuffled using seed: 1374948318
 3Entering test module "Bitcoin Test Suite"
 4test/validation_block_tests.cpp(24): Entering test suite "validation_block_tests"
 5test/validation_block_tests.cpp(125): Entering test case "processnewblock_signals_ordering"
 6test/validation_block_tests.cpp(160): info:  has passed
 7test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
 8test/validation_block_tests.cpp(160): info: check 
 9test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
10test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
11test/validation_block_tests.cpp(160): info: check 
12test/validation_block_tests.cpp(160): info:  has passed
13test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
14terminate called after throwing an instance of 'std::length_error'
15  what():  basic_string::_S_create
16make[3]: *** [test/validation_block_tests.cpp.test] Error 1

sdaftuar commented at 7:11 pm on May 4, 2018: member

From some quick web searching, it seems that BOOST_CHECK may not be thread safe, so invoking it in each thread is likely the problem here. Perhaps we should just assert() on the return value from ProcessNewBlock() (at line 160).

sdaftuar commented at 7:42 pm on May 4, 2018: member

I do occasionally see this failure when I run validation_block_tests in a tight loop (after fixing the BOOST_CHECK issue above):

0test/validation_block_tests.cpp(38): error in "processnewblock_signals_ordering": check m_expected_tip == block->hashPrevBlock failed [0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206 != 0000000000000000000000000000000000000000000000000000000000000000]
1Segmentation fault (core dumped)

Edit: perhaps this is also a threading issue though?

skeees force-pushed on May 4, 2018

skeees commented at 7:53 pm on May 4, 2018: contributor

Hmmm thats the genesis block which connects through a different path I believe

skeees commented at 8:07 pm on May 4, 2018: contributor

@sdaftuar - thats the genesis block getting activated. Feels like it may be more of a test artefact (wasn’t there a recent change that randomizes test case execution order) than a threading issue? I think solution for that is to just call PNB(genesis) at the very beginning.

sdaftuar commented at 8:08 pm on May 4, 2018: member

@skeees My reasoning is this:

Assume ABC never releases cs_main until the new tip has work >= that of the old tip..
Then it should be safe to always try to connect to pindexMostWork as long as it is still a candidate to be our tip (ie it’s still in setBlockIndexCandidates). Since we prune entries from setBlockIndexCandidates in ABCStep, if an entry is in setBlockIndexCandidates, then working towards it is still making progress (even if we might end up scrapping it for an even-more-work tip later – it’s no different from just learning about the more work tip after you had finished connecting the other blocks, rather than before).
However there is a small risk that we might have two threads that are simultaneously trying to connect several blocks towards two different tips, one of which has more work than the other, and both of which have more work than our prior tip. Because ABCStep continues until it arrives at a block that has more work than the prior tip, this must converge, but it could annoyingly take a while (eg thread 1 connects 1 block out of 10, then thread 2 connects 2 blocks out of the 11 it needs to get to its desired tip, then thread 1 connects 3 blocks, and so on). This is the convergence issue I alluded to in the code comment.
So as an optimization, we can just check to see if the tip has changed since we last left off. If the tip has not changed we can use the assumption that we never release cs_main at a less-work tip than we were at before to be sure that whatever our pindexMostWork target from before is must still be a valid goal.
In particular, it’s not possible for (say) other threads invoking ABC concurrently to leave us at our tip if they have been connecting towards more-work tips.
I believe that checking for the tip change is sufficient for this to be safe, but as belt-and-suspenders I also included the check for pindexMostWork being in setBlockIndexCandidates, just in case. I think the worst-case behavior if there’s a hole in the logic would be the slow-convergence-issue in the event that multiple threads are trying to connect lots of blocks towards competing tips, which seems pretty low risk to me anyway.

I believe that the fix to not release cs_main if we encounter an invalid block achieves the assumption listed at the top, and thus the rest follows.

Just wanted to throw out one alternate way to do determine when to refresh most work tip that’s (imo) a bit easier to convince myself behaves correctly. Instead of caching pindexOldTip and using that as well as a couple of other things to determine whether to refresh - how about giving setBlockIndexCandidates a sequence number that gets incremented every time it gets updated. Then all ABC has to do is record the last tip candidate set version it saw (instead of the old tip) and use that to determine whether a refresh may be necessary. You could do this in ABC or even internally in FindMostWorkChain.

This might also be a reasonable solution; it seems like a bigger refactor to me since we don’t have great encapsulation on setBlockIndexCandidates at the moment, but I agree that it could be a clearer logical change. @sipa Any thoughts on this issue?

skeees force-pushed on May 4, 2018

in src/test/validation_block_tests.cpp:144 in 04176fc02f outdated

139+    // Process all the headers so we understand the toplogy of the chain
140+    BOOST_CHECK(ProcessNewBlockHeaders(headers, state, Params()));
141+    ProcessNewBlock(Params(), std::make_shared<CBlock>(Params().GenesisBlock()), true, &ignored);
142+
143+    // subscribe to events (this subscriber will validate event ordering)
144+    TestSubscriber sub(chainActive.Tip()->GetBlockHash());

sdaftuar commented at 0:35 am on May 5, 2018:

As discussed on IRC, there’s a race condition where callbacks scheduled prior to a client subscribing to callbacks can be delivered to those clients, if the scheduler didn’t finish draining its queue before the new subscriber is added.

That causes this test to (rarely) fail, because we initialize the TestSubscriber with the genesis block, even though it might be notified of the genesis block’s connection (which has already happened at this point!) if the callbacks were delayed. This can be reproduced by adding a MilliSleep() call to SingleThreadedSchedulerClient::ProcessQueue().

The call to ProcessNewBlock at line 141 above doesn’t fix this issue; instead we need to ensure that there are no outstanding callbacks at the time the TestSubscriber is created, for instance by calling SyncWithValidationInterfaceQueue(). Also we should hold cs_main before accessing chainActive:

0    SyncWithValidationInterfaceQueue();
1
2    const CBlockIndex *initial_tip = nullptr;
3    {
4        LOCK(cs_main);
5        initial_tip = chainActive.Tip();
6    }
7
8    // subscribe to events (this subscriber will validate event ordering)
9    TestSubscriber sub(initial_tip->GetBlockHash());

skeees force-pushed on May 7, 2018

skeees renamed this:
~~Always refresh most work chain when reacquiring cs_main in ActivateBestChain()~~
Fix some concurrency issues in ActivateBestChain()
on May 7, 2018

in src/validation.cpp:151 in c496b7b386 outdated

143@@ -144,6 +144,12 @@ class CChainState {
144       */
145     std::set<CBlockIndex*> g_failed_blocks;
146 
147+    /**
148+     * the ChainState CriticalSection
149+     * A lock that must be held when modifying this ChainState - held in ActivateBestChain()
150+     */
151+    CCriticalSection m_cs_cs;

sipa commented at 5:46 pm on May 7, 2018:

I believe it may be worth adding 16 bytes to the source code to give this the mildly less confusing name m_chainstate_cs instead.

sipa commented at 6:28 pm on May 7, 2018: member

utACK code changes; I didn’t review the tests.

Also, when introducing a test to demonstrate the problem, put it after the fix. Doing otherwise means introducing temporarily broken commits, which may break the ability to bisect.

in src/validation.cpp:2661 in c496b7b386 outdated

2657@@ -2652,6 +2658,7 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
2658     // us in the middle of ProcessNewBlock - do not assume pblock is set
2659     // sanely for performance or correctness!
2660     AssertLockNotHeld(cs_main);
2661+    LOCK(m_cs_cs);

sdaftuar commented at 6:38 pm on May 7, 2018:

nit: I think it’s worth commenting here with some explanation of the types of race conditions that this is designed to protect (along the lines of what is in the commit message, perhaps)

in src/test/test_bitcoin.h:124 in 9dd61f272a outdated

118@@ -119,5 +119,6 @@ struct TestMemPoolEntryHelper
119 };
120 
121 CBlock getBlock13b8a();
122+std::ostream& operator<<(std::ostream& os, const uint256& num);

sdaftuar commented at 6:55 pm on May 7, 2018:

nit: perhaps add a comment explaining why this is added? (guessing it’s because you want to use uint256’s in BOOST_CHECK_EQUAL())

in src/test/validation_block_tests.cpp:7 in 9dd61f272a outdated

0@@ -0,0 +1,171 @@
1+// Copyright (c) 2018 The Bitcoin Core developers
2+// Distributed under the MIT software license, see the accompanying
3+// file COPYING or http://www.opensource.org/licenses/mit-license.php.
4+
5+#include <iomanip>
6+#include <memory>
7+#include <sstream>

sdaftuar commented at 7:06 pm on May 7, 2018:

nit: doesn’t seem like these includes are necessary at all, am I missing some dependency?

sdaftuar approved

sdaftuar commented at 7:12 pm on May 7, 2018: member

ACK c496b7b386a66fb4fc73d3bacd3db2a5c4bf7872, though I left some nits for you to consider.

skeees force-pushed on May 7, 2018

skeees commented at 7:44 pm on May 7, 2018: contributor

@sipa - commits now correctly ordered and m_cs_cs renamed to m_cs_chainstate

sdaftuar commented at 7:47 pm on May 7, 2018: member

re-ACK ed1828c202c9f4376492c173e361814320a01ffa

skeees force-pushed on May 7, 2018

skeees force-pushed on May 8, 2018

skeees force-pushed on May 11, 2018

in src/validation.cpp:2719 in 8febf4850b outdated

2735+                for (const PerBlockConnectTrace& trace : connectTrace.GetBlocksConnected()) {
2736+                    assert(trace.pblock && trace.pindex);
2737+                    GetMainSignals().BlockConnected(trace.pblock, trace.pindex, trace.conflictedTxs);
2738+                }
2739+            } while (!chainActive.Tip() || (starting_tip && CBlockIndexWorkComparator()(chainActive.Tip(), starting_tip)));
2740+            if(!blocks_connected) return true;

jamesob commented at 7:31 pm on May 11, 2018:

nit: if (...

in src/validation.cpp:149 in 8febf4850b outdated

143@@ -144,6 +144,12 @@ class CChainState {
144       */
145     std::set<CBlockIndex*> g_failed_blocks;
146 
147+    /**
148+     * the ChainState CriticalSection
149+     * A lock that must be held when modifying this ChainState - held in ActivateBestChain()

jamesob commented at 7:34 pm on May 11, 2018:

must be held when modifying this ChainState

Seems like this statement is too broad; many aspects of this object are modified without this lock held (e.g. g_failed_blocks, setBlockIndexCandidates).

sdaftuar commented at 7:18 pm on May 16, 2018:

I agree – this is more of a lock on the logic in ABC than any particular data structure. The comment in ABC itself is pretty good though.

in src/validation.cpp:2663 in 8febf4850b outdated

2659@@ -2653,6 +2660,12 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
2660     // sanely for performance or correctness!
2661     AssertLockNotHeld(cs_main);
2662 
2663+    // ABC maintains a fair degree of expensive to calculate internal state

jamesob commented at 7:35 pm on May 11, 2018:

I had a bit of trouble parsing this comment initially - might be easier to read if you wrote “expensive-to-calculate” instead?

jamesob approved

jamesob commented at 7:37 pm on May 11, 2018: member

utACK https://github.com/bitcoin/bitcoin/pull/13023/commits/8febf4850b11406883d457d3c63e713acfacf682

The tests you’ve written are great (we need more like these) and the changeset fixes a definite bug, so I’m :+1: on this. I left a few nits but feel free to ignore or fix later.

in src/validation.cpp:2708 in 4931c6d298 outdated

2728 
2729-            for (const PerBlockConnectTrace& trace : connectTrace.GetBlocksConnected()) {
2730-                assert(trace.pblock && trace.pindex);
2731-                GetMainSignals().BlockConnected(trace.pblock, trace.pindex, trace.conflictedTxs);
2732-            }
2733+            const CBlockIndex * pindexFork = chainActive.FindFork(pindexOldTip);

TheBlueMatt commented at 7:40 pm on May 11, 2018:

Not sure if this is my fault or not, but this definitely needs to be finding the fork since starting_tip for the notifications, not just since the last loop iteration. Also, we should probably move the UpdatedBlockTip callback generation inside the pindexFork != pindexNewTip check as otherwise you’ll get rather nonsense callbacks (and none of the callers seem to rely on any specific behavior). Either way behavior needs documentation in validationinterface.h (or I can do it in #12979 once this gets merged).

in src/validation.cpp:2665 in 1eee5ef33b outdated

2659@@ -2653,6 +2660,12 @@ bool CChainState::ActivateBestChain(CValidationState &state, const CChainParams&
2660     // sanely for performance or correctness!
2661     AssertLockNotHeld(cs_main);
2662 
2663+    // ABC maintains a fair degree of expensive to calculate internal state
2664+    // because this function periodically releases cs_main so that it does not lock up other threads for too long
2665+    // during large re-orgs - and to allow for e.g. the callback queue to drain

TheBlueMatt commented at 7:41 pm on May 11, 2018:

nit: comment is slightly wrong now, we dont release cs_main during large reorgs, only large connects.

in src/test/scheduler_tests.cpp:133 in 9c31fa3228 outdated

128+    for (int i = 0; i < 5; i++) {
129+        threads.create_thread(boost::bind(&CScheduler::serviceQueue, &scheduler));
130+    }
131+
132+    // these are not atomic, if SinglethreadedSchedulerClient prevents
133+    // parallel execution at the queue level no synchronization should be required here

TheBlueMatt commented at 7:57 pm on May 11, 2018:

This isnt sufficient - just because two things are ordered does not mean they have a consistent view into memory unless there was something that triggered a flush. In practice I believe the lock inside of the SingleThreadedSchedulerClient provides this, but that’s certainly not guaranteed in the future. A release-acquire should be sufficient, however.

skeees commented at 6:04 pm on May 12, 2018:

don’t quite understand what you mean here - isn’t the whole purpose of SingleThreadedSchedulerClient to ensure internally consistent ordering/views of callbacks on a single client? if a memory barrier is somehow necessary to achieve this (like you I think the lock is sufficient for now) then that should be done in the SingleThreadedSchedulerClient and not be the caller’s responsibility. This test was intended to demonstrate that the api guarantees this behavior to users.

TheBlueMatt commented at 4:35 pm on May 13, 2018:

The C++ memory model does not provide any ordering guarantees across variables at different memory locations unless you have a) some operation dependency tree which creates such an order, or b) a rel_acq/seq_cst operation somewhere which implicitly creates such an order (which a mutex lock/unlock essentially implicitly creates). So if, eg, the SingleThreadedSchedulerClient used only relaxed operations to load the next callback to call, and both callback clients update the variable, there is no happens-before relationship created on the variable and you may not have consistent results. Having the rel_acq implied by the internal lock provides what you want implicitly, but its not a guarantee. A release/acquire operation should be sufficient here.

skeees commented at 2:35 pm on May 14, 2018:

oh hmm - but each client only interacts with its own “counter” there’s no sharing across clients in this tests. I’m simply checking that for each client, its own callbacks are processed sequentially - which i think is something that the singlethreadedschedulerclient api is supposed to guarantee (otherwise might as well just use the raw scheduler) - and no current test existed

TheBlueMatt commented at 4:29 pm on May 14, 2018:

Ugh, my memory of the C++ memory model was slightly faulty, I’ve updated the comment above, though its much less a concern now (however should sitll be fixed). You could use relaxed operations to try to optimistically fetch the next callback to execute, succeed, and then have one thread run a callback, the other thread run a second callback but for the same client, and create no dependancy. On many non-x86 platforms the CPU is free to reorder (apparently) unrelated stores/loads.

TheBlueMatt commented at 7:59 pm on May 11, 2018: member

Didnt review the last new test (awesome, btw).

Do not unlock cs_main in ABC unless we've actually made progress.

Technically, some internal datastructures may be in an inconsistent
state if we do this, though there are no known bugs there. Still,
for future safety, its much better to only unlock cs_main if we've
made progress (not just tried a reorg which may make progress).

ecc3c4a019

Fix concurrency-related bugs in ActivateBestChain

If multiple threads are invoking ActivateBestChain, it was possible to have
them working towards different tips, and we could arrive at a less work tip
than we should.  Fix this by introducing a ChainState lock which must
be held for the entire duration of ActivateBestChain to enforce
exclusion in ABC.

a3ae8e6873

skeees force-pushed on May 12, 2018

TheBlueMatt commented at 4:33 pm on May 14, 2018: member

utACK non-test changes through a3ae8e68739023e5dba9e5cb190e707ed4603316, a3070a861bb3a76303b7859c4dd96ba886029111 would be fine with some release/acquire atomics, and 51539506577e5439c58b2e36c07d84f33bb4d4c5 looks awesome, though I haven’t reviewed it fully.

skeees commented at 6:02 pm on May 14, 2018: contributor

I will follow up offline in case I’m totally missing something - but I feel pretty strongly that sequential consistency should be guaranteed by the SingleThreadedSchedulerClient - whether the callbacks are actually run by one or multiple threads internally, it should feel single threaded (it’s even named that way) and the caller shouldn’t have to worry about it. Otherwise you open up a ton of surface area for more bugs like the ones being fixed here to creep in. Of course if there’s something I’m missing in the current implementation of the SingleThreadedSchedulerClient that doesn’t guarantee this ordering, I’m happy to fix it here, but it appears that the lock guarantees it now and I’d strongly advocate that any future implementations without the lock still maintain that guarantee.

TheBlueMatt commented at 6:05 pm on May 14, 2018: member

@skeees hmm, maybe we’re talking past each other, but my point was that you can fully guarantee both ordering and only one callback executing at once without implying seq_cst/rel_acq in the C++ memory model. I dont think the memory consistency of a client should be implied by an API as it can impose runtime cost.

skeees force-pushed on May 14, 2018

skeees commented at 8:19 pm on May 14, 2018: contributor

Ok - I finally understand what you’re saying. I still disagree. I think it makes the API pretty unusable if you drop that guarantee - and the theoretical (because its single threaded right now) performance tradeoff is worth the usability - and certainly it should not be called SingleThreaded anything since it drops the single threaded memory model. But in the interest of not delaying 0.16.1 I’ve removed that commit and will follow up in a separate PR.

Btw though - I think the second test still has the same issue (different callbacks referencing the same unsynchronized memory)

TheBlueMatt commented at 7:49 pm on May 15, 2018: member

utACK 7b4be5089938b188ecca86742aa6aa4d2ec0c40e, though I didn’t review the last test in detail (thanks for making that!).

laanwj commented at 11:56 am on May 16, 2018: member

Compiling this locally gives me:

0/.../bitcoin/src/test/validation_block_tests.cpp:73:13: error: use of undeclared identifier 'CheckProofOfWork'
1    while (!CheckProofOfWork(pblock->GetHash(), pblock->nBits, Params().GetConsensus())) {
2            ^
31 error generated.

Adding #include <pow.h> fixes it.

Add unit tests for signals generated by ProcessNewBlock()

After a recent bug discovered in callback ordering in MainSignals,
this test checks invariants in ordering of
BlockConnected / BlockDisconnected / UpdatedChainTip signals

dd435ad402

skeees force-pushed on May 16, 2018

skeees commented at 12:29 pm on May 16, 2018: contributor

thanks @laanwj - updated

in src/validation.cpp:2691 in ecc3c4a019 outdated

2696-                return true;
2697+                bool fInvalidFound = false;
2698+                std::shared_ptr<const CBlock> nullBlockPtr;
2699+                if (!ActivateBestChainStep(state, chainparams, pindexMostWork, pblock && pblock->GetHash() == pindexMostWork->GetBlockHash() ? pblock : nullBlockPtr, fInvalidFound, connectTrace))
2700+                    return false;
2701+                blocks_connected = true;

sdaftuar commented at 1:57 pm on May 16, 2018:

nit: This is not a very good variable name, as ABCStep can return true even if no blocks were connected.

in src/validation.cpp:2703 in ecc3c4a019 outdated

2719-            fInitialDownload = IsInitialBlockDownload();
2720+                for (const PerBlockConnectTrace& trace : connectTrace.GetBlocksConnected()) {
2721+                    assert(trace.pblock && trace.pindex);
2722+                    GetMainSignals().BlockConnected(trace.pblock, trace.pindex, trace.conflictedTxs);
2723+                }
2724+            } while (!chainActive.Tip() || (starting_tip && CBlockIndexWorkComparator()(chainActive.Tip(), starting_tip)));

sdaftuar commented at 2:03 pm on May 16, 2018:

I believe this means we’ll infinite loop if we ever had a bug like the race condition that this PR fixes, that could cause us to invoke ABCStep with a pindexMostWork that is less work than our tip.

I think an infinite loop would be strictly worse than an assert failure, so could we also add an assert above that pindexMostWork has more work than chainActive.Tip(), before we invoke ABCStep?

laanwj commented at 4:24 pm on May 16, 2018: member

utACK dd435ad40267f5c50ff17533c696f9302829a6a6

laanwj merged this on May 16, 2018

laanwj closed this on May 16, 2018

laanwj referenced this in commit 11e7bdfd90 on May 16, 2018

sdaftuar commented at 7:21 pm on May 16, 2018: member

Post-merge ACK, with some nits.

fanquake added this to the "For backport" column in a project

skeees deleted the branch on May 24, 2018

MarcoFalke referenced this in commit 3663900717 on May 24, 2018

MarcoFalke referenced this in commit 8a7cfd5557 on May 24, 2018

MarcoFalke referenced this in commit e8a283f554 on May 24, 2018

MarcoFalke referenced this in commit 0948153ea6 on May 24, 2018

MarcoFalke referenced this in commit bb79aaf93a on May 24, 2018

MarcoFalke referenced this in commit d6c3a08c48 on May 24, 2018

fanquake removed the label Needs backport on May 28, 2018

fanquake commented at 9:37 am on May 28, 2018: member

Backported in #13317

fanquake removed this from the "For backport" column in a project

HashUnlimited referenced this in commit f8f81c2d53 on Jun 29, 2018

HashUnlimited referenced this in commit 2413612b85 on Jun 29, 2018

HashUnlimited referenced this in commit c80100e3ff on Jun 29, 2018

MarcoFalke referenced this in commit e83d82a85c on Aug 1, 2018

ccebrecos referenced this in commit 85ac9b7168 on Sep 14, 2018

ccebrecos referenced this in commit 27876d57d5 on Sep 14, 2018

ccebrecos referenced this in commit d4f80b9367 on Sep 14, 2018

PastaPastaPasta referenced this in commit c5af021d59 on Jul 17, 2020

PastaPastaPasta referenced this in commit 82d64aef95 on Jul 17, 2020

PastaPastaPasta referenced this in commit f9aa9ad39f on Jul 17, 2020

LarryRuane referenced this in commit b19c9ed41b on Mar 22, 2021

LarryRuane referenced this in commit 25fa8176e2 on Mar 22, 2021

random-zebra referenced this in commit b993c56afe on Apr 4, 2021

LarryRuane referenced this in commit d9b7f45ce8 on Apr 26, 2021

LarryRuane referenced this in commit c45596e287 on Apr 26, 2021

UdjinM6 referenced this in commit 4a67d6ae09 on May 21, 2021

LarryRuane referenced this in commit 897b4fabdd on May 27, 2021

LarryRuane referenced this in commit 638b1e94be on May 27, 2021

UdjinM6 referenced this in commit d3d242f2b1 on Jun 5, 2021

UdjinM6 referenced this in commit bb44ac04c9 on Jun 5, 2021

str4d referenced this in commit ef61fe6dbc on Sep 23, 2021

str4d referenced this in commit bbaff2ab1e on Sep 23, 2021

DrahtBot locked this on Dec 16, 2021

Fix some concurrency issues in ActivateBestChain() #13023

Original description below