p2p_compactblocks.py failing occasionally on master #12978

issue sdaftuar openend this issue on April 13, 2018
  1. sdaftuar commented at 3:47 pm on April 13, 2018: member

    Reported here #12629 (comment) by @Empact. I’ve seen it now in a few different travis jobs, including:

    https://travis-ci.org/bitcoin/bitcoin/jobs/366157920 https://travis-ci.org/bitcoin/bitcoin/jobs/366139851

    From my initial investigation it seems like there may be an actual bug in block announcement, so I’m opening this issue to track.

  2. fanquake added the label Tests on Apr 13, 2018
  3. sdaftuar commented at 5:10 pm on April 13, 2018: member

    I haven’t yet been able to reproduce. What I find most disturbing from staring at the debug logs in the travis failure (eg around here: https://travis-ci.org/bitcoin/bitcoin/jobs/366157920#L6672) is that the latest block generated on the node isn’t being relayed to any peers – neither the python mininode peers, nor the other bitcoind peer.

    Not really sure how that can be happening (and only in this particular test? seems like a block relay problem of this sort would cause widespread testing failures). So hopefully I’m somehow just overlooking a particular quirk of the testing setup here.

  4. Empact commented at 5:23 pm on April 13, 2018: member
  5. jnewbery commented at 6:40 pm on April 13, 2018: member

    I think I see what’s going on here. There are two nodes under test:

    • node0
    • node1

    and three connections to those nodes:

    • self.test_node (connects to node0)
    • self.segwit_node (connects to node1 and signals NODE_NETWORK|NODE_WITNESS)
    • self.old_node (connects to node1 and signals NODE_NETWORK)

    The test is failing in the call to self.test_sendcmpct() on line 807. Before that:

    1. The call to self.test_sendcmpct() on line 805 is generating blocks on node0 and checking that they’re received by self.test_node.
    2. sync_blocks(self.nodes) on line 806 ensures that node1 is at the same tip as node0
    3. self.test_sendcmpct() on line generates blocks on node1 and checks that they’re received by self.segwit_node.

    Note that the call to sync_blocks on 806 has not ensured that the blocks have been propagated to self.segwit_node or self.old_node. The test fails because it drops into self.test_sendcmpct(), and calls generate on node1 before node1 has propagated the block from node0 to self.segwit_node. After generate is called, the test is waiting for node1 to send the newly generated block to self.segwit_node, but the next inv from node1 to self.segwit_node is for the block previously generated by node0:

     0node0 2018-04-13T13:07:45.809404Z ThreadRPCServer method=generate user=__cookie__               <--------- generate to node0 from self.test_sendcmpct() on line 805
     1[...]
     2 node0 2018-04-13T13:07:45.812956Z CreateNewBlock(): block weight: 716 txs: 0 fees: 0 sigops 400 
     3[...]
     4 node0 2018-04-13T13:07:45.815971Z UpdateTip: new best=23eb17fb1f4805b7c4adb768c250ff07a16aa16759839dc87c5ac98bd686705b height=111 version=0x20000000 log2_work=7.8073549 tx=113 date='2018-04-13T13:08:03Z' progress=1.000000 cache=0.0MiB(121txo) 
     5[...]
     6 node0 2018-04-13T13:07:45.816646Z sending headers (82 bytes) peer=0 
     7[...]
     8 node1 2018-04-13T13:07:45.817742Z received: headers (82 bytes) peer=0    <---- node1 receives block
     9[...]
    10 node1 2018-04-13T13:07:45.825341Z UpdateTip: new                         <--- node1 synced
    11[...]
    12 node1 2018-04-13T13:07:45.827344Z ThreadRPCServer method=generate user=__cookie__              <--------- generate to node1 from self.test_sendcmpct() on line 807
    13[...]
    14 node1 2018-04-13T13:07:45.830472Z UpdateTip: new best=7bfef9174d2d8c9bf165ce690b6eca7c2567fa52f3e6d9cb3d98d14327e87d33 height=112 version=0x20000000 log2_work=7.820179 tx=114 date='2018-04-13T13:08:03Z' progress=1.000000 cache=0.0MiB(122txo)                        <----------------- node1 tip advances. Test now waits for notification of block (check_announcement_of_new_block() on line 172)
    15[...]
    16 node1 2018-04-13T13:07:45.831909Z SendMessages: sending inv peer=2 hash=23eb17fb1f4805b7c4adb768c250ff07a16aa16759839dc87c5ac98bd686705b            <---------- node1 sends the blockhash for the *old block* to segwit_node
    17 node1 2018-04-13T13:07:45.831954Z sending inv (37 bytes) peer=2 
    18 test  2018-04-13T13:07:45.832000Z TestFramework.mininode (DEBUG): Received message from 127.0.0.1:11458: msg_inv(inv=[CInv(type=Block hash=23eb17fb1f4805b7c4adb768c250ff07a16aa16759839dc87c5ac98bd686705b)]) 
    19[...]
    20 test  2018-04-13T13:08:15.877000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: (['        def received_hash():\n', '            return (block_hash in self.announced_blockhashes)\n'], 79)    <-------- test times out waiting for new block
    

    compare this with a passing case where the block from node0 is propagated to self.segwit_node before the generate call to node1.

    It’d be nice to rewrite this test to not have so much state between the different nodes-under-test and test connections. If we fix this one timing issue, there’s no guarantee that there aren’t other similar issues that will be hit by travis.

  6. jnewbery commented at 8:03 pm on April 13, 2018: member
    I’ve spoken to @sdaftuar - we think this may be a race in the validationinterface callbacks around vBlockHashesToAnnounce. Still investigating.
  7. jnewbery commented at 9:31 pm on April 13, 2018: member

    Caught it! https://travis-ci.org/jnewbery/bitcoin/jobs/366281859

    Looks like UpdatedBlockTip() is called in the wrong order:

    0node1 2018-04-13T21:03:36.950443Z UpdateTip: new best=74fa99e8f6767f817000b9ddb050bb2fc53d329203d63f3a084d37300be894e9 height=111 version=0x20000000 log2_work=7.8073549 tx=113 date='2018-04-13T21:03:54Z' progress=1.000000 cache=0.0MiB(121txo) 
    1[...]
    2 node1 2018-04-13T21:03:36.955464Z UpdateTip: new best=57a644748fb3e38c225972bd2e3b0fd45e7665498ebd1fcaf8e37aa6c07ac886 height=112 version=0x20000000 log2_work=7.820179 tx=114 date='2018-04-13T21:03:54Z' progress=1.000000 cache=0.0MiB(122txo) 
    

    but:

    0node1 2018-04-13T21:03:36.955924Z UpdatedBlockTip callback: new block hash: 57a644748fb3e38c225972bd2e3b0fd45e7665498ebd1fcaf8e37aa6c07ac886 
    1[...]
    2node1 2018-04-13T21:03:36.956569Z UpdatedBlockTip callback: new block hash: 74fa99e8f6767f817000b9ddb050bb2fc53d329203d63f3a084d37300be894e9 
    
  8. skeees commented at 3:17 pm on April 14, 2018: contributor

    Had some offline conversations regarding this too. After @jnewbery’s insights above and a bit of reading - I think what’s happening is that cs_main is released in ActivateBestChain() between ActivateBestChainStep but before the UpdatedBlockTip callback has been enqueued.

    In this particular case there are two threads now racing to enqueue an UpdatedBlockTip signal (the thread processing the block received over p2p and the thread which generated the block via rpc).

    After adding a randomized sleep (see below) you can reproduce this issue a bit more reliably:

    0        } // CS_MAIN RELEASED HERE
    1        // When we reach this point, we switched to a new tip (stored in pindexNewTip).
    2
    3        // Notifications/callbacks that can run without cs_main
    4        MilliSleep(GetRand(500));
    5
    6        // Notify external listeners about the new tip.
    7        GetMainSignals().UpdatedBlockTip(pindexNewTip, pindexFork, fInitialDownload);
    

    I think a potential fix is to hold cs_main until the UpdatedBlockTip signals are enqueued? Once safely enqueued they should be processed in correct order by the scheduler thread.

  9. TheBlueMatt commented at 3:40 pm on April 14, 2018: member
    Yea, should just hold cs_main there. I’d like to also fix the bug where we generate callbacks if a block connection failed ala https://github.com/bitcoin/bitcoin/pull/12138/commits/a9db3dada0119c183d16627805e90c4dbca05c6a but its somewhat tangential.
  10. skeees referenced this in commit 50ef361e66 on Apr 15, 2018
  11. skeees commented at 3:34 pm on April 15, 2018: contributor
    #12988 is a minimal fix that should resolve this specific test failure Will also spend some time going through other signals to see if there are any analogous situations that should be proactively addressed
  12. promag commented at 3:15 pm on April 16, 2018: member
    Couldn’t we add logic to PeerLogicValidation::UpdatedBlockTip so that it would’t rewind? It could ignore out of order notifications if it’s not a reorg? It already handles blocks in between: https://github.com/bitcoin/bitcoin/blob/0d6992168c2bda85b18fda8f6dea08da433a0dc9/src/net_processing.cpp#L923-L934
  13. skeees commented at 3:33 pm on April 16, 2018: contributor

    I think that the missed announcement is actually caused by the below (or maybe both)

    0// Requires cs_main
    1bool PeerHasHeader(CNodeState *state, const CBlockIndex *pindex)
    2{
    3    if (state->pindexBestKnownBlock && pindex == state->pindexBestKnownBlock->GetAncestor(pindex->nHeight))
    4        return true;
    5    if (state->pindexBestHeaderSent && pindex == state->pindexBestHeaderSent->GetAncestor(pindex->nHeight))
    6        return true;
    7    return false;
    8}
    

    But anyway - even fixing that (which would substantially complicate the logic) - its difficult to assess how many other dependencies there are on this signal being properly ordered

  14. sdaftuar commented at 3:44 pm on April 16, 2018: member
    I think requiring that UpdatedBlockTip callbacks happen in order is a far better design in the long run; being resilient to these happening out of order adds a level of complexity to the listeners that isn’t necessary at all, especially now that the callbacks are actually happening in a separate thread.
  15. skeees referenced this in commit 713c066cdc on Apr 16, 2018
  16. promag commented at 8:53 pm on April 16, 2018: member
    @sdaftuar @skeees makes sense.
  17. skeees referenced this in commit d86edd3d30 on Apr 16, 2018
  18. practicalswift commented at 11:39 am on April 17, 2018: contributor

    Should we add thread-safety analysis annotations to make sure the locking requirements are fulfilled going forward?

    Are these the locks we want to be held?

    0boost::signals2::signal<void (bool, const CBlockIndex *)> NotifyBlockTip
    1    GUARDED_BY(cs_main);
    2
    3void UpdatedBlockTip(const CBlockIndex *, const CBlockIndex *, bool fInitialDownload) 
    4    EXCLUSIVE_LOCKS_REQUIRED(cs_main);
    
  19. sdaftuar commented at 1:07 pm on April 17, 2018: member

    @practicalswift I don’t think that makes sense for the reason I gave here – we don’t just need to hold a lock while invoking the callback, we also need to not have let it go from when the tip was updated. I fear that adding checks on just holding the lock at the time the callbacks are called obscures the ordering requirement.

    I do think that we should add testing to ensure the ordering requirement is met; I was thinking a unit-test might work but I’d love to hear other ideas.

  20. skeees commented at 1:26 pm on April 17, 2018: contributor

    Apart from the test we have now which fails (but only sporadically) - I’m struggling a bit to think about what sort of test to add that would positively identify this if it broke again - what did you have in mind?

    I agree that adding a lock assert doesnt say much about what happened before the lock was acquired so doesn’t really guarantee anything about atomicity from chain update to event invocation. I think a different solution is to lock down event generation to certain paths (e.g. in this case ActivateBestChain) - I don’t think that breaks cleanliness of the interface - the interface is more to make it easy and generalized for subscribers, but event generation is still very context dependent. I don’t know of any way to accomplish this easily in c++ though - anything obvious come to mind?

  21. skeees commented at 1:39 pm on April 17, 2018: contributor

    Oh, also - I guess I should mention/shill this WIP pr #12934 I have out currently - its related. It creates a separate Validation thread and instead of threads calling directly into the validation layer - they just push messages onto a queue that the validation thread pulls from and processes completely and in order.

    In addition to more cleanly defining the interfaces into validation paths - it also makes it a lot easier to reason about concurrency and parallelism because block validation paths become single threaded and well ordered. You also get the added benefit that you can more easily parallelize stuff that happens outside of validation by explicitly making validation interfaces asynchronous - but the main goal is to better define interfaces and simplify the concurrency model in the most logically complex validation paths.

  22. sdaftuar commented at 2:00 pm on April 17, 2018: member

    Apart from the test we have now which fails (but only sporadically) - I’m struggling a bit to think about what sort of test to add that would positively identify this if it broke again - what did you have in mind?

    I haven’t worked this out yet, but I was thinking that if we could mock out the scheduler that is used by the validationinterface, then we could play around with delays in callback enqueuing time to see if that breaks ordering when generating events from multiple threads?

  23. laanwj closed this on Apr 17, 2018

  24. laanwj referenced this in commit 39e0c65b29 on Apr 17, 2018
  25. MarcoFalke referenced this in commit 9000fab07a on Apr 20, 2018
  26. laanwj referenced this in commit 11e7bdfd90 on May 16, 2018
  27. fanquake referenced this in commit ac6c2c6bf7 on May 17, 2018
  28. fanquake referenced this in commit e56783e09a on May 17, 2018
  29. fanquake referenced this in commit acdf433822 on May 18, 2018
  30. stamhe referenced this in commit 4a13d03d30 on Jun 27, 2018
  31. HashUnlimited referenced this in commit 6dce29170c on Jun 29, 2018
  32. ccebrecos referenced this in commit 8336ba655e on Sep 14, 2018
  33. lionello referenced this in commit 1fabd7b9c1 on Nov 7, 2018
  34. joemphilips referenced this in commit 16b58fd72c on Nov 9, 2018
  35. jkczyz referenced this in commit e13e0f85dc on Aug 22, 2019
  36. jkczyz referenced this in commit 4bacb541ea on Sep 3, 2019
  37. jkczyz referenced this in commit 41eb472399 on Oct 17, 2019
  38. jkczyz referenced this in commit e9eadb374e on Oct 17, 2019
  39. jkczyz referenced this in commit b6bc83628f on Oct 17, 2019
  40. jkczyz referenced this in commit 42c6fb596a on Oct 31, 2019
  41. jkczyz referenced this in commit bb65a925b6 on Oct 31, 2019
  42. jkczyz referenced this in commit 7f70f63ddf on Oct 31, 2019
  43. jkczyz referenced this in commit 93e1e930af on Oct 31, 2019
  44. jkczyz referenced this in commit b557294604 on Oct 31, 2019
  45. jkczyz referenced this in commit 6ab424c4c1 on Oct 31, 2019
  46. jkczyz referenced this in commit 9eb373c07a on Nov 1, 2019
  47. jkczyz referenced this in commit b7cb245a21 on Nov 5, 2019
  48. jkczyz referenced this in commit f871e272a1 on Nov 5, 2019
  49. jnewbery referenced this in commit 7058b453af on Nov 8, 2019
  50. jkczyz referenced this in commit 741cab0a7d on Nov 8, 2019
  51. jkczyz referenced this in commit b929189fb3 on Nov 11, 2019
  52. jkczyz referenced this in commit cd6bf9970f on Nov 11, 2019
  53. jkczyz referenced this in commit b7bb20e0f6 on Nov 22, 2019
  54. jkczyz referenced this in commit 0cadb123ba on Nov 22, 2019
  55. jkczyz referenced this in commit f9abf4ab6d on Jan 3, 2020
  56. laanwj referenced this in commit e7f8450357 on Jan 9, 2020
  57. sidhujag referenced this in commit 9d38d10d6e on Jan 9, 2020
  58. HashUnlimited referenced this in commit 0af60d4d30 on Apr 17, 2020
  59. PastaPastaPasta referenced this in commit 4cc8596d68 on Nov 10, 2020
  60. sidhujag referenced this in commit 496411c6bc on Nov 10, 2020
  61. PastaPastaPasta referenced this in commit 804428c795 on Nov 12, 2020
  62. PastaPastaPasta referenced this in commit 06aebf5eb0 on Nov 17, 2020
  63. backpacker69 referenced this in commit fbda4bfb86 on Mar 28, 2021
  64. furszy referenced this in commit 57a6fb421d on Apr 2, 2021
  65. furszy referenced this in commit 10a0b3eb26 on Apr 2, 2021
  66. furszy referenced this in commit ef24337204 on Apr 3, 2021
  67. furszy referenced this in commit 7ab7112c1f on Apr 8, 2021
  68. UdjinM6 referenced this in commit 4a67d6ae09 on May 21, 2021
  69. UdjinM6 referenced this in commit d3d242f2b1 on Jun 5, 2021
  70. UdjinM6 referenced this in commit bb44ac04c9 on Jun 5, 2021
  71. DrahtBot locked this on Feb 15, 2022
  72. gades referenced this in commit 742197c38d on Mar 17, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-01 13:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me