Potential deadlock assertion in net code on 0.12RC3 #7470

issue GIJensen opened this issue on February 5, 2016
  1. GIJensen commented at 2:25 AM on February 5, 2016: none

    <githubroot>I was running 0.12RC2, I downloaded 0.12RC3 (b1f031d) today and did a configure make (without a clean). This caused my mainnet and testnet nodes to crash within 1s. I tried to do a reindex on the mainnet node, and that did nothing (I shouldn't have). I did a "make clean", and rebuilt, now the testnet node appears okay (didn't try a reindex) however now the mainnet node crashes in minutes (but it sometimes takes a while, so I don't see it) with the following error:

    Crash error:

    bitcoind: sync.cpp:112: void potential_deadlock_detected(const std::pair<void, void>&, const LockStack&, const LockStack&): Assertion `onlyMaybeDeadlock' failed.

    debug.log tail:

    LoadExternalBlockFile: Processing out of order child 000000000000004f2d365bee2d14cea4cd33c238a75f5dee833048fa0f3855ac of 0000000000000004851be1a8e239bfd84b806f4e7341f09b73ffa7ffe7ea4700 2016-02-04 23:51:14 UpdateTip: new best=000000000000004f2d365bee2d14cea4cd33c238a75f5dee833048fa0f3855ac height=246847 log2_work=70.747326 tx=20743335 date=2013-07-16 11:09:45 progress=0.087758 cache=5.6MiB(7660tx) 2016-02-04 23:51:14 POTENTIAL DEADLOCK DETECTED 2016-02-04 23:51:14 Previous lock order was: 2016-02-04 23:51:14 (1) pnode->cs_vRecvMsg net.cpp:1736 (TRY) 2016-02-04 23:51:14 (2) cs_main main.cpp:296 2016-02-04 23:51:14 Current lock order is: 2016-02-04 23:51:14 pnode->cs_vSend net.cpp:1755 (TRY) 2016-02-04 23:51:14 (2) cs_main main.cpp:5345 (TRY) 2016-02-04 23:51:14 (1) pto->cs_inventory main.cpp:5445

    I'm not sure if it's relevant but I also run "getblockchaininfo" every 0.5s to get the verification progress (I have a CLI progress bar). I've been considering doing a blockchain redownload, but I don't want to if it won't fix anything.</githubroot>

  2. laanwj added the label P2P on Feb 5, 2016
  3. laanwj commented at 2:56 AM on February 5, 2016: member

    This is an error that only appears with -DDEBUG_LOCKORDER, which is enabled by --enable-debug. Did you perhaps enable this recently?

    Haven't checked whether this is a real issue or not, but the try_lock deadlock detection seems to sometimes give false positives.

    I've been considering doing a blockchain redownload, but I don't want to if it won't fix anything.

    No, this looks like an issue in the networking code redownloading the block chain shouldn't have a single effect on this.

  4. laanwj commented at 3:05 AM on February 5, 2016: member

    Changes to main.cpp in rc3 versus rc2:

    diff --git a/src/main.cpp b/src/main.cpp
    index d1b0fba..3591679 100644
    --- a/src/main.cpp
    +++ b/src/main.cpp
    @@ -75,7 +75,7 @@ bool fCheckpointsEnabled = DEFAULT_CHECKPOINTS_ENABLED;
     size_t nCoinCacheUsage = 5000 * 300;
     uint64_t nPruneTarget = 0;
     bool fAlerts = DEFAULT_ALERTS;
    -bool fPermitReplacement = DEFAULT_PERMIT_REPLACEMENT;
    +bool fEnableReplacement = DEFAULT_ENABLE_REPLACEMENT;
    
     /** Fees smaller than this (in satoshi) are considered zero fee (for relaying, mining and transaction creation) */
     CFeeRate minRelayTxFee = CFeeRate(DEFAULT_MIN_RELAY_TX_FEE);
    @@ -866,7 +866,7 @@ bool AcceptToMemoryPoolWorker(CTxMemPool& pool, CValidationState &state, const C
                     // unconfirmed ancestors anyway; doing otherwise is hopelessly
                     // insecure.
                     bool fReplacementOptOut = true;
    -                if (fPermitReplacement)
    +                if (fEnableReplacement)
                     {
                         BOOST_FOREACH(const CTxIn &txin, ptxConflicting->vin)
                         {
    @@ -4497,8 +4497,8 @@ bool static ProcessMessage(CNode* pfrom, string strCommand, CDataStream& vRecv,
    
             bool fBlocksOnly = GetBoolArg("-blocksonly", DEFAULT_BLOCKSONLY);
    
    -        // Allow whitelisted peers to send data other than blocks in blocks only mode if whitelistalwaysrelay is true
    -        if (pfrom->fWhitelisted && GetBoolArg("-whitelistalwaysrelay", DEFAULT_WHITELISTALWAYSRELAY))
    +        // Allow whitelisted peers to send data other than blocks in blocks only mode if whitelistrelay is true
    +        if (pfrom->fWhitelisted && GetBoolArg("-whitelistrelay", DEFAULT_WHITELISTRELAY))
                 fBlocksOnly = false;
    
             LOCK(cs_main);
    @@ -4677,8 +4677,8 @@ bool static ProcessMessage(CNode* pfrom, string strCommand, CDataStream& vRecv,
         else if (strCommand == NetMsgType::TX)
         {
             // Stop processing the transaction early if
    -        // We are in blocks only mode and peer is either not whitelisted or whitelistalwaysrelay is off
    -        if (GetBoolArg("-blocksonly", DEFAULT_BLOCKSONLY) && (!pfrom->fWhitelisted || !GetBoolArg("-whitelistalwaysrelay", DEFAULT_WHITELISTALWAYSRELAY)))
    +        // We are in blocks only mode and peer is either not whitelisted or whitelistrelay is off
    +        if (GetBoolArg("-blocksonly", DEFAULT_BLOCKSONLY) && (!pfrom->fWhitelisted || !GetBoolArg("-whitelistrelay", DEFAULT_WHITELISTRELAY)))
             {
                 LogPrint("net", "transaction sent in violation of protocol peer=%d\n", pfrom->id);
                 return true;
    @@ -4778,7 +4778,7 @@ bool static ProcessMessage(CNode* pfrom, string strCommand, CDataStream& vRecv,
                 assert(recentRejects);
                 recentRejects->insert(tx.GetHash());
    
    -            if (pfrom->fWhitelisted && GetBoolArg("-whitelistalwaysrelay", DEFAULT_WHITELISTALWAYSRELAY)) {
    +            if (pfrom->fWhitelisted && GetBoolArg("-whitelistforcerelay", DEFAULT_WHITELISTFORCERELAY)) {
                     // Always relay transactions received from whitelisted peers, even
                     // if they were already in the mempool or rejected from it due
                     // to policy, allowing the node to function as a gateway for
    

    Changes to net.cpp in rc3 versus rc2:

    diff --git a/src/net.cpp b/src/net.cpp
    index 8abfc4b..2f60cfa 100644
    --- a/src/net.cpp
    +++ b/src/net.cpp
    @@ -884,8 +884,6 @@ static bool AttemptToEvictConnection(bool fPreferNewConnection) {
                     continue;
                 if (node->fDisconnect)
                     continue;
    -            if (node->addr.IsLocal())
    -                continue;
                 vEvictionCandidates.push_back(CNodeRef(node));
             }
         }
    @@ -916,15 +914,20 @@ static bool AttemptToEvictConnection(bool fPreferNewConnection) {
    
         if (vEvictionCandidates.empty()) return false;
    
    -    // Identify the network group with the most connections
    +    // Identify the network group with the most connections and youngest member.
    +    // (vEvictionCandidates is already sorted by reverse connect time)
         std::vector<unsigned char> naMostConnections;
         unsigned int nMostConnections = 0;
    +    int64_t nMostConnectionsTime = 0;
         std::map<std::vector<unsigned char>, std::vector<CNodeRef> > mapAddrCounts;
         BOOST_FOREACH(const CNodeRef &node, vEvictionCandidates) {
             mapAddrCounts[node->addr.GetGroup()].push_back(node);
    +        int64_t grouptime = mapAddrCounts[node->addr.GetGroup()][0]->nTimeConnected;
    +        size_t groupsize = mapAddrCounts[node->addr.GetGroup()].size();
    
    -        if (mapAddrCounts[node->addr.GetGroup()].size() > nMostConnections) {
    -            nMostConnections = mapAddrCounts[node->addr.GetGroup()].size();
    +        if (groupsize > nMostConnections || (groupsize == nMostConnections && grouptime > nMostConnectionsTime)) {
    +            nMostConnections = groupsize;
    +            nMostConnectionsTime = grouptime;
                 naMostConnections = node->addr.GetGroup();
             }
         }
    @@ -932,14 +935,13 @@ static bool AttemptToEvictConnection(bool fPreferNewConnection) {
         // Reduce to the network group with the most connections
         vEvictionCandidates = mapAddrCounts[naMostConnections];
    
    -    // Do not disconnect peers if there is only 1 connection from their network group
    +    // Do not disconnect peers if there is only one unprotected connection from their network group.
         if (vEvictionCandidates.size() <= 1)
             // unless we prefer the new connection (for whitelisted peers)
             if (!fPreferNewConnection)
                 return false;
    
    -    // Disconnect the most recent connection from the network group with the most connections
    -    std::sort(vEvictionCandidates.begin(), vEvictionCandidates.end(), ReverseCompareNodeTimeConnected);
    +    // Disconnect from the network group with the most connections
         vEvictionCandidates[0]->fDisconnect = true;
    
         return true;
    

    As far as I see it, nothing with the locks in question changed.

  5. GIJensen commented at 3:05 AM on February 5, 2016: none

    I have two aliases for configuring:

    alias configurestd="./configure --prefix='$HOME/.local'" alias configuredev="./configure --enable-debug --program-prefix=dev --prefix='$HOME/.local'"

    I ran "configurestd" before "make && make install" so I figured that'd disable "--enable-debug". Other than that, I don't have any debugging flags anywhere.

  6. GIJensen commented at 3:19 AM on February 5, 2016: none

    Is it possible the issue existed before but never came up because I never reindexed? Should I try to build older commits?

  7. laanwj commented at 3:20 AM on February 5, 2016: member

    Yes, that is possible. Try cleaning your tree then rebuildilng without --enable-debug, then reindexing.

  8. GIJensen commented at 3:23 AM on February 5, 2016: none

    Okay, already doing that!

    BTW I'm running Debian Jessie 64- bit. Just in case that ends up being relevant.

  9. laanwj commented at 3:27 AM on February 5, 2016: member

    I have trouble parsing which locks are conflicting from that log message. It's supposed to detect A->B B->A lock conflicts, but it looks like, except cs_main the two sequences have nothing in common? @TheBlueMatt @cfields you touched that deadlock detection code most recently AFAIK, any idea what is going on here?

  10. GIJensen commented at 6:02 AM on February 5, 2016: none

    Seems to run fine now anyways, no crash yet!

  11. laanwj renamed this:
    Crash on 0.12RC3
    Potential deadlock assertion in net code on 0.12RC3
    on Feb 5, 2016
  12. laanwj commented at 5:23 PM on February 16, 2016: member

    Other issues about locking in net code:

    • Potential deadlock on {cs_main, pnode->cs_vSend} #4493
    • locking inconsistencies in the network code #4091
    • Missing cs_main lock in ProcessMessage() #5678
  13. MarcoFalke commented at 3:50 PM on February 24, 2016: member

    Travis is running CPPFLAGS=-DDEBUG_LOCKORDER ( https://github.com/bitcoin/bitcoin/blob/317462123f8e41fd7dd967ab907e59ddffb19898/.travis.yml#L46 )

    I should try to see if it triggers this bug report on the extended suite maybe...

  14. laanwj commented at 10:11 AM on March 28, 2016: member

    Just hit this in travis too:

    Running testscript walletbackup.py ...
    INFO:Initializing test directory /tmp/testATnfBW
    start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
    start_node: calling bitcoin-cli -rpcwait getblockcount returned
    start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
    start_node: calling bitcoin-cli -rpcwait getblockcount returned
    start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
    start_node: calling bitcoin-cli -rpcwait getblockcount returned
    start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
    start_node: calling bitcoin-cli -rpcwait getblockcount returned
    INFO:Generating initial blockchain
    INFO:Creating transactions
    INFO:Backing up
    INFO:More transactions
    INFO:Restoring using wallet.dat
    INFO:Re-starting nodes
    start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
    start_node: calling bitcoin-cli -rpcwait getblockcount returned
    start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
    start_node: calling bitcoin-cli -rpcwait getblockcount returned
    start_node: bitcoind started, calling bitcoin-cli -rpcwait getblockcount
    start_node: calling bitcoin-cli -rpcwait getblockcount returned
    bitcoind: sync.cpp:112: void potential_deadlock_detected(const std::pair<void*, void*>&, const LockStack&, const LockStack&): Assertion `onlyMaybeDeadlock' failed.
    Unexpected exception caught during testing: [Errno 111] Connection refused
      File "/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/qa/rpc-tests/test_framework/test_framework.py", line 135, in main
        self.run_test()
      File "/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/qa/rpc-tests/walletbackup.py", line 168, in run_test
        sync_blocks(self.nodes)
      File "/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/qa/rpc-tests/test_framework/util.py", line 98, in sync_blocks
        counts = [ x.getblockcount() for x in rpc_connections ]
      File "/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/qa/rpc-tests/test_framework/coverage.py", line 50, in __call__
        return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
      File "/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/qa/rpc-tests/test_framework/authproxy.py", line 138, in __call__
        response = self._request('POST', self.__url.path, postdata)
      File "/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/qa/rpc-tests/test_framework/authproxy.py", line 124, in _request
        self.__conn.request(method, path, postdata, headers)
      File "/usr/lib/python2.7/httplib.py", line 958, in request
        self._send_request(method, url, body, headers)
      File "/usr/lib/python2.7/httplib.py", line 992, in _send_request
        self.endheaders(body)
      File "/usr/lib/python2.7/httplib.py", line 954, in endheaders
        self._send_output(message_body)
      File "/usr/lib/python2.7/httplib.py", line 814, in _send_output
        self.send(msg)
      File "/usr/lib/python2.7/httplib.py", line 776, in send
        self.connect()
      File "/usr/lib/python2.7/httplib.py", line 757, in connect
        self.timeout, self.source_address)
      File "/usr/lib/python2.7/socket.py", line 571, in create_connection
        raise err
    
  15. theuni commented at 5:27 PM on April 5, 2016: member

    Just got one that looks similar here: https://travis-ci.org/theuni/bitcoin/jobs/120764964#L7682

  16. laanwj commented at 12:23 PM on April 6, 2016: member

    This is, unfortunately, one of the most frequent travis random false positives.

  17. sipa commented at 4:35 PM on April 8, 2016: member

    From the report:

    2016-02-04 23:51:14 (1) pnode->cs_vRecvMsg net.cpp:1736 (TRY) ... 2016-02-04 23:51:14 (1) pto->cs_inventory main.cpp:5445

    This means the lock detection code is thinking that pnode->cs_vRecvMsg and pto->cs_inventory are the same variable. They are not however, and my assumption is that there are enough CNode objects being created and destroyed that the address of a cs_inventory at one point ends up being occupied by a cs_vRecvMsg later.

    A fix requires deleting the lockorders in the table involving a lock that gets destroyed.

  18. laanwj commented at 1:51 PM on April 14, 2016: member

    Hopefully fixed by #7846, closing for now

  19. laanwj closed this on Apr 14, 2016

  20. MarcoFalke commented at 8:42 AM on April 15, 2016: member

    Travis will still fail the extended tests due to

    bitcoind: sync.cpp:125: void potential_deadlock_detected(const std::pair<void*, void*>&, const LockStack&, const LockStack&): Assertion `onlyMaybeDeadlock' failed.
    

    https://travis-ci.org/bitcoin/bitcoin/jobs/123165273#L2968

    This happens consistently on travis but I could not yet reproduce locally.

  21. laanwj commented at 8:43 AM on April 15, 2016: member

    Thanks for reporting, reopening

  22. laanwj reopened this on Apr 15, 2016

  23. MarcoFalke commented at 10:03 AM on April 15, 2016: member

    Hmm, travis tells that sipa's fix was already merged yesterday evening, unfortunately #7798 was not merged as of yesterday, so there is no way to verify that.

    I have retriggered #7148 and it run just fine. Need to keep an eye on that pull on the coming days...

    Edit: Seems to be still an issue:

  24. MarcoFalke commented at 11:21 AM on June 2, 2016: member

    Fixed by @gmaxwell in #8082.

  25. MarcoFalke closed this on Jun 2, 2016

  26. DrahtBot locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-13 18:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me