ci: failure in wallet_multiwallet.py --legacy-wallet - (void wallet::UnloadWallet(std::shared_ptr<CWallet> &&): Assertion 'it.second' failed.) #29073

issue fanquake openend this issue on December 13, 2023
  1. fanquake commented at 12:52 pm on December 13, 2023: member

    https://cirrus-ci.com/task/5692425725804544?logs=ci#L3370:

     0 node0 2023-12-13T12:44:27.733356Z [httpworker.1] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=unloadwallet user=__cookie__ 
     1 test  2023-12-13T12:44:27.782000Z TestFramework (ERROR): Assertion failed 
     2                                   Traceback (most recent call last):
     3                                     File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/test_framework/test_framework.py", line 131, in main
     4                                       self.run_test()
     5                                     File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/wallet_multiwallet.py", line 292, in run_test
     6                                       assert_equal(got_loading_error, True)
     7                                     File "/ci_container_base/ci/scratch/build/bitcoin-i686-pc-linux-gnu/test/functional/test_framework/util.py", line 57, in assert_equal
     8                                       raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
     9                                   AssertionError: not(False == True)
    10 test  2023-12-13T12:44:27.784000Z TestFramework (DEBUG): Closing down network thread 
    11 test  2023-12-13T12:44:27.834000Z TestFramework (INFO): Stopping nodes 
    12 test  2023-12-13T12:44:27.834000Z TestFramework.node0 (DEBUG): Stopping node 
    13 node0 stderr Error: Specified -walletdir "wallets" is a relative path 
    14 node0 stderr bitcoin-node: wallet/wallet.cpp:246: void wallet::UnloadWallet(std::shared_ptr<CWallet> &&): Assertion `it.second' failed. 
    15 node0 stderr Warning: Ignoring duplicate -wallet w1. 
    16 node0 stderr Error: Specified -walletdir "wallets" does not exist 
    17 node0 stderr Error: Specified -walletdir "/ci_container_base/ci/scratch/test_runner/test_runner_₿_🏃_20231213_124140/wallet_multiwallet_255/node0/regtest/wallets/notadir" is not a directory 
    18 node0 stderr Error: BerkeleyDatabase: Can't open database w8_copy (duplicates fileid 378185002c000000dd7bcafdc6f3010000000000 from w8) 
    19 node0 stderr Error: Error parsing command line arguments: Invalid parameter -upgradewallet 
    20 node0 stderr Error: Specified -walletdir "bad" does not exist 
    21 node0 stderr Error: Invalid -wallet path 'w8_symlink'. -wallet path should point to a directory where wallet.dat and database/log.?????????? files can be stored, a location where such a directory could be created, or (for backwards compatibility) the name of an existing data file in -walletdir ("/ci_container_base/ci/scratch/test_runner/test_runner_₿_🏃_20231213_124140/wallet_multiwallet_255/node0/regtest/wallets") 
    22 node0 stderr Error: Specified -walletdir "debug.log" is not a directory 
    23 node1 stderr Error: Error initializing wallet database environment "/ci_container_base/ci/scratch/test_runner/test_runner_₿_🏃_20231213_124140/wallet_multiwallet_255/competing_walletdir"! 
    
  2. maflcko added the label Tests on Dec 13, 2023
  3. maflcko added the label Bug on Dec 13, 2023
  4. dimitaracev commented at 12:34 pm on December 25, 2023: contributor
    I can take a look at this.
  5. maflcko commented at 1:34 pm on December 25, 2023: member

    The error seems to be: node0 stderr bitcoin-node: wallet/wallet.cpp:246: void wallet::UnloadWallet(std::shared_ptr<CWallet> &&): Assertionit.second’ failed. `?

    So I think another bug would be that the error is not caught and reported by the python code?

  6. maflcko removed the label Tests on Dec 25, 2023
  7. maflcko added the label Wallet on Dec 25, 2023
  8. dimitaracev commented at 1:49 pm on December 25, 2023: contributor
    Maybe we should return a proper message that indicates that the wallet we are trying to unload is already marked for unloading and we can catch that in the python code.
  9. maflcko added the label Tests on Jan 24, 2024
  10. maflcko added the label CI failed on Jan 24, 2024
  11. maflcko removed the label Tests on Jan 24, 2024
  12. maflcko commented at 6:04 pm on February 8, 2024: member

    https://cirrus-ci.com/task/6533368603475968?logs=ci#L3401

    0 node0 stderr bitcoin-node: wallet/wallet.cpp:246: void wallet::UnloadWallet(std::shared_ptr<CWallet> &&): Assertion `it.second' failed. 
    
  13. maflcko added this to the milestone 27.0 on Feb 22, 2024
  14. maflcko commented at 6:08 pm on February 22, 2024: member
    Added to 27.0, with the understanding that it is not a regression, and can be moved to 28.0, if needed.
  15. m3dwards commented at 3:49 pm on March 11, 2024: contributor

    I’m not sure the two linked failures are necessarily the same failure. Both happen at different points in the test. The first is an assert failure during Concurrent wallet loading and the second is during Load remaining wallets.

    The first appears to be a timing issue where perhaps an inconveniently timed database flush to disk slows down concurrent requests to load / unload the same wallet and all attempts succeed without hitting the loading mutex.

    In the logs for the second occurrence we can see a broken pipe when trying to load a wallet. Flaky python rpc call?

    Although both do share the std error message complaining about it.second assertion I’m not convinced this is the cause for either of these failures. In the first test this assertion failure seems to come before the test has got to the concurrent wallet loading and in the second test the failure happens when trying to load a wallet, not unload it.

    I will see if I can get more information with the test running in a loop with --tracerpc.

  16. m3dwards commented at 9:55 am on March 12, 2024: contributor

    Ran the test on the same i686 container as CI 1000 times without fail.

    For the Concurrent wallet loading functional test, this depends on timing and therefore I believe will always have a chance of failing. Perhaps this should be a unit test only where hopefully you could have enough control to guarantee the mutex would be hit.

    My only suggestion for the Load remaining wallets failure is to upgrade python from 3.10 to 3.12 which helped resolve some RPC connection issues for me previously.

  17. maflcko commented at 10:00 am on March 12, 2024: member
    The “Assertion `it.second’ failed.” is a bug in the C++ source code. At least, I can’t see how changing the python version would fix this. IIUC a broken pipe is expected when the program crashes, so I don’t think this is a problem on the python side as well.
  18. m3dwards commented at 10:09 am on March 12, 2024: contributor
    Ah ok, I made a wrong assumption there then. I assumed it could get hit if two requests to unload came at the same time but looking at the code again the lock two lines above should guard that.
  19. fanquake removed this from the milestone 27.0 on Apr 16, 2024
  20. fanquake added this to the milestone 28.0 on Apr 16, 2024
  21. maflcko commented at 9:54 am on May 3, 2024: member
  22. DrahtBot commented at 8:28 am on May 22, 2024: contributor

    https://cirrus-ci.com/task/6247747513221120?logs=ci#L3385

    0 node0 stderr bitcoin-node: wallet/wallet.cpp:244: void wallet::UnloadWallet(std::shared_ptr<CWallet> &&): Assertion `it.second' failed. 
    
  23. maflcko commented at 6:03 am on June 21, 2024: member
  24. fanquake commented at 4:29 pm on July 11, 2024: member
  25. hebasto commented at 7:22 pm on July 15, 2024: member
  26. maflcko commented at 11:13 am on July 31, 2024: member
    Could append (void wallet::UnloadWallet(std::shared_ptr<CWallet> &&): Assertion it.second’ failed.)` to the issue title, so that it is clear that this is a crashing intermittent issue?
  27. fanquake renamed this:
    ci: failure in `wallet_multiwallet.py --legacy-wallet`
    ci: failure in `wallet_multiwallet.py --legacy-wallet` - (`void wallet::UnloadWallet(std::shared_ptr<CWallet> &&): Assertion 'it.second' failed.`)
    on Jul 31, 2024
  28. josibake commented at 11:45 am on August 5, 2024: member
  29. ryanofsky commented at 7:53 pm on August 7, 2024: contributor

    This seems like a bug in the UnloadWallet function, where it just crashes if it is called by more than one thread at the same time for the same wallet. (For example if there are multiple unloadwallet RPC calls at the same time, or if there is an unloadwallet RPC call at the same time as a shutdown.) Possible fix would be:

     0--- a/src/wallet/wallet.cpp
     1+++ b/src/wallet/wallet.cpp
     2@@ -244,15 +244,19 @@ void UnloadWallet(std::shared_ptr<CWallet>&& wallet)
     3 {
     4     // Mark wallet for unloading.
     5     const std::string name = wallet->GetName();
     6+    bool notify_unload;
     7     {
     8         LOCK(g_wallet_release_mutex);
     9         auto it = g_unloading_wallet_set.insert(name);
    10-        assert(it.second);
    11+        // Multiple threads could try to unload the wallet at the same time, but
    12+        // if this is the first thread, notify the GUI that it should release
    13+        // any pointers to the wallet.
    14+        notify_unload = it.second;
    15     }
    16     // The wallet can be in use so it's not possible to explicitly unload here.
    17     // Notify the unload intent so that all remaining shared pointers are
    18     // released.
    19-    wallet->NotifyUnload();
    20+    if (notify_unload) wallet->NotifyUnload();
    21 
    22     // Time to ditch our shared_ptr and wait for ReleaseWallet call.
    23     wallet.reset();
    

    This wallet unloading code is messy in general, and there’s an old comment about cleaning up #18338 (comment) that still applies.

  30. furszy commented at 9:03 pm on August 7, 2024: member

    @ryanofsky, I’m not seeing how multiple threads can reach UnloadWallet at the same time for the same wallet. unloadwallet first calls RemoveWallet() then UnloadWallet(). And RemoveWallet() locks the wallets context mutex and removes the wallet from the context. So the next thread executing RemoveWallet() would just return early there (the RPC would return an error) and never reach the follow-up UnloadWallet call.

    I see the issue during shutdown, inside UnloadWallets (we are also not checking the function’s return value). But that seems to be a different problem.

  31. ryanofsky commented at 1:44 am on August 8, 2024: contributor

    I see the issue during shutdown, inside UnloadWallets (we are also not checking the function’s return value). But that seems to be a different problem.

    I think my suggested change should fix that problem. It’s a good observation that if two unloadwallet calls happen at the same time only one of them will call UnloadWallet, so the “multiple unloadwallet RPC calls at the same time” race can’t happen. But the “unloadwallet RPC call at the same time as a shutdown” case I mentioned can happen because UnloadWallets which runs on shutdown makes a copy of the wallets vector, so if an unloadwallet RPC call happens at the same time as that they can both call UnloadWallet on the same wallet.

  32. furszy commented at 1:53 pm on August 8, 2024: member

    But the “unloadwallet RPC call at the same time as a shutdown” case I mentioned can happen because UnloadWallets which runs on shutdown makes a copy of the wallets vector, so if an unloadwallet RPC call happens at the same time as that they can both call UnloadWallet on the same wallet.

    I agree that there is an issue there, but how do you link it to this failure? No shutdown should be executed along the “Concurrent wallet loading” test case. The scenario merely involves two or more threads attempting to load and unload the same wallet concurrently. I think we are missing something here. Probably the assertion failure is a consequence of another issue that triggers a shutdown during the concurrent wallet load/unload.

  33. ryanofsky commented at 2:36 pm on August 8, 2024: contributor

    but how do you link it to this failure?

    It seems pretty clear to me the suggested fix in #29073 (comment) will prevent this failure because it removes the assert and handles the case where UnloadWalllet is called by more than one thread, instead of aborting. I agree it would be nice to understand more about this failure, though, because maybe there are more problems and this fix isn’t sufficient.

  34. furszy commented at 2:23 pm on August 9, 2024: member

    but how do you link it to this failure?

    It seems pretty clear to me the suggested fix in #29073 (comment) will prevent this failure because it removes the assert and handles the case where UnloadWalllet is called by more than one thread, instead of aborting. I agree it would be nice to understand more about this failure, though, because maybe there are more problems and this fix isn’t sufficient.

    Your suggestion will prevent the final error, yeah. But not the origin of it. Still, it is a good step forward. We could also go with something even simpler than your suggestion:

     0diff --git a/src/wallet/load.cpp b/src/wallet/load.cpp
     1--- a/src/wallet/load.cpp
     2+++ b/src/wallet/load.cpp
     3@@ -177,7 +177,10 @@
     4         auto wallet = wallets.back();
     5         wallets.pop_back();
     6         std::vector<bilingual_str> warnings;
     7-        RemoveWallet(context, wallet, /* load_on_start= */ std::nullopt, warnings);
     8+        if (!RemoveWallet(context, wallet, /* load_on_start= */ std::nullopt, warnings)) {
     9+            LogWarning("Unable to unload wallet %s during shutdown. Wallet already unloaded", wallet->GetName());
    10+            continue;
    11+        }
    12         UnloadWallet(std::move(wallet));
    13     }
    14 }
    

    Give me your thumbs up and will push the PR. And will probably also rework the shutdown flow too.

  35. ryanofsky commented at 2:56 pm on August 9, 2024: contributor

    Give me your thumbs up and will push the PR. And will probably also rework the shutdown flow too.

    Definitely feel free! I think the fix you suggested #29073 (comment) will prevent the crash, but it has drawbacks compared to the change suggested in #29073 (comment):

    • The main drawback is that allows the UnloadWallets call to return early before all wallets have actually unloaded. This was not how UnloadWallets was intended to work and is probably not how someone would expect it to work. I think it could just lead to problems shutting down the node and gui if they try to shut down while wallets are still loaded, though I’m not sure.
    • Secondarily that fix leaves code in a more fragile state. It adds a new warning that a wallet was unable to be unloaded without any indication of what users or developers are supposed to about the warning. Ignore it? debug it? report it? And instead of making it safe to call UnloadWallet from multiple threads and simply wait for the wallet to be unloaded, it will still crash the program if called concurrently, even though on the surface it seems like it is written be a thread-safe function.

    So I’d recommend using the fix in #29073 (comment) or something similar, and if you want to clean this code up more I think all the ideas in #18338 (comment) are still applicable and would be improvements.

  36. achow101 referenced this in commit 99ecb9a630 on Aug 15, 2024
  37. maflcko commented at 9:15 am on August 16, 2024: member

    Closing for now. I think the test may still fail intermittently, however the bitcoind crash should be fixed.

    I’d say it is the best to open a new issue for the test failure, once it happens again.

  38. maflcko closed this on Aug 16, 2024


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me