test, bench: Initialize and terminate use of Winsock properly #28486

pull hebasto wants to merge 1 commits into bitcoin:master from hebasto:230914-winsock changing 3 files +14 −2
  1. hebasto commented at 10:08 pm on September 14, 2023: member

    On the master branch, when compiling without external signer support, the bench_bitcoin.exe does not initialize Winsock DLL that is required, for example, here: https://github.com/bitcoin/bitcoin/blob/459272d639b9547f68000d2b9a5a0d991d477de5/src/bench/addrman.cpp#L124

    Moreover, Windows docs explicitly state that WSAStartup and WSACleanup must be balanced:

    There must be a call to WSACleanup for each successful call to WSAStartup. Only the final WSACleanup function call performs the actual cleanup. The preceding calls simply decrement an internal reference count in the WS2_32.DLL.

    That is not the case for our unit tests because the SetupNetworking() call is a part of the BasicTestingSetup fixture and is invoked multiple times, while ~CNetCleanup() is invoked once only, at the end of the test binary execution.

    This PR fixes Winsock DLL initialization and termination.

    More docs:

    Fix #28940.

  2. DrahtBot commented at 10:08 pm on September 14, 2023: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK maflcko

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  3. DrahtBot added the label Tests on Sep 14, 2023
  4. hebasto commented at 10:09 pm on September 14, 2023: member

    For some reasons (I failed to figure them out, unfortunately) the code on the master branch works. @sipsorcery Maybe you have any insights about that?

  5. theuni commented at 10:26 am on September 19, 2023: member

    I’m having trouble understanding the problem here. Could you please explain what the resulting error is?

    I can understand possibly needing to move SetupNetworking() to be more correct. But I don’t understand what CMake has anything to do with this. IMO we need to figure out why before moving anything around.

  6. theuni commented at 10:43 am on September 19, 2023: member
    Looking at your commit that purposefully introduces a bug, it removes a bunch of bench files. I’m guessing one of those happens to call WSAStartup somehow?
  7. hebasto commented at 10:46 am on September 19, 2023: member

    Looking at your commit that purposefully introduces a bug, it removes a bunch of bench files. I’m guessing one of those happens to call WSAStartup somehow?

    ~That was my initial guess. But it is not the case. I mean, there are no explicit WSAStartup invocations.~

    Yes, in the MakeNoLogFileContext template function where the compiler decided to instantiate the BasicTestingSetup struct.

  8. theuni commented at 1:53 pm on September 19, 2023: member

    The error in that github action is:

    0Error: Bad optional access
    1Error: Process completed with exit code 1.
    

    Which implies to me that something is throwing std::bad_optional_access. That seems to be something different from what you’re talking about?

  9. hebasto commented at 2:14 pm on September 19, 2023: member

    The error in that github action is:

    0Error: Bad optional access
    1Error: Process completed with exit code 1.
    

    Which implies to me that something is throwing std::bad_optional_access. That seems to be something different from what you’re talking about?

    The call chain from here https://github.com/bitcoin/bitcoin/blob/459272d639b9547f68000d2b9a5a0d991d477de5/src/bench/addrman.cpp#L124 is as follows:

    • std::optional<CNetAddr> LookupHost(const std::string&, bool, DNSLookupFn)
    • std::vector<CNetAddr> LookupHost(const std::string&, unsigned int, bool, DNSLookupFn)
    • std::vector<CNetAddr> LookupIntern(const std::string&, unsigned int, bool, DNSLookupFn)
    • std::vector<CNetAddr> WrappedGetAddrInfo(const std::string&, bool)
    • getaddrinfo

    If the latter fails, for example, due to uninitialized Winsock, the WrappedGetAddrInfo returns empty vector, and the first LookupHost does std::nullopt, which in turn causes Error: Bad optional access.

  10. hebasto commented at 8:51 pm on September 23, 2023: member

    In this branch, I’ve minimized the problem code to a single line of code, which is:

    0    // Comment the next line out to trigger the WSANOTINITIALISED=10093 error.
    1    const auto test_setup = MakeNoLogFileContext<const TestingSetup>();
    
  11. fanquake commented at 4:42 pm on September 24, 2023: member

    I still don’t understand why our CI isn’t failing, in the way you claim it should be, if our binaries are meant to be broken.

    Nothing should be done here, until the root cause of the issue is actually understood. As this either points to some brokenness in master (build system/CI), or your CMake/MSVC branch producing binaries differently to master.

  12. hebasto commented at 5:06 pm on September 24, 2023: member

    I still don’t understand why our CI isn’t failing, in the way you claim it should be, if our binaries are meant to be broken.

    Our code is incorrect according to Microsoft’s docs because the SetupNetworking (with WSAStartup in it) is never called in bench_bitcoin.exe binary.

  13. in src/bench/bench_bitcoin.cpp:149 in ddd8661c7a outdated
    146+        ret = EXIT_FAILURE;
    147     }
    148+
    149+#ifdef WIN32
    150+    // Shutdown Windows Sockets
    151+    WSACleanup();
    


    maflcko commented at 9:41 am on September 25, 2023:
    Shouldn’t this be fixed by moving WSACleanup from net to netbase (or similar)?

    fanquake commented at 10:06 am on November 28, 2023:
    Agree. Duplicating the Windows sockets handling code inside the benchmarking framework, isnt the right way to solve this.
  14. hebasto commented at 8:56 pm on November 25, 2023: member

    I still don’t understand why our CI isn’t failing, in the way you claim it should be, if our binaries are meant to be broken.

    Apparently, the underlying problem is not MSVC or CMake specific. I’ve opened a dedicated issue with steps to reproduce the bug.

    We might adjust our CI to catch the problem easily.


    The PR description has been updated.

  15. hebasto force-pushed on Nov 28, 2023
  16. hebasto commented at 12:42 pm on November 28, 2023: member

    I’m going to address @maflcko’s and @fanquake’s recent comments shortly.

    In the meantime, I’ve modified the “Win64, VS 2022” GHA CI job to make it run with both enabled and disabled external wallet support. This aims to prevent any similar regressions in the future.

  17. fanquake commented at 12:51 pm on November 28, 2023: member

    In the meantime, I’ve modified the “Win64, VS 2022” GHA CI job to make it run with both enabled and disabled external wallet support.

    We shouldn’t (have to) do this going forward, because the solution is either to remove/disable Boost Process for Windows, or replace it with something that doesn’t undermine our own code.

  18. hebasto force-pushed on Nov 28, 2023
  19. hebasto renamed this:
    bench: Initialize and terminate use of Winsock properly
    test, bench: Initialize and terminate use of Winsock properly
    on Nov 28, 2023
  20. hebasto commented at 4:58 pm on November 28, 2023: member

    The PR has been reworked and its description has been updated.

    In the meantime, I’ve modified the “Win64, VS 2022” GHA CI job to make it run with both enabled and disabled external wallet support.

    We shouldn’t (have to) do this going forward, because the solution is either to remove/disable Boost Process for Windows, or replace it with something that doesn’t undermine our own code.

    I agree with the replacement of the Boost.Process and going to pick up @theStack’s branch.

    Until then, it seems appropriate to have a regression test after fixing our own code. And it comes almost for free for us.

  21. fanquake commented at 5:02 pm on November 28, 2023: member

    Until then, it seems appropriate to have a regression test after fixing our own code.

    Thanks, but we aren’t going to change our code in the way you are suggesting in this PR. External signer should just be disabled for Windows, until there is a replacement that works properly.

  22. hebasto commented at 5:04 pm on November 28, 2023: member

    Until then, it seems appropriate to have a regression test after fixing our own code.

    Thanks, but we aren’t going to change our code in the way you are suggesting in this PR. External signer should just be disabled for Windows, until there is a replacement that works properly.

    Disabling external signer support is exactly the thing which exposes the fact that our code is broken. See #28940.

  23. fanquake commented at 5:10 pm on November 28, 2023: member

    Yes, and we should keep the almost dangerous, poorly maintained code disabled until we replace it with something that has actually been reviewed, and works as we expect it to. Clearly that hasn’t been the case previously. Obviously we also need to fix our networking code (as all our code should function without external signer), but I’m not sure the approach here, is a good one, and further complicates test/bench code, while diverging it from how bitcoind functions. Ideally things become more encapsulated in some lower-level networking/init code, that can be called from bitcoind/test/bench/whereever, we shouldn’t have to introduce new special code into the test/bench frameworks to initialize sockets on windows.

    it would also help to elaborate in the PR description what the underlying issue is (Boost ASIO/Process hijacking socket/network setup out from underneath us on Windows).

  24. hebasto commented at 5:15 pm on November 28, 2023: member

    What exactly do you mean by “diverging it from how bitcoind functions”?

    it would also help to elaborate in the PR description what the underlying issue is (Boost ASIO/Process hijacking socket/network setup out from underneath us on Windows).

    “Boost ASIO/Process hijacking socket/network setup” is not a cause of the bug, rather a reason why that bug was hidden from us.

  25. in src/test/util/setup_common.h:49 in 3cb83c87d6 outdated
    43@@ -44,6 +44,16 @@ std::ostream& operator<<(typename std::enable_if<std::is_enum<T>::value, std::os
    44 
    45 static constexpr CAmount CENT{1000000};
    46 
    47+/** Global network setup.
    48+ * On Windows, we globally terminate the use of the Winsock DLL when
    49+ * destructing the static instance_of_cnetcleanup. Consequently,
    


    maflcko commented at 5:25 pm on November 28, 2023:

    Can you explain what you are fixing in this pull? Why would destructing the static instance_of_cnetcleanup happen, and why does it matter?

    Also, are you fixing the lack of initialization or fixing the lack of de-initialization, or both?


    hebasto commented at 5:29 pm on November 28, 2023:

    This PR fixes:

    1. Excessive network initialization in all unit tests and in some benchmarks, which use MakeNoLogFileContext.
    2. Lack of network initialization in other benchmarks.

    hebasto commented at 5:36 pm on November 28, 2023:

    Why would destructing the static instance_of_cnetcleanup happen, and why does it matter?

    https://github.com/bitcoin/bitcoin/blob/31ce305d46a191621aaa78a73fa4bbc7e58d8ed0/src/net.cpp#L3284-L3297

    It happens because static objects are destructed at the end of binary execution. It matters because the desctructor calls WSACleanup() on Windows. As it happens once only, the WSAStartup() must be called once only as well.


    maflcko commented at 5:42 pm on November 28, 2023:

    It happens because static objects are destructed at the end of binary execution

    Which tests are run after the end of the binary execution?


    hebasto commented at 5:44 pm on November 28, 2023:
    None.

    maflcko commented at 5:52 pm on November 28, 2023:

    Lack of network initialization in other benchmarks.

    Are fuzz tests not affected? Wouldn’t it be easier to put the setup code into libtest_util, to cover all tests?


    hebasto commented at 5:57 pm on November 28, 2023:

    This PR fixes:

    1. Excessive network initialization in all unit tests and in some benchmarks, which use `MakeNoLogFileContext`.
    

    In other words, test_bitcoin.exe calls WSAStartup() >500 times, and calls WSACleanup() exactly 1 time. According to Microsoft docs, these numbers must be equal.


    hebasto commented at 5:59 pm on November 28, 2023:

    Lack of network initialization in other benchmarks.

    Are fuzz tests not affected? Wouldn’t it be easier to put the setup code into libtest_util, to cover all tests?

    I will look into that. However, we do not use fuzzing on Windows, right?


    maflcko commented at 6:15 pm on November 28, 2023:

    I will look into that. However, we do not use fuzzing on Windows, right?

    I don’t use Windows, so I don’t know whether it is possible to fuzz on Windows, or if anyone does it.


    hebasto commented at 6:29 pm on November 28, 2023:

    Wouldn’t it be easier to put the setup code into libtest_util, to cover all tests?

    Sorry, if misunderstood your question, but this PR adds a new structure to test/util/setup_common.{h,cpp}. It already is a part of the libtest_util library.

    I don’t use Windows, so I don’t know whether it is possible to fuzz on Windows, or if anyone does it. @dergoegge Do you have any related insights?


    maflcko commented at 6:32 pm on November 28, 2023:

    Sorry, if misunderstood your question, but this PR adds a new structure to test/util/setup_common.{h,cpp}. It already is a part of the libtest_util library.

    I mean similar to how the shutdown implicitly happens if you link the net module, the startup can happen (implicitly) by linking libtest_util, which should already happen for all tests?

    Edit: It’s just a style nit and the current patch looks fine.


    hebasto commented at 6:56 pm on November 28, 2023:

    Sorry, if misunderstood your question, but this PR adds a new structure to test/util/setup_common.{h,cpp}. It already is a part of the libtest_util library.

    I mean similar to how the shutdown implicitly happens if you link the net module, the startup can happen (implicitly) by linking libtest_util, which should already happen for all tests?

    Thanks! Implemented.

  26. hebasto force-pushed on Nov 28, 2023
  27. hebasto force-pushed on Nov 28, 2023
  28. hebasto commented at 6:55 pm on November 28, 2023: member
    Addressed @maflcko’s comment.
  29. hebasto force-pushed on Nov 28, 2023
  30. hebasto force-pushed on Nov 28, 2023
  31. in src/test/util/setup_common.cpp:95 in 7560aa5b33 outdated
    87@@ -88,6 +88,15 @@ std::ostream& operator<<(std::ostream& os, const uint256& num)
    88     return os;
    89 }
    90 
    91+struct NetworkSetup
    92+{
    93+    NetworkSetup()
    94+    {
    95+        SetupNetworking();
    


    maflcko commented at 7:06 pm on November 28, 2023:

    nit: Missing

    0        Assert(SetupNetworking());
    

    ?


    hebasto commented at 7:13 pm on November 28, 2023:
    Thanks! Added.
  32. test: Setup networking globally fd4c6a10f2
  33. hebasto force-pushed on Nov 28, 2023
  34. maflcko commented at 8:35 pm on November 28, 2023: member
    lgtm ACK b22c3ac50baf33a7a684266b0c7ec3d1cd5f3177, didn’t review the GHA diff
  35. hebasto commented at 9:26 pm on November 28, 2023: member

    … didn’t review the GHA diff

    Just in case, here is a branch with the reverted first commit. Here are its CI results:

  36. fanquake commented at 10:22 am on November 29, 2023: member
    I don’t understand how this solves the fact that the Boost ASIO code is still in bitcoind, running before main, and calling WSAStartup, with a version (2.0) that is different from what we want (2.2). There should not be random Boost code, initializing our networking stack, at all, let alone before we’ve even run any of our sanity checks/startup code.
  37. hebasto commented at 10:30 am on November 29, 2023: member

    I don’t understand how this solves the fact that the Boost ASIO code is still in bitcoind, running before main/any of our init checks, and calling WSAStartup, with a version (2.0) that is different from what we want (2.2).

    This PR does nothing about the fact you mentioned. It resolves another bug which is explicit when bitcoind does not contain Boost ASIO code, i.e. when building without external wallet support.

    There should not be random Boost code, initializing our networking stack, at all, let alone before we’ve even run any of our sanity checks/startup code.

    I agree with you.


    In the added “Win64, VS 2022 (ext.signer=OFF)” CI job, there is no Boost ASIO code in bitcoind.exe. The benchmark fails without this PR fix, and it passes with one.

  38. fanquake commented at 10:32 am on November 29, 2023: member

    I agree with you.

    So then why isn’t this PR disabling external signer for Windows?

  39. maflcko commented at 10:33 am on November 29, 2023: member

    So then why isn’t this PR disabling external signer for Windows?

    I think this is a test-only bugfix change. No one is holding anyone back from disabling external signer on Windows.

  40. fanquake commented at 10:34 am on November 29, 2023: member

    I think this is a test-only bugfix change. No one is holding anyone back from disabling external signer on Windows.

    Given what we know, I don’t see why we wouldn’t just disable it now. If it’s going to be disabled, there’s also no point in adding the additional CI checks in this PR.

  41. hebasto commented at 10:35 am on November 29, 2023: member

    I agree with you.

    So then why isn’t this PR disabling external signer for Windows?

    Because this PR is focused on fixing a bug in our code, not on changing user-available functionality. The question you raised should be discussed in a dedicated issue/PR.

  42. hebasto commented at 10:40 am on November 29, 2023: member

    I think this is a test-only bugfix change. No one is holding anyone back from disabling external signer on Windows.

    Given what we know, I don’t see why we wouldn’t just disable it now. If it’s going to be disabled, there’s also no point in adding the additional CI checks in this PR.

    The additional CI check proves that everything is working with external signer disabled.

  43. maflcko commented at 10:42 am on November 29, 2023: member

    there’s also no point in adding the additional CI checks in this PR.

    It the second commit is controversial, I guess it can be dropped. However, my understanding is that the first commit is correct and will be needed in any case.

  44. hebasto force-pushed on Nov 29, 2023
  45. hebasto commented at 10:47 am on November 29, 2023: member

    It the second commit is controversial, I guess it can be dropped. However, my understanding is that the first commit is correct and will be needed in any case.

    The second commit has been dropped.

  46. maflcko commented at 10:54 am on November 29, 2023: member
    lgtm ACK fd4c6a10f2285f16c5d0215eb56a3060441f3ef2
  47. in src/test/util/setup_common.cpp:92 in fd4c6a10f2
    88@@ -88,6 +89,15 @@ std::ostream& operator<<(std::ostream& os, const uint256& num)
    89     return os;
    90 }
    91 
    92+struct NetworkSetup
    


    fanquake commented at 11:51 am on November 29, 2023:
    This could probably use at least one line of doc explaining it’s existence? Given it’s only here to facilitate Windows sockets initialization, and doesn’t do anything for non-windows platforms.

    hebasto commented at 4:31 pm on November 29, 2023:

    This could probably use at least one line of doc explaining it’s existence?

    Mind suggesting one?

    Given it’s only here to facilitate Windows sockets initialization, and doesn’t do anything for non-windows platforms.

    With the current design of the SetupNetworking(), the code difference between platforms looks rather like implementation details.


    fanquake commented at 4:35 pm on November 29, 2023:

    the code difference between platforms looks rather like implementation details.

    The only code in SetupNetworking is Windows code to call WSAStartup()?


    hebasto commented at 4:41 pm on November 29, 2023:
    That’s correct.
  48. in src/test/util_tests.cpp:1223 in fd4c6a10f2
    1217@@ -1218,6 +1218,9 @@ BOOST_AUTO_TEST_CASE(test_LockDirectory)
    1218     // has released the lock as we would expect by probing it.
    1219     int processstatus;
    1220     BOOST_CHECK_EQUAL(write(fd[1], &LockCommand, 1), 1);
    1221+    // The following line invokes the ~CNetCleanup dtor without
    1222+    // a paired SetupNetworking call. This is acceptable as long as
    1223+    // ~CNetCleanup is a no-op for non-Windows platforms.
    


    fanquake commented at 11:51 am on November 29, 2023:
    Is this now only because of the change above? It’s not obvious why calling write() would invoke a NetCleanup destructor, or why we end up in a state where we are trying to tear down networking, if it’s never even been setup. In general it seems fragile to rely on ifdefs and code being a no-op to avoid additional issues.

    hebasto commented at 4:39 pm on November 29, 2023:

    Is this now only because of the change above?

    This PR diff does not change behaviour of the commented LOC. However, it seems worth mentioning that currently the SetupNetworking and ~CNetCleanup calls might be unbalanced. Perhaps, we might consider moving network initialization after daemonizing? But on platforms that support daemonizing, both calls are no-op…

    It’s not obvious why calling write() would invoke a NetCleanup destructor…

    Actually, it terminates the process with all consequences, including ~CNetCleanup call.


    fanquake commented at 4:43 pm on November 29, 2023:

    However, it seems worth mentioning that currently the SetupNetworking and ~CNetCleanup calls might be unbalanced.

    Maybe, but SetupNetworking and CNetCleanup only do anything on Windows, and the code here only runs for non-windows, where “unbalanced” calls to either doesn’t matter, because there is no code being executed.


    hebasto commented at 4:46 pm on November 29, 2023:

    SetupNetworking and CNetCleanup only do anything on Windows, and the code here only runs for non-windows, where “unbalanced” calls to either doesn’t matter, because there is no code being executed.

    That is similar to what the comment says, right?


    fanquake commented at 4:59 pm on November 29, 2023:
    I think my point is that we shouldn’t need a comment like this, or it’s not clear what value it provides. I would think it makes more sense to try and fix things so that we just don’t have situations where we try and call network teardown code if we haven’t setup any networking, rather than adding comments about Windows for code that only runs on Linux.
  49. fanquake commented at 11:56 am on November 29, 2023: member

    Left two non-blocking comments. I think we could improve SetupNetworking() and CNetCleanup further, and constrain the code to being Windows only, rather than having it exist for all platforms, but be a no-op on non-windows.

    Will merge this shortly, and open a new PR to disable external signer for Windows, if nobody else does.

  50. fanquake referenced this in commit 798b2bd492 on Nov 29, 2023
  51. fanquake referenced this in commit 3d415322b2 on Nov 29, 2023
  52. fanquake referenced this in commit 02895a89f6 on Nov 29, 2023
  53. fanquake merged this on Nov 29, 2023
  54. fanquake closed this on Nov 29, 2023

  55. hebasto deleted the branch on Nov 29, 2023
  56. fanquake referenced this in commit 9ff589c60c on Nov 29, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-16 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me