net: call getaddrinfo() in detachable thread to prevent stalling #27557

pull pinheadmz wants to merge 1 commits into bitcoin:master from pinheadmz:async-getaddrinfo changing 3 files +94 −7
  1. pinheadmz commented at 6:28 PM on May 2, 2023: member

    Closes #16778 Replaces #27505

    The library call getaddrinfo() has no internal timeout and depending on the user's system may wait indefinitely for a response when looking up a hostname in the DNS. By making that call in a separate thread, we can abandon it completely after some timeout (currently in this PR, 2 seconds).

    TODO:

    • We could make the polling loop interruptible but I'm not sure the best approach to that, since these functions don't live in a class with a flag like interruptNet
  2. DrahtBot commented at 6:28 PM on May 2, 2023: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    No conflicts as of last run.

  3. DrahtBot added the label P2P on May 2, 2023
  4. DrahtBot added the label CI failed on May 2, 2023
  5. fanquake commented at 9:03 AM on May 3, 2023: member

    The unit test I added makes a live DNS request for nic.com that test will fail the platform has no DNS

    Or just anytime you don't have an internet connection? We're not going to add "internet connectivity" as a requirement to run the unit tests (putting aside all other potential issues with doing this).

  6. in src/test/netbase_tests.cpp:166 in 49b909cce7 outdated
     157 | @@ -158,6 +158,19 @@ BOOST_AUTO_TEST_CASE(embedded_test)
     158 |      BOOST_CHECK_EQUAL(addr1.ToStringAddr(), addr2.ToStringAddr());
     159 |  }
     160 |  
     161 | +BOOST_AUTO_TEST_CASE(dns_lookup)
     162 | +{
     163 | +    CNetAddr addr;
     164 | +    bool ret = LookupHost("nic.com", addr, /*fAllowLookup=*/true);
     165 | +    BOOST_CHECK(ret);
     166 | +    BOOST_CHECK(addr.IsRoutable());
    


    dergoegge commented at 10:52 AM on May 3, 2023:

    Making actual requests from the unit tests is not great.

    If you want to add a unit test, you could introduce a way to mock getaddrinfo or AsyncGetAddrInfo so that you can better test our code in all the relevant scenarios (i.e. getaddrinfo errors/blocking etc.).

  7. in src/netbase.cpp:117 in 49b909cce7 outdated
     115 | +            }
     116 | +            // Check every 100ms for 2 seconds
     117 | +            if (++checks > 20) break;
     118 | +            std::this_thread::sleep_for(std::chrono::milliseconds(100));
     119 | +        }
     120 | +        thread.detach();
    


    dergoegge commented at 10:55 AM on May 3, 2023:

    furszy commented at 1:30 PM on May 3, 2023:

    Performance wise, create a new thread per lookup is quite bad. Should use a workers pool instead.

    I have implemented a generic one for #26966 (https://github.com/bitcoin/bitcoin/pull/26966/commits/f448668f473675915fcebf0e1bd3d73dcb1881af) that could also use here. It also support callback futures so you can remove the costly active wait as well.

    Give it a look and if you like it, could decouple it into a standalone PR so we could start using it in both PRs.


    dergoegge commented at 9:40 AM on May 4, 2023:

    Afaict this won't work here, because the thread might not finish and then the pool would just get clogged (see comment about abandoning the thread).

    Performance should also not really matter here since this doesn't happen that often and startup time for a thread vs. time of a DNS lookup is probably negligible.


    furszy commented at 12:57 PM on May 5, 2023:

    Afaict this won't work here, because the thread might not finish and then the pool would just get clogged (see comment about abandoning the thread).

    k, getaddrinfo doesn't have a timeout and getaddrinfo_a seems to have a segfault. Cool..

    Performance should also not really matter here since this doesn't happen that often and startup time for a thread vs. time of a DNS lookup is probably negligible.

    I was thinking more about context switching. Isn't this used for OpenNetworkConnection too? So, it would create a new thread for every connection attempt? (unless the lookups are numeric, which guess that is the rule here..).

    And, following the worst case scenario, what if the user call addnode a good number of times? we would get a lot of detached threads? (unless we set a limit for the amount of lookups that can be resolved in parallel).


    dergoegge commented at 1:10 PM on May 5, 2023:

    k, getaddrinfo doesn't have a timeout and getaddrinfo_a seems to have a segfault. Cool..

    Yea one would think that making DNS requests in c++ would be a solved problem.

    Anyway, I'm also not sure about the approach here. The original issue was that we would block indefinitely while joining the DNS thread on shutdown (only happens if the user's system is broken I think?). I suggested to detach the dns thread if it doesn't join in a reasonable amount of time (would be fine, the OS will clean the thread up since we are about to exit anyway).

    The approach here risks having threads hang around for the lifetime of a node, which seems worse than the occasional shutdown hang on broken systems.

  8. pinheadmz force-pushed on May 3, 2023
  9. pinheadmz force-pushed on May 3, 2023
  10. pinheadmz force-pushed on May 3, 2023
  11. pinheadmz force-pushed on May 3, 2023
  12. pinheadmz force-pushed on May 3, 2023
  13. pinheadmz force-pushed on May 3, 2023
  14. pinheadmz commented at 7:15 PM on May 3, 2023: member

    Still waiting for CI to finish but I removed the need for internet access by looking up the name "localhost" instead. That is a non-numeric lookup that getaddrinfo() can do without the actual DNS. I think I also fixed the ipv6 test issue:

    (run in docker with no ipv6)

    test/netbase_tests.cpp(161): Entering test case "dns_lookup"
    test/netbase_tests.cpp(165): info: check LookupHost("localhost", addr, true) has passed
    test/netbase_tests.cpp(169): info: check Lookup("127.0.0.1:12345", serv, 9050, true) has passed
    test/netbase_tests.cpp(176): warning: in "netbase_tests/dns_lookup": Skipping IPv6 check
    test/netbase_tests.cpp(161): Leaving test case "dns_lookup"; testing time: 217012us
    

    I'm interested in following the worker pool idea to make these requests async as opposed to new threads.

  15. DrahtBot removed the label CI failed on May 3, 2023
  16. pinheadmz force-pushed on May 5, 2023
  17. pinheadmz commented at 5:16 PM on May 5, 2023: member

    added call to thread.join() in the "happy path" so the only threads we should be abandoning are those that time out. Also setup the unit test to stub in a black hole for getaddressinfo to ensure the 2-second timeout is enforced. On my machine at least with thread and memory sanitizers I'm not getting any errors for the abandoned thread. I'm going to try some more experiments with a mainnet node as well.

  18. pinheadmz force-pushed on May 5, 2023
  19. pinheadmz commented at 6:37 PM on May 5, 2023: member

    ...got thread sanitizer errors running locally on main when my local DNS is black-holed. Fixed that with a shared pointer.

  20. DrahtBot added the label CI failed on May 5, 2023
  21. DrahtBot removed the label CI failed on May 7, 2023
  22. in src/netbase.cpp:43 in c292288a05 outdated
      39 | @@ -40,7 +40,9 @@ bool fNameLookup = DEFAULT_NAME_LOOKUP;
      40 |  int g_socks5_recv_timeout = 20 * 1000;
      41 |  static std::atomic<bool> interruptSocks5Recv(false);
      42 |  
      43 | -std::vector<CNetAddr> WrappedGetAddrInfo(const std::string& name, bool allow_lookup)
      44 | +static GlobalMutex gai_mutex;
    


    pinheadmz commented at 5:31 PM on May 11, 2023:

    TIL I can probably replace this with std::atomic_bool

  23. pinheadmz force-pushed on May 12, 2023
  24. DrahtBot added the label CI failed on Jun 4, 2023
  25. pinheadmz force-pushed on Jun 5, 2023
  26. pinheadmz commented at 6:29 PM on June 5, 2023: member

    Needs rebase?

    ty. 🤞

  27. pinheadmz force-pushed on Jun 5, 2023
  28. DrahtBot removed the label CI failed on Jun 5, 2023
  29. in src/netbase.cpp:88 in 2910e5903c outdated
      97 | +    std::shared_ptr<GAIRequest> req = std::make_shared<struct GAIRequest>(allow_lookup, name, &resolved_addresses);
      98 | +
      99 | +    if (allow_lookup) {
     100 | +        // Execute the DNS lookup in another thread and check it periodically for completion.
     101 | +        // After sufficiently waiting, abandon the thread entirely. The user's system
     102 | +        // may wait forever for a DNS response that never returns.
    


    luke-jr commented at 10:08 PM on July 1, 2023:

    Probably not good to ignore it never returning... it's leaking resources. Maybe kill it and/or tell the user somehow? :/


    pinheadmz commented at 7:10 PM on July 5, 2023:

    I added a log message but not sure what else to do. The problem is that getaddrinfo() is blocking and getaddrinfo_a() segfaults 😭 As long as the DNS request resolves eventually the thread will close, even if bitcoind has given up and detached the thread. If getaddrinfo() hangs forever then yes you're right that is a leak, but the user also has a serious networking issue.

  30. in src/test/netbase_tests.cpp:163 in 2910e5903c outdated
     155 | @@ -156,6 +156,42 @@ BOOST_AUTO_TEST_CASE(embedded_test)
     156 |      BOOST_CHECK_EQUAL(addr1.ToStringAddr(), addr2.ToStringAddr());
     157 |  }
     158 |  
     159 | +BOOST_AUTO_TEST_CASE(dns_lookup)
     160 | +{
     161 | +    // Non-numeric lookup that does not require actual DNS service
     162 | +    BOOST_CHECK(LookupHost("localhost", /*fAllowLookup=*/true).has_value());
     163 | +    BOOST_CHECK(!LookupHost("fail.doesnotexist", /*fAllowLookup=*/true).has_value());
    


    luke-jr commented at 10:17 PM on July 1, 2023:

    Won't this still make an externally-visible DNS request?


    pinheadmz commented at 6:52 PM on July 5, 2023:

    Yep you are absoutley right, I'll remove fail.doesnotexist (localhost will NOT make an external request)

  31. net: call getaddrinfo() in detachable thread to prevent stalling be52064515
  32. pinheadmz force-pushed on Jul 5, 2023
  33. achow101 requested review from theuni on Sep 20, 2023
  34. achow101 requested review from fanquake on Sep 20, 2023
  35. DrahtBot added the label Needs rebase on Oct 19, 2023
  36. DrahtBot removed the label Needs rebase on Nov 7, 2023
  37. DrahtBot added the label CI failed on Jan 13, 2024
  38. achow101 closed this on Apr 9, 2024

  39. bitcoin locked this on Apr 9, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-15 15:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me