net, pcp: handle multi-part responses and filter for default route while querying default gateway #32159

pull willcl-ark wants to merge 3 commits into bitcoin:master from willcl-ark:pcp-default-multipart changing 1 files +62 −31
  1. willcl-ark commented at 2:11 pm on March 28, 2025: member

    …for default route in pcp pinholing.

    Currently we only make a single recv call, which trucates results from large routing tables, or in the case the kernel may split the message into multiple responses (which may happen with NLM_F_DUMP).

    We also do not filter on the default route. For IPv6, this led to selecting the first route with an RTA_GATEWAY attribute, often a non-default route instead of the actual default. This caused PCP port mapping failures because the wrong gateway was used.

    Fix both issues by adding multi-part handling of responses and filter for the default route.

    Limit responses to ~ 1MB to prevent any router-based DoS.

  2. DrahtBot commented at 2:11 pm on March 28, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32159.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK achow101, davidgumberg, Sjors
    Concept ACK laanwj

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  3. willcl-ark commented at 2:13 pm on March 28, 2025: member

    cc @laanwj

    This patch maintains the FreeBSD-style querying/filtering we have currently, but but increases the size of the response processed to a maximum of 1MB.

  4. laanwj added the label P2P on Mar 28, 2025
  5. laanwj renamed this:
    net, pcp: handle multi-part responses and filter...
    net, pcp: handle multi-part responses and filter for default route while querying default gateway
    on Mar 28, 2025
  6. laanwj commented at 2:55 pm on March 28, 2025: member

    Concept ACK

    Adding 29.0 milestone, doesn’t need to block the release but it would be nice to backport it.

    Edit: removed, the milestone, i still think it’d be nice to have in next 29.x release, but it turns out more complicated and riskier than expected, better to have no immediate time pressure

  7. laanwj added this to the milestone 29.0 on Mar 28, 2025
  8. laanwj requested review from vasild on Mar 28, 2025
  9. laanwj requested review from Sjors on Mar 28, 2025
  10. in src/common/netif.cpp:113 in c5211f3423 outdated
    136-            } else if (family == AF_INET6 && sizeof(in6_addr) == RTA_PAYLOAD(rta_gateway)) {
    137-                in6_addr gw;
    138-                std::memcpy(&gw, RTA_DATA(rta_gateway), sizeof(gw));
    139-                return CNetAddr(gw, scope_id);
    140+        for (nlmsghdr* hdr = (nlmsghdr*)response; NLMSG_OK(hdr, recv_result); hdr = NLMSG_NEXT(hdr, recv_result)) {
    141+            if (hdr->nlmsg_type == NLMSG_DONE) {
    


    laanwj commented at 3:23 pm on March 28, 2025:

    Is it guaranteed that the reponse to NLM_F_DUMP is always multipart? Or do we need to check nlmsg_flags for NLM_F_MULTI, and if not, break after the first packet?

    This is not clear to me from the documentation:


    willcl-ark commented at 9:31 am on March 29, 2025:

    I don’t think it is, but like you am unsure about what the guarantees of the protocol are here.

    I kind of reverse engineered this looking at miniupnpc and netlink sources, along with an strace of ip route show, which is where the repeated recv calls jumped out to me as a difference between our code and other tools.

    This approach simply relies on receiving an NLMSG_DONE to signal the end of the response and break. This should handle both single and multi-part messages. Here’s a snipped sample of strace -e filter=recvmsg ip route show on my system:

     0recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2932
     1recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1444, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, <snip> 0) = 2932
     2recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 3384
     3recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1504, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, <snip> 0) = 3384
     4recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 5196
     5recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1488, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, <snip> 0) = 5196
     6recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
     7recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
     8recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 424
     9recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=76, nlmsg_type=RTM_NEWADDR, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240136, nlmsg_pid=3248093}, <snip> 0) = 424
    10recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 600
    11recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=72, nlmsg_type=RTM_NEWADDR, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240136, nlmsg_pid=3248093}, <snip> 0) = 600
    12recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
    13recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240136, nlmsg_pid=3248093}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
    

    NLM_F_MULTI is indeed set, even in the NLMSG_DONE packet.

    One other thing I did read about, but not implement, is checking of the sequence number on each message to ensure it was meant for our request. But as we only make a single request i thought this should be OK to omit.


    laanwj commented at 2:46 pm on March 29, 2025:

    Thanks!

    The receive flow here seems to indicate:

    • Receive packet
    • Process packet
    • If NLM_F_MULTI is set and the packet is not NLMSG_DONE, repeat

    What i’m mostly worried about is that the current code will hang if NLMSG_DONE never comes, which seems to be the case for non-multi responses, which have one data packet.

    But it may be that the NETLINK_ROUTE response to RTM_GETROUTE/NLM_F_DUMP is always multi-packet. That empirically seems to be the case even for tiny routing tables.

    Looking at the ip source is a good idea. Also we need to verify this with FreeBSD.

    But as we only make a single request i thought this should be OK to omit.

    Agree, going that far in checking seems unnecessary. i don’t think we need super defensive coding as netlink is a local protocol with the kernel.


    willcl-ark commented at 8:18 am on March 31, 2025:

    Thanks, that infradead page is very handy!

    I made a few changes in 6c694c212f8d898dac9cc1c5637381452c91e79b based on your thoughts (and the protocol page):

    • set socket to non-blocking mode to avoid hanging if the kernel doesn’t send a response
    • use a vector to collect all data from multi-part responses
    • exit when recv() returns 0 (this should handle single-part messages, AFAICT)

    I think relying on receiving no more data from recv() to break the receive loop should be as (or perhaps more) robust than checking for the NLM_F_MULTI flag and exiting after first receive if it’s not set, but curious what you think here?

    If it would help, I’d be happy to break this into a few smaller commits, as I’m kinda feeling this change contains a few different changes in one in some ways now…


    laanwj commented at 9:13 am on March 31, 2025:

    I think relying on receiving no more data from recv() to break the receive loop should be as (or perhaps more) robust than checking for the NLM_F_MULTI flag and exiting after first receive if it’s not set, but curious what you think here?

    Maybe-is it safe to assume that netlink will never block?

    We don’t want to end up in the same situation as before where we miss data. But due to say, a threading race condition.

    i think the safest thing here is to mimic as closely as possible ip’s behavior, as it is the only tool these kind of interfaces tend to be written towards.


    laanwj commented at 9:16 am on March 31, 2025:

    use a vector to collect all data from multi-part responses

    i’m not sure i see the motivation here. Parsing the packets as seperate units is just as valid (“Multipart messages unlike fragmented ip packets must not be reassmbled”), avoids dynamic allocation, and is simpler.


    Sjors commented at 12:34 pm on March 31, 2025:

    checking of the sequence number on each message to ensure it was meant for our request

    This seems like a good idea to at least do in debug builds.

    It seems like a good precaution to check for the presence of NLM_F_MULTI and don’t wait for NLMSG_DONE if it isn’t. At least from my naive reading of https://man7.org/linux/man-pages/man7/netlink.7.html it seems NLMSG_DONE is only used for multipart messages.

    Splitting into multiple commits would be useful, e.g. one commit that switches to non-blocking mode.


    davidgumberg commented at 8:11 pm on August 28, 2025:

    Is it guaranteed that the reponse to NLM_F_DUMP is always multipart? Or do we need to check nlmsg_flags for NLM_F_MULTI, and if not, break after the first packet?

    It seems that in the linux kernel, that is the case. A message type’s (e.g. GETROUTE) message handling callback struct has a few callback functions, the .dumpit callback is called when you send a request with NLM_F_DUMP. Checking the .dumpit callbacks, the ones I find set NLM_F_MULTI in their responses, e.g. inet_dump_ifaddr which is a wrapper for inet_dump_addr is the is the .dumpit for RTM_GETADDR for IPv4 addresses,, and inet_dump_addr sets NLM_F_MULTI, and the same is true for the NLMSG_DONE that comes when the dump is complete:

    0static int netlink_dump_done(struct netlink_sock *nlk, struct sk_buff *skb,
    1			     struct netlink_callback *cb,
    2			     struct netlink_ext_ack *extack)
    3{
    4	nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(nlk->dump_done_errno),
    5			       NLM_F_MULTI | cb->answer_flags);
    

    https://github.com/torvalds/linux/blob/39f90c1967215375f7d87b81d14b0f3ed6b40c29/net/netlink/af_netlink.c#L2234

    Grepping for NLMSG_DONE I think that netlink_dump_done is the only place NLMSG_DONE is used that’s relevant to us, all of the other instances are in other netlink subsystems, so we only get NLMSG_DONE in reply to NLM_F_DUMP, but I think this is the case even when we don’t get a multipart message, since netlink_dump_done is always invoked when sending a request with NLM_F_DUMP I believe.

  11. willcl-ark force-pushed on Mar 31, 2025
  12. in src/common/netif.cpp:96 in 6c694c212f outdated
    113+    while (!done) {
    114+        char temp[4096];
    115+        int64_t recv_result;
    116+        do {
    117+            recv_result = sock->Recv(temp, sizeof(temp), 0);
    118+        } while (recv_result < 0 && (errno == EINTR || errno == EAGAIN));
    


    Sjors commented at 12:14 pm on March 31, 2025:
    Since you’re touching this line… According to the internet, we should also check EWOULDBLOCK even though it’s usually the same as EAGAIN, and it’s likely not relevant for any system we support. https://stackoverflow.com/a/49421517

  13. in src/common/netif.cpp:107 in 6c694c212f outdated
    112+
    113+    while (!done) {
    114+        char temp[4096];
    115+        int64_t recv_result;
    116+        do {
    117+            recv_result = sock->Recv(temp, sizeof(temp), 0);
    


    Sjors commented at 12:25 pm on March 31, 2025:
    I know this is existing code, but I don’t recall why there’s no timeout here. And also, should there be a quick wait between Recv calls?
  14. Sjors commented at 12:38 pm on March 31, 2025: member

    Although the code this PR touches isn’t compiled on macOS, I did briefly check that things still work there. I also briefly tested on Ubuntu 24.10.

    Left some inline question to wrap my head around the changes and refresh my memory of the original…

  15. laanwj added the label Needs backport (29.x) on Apr 1, 2025
  16. laanwj removed this from the milestone 29.0 on Apr 1, 2025
  17. willcl-ark commented at 9:26 am on April 1, 2025: member

    Thanks for the review @Sjors & @laanwj.

    Going to move to draft while I re-work it a little.

  18. willcl-ark marked this as a draft on Apr 1, 2025
  19. willcl-ark force-pushed on Apr 1, 2025
  20. net: filter for default routes in netlink responses
    Filter netlink responses to only consider default routes by checking the
    destination prefix length (rtm_dst_len == 0).
    
    Previously, we selected the first route with an RTA_GATEWAY attribute,
    which for IPv6 often resulted in choosing a non-default route instead of
    the actual default.
    
    This caused occasional PCP port mapping failures because a gateway for a
    non-default route was selected.
    57ce645f05
  21. net: skip non-route netlink responses
    This shouldn't usually be hit, but is a good belt-and-braces.
    42e99ad773
  22. willcl-ark force-pushed on Apr 2, 2025
  23. willcl-ark force-pushed on Apr 2, 2025
  24. DrahtBot added the label CI failed on Apr 2, 2025
  25. DrahtBot commented at 11:05 am on April 2, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/39838257320

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  26. DrahtBot removed the label CI failed on Apr 2, 2025
  27. willcl-ark commented at 1:56 pm on April 2, 2025: member

    After (additional) further investigation I gained some new insights:

    • This is not a regression from migration from libnatpmp as I first thought.

    • The extent to which tailscale interferes with the routing is more significant than I realised: Even if we find the correct gateway (by handling multi-part messages and filtering properly), the route to the gateway still doesn’t return the data we want, specifically the correct interface. This causes us to make an invalid request, which gets rejected.

      This is also observed using the ip command:

      0$ ip route show
      1default via 10.0.0.1 dev wlo1 proto dhcp src 10.0.12.100 metric 600
      210.0.0.0/20 dev wlo1 proto kernel scope link src 10.0.12.100 metric 600
      3172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
      4192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown
      5
      6will@ubuntu in …/src/core/bitcoin on  pcp-default-multipart [$?] : C v19.1.7-clang via △ v3.31.6 : 🐍 (core)
      7$ ip route get 10.0.0.1
      810.0.0.1 dev tailscale0 table 52 src 100.89.20.73 uid 1000
      9    cache
      

      Handling this would be quite invasive, require tracking the interface, and I don’t think it’s in scope for us.

    Therefore I have re-worked and split the remaining changes and into 3 commits:

    1. Filter for the default route in the response. With this change ipv6 pinholing works even with tailscale up, which is quite nice.
    2. Skip non NEWROUTE messages. Although NLMSG_DONE is already handled, this is a slight optimisation in protecting against unexpected responses being parsed.
    3. Handle single and multi-part messages. Track if NLM_F_MULTI was set and if so wait for NLMSG_DONE, otherwise break after the first response, assuming a single-part response.

    I think these are all worthwhile, but I could seean argument that what we currently have works “well enough” for the most basic use-case; a simple (single) routing table. And we could just consider anything else out-of-scope in this project. @Sjors: Ref your comments, the socket is set as nonblocking already via the implementation in CreateSockOS in https://github.com/bitcoin/bitcoin/blob/74d9598bfbc24c3b7bfe1dad5bf9d988381bf893/src/netbase.cpp#L536-L540 I did try a commit to add a backoff timer and retry mechanism to this for extra safety, but pretty sure it’s not worth the added complexity.

  28. willcl-ark marked this as ready for review on Apr 2, 2025
  29. DrahtBot added the label CI failed on Apr 29, 2025
  30. DrahtBot removed the label CI failed on May 2, 2025
  31. fanquake commented at 3:39 pm on May 8, 2025: member
    @laanwj you might be interested in circling back here to review?
  32. laanwj requested review from laanwj on May 9, 2025
  33. laanwj removed the label Needs backport (29.x) on May 13, 2025
  34. laanwj commented at 2:19 pm on May 13, 2025: member

    you might be interested in circling back here to review?

    Yes, will look into it.

    This is not a regression from migration from libnatpmp as I first thought.

    Removed the backport label as this was confirmed not to be a regression (thanks!).

  35. fanquake added this to the milestone 30.0 on Jul 17, 2025
  36. fanquake commented at 2:51 pm on July 17, 2025: member
    Put this on the v30.0 milestone.
  37. achow101 commented at 9:08 pm on August 18, 2025: member

    ACK 4c531782569b42fc47478a468f4a79e0e2d93946

    Verified against iproute2 and strace

  38. in src/common/netif.cpp:112 in 4c53178256 outdated
    138-                std::memcpy(&gw, RTA_DATA(rta_gateway), sizeof(gw));
    139-                return CNetAddr(gw, scope_id);
    140+        bool processed_one{false};
    141+        for (nlmsghdr* hdr = (nlmsghdr*)response; NLMSG_OK(hdr, recv_result); hdr = NLMSG_NEXT(hdr, recv_result)) {
    142+            rtmsg* r = (rtmsg*)NLMSG_DATA(hdr);
    143+            int remaining_len = RTM_PAYLOAD(hdr);
    


    achow101 commented at 9:52 pm on August 26, 2025:
    These could be moved after if (hdr->nlmsg_type == NLMSG_DONE) { to avoid the integer sanitizer warning in #33245
  39. in src/common/netif.cpp:103 in 57ce645f05 outdated
     96@@ -97,6 +97,11 @@ std::optional<CNetAddr> QueryDefaultGatewayImpl(sa_family_t family)
     97         rtmsg* r = (rtmsg*)NLMSG_DATA(hdr);
     98         int remaining_len = RTM_PAYLOAD(hdr);
     99 
    100+        // Only consider default routes (destination prefix length of 0).
    101+        if (r->rtm_dst_len != 0) {
    102+            continue;
    103+        }
    


    davidgumberg commented at 5:40 pm on August 28, 2025:

    In https://github.com/bitcoin/bitcoin/pull/32159/commits/57ce645f05d18d8ad10711c347a5989076f1f788 (net: filter for default routes in netlink responses)

    Just info for other reviewers, I found this useful since I’m generally unfamiliar with how routing and gateways work:

    As I understand, the destination prefix is a mask indicating what routes the gateway can be used for, a destination prefix of 0 (alternatively 0.0.0.0 or ::0) indicates that the gateway can be used for all destinations, and RFC 1812 offers this right above the beginning of section 5.2.4.4:

    (5) Default Route: This is a route to all networks for which there are no explicit routes. It is by definition the route whose prefix length is zero.


    davidgumberg commented at 8:49 pm on August 28, 2025:

    Relevant (editorialized) code from iproute2:

    0	if (tb[RTA_DST]) {
    1        // [...if the route has a specific destination, e.g.] ip route add 192.168.0.5 via  192.168.0.4
    2	} else if (r->rtm_dst_len) {
    3        // [if the route has prefix has a length]
    4		snprintf(b1, sizeof(b1), "0/%d ", r->rtm_dst_len);
    5	} else {
    6        // [if rtm_dst_len == 0]
    7		strncpy(b1, "default", sizeof(b1));
    8	}
    

    https://github.com/iproute2/iproute2/blob/3337e5013e168bcf4fab5f6518d1e4293a0a830b/ip/iproute.c#L833-L851

  40. in src/common/netif.cpp:102 in 42e99ad773 outdated
     96@@ -97,6 +97,10 @@ std::optional<CNetAddr> QueryDefaultGatewayImpl(sa_family_t family)
     97         rtmsg* r = (rtmsg*)NLMSG_DATA(hdr);
     98         int remaining_len = RTM_PAYLOAD(hdr);
     99 
    100+        if (hdr->nlmsg_type != RTM_NEWROUTE) {
    101+            continue; // Skip non-route messages
    102+        }
    


    davidgumberg commented at 6:20 pm on August 28, 2025:

    In https://github.com/bitcoin/bitcoin/pull/32159/commits/42e99ad77396e4e9b02d67daf46349e215e99a0f (net: skip non-route netlink responses)

    The man 7 rtnetlink page (https://man7.org/linux/man-pages/man7/rtnetlink.7.html) does not really offer much insight here, but as far as I can tell, the convention comes from the fact that the rtnetlink interface is also used for monitoring changes to the kernel’s routing table, and because the userspace application is trying to maintain an idea of the kernel routing table’s state, the interfaces are symmetrical, a userspace application sends an RTM_NEWROUTE when it wants to insert a new route, and likewise whenever the kernel inserts a new route, it sends out an RTM_NEWROUTE message over open NETLINK_ROUTE sockets, so the reply to an RTM_GETROUTE follows this convention and replies with RTM_NEWROUTE, and I imagine the same is true for GETADDR and GETLINK

    Looking through the linux kernel’s implementation for evidence of this:

    e.g. inet_rtm_getroute is the kernel’s callback that handles RTM_GETROUTE messages for IPv4 routes, and here is a simplified + annotated view of the relevant section of that function:

     0static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
     1			     struct netlink_ext_ack *extack)
     2{
     3	// [....]
     4
     5	// [ This is never true for us. ]
     6	if (rtm->rtm_flags & RTM_F_FIB_MATCH) {
     7		// [ fib_dump_info populates sk_buff* skb, which holds packet data,
     8		//   and places the `event` arg in the nlmsg hdr->nlmsg_type
     9		//   https://github.com/torvalds/linux/blob/39f90c1967215375f7d87b81d14b0f3ed6b40c29/net/ipv4/fib_semantics.c#L1732
    10		// ]
    11		fib_dump_info(skb, NETLINK_CB(in_skb).portid,
    12				    nlh->nlmsg_seq,/*event=*/RTM_NEWROUTE, &fri, 0);
    13	} else {
    14		// [ This case is relevant for us, rt_fill_info always creates an RTM_NEWROUTE message.]
    15		//   https://github.com/torvalds/linux/blob/39f90c1967215375f7d87b81d14b0f3ed6b40c29/net/ipv4/route.c#L2951
    16		rt_fill_info(net, dst, src, rt, table_id, res.dscp, &fl4,
    17				   skb, NETLINK_CB(in_skb).portid,
    18				   nlh->nlmsg_seq, 0);
    19	}
    20
    21	// [ Broadcast the `skb` on the socket. ]
    22	rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
    23}
    24
    25// [ https://github.com/torvalds/linux/blob/39f90c1967215375f7d87b81d14b0f3ed6b40c29/net/ipv4/route.c#L2939-L2951 ]
    26static int rt_fill_info(
    27    // [...]
    28)
    29{
    30	nlh = nlmsg_put(skb, portid, seq, RTM_NEWROUTE, sizeof(*r), flags);
    31}
    

    davidgumberg commented at 8:34 pm on August 28, 2025:

    Relevant section of iproute2, if a response to getroute is not of type RTM_NEWROUTE, it’s an error:

    0static int iproute_get(int argc, char **argv)
    1        // [... `Populate answer` ]
    2		struct rtmsg *r = NLMSG_DATA(answer);
    3		if (answer->nlmsg_type != RTM_NEWROUTE) {
    4			fprintf(stderr, "Not a route?\n");
    5			delete_json_obj();
    6			free(answer);
    7			return -1;
    8		}
    9}
    

    https://github.com/iproute2/iproute2/blob/3337e5013e168bcf4fab5f6518d1e4293a0a830b/ip/iproute.c#L2242-L2247

  41. in src/common/netif.cpp:136 in 4c53178256 outdated
    171+            rtattr* rta_gateway = nullptr;
    172+            int scope_id = 0;
    173+            for (rtattr* attr = RTM_RTA(r); RTA_OK(attr, remaining_len); attr = RTA_NEXT(attr, remaining_len)) {
    174+                if (attr->rta_type == RTA_GATEWAY) {
    175+                    rta_gateway = attr;
    176+                } else if (attr->rta_type == RTA_OIF && sizeof(int) == RTA_PAYLOAD(attr)) {
    


    davidgumberg commented at 9:58 pm on August 28, 2025:

    For context on this, IPv6 allows “scoped addresses”, e.g. I can send a packet to a non-globally unique address, like a link-local address that has two destinations on two different links, so in order to disambiguate scoped addresses RFC4007#section-6 recommends the use of unique zone indexes:

    Because the same non-global address may be in use in more than one zone of the same scope (e.g., the use of link-local address fe80::1 in two separate physical links) and a node may have interfaces attached to different zones of the same scope (e.g., a router normally has multiple interfaces attached to different links), a node requires an internal means to identify to which zone a non-global address belongs. This is accomplished by assigning, within the node, a distinct “zone index” to each zone of the same scope to which that node is attached, and by allowing all internal uses of an address to be qualified by a zone index.

    And RFC4007#section-7 suggests that specifying an output interface index as the zone index is sufficient, but more specific than necessary to disambiguate:

    When an upper-layer protocol sends a packet to a non-global destination address, it must have a means of identifying the intended zone to the IPv6 layer for cases in which the node is attached to more than one zone of the destination address’s scope. Although identification of an outgoing interface is sufficient to identify an intended zone (because each interface is attached to no more than one zone of each scope), in many cases that is more specific than desired.

    The name used in the rtnetlink interface (RTA_OIF) suggest that the unique scope ID / zone ID used is the output interface index (OIF), but I guess that’s an implementation detail, the point is, it’s a unique index that makes an IPv6 address globally unique from the POV of a given device.

  42. in src/common/netif.cpp:111 in 4c53178256 outdated
    138+            int remaining_len = RTM_PAYLOAD(hdr);
    139+
    140+            processed_one = true;
    141+            if (hdr->nlmsg_flags & NLM_F_MULTI) {
    142+                multi_part = true;
    143             }
    


    davidgumberg commented at 10:09 pm on August 28, 2025:

    In https://github.com/bitcoin/bitcoin/pull/32159/commits/4c531782569b42fc47478a468f4a79e0e2d93946 (net: handle multi-part netlink responses)

    nit:

    processed_one, multi_part and the check at the bottom of the outer for loop could be dropped, with the following check:

    0            if (!(hdr->nlmsg_flags & NLM_F_MULTI)) {
    1                done = true;
    2            }
    

    Sjors commented at 12:32 pm on September 1, 2025:

    A slightly different approach, which avoids confusing NLMSG_DONE with done:

     0diff --git a/src/common/netif.cpp b/src/common/netif.cpp
     1index 77246980c3..3f3b8fdcea 100644
     2--- a/src/common/netif.cpp
     3+++ b/src/common/netif.cpp
     4@@ -117,9 +117,10 @@ std::optional<CNetAddr> QueryDefaultGatewayImpl(sa_family_t family)
     5     // Receive response.
     6     char response[4096];
     7     ssize_t total_bytes_read{0};
     8-    bool done{false};
     9-    bool multi_part{false};
    10-    while (!done) {
    11+    // If we processed at least one message and multi flag not set, or if
    12+    // we received no valid messages, then we're done.
    13+    bool maybe_more{true};
    14+    while (maybe_more) {
    15         int64_t recv_result;
    16         do {
    17             recv_result = sock->Recv(response, sizeof(response), 0);
    18@@ -135,18 +136,16 @@ std::optional<CNetAddr> QueryDefaultGatewayImpl(sa_family_t family)
    19             return std::nullopt;
    20         }
    21 
    22-        bool processed_one{false};
    23         for (nlmsghdr* hdr = (nlmsghdr*)response; NLMSG_OK(hdr, recv_result); hdr = NLMSG_NEXT(hdr, recv_result)) {
    24             rtmsg* r = (rtmsg*)NLMSG_DATA(hdr);
    25             int remaining_len = RTM_PAYLOAD(hdr);
    26 
    27-            processed_one = true;
    28-            if (hdr->nlmsg_flags & NLM_F_MULTI) {
    29-                multi_part = true;
    30-            }
    31+            if (!(hdr->nlmsg_flags & NLM_F_MULTI)) {
    32+                maybe_more = false;
    33+            };
    34 
    35             if (hdr->nlmsg_type == NLMSG_DONE) {
    36-                done = true;
    37+                maybe_more = false;
    38                 break;
    39             }
    40 
    41@@ -183,12 +182,6 @@ std::optional<CNetAddr> QueryDefaultGatewayImpl(sa_family_t family)
    42                 }
    43             }
    44         }
    45-
    46-        // If we processed at least one message and multi flag is not set,
    47-        // or if we received no valid messages, then we're done.
    48-        if ((processed_one && !multi_part) || !processed_one) {
    49-            done = true;
    50-        }
    51     }
    52 
    53     return std::nullopt;
    
  43. in src/common/netif.cpp:103 in 4c53178256 outdated
    120-                rta_gateway = attr;
    121-            } else if (attr->rta_type == RTA_OIF && sizeof(int) == RTA_PAYLOAD(attr)) {
    122-                std::memcpy(&scope_id, RTA_DATA(attr), sizeof(scope_id));
    123-            }
    124+        total_bytes_read += recv_result;
    125+        if (total_bytes_read > NETLINK_MAX_RESPONSE_SIZE) {
    


    davidgumberg commented at 10:17 pm on August 28, 2025:

    nit:

    I think we should return early on total_bytes_read == 0 as well.


    davidgumberg commented at 10:29 pm on August 28, 2025:

    Feel free to mark resolved, this suggestion is actually incorrect, since when recv_result == 0 the NLMSG_OK check below would fail:

    0#define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
    1			   (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
    2			   (nlh)->nlmsg_len <= (len))
    

    https://github.com/torvalds/linux/blob/39f90c1967215375f7d87b81d14b0f3ed6b40c29/include/uapi/linux/netlink.h#L107-L109

  44. davidgumberg commented at 10:23 pm on August 28, 2025: contributor

    untested crACK 4c531782569b42fc474

    https://github.com/bitcoin/bitcoin/pull/32159/commits/57ce645f05d18d8ad10711c347a5989076f1f788 corrects a bug where QueryDefaultGatewayImpl would return not default gateways.

    https://github.com/bitcoin/bitcoin/pull/32159/commits/42e99ad77396e4e9b02d67daf46349e215e99a0f adds a belt and suspenders check that we have received a sensible reply to our GETROUTE message.

    https://github.com/bitcoin/bitcoin/pull/32159/commits/4c531782569b42fc47478a468f4a79e0e2d93946 correctly handles multi-part messages, I believe the current implementation would fail e.g. any time the reply to our GETROUTE was in excess of 4096 bytes

    I left a few non-blocking nits, and comments that provide some context that I had to look for when reviewing the changes here.

  45. Sjors commented at 12:35 pm on September 1, 2025: member

    utACK 4c531782569b42fc47478a468f4a79e0e2d93946

    I didn’t specifically test multipart messages.

    I did test this (rebased) PR with a router running OPNsense 25.7.2 with miniupnpd 2.3.9 using pf backend.

    From a macOS 15.6.1 machine and macOS 13.7.8 machine I can see it opens a port and I’m (still) getting inbound connections. This is unsurprising, because the changes here don’t impact macOS.

    I also tested on Ubuntu 25.04

  46. hodlinator commented at 9:41 pm on September 1, 2025: contributor

    Tested but did not dig deep into the code for now. Spent more than a day adding debug code in Bitcoin Core and fighting with my OpenWrt router (https://github.com/openwrt/packages/issues/17871#issuecomment-3243222244). Now that the router has been fully reset without any lingering configs from prior major versions, PCP & NATPMP is actually working.

    On master (in 7/7 attempts roughly):

     0 cmake --build build -j16 --target=bitcoind && ./build/bin/bitcoind -regtest -debug=net
     1...
     2Bitcoin Core version v29.99.0-74d9598bfbc2 (release build)
     3[warning] Option '-upnp=true' is given but UPnP support was dropped in version 29.0. Substituting '-natpmp=true'.
     4...
     5[net] portmap: gateway [IPv4]: 192.168.1.1
     6[net] pcp: Requesting port mapping for addr 0.0.0.0 port 18444 from gateway 192.168.1.1
     7[net] pcp: Internal address after connect: 192.168.1.201
     8AddLocal([1111:1111::1]:18444,1)
     9Discover: 1111:1111::1
    10Bound to 127.0.0.1:18445
    11Bound to [::]:18444
    12Bound to 0.0.0.0:18444
    13Loaded 0 addresses from "anchors.dat"
    140 block-relay-only anchors will be tried for connections.
    15init message: Starting network threads
    16net thread start
    17init message: Done loading
    18opencon thread start
    19addcon thread start
    20msghand thread start
    21dnsseed thread start
    22Loading addresses from DNS seed dummySeed.invalid.
    230 addresses found from DNS seeds
    24dnsseed thread exit
    25[net] pcp: Received response of 60 bytes: REDACTED...
    26[net:info] portmap: Added mapping pcp:< REDACTED  >:18450 -> 192.168.1.201:18444 (for 2400s)
    27AddLocal(< REDACTED  >:18450,3)
    28[net] portmap: gateway [IPv6]: fe80::REDACTED...
    29[net] pcp: Requesting port mapping for addr 1111:1111::1 port 18444 from gateway fe80::REDACTED...
    30[net] pcp: Internal address after connect: 1111:1111::1
    31[net] pcp: Received response of 60 bytes: REDACTED...
    32[net:warning] pcp: Mapping failed with result ADDRESS_MISMATCH (code 12)
    

    On the router side:

    0... daemon.err miniupnpd[14402]: PCP: Invalid PCP v2 MAP message.
    

    Taking the first commit (57ce645f05d18d8ad10711c347a5989076f1f788) from this PR makes the output towards the end better (my ISP doesn’t support IPv6):

    0[net] pcp: Received response of 60 bytes: REDACTED...
    1[net:info] portmap: Added mapping pcp:< REDACTED  >:18464 -> 192.168.1.201:18444 (for 2400s)
    2AddLocal(< REDACTED  >:18464,3)
    3[net] portmap: Could not determine IPv6 default gateway
    

    And there is no longer any error in the router log.

    So PCP+IPv4 mostly works regardless of the PR, but at least the failure path for IPv6 is improved by it.

    Haven’t really noticed any changes in behavior with my setup from the latter 2 commits.

    Might focus on other things for now as this already has 3 ACKs.

  47. Sjors commented at 6:47 am on September 2, 2025: member

    Now that the router has been fully reset without any lingering configs from prior major versions, PCP & NATPMP is actually working.

    You would not be the first person running into router bugs while testing NAT punching functionality :-)

  48. net: handle multi-part netlink responses
    Handle multi-part netlink responses to prevent truncated results from
    large routing tables.
    
    Previously, we only made a single recv call, which led to incomplete
    results when the kernel split the message into multiple responses (which
    happens frequently with NLM_F_DUMP).
    
    Also guard against a potential hanging issue where the code would
    indefinitely wait for NLMSG_DONE for non-multi-part responses by
    detecting the NLM_F_MULTI flag and only continue waiting when necessary.
    88db09bafe
  49. willcl-ark force-pushed on Sep 3, 2025
  50. achow101 commented at 10:09 pm on September 3, 2025: member
    ACK 88db09bafe9ec363525e5e526c5f6cdd13691447
  51. DrahtBot requested review from Sjors on Sep 3, 2025
  52. DrahtBot requested review from davidgumberg on Sep 3, 2025
  53. davidgumberg commented at 0:00 am on September 4, 2025: contributor

    Code Review re-ACK 88db09b

    0$ git range-diff 4c53178...88db09b
    

    https://github.com/bitcoin/bitcoin/commit/88db09bafe9ec363525e5e526c5f6cdd13691447 takes the reviewer suggestion of moving the rtmsg* cast after checking if hdr->nlmsg_type == NLMSG_DONEwhich resolves #33245 and implements a suggested refactor that simplifies the multi-message/done state tracking in the NLMSG_NEXT loop.

  54. Sjors commented at 7:18 am on September 4, 2025: member
    re-utACK 88db09bafe9ec363525e5e526c5f6cdd13691447
  55. fanquake merged this on Sep 4, 2025
  56. fanquake closed this on Sep 4, 2025

  57. ryanofsky commented at 2:00 pm on September 4, 2025: contributor
    Do we know if this change might fix the integer overflow error reported #32345 (review)?
  58. willcl-ark commented at 3:15 pm on September 4, 2025: member
    I moved variables as per #32159 (review) to try and fix this in here. I did not verify whether it worked, however.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-09-07 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me