net, pcp: handle multi-part responses and filter for default route while querying default gateway

willcl-ark commented at 2:11 pm on March 28, 2025: member

…for default route in pcp pinholing.

Currently we only make a single recv call, which trucates results from large routing tables, or in the case the kernel may split the message into multiple responses (which may happen with NLM_F_DUMP).

We also do not filter on the default route. For IPv6, this led to selecting the first route with an RTA_GATEWAY attribute, often a non-default route instead of the actual default. This caused PCP port mapping failures because the wrong gateway was used.

Fix both issues by adding multi-part handling of responses and filter for the default route.

Limit responses to ~ 1MB to prevent any router-based DoS.

DrahtBot commented at 2:11 pm on March 28, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32159.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	laanwj

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

willcl-ark commented at 2:13 pm on March 28, 2025: member

cc @laanwj

This patch maintains the FreeBSD-style querying/filtering we have currently, but but increases the size of the response processed to a maximum of 1MB.

laanwj added the label P2P on Mar 28, 2025

laanwj renamed this:
~~net, pcp: handle multi-part responses and filter...~~
net, pcp: handle multi-part responses and filter for default route while querying default gateway
on Mar 28, 2025

laanwj commented at 2:55 pm on March 28, 2025: member

Concept ACK

~~Adding 29.0 milestone,~~ doesn’t need to block the release but it would be nice to backport it.

Edit: removed, the milestone, i still think it’d be nice to have in next 29.x release, but it turns out more complicated and riskier than expected, better to have no immediate time pressure

laanwj added this to the milestone 29.0 on Mar 28, 2025

laanwj requested review from vasild on Mar 28, 2025

laanwj requested review from Sjors on Mar 28, 2025

in src/common/netif.cpp:119 in c5211f3423 outdated

136-            } else if (family == AF_INET6 && sizeof(in6_addr) == RTA_PAYLOAD(rta_gateway)) {
137-                in6_addr gw;
138-                std::memcpy(&gw, RTA_DATA(rta_gateway), sizeof(gw));
139-                return CNetAddr(gw, scope_id);
140+        for (nlmsghdr* hdr = (nlmsghdr*)response; NLMSG_OK(hdr, recv_result); hdr = NLMSG_NEXT(hdr, recv_result)) {
141+            if (hdr->nlmsg_type == NLMSG_DONE) {

laanwj commented at 3:23 pm on March 28, 2025:

Is it guaranteed that the reponse to NLM_F_DUMP is always multipart? Or do we need to check nlmsg_flags for NLM_F_MULTI, and if not, break after the first packet?

This is not clear to me from the documentation:

willcl-ark commented at 9:31 am on March 29, 2025:

I don’t think it is, but like you am unsure about what the guarantees of the protocol are here.

I kind of reverse engineered this looking at miniupnpc and netlink sources, along with an strace of ip route show, which is where the repeated recv calls jumped out to me as a difference between our code and other tools.

This approach simply relies on receiving an NLMSG_DONE to signal the end of the response and break. This should handle both single and multi-part messages. Here’s a snipped sample of strace -e filter=recvmsg ip route show on my system:

 0recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2932
 1recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1444, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, <snip> 0) = 2932
 2recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 3384
 3recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1504, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, <snip> 0) = 3384
 4recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 5196
 5recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1488, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, <snip> 0) = 5196
 6recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
 7recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240135, nlmsg_pid=3248093}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
 8recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 424
 9recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=76, nlmsg_type=RTM_NEWADDR, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240136, nlmsg_pid=3248093}, <snip> 0) = 424
10recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 600
11recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=72, nlmsg_type=RTM_NEWADDR, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240136, nlmsg_pid=3248093}, <snip> 0) = 600
12recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
13recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1743240136, nlmsg_pid=3248093}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20

NLM_F_MULTI is indeed set, even in the NLMSG_DONE packet.

One other thing I did read about, but not implement, is checking of the sequence number on each message to ensure it was meant for our request. But as we only make a single request i thought this should be OK to omit.

laanwj commented at 2:46 pm on March 29, 2025:

Thanks!

The receive flow here seems to indicate:

Receive packet
Process packet
If NLM_F_MULTI is set and the packet is not NLMSG_DONE, repeat

What i’m mostly worried about is that the current code will hang if NLMSG_DONE never comes, which seems to be the case for non-multi responses, which have one data packet.

But it may be that the NETLINK_ROUTE response to RTM_GETROUTE/NLM_F_DUMP is always multi-packet. That empirically seems to be the case even for tiny routing tables.

Looking at the ip source is a good idea. Also we need to verify this with FreeBSD.

But as we only make a single request i thought this should be OK to omit.

Agree, going that far in checking seems unnecessary. i don’t think we need super defensive coding as netlink is a local protocol with the kernel.

willcl-ark commented at 8:18 am on March 31, 2025:

Thanks, that infradead page is very handy!

I made a few changes in 6c694c212f8d898dac9cc1c5637381452c91e79b based on your thoughts (and the protocol page):

set socket to non-blocking mode to avoid hanging if the kernel doesn’t send a response
use a vector to collect all data from multi-part responses
exit when recv() returns 0 (this should handle single-part messages, AFAICT)

I think relying on receiving no more data from recv() to break the receive loop should be as (or perhaps more) robust than checking for the NLM_F_MULTI flag and exiting after first receive if it’s not set, but curious what you think here?

If it would help, I’d be happy to break this into a few smaller commits, as I’m kinda feeling this change contains a few different changes in one in some ways now…

laanwj commented at 9:13 am on March 31, 2025:

I think relying on receiving no more data from recv() to break the receive loop should be as (or perhaps more) robust than checking for the NLM_F_MULTI flag and exiting after first receive if it’s not set, but curious what you think here?

Maybe-is it safe to assume that netlink will never block?

We don’t want to end up in the same situation as before where we miss data. But due to say, a threading race condition.

i think the safest thing here is to mimic as closely as possible ip’s behavior, as it is the only tool these kind of interfaces tend to be written towards.

laanwj commented at 9:16 am on March 31, 2025:

use a vector to collect all data from multi-part responses

i’m not sure i see the motivation here. Parsing the packets as seperate units is just as valid (“Multipart messages unlike fragmented ip packets must not be reassmbled”), avoids dynamic allocation, and is simpler.

Sjors commented at 12:34 pm on March 31, 2025:

checking of the sequence number on each message to ensure it was meant for our request

This seems like a good idea to at least do in debug builds.

It seems like a good precaution to check for the presence of NLM_F_MULTI and don’t wait for NLMSG_DONE if it isn’t. At least from my naive reading of https://man7.org/linux/man-pages/man7/netlink.7.html it seems NLMSG_DONE is only used for multipart messages.

Splitting into multiple commits would be useful, e.g. one commit that switches to non-blocking mode.

willcl-ark force-pushed on Mar 31, 2025

in src/common/netif.cpp:97 in 6c694c212f outdated

113+    while (!done) {
114+        char temp[4096];
115+        int64_t recv_result;
116+        do {
117+            recv_result = sock->Recv(temp, sizeof(temp), 0);
118+        } while (recv_result < 0 && (errno == EINTR || errno == EAGAIN));

Sjors commented at 12:14 pm on March 31, 2025:

Since you’re touching this line… According to the internet, we should also check EWOULDBLOCK even though it’s usually the same as EAGAIN, and it’s likely not relevant for any system we support. https://stackoverflow.com/a/49421517

vasild commented at 3:33 pm on April 16, 2025:

Might be helpful:

https://github.com/bitcoin/bitcoin/blob/cdc32994feadf3f15df3cfac5baae36b4b011462/src/util/sock.cpp#L22-L25

in src/common/netif.cpp:107 in 6c694c212f outdated

112+
113+    while (!done) {
114+        char temp[4096];
115+        int64_t recv_result;
116+        do {
117+            recv_result = sock->Recv(temp, sizeof(temp), 0);

Sjors commented at 12:25 pm on March 31, 2025:

I know this is existing code, but I don’t recall why there’s no timeout here. And also, should there be a quick wait between Recv calls?

Sjors commented at 12:38 pm on March 31, 2025: member

Although the code this PR touches isn’t compiled on macOS, I did briefly check that things still work there. I also briefly tested on Ubuntu 24.10.

Left some inline question to wrap my head around the changes and refresh my memory of the original…

laanwj added the label Needs backport (29.x) on Apr 1, 2025

laanwj removed this from the milestone 29.0 on Apr 1, 2025

willcl-ark commented at 9:26 am on April 1, 2025: member

Thanks for the review @Sjors & @laanwj.

Going to move to draft while I re-work it a little.

willcl-ark marked this as a draft on Apr 1, 2025

willcl-ark force-pushed on Apr 1, 2025

net: filter for default routes in netlink responses

Filter netlink responses to only consider default routes by checking the
destination prefix length (rtm_dst_len == 0).

Previously, we selected the first route with an RTA_GATEWAY attribute,
which for IPv6 often resulted in choosing a non-default route instead of
the actual default.

This caused occasional PCP port mapping failures because a gateway for a
non-default route was selected.

57ce645f05

net: skip non-route netlink responses

This shouldn't usually be hit, but is a good belt-and-braces.

42e99ad773

willcl-ark force-pushed on Apr 2, 2025

net: handle multi-part netlink responses

Handle multi-part netlink responses to prevent truncated results from
large routing tables.

Previously, we only made a single recv call, which led to incomplete
results when the kernel split the message into multiple responses (which
happens frequently with NLM_F_DUMP).

Also guard against a potential hanging issue where the code would
indefinitely wait for NLMSG_DONE for non-multi-part responses by
detecting the NLM_F_MULTI flag and only continue waiting when necessary.

4c53178256

willcl-ark force-pushed on Apr 2, 2025

DrahtBot added the label CI failed on Apr 2, 2025

DrahtBot commented at 11:05 am on April 2, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/39838257320

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot removed the label CI failed on Apr 2, 2025

willcl-ark commented at 1:56 pm on April 2, 2025: member

After (additional) further investigation I gained some new insights:

This is not a regression from migration from libnatpmp as I first thought.

The extent to which tailscale interferes with the routing is more significant than I realised: Even if we find the correct gateway (by handling multi-part messages and filtering properly), the route to the gateway still doesn’t return the data we want, specifically the correct interface. This causes us to make an invalid request, which gets rejected.

This is also observed using the ip command:

0$ ip route show
1default via 10.0.0.1 dev wlo1 proto dhcp src 10.0.12.100 metric 600
210.0.0.0/20 dev wlo1 proto kernel scope link src 10.0.12.100 metric 600
3172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
4192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown
5
6will@ubuntu in …/src/core/bitcoin on  pcp-default-multipart [$?⇕] : C v19.1.7-clang via △ v3.31.6 : 🐍 (core)
7$ ip route get 10.0.0.1
810.0.0.1 dev tailscale0 table 52 src 100.89.20.73 uid 1000
9    cache

Handling this would be quite invasive, require tracking the interface, and I don’t think it’s in scope for us.

Therefore I have re-worked and split the remaining changes and into 3 commits:

Filter for the default route in the response. With this change ipv6 pinholing works even with tailscale up, which is quite nice.
Skip non NEWROUTE messages. Although NLMSG_DONE is already handled, this is a slight optimisation in protecting against unexpected responses being parsed.
Handle single and multi-part messages. Track if NLM_F_MULTI was set and if so wait for NLMSG_DONE, otherwise break after the first response, assuming a single-part response.

I think these are all worthwhile, but I could seean argument that what we currently have works “well enough” for the most basic use-case; a simple (single) routing table. And we could just consider anything else out-of-scope in this project. @Sjors: Ref your comments, the socket is set as nonblocking already via the implementation in CreateSockOS in https://github.com/bitcoin/bitcoin/blob/74d9598bfbc24c3b7bfe1dad5bf9d988381bf893/src/netbase.cpp#L536-L540 I did try a commit to add a backoff timer and retry mechanism to this for extra safety, but pretty sure it’s not worth the added complexity.

willcl-ark marked this as ready for review on Apr 2, 2025

DrahtBot added the label CI failed on Apr 29, 2025

DrahtBot removed the label CI failed on May 2, 2025

fanquake commented at 3:39 pm on May 8, 2025: member

@laanwj you might be interested in circling back here to review?

laanwj requested review from laanwj on May 9, 2025

laanwj removed the label Needs backport (29.x) on May 13, 2025

laanwj commented at 2:19 pm on May 13, 2025: member

you might be interested in circling back here to review?

Yes, will look into it.

This is not a regression from migration from libnatpmp as I first thought.

Removed the backport label as this was confirmed not to be a regression (thanks!).

fanquake added this to the milestone 30.0 on Jul 17, 2025

fanquake commented at 2:51 pm on July 17, 2025: member

Put this on the v30.0 milestone.

net, pcp: handle multi-part responses and filter for default route while querying default gateway #32159

Code Coverage & Benchmarks

Reviews