p2p: improve TxOrphanage denial of service bounds

glozow commented at 9:14 pm on February 9, 2025: member

This PR is part of the orphan resolution project, see #27463.

This design came from collaboration with sipa - thanks.

We want to limit the CPU work and memory used by TxOrphanage to avoid denial of service attacks. On master, this is achieved by limiting the number of transactions in this data structure to 100, and the weight of each transaction to 400KWu (the largest standard tx) [0]. We always allow new orphans, but if the addition causes us to exceed 100, we evict one randomly. This is dead simple, but has problems:

It makes the orphanage trivially churnable: any one peer can render it useless by spamming us with lots of orphans. It’s possible this is happening: “Looking at data from node alice on 2024-09-14 shows that we’re sometimes removing more than 100k orphans per minute. This feels like someone flooding us with orphans.” [1]
Effectively, opportunistic 1p1c is useless in the presence of adversaries: it is opportunistic and pairs a low feerate tx with a child that happens to be in the orphanage. So if nothing is able to stay in orphanages, we can’t expect 1p1cs to propagate.
This number is also often insufficient for the volume of orphans we handle: historical data show that overflows are pretty common, and there are times where “it seems like [the node] forgot about the orphans and re-requested them multiple times.” [1]

Just jacking up the -maxorphantxs number is not a good enough solution, because it doesn’t solve the churnability problem, and the effective resource bounds scale poorly.

This PR introduces numbers for {global, per-peer} {memory usage, announcements + number of inputs}, representing resource limits:

The (constant) global latency score limit is the number of unique (wtxid, peer) pairs in the orphanage + the number of inputs spent by those (deduplicated) transactions floor-divided by 10 [2]. This represents a cap on CPU or latency for any given operation, and does not change with the number of peers we have. Evictions must happen whenever this limit is reached. The primary goal of this limit is to ensure we do not spend more than a few ms on any call to LimitOrphans or EraseForBlock.
The (variable) per-peer latency score limit is the global latency score limit divided by the number of peers. Peers are allowed to exceed this limit provided the global announcement limit has not been reached. The per-peer announcement limit decreases with more peers.
The (constant) per-peer memory usage reservation is the amount of orphan weight [3] reserved per peer [4]. Reservation means that peers are effectively guaranteed this amount of space. Peers are allowed to exceed this limit provided the global usage limit is not reached. The primary goal of this limit is to ensure we don’t oom.
The (variable) global memory usage limit is the number of peers multiplied by the per-peer reservation [5]. As such, the global memory usage limit scales up with the number of peers we have. Evictions must happen whenever this limit is reached.
We introduce a “Peer DoS Score” which is the maximum between its “CPU Score” and “Memory Score.” The CPU score is the ratio between the number of orphans announced by this peer / peer announcement limit. The memory score is the total usage of all orphans announced by this peer / peer usage reservation.

Eviction changes in a few ways:

It is triggered if either limit is exceeded.
On each iteration of the loop, instead of selecting a random orphan, we select a peer and delete 1 of its announcements. Specifically, we select the peer with the highest DoS score, which is the maximum between its CPU DoS score (based on announcements) and Memory DoS score (based on tx weight). After the peer has been selected, we evict the oldest orphan (non-reconsiderable sorted before reconsiderable).
Instead of evicting orphans, we evict announcements. An orphan is still in the orphanage as long as there is 1 peer announcer. Of course, over the course of several iteration loops, we may erase all announcers, thus erasing the orphan itself. The purpose of this change is to prevent a peer from being able to trigger eviction of another peer’s orphans.

This PR also:

Reimplements TxOrphanage as single multi-index container.
Effectively bounds the number of transactions that can be in a peer’s work set by ensuring it is a subset of the peer’s announcements.
Removes the -maxorphantxs config option, as the orphanage no longer limits by unique orphans.

This means we can receive 1p1c packages in the presence of spammy peers. It also makes the orphanage more useful and increases our download capacity without drastically increasing orphanage resource usage.

[0]: This means the effective memory limit in orphan weight is 100 * 400KWu = 40MWu [1]: https://delvingbitcoin.org/t/stats-on-orphanage-overflows/1421 [2]: Limit is 3000, which is equivalent to one max size ancestor package (24 transactions can be missing inputs) for each peer (default max connections is 125). [3]: Orphan weight is used in place of actual memory usage because something like “one maximally sized standard tx” is easier to reason about than “considering the bytes allocated for vin and vout vectors, it needs to be within N bytes…” etc. We can also consider a different formula to encapsulate more the memory overhead but still have an interface that is easy to reason about. [4]: The limit is 404KWu, which is the maximum size of an ancestor package. [5]: With 125 peers, this is 50.5MWu, which is a small increase from the existing limit of 40MWu. While the actual memory usage limit is higher (this number does not include the other memory used by TxOrphanage to store the outpoints map, etc.), this is within the same ballpark as the old limit.

glozow added the label P2P on Feb 9, 2025

DrahtBot commented at 9:14 pm on February 9, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31829.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	marcofleon, instagibbs, theStack, achow101
Approach ACK	sipa, jsarenik
Stale ACK	monlovesmango

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32896 (wallet, rpc: add v3 transaction creation and wallet support by ishaanam)
#32827 (mempool: Avoid needless vtx iteration during IBD by l0rinc)
#32430 (test: Add and use ElapseTime helper by maflcko)
#30277 ([DO NOT MERGE] Erlay: bandwidth-efficient transaction relay protocol (Full implementation) by sr-gi)
#29415 (Broadcast own transactions only via short-lived Tor or I2P connections by vasild)
#28690 (build: Introduce internal kernel library by TheCharlatan)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

List of typos found in added lines:

tranaction -> transaction [misspelling in comment “memory usage of the tranaction”]
comphehensive -> comprehensive [misspelling in comment “This is a comphehensive simulation…”]

drahtbot_id_4_m

glozow force-pushed on Feb 9, 2025

DrahtBot added the label CI failed on Feb 9, 2025

DrahtBot commented at 9:20 pm on February 9, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/36925040096

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow force-pushed on Feb 9, 2025

glozow force-pushed on Feb 10, 2025

glozow commented at 1:13 pm on February 10, 2025: member

Rebased

glozow marked this as ready for review on Feb 10, 2025

DrahtBot removed the label CI failed on Feb 10, 2025

in src/txorphanage.cpp:55 in ef2f44e653 outdated

35@@ -36,9 +36,10 @@ bool TxOrphanage::AddTx(const CTransactionRef& tx, NodeId peer)
36         return false;
37     }
38

instagibbs commented at 4:43 pm on February 10, 2025:

ef2f44e653a4877d4e65fbd5a51ec83ceb96d212

MAX_GLOBAL_ORPHAN_ANNOUNCEMENTS doesn’t occur in the PR, should be DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS?

glozow commented at 12:04 pm on February 11, 2025:

thanks, fixed

in src/txorphanage.h:47 in ef2f44e653 outdated

42+    unsigned int m_reserved_weight_per_peer{DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER};
43+
44+    /** The maximum number of announcements across all peers, representing a computational upper bound,
45+     * i.e. the maximum number of evictions we might do at a time. There is no per-peer announcement
46+     * limit until the global limit is reached. Also, this limit is constant regardless of how many
47+     * peers we have: if we only have 1 peer, this is the number of orphans they may provide. As

instagibbs commented at 4:50 pm on February 10, 2025:

s/may provide/may provide without risking eviction/?

glozow commented at 12:06 pm on February 11, 2025:

changed

in src/txorphanage.h:48 in ef2f44e653 outdated

43+
44+    /** The maximum number of announcements across all peers, representing a computational upper bound,
45+     * i.e. the maximum number of evictions we might do at a time. There is no per-peer announcement
46+     * limit until the global limit is reached. Also, this limit is constant regardless of how many
47+     * peers we have: if we only have 1 peer, this is the number of orphans they may provide. As
48+     * more peers are added, each peer's allocation is reduced. */

instagibbs commented at 4:50 pm on February 10, 2025:

s/allocation/protected allocation/?

glozow commented at 12:06 pm on February 11, 2025:

added

in src/txorphanage.h:148 in ef2f44e653 outdated

136@@ -111,7 +137,6 @@ class TxOrphanage {
137 
138 protected:
139     struct OrphanTx : public OrphanTxBase {

instagibbs commented at 4:51 pm on February 10, 2025:

do we think we’ll end up using the derived class in a meaningful way later?

glozow commented at 4:33 am on February 12, 2025:

not entirely sure… could be cleaned up

glozow commented at 5:42 pm on June 4, 2025:

this is gone now

in src/txorphanage.h:247 in ef2f44e653 outdated

233+
234+    unsigned int GetGlobalMaxUsage() const {
235+        return std::max<unsigned int>(m_peer_orphanage_info.size() * m_reserved_weight_per_peer, 1);
236+    }
237+
238+    unsigned int GetPerPeerMaxAnnouncements() const {

instagibbs commented at 5:02 pm on February 10, 2025:

note: this is use to “normalize” the one DoS score metric against the other only. If this wasn’t used, trimming would heavily favor announcement trimming scores first.

glozow commented at 4:40 am on February 12, 2025:

Yes, they’re both meant to be ratios :+1:

in src/txorphanage.h:196 in ef2f44e653 outdated

182+            FeeFrac cpu_score(m_iter_list.size(), peer_max_ann);
183+            FeeFrac mem_score(m_total_usage, peer_max_mem);
184+            return std::max<FeeFrac>(cpu_score, mem_score);
185+        }
186     };
187     std::map<NodeId, PeerOrphanInfo> m_peer_orphanage_info;

instagibbs commented at 5:11 pm on February 10, 2025:

re:global memory limits, they won’t be increased until each connected peer offers up their own orphan or announcement (which makes sense)

would be good to have test/fuzz coverage that there isn’t some bug where this continuously grows, raising the global limits to unsafe levels? Arg in SanityCheck?

sipa commented at 4:26 am on February 19, 2025:

Regarding https://github.com/bitcoin/bitcoin/pull/31829/commits/2771b69e461230feb7761b0afff41b44bd3ba34f#r1949554535

With GetGlobalMaxUsage() computing memory limits on-the-fly, is this comment still relevant?

instagibbs commented at 4:40 pm on February 20, 2025:

I guess my worry is that we would somehow add a regression that results in the peer set growing indefinitely in the orphanage, which would grow memory usage substantially. Here’s some suggested coverage which would hopefully uncover such an issue (unless the caller plum forgets to EraseForPeer):

  0diff --git a/src/test/fuzz/txorphan.cpp b/src/test/fuzz/txorphan.cpp
  1index 9133323449..df0a02d75d 100644
  2--- a/src/test/fuzz/txorphan.cpp
  3+++ b/src/test/fuzz/txorphan.cpp
  4@@ -45,10 +45,15 @@ FUZZ_TARGET(txorphan, .init = initialize_orphanage)
  5         outpoints.emplace_back(Txid::FromUint256(uint256{i}), 0);
  6     }
  7 
  8     CTransactionRef ptx_potential_parent = nullptr;
  9 
 10+    // Peers which have offered orphans (via tx or announcement) and have not subsequently
 11+    // "disconnected" aka called EraseForPeer
 12+    std::set<NodeId> connected_peers;
 13+
 14+
 15     LIMITED_WHILE(outpoints.size() < 200'000 && fuzzed_data_provider.ConsumeBool(), 10 * DEFAULT_MAX_ORPHAN_TRANSACTIONS)
 16     {
 17         // construct transaction
 18         const CTransactionRef tx = [&] {
 19             CMutableTransaction tx_mut;
 20@@ -121,14 +126,16 @@ FUZZ_TARGET(txorphan, .init = initialize_orphanage)
 21 
 22                         if (add_tx) {
 23                             Assert(orphanage.UsageByPeer(peer_id) == tx_weight + total_peer_bytes_start);
 24                             Assert(orphanage.TotalOrphanUsage() == tx_weight + total_bytes_start);
 25                             Assert(tx_weight <= MAX_STANDARD_TX_WEIGHT);
 26+                            connected_peers.insert(peer_id);
 27                         } else {
 28                             // Peer may have been added as an announcer.
 29                             if (orphanage.UsageByPeer(peer_id) == tx_weight + total_peer_bytes_start) {
 30                                 Assert(orphanage.HaveTxFromPeer(wtxid, peer_id));
 31+                                connected_peers.insert(peer_id);
 32                             } else {
 33                                 // Otherwise, there must not be any change to the peer byte count.
 34                                 Assert(orphanage.UsageByPeer(peer_id) == total_peer_bytes_start);
 35                             }
 36 
 37@@ -158,10 +165,11 @@ FUZZ_TARGET(txorphan, .init = initialize_orphanage)
 38                         // Total bytes should not have changed. If peer was added as announcer, byte
 39                         // accounting must have been updated.
 40                         Assert(orphanage.TotalOrphanUsage() == total_bytes_start);
 41                         if (added_announcer) {
 42                             Assert(orphanage.UsageByPeer(peer_id) == tx_weight + total_peer_bytes_start);
 43+                            connected_peers.insert(peer_id);
 44                         } else {
 45                             Assert(orphanage.UsageByPeer(peer_id) == total_peer_bytes_start);
 46                         }
 47                     }
 48                 },
 49@@ -190,10 +198,11 @@ FUZZ_TARGET(txorphan, .init = initialize_orphanage)
 50                         Assert(!have_tx && !have_tx_and_peer && !orphanage.EraseTx(wtxid));
 51                     }
 52                 },
 53                 [&] {
 54                     orphanage.EraseForPeer(peer_id);
 55+                    connected_peers.erase(peer_id); // "DisconnectPeer"
 56                     Assert(!orphanage.HaveTxFromPeer(tx->GetWitnessHash(), peer_id));
 57                     Assert(orphanage.UsageByPeer(peer_id) == 0);
 58                 },
 59                 [&] {
 60                     // test mocktime and expiry
 61@@ -212,7 +221,8 @@ FUZZ_TARGET(txorphan, .init = initialize_orphanage)
 62 
 63         const bool have_tx{orphanage.HaveTx(tx->GetWitnessHash())};
 64         const bool get_tx_nonnull{orphanage.GetTx(tx->GetWitnessHash()) != nullptr};
 65         Assert(have_tx == get_tx_nonnull);
 66     }
 67-    orphanage.SanityCheck();
 68+
 69+    orphanage.SanityCheck(connected_peers.size());
 70 }
 71diff --git a/src/txorphanage.cpp b/src/txorphanage.cpp
 72index 40a2503af5..c6c156409d 100644
 73--- a/src/txorphanage.cpp
 74+++ b/src/txorphanage.cpp
 75@@ -402,11 +402,11 @@ std::vector<TxOrphanage::OrphanTxBase> TxOrphanage::GetOrphanTransactions() cons
 76         ret.push_back({o.second.tx, o.second.announcers, o.second.nTimeExpire});
 77     }
 78     return ret;
 79 }
 80 
 81-void TxOrphanage::SanityCheck() const
 82+void TxOrphanage::SanityCheck(int expected_num_peers) const
 83 {
 84     // Check that cached m_total_announcements is correct. First count when iterating through m_orphans (counting number
 85     // of announcers each), then count when iterating through peers (counting number of orphans per peer).
 86     unsigned int counted_total_announcements{0};
 87     // Check that m_total_orphan_usage is correct
 88@@ -461,10 +461,16 @@ void TxOrphanage::SanityCheck() const
 89                 return orphan_it->second.tx->GetWitnessHash() == wtxid;
 90             }) != info.m_iter_list.end());
 91         }
 92     }
 93 
 94+    // We should not be offering more global memory for orphanage than expected
 95+    if (expected_num_peers != -1) {
 96+        Assert((size_t) expected_num_peers == m_peer_orphanage_info.size());
 97+        Assert(GetGlobalMaxUsage() == std::max<int64_t>(expected_num_peers * GetPerPeerMaxUsage(), 1));
 98+    }
 99+
100     Assert(wtxids_in_peer_map.size() == m_orphans.size());
101     Assert(counted_total_announcements == 0);
102 }
103 
104 bool TxOrphanage::NeedsTrim(unsigned int max_orphans) const
105diff --git a/src/txorphanage.h b/src/txorphanage.h
106index a5a6a94bec..562e9a3321 100644
107--- a/src/txorphanage.h
108+++ b/src/txorphanage.h
109@@ -140,11 +140,11 @@ public:
110         return peer_it == m_peer_orphanage_info.end() ? 0 : peer_it->second.m_iter_list.size();
111     }
112 
113     /** Check consistency between PeerOrphanInfo and m_orphans. Recalculate counters and ensure they
114      * match what is cached. */
115-    void SanityCheck() const;
116+    void SanityCheck(int expected_num_peers = -1) const;
117 
118 protected:
119     struct OrphanTx : public OrphanTxBase {
120     };

in src/txorphanage.cpp:103 in ef2f44e653 outdated

108-        it_last->second.list_pos = old_pos;
109+        // Find this orphan iter's position in the list, and delete it.
110+        auto& orphan_list = peer_it->second.m_iter_list;
111+        size_t old_pos = std::distance(orphan_list.begin(), std::find(orphan_list.begin(), orphan_list.end(), it));
112+
113+        if (!Assume(old_pos < orphan_list.size())) continue;

instagibbs commented at 5:21 pm on February 10, 2025:

Maybe a bit too paranoid since it was just computed explicitly from the underlying list?

glozow commented at 12:04 pm on February 11, 2025:

removed

in src/txorphanage.cpp:200 in ef2f44e653 outdated

200-        ++nEvicted;
201+        if (!Assume(!peer_it_heap.empty())) break;
202+        // Find the peer with the highest DoS score, which is a fraction of {usage, announcements} used
203+        // over the allowance. This metric causes us to naturally select peers who have exceeded
204+        // their limits (i.e. a DoS score > 1) before peers who haven't. We may choose the same peer
205+        // change since the last iteration of this loop.

instagibbs commented at 5:29 pm on February 10, 2025:

wording confusion: same peer change?

glozow commented at 12:05 pm on February 11, 2025:

fixed

in src/test/orphanage_tests.cpp:247 in 95b61662e5 outdated

242+        auto ptx = MakeTransactionSpending(/*outpoints=*/{}, det_rand);
243+        orphanage.AddTx(ptx, dos_peer);
244+    }
245+    peer_usages.emplace_back(orphanage.UsageByPeer(dos_peer));
246+
247+    // Force an eviction. Note that no limiting has happened yet at this point. LimitOrphans may

instagibbs commented at 5:44 pm on February 10, 2025:

happened yet at this point

Do you mean prior to LimitOrphans?

glozow commented at 12:12 pm on February 11, 2025:

yes, clarified

in src/test/orphanage_tests.cpp:256 in 95b61662e5 outdated

250+    orphanage.LimitOrphans(prev_count - 1, det_rand);
251+    BOOST_CHECK(orphanage.Size() <= prev_count - 1);
252+
253+    // The DoS peer's orphans have been evicted, nobody else's have.
254+    for (NodeId peer{0}; peer <= dos_peer; ++peer) {
255+        BOOST_CHECK_EQUAL(peer == dos_peer, peer_usages.at(peer) != orphanage.UsageByPeer(peer));

instagibbs commented at 5:46 pm on February 10, 2025:

nit: would rather we check that we evicted the dos’y peer and somehow didn’t add more resources allocated to him

glozow commented at 12:16 pm on February 11, 2025:

Added

in test/functional/p2p_opportunistic_1p1c.py:464 in 653f1bb84d outdated

467+        peer_normal.wait_for_getdata([parent_txid_int])
468+
469+        self.log.info("Send another round of very large orphans from a DoSy peer")
470+        for large_orphan in large_orphans[60:]:
471+            peer_doser.send_and_ping(msg_tx(large_orphan))
472+

instagibbs commented at 6:04 pm on February 10, 2025:

could do this for both cases

0       # Something was evicted
1        assert_greater_than(len(large_orphans), len(node.getorphantxs()))

glozow commented at 12:37 pm on February 11, 2025:

added

in src/bench/txorphanage.cpp:103 in 3ce9ef7dd3 outdated

44+}
45+
46+static void OrphanageEvictionMany(benchmark::Bench& bench)
47+{
48+    NodeId NUM_PEERS{125};
49+    unsigned int NUM_TRANSACTIONS(DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS / NUM_PEERS);

instagibbs commented at 6:15 pm on February 10, 2025:

seems wrong, you’re sending 3000/125=24 transactions total?

glozow commented at 12:18 pm on February 11, 2025:

Yes, and they are each announced by every peer. This bench is to test the maximum number of transactions where every peer has 100% overlap. We call AddTx DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS times, which is the maximum before eviction would trigger. If we increase DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS, the bench will scale too.

instagibbs commented at 3:02 pm on February 15, 2025:

are you sure it’s being sent via every peer for this benchmark? Looks like there’s no overlap?

glozow commented at 4:44 pm on February 20, 2025:

You are right. That explains why it doesn’t get slower haha

in src/txorphanage.cpp:276 in 33034eaa3b outdated

275+                // items that are no longer in the orphanage. We should only do this once per peer
276+                // per call to AddChildrenToWorkSet, so keep track of which peers we have trimmed.
277+                // We also never need to do it more than once since evictions don't happen in this
278+                // function.
279+                if (orphan_work_set.size() + 1 > MAX_ORPHAN_WORK_QUEUE && !peers_workset_trimmed.contains(announcer)) {
280+                    std::erase_if(orphan_work_set, [&](const auto& wtxid) { return m_orphans.contains(wtxid); });

instagibbs commented at 8:28 pm on February 10, 2025:

this is backwards. See added unit test:

 0diff --git a/src/test/orphanage_tests.cpp b/src/test/orphanage_tests.cpp
 1index fe0f81fdb4..dde42d9d4a 100644
 2--- a/src/test/orphanage_tests.cpp
 3+++ b/src/test/orphanage_tests.cpp
 4@@ -77,0 +78,24 @@ static CTransactionRef MakeTransactionSpending(const std::vector<COutPoint>& out
 5+// 101 output transaction
 6+static CTransactionRef MakeHugeTransactionSpending(const std::vector<COutPoint>& outpoints, FastRandomContext& det_rand)
 7+{
 8+    CKey key;
 9+    MakeNewKeyWithFastRandomContext(key, det_rand);
10+    CMutableTransaction tx;
11+    // If no outpoints are given, create a random one.
12+    if (outpoints.empty()) {
13+        tx.vin.emplace_back(Txid::FromUint256(det_rand.rand256()), 0);
14+    } else {
15+        for (const auto& outpoint : outpoints) {
16+            tx.vin.emplace_back(outpoint);
17+        }
18+    }
19+    // Ensure txid != wtxid
20+    tx.vin[0].scriptWitness.stack.push_back({1});
21+    tx.vout.resize(101);
22+    tx.vout[0].nValue = CENT;
23+    tx.vout[0].scriptPubKey = GetScriptForDestination(PKHash(key.GetPubKey()));
24+    tx.vout[1].nValue = 3 * CENT;
25+    tx.vout[1].scriptPubKey = GetScriptForDestination(WitnessV0KeyHash(key.GetPubKey()));
26+    return MakeTransactionRef(tx);
27+}
28+
29@@ -598,0 +623,23 @@ BOOST_AUTO_TEST_CASE(peer_worksets)
30+
31+        {
32+            // We will fill the orphanage with a single parent and 101 children
33+            // from that single transaction to cause potential deletion of work set
34+            // from peer 0.
35+            auto tx_missing_parent = MakeHugeTransactionSpending({}, det_rand);
36+            std::vector<CTransactionRef> tx_orphans;
37+            for (unsigned int i{0}; i < MAX_ORPHAN_WORK_QUEUE + 1; i++) {
38+                auto tx_orphan = MakeTransactionSpending({COutPoint{tx_missing_parent->GetHash(), i}}, det_rand);
39+                BOOST_CHECK(orphanage.AddTx(tx_orphan, /*peer=*/node0));
40+            }
41+
42+            // 101 transactions in the orphanage (no trimming of orphanage yet), now
43+            // add parent to work set, which will all be allocated to peer 0.
44+            // work set should get trimmed exactly once down to MAX_ORPHAN_WORK_QUEUE
45+            orphanage.AddChildrenToWorkSet(*tx_missing_parent, det_rand);
46+            for (unsigned int i{0}; i < MAX_ORPHAN_WORK_QUEUE; i++) {
47+                BOOST_CHECK(orphanage.GetTxToReconsider(node0));
48+            }
49+
50+            // We should have emptied the work queue in MAX_ORPHAN_WORK_QUEUE steps
51+            BOOST_CHECK(!orphanage.HaveTxToReconsider(node0));
52+        }
53diff --git a/src/txorphanage.cpp b/src/txorphanage.cpp
54index 09d50a244a..2a881db669 100644
55--- a/src/txorphanage.cpp
56+++ b/src/txorphanage.cpp
57@@ -276 +276 @@ void TxOrphanage::AddChildrenToWorkSet(const CTransaction& tx, FastRandomContext
58-                    std::erase_if(orphan_work_set, [&](const auto& wtxid) { return m_orphans.contains(wtxid); });
59+                    std::erase_if(orphan_work_set, [&](const auto& wtxid) { return !m_orphans.contains(wtxid); });

glozow commented at 12:45 pm on February 11, 2025:

Wow, very bad bit flip :facepalm: thank you

glozow commented at 5:06 pm on February 11, 2025:

Wrote a similar test to check that an evicted work item is the one that doesn’t exist in m_orphans anymore.

in src/txorphanage.cpp:281 in 33034eaa3b outdated

281+                    peers_workset_trimmed.insert(announcer);
282+                }
283+
284+                // Add this tx to the work set. If the workset is full, even after trimming, don't
285+                // accept any new work items until the work queue has been flushed.
286+                if (orphan_work_set.size() < MAX_ORPHAN_WORK_QUEUE) {

instagibbs commented at 8:40 pm on February 10, 2025:

should we debug log if we’re not adding to work set? might be good to know it’s happening

glozow commented at 4:39 am on February 12, 2025:

Added a couple pushes ago, but gone again. After a bit of offline discussion with @mzumsande and @sipa, it seemed better to just synchronously remove wtxids from worksets when they are removed as announcements. This means that the work set is always a subset of announcement set (added this to SanityCheck). Also, potentially failing to add things to work set seemed to make this less useful.

instagibbs commented at 12:45 pm on February 12, 2025:

Ok! that was my suggestion offline too. Bounding announcements means we bound the workset :+1:

in src/txorphanage.cpp:149 in ff24c1feeb outdated

139@@ -140,13 +140,12 @@ void TxOrphanage::EraseForPeer(NodeId peer)
140     if (nErased > 0) LogDebug(BCLog::TXPACKAGES, "Erased %d orphan transaction(s) from peer=%d\n", nErased, peer);
141 }
142 
143-void TxOrphanage::LimitOrphans(unsigned int max_orphans, FastRandomContext& rng)
144+unsigned int TxOrphanage::MaybeExpireOrphans()
145 {
146-    unsigned int nEvicted = 0;
147+    int nErased = 0;

mzumsande commented at 8:42 pm on February 10, 2025:

nit: doesn’t really matter if unsigned int (return value) or int is used, but would be nice to make it consistent, also with the %u / %d format specifiers in the logprints.

glozow commented at 4:37 am on February 12, 2025:

made consistent + made both logs %u

in src/txorphanage.cpp:245 in ff24c1feeb outdated

175         ++nEvicted;
176     }
177+    return nEvicted;
178+}
179+
180+void TxOrphanage::LimitOrphans(unsigned int max_orphans, FastRandomContext& rng)

mzumsande commented at 9:12 pm on February 10, 2025:

nit: commit msg of c929d71c0828544be509934312b6a7d11b47ea4d lacks a verb (e.g. “split”).

glozow commented at 4:25 am on February 12, 2025:

added

in test/functional/p2p_opportunistic_1p1c.py:451 in 653f1bb84d outdated

446+        # normal package request to time out.
447+        self.wait_until(lambda: len(node.getorphantxs()) == num_individual_dosers)
448+
449+        self.log.info("Send an orphan from a non-DoSy peer. Its orphan should not be evicted.")
450+        low_fee_parent = self.create_tx_below_mempoolminfee(self.wallet)
451+        high_fee_child = self.wallet.create_self_transfer(utxo_to_spend=low_fee_parent["new_utxo"], fee_rate=20*FEERATE_1SAT_VB)

instagibbs commented at 9:22 pm on February 10, 2025:

the honest orphan should be as large as possible: target_vsize=100000

glozow commented at 12:38 pm on February 11, 2025:

done

in test/functional/p2p_opportunistic_1p1c.py:500 in 653f1bb84d outdated

495+                # runtime of this test.
496+                peer_doser_batch.send_message(msg_tx(tx))
497+
498+        self.log.info("Send an orphan from a non-DoSy peer. Its orphan should not be evicted.")
499+        low_fee_parent = self.create_tx_below_mempoolminfee(self.wallet)
500+        high_fee_child = self.wallet.create_self_transfer(utxo_to_spend=low_fee_parent["new_utxo"], fee_rate=20*FEERATE_1SAT_VB)

instagibbs commented at 9:23 pm on February 10, 2025:

the honest orphan should be as large as possible: target_vsize=100000

glozow commented at 12:38 pm on February 11, 2025:

done

in test/functional/p2p_opportunistic_1p1c.py:473 in 653f1bb84d outdated

473+        self.log.info("Provide the orphan's parent. This 1p1c package should be successfully accepted.")
474+        peer_normal.send_and_ping(msg_tx(low_fee_parent["tx"]))
475+        assert_equal(node.getmempoolentry(orphan_tx.rehash())["ancestorcount"], 2)
476+
477+    @cleanup
478+    def test_orphanage_dos_many(self):

instagibbs commented at 9:26 pm on February 10, 2025:

can we get a functional test case that covers the “protects fully sized ancestor package” scenario in p2p_orphan_handling.py?

glozow commented at 4:38 am on February 12, 2025:

(thanks, I took your test with just a few tweaks)

instagibbs commented at 9:27 pm on February 10, 2025: member

The resource bounds additions seem to make sense, still working through the workset change implications.

I’ve got a minimal fuzz harness checking that the “honest” peer cannot be evicted, please feel free to take it: https://github.com/instagibbs/bitcoin/tree/2025-01-orphanage-peer-dos_greg_2

in src/txorphanage.h:236 in ef2f44e653 outdated

230+    unsigned int GetPerPeerMaxUsage() const {
231+        return m_reserved_weight_per_peer;
232+    }
233+
234+    unsigned int GetGlobalMaxUsage() const {
235+        return std::max<unsigned int>(m_peer_orphanage_info.size() * m_reserved_weight_per_peer, 1);

mzumsande commented at 9:58 pm on February 10, 2025:

should this be int64_t instead of int in the spirit of the first commit?

glozow commented at 4:37 am on February 12, 2025:

fixed

sipa commented at 2:25 pm on February 12, 2025:

In commit “[txorphanage] when full, evict from the DoSiest peers first”

The int64_t return type won’t actually do anything here, because std::max is instantiated for unsigned int, and also, the std::map::size() may return something smaller (particularly on 32-bit systems).

0int64_t GetGlobalMaxUsage() const {
1        return std::max<int64_t>(int64_t(m_peer_orphanage_info.size()) * m_reserved_weight_per_peer, 1);
2}

glozow commented at 12:57 pm on February 14, 2025:

Thanks! Fixed

in src/txorphanage.cpp:185 in ef2f44e653 outdated

169@@ -165,13 +170,68 @@ unsigned int TxOrphanage::MaybeExpireOrphans()
170 
171 unsigned int TxOrphanage::MaybeTrimOrphans(unsigned int max_orphans, FastRandomContext& rng)
172 {
173+    // Exit early to avoid building the heap unnecessarily
174+    if (!NeedsTrim() && m_orphans.size() <= max_orphans) return 0;
175+
176+    std::vector<PeerMap::iterator> peer_it_heap;
177+    for (auto it = m_peer_orphanage_info.begin(); it != m_peer_orphanage_info.end(); ++it) peer_it_heap.push_back(it);
178+    peer_it_heap.reserve(m_peer_orphanage_info.size());

mzumsande commented at 10:35 pm on February 10, 2025:

should call reserve before pushing entries to peer_it_heap, not after.

glozow commented at 4:25 am on February 12, 2025:

fixed

in src/txorphanage.cpp:173 in ef2f44e653 outdated

169@@ -165,13 +170,68 @@ unsigned int TxOrphanage::MaybeExpireOrphans()
170 
171 unsigned int TxOrphanage::MaybeTrimOrphans(unsigned int max_orphans, FastRandomContext& rng)
172 {
173+    // Exit early to avoid building the heap unnecessarily
174+    if (!NeedsTrim() && m_orphans.size() <= max_orphans) return 0;

mzumsande commented at 10:40 pm on February 10, 2025:

Could m_orphans.size() <= max_orphans be inside NeedsTrim?

glozow commented at 4:25 am on February 12, 2025:

Moved into NeedsTrim

glozow force-pushed on Feb 11, 2025

glozow commented at 5:04 pm on February 11, 2025: member

Thanks @instagibbs for the testing and review, added your fuzz commits and took comments. Still need to write the p2p_orphan_handling test.

glozow force-pushed on Feb 11, 2025

DrahtBot added the label CI failed on Feb 11, 2025

DrahtBot commented at 5:08 pm on February 11, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/37041607307

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow force-pushed on Feb 11, 2025

mzumsande commented at 7:33 pm on February 11, 2025: contributor

Halfway through, some minor points below - my main conceptual question is why m_total_announcements is a meaningful metric in limiting the orphanage.

My understanding is that m_total_orphan_usage exists to limit memory usage, and m_total_announcements to limit CPU usage - but why the number of announcements instead of number of orphans? Why would it make the situation any less DoSy if we remove an announcer but keep the orphan? Since we only assign the tx to one peer’s workset after 7426afbe62414fa575f91b4f8d3ea63bcc653e8b, more announcers for the same number of orphans doesn’t really mean any additional work.

glozow commented at 10:49 pm on February 11, 2025: member

My understanding is that m_total_orphan_usage exists to limit memory usage, and m_total_announcements to limit CPU usage - but why the number of announcements instead of number of orphans?

Yep, to limit CPU usage. The complexity of eviction for example is bounded by the total number of announcements: in the worst case, each orphan has many announcers and the MaybeTrimOrphans loop first removes announcements until each orphan just has 1 left, and then finally can remove transactions. See comment above declaration, “The loop can run a maximum of m_max_global_announcement times”

Why would it make the situation any less DoSy if we remove an announcer but keep the orphan?

Perhaps I should have stated this in the OP more explicitly, but a major motivation for this eviction strategy is to prevent any peer from being to evict any announcements of another peer, hence the per-peer limits. If we changed the eviction code to remove orphans wholesale instead of just announcements, we’d have a similar situation to today’s: an attacker can cause churn of an honest orphan by announcing it along with a lot of other orphans.

So evicting announcements instead of orphans isn’t less DoSy, but it does make the orphanage less churnable.

glozow force-pushed on Feb 12, 2025

in test/functional/p2p_orphan_handling.py:753 in 4e17767f4b outdated

748+            peer_doser.send_and_ping(msg_tx(large_orphan))
749+
750+        self.log.info("Provide the top ancestor. The whole package should be re-evaluated after enough time.")
751+        peer_normal.send_and_ping(msg_tx(ancestor_package[0]["tx"]))
752+
753+        self.wait_until(lambda: node.getmempoolentry(ancestor_package[-1]["txid"])["ancestorcount"] == DEFAULT_ANCESTOR_LIMIT)

instagibbs commented at 2:06 pm on February 12, 2025:

on second thought, this will probably throw an assertion if the final child isn’t in the mempool yet? Probably need to prepend this clause with ancestor_package[-1]["txid"] in node.getrawmempool()

glozow commented at 4:00 pm on February 13, 2025:

ah hm, maybe we don’t need to wait? I guess if you send a ping, you don’t get a pong until all 24 are processed.

instagibbs commented at 5:04 pm on February 13, 2025:

I’ll defer to you, but IIUC we’ll do ~1 orphan processing per message processing step, so it might take a bit more to process all 24 from orphanage? Alternatively we could query the first tx and wait until the descendant count hits DEFAULT_ANCESTOR_LIMIT

glozow commented at 12:52 pm on February 14, 2025:

Hold on, I don’t see why it’d throw an assertion? ancestor_package[-1] is the last child right? Added some more comments but didn’t change the code

in test/functional/p2p_orphan_handling.py:62 in 889afbadb4 outdated

60-            self.nodes[0].bumpmocktime(LONG_TIME_SKIP)
61             # Check that mempool and orphanage have been cleared
62             self.wait_until(lambda: len(self.nodes[0].getorphantxs()) == 0)
63             assert_equal(0, len(self.nodes[0].getrawmempool()))
64+
65+            self.restart_node(0, extra_args=["-persistmempool=0"])

instagibbs commented at 2:20 pm on February 12, 2025:

889afbadb417e9422c7c06fd074981fa62045568

Since we’re restarting does this wipe the mocktime on the node? Doesn’t seem to affect timings, but I think it’s easier to think about the test this way.

note that if you set it here, you also need to setmocktime any other time you restart as well

glozow commented at 12:53 pm on February 14, 2025:

nice, thanks - added a setmocktime() after all the restarts

in src/txorphanage.h:179 in 8520f1e493 outdated

169@@ -138,10 +170,27 @@ class TxOrphanage {
170          * m_total_orphan_size. If a peer is removed as an announcer, even if the orphan still
171          * remains in the orphanage, this number will be decremented. */
172         int64_t m_total_usage{0};
173+
174+        /** Orphan transactions in vector for quick random eviction */
175+        std::vector<OrphanMap::iterator> m_iter_list;

sipa commented at 2:21 pm on February 12, 2025:

In commit “[txorphanage] when full, evict from the DoSiest peers first”

I think it would be helpful to add this variable, and the testing thereof, in a separate commit from the actual eviction changes.

glozow commented at 12:53 pm on February 14, 2025:

Added a commit before that one, just adding the list and sanity checking

in src/txorphanage.h:191 in 8520f1e493 outdated

181+         * If the peer is using more than the allowed for either resource, its DoS score is > 1.
182+         * A peer having a DoS score > 1 does not necessarily mean that something is wrong, since we
183+         * do not trim unless the orphanage exceeds global limits, but it means that this peer will
184+         * be selected for trimming sooner. */
185+        FeeFrac GetDoSScore(unsigned int peer_max_ann, unsigned int peer_max_mem) {
186+            FeeFrac cpu_score(m_iter_list.size(), peer_max_ann);

sipa commented at 2:26 pm on February 12, 2025:

Huh, neat, I hadn’t considered using FeeFrac here, but it fits.

in src/txorphanage.cpp:192 in 8520f1e493 outdated

178+
179+    // Sort peers that have the highest ratio of DoSiness first
180+    auto compare_peer = [this](PeerMap::iterator left, PeerMap::iterator right) {
181+        const auto max_ann{GetPerPeerMaxAnnouncements()};
182+        const auto max_mem{GetPerPeerMaxUsage()};
183+        return left->second.GetDoSScore(max_ann, max_mem) < right->second.GetDoSScore(max_ann, max_mem);

sipa commented at 2:31 pm on February 12, 2025:

In commit “[txorphanage] when full, evict from the DoSiest peers first”

Using FeeFrac::operator< here, which tiebreaks by biggest denominator first in case the ratios are equal, means that if two peers are equally DoSsy, but one is that due to memory usage, and the other is that due to announcements, the announcements one will be targetted first. That’s probably fine, but perhaps worth documenting.

glozow commented at 12:53 pm on February 14, 2025:

good point, added a comment

sipa commented at 2:35 pm on February 12, 2025: member

Approach ACK

in test/functional/p2p_opportunistic_1p1c.py:506 in acbe37029e outdated

501+                # Don't sync with ping or wait for responses, because it dramatically increases the
502+                # runtime of this test.
503+                peer_doser_batch.send_message(msg_tx(tx))
504+
505+        # Something was evicted
506+        self.wait_until(lambda: len(node.getorphantxs()) < batch_size * num_peers)

instagibbs commented at 2:56 pm on February 12, 2025:

this will trivially pass (and not actually ensure there was an eviction) unless you let the orphans be processed with a send_and_ping.

It’s also not true, because no evictions will happen with just 1000 orphans

Here’s a suggested patchset which has this subtest run in ~2m30s and actually causes evictions with default parameters: https://github.com/instagibbs/bitcoin/commit/fedea4a17b7fc4c442b0ad98b51b85ff93a55beb

glozow commented at 12:54 pm on February 14, 2025:

Thanks! Added. It does take a long time but that’s the only way to get evictions due to announcements in a functional test.

glozow commented at 2:07 pm on February 14, 2025:

Hm, one of the CI failures is a timeout for this (the other is a wallet thing). Perhaps it takes a bit too long? A few ideas:

Change these to batches of 200 * 10 + 101 * 10 and just do the wait_until once to make the test faster.
Use -maxorphantxs to reduce the limit.
Wait after each orphan is sent. This makes the overall test a lot longer, but makes it less likely we’ll hit the timeout.
Keep a lower count of txns and settle for a test that fails on master but not on this PR.

instagibbs commented at 2:09 pm on February 14, 2025:

Too bad :(

edit: my guess is we’ll need to lower max orphan count to make CI runs happy

glozow commented at 5:54 pm on February 14, 2025:

Hm, I don’t want to manually change -maxorphantxs. But hey, this is the idea for introducing the test prior to increasing the default to 3000. At that commit, the orphanage doesn’t go past 100, so it’s definitely doing evictions even with just 1010 + 101 orphans.

glozow commented at 6:41 pm on February 14, 2025:

Ok, I’ve made it 101 * 30 + 101 = 3131 total. I think the wait_until for the 1p1c orphan to be in orphanage kind of achieves what we want (i.e. evictions for each of the previous orphans have already been calculated) even though we don’t explicitly wait for each peer. this is not true anymore

glozow commented at 2:05 pm on February 19, 2025:

fwiw, here’s what I landed on for test_orphanage_dos_many: (1) send 51 of the same orphan from 60 peers (51 orphans, 3060 announcements). sync_with_ping and wait until at least one of the orphans is in orphanage (2) send the 1p1c, which is a small orphan (not 100kvb because prior to increasing -maxorphantxs, this peer’s memory DoS score could be higher than the other peers’ CPU DoS scores and get the tx evicted. Better to compare apples to apples). (3) send 51 unique orphans from 40 peers (2040 orphans, 2040 announcements). sync_with_ping and wait until we have at least 1 of the orphans from each peer. This gives us 2091 orphans, 5100 announcements total from the DoSy peers, plus 1 normal from the 1p1c.

Before we increase -maxorphantxs, there are evictions during (3). After we increase -maxorphantxs, there are evictions in (1) and (3). The reason I’m doing the shared orphans first is so that we get to maximum announcements more quickly, and we can expect evictions during both rounds.

in test/functional/p2p_opportunistic_1p1c.py:540 in acbe37029e outdated

535+            peer_doser_shared = node.add_p2p_connection(P2PInterface())
536+            for orphan in shared_orphans:
537+                peer_doser_shared.send_message(msg_tx(orphan))
538+
539+        # Something was evicted; the orphanage does not contain all DoS orphans + the 1p1c child
540+        self.wait_until(lambda: len(node.getorphantxs()) < batch_size * num_peers + len(shared_orphans) + 1)

instagibbs commented at 3:33 pm on February 12, 2025:

this is a buggy assertion because no evictions will be happening, you might just front-run validation and slip through on your local machine. see https://github.com/instagibbs/bitcoin/commit/fedea4a17b7fc4c442b0ad98b51b85ff93a55beb for what I think would be a valid assertion

glozow commented at 12:55 pm on February 14, 2025:

Added thanks!

instagibbs commented at 3:34 pm on February 12, 2025: member

combing through tests a bit, think I spotted the CI failure cause

glozow added this to the milestone 29.0 on Feb 12, 2025

glozow requested review from sr-gi on Feb 13, 2025

glozow requested review from stickies-v on Feb 13, 2025

glozow force-pushed on Feb 14, 2025

DrahtBot removed the label CI failed on Feb 14, 2025

glozow force-pushed on Feb 14, 2025

DrahtBot added the label CI failed on Feb 14, 2025

in src/test/fuzz/txorphan_protected.cpp:46 in a704b8a959 outdated

41+
42+    // We have NUM_PEERS, of which Peer==0 is the "honest" one
43+    // who will never exceed their reserved weight of announcement
44+    // count, and should therefore never be evicted.
45+    const unsigned int NUM_PEERS = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(1, 125);
46+    const unsigned int global_announcement_limit = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(NUM_PEERS, 1'000);

instagibbs commented at 7:04 pm on February 14, 2025:

Let’s set the upper range to DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS?

glozow commented at 7:23 pm on February 14, 2025:

done

in src/test/fuzz/txorphan_protected.cpp:50 in a704b8a959 outdated

45+    const unsigned int NUM_PEERS = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(1, 125);
46+    const unsigned int global_announcement_limit = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(NUM_PEERS, 1'000);
47+    // This must match announcement limit (or exceed) otherwise "honest" peer can be evicted
48+    const unsigned int global_tx_limit = global_announcement_limit;
49+    const unsigned int per_peer_announcements = global_announcement_limit / NUM_PEERS;
50+    const unsigned int per_peer_weight_reservation = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(1, 4'000);

instagibbs commented at 7:05 pm on February 14, 2025:

set the upper range to DEFAULT_ANCESTOR_SIZE_LIMIT_KVB * 4000? See no reason to not cover the full range (we could also increase the num_outs range to make larger ranges hit based on a bool?

glozow commented at 7:24 pm on February 14, 2025:

I made it DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER * 10

glozow force-pushed on Feb 14, 2025

in src/test/orphanage_tests.cpp:252 in 064af55a0a outdated

247+    // Force an eviction. Note that no limiting has happened yet at this point (we haven't called
248+    // LimitOrphans yet) so it may be oversize and LimitOrphans may evict more than 1 transaction.
249+    // All evictions will be from the dos_peer's transactions.
250+    const auto prev_count = orphanage.Size();
251+    orphanage.LimitOrphans(prev_count - 1, det_rand);
252+    BOOST_CHECK(orphanage.Size() <= prev_count - 1);

mzumsande commented at 9:01 pm on February 14, 2025:

This seems very cautious. The test first adds 150 txns, then another DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS (so there is a 1 to 1 relation between announcements and txns). If all txns are distinct, this should be enough to assert that Size() should shrink by 150 compared to prev_count, not just 1.

glozow commented at 8:16 pm on February 18, 2025:

changed to be asserting exact counts 👍

in test/functional/p2p_opportunistic_1p1c.py:466 in 793c80be58 outdated

473+        self.log.info("Send another round of very large orphans from a DoSy peer")
474+        for large_orphan in large_orphans[60:]:
475+            peer_doser.send_and_ping(msg_tx(large_orphan))
476+
477+        # Something was evicted; the orphanage does not contain all large orphans + the 1p1c child
478+        self.wait_until(lambda: len(node.getorphantxs()) < len(large_orphans) + 1)

mzumsande commented at 9:26 pm on February 14, 2025:

the numbers don’t add up for me: the tests creates 100 large orphans, sends 20, sends 1 small tx, then sends another 40 large orphans, and finally asserts that there are less than 101 entries in the orphanage. This would also be true without any eviction.

glozow commented at 8:15 pm on February 18, 2025:

thanks, fixed

DrahtBot removed the label CI failed on Feb 14, 2025

in src/txorphanage.cpp:100 in 944f61e6d5 outdated

 96         auto peer_it = m_peer_orphanage_info.find(peer);
 97         if (Assume(peer_it != m_peer_orphanage_info.end())) {
 98             peer_it->second.m_total_usage -= tx_size;
 99+
100+            auto& orphan_list = peer_it->second.m_iter_list;
101+            size_t old_pos = std::distance(orphan_list.begin(), std::find(orphan_list.begin(), orphan_list.end(), it));

sipa commented at 2:32 pm on February 15, 2025:

In commit “[txorphanage] add per-peer iterator list and announcements accounting”

The cost of this std::distance may be O(num_orphans_per_peer), and the peer : it->second.announcers loop around can run up to O(num_announcers_per_tx). However, since the sum of all orphan_list lengths is equal to the total number of announcements, the overall cost of EraseTx is bounded by O(total_announcements). In both MaybeExpireOrphan and EraseForBlock, the function EraseTx may be invoked once per orphan, so I think this may actually mean O(total_orphans * total_announcements), which may be millions, which could mean a concerning amount of time. Something similar may apply to the call in MaybeTrimOrphans, but it’s a bit more complicated to analyse. Does that sound right?

One possibility to reduce this may be to batch the removals. Replace EraseTx with a function that takes a set of wtxids to remove, and just loops over all peers’ orphan_lists once, removing anything that’s in the set. That would reduce the cost to just O(total_announcements). However, it would mean always iterating over all announcements whenever an orphan is erased.

Alternatively, the iter_lists could be replaced with sets for faster removal, but that would increase the cost of random evictions in it from O(1) to O(n). That might not actually be an improvement for MaybeTrimOrphans.

I think the proper solution is replacing the data structures that together encode the announcement sets (announcers, m_iter_list, and optionally also m_work_set) with a single global boost::multiindex, with hashed by-wtxid index, and a ranked by-(peer,wtxid) index (which allows for fast random access).

instagibbs commented at 4:21 pm on February 15, 2025:

Added a few more benchmarks to get an idea of what it could look like with current PR: https://github.com/instagibbs/bitcoin/commit/ba2e3e339cafdf1b38742b2c288a18dd32c63db3

0|               ns/op |                op/s |    err% |     total | benchmark
1|--------------------:|--------------------:|--------:|----------:|:----------
2|        7,012,786.00 |              142.60 |    0.9% |      0.08 | `OrphanageEvictionBlockManyPeers`
3|       27,505,341.00 |               36.36 |    0.8% |      0.30 | `OrphanageEvictionBlockOnePeer`
4|       12,507,729.00 |               79.95 |    0.6% |      0.14 | `OrphanageEvictionManyWithManyPeers`
5|       26,721,356.00 |               37.42 |    0.4% |      0.29 | `OrphanageEvictionManyWithOnePeer`
6|        7,262,273.00 |              137.70 |    3.3% |      0.08 | `OrphanageEvictionPeerMany`
7|       22,306,678.00 |               44.83 |    0.7% |      0.25 | `OrphanageEvictionPeerOne`

Added EraseForBlock, EraseForPeer(in a loop), and parameterized number of peers. Looks like the std::distance work is causing most of the time since things are slower with a single peer?

sipa commented at 4:35 pm on February 15, 2025:

@instagibbs I believe the worst case when max_orphans == max_announcements is to have exactly one peer, and then erasing the transactions in reverse order they appear in _m_list_iters.

That would cost n^2/2 steps in std::distance.

When max_announcements is larger than max_orphans, the worst case is having max_announcements / max_orphans peers, and every transaction be announced by all, I think.

instagibbs commented at 4:44 pm on February 15, 2025:

I believe the worst case when max_orphans == max_announcements is to have exactly one peer, and then erasing the transactions in reverse order they appear in _m_list_iters.

I swapped the order of the block txns to force it to walk the whole list for the single peer, it’s about 10% slower

glozow commented at 6:22 pm on February 15, 2025:

Another solution maybe: replace set<NodeId> announcers in OrphanTxBase with a std::map<NodeId, size_t> announcers, where the value is the orphan’s position in the PeerOrphanInfo::m_iter_list. Previously, we had a list_posthat tracked the orphan’s location in m_orphan_list. That removes the need to do std::distance.

On the whole though, I agree a multiindex is probably the most natural data structure for orphanage.

sipa commented at 6:32 pm on February 15, 2025:

@glozow That works, I believe.

glozow commented at 10:15 pm on February 18, 2025:

replace set announcers in OrphanTxBase with a std::map<NodeId, size_t> announcers, where the value is the orphan’s position in the PeerOrphanInfo::m_iter_list

Went ahead with this solution.

in test/functional/p2p_orphan_handling.py:635 in 212e1ea50c outdated

648-        assert_equal(len(orphanage), DEFAULT_MAX_ORPHAN_TRANSACTIONS)
649-
650-        self.log.info("Clearing the orphanage")
651-        for index, parent_orphan in enumerate(parent_orphans):
652-            peer_1.send_and_ping(msg_tx(parent_orphan))
653-        self.wait_until(lambda: len(node.getorphantxs()) == 0)

kevkevinpal commented at 3:25 am on February 16, 2025:

instead of removing this test we can keep it if we restart the node with the previous max orphan amount

0self.restart_node(0, extra_args=["-maxorphantx=" + str(DEFAULT_MAX_ORPHAN_TRANSACTIONS)])

and we can probably move DEFAULT_MAX_ORPHAN_TRANSACTIONS into this test_max_orphan_amount and rename it to max_orphan_amount since this isnt the default max orphan amount anymore.

If we still don’t want this test we can remove DEFAULT_MAX_ORPHAN_TRANSACTIONS since it is only used in this test

glozow commented at 8:15 pm on February 18, 2025:

thanks, removed DEFAULT_MAX_ORPHAN_TRANSACTIONS

glozow force-pushed on Feb 18, 2025

in src/txorphanage.cpp:41 in 19c77223cd outdated

35@@ -36,8 +36,12 @@ bool TxOrphanage::AddTx(const CTransactionRef& tx, NodeId peer)
36         return false;
37     }
38 
39-    auto ret = m_orphans.emplace(wtxid, OrphanTx{{tx, {peer}, Now<NodeSeconds>() + ORPHAN_TX_EXPIRE_TIME}, m_orphan_list.size()});
40+    auto& orphan_list = m_peer_orphanage_info.try_emplace(peer).first->second.m_iter_list;
41+    std::map<NodeId, size_t> announcer_list_pos{{peer, orphan_list.size()}};
42+    auto ret = m_orphans.emplace(wtxid, OrphanTx{{tx, announcer_list_pos, Now<NodeSeconds>() + ORPHAN_TX_EXPIRE_TIME}, m_orphan_list.size()});

sipa commented at 2:04 pm on February 18, 2025:

In commit “[txorphanage] add per-peer iterator list and announcements accounting”

Use tx, std::move(announcer_list_pos), ... to avoid an allocation.

glozow commented at 8:58 pm on February 18, 2025:

thanks, done

in src/txorphanage.h:158 in 19c77223cd outdated

153+        /** Orphan transactions announced by this peer. */
154+        std::vector<OrphanMap::iterator> m_iter_list;
155+
156+        /** Remove the element at list_pos in m_iter_list in O(1) time by swapping the last element
157+         * with the one at list_pos and popping the back if there are multiple elements. Returns the
158+         * swapped element, if applicable, so that the caller can update its list_pos.

sipa commented at 2:18 pm on February 18, 2025:

In commit “[txorphanage] add per-peer iterator list and announcements accounting”

Would it be possible to do the list_pos updating here without needing to return an iteration to push that responsibility to the caller? It would mean RemoveIterAt would need to know what peer it’s operating in, so that means it’s perhaps more appropriate to have it as TxOrphanage member function rather than a PeerOrphanInfo member function.

glozow commented at 8:58 pm on February 18, 2025:

good point, I’ve made it a TxOrphanage method now

in src/txorphanage.cpp:112 in 19c77223cd outdated

110@@ -104,6 +111,7 @@ int TxOrphanage::EraseTx(const Wtxid& wtxid)
111         m_orphan_list[old_pos] = it_last;
112         it_last->second.list_pos = old_pos;

sipa commented at 2:26 pm on February 18, 2025:

In commit “[txorphanage] add per-peer iterator list and announcements accounting”

I think this line, and the 7 lines before it, can be replaced with RemoveIterAt(it->second.list_pos), especially if it can be changed to do the list_pos updating internally?

Not very important as the code disappears in the next commit, but would make it more obviously correct.

glozow commented at 8:45 pm on February 18, 2025:

Maybe I’m misunderstanding, but I don’t think we can use RemoveIterAt (which is for updating the peer list) for this (which is the global TxOrphanage::m_orphan_list)?

sipa commented at 8:48 pm on February 18, 2025:

Oh of course; it just looked very similar.

in src/bench/txorphanage.cpp:102 in 2bffbd5c9d outdated

 96@@ -97,4 +97,105 @@ static void OrphanageEraseForBlockSinglePeer(benchmark::Bench& bench)
 97     });
 98 }
 99 
100+static void OrphanageEvictionManyPeers(benchmark::Bench& bench)

sipa commented at 3:21 pm on February 18, 2025:

In commit “[bench] TxOrphanage::LimitOrphans”

Would it make sense to introduce this benchmark earlier (and the other ones below), so we can see what effect the previous commit has on it?

glozow commented at 8:58 pm on February 18, 2025:

yes, moved it up

DrahtBot added the label CI failed on Feb 18, 2025

glozow force-pushed on Feb 18, 2025

DrahtBot removed the label CI failed on Feb 18, 2025

in src/bench/txorphanage.cpp:1 in 032b623753 outdated

0@@ -0,0 +1,100 @@
1+// Copyright (c) 2011-2022 The Bitcoin Core developers

sipa commented at 1:12 am on February 19, 2025:

In commit “[bench] TxOrphanage::EraseForBlock”

Better not to have years listed than outdated/wrong ones.

in src/txorphanage.h:184 in 56f05a73d8 outdated

179@@ -168,6 +180,10 @@ class TxOrphanage {
180 
181     /** If there are more than max_orphans total orphans, evict randomly until that is no longer the case. */
182     unsigned int MaybeTrimOrphans(unsigned int max_orphans, FastRandomContext& rng);
183+
184+    /** Remove the element at list_pos in m_iter_list in O(1) time by swapping the last element

sipa commented at 1:18 am on February 19, 2025:

In commit “[txorphanage] add per-peer iterator list and announcements accounting”

Nit: I may have instigated this, but given the m_peer_orphanage_info.find(peer) call, it’s O(log n) really (in the number of peers, not O(1)).

sipa commented at 2:09 pm on February 19, 2025: member

The changes here have surprisingly little effect on the included benchmarks:

[bench] TxOrphanage::LimitOrphans

ns/op	op/s	err%	total	benchmark
68,742,718.06	14.55	0.2%	10.59	`OrphanageEraseForBlockSinglePeer`
8,857.40	112,899.93	0.3%	10.86	`OrphanageEvictionManyPeers`
1,544,919.67	647.28	0.2%	10.97	`OrphanageWorksetManyPeers`
11,444,367.17	87.38	0.1%	11.00	`OrphanageWorksetSinglePeer`

[txorphanage] when full, evict from the DoSiest peers first

ns/op	op/s	err%	total	benchmark
65,643,925.00	15.23	1.7%	10.32	`OrphanageEraseForBlockSinglePeer`
8,657.19	115,510.91	0.1%	10.83	`OrphanageEvictionManyPeers`
1,472,007.25	679.34	0.6%	10.96	`OrphanageWorksetManyPeers`
10,034,423.77	99.66	0.3%	11.05	`OrphanageWorksetSinglePeer`

[txorphanage] limit EraseForBlock iterations and use set instead of vec

ns/op	op/s	err%	total	benchmark
59,924,050.19	16.69	1.1%	10.98	`OrphanageEraseForBlockSinglePeer`
8,684.76	115,144.26	0.3%	10.80	`OrphanageEvictionManyPeers`
1,500,899.95	666.27	0.2%	11.10	`OrphanageWorksetManyPeers`
10,619,315.89	94.17	0.3%	10.99	`OrphanageWorksetSinglePeer`

[txorphanage] when orphans are erased, delete them from worksets

ns/op	op/s	err%	total	benchmark
59,404,692.72	16.83	1.6%	10.70	`OrphanageEraseForBlockSinglePeer`
8,824.15	113,325.33	0.1%	10.82	`OrphanageEvictionManyPeers`
1,481,787.75	674.86	0.2%	10.96	`OrphanageWorksetManyPeers`
10,452,273.40	95.67	0.5%	11.10	`OrphanageWorksetSinglePeer`

Increase default -maxorphantx to 3000

ns/op	op/s	err%	total	benchmark
59,868,390.44	16.70	1.8%	11.29	`OrphanageEraseForBlockSinglePeer`
9,275.98	107,805.36	0.3%	10.97	`OrphanageEvictionManyPeers`
1,538,763.63	649.87	0.2%	11.01	`OrphanageWorksetManyPeers`
10,657,836.93	93.83	0.4%	11.03	`OrphanageWorksetSinglePeer`

glozow commented at 2:25 pm on February 19, 2025: member

The changes here have surprisingly little effect on the included benchmarks:

My expectation was that the time would go up with these changes, but hopefully not by too much.

I am pretty surprised that “[txorphanage] when full, evict from the DoSiest peers first” doesn’t make OrphanageEvictionManyPeers slower, but I guess it’s because there are only 24 transactions.

glozow commented at 4:43 pm on February 19, 2025: member

Will add something like this as comments as well but here’s the thinking around these benches:

The EraseForBlock bench exists to test the worst case EraseForBlock time, which is every orphan conflicting the maximum amount with the block. That’s (very roughly) ~2000 inputs per tx (within 400KWu), times the max number of orphans.
- We’re kind of cheating in “limit EraseForBlock iterations and use set instead of vec” which forces it to stop at 100 orphans, even if the max number of orphans is increased.
- There is also memory, which what changing to a set helps with. Due to the way the loops work, we would have 2000 * max number of orphans iterators in the vector.
The workset benches exist to test the worst case AddChildrenToWorkSet time. Similarly, worst case is when all orphans spend an output of the transaction. The complexity can also increase when there are multiple announcers for the orphans.
The eviction bench exists to test the worst case LimitOrphans time. Note it artificially uses max_orphans=0 even though that doesn’t happen in real life. The worst case is when every orphan has been announced by every peer, and we need to remove all announcers one by one before we actually erase anything. This would have been pretty bad before we set a global announcement limit.

So for all of these benches, we primarily want them to not blow up when we increase the DEFAULT_MAX_ORPHAN_TRANSACTIONS.

glozow removed this from the milestone 29.0 on Feb 20, 2025

glozow added this to the milestone 30.0 on Feb 20, 2025

sipa commented at 3:44 pm on February 20, 2025: member

Sad to see this slip, but given the amount of changes and discoveries that necessitated them even in just the last week, it’s probably the right decision.

glozow commented at 3:52 pm on February 20, 2025: member

I’m sad too! Seeing the stats from https://delvingbitcoin.org/t/stats-on-orphanage-overflows/1421 made this more pressing in my opinion, but it’s not a regression. I think we can still try to consider small, obviously safe changes for v29, but this feels too big. I don’t want to risk creating new DoS problems.

in src/txorphanage.cpp:339 in 021fc20f9c outdated

332@@ -332,16 +333,19 @@ void TxOrphanage::EraseForBlock(const CBlock& block)
333             if (itByPrev == m_outpoint_to_orphan_it.end()) continue;
334             for (auto mi = itByPrev->second.begin(); mi != itByPrev->second.end(); ++mi) {
335                 const CTransaction& orphanTx = *(*mi)->second.tx;
336-                vOrphanErase.push_back(orphanTx.GetWitnessHash());
337+                wtxids_to_erase.insert(orphanTx.GetWitnessHash());
338+                // Stop to avoid doing too much work. If there are more orphans to erase, rely on
339+                // expiration and evictions to clean up everything eventually.
340+                if (wtxids_to_erase.size() >= 100) break;

instagibbs commented at 6:47 pm on February 20, 2025:

seems like we’re leaving only one of 3 loops? is this intended?

in src/txorphanage.cpp:215 in 6cbe944539 outdated

215+        // ratios.
216+        std::pop_heap(peer_it_heap.begin(), peer_it_heap.end(), compare_peer);
217+        auto it_worst_peer = peer_it_heap.back();
218+        peer_it_heap.pop_back();
219+
220+        // Evict a random orphan from this peer.

mzumsande commented at 7:31 pm on February 20, 2025:

“Remove a random announcement from this peer.” seems better because we only sometimes evict an orphan.

in src/txorphanage.cpp:219 in 6cbe944539 outdated

219+
220+        // Evict a random orphan from this peer.
221+        size_t randompos = rng.randrange(it_worst_peer->second.m_iter_list.size());
222+        auto it_to_evict = it_worst_peer->second.m_iter_list.at(randompos);
223+
224+        // Only erase this peer as an announcer, unless it is the only announcer. Otherwise peers

mzumsande commented at 7:35 pm on February 20, 2025:

This is a bit ambiguous, I first read it as “don’t erase the peer as an announcer if it was the only one”, but what is meant is “erase the peer as an announcer, also remove the orphan if the peer was the only announcer”, so maybe reword it.

in src/txorphanage.h:243 in 6cbe944539 outdated

242+    int64_t GetPerPeerMaxUsage() const {
243+        return m_reserved_weight_per_peer;
244+    }
245+
246+    int64_t GetGlobalMaxUsage() const {
247+        return std::max<int64_t>(int64_t(m_peer_orphanage_info.size()) * m_reserved_weight_per_peer, 1);

mzumsande commented at 7:45 pm on February 20, 2025:

If I understand it correctly, the current logic is that we assign additional weight for all peers that have an entry in m_peer_orphanage_info, which means we don’t assign weight for peers not participating in tx relay or that do but never sent us an orphan before. But once they sent us an orphan, that weight will stay reserved until they disconnect (and can be used by other peers too), even if they never send us another orphan.

This seems like a middle-ground between assigning each peer a fixed DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER share which may not be exceeded, and assigning a pool DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER for each peer (whether that peer sent us an orphan before or not) - was that the purpose of choosing to do it this way?

glozow commented at 7:45 pm on February 24, 2025:

If I understand it correctly, …

Correct 👍 the general idea is to put a hard cap on the memory usage per peer. However, if a peer is using a lot more than the others simply because it is very useful, we don’t penalize them until we run out of space.

We could do a RegisterPeer type of thing when the peer first connects, but I don’t see any particular reason to do that extra step right now. Perhaps we can add this in the future, if we want to give peers a different reservation depending on the type of connection. This has a nice side effect where we don’t reserve any space for useless spy peers, and save a bit of extra work for short-lived connections or ones where no transactions are relayed, though I wouldn’t say this is a motivation.

in src/txorphanage.h:42 in 6cbe944539 outdated

33@@ -27,7 +34,26 @@ static constexpr auto ORPHAN_TX_EXPIRE_INTERVAL{5min};
34  * Not thread-safe. Requires external synchronization.
35  */
36 class TxOrphanage {
37+    /** The usage (weight) reserved for each peer, representing the amount of memory we are willing
38+     * to allocate for orphanage space. Note that this number is a reservation, not a limit: peers
39+     * are allowed to exceed this reservation until the global limit is reached, and peers are
40+     * effectively guaranteed this amount of space. Reservation is per-peer, so the global upper
41+     * bound on memory usage scales up with more peers. */
42+    unsigned int m_reserved_weight_per_peer{DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER};

mzumsande commented at 8:30 pm on February 20, 2025:

One possibility is that an attacker who wants to spam us with orphans can do so via other peers, without providing the orphans themselves:

1.) Attacker A inv’s a parent P to victim V but doesn’t answer GETDATA (2 minutes timeout) 2.) Within these 2 minutes, A sends P and multiple non-conflicting children C to the rest of the network - these are accepted/relayed by all other peers, so will eventually be announced to V by all of its legitimate peers. 3.) All of V’s legitimate peers will announce P and C, V will only request C (because P is in flight) and saves them as orphans / adds all of its peers as announcers. 4.) V’s orphanage reaches its limits. With 125 peers, you’d need ~25 children to reach the max announcement limit of 3000, so you might need to split the children over multiple parents because of the mempool descendant limit. 5.) V will start evicting announcements randomly from peers, so it may also evict legitimate unrelated orphans.

Unlike with the status quo where attackers can just spam us with orphans for free, this is not free (P and C need to be valid transactions with a sufficient fee to be relayed), but I couldn’t think of any countermeasure, so maybe it kind of sets a theoretical limit on how certain we can be that legitimate orphans are resolved?

instagibbs commented at 10:48 pm on February 20, 2025:

IIUC this is basically a way of slowing down the parent resolution of a target package by 2 minutes, thereby increasing the window of possible random eviction when under real cpfp traffic.

Without this delay we can still have natural overflow if there are too many (or too large) real packages in flight, the time window here just gets a lot longer

Unfortunately we have no way of peers to communicate which orphans they claim are higher paying (to preferentially evict those last f.e.). Then at least orphan package makers could try to outbid the queue. Maybe something to think about with a future sender-initiated protocol?

Being more aggressive about timing out inbound peers(and or txid-relay) when they aren’t responding to getdatas could also be on the table, at the cost of bandwidth in the average case over a slow link.

re:(2) you could also just imagine the scenario where the attacker simply relies on natural cpfp traffic and may just get lucky that it gets evicted in the two minutes. Costs the attacker nothing in this case though chances of failure are likely way higher?

mzumsande commented at 4:49 pm on February 21, 2025:

IIUC this is basically a way of slowing down the parent resolution of a target package by 2 minutes, thereby increasing the window of possible random eviction when under real cpfp traffic.

Not sure if I understood that right, but I think that would be a different kind of attack where the attacker knows the target packet. What I meant was that that the attacker crafts spam transactions (P/C) that the rest of the network relays, but that end up only in the victim’s orphanage, resulting in eviction of random other orphans (about which the attacker doesn’t need to know any specifics) - similar to how an attacker could today just spam orphans to trigger random eviction on master, but no longer for free.

glozow commented at 8:00 pm on February 24, 2025:

My general impression is that, since these “spam” orphans are real, fee-paying transactions that end up in mempool, they are equally useful / equally deserve the space. I do think this represents a bound on how much total orphan volume we can handle, but I can’t really see why the victim node should evict these orphans over the others, just because they only originate from 1 peer.

Unfortunately we have no way of peers to communicate which orphans they claim are higher paying (to preferentially evict those last f.e.). Then at least orphan package makers could try to outbid the queue. Maybe something to think about with a future sender-initiated protocol?

Even if we had a way to communicate this information, I don’t think it could be trusted. I think the only real solution to this is a sender-initiated protocol. I have half a mind to go ahead and propose a quick-and-dirty sender-initiated protocol for 1p1cs right now, but maybe that’s too short term thinking.

glozow commented at 9:13 pm on March 4, 2025: member

Following coredev discussions, I’m working on a few things:

given selected peer, evict by entry/announcement time instead of randomly
shorten expiry time (and GETDATA interval)
rewrite as boost multi-index
clarify benches

instagibbs commented at 9:29 pm on March 4, 2025: member

shorten expiry time (and GETDATA interval)

Is this worth splitting out as its own PR?

glozow commented at 9:45 pm on March 4, 2025: member

Yes! I think multi index could also be its own PR. But wanted to give a status update here.

glozow marked this as a draft on Mar 9, 2025

DrahtBot added the label Needs rebase on Mar 16, 2025

glozow force-pushed on May 19, 2025

DrahtBot added the label CI failed on May 19, 2025

DrahtBot commented at 3:27 pm on May 19, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/42487913320 LLM reason (✨ experimental): The CI failure is due to multiple linting errors, including circular dependencies, missing include guards, and spelling mistakes.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow force-pushed on May 19, 2025

DrahtBot removed the label Needs rebase on May 19, 2025

glozow force-pushed on May 20, 2025

DrahtBot removed the label CI failed on May 20, 2025

glozow force-pushed on May 22, 2025

glozow marked this as ready for review on May 22, 2025

glozow commented at 1:07 pm on May 22, 2025: member

This is ready for review again. Main changes from February: it includes the rewrite as a boost::multi_index container, removes -maxorphantxs entirely, and drops the benches that weren’t demonstrating anything. I’ve updated the PR description.

Rewrite and eviction changes are in the same commit. I didn’t think it made sense to first reimplement the old design as a multi_index, because the old eviction strategy requires twice as many indexes. If you are familiar with the old design and just want to see the behavior changes applied to it before comparing it with the new impl, I have a copy of the original PR here.

in src/node/txorphanage_impl.h:221 in 455b4b8178 outdated

216+            ++it;
217+        }
218+        return count;
219+    }
220+
221+    /** Return number of announcements with this wtxid. */

instagibbs commented at 5:36 pm on May 27, 2025:

extra fn doesn’t seem worth it

 0diff --git a/src/node/txorphanage_impl.h b/src/node/txorphanage_impl.h
 1index 89974ec506..b6137845b4 100644
 2--- a/src/node/txorphanage_impl.h
 3+++ b/src/node/txorphanage_impl.h
 4@@ -205,11 +205,11 @@ class TxOrphanageImpl
 5     }
 6 
 7-    /** Return number of announcements with the same wtxid as it. */
 8-    unsigned int CountSameWtxid(Iter<ByWtxid> it) const
 9+    /** Return number of announcements with this wtxid. */
10+    unsigned int CountWtxid(const Wtxid& wtxid) const
11     {
12+        auto it = m_orphans.get<ByWtxid>().lower_bound(ByWtxidView{wtxid, min_peer});
13         if (it == m_orphans.end()) return 0;
14 
15         unsigned int count{0};
16-        const auto& wtxid{it->m_tx->GetWitnessHash()};
17         while (it != m_orphans.end() && it->m_tx->GetWitnessHash() == wtxid) {
18             ++count;
19@@ -218,12 +218,4 @@ class TxOrphanageImpl
20         return count;
21     }
22-
23-    /** Return number of announcements with this wtxid. */
24-    unsigned int CountWtxid(const Wtxid& wtxid) const
25-    {
26-        auto it = m_orphans.get<ByWtxid>().lower_bound(ByWtxidView{wtxid, min_peer});
27-        if (it == m_orphans.end()) return 0;
28-        return CountSameWtxid(it);
29-    }
30 public:
31     TxOrphanageImpl() = default;

glozow commented at 2:36 pm on June 2, 2025:

have consolidated the two

in src/node/txorphanage_impl.h:616 in 455b4b8178 outdated

624+
625+            // If needs trim, then at least one peer has a DoS score higher than 1.
626+            Assume(dos_score.fee > dos_score.size);
627+
628+            // Evict the oldest announcement from this peer, sorting non-reconsiderable before reconsiderable.
629+            auto it_ann = m_orphans.get<ByPeer>().lower_bound(ByPeerView{worst_peer, false, 0});

instagibbs commented at 6:26 pm on May 27, 2025:

455b4b817884f860cd4467f0a9be4a459e89891c

Take or leave suggestion to reduce churn in the heap, did no benchmarking but may reduce work for most DoSy peers.

 0diff --git a/src/node/txorphanage_impl.h b/src/node/txorphanage_impl.h
 1index 89974ec506..6fe7422054 100644
 2--- a/src/node/txorphanage_impl.h
 3+++ b/src/node/txorphanage_impl.h
 4@@ -623,20 +623,27 @@ public:
 5             heap_peer_dos.pop_back();
 6 
 7+            auto it_worst_peer = m_peer_orphanage_info.find(worst_peer);
 8+
 9             // If needs trim, then at least one peer has a DoS score higher than 1.
10             Assume(dos_score.fee > dos_score.size);
11 
12-            // Evict the oldest announcement from this peer, sorting non-reconsiderable before reconsiderable.
13-            auto it_ann = m_orphans.get<ByPeer>().lower_bound(ByPeerView{worst_peer, false, 0});
14-            Assume(it_ann->m_announcer == worst_peer);
15-            Erase<ByPeer>(it_ann, /*cleanup_outpoints_map=*/CountWtxid(it_ann->m_tx->GetWitnessHash()) == 1);
16-            num_erased += 1;
17-
18-            // Unless this peer is empty (which should never happen as long as per-peer reserved usage is at least as
19-            // large as the maximum allowed orphan size), put it back in the heap so we continue to consider evicting
20-            // its orphans. Calculate the DoS score anew. This peer might still be the DoSiest one.
21-            auto it_worst_peer = m_peer_orphanage_info.find(worst_peer);
22-            if (it_worst_peer != m_peer_orphanage_info.end() && it_worst_peer->second.m_count_announcements > 0) {
23-                heap_peer_dos.emplace_back(worst_peer, it_worst_peer->second.GetDosScore(max_ann, max_mem));
24-                std::push_heap(heap_peer_dos.begin(), heap_peer_dos.end(), compare_score);
25+            if (it_worst_peer != m_peer_orphanage_info.end()) {
26+                // Avoid churn by re-entering only when worst peer's score is no longer worst
27+                // Evict the oldest announcement from this peer, sorting non-reconsiderable before reconsiderable.
28+                auto it_ann = m_orphans.get<ByPeer>().lower_bound(ByPeerView{worst_peer, false, 0});
29+                const auto& next_dos_score = heap_peer_dos.empty() ? FeeFrac{0, 1} : heap_peer_dos.front().second;
30+                while (NeedsTrim() && it_worst_peer->second.GetDosScore(max_ann, max_mem) > next_dos_score) {
31+                    Assume(it_ann->m_announcer == worst_peer);
32+                    Erase<ByPeer>(it_ann, /*cleanup_outpoints_map=*/CountWtxid(it_ann->m_tx->GetWitnessHash()) == 1);
33+                    num_erased += 1;
34+                    it_ann++; // advance to next announcement from this peer in case loop continues
35+                }
36+                // Unless this peer is empty (which should never happen as long as per-peer reserved usage is at least as
37+                // large as the maximum allowed orphan size), put it back in the heap so we continue to consider evicting
38+                // its orphans. Calculate the DoS score anew. This peer might still be the DoSiest one.
39+                if (it_worst_peer->second.m_count_announcements > 0) {
40+                    heap_peer_dos.emplace_back(worst_peer, it_worst_peer->second.GetDosScore(max_ann, max_mem));
41+                    std::push_heap(heap_peer_dos.begin(), heap_peer_dos.end(), compare_score);
42+                }
43             }
44         } while (!heap_peer_dos.empty() && NeedsTrim());

glozow commented at 7:11 pm on June 5, 2025:

Implemented the general idea, i.e. keep trimming until this peer wouldn’t be the next thing we pop from the heap. There was some iterator invalidation going on here

in src/txorphanage.cpp:330 in aed51fe7d5 outdated

326@@ -327,10 +327,10 @@ void TxOrphanage::SanityCheck() const
327     // Check that cached m_total_announcements is correct
328     unsigned int counted_total_announcements{0};
329     // Check that m_total_orphan_usage is correct
330-    unsigned int counted_total_usage{0};
331+    int64_t counted_total_usage{0};

instagibbs commented at 3:11 pm on May 28, 2025:

aed51fe7d5cbcc43eba2be3cd5af666fe1d95dd7

give motivation in commit message for the change?

glozow commented at 4:34 pm on June 2, 2025:

added

instagibbs commented at 6:27 pm on June 9, 2025:

do we care about signed-ness?

instagibbs commented at 1:42 pm on June 10, 2025:

22d6cdd4f9dd6e03ad88946c130dad98fc45d7ad

why signed int?

glozow commented at 8:15 pm on June 12, 2025:

we’re doing some arithmetic :shrug:

in src/node/txorphanage.cpp:40 in 737c5127df outdated

36@@ -37,7 +37,7 @@ bool TxOrphanage::AddTx(const CTransactionRef& tx, NodeId peer)
37         return false;
38     }
39 
40-    auto ret = m_orphans.emplace(wtxid, OrphanTx{{tx, {peer}, Now<NodeSeconds>() + ORPHAN_TX_EXPIRE_TIME}, m_orphan_list.size()});
41+    auto ret = m_orphans.emplace(wtxid, OrphanTx{{tx, {peer}}, Now<NodeSeconds>() + ORPHAN_TX_EXPIRE_TIME, m_orphan_list.size()});

instagibbs commented at 3:15 pm on May 28, 2025:

737c5127df841e9c8037b1885284f80b0aba17dd

might be good to note in commit message the getorphantxs is experimental, so breaking is considered ok

glozow commented at 4:34 pm on June 2, 2025:

added

in src/test/fuzz/txdownloadman.cpp:307 in f2dcdbf700 outdated

303@@ -305,9 +304,9 @@ FUZZ_TARGET(txdownloadman_impl, .init = initialize)
304     // Initialize a TxDownloadManagerImpl
305     bilingual_str error;
306     CTxMemPool pool{MemPoolOptionsForTest(g_setup->m_node), error};
307-    const auto max_orphan_count = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(0, 300);
308+    const auto max_orphan_count = node::DEFAULT_MAX_ORPHAN_TRANSACTIONS;

instagibbs commented at 4:02 pm on May 28, 2025:

f2dcdbf700b3b20a315f5a6eec57c7463955fe43

if this goes away later anyways, ignore, but CheckInvariants could just directly use node::DEFAULT_MAX_ORPHAN_TRANSACTIONS

glozow commented at 8:45 pm on June 3, 2025:

ignoring because the constant gets deleted

in src/node/txorphanage_impl.h:77 in 455b4b8178 outdated

72+        int64_t GetUsage()  const {
73+            return GetTransactionWeight(*m_tx);
74+        }
75+    };
76+
77+    // Index by wtxid, then peer. Uses:

instagibbs commented at 6:51 pm on May 28, 2025:

455b4b817884f860cd4467f0a9be4a459e89891c

nit: unsure if “uses” section adds a lot to clarity vs reading the code

glozow commented at 1:59 pm on June 2, 2025:

removed

in src/node/txorphanage_impl.h:70 in 455b4b8178 outdated

83+    struct ByWtxid {};
84+    using ByWtxidView = std::tuple<Wtxid, NodeId>;
85+    struct WtxidExtractor
86+    {
87+        using result_type = ByWtxidView;
88+        result_type operator()(const Announcement& ann) const

instagibbs commented at 7:08 pm on May 28, 2025:

not sure what result_type is doing here or in ByPeerViewExtractor

glozow commented at 12:24 pm on June 2, 2025:

It’s a part of what boost multiindex requires for its “key extractor” concept: https://www.boost.org/doc/libs/1_88_0/libs/multi_index/doc/reference/key_extraction.html

in src/node/txorphanage_impl.h:169 in 455b4b8178 outdated

164+            return std::max<FeeFrac>(cpu_score, mem_score);
165+        }
166+    };
167+    /** Store per-peer statistics. Used to determine each peer's DoS score. */
168+    std::unordered_map<NodeId, PeerInfo> m_peer_orphanage_info;
169+    using PeerMap = decltype(m_peer_orphanage_info);

instagibbs commented at 7:16 pm on May 28, 2025:

PeerMap is unused circa 455b4b817884f860cd4467f0a9be4a459e89891c

glozow commented at 2:31 pm on June 2, 2025:

removed

in src/test/orphanage_tests.cpp:134 in b5fe4383ad outdated

131 
132     // ... and 50 that depend on other orphans:
133     for (int i = 0; i < 50; i++)
134     {
135-        CTransactionRef txPrev = orphanage.RandomOrphan();
136+        CTransactionRef txPrev = Random(orphans_added, m_rng);

sipa commented at 7:21 pm on May 28, 2025:

In commit “[prep/test] modify test to not access TxOrphanage internals”

It seems all invocations of Random() assume that the returned value won’t be nullptr anyway, so I think they can just be replaced with:

0CTransactionRef txPrev = orphans_added[m_rng.randrange(orphans_added.size())];

glozow commented at 4:31 pm on June 2, 2025:

done

in src/node/txorphanage_impl.h:132 in 455b4b8178 outdated

127+
128+    /** Index from the parents' outputs to wtxids that exist in m_orphans. Used to find children of
129+     * a transaction that can be reconsidered and to remove entries that conflict with a block.*/
130+    std::map<COutPoint, std::set<Wtxid>> m_outpoint_to_orphan_it;
131+
132+    struct PeerInfo {

instagibbs commented at 7:24 pm on May 28, 2025:

nit: s/PeerInfo/PeerDoSInfo/

glozow commented at 2:30 pm on June 2, 2025:

done

in src/node/txorphanage_impl.h:142 in 455b4b8178 outdated

157+        * do not trim unless the orphanage exceeds global limits, but it means that this peer will
158+        * be selected for trimming sooner. If the global announcement or global memory usage
159+        * limits are exceeded, it must be that there is a peer whose DoS score > 1. */
160+        FeeFrac GetDosScore(unsigned int max_peer_count, int64_t max_peer_bytes) const
161+        {
162+            const FeeFrac cpu_score(m_count_announcements, max_peer_count);

instagibbs commented at 7:27 pm on May 28, 2025:

probably should assert the denominators are not 0, otherwise FeeFrac comparison becomes nonsensical

glozow commented at 2:31 pm on June 2, 2025:

added

in src/node/txorphanage_impl.h:244 in 455b4b8178 outdated

239+    int64_t TotalOrphanUsage() const { return m_unique_orphan_bytes; }
240+
241+    /** Number of unique orphans */
242+    unsigned int CountUniqueOrphans() const { return m_unique_orphans; }
243+
244+    /** Number of orphans from this peer */

instagibbs commented at 7:39 pm on May 28, 2025:

0    /** Number of stored orphans from this peer */

glozow commented at 2:37 pm on June 2, 2025:

done

in src/node/txorphanage_impl.h:350 in 455b4b8178 outdated

345+        // have the tx data.
346+        if (it == m_orphans.end()) return false;
347+        if (it->m_tx->GetWitnessHash() != wtxid) return false;
348+
349+        // Quit if we already have this announcement (same wtxid and peer).
350+        if (HaveTxFromPeer(wtxid, peer)) return false;

instagibbs commented at 7:54 pm on May 28, 2025:

could use it->m_announcer == peer since you already did the lookup

glozow commented at 2:44 pm on June 2, 2025:

removed

in src/node/txorphanage_impl.h:361 in 455b4b8178 outdated

356+
357+        ++m_current_sequence;
358+        auto& peer_info = m_peer_orphanage_info.try_emplace(peer).first->second;
359+        peer_info.Add(*ret.first);
360+
361+        const auto& txid = ret.first->m_tx->GetHash();

instagibbs commented at 7:56 pm on May 28, 2025:

0        const auto& txid = ptx->GetHash();

glozow commented at 2:44 pm on June 2, 2025:

done

sipa commented at 8:35 pm on May 28, 2025: member

The PR title may need updating, as -maxorphantxs is gone now.

in src/node/txorphanage_impl.h:395 in 455b4b8178 outdated

390+        auto& index_by_peer = m_orphans.get<ByPeer>();
391+        auto it = index_by_peer.lower_bound(ByPeerView{peer, false, 0});
392+        unsigned int num_erased{0};
393+        while (it != index_by_peer.end() && it->m_announcer == peer) {
394+            // Decide what will happen next before the iter is invalidated.
395+            const bool last_item{std::next(it) == index_by_peer.end() || std::next(it)->m_announcer != peer};

instagibbs commented at 8:36 pm on May 28, 2025:

455b4b817884f860cd4467f0a9be4a459e89891c

Little confused how this logic is necessary. The while loop will exit as soon as it is incremented to index_by_peer.end() or it->m_announcer != peer. What am I missing?

Same with EraseAll

glozow commented at 2:45 pm on June 2, 2025:

right, removed

in src/node/txorphanage_impl.h:411 in 455b4b8178 outdated

406+
407+        if (num_erased > 0) LogDebug(BCLog::TXPACKAGES, "Erased %d orphan transaction(s) from peer=%d\n", num_erased, peer);
408+    }
409+
410+    /** Erase all entries with this wtxid. Return the number of announcements erased. */
411+    unsigned int EraseAll(const Wtxid& wtxid)

instagibbs commented at 8:45 pm on May 28, 2025:

455b4b817884f860cd4467f0a9be4a459e89891c

difference with EraseTx is pretty subtle, and only EraseTx appears to be used externally. Make this private or just subsume into EraseTx?

glozow commented at 2:46 pm on June 2, 2025:

made private

in src/node/txorphanage_impl.h:21 in ea3a65e698 outdated

16+#include <util/epochguard.h>
17+#include <util/hasher.h>
18+#include <util/result.h>
19+#include <util/feefrac.h>
20+
21+#include <boost/multi_index/hashed_index.hpp>

sipa commented at 8:49 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Many of the includes here seem unused:

 0--- a/src/node/txorphanage_impl.h
 1+++ b/src/node/txorphanage_impl.h
 2@@ -5,33 +5,20 @@
 3 #ifndef BITCOIN_NODE_TXORPHANAGE_IMPL_H
 4 #define BITCOIN_NODE_TXORPHANAGE_IMPL_H
 5 
 6-#include <coins.h>
 7-#include <consensus/amount.h>
 8-#include <indirectmap.h>
 9 #include <logging.h>
10-#include <net.h>
11 #include <policy/policy.h>
12 #include <primitives/transaction.h>
13-#include <sync.h>
14-#include <util/epochguard.h>
15 #include <util/hasher.h>
16 #include <util/result.h>
17 #include <util/feefrac.h>
18 
19-#include <boost/multi_index/hashed_index.hpp>
20-#include <boost/multi_index/identity.hpp>
21 #include <boost/multi_index/indexed_by.hpp>
22 #include <boost/multi_index/ordered_index.hpp>
23-#include <boost/multi_index/sequenced_index.hpp>
24 #include <boost/multi_index/tag.hpp>
25 #include <boost/multi_index_container.hpp>
26 
27-#include <atomic>
28 #include <map>
29-#include <optional>
30 #include <set>
31-#include <string>
32-#include <string_view>
33 #include <utility>
34 #include <vector>

glozow commented at 1:58 pm on June 2, 2025:

Fixed

in src/node/txorphanage_impl.h:454 in 455b4b8178 outdated

449+    {
450+        auto it = m_orphans.get<ByPeer>().lower_bound(ByPeerView{peer, true, 0});
451+        return it != m_orphans.get<ByPeer>().end() && it->m_announcer == peer && it->m_reconsider;
452+    }
453+
454+    /** If there is a tx that can be reconsidered, return it. Otherwise, return a nullptr. */

instagibbs commented at 8:49 pm on May 28, 2025:

just noting this is going to get oldest-reconsidered-by-peer first

sounds logical

glozow commented at 12:26 pm on June 2, 2025:

Yes, seemed like the most fair way to do it.

in src/node/txorphanage_impl.h:108 in ea3a65e698 outdated

125+    /** Total bytes used by orphans, deduplicated by wtxid. */
126+    unsigned int m_unique_orphan_bytes{0};
127+
128+    /** Index from the parents' outputs to wtxids that exist in m_orphans. Used to find children of
129+     * a transaction that can be reconsidered and to remove entries that conflict with a block.*/
130+    std::map<COutPoint, std::set<Wtxid>> m_outpoint_to_orphan_it;

sipa commented at 8:52 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Would it be possible to use std::map<COutPoint, std::set<Iter>> here as type? That would be faster (avoiding a lookup to resolve Wtxid -> Iter) and use less memory. boost::multi_index iterator objects are stable (remain valid as long as the object they point to exists), unlike std::unordered_map.

glozow commented at 6:01 pm on June 5, 2025:

I went pretty far into implementing this, but realized there is a downside to this approach - the set stores entries for each announcement instead of each unique orphan. It requires us to update this map each time a new announcement is added/removed instead of just for unique ones. Perhaps worth keeping as Wtxid - what do you think?

sipa commented at 6:14 pm on June 5, 2025:

Ah, I see. So you could have an $\mathcal{O(n)}$ blowup factor with $n$ the number of peers that have announced that Wtxid? If so, that doesn’t sound like a worthwhile tradeoff.

glozow commented at 2:02 pm on June 13, 2025:

Yep, tradeoff = either touch this set for every announcement added/removed, or do the extra lookup by Wtxid.

in src/node/txorphanage_impl.h:480 in 455b4b8178 outdated

475+            for (const auto& input : block_tx.vin) {
476+                auto it_prev = m_outpoint_to_orphan_it.find(input.prevout);
477+                if (it_prev != m_outpoint_to_orphan_it.end()) {
478+                    // Copy all wtxids to wtxids_to_erase.
479+                    std::copy(it_prev->second.cbegin(), it_prev->second.cend(), std::inserter(wtxids_to_erase, wtxids_to_erase.end()));
480+                    for (const auto& wtxid : it_prev->second) {

instagibbs commented at 8:53 pm on May 28, 2025:

455b4b817884f860cd4467f0a9be4a459e89891c

doesn’t the inserter above do this

glozow commented at 2:47 pm on June 2, 2025:

ha yes, removed the duplicate

in src/node/txorphanage_impl.h:483 in 455b4b8178 outdated

490+        }
491+
492+        if (num_erased != 0) {
493+            LogDebug(BCLog::TXPACKAGES, "Erased %d orphan transaction(s) included or conflicted by block\n", num_erased);
494+        }
495+        return wtxids_to_erase.size();

instagibbs commented at 8:56 pm on May 28, 2025:

0        Assume(wtxids_to_erase.size() == num_erased);
1        return wtxids_to_erase.size();

glozow commented at 2:49 pm on June 2, 2025:

added

in src/node/txorphanage_impl.h:163 in ea3a65e698 outdated

179+    {
180+        // Update m_peer_orphanage_info and clean up entries if they point to an empty struct.
181+        // This means peers that are not storing any orphans do not have an entry in
182+        // m_peer_orphanage_info (they can be added back later if they announce another orphan) and
183+        // ensures disconnected peers are not tracked forever.
184+        auto peer_it = m_peer_orphanage_info.find(it->m_announcer);

sipa commented at 9:00 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Assume(peer_it != m_peer_orphanage_info.end()); ?

glozow commented at 2:34 pm on June 2, 2025:

added

in src/node/txorphanage_impl.h:207 in ea3a65e698 outdated

202+            }
203+        }
204+        m_orphans.get<Tag>().erase(it);
205+    }
206+
207+    /** Return number of announcements with the same wtxid as it. */

sipa commented at 9:04 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

This has an unstated assumption that it is the iterator the first entry in the ByWtxid index for a given wtxid.

Given that there is only one call site (CountWtxid below), maybe it is better to inline this function there?

glozow commented at 2:35 pm on June 2, 2025:

done

in src/node/txorphanage_impl.h:550 in 455b4b8178 outdated

545+                    // inputs. However, we don't want to create an issue in which the assigned peer can purposefully stop us
546+                    // from processing the orphan by disconnecting.
547+                    const auto num_announcers{CountWtxid(wtxid)};
548+                    if (!Assume(num_announcers > 0)) continue;
549+                    std::advance(it, rng.randrange(num_announcers));
550+                    if (!Assume(it->m_tx->GetWitnessHash() == wtxid)) continue;

instagibbs commented at 9:06 pm on May 28, 2025:

probably too cautious already but

0                    if (!Assume(it->m_tx->GetWitnessHash() == wtxid)) break;

glozow commented at 3:06 pm on June 2, 2025:

done

in src/node/txorphanage_impl.h:235 in ea3a65e698 outdated

230+    TxOrphanageImpl(unsigned int max_global_ann, int64_t reserved_peer_usage) :
231+        m_max_global_announcements{max_global_ann},
232+        m_reserved_usage_per_peer{reserved_peer_usage}
233+    {}
234+
235+    /** Number of announcements ones for the same wtxid are not de-duplicated. */

sipa commented at 9:14 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Nit: grammar parse error.

glozow commented at 2:37 pm on June 2, 2025:

fixed

in src/node/txorphanage_impl.h:591 in 455b4b8178 outdated

586+     * amount of announcements and space for each peer. The reserved amount is protected from eviction even if there
587+     * are peers spamming the orphanage.
588+     */
589+    void LimitOrphans()
590+    {
591+        if (m_orphans.empty() || !NeedsTrim()) return;

instagibbs commented at 9:14 pm on May 28, 2025:

0        if (!NeedsTrim()) return;

glozow commented at 3:06 pm on June 2, 2025:

done

in src/node/txorphanage_impl.h:259 in ea3a65e698 outdated

254+    }
255+
256+    void SanityCheck() const
257+    {
258+        std::unordered_map<NodeId, PeerInfo> reconstructed_peer_info;
259+        std::map<Wtxid, int64_t > unique_wtxids_to_usage;

sipa commented at 9:17 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Nit: after int64_t.

glozow commented at 2:37 pm on June 2, 2025:

removed

in src/node/txorphanage_impl.h:268 in ea3a65e698 outdated

263+            for (const auto& input : it->m_tx->vin) {
264+                all_outpoints.insert(input.prevout);
265+            }
266+            unique_wtxids_to_usage.emplace(it->m_tx->GetWitnessHash(), it->GetUsage());
267+
268+            auto& peer_info = reconstructed_peer_info.try_emplace(it->m_announcer).first->second;

sipa commented at 9:18 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Nit: maybe auto& peer_info = reconstructed_peer_info[it->m_announcer]; ?

glozow commented at 2:42 pm on June 2, 2025:

done

in src/node/txorphanage_impl.h:44 in ea3a65e698 outdated

39+/** Default value for TxOrphanage::m_reserved_usage_per_peer. */
40+static constexpr int64_t DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER{404'000};
41+/** Default value for TxOrphanage::m_max_global_announcements. */
42+static constexpr unsigned int DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS{3000};
43+/** Minimum NodeId for lower_bound lookups (in practice, NodeIds start at 0). */
44+static constexpr NodeId min_peer{std::numeric_limits<NodeId>::min()};

sipa commented at 9:21 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Style: MIN_PEER? It wasn’t immediately clear when reading the code that this referred to a constant.

glozow commented at 1:58 pm on June 2, 2025:

makes sense, done

in src/node/txorphanage_impl.h:300 in ea3a65e698 outdated

295+    bool AddTx(const CTransactionRef& tx, NodeId peer)
296+    {
297+        const auto& wtxid{tx->GetWitnessHash()};
298+        const auto& txid{tx->GetHash()};
299+        // Quit if we already have this announcement (same wtxid and peer).
300+        if (HaveTxFromPeer(wtxid, peer)) return false;

sipa commented at 9:26 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

This could be avoided by checking the ret.first result of auto ret = m_orphans.get<ByWtxid>().emplace(tx, peer, m_current_sequence); below. The ByWtxid index is unique, so emplacement will fail if the same the wtxid/peer combination already exist, I think.

glozow commented at 2:43 pm on June 2, 2025:

looks correct, done

in src/node/txorphanage_impl.h:350 in ea3a65e698 outdated

345+        // have the tx data.
346+        if (it == m_orphans.end()) return false;
347+        if (it->m_tx->GetWitnessHash() != wtxid) return false;
348+
349+        // Quit if we already have this announcement (same wtxid and peer).
350+        if (HaveTxFromPeer(wtxid, peer)) return false;

sipa commented at 9:27 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

This can be avoided by checking ret.first from the auto ret = m_orphans.get<ByWtxid>().emplace(ptx, peer, m_current_sequence); call below, because the ByWtxid index will fail if the same announcement already exists.

glozow commented at 2:44 pm on June 2, 2025:

yes, done

in src/node/txorphanage_impl.h:454 in ea3a65e698 outdated

449+    {
450+        auto it = m_orphans.get<ByPeer>().lower_bound(ByPeerView{peer, true, 0});
451+        return it != m_orphans.get<ByPeer>().end() && it->m_announcer == peer && it->m_reconsider;
452+    }
453+
454+    /** If there is a tx that can be reconsidered, return it. Otherwise, return a nullptr. */

sipa commented at 9:31 pm on May 28, 2025:

In commit “feature: Add TxOrphanageImpl”

Maybe note that if a tx is returned, it is also marked as non-reconsiderable.

glozow commented at 2:46 pm on June 2, 2025:

added in the documentation

sipa commented at 9:31 pm on May 28, 2025: member

Some comments already.

in src/node/txorphanage_impl.h:631 in 455b4b8178 outdated

626+            Assume(dos_score.fee > dos_score.size);
627+
628+            // Evict the oldest announcement from this peer, sorting non-reconsiderable before reconsiderable.
629+            auto it_ann = m_orphans.get<ByPeer>().lower_bound(ByPeerView{worst_peer, false, 0});
630+            Assume(it_ann->m_announcer == worst_peer);
631+            Erase<ByPeer>(it_ann, /*cleanup_outpoints_map=*/CountWtxid(it_ann->m_tx->GetWitnessHash()) == 1);

instagibbs commented at 9:39 pm on May 28, 2025:

If you had MaxGlobalAnnouncements peers all announcing the same tx, it would end up being n^2 * logn work with CountWtxid calls?

glozow commented at 11:03 am on June 1, 2025:

Oof yes! Should be replaced with an “IsUnique” type of thing.

glozow commented at 3:15 pm on June 2, 2025:

Added an IsUnique to avoid this

instagibbs commented at 9:45 pm on May 28, 2025: member

reviewed through 455b4b817884f860cd4467f0a9be4a459e89891c

apologies if later commits clear up my confusion

in src/test/orphanage_tests.cpp:126 in c9a34d4cdb outdated

121+    // Single peer: eviction is triggered if either limit is hit
122+    {
123+        // Test announcement limits
124+        NodeId peer{8};
125+        node::TxOrphanageImpl orphanage_low_ann(/*max_global_ann=*/1, /*reserved_peer_usage=*/TX_SIZE * 10);
126+        node::TxOrphanageImpl orphanage_low_mem(/*max_global_ann=*/10, /*reserved_peer_usage=*/TX_SIZE + 1);

instagibbs commented at 2:15 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

0        node::TxOrphanageImpl orphanage_low_mem(/*max_global_ann=*/10, /*reserved_peer_usage=*/TX_SIZE);

glozow commented at 3:39 pm on June 2, 2025:

done

in src/test/orphanage_tests.cpp:143 in c9a34d4cdb outdated

138+        orphanage_low_mem.AddTx(TXNS.at(1), peer);
139+        BOOST_CHECK(orphanage_low_mem.CountAnnouncements() <= orphanage_low_mem.MaxGlobalAnnouncements());
140+        BOOST_CHECK(orphanage_low_mem.TotalOrphanUsage() > orphanage_low_mem.MaxGlobalUsage());
141+        BOOST_CHECK(orphanage_low_mem.NeedsTrim());
142+
143+        BOOST_CHECK_EQUAL(CheckNumEvictions(orphanage_low_mem), 1);

instagibbs commented at 2:20 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

could run these checks just before additions to assert no-op

glozow commented at 8:43 pm on June 3, 2025:

added

in src/test/orphanage_tests.cpp:182 in c9a34d4cdb outdated

177+        orphanage.AddChildrenToWorkSet(*parents.at(0), det_rand);
178+        BOOST_CHECK(orphanage.HaveTxToReconsider(peer));
179+
180+        // Add 1 more orphan, causing the orphanage to be oversize. child1 is evicted.
181+        orphanage.AddTx(children.at(3), peer);
182+        orphanage.LimitOrphans();

instagibbs commented at 2:22 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

might as well use and assert return of CheckNumEvictions

glozow commented at 3:41 pm on June 2, 2025:

done

in src/test/orphanage_tests.cpp:190 in c9a34d4cdb outdated

185+        BOOST_CHECK(orphanage.HaveTx(children.at(2)->GetWitnessHash()));
186+        BOOST_CHECK(orphanage.HaveTx(children.at(3)->GetWitnessHash()));
187+
188+        // Add 1 more... child2 is evicted.
189+        orphanage.AddTx(children.at(4), peer);
190+        orphanage.LimitOrphans();

instagibbs commented at 2:22 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

might as well use and assert return of CheckNumEvictions

glozow commented at 3:41 pm on June 2, 2025:

done

in src/test/orphanage_tests.cpp:203 in c9a34d4cdb outdated

198+        orphanage.AddChildrenToWorkSet(*parents.at(3), det_rand);
199+
200+        orphanage.AddTx(children.at(5), peer);
201+        orphanage.AddChildrenToWorkSet(*parents.at(5), det_rand);
202+
203+        orphanage.LimitOrphans();

instagibbs commented at 2:23 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

might as well use and assert return of CheckNumEvictions

glozow commented at 3:43 pm on June 2, 2025:

done

in src/test/orphanage_tests.cpp:214 in c9a34d4cdb outdated

209+        // The first transaction returned from GetTxToReconsider is the older one, not the one that was marked for
210+        // reconsideration earlier.
211+        // Transactions are marked non-reconsiderable again when returned through GetTxToReconsider
212+        BOOST_CHECK_EQUAL(orphanage.GetTxToReconsider(peer), children.at(3));
213+        orphanage.AddTx(children.at(6), peer);
214+        orphanage.LimitOrphans();

instagibbs commented at 2:23 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

might as well use and assert return of CheckNumEvictions

glozow commented at 3:43 pm on June 2, 2025:

done

in src/test/orphanage_tests.cpp:226 in c9a34d4cdb outdated

221+        BOOST_CHECK_EQUAL(orphanage.GetTxToReconsider(peer), children.at(5));
222+    }
223+
224+    // Multiple peers: when limit is exceeded, we choose the DoSiest peer and evict their oldest transaction.
225+    {
226+        NodeId peer0{0};

instagibbs commented at 2:25 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

nit: s/peer0/dos_peer/

glozow commented at 4:18 pm on June 2, 2025:

done, peer_dosy

in src/test/orphanage_tests.cpp:240 in c9a34d4cdb outdated

235+        for (unsigned int i{0}; i < max_announcements; ++i) {
236+            orphanage.AddTx(TXNS.at(i), peer0);
237+            BOOST_CHECK_EQUAL(CheckNumEvictions(orphanage), 0);
238+        }
239+        BOOST_CHECK_EQUAL(orphanage.AnnouncementsFromPeer(peer0), max_announcements);
240+        BOOST_CHECK_EQUAL(orphanage.AnnouncementsFromPeer(peer1), 0);

instagibbs commented at 2:26 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

0        BOOST_CHECK_EQUAL(orphanage.AnnouncementsFromPeer(peer1), 0);
1        BOOST_CHECK_EQUAL(orphanage.AnnouncementsFromPeer(peer2), 0);

glozow commented at 4:19 pm on June 2, 2025:

done

in src/test/orphanage_tests.cpp:337 in c9a34d4cdb outdated

318+        BOOST_CHECK_EQUAL(orphanage.MaxGlobalAnnouncements(), node::DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS);
319+        BOOST_CHECK_EQUAL(orphanage.ReservedPeerUsage(), node::DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER);
320+        BOOST_CHECK_EQUAL(orphanage.MaxGlobalUsage(), node::DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER * 3);
321+        BOOST_CHECK_EQUAL(orphanage.MaxPeerAnnouncements(), node::DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS / 3);
322+
323+        // Number of peers didn't change.

instagibbs commented at 2:32 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

EraseForPeer and EraseTx coverage here would be good. IIUC max memory goes up IFF a peer has a live orphan announcement?

glozow commented at 4:27 pm on June 2, 2025:

added

in src/test/orphanage_tests.cpp:93 in c9a34d4cdb outdated

89@@ -89,6 +90,244 @@ static bool EqualTxns(const std::set<CTransactionRef>& set_txns, const std::vect
90     return true;
91 }
92 
93+unsigned int CheckNumEvictions(node::TxOrphanageImpl& orphanage)

instagibbs commented at 2:33 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

Single scenario that boots out 2+ txns with a single CheckNumEvictions seems apt

glozow commented at 8:43 pm on June 3, 2025:

added a test that has 10 evictions in 1 go

in src/test/orphanage_tests.cpp:278 in c9a34d4cdb outdated

273+            // Evictions are FIFO within a peer, so the ith transaction sent by peer0 is the one that was evicted.
274+            BOOST_CHECK(!orphanage.HaveTxFromPeer(TXNS.at(i)->GetWitnessHash(), peer0));
275+            BOOST_CHECK(orphanage.HaveTx(TXNS.at(i)->GetWitnessHash()));
276+        }
277+
278+        // With 6 peers, each can add 10, and still only peer0's orphans are evicted.

instagibbs commented at 2:34 pm on May 29, 2025:

c9a34d4cdb5f5eca385304dc5c836960fad2a74a

Would like test coverage for an “alternation” to deleting another peer’s announcement to demonstrate heap behavior. IIUC this test is only doing peer0

glozow commented at 7:11 pm on June 5, 2025:

Added a test that has interleaved worst peers in a LimitOrphans call

in src/test/fuzz/txorphan.cpp:248 in f6c4f1ed3e outdated

243+
244+    // Peer that must have orphans protected from eviction
245+    NodeId honest_peerid{0};
246+
247+    // We have NUM_PEERS, of which Peer==0 is the "honest" one
248+    // who will never exceed their reserved weight of announcement

instagibbs commented at 2:37 pm on May 29, 2025:

f6c4f1ed3e353d6bf1f4372adb63b1906d18890a

0    // who will never exceed their reserved weight or announcement

glozow commented at 3:36 pm on June 2, 2025:

done

in src/test/fuzz/txorphan.cpp:301 in f6c4f1ed3e outdated

288+            }
289+            // output amount or spendability will not affect txorphanage
290+            tx_mut.vout.reserve(num_out);
291+            for (uint32_t i = 0; i < num_out; i++) {
292+                const auto payload_size = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(1, 100000);
293+                if (payload_size) {

instagibbs commented at 2:41 pm on May 29, 2025:

f6c4f1ed3e353d6bf1f4372adb63b1906d18890a

this conditional is always taken

glozow commented at 3:37 pm on June 2, 2025:

changed

in src/test/fuzz/txorphan.cpp:278 in f6c4f1ed3e outdated

266+
267+    // initial outpoints used to construct transactions later
268+    for (uint8_t i = 0; i < 4; i++) {
269+        outpoints.emplace_back(Txid::FromUint256(uint256{i}), 0);
270+    }
271+

instagibbs commented at 2:43 pm on May 29, 2025:

0
1    // This set of wtxids are honest peer's live announcements that must be protected

glozow commented at 3:36 pm on June 2, 2025:

added

in src/test/fuzz/txorphan.cpp:372 in f6c4f1ed3e outdated

367+
368+                    Assert(orphanage.CountAnnouncements() <= global_announcement_limit);
369+                    Assert(orphanage.TotalOrphanUsage() <= per_peer_weight_reservation * NUM_PEERS);
370+
371+                    // This should never differ before and after since we aren't allowing
372+                    // expiries and we've never exceeded the per-peer reservations.

instagibbs commented at 2:51 pm on May 29, 2025:

f6c4f1ed3e353d6bf1f4372adb63b1906d18890a

expiries aren’t at thing anymore

glozow commented at 3:38 pm on June 2, 2025:

that’s what I meant to say, but yeah probably not helpful to mention

instagibbs commented at 3:09 pm on May 29, 2025: member

reviewed through ea3a65e698f519afee23484ce1b399e9a4c62529

in src/node/txorphanage_impl.h:500 in 455b4b8178 outdated

507+        // transactions are added first. Doing so helps avoid work when one of the orphans replaced
508+        // an earlier one. Since we require the NodeId to match, one peer's announcement order does
509+        // not bias how we process other peer's orphans.
510+        auto& index_by_peer = m_orphans.get<ByPeer>();
511+        auto it_upper = index_by_peer.upper_bound(ByPeerView{peer, true, std::numeric_limits<uint64_t>::max()});
512+        auto it_lower = index_by_peer.lower_bound(ByPeerView{peer, false, 0});

instagibbs commented at 3:46 pm on May 29, 2025:

what if the peer sent us the first non-reconsidered orphan? would that not be considered below due to sequence number being 0?

glozow commented at 3:05 pm on June 2, 2025:

That element would be equal to it_lower. I guess the concern is if we skip that element in the while (rit != rit_end) loop? rit_end is one past the last element. Here’s a diagram from cpp reference: https://saco-evaluator.org.za/docs/cppreference/en/cpp/iterator/rbegin.html

instagibbs commented at 6:25 pm on June 9, 2025:

ah yep, you can mark as resolved

glozow renamed this:
~~p2p: improve TxOrphanage denial of service bounds and increase -maxorphantxs~~
p2p: improve TxOrphanage denial of service bounds
on Jun 1, 2025

glozow force-pushed on Jun 2, 2025

glozow commented at 4:36 pm on June 2, 2025: member

Thanks for the review! Just addressed most comments, still have a few more to get to

glozow force-pushed on Jun 2, 2025

DrahtBot added the label CI failed on Jun 2, 2025

DrahtBot commented at 4:41 pm on June 2, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/43322556029 LLM reason (✨ experimental): Lint check failed due to trailing whitespace in src/test/orphanage_tests.cpp.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

in src/node/txorphanage.h:31 in 3da112f33b outdated

27@@ -30,7 +28,11 @@ static const uint32_t DEFAULT_MAX_ORPHAN_TRANSACTIONS{100};
28  * Not thread-safe. Requires external synchronization.
29  */
30 class TxOrphanage {
31+    const std::unique_ptr<TxOrphanageImpl> m_impl;

sipa commented at 4:53 pm on June 2, 2025:

What do you think about using the “exposed std::unique_ptr<>” pattern (don’t know if it has a name) as opposed to full-blown pimpl?

So like TxGraph / TxGraphImpl, I’m suggesting there would be 2 files (no _impl.h), with a TxOrphanage abstract class with virtual member functions, and an implementation defined only inside the .cpp file, and a factory function std::unique_ptr<TxOrphanage> MakeTxOrphanage() or so that returns a newly-constructed TxOrphanageImpl. It avoids the boilerplate of TxOrphanage functions that forward to the impl code, and the need for a _impl.h file. A downside is that it’s (probably negligibly) slower to dispatch (because virtual functions), and an inability to test the implementation beyond what the public interface offers, but I don’t think that’s happening here anyway?

glozow commented at 5:02 pm on June 3, 2025:

Sure, yes. I’ll add a commit to the beginning to refactor to this structure + change all clients to do MakeTxOrphanage, and then swap out the Impls basically.

glozow commented at 8:53 pm on June 3, 2025:

Restructured this way, with the help of 🪄 ✨AI ✨

glozow force-pushed on Jun 3, 2025

DrahtBot removed the label CI failed on Jun 3, 2025

in src/test/fuzz/txorphan.cpp:242 in 3322f9301e outdated

227@@ -228,3 +228,154 @@ FUZZ_TARGET(txorphan, .init = initialize_orphanage)
228     }
229     orphanage->SanityCheck();
230 }
231+
232+FUZZ_TARGET(txorphan_protected, .init = initialize_orphanage)

instagibbs commented at 2:37 pm on June 5, 2025:

Had a thought: this harness could be relaxed to where the fuzzer input selects the protected peers, which could be any subset of all the peers. This would allow coverage of a scenario where no peers exceed reservation limits, and possibly stronger assertions in that case.

glozow commented at 7:11 pm on June 5, 2025:

Good idea, done

glozow force-pushed on Jun 5, 2025

DrahtBot added the label CI failed on Jun 5, 2025

DrahtBot commented at 7:16 pm on June 5, 2025: contributor

🚧 At least one of the CI tasks failed. Task ARM, unit tests, no functional tests: https://github.com/bitcoin/bitcoin/runs/43568906845 LLM reason (✨ experimental): The failure is due to a missing include of <bitset> in the source file.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot removed the label CI failed on Jun 5, 2025

glozow commented at 3:54 pm on June 6, 2025: member

All comments addressed, ready for review. Will write up some notes for a review club soon.

in src/node/txorphanage.cpp:15 in 498f1c0191 outdated

 7@@ -8,68 +8,159 @@
 8 #include <logging.h>
 9 #include <policy/policy.h>
10 #include <primitives/transaction.h>
11+#include <util/feefrac.h>
12 #include <util/time.h>
13 
14+#include <boost/multi_index/indexed_by.hpp>

sipa commented at 6:00 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Nitty McNitface: commit title undersells it a bit. Maybe “Overhaul TxOrphanage with smarter limits” or so?

glozow commented at 3:33 pm on June 9, 2025:

Done

in src/node/txorphanage.cpp:94 in 498f1c0191 outdated

129-    std::vector<OrphanMap::iterator> m_orphan_list;
130+    /** Number of unique orphans by wtxid. Less than or equal to the number of entries in m_orphans. */
131+    unsigned int m_unique_orphans{0};
132+
133+    /** Total bytes used by orphans, deduplicated by wtxid. */
134+    unsigned int m_unique_orphan_bytes{0};

sipa commented at 6:02 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

I see in other places int64_t is used for expressing usage (like PeerDosInfo::m_total_usage). Maybe better to pick one type for all places that involve usage? Or even introduce some type aliases for things like announcementcounts/txcounts/memusages, as that may improve readability too.

glozow commented at 9:29 pm on June 9, 2025:

Nice yes, made Usage = int64_t and Count = unsigned int

in src/node/txorphanage.cpp:184 in 498f1c0191 outdated

182     int64_t UsageByPeer(NodeId peer) const override;
183     void SanityCheck() const override;
184 };
185 
186-bool TxOrphanageImpl::AddTx(const CTransactionRef& tx, NodeId peer)
187+/** Erase from m_orphans and update m_peer_orphanage_info.

sipa commented at 6:11 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Nit: this function documentation may be more appropriate near its declaration inline in TxOrphanageImpl, rather than here with its implementation.

(and elsewhere below)

glozow commented at 8:45 pm on June 9, 2025:

Moved

in src/node/txorphanage.cpp:202 in 498f1c0191 outdated

207+    Assume(peer_it != m_peer_orphanage_info.end());
208+    if (peer_it->second.Subtract(*it)) m_peer_orphanage_info.erase(peer_it);
209+
210+    if (cleanup_outpoints_map) {
211+        m_unique_orphans -= 1;
212+        m_unique_orphan_bytes -= it->GetUsage();

sipa commented at 8:30 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Nit: if cleanup_outpoints_map also controls whether m_unique_orphans and m_unique_orphan_bytes gets updated, maybe it’s better to call it “was_last_for_wtxid” or so?

Alternative, it seems possible to just drop the argument, and have this function call IsUnique directly. That would remove the responsibility from the caller. The only caller that doesn’t already do this is EraseTx, but even there it’s correct I think, and only negligibly slower?

glozow commented at 9:21 pm on June 9, 2025:

Renamed to unique. I consolidated IsUnique to only have the wtxid variant, and it’s now only used by the erasure functions that aren’t holding ByWtxid iterators.

instagibbs commented at 6:25 pm on June 10, 2025:

good change was about to suggest it. Could store wtxid and use it twice rather than project and grabbing it again later?

in src/node/txorphanage.cpp:393 in 498f1c0191 outdated

490+    while (it != index_by_peer.end() && it->m_announcer == peer) {
491+        // Decide what will happen next before the iter is invalidated.
492+        auto it_next = std::next(it);
493+
494+        // Delete item, cleaning up m_outpoint_to_orphan_it iff this entry is unique by wtxid.
495+        Erase<ByPeer>(it, /*cleanup_outpoints_map=*/IsUnique(it->m_tx->GetWitnessHash()));

sipa commented at 8:36 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

I think IsUnique(it) would work here too, and be more efficient? Or move into Erase, see comment there.

glozow commented at 9:24 pm on June 9, 2025:

it is a ByPeer iterator, so its neighbors won’t tell us much information about how many announcers of the tx there are. Erase would also need to see that this is the case and make a separate query by wtxid to check uniqueness… seems a bit cleaner as is?

sipa commented at 1:10 pm on June 10, 2025:

Here is what I had in mind, making IsUnique work by ByWtxid iterator only, and making Erase figure out uniqueness of the passed argument by itself: https://github.com/sipa/bitcoin/commits/pr31829

Feel free to use/squash in any part of it.

glozow commented at 3:19 pm on June 10, 2025:

Thanks! Squashed this in with a couple small edits

in src/node/txorphanage.cpp:463 in 498f1c0191 outdated

575+            while (NeedsTrim()) {
576+                if (!Assume(it_ann->m_announcer == worst_peer)) break;
577+                // Decide what will happen next before the iter is invalidated.
578+                auto it_next = std::next(it_ann);
579+
580+                Erase<ByPeer>(it_ann, /*cleanup_outpoints_map=*/IsUnique(it_ann->m_tx->GetWitnessHash()));

sipa commented at 8:37 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

IsUnique(it_ann) here too? (or move it inside Erase, see comment there).

glozow commented at 9:25 pm on June 9, 2025:

Also a ByPeer iterator (see other comment)

in src/node/txorphanage.cpp:499 in 498f1c0191 outdated

635-                // Get this source peer's work set, emplacing an empty set if it didn't exist
636-                // (note: if this peer wasn't still connected, we would have removed the orphan tx already)
637-                std::set<Wtxid>& orphan_work_set = m_peer_orphanage_info.try_emplace(announcer).first->second.m_work_set;
638-                // Add this tx to the work set
639-                orphan_work_set.insert(elem->first);
640+                const auto num_announcers{CountWtxid(wtxid)};

sipa commented at 8:44 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Calling lower_bound to initialize it, and then calling CountWtxid(wtxid) which will do the same seems superfluous.

How about:

0auto it = index_by_wtxid.lower_bound(ByWtxidView{wtxid, MIN_PEER});
1auto it_end = index_by_wtxid.upper_bound(ByWtxidView{wtxid, MAX_PEER});
2const auto num_announcers{std::distance(it, it_end)};
3if (num_announcers == 0) continue;
4std::advance(it, rng.randrange(num_announcers));

This would perhaps even allow you to get rid of CountWtxid, as all remaining calls are in Assume(CountWtxid(wtxid) > 1);.

glozow commented at 8:34 pm on June 9, 2025:

Nice, done

in src/node/txorphanage.cpp:541 in 498f1c0191 outdated

696-        }
697+    auto it = m_orphans.get<ByPeer>().lower_bound(ByPeerView{peer, true, 0});
698+    if (it != m_orphans.get<ByPeer>().end() && it->m_announcer == peer && it->m_reconsider) {
699+        // Flip m_reconsider. Even if this transaction stays in orphanage, it shouldn't be
700+        // reconsidered again until there is a new reason to do so.
701+        auto mark_reconsidered_modifier = [](auto& ann) { ann.m_reconsider = false; };

sipa commented at 8:49 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Nit: static constexpr auto mark_reconsider_modifier = ...

glozow commented at 8:35 pm on June 9, 2025:

Done

in src/node/txorphanage.cpp:505 in 498f1c0191 outdated

641+                if (!Assume(num_announcers > 0)) continue;
642+                std::advance(it, rng.randrange(num_announcers));
643+                if (!Assume(it->m_tx->GetWitnessHash() == wtxid)) break;
644+
645+                // Mark this orphan as ready to be reconsidered.
646+                auto mark_reconsidered_modifier = [](auto& ann) { ann.m_reconsider = true; };

sipa commented at 8:50 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Nit: static constexpr auto mark_reconsidered_modifier = ...

glozow commented at 8:34 pm on June 9, 2025:

Done

in src/node/txorphanage.cpp:362 in 498f1c0191 outdated

423+    auto it = index_by_wtxid.lower_bound(ByWtxidView{wtxid, MIN_PEER});
424+    if (it == index_by_wtxid.end()) return 0;
425+
426+    unsigned int num_erased{0};
427+    const auto& txid = it->m_tx->GetHash();
428+    while (it != index_by_wtxid.end() && it->m_tx->GetWitnessHash() == wtxid) {

sipa commented at 8:56 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

This loop could also be written as:

0auto it = index_by_wtxid.lower_bound(ByWtxidView{wtxid, MIN_PEER});
1auto it_end = index_by_wtxid.upper_bound(ByWtxidView{wtxid, MAX_PEER});
2unsigned int num_erased{0};
3while (it != it_end) {
4    Erase<ByWtxid>(it++, num_erased == 0);
5    ++num_erased;
6}

(also applies to EraseForPeer)

glozow commented at 9:21 pm on June 9, 2025:

Changed this one, but wasn’t sure if there is a max Wtxid I can use for EraseForPeer?

sipa commented at 1:17 pm on June 10, 2025:

Ah, you’d need the ffff...ffff Wtxid for that, but I think you can avoid it by incrementing the peer instead:

0auto it = index_by_peer.lower_bound(ByPeerView{peer, false, 0});
1auto it_end = index_by_peer.lower_bound(ByPeerView{peer + 1, false, 0});

glozow commented at 1:40 pm on June 10, 2025:

Yeah, it didn’t work for me yesterday for some reason. Taking another look.

glozow commented at 3:19 pm on June 10, 2025:

Ok I must have been tripping yesterday… changed

glozow commented at 9:35 pm on June 13, 2025:

Aha fuzz crash reminded me what the problem was for EraseForPeer: if peer = MAX_PEER and there are only announcements from that peer, then MAX_PEER + 1 wraps around, it and it_end both point to the first entry from that peer, and we don’t end up erasing anything. Not really an issue in practice, but I reverted back to while (it != index_by_peer.end() && it->m_announcer == peer).

sipa commented at 9:39 pm on June 13, 2025:

Ha, ok.

FWIW (but no need to change), you don’t need a MAX_WTXID, because ByPeerView doesn’t contain a wtxid:

0auto it = index_by_peer.lower_bound(ByPeerView{peer, false, 0});
1auto it_end = index_by_peer.upper_bound(ByPeerView{peer, true, std::numeric_limits<SequenceNumber>::max()});

in src/node/txorphanage.cpp:379 in 498f1c0191 outdated

457-    m_orphan_list.pop_back();
458-
459-    m_orphans.erase(it);
460-    return 1;
461+
462+    return std::min<unsigned int>(1, num_erased);

sipa commented at 9:01 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

If the if (it == index_by_wtxid.end()) return 0; above is kept, then I think this is equivalent to return 1;.

glozow commented at 8:06 pm on June 9, 2025:

num_erased has the number of announcements so it can be greater than 1. However, since I’ve changed this to return a bool, I just made this return true.

in src/node/txorphanage.cpp:375 in 498f1c0191 outdated

444-        // Unless we're deleting the last entry in m_orphan_list, move the last
445-        // entry to the position we're deleting.
446-        auto it_last = m_orphan_list.back();
447-        m_orphan_list[old_pos] = it_last;
448-        it_last->second.list_pos = old_pos;
449+    if (num_erased > 0) {

sipa commented at 9:01 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

If the if (it == index_by_wtxid.end()) return 0; above is kept, then I think this condition is always true.

glozow commented at 7:53 pm on June 9, 2025:

I added a check that the wtxid matches, and removed this extra stuff

in src/node/txorphanage.cpp:353 in 498f1c0191 outdated

393+
394+    Assume(CountWtxid(wtxid) > 1);
395+    return true;
396 }
397 
398 int TxOrphanageImpl::EraseTx(const Wtxid& wtxid)

sipa commented at 9:04 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

This function only ever returns 0 or 1. Maybe make its return type bool?

glozow commented at 8:05 pm on June 9, 2025:

done

in src/node/txorphanage.cpp:428 in 498f1c0191 outdated

540+    std::vector<std::pair<NodeId, FeeFrac>> heap_peer_dos;
541+    heap_peer_dos.reserve(m_peer_orphanage_info.size());
542+    for (const auto& [nodeid, entry] : m_peer_orphanage_info) {
543+        heap_peer_dos.emplace_back(nodeid, entry.GetDosScore(max_ann, max_mem));
544+    }
545+    auto compare_score = [](auto left, auto right) { return left.second < right.second; };

sipa commented at 9:08 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Nit: static constexpr auto compare_score =

Also, the arguments should probably be auto&, though the copy will probably be optimized away anyway.

glozow commented at 8:08 pm on June 9, 2025:

nice, done

in src/node/txorphanage.cpp:452 in 498f1c0191 outdated

564+
565+        // If needs trim, then at least one peer has a DoS score higher than 1.
566+        Assume(dos_score.fee > dos_score.size);
567+
568+        auto it_worst_peer = m_peer_orphanage_info.find(worst_peer);
569+        if (it_worst_peer != m_peer_orphanage_info.end()) {

sipa commented at 9:09 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

Is it possible that this condition is not true?

glozow commented at 8:09 pm on June 9, 2025:

I don’t think so, removed the condition

in src/node/txorphanage.cpp:468 in 498f1c0191 outdated

580+                Erase<ByPeer>(it_ann, /*cleanup_outpoints_map=*/IsUnique(it_ann->m_tx->GetWitnessHash()));
581+                num_erased += 1;
582+                it_ann = it_next;
583+
584+                // If we erased the last orphan from this peer, it_worst_peer will be invalidated.
585+                it_worst_peer = m_peer_orphanage_info.find(worst_peer);

sipa commented at 9:14 pm on June 6, 2025:

In commit “[p2p] improve TxOrphanage DoS limits”

I don’t think this search is necessary, because no more announcements from the same peer left can be determined by looking if it_ann is end() or refers to another peer now.

(Feel free to disregard, if you think the current code is sufficiently cleaner; I don’t think the performance difference matters)

glozow commented at 8:29 pm on June 9, 2025:

Seems slightly simpler as is so leaving

sipa commented at 9:16 pm on June 6, 2025: member

Made it through most of the big commit.

in src/node/txorphanage.cpp:556 in 498f1c0191 outdated

789+
790+    if (it_upper != index_by_peer.begin()) {
791+        auto rit = std::make_reverse_iterator(it_upper);
792+        auto rit_end = std::make_reverse_iterator(it_lower);
793+        while (rit != rit_end) {
794+            if (rit->m_announcer != peer) continue;

instagibbs commented at 6:32 pm on June 9, 2025:

498f1c019197a8e4105490cdc4a0605594ca97d5

if this line is hit, the rit iterator is never incremented; infinite loop?

alternative formulation to consider for the entire conditional starting at https://github.com/bitcoin/bitcoin/pull/31829/commits/498f1c019197a8e4105490cdc4a0605594ca97d5#diff-e6100361fa0e9e25478f808ca084e5f681d4dddbbee7b3bea0f9d5bcd29db3aaR596

 0for (auto rit = std::make_reverse_iterator(it_upper);
 1         rit != std::make_reverse_iterator(it_lower); ++rit)
 2{
 3    if (rit->m_announcer != peer) continue;
 4    for (const auto& input : rit->m_tx->vin) {
 5        if (input.prevout.hash == parent_txid) {
 6            children_found.emplace_back(rit->m_tx);
 7            break;
 8        }
 9    }
10}

glozow commented at 8:09 pm on June 12, 2025:

Yeah good point, and the fact that it never infinite loops is a sign this is never true. I took the for loop changes and made it Assume(==peer).

glozow force-pushed on Jun 9, 2025

DrahtBot added the label CI failed on Jun 9, 2025

glozow commented at 10:22 pm on June 9, 2025: member

Rebased for silent merge conflict with #32406

glozow force-pushed on Jun 9, 2025

DrahtBot removed the label CI failed on Jun 9, 2025

in src/test/fuzz/txorphan.cpp:223 in 9afbf15b99 outdated

228 
229-        // Set tx as potential parent to be used for future GetChildren() calls.
230-        if (!ptx_potential_parent || fuzzed_data_provider.ConsumeBool()) {
231-            ptx_potential_parent = tx;
232-        }
233+        ptx_potential_parent = tx;

instagibbs commented at 1:49 pm on June 10, 2025:

9afbf15b99508982b1a73bc416246ffbbce22d89

was this logic change necessary for this commit?

glozow commented at 8:13 pm on June 12, 2025:

restored

in src/node/txorphanage.cpp:22 in a703a3086a outdated

17+#include <boost/multi_index_container.hpp>
18+
19 #include <cassert>
20 
21 namespace node {
22+using Usage = int64_t;

sipa commented at 2:14 pm on June 10, 2025:

In commit “[p2p] overhaul TxOrphanage with smarter limits”

Since these types are used in the return types of (the implementation of) TxOrphanage interface functions, maybe they belong in TxOrphanage directly, so that e.g. both TxOrphanage::UsageFromPeer and TxOrphanageImpl::UsageFromPeer can use Usage as return type?

glozow commented at 8:05 pm on June 12, 2025:

Right, done

in src/node/txorphanage.cpp:36 in a703a3086a outdated

63-    using OrphanMap = decltype(m_orphans);
64+    /** One orphan announcement. Each announcement (i.e. combination of wtxid, nodeid) is unique. There may be multiple
65+     * announcements for the same tx, and multiple transactions with the same txid but different wtxid are possible. */
66+    struct Announcement
67+    {
68+        CTransactionRef m_tx;

instagibbs commented at 2:22 pm on June 10, 2025:

a703a3086a6a3a6250fb97e799712443eaedf5d0

I think you can const all fields except m_reconsider?

glozow commented at 8:05 pm on June 12, 2025:

done

in src/node/txorphanage.cpp:375 in a703a3086a outdated

510-        ++nEvicted;
511+    if (!NeedsTrim()) return;
512+
513+    const auto original_unique_txns{CountUniqueOrphans()};
514+
515+    // These numbers cannot change within a single call to LimitOrphans because the size of m_peer_orphanage_info

sipa commented at 2:34 pm on June 10, 2025:

In commit “[p2p] overhaul TxOrphanage with smarter limits”

I’m confused by this comment. Neither MaxPeerAnnouncements() or ReservedPeerUsage() are affected by the size of m_peer_orphanage_info, so even if that map’s size were to change (which can happen in this function…), nothing would change? In fact, they’re both just returning const class member variables.

glozow commented at 6:49 pm on June 10, 2025:

MaxPeerAnnouncements() is affected by the number of peers, since it’s the global limit / num peers.

However, I just realized this comment is also wrong because m_peer_orphanage_info size can change during the call. I think it’s only possible when the per-peer reservation is below 400k, so we allow adding a tx, but immediately remove it because the peer is using “too much” space. Weird, but possible.

instagibbs commented at 7:11 pm on June 10, 2025:

It used to be correct back when as soon as a peer ever registered an orphan, the max would increase and it would be decreased on disconnect of peer.

Now limits will dynamically shrink as peers are “completely” evicted.

It also means the heap is being “invalidated” as peers are completely evicted, if these limits are recalculated.

sipa commented at 7:15 pm on June 10, 2025:

Oh, duh, it’s the per-peer announcement reservation, not the global announcement limit. :man_facepalming:.

But I don’t think it matters really that this is a constant. We’d be ok with using global_announcement_limit / total_peers as per-peer reservation, but we are - as a resource optimization on top - using global_announcement_limit / peers_with_at_least_one_orphan instead. Not updating the constant here when the last orphan of a peer disappears means using something in between temporarily. No big deal.

glozow commented at 8:05 pm on June 12, 2025:

I don’t think we ever did peer “registration,” we just briefly considered it and then left it for when/if we make outbound limits different from inbound limits?

I reworded the comments in this function and kept the constants

in src/node/txorphanage.cpp:150 in a703a3086a outdated

188+     * growing. Set it to false when other announcements for the same tx exist.
189+     */
190+    template<typename Tag>
191+    void Erase(Iter<Tag> it, bool unique);
192+
193+    /** Check if there is exactly one transaction with this wtxid.

instagibbs commented at 2:40 pm on June 10, 2025:

a703a3086a6a3a6250fb97e799712443eaedf5d0

exactly one announcement for this wtxid I presume

glozow commented at 9:38 pm on June 13, 2025:

edited

in src/node/txorphanage.cpp:412 in a703a3086a outdated

548+        // If needs trim, then at least one peer has a DoS score higher than 1.
549+        Assume(dos_score.fee > dos_score.size);
550+
551+        auto it_worst_peer = m_peer_orphanage_info.find(worst_peer);
552+
553+        // Trim until this peer is no longer the DoSiest one or has a score within 1.

sipa commented at 2:48 pm on June 10, 2025:

In commit “[p2p] overhaul TxOrphanage with smarter limits”

Nit: maybe point out that it isn’t strictly necessary to use 1 as a lower limit, because when the last peer goes below ratio 1, necessarily NeedsTrim() will be false, and we’d stop anyway. But it’s more convenient to have some fallback number to use anyway.

glozow commented at 8:07 pm on June 12, 2025:

added a sentence to this comment

in src/node/txorphanage.cpp:138 in a703a3086a outdated

176+            const FeeFrac cpu_score(m_count_announcements, max_peer_count);
177+            const FeeFrac mem_score(m_total_usage, max_peer_bytes);
178+            return std::max<FeeFrac>(cpu_score, mem_score);
179+        }
180+    };
181+    /** Store per-peer statistics. Used to determine each peer's DoS score. */

instagibbs commented at 2:51 pm on June 10, 2025:

a703a3086a6a3a6250fb97e799712443eaedf5d0

and helps determine the total memory limits based on number of entries

glozow commented at 8:05 pm on June 12, 2025:

reworded

in src/node/txorphanage.cpp:385 in a703a3086a outdated

520+    // We have exceeded the global limit(s). Now, identify who is using too much and evict their orphans.
521+    // Create a heap of pairs (NodeId, DoS score), sorted by descending DoS score.
522+    std::vector<std::pair<NodeId, FeeFrac>> heap_peer_dos;
523+    heap_peer_dos.reserve(m_peer_orphanage_info.size());
524+    for (const auto& [nodeid, entry] : m_peer_orphanage_info) {
525+        heap_peer_dos.emplace_back(nodeid, entry.GetDosScore(max_ann, max_mem));

sipa commented at 2:57 pm on June 10, 2025:

In commit “[p2p] overhaul TxOrphanage with smarter limits”

Nit: this could skip inserting any peer with DoS score < 1? In the typical (?) case where there are just a few spammy peers, this avoids an $\mathcal{O}(\log n)$ factor in cost per announcement removal in heap maintenance.

glozow commented at 8:07 pm on June 12, 2025:

Ah yeah! done.

in src/node/txorphanage.cpp:433 in a703a3086a outdated

573+        // its orphans.
574+        if (it_worst_peer != m_peer_orphanage_info.end() && it_worst_peer->second.m_count_announcements > 0) {
575+            heap_peer_dos.emplace_back(worst_peer, it_worst_peer->second.GetDosScore(max_ann, max_mem));
576+            std::push_heap(heap_peer_dos.begin(), heap_peer_dos.end(), compare_score);
577+        }
578+    } while (!heap_peer_dos.empty() && NeedsTrim());

sipa commented at 3:04 pm on June 10, 2025:

In commit “[p2p] overhaul TxOrphanage with smarter limits”

Nit: the !heap_peer_dos.empty() is redundant I think, because that would imply there are no peers left (or no peers with score >= 1 left, if above suggestion is followed), so NeedsTrim() is definitely false.

This could arguably also be moved above the if (it_worst_peer != ...) check:

0            if (it_worst_peer == m_peer_orphanage_info.end() || it_worst_peer->second.GetDosScore(max_ann, max_mem) < dos_threshold) break;
1        }
2        if (!NeedsTrim()) break;
3        // ...
4        if (it_worst_peer != m_peer_orphanage_info.end() && ...
5        ...
6        }
7    } while(true);

glozow commented at 8:08 pm on June 12, 2025:

done

glozow force-pushed on Jun 10, 2025

in src/node/txorphanage.cpp:582 in bd9be630e0 outdated

803+    std::set<NodeId> this_orphan_announcers;
804+    while (it != index_by_wtxid.end()) {
805+        this_orphan_announcers.insert(it->m_announcer);
806+        // If this is the last entry, or the next entry has a different wtxid, build a OrphanTxBase.
807+        if (std::next(it) == index_by_wtxid.end() || std::next(it)->m_tx->GetWitnessHash() != it->m_tx->GetWitnessHash()) {
808+            result.emplace_back(TxOrphanage::OrphanTxBase{it->m_tx, this_orphan_announcers});

sipa commented at 4:57 pm on June 10, 2025:

In commit “[p2p] overhaul TxOrphanage with smarter limits”

Use std::move(this_orphan_announcers) to avoid a copy of the set.

Also nit: invoking the OrphanTxBase constructor negates the advantage of using emplace_back (which can construct it in place, but instead it’s being constructed by the caller, and then copied in place):

0    result.emplace_back(it->m_tx, std::move(this_orphan_announcers));

(to be clear, this doesn’t matter for performance or anything at all here, but it seems silly to actively bypass in-place construction where it’s easier to not do that)

glozow commented at 8:09 pm on June 12, 2025:

taken

in src/test/orphanage_tests.cpp:98 in 1c37a4e27a outdated

84+{
85+    FastRandomContext det_rand{true};
86+
87+    // Construct transactions to use
88+    unsigned int NUM_TXNS_CREATED = 100;
89+    std::vector<CTransactionRef> TXNS;

sipa commented at 5:10 pm on June 10, 2025:

In commit “[unit test] basic TxOrphanageImpl eviction and protection”

Style nit: not a constant, shouldn’t be ALLCAPS.

glozow commented at 8:10 pm on June 12, 2025:

renamed, now lowercase

in src/test/orphanage_tests.cpp:101 in 1c37a4e27a outdated

87+    // Construct transactions to use
88+    unsigned int NUM_TXNS_CREATED = 100;
89+    std::vector<CTransactionRef> TXNS;
90+    TXNS.reserve(NUM_TXNS_CREATED);
91+    // All transactions are the same size.
92+    int64_t TX_SIZE{0};

sipa commented at 5:10 pm on June 10, 2025:

In commit “[unit test] basic TxOrphanageImpl eviction and protection”

Style nit: not a constant, shouldn’t be ALLCAPS.

glozow commented at 8:11 pm on June 12, 2025:

I turned it into a (magic) constant, I had it as a variable earlier just in case the implementation was going to change

in src/test/orphanage_tests.cpp:111 in 1c37a4e27a outdated

 97+            BOOST_CHECK_EQUAL(TX_SIZE, GetTransactionWeight(*ptx));
 98+        } else {
 99+            TX_SIZE = GetTransactionWeight(*ptx);
100+        }
101+    }
102+    int64_t USAGE_TXNS_CREATED = NUM_TXNS_CREATED * TX_SIZE;

sipa commented at 5:11 pm on June 10, 2025:

In commit “[unit test] basic TxOrphanageImpl eviction and protection”

Style nit: not a (compile-time) constant, shouldn’t be ALLCAPS.

glozow commented at 8:11 pm on June 12, 2025:

made it a compile-time constant

glozow force-pushed on Jun 10, 2025

DrahtBot added the label CI failed on Jun 10, 2025

DrahtBot commented at 5:12 pm on June 10, 2025: contributor

🚧 At least one of the CI tasks failed. Task multiprocess, i686, DEBUG: https://github.com/bitcoin/bitcoin/runs/43821676297 LLM reason (✨ experimental): The CI failure is caused by an assertion failure in boost’s safe iterator during the orphanage_tests.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

in src/test/orphanage_tests.cpp:75 in 62b7e67862 outdated

71@@ -72,6 +72,15 @@ static bool EqualTxns(const std::set<CTransactionRef>& set_txns, const std::vect
72     return true;
73 }
74 
75+static bool ExactEqualTxns(const std::vector<CTransactionRef>& expected, const std::vector<CTransactionRef>& vec_txns)

sipa commented at 5:53 pm on June 10, 2025:

In commit “[unit test] strengthen GetChildrenFromSamePeer tests: results are in”

Isn’t this just expected == vec_txns ?

glozow commented at 8:10 pm on June 12, 2025:

yes, fixed

in src/test/orphanage_tests.cpp:620 in 62b7e67862 outdated

616@@ -600,7 +617,7 @@ BOOST_AUTO_TEST_CASE(get_children)
617         BOOST_CHECK(orphanage->GetChildrenFromSamePeer(child_p1n0_p2n0, node2).empty());
618     }
619 
620-    // Orphans provided by node1 and node2
621+

sipa commented at 5:53 pm on June 10, 2025:

In commit “[unit test] strengthen GetChildrenFromSamePeer tests: results are in”

Comment accidentally removed? If not, unnecessary double newline?

glozow commented at 8:11 pm on June 12, 2025:

restored

in src/test/fuzz/txorphan.cpp:243 in bd940478bb outdated

238+
239+    // We have NUM_PEERS, of which Peer==0 is the "honest" one
240+    // who will never exceed their reserved weight or announcement
241+    // count, and should therefore never be evicted.
242+    const unsigned int MAX_PEERS = 125;
243+    const unsigned int NUM_PEERS = fuzzed_data_provider.ConsumeIntegralInRange<unsigned int>(1, MAX_PEERS);

sipa commented at 5:55 pm on June 10, 2025:

In commit “[fuzz] txorphanage_impl protection harness”

Style nit: not a compile-time constant, shouldn’t be ALLCAPS.

glozow commented at 8:13 pm on June 12, 2025:

lowercased

in src/test/fuzz/txorphan.cpp:273 in bd940478bb outdated

268+    }
269+
270+    // These are honest peer's live announcements. We expect them to be protected from eviction.
271+    std::set<Wtxid> protected_wtxids;
272+
273+    LIMITED_WHILE(outpoints.size() < 200'000 && fuzzed_data_provider.ConsumeBool(), 10 * global_announcement_limit)

sipa commented at 6:13 pm on June 10, 2025:

In commit “[fuzz] txorphanage_impl protection harness”

Any reason why such a high number (200000) is desirable? It would be extremely slow, and normal fuzz infrastructure won’t even try it, as it limits seeds to 4096 bytes, and each iteration of this loop consumes at least 11 bytes if I counted correctly, so you can’t even get above 400.

glozow commented at 8:15 pm on June 12, 2025:

No reason, changing to 400

in src/test/fuzz/txorphan.cpp:309 in bd940478bb outdated

304+        }();
305+
306+        const auto wtxid{tx->GetWitnessHash()};
307+
308+        // orphanage functions
309+        LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10 * global_announcement_limit)

sipa commented at 6:20 pm on June 10, 2025:

In commit “[fuzz] txorphanage_impl protection harness”

I would use LIMITED_WHILE(provider.remaining_bytes(), 10 * global_announcement_limit) { here to avoiding consuming an extra byte per iteration (and also, why stop when you have more bytes left that could be used to tell you things to try)?

glozow commented at 8:13 pm on June 12, 2025:

taken

in src/node/txorphanage.cpp:83 in c21466b83d outdated

117+    struct OrphanIndices final : boost::multi_index::indexed_by<
118+        boost::multi_index::ordered_unique<boost::multi_index::tag<ByWtxid>, WtxidExtractor>,
119+        boost::multi_index::ordered_unique<boost::multi_index::tag<ByPeer>, ByPeerViewExtractor>
120+    >{};
121+
122+    using OrphanMap = boost::multi_index::multi_index_container<Announcement, OrphanIndices>;

instagibbs commented at 6:45 pm on June 10, 2025:

nit(?): is it more of an AnnouncementMap?

glozow commented at 8:05 pm on June 12, 2025:

renamed

in src/node/txorphanage.cpp:95 in c21466b83d outdated

129+
130+    /** Number of unique orphans by wtxid. Less than or equal to the number of entries in m_orphans. */
131+    Count m_unique_orphans{0};
132+
133+    /** Total bytes used by orphans, deduplicated by wtxid. */
134+    Usage m_unique_orphan_bytes{0};

instagibbs commented at 6:49 pm on June 10, 2025:

c21466b83d725ab38e8b2b6c5b3e01815b300745

it’s not actually bytes; we we want to just refer to Usage() directly here

glozow commented at 8:05 pm on June 12, 2025:

reworded this comment to point to GetUsage

in src/node/txorphanage.cpp:307 in c21466b83d outdated

348+    // Do nothing if this transaction isn't already present. We can't create an entry if we don't
349+    // have the tx data.
350+    if (it == m_orphans.end()) return false;
351+    if (it->m_tx->GetWitnessHash() != wtxid) return false;
352+
353+    // Add another announcement, copying one that exists

instagibbs commented at 6:59 pm on June 10, 2025:

“copying one that exists” I assume means grabbing the CTransactionRef?

glozow commented at 8:05 pm on June 12, 2025:

reworded

in src/node/txorphanage.cpp:378 in c21466b83d outdated

407-        it_last->second.list_pos = old_pos;
408+    auto it_end = index_by_wtxid.upper_bound(ByWtxidView{wtxid, MAX_PEER});
409+    unsigned int num_ann{0};
410+    const auto& txid = it->m_tx->GetHash();
411+    while (it != it_end) {
412+        Erase<ByWtxid>(it++);

instagibbs commented at 7:06 pm on June 10, 2025:

0        Assume(it->m_tx.GetWitnessHash() == wtxid);
1        Erase<ByWtxid>(it++);

glozow commented at 8:05 pm on June 12, 2025:

added

in src/node/txorphanage.cpp:421 in c21466b83d outdated

551+            Erase<ByPeer>(it_ann++);
552+            num_erased += 1;
553+
554+            // If we erased the last orphan from this peer, it_worst_peer will be invalidated.
555+            it_worst_peer = m_peer_orphanage_info.find(worst_peer);
556+            if (it_worst_peer == m_peer_orphanage_info.end() || it_worst_peer->second.GetDosScore(max_ann, max_mem) < dos_threshold) break;

instagibbs commented at 7:15 pm on June 10, 2025:

c21466b83d725ab38e8b2b6c5b3e01815b300745

this dos_threshold check does tie-breaker on “size”, and probably should be <=? Not sure the former fact matters but bringing it up.

glozow commented at 8:08 pm on June 12, 2025:

Hm, I think we want this peer to become at least the second-DoSiest, right? So it should be < dos_threshold?

instagibbs commented at 3:22 pm on June 18, 2025:

Ok I’m confusing myself here, the denominator (“size”) is constant for the entire evaluation of LimitOrphans, so tie-breaking is not going to happen. This behavior is fine.

instagibbs commented at 7:44 pm on June 10, 2025: member

reviewed through 128ad62cd68038641ac7c3308ceb40c6c84d325e

CI failure seems unrelated

I’m going to think more about how evicting peers completely interacts with the behavior.

DrahtBot removed the label CI failed on Jun 10, 2025

glozow force-pushed on Jun 12, 2025

glozow commented at 9:13 pm on June 12, 2025: member

Rebased and edited the functional tests for silent merge conflict with #32421

glozow force-pushed on Jun 12, 2025

DrahtBot added the label CI failed on Jun 12, 2025

DrahtBot commented at 9:19 pm on June 12, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/44005354921 LLM reason (✨ experimental): The CI failure is caused by lint errors detected by ruff due to unused variables in Python test code.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow force-pushed on Jun 12, 2025

glozow force-pushed on Jun 13, 2025

DrahtBot removed the label CI failed on Jun 13, 2025

in src/test/orphanage_tests.cpp:75 in e83f2b0b00 outdated

71@@ -72,6 +72,11 @@ static bool EqualTxns(const std::set<CTransactionRef>& set_txns, const std::vect
72     return true;
73 }
74 
75+static bool ExactEqualTxns(const std::vector<CTransactionRef>& expected, const std::vector<CTransactionRef>& vec_txns)

instagibbs commented at 2:42 pm on June 18, 2025:

this function is now completely superfluous?

glozow commented at 6:45 pm on June 23, 2025:

Replaced it with ==s

in src/test/orphanage_tests.cpp:619 in e83f2b0b00 outdated

592 
593-        std::set<CTransactionRef> expected_parent1_children{child_p1n0, child_p1n0_p2n0, child_p1n0_p1n1};
594-        std::set<CTransactionRef> expected_parent2_children{child_p2n1, child_p1n0_p2n0};
595+        BOOST_CHECK(!orphanage->AddTx(child_p1n0_p1n1, node0));
596+        BOOST_CHECK(!orphanage->AddTx(child_p2n1, node0));
597+        BOOST_CHECK(!orphanage->AddTx(child_p1n0, node3));

monlovesmango commented at 3:08 am on June 19, 2025:

0        BOOST_CHECK(!orphanage->AddTx(child_p1n0_p2n0, node3));

I think this is probably what was intended? Otherwise it is a duplicate line.

glozow commented at 6:46 pm on June 23, 2025:

removed

in src/test/fuzz/txorphan.cpp:367 in 101c3f9299 outdated

358+                    orphanage->EraseTx(wtxid);
359+                    Assert(!orphanage->HaveTx(wtxid));
360+                },
361+                [&] { // EraseForPeer
362+                    if (!protected_peers[peer_id]) {
363+                        orphanage->EraseForPeer(peer_id);

monlovesmango commented at 3:24 am on June 19, 2025:

Perhaps assert that orphanage->AnnouncementsFromPeer(peer_id) and orphanage->UsageFromPeer(peer_id) are both 0 here?

glozow commented at 6:48 pm on June 23, 2025:

added

monlovesmango commented at 4:37 am on June 19, 2025: contributor

ACK d4b787b25f07b212bedb26433e64378673a27f6a

This is a strict improvement over the existing tx orphanage’s DOS protection mechanism which can be manipulated by a malicious peer to probabilistically evict orphans from honest peers. With this approach, any honest peer that stays within the bounds of the defined orphan announcement count and usage is guaranteed to never have their orphan evicted by a malicious peer. This also strictly confines the total resources any one peer can consume in the tx orphanage and defines a deterministic method of resolving instances of peers exceeding the global orphan announcement limits/reservations.

Couple small suggestions included.

DrahtBot requested review from sipa on Jun 19, 2025

instagibbs commented at 12:35 pm on June 19, 2025: member

Did a bit of rough benchmarking here https://github.com/instagibbs/bitcoin/tree/2025-06-bench-txorphanage-multiindex

I built two scenarios to try and maximize the number of announcements evicted, while only allowing DEFAULT_MAX_ORPHAN_ANNOUNCEMENTS+1 announcements to be added before calling LimitOrphans. It’s actually oversized in bytes terms early, but I consider this out of scope for the benchmark.

Note that “OrphanageManyWithManyPeers” and “OrphanageManyWithOnePeer” are the same as the “Eviction” ones, they just don’t evict so they’re used to estimate the cost of constructing the full orphanage. I’m guessing there’s a downside to estimating performance this way, but maybe an ok ballpark?

0|               ns/op |                op/s |    err% |     total | benchmark
1|--------------------:|--------------------:|--------:|----------:|:----------
2|       10,521,107.00 |               95.05 |    3.9% |      0.12 | `OrphanageEvictionManyWithManyPeers`
3|        2,792,771.00 |              358.07 |    0.2% |      0.03 | `OrphanageEvictionManyWithOnePeer`
4|       10,113,276.00 |               98.88 |    3.1% |      0.11 | `OrphanageManyWithManyPeers`
5|        2,214,154.00 |              451.64 |    0.4% |      0.02 | `OrphanageManyWithOnePeer`

~.4ms and ~.58ms the two scenarios, if monkey math is correct?

jsarenik commented at 1:48 pm on June 19, 2025: none

Approach ACK

in src/node/txorphanage.h:93 in d4b787b25f outdated

88+    virtual void EraseForPeer(NodeId peer) = 0;
89+
90+    /** Erase all orphans included in or invalidated by a new block */
91+    virtual void EraseForBlock(const CBlock& block) = 0;
92+
93+    /** Limit the orphanage to DEFAULT_MAX_ORPHAN_TRANSACTIONS. */

jsarenik commented at 1:58 pm on June 19, 2025:

This comment refers to DEFAULT_MAX_ORPHAN_TRANSACTIONS which is not used anymore in the rest of the code (all lines that contain this string are being removed, except this one which is being added)

glozow commented at 6:48 pm on June 23, 2025:

good catch, fixed the doc in the main commit

sipa commented at 3:08 pm on June 19, 2025: member

@instagibbs Ran your benchmarks with -min-time=1000000 (so for ~18 minutes each) on a Ryzen 9 5950X CPU:

ns/op	op/s	err%	total	benchmark
8,005,268.73	124.92	0.1%	1,074.80	`OrphanageEvictionManyWithManyPeers`
2,748,469.35	363.84	0.1%	1,101.19	`OrphanageEvictionManyWithOnePeer`
7,783,037.34	128.48	0.1%	1,073.93	`OrphanageManyWithManyPeers`
2,156,225.22	463.77	0.3%	1,104.91	`OrphanageManyWithOnePeer`

I also ran them once with all 4 in parallel (slower, but any temperature-related CPU speed fluctuations would affect all equally):

ns/op	op/s	err%	total	benchmark
9,601,227.92	104.15	0.2%	1,105.00	`OrphanageEvictionManyWithManyPeers`
2,928,013.47	341.53	0.2%	1,103.20	`OrphanageEvictionManyWithOnePeer`
9,237,659.67	108.25	0.6%	1,105.15	`OrphanageManyWithManyPeers`
2,274,349.94	439.69	0.7%	1,106.91	`OrphanageManyWithOnePeer`

So 0.22-0.36 ms for many peers, and 0.59-0.65 ms for one peer?

If so, I’d say we can tolerate 5x-10x more?

in test/functional/p2p_opportunistic_1p1c.py:398 in d4b787b25f outdated

393+        """Create huge orphan transaction"""
394+        tx = CTransaction()
395+        # Nonexistent UTXO
396+        tx.vin = [CTxIn(COutPoint(random.randrange(1 << 256), random.randrange(1, 100)))]
397+        tx.wit.vtxinwit = [CTxInWitness()]
398+        tx.wit.vtxinwit[0].scriptWitness.stack = [CScript([OP_NOP] * 390000)]

theStack commented at 11:11 pm on June 19, 2025:

in commit d4b787b25f07b212bedb26433e64378673a27f6a:

0        tx.wit.vtxinwit[0].scriptWitness.stack = [b'X' * 390000]

for a significant orphan-tx creation speed-up (~30s vs ~30ms on my machine for the large_orphans list creation at the call-site)

could also move the orphan creation helpers to mempool_util.py, since the large one is defined in p2p_orphan_handling.py as well

glozow commented at 6:47 pm on June 23, 2025:

Nice, took suggestion and moved the helper into mempool_util

in src/test/orphanage_tests.cpp:380 in 3e2985630c outdated

375+            tx_large.vin[i].prevout.n = i;
376+            tx_large.vin[i].prevout.hash = Txid::FromUint256(m_rng.rand256());
377+        }
378+        const auto size_diff = 10 * TX_SIZE - GetTransactionWeight(*MakeTransactionRef(tx_large));
379+        // Pad the last transaction until it is roughly 10 times larger than the normal size transaction.
380+        tx_large.vout.back().scriptPubKey = CScript() << OP_RETURN << std::vector<unsigned char>(size_diff / WITNESS_SCALE_FACTOR);

theStack commented at 11:15 pm on June 19, 2025:

in 3e2985630cea8d59a6045e8a3451753130b18cab: could use the BulkTransaction helper here

0        tx_large.vin.resize(1);
1        BulkTransaction(tx_large, 10 * TX_SIZE);

glozow commented at 6:46 pm on June 23, 2025:

Nice! Used this instead

in src/node/txorphanage.cpp:531 in a1bdebf370 outdated

709-            nErased += EraseTx(orphanHash);
710-        }
711-        LogDebug(BCLog::TXPACKAGES, "Erased %d orphan transaction(s) included or conflicted by block\n", nErased);
712+    unsigned int num_erased{0};
713+    for (const auto& wtxid : wtxids_to_erase) {
714+        num_erased += EraseTx(wtxid);

theStack commented at 11:31 pm on June 19, 2025:

in a1bdebf370bf2fc43c0204ca2d6291e73fc7f91f: nit: since the previous commit 42e59cd5a4a119ba15991178f0438b4a7ffb5bab, this is now an implicit bool-to-int-cast

0        num_erased += EraseTx(wtxid) ? 1 : 0;

(not sure how much we care about those, I guess there are countless other similar instances in the codebase; might be interesting to try https://clang.llvm.org/extra/clang-tidy/checks/readability/implicit-bool-conversion.html one day)

glozow commented at 6:47 pm on June 23, 2025:

taken

in src/node/txorphanage.h:32 in a1bdebf370 outdated

34 /** A class to track orphan transactions (failed on TX_MISSING_INPUTS)
35- * Since we cannot distinguish orphans from bad transactions with
36- * non-existent inputs, we heavily limit the number of orphans
37- * we keep and the duration we keep them for.
38+ * Since we cannot distinguish orphans from bad transactions with non-existent inputs, we heavily limit the amount of
39+ * announcements (unique (NodeId, tx) pairs). We also try to prevent adversaries churning this data structure: when

theStack commented at 11:31 pm on June 19, 2025:

0 * announcements (unique (NodeId, wtxid) pairs). We also try to prevent adversaries churning this data structure: when

glozow commented at 6:47 pm on June 23, 2025:

taken

theStack commented at 11:44 pm on June 19, 2025: contributor

Concept ACK

Left some mostly test-related nits I found from the first review round, haven’t looked in-depth at the main commit a1bdebf370bf2fc43c0204ca2d6291e73fc7f91f yet.

instagibbs commented at 3:08 pm on June 20, 2025: member

Trying not to derail the conversation but I spent a bit of time around thinking what the pessimal case for “real” traffic is. Let’s assume there’s an attacker who is watching on the network for a victim’s 1p1c package to be broadcast onto the network.

Let’s assume a “quiet” network, so announcements are capped at 3,000, and one peer at each hop ends up being fast enough relaying all parent-child pairs so the max memory in orphanage total is capped at 404kWU.

As soon as the victim package is spotted, the attacker sends their own “honest”-yet-evil cpfp packages, 404kWU total, at mempool minfee. The FIFO behavior kicks in, and the victim’s child is dropped from orphanage.

I’m not sure how practical this is, but I hope the scenario is clear and would demonstrate where larger reservation sizes (or total size) may come into play. The status quo is also likely worse in that the attacker can do a couple hundred tiny packages and accomplish the same censorship.

glozow commented at 5:43 pm on June 23, 2025: member

As soon as the victim package is spotted, the attacker sends their own “honest”-yet-evil cpfp packages, 404kWU total, at mempool minfee. The FIFO behavior kicks in, and the victim’s child is dropped from orphanage.

I think @mzumsande made this point a few months ago as well - we don’t have many guarantees if new orphans arrive much faster than we can process them. An attacker would need to pay fees for real transactions so that other peers are also sending these transactions. TLDR you could end up dropping transactions simply because the traffic is beyond the size of each peer’s buffer.

Higher limits would help here (noted on 5-10x). But also, it seems like a good followup to assign different limits for outbounds vs inbounds? It seems advantageous for outbounds to take much longer to forget orphans.

instagibbs commented at 5:59 pm on June 23, 2025: member

Higher limits result in multiplicative cost for said attackers, so maybe that’s enough for practical issues until sender-initiated is implemented, which should mitigate this issue entirely.

Either way, I believe this PR is strictly superior against this attack vs status quo.

glozow force-pushed on Jun 23, 2025

DrahtBot added the label CI failed on Jun 23, 2025

DrahtBot commented at 7:58 pm on June 23, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/44632073375 LLM reason (✨ experimental): The CI failure is caused by lint errors detected by ‘ruff’ that found 7 fixable issues, leading to a failure in the lint check.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow commented at 8:07 pm on June 23, 2025: member

Just addressing the comments on this push.

I bumped max announcements locally and got some slow benchmarks, so will investigate. It’s clear to me why this scenario (many tiny until you’re just under memory limit, the 1 huge one) is the worst case LimitOrphans for 1 peer. For multiple peers, the number of announcements is still based on how many tiny transactions match the size of a max size one (otherwise LimitOrphans should have already triggered evictions), so the primary difference should just be heap operations, no?

glozow force-pushed on Jun 23, 2025

DrahtBot removed the label CI failed on Jun 23, 2025

in src/node/txorphanage.cpp:477 in 78939c094f outdated

616 
617 bool TxOrphanageImpl::HaveTx(const Wtxid& wtxid) const
618 {
619-    return m_orphans.count(wtxid);
620+    auto it_lower = m_orphans.get<ByWtxid>().lower_bound(ByWtxidView{wtxid, MIN_PEER});
621+    return it_lower != m_orphans.end() && it_lower->m_tx->GetWitnessHash() == wtxid;

theStack commented at 1:10 am on June 24, 2025:

should this be

0    auto it_lower = m_orphans.get<ByWtxid>().lower_bound(ByWtxidView{wtxid, MIN_PEER});
1    return it_lower != m_orphans.get<ByWtxid>().end() && it_lower->m_tx->GetWitnessHash() == wtxid;

instead, if only for consistency? There are two more places I’ve found where an indexed iterator is compared to a non-indexed .end(): ::AddAnnouncer, ::GetTx

glozow commented at 8:46 pm on June 27, 2025:

done

in src/node/txorphanage.cpp:385 in 78939c094f outdated

499+    // Create a heap of pairs (NodeId, DoS score), sorted by descending DoS score.
500+    std::vector<std::pair<NodeId, FeeFrac>> heap_peer_dos;
501+    heap_peer_dos.reserve(m_peer_orphanage_info.size());
502+    for (const auto& [nodeid, entry] : m_peer_orphanage_info) {
503+        // Performance optimization: only consider peers with a DoS score >= 1.
504+        if (entry.GetDosScore(max_ann, max_mem) >= FeeFrac{1, 1}) {

theStack commented at 1:26 am on June 24, 2025:

micro-optimization nit: as a DoS score of 1 is still okay, could only consider peers which are strictly above?

0        // Performance optimization: only consider peers with a DoS score > 1.
1        if (entry.GetDosScore(max_ann, max_mem) >> FeeFrac{1, 1}) {

glozow commented at 8:46 pm on June 27, 2025:

done

in src/node/txorphanage.cpp:146 in 78939c094f outdated

153+            m_count_announcements += 1;
154+        }
155+        bool Subtract(const Announcement& ann)
156+        {
157+            m_total_usage -= ann.GetUsage();
158+            m_count_announcements -= 1;

theStack commented at 1:34 am on June 24, 2025:

could add Assumes here before the subtractions each (m_total_usage >= ann.GetUsage(), m_count_announcements >= 1) to ensure they won’t underflow

glozow commented at 8:46 pm on June 27, 2025:

done

in src/node/txorphanage.cpp:408 in 78939c094f outdated

533+        std::pop_heap(heap_peer_dos.begin(), heap_peer_dos.end(), compare_score);
534+        const auto [worst_peer, dos_score] = std::move(heap_peer_dos.back());
535+        heap_peer_dos.pop_back();
536+
537+        // If needs trim, then at least one peer has a DoS score higher than 1.
538+        Assume(dos_score.fee > dos_score.size);

theStack commented at 1:43 am on June 24, 2025:

to get rid of unrelated fee/size terminology, I wonder if a dedicated dos score type would make sense, like e.g.:

 0// Use FeeFrac class to represent the DoS score as fraction,
 1// with more generic member methods for (de)nominator access
 2class DosScore : public FeeFrac {
 3private:
 4    // prevent direct access of base fraction values
 5    using FeeFrac::fee, FeeFrac::size;
 6public:
 7    using FeeFrac::FeeFrac; // inherit constructors
 8    int64_t nominator() const { return fee; }
 9    int32_t denominator() const { return size; }
10};

(admitedly a bit hacky; can be discussed also in a follow-up PR)

instagibbs commented at 1:51 pm on June 24, 2025:

Could also just Assume it’s >> FeeFrac{1, 1} like later

glozow commented at 8:46 pm on June 27, 2025:

Made it » 1

in src/node/txorphanage.h:132 in 6093300bfc outdated

127+
128+    /** Number of orphans stored from this peer. */
129+    virtual Count AnnouncementsFromPeer(NodeId peer) const = 0;
130+
131+    /** Total usage of orphans from this peer */
132+    virtual Usage UsageFromPeer(NodeId peer) const = 0;

sipa commented at 5:59 pm on June 24, 2025:

UsageFromPeer and UsageByPeer appear to be identical functions?

glozow commented at 8:46 pm on June 27, 2025:

Removed UsageFromPeer

glozow force-pushed on Jun 24, 2025

sipa commented at 7:30 pm on June 24, 2025: member

Does it ever make sense to have multiple announcements for the same wtxid to be m_reconsider? I’m wondering if AddChildrenToWorkSet should skip wtxids for which a reconsider-announcement already exists.

Imagine you have 10 parents, all announced by 1 peer each. And then 1 child, announced by all peers. If all 10 parents arrive, AddChildrenToWorkSet will be invoked 10 times, which may select a different child-announcer for reconsideration on every call.

Also, if the reconsiderable-announcer of a wtxid goes offline, should another announcer of the same wtxid be made the reconsiderer?

glozow force-pushed on Jun 24, 2025

glozow commented at 10:38 pm on June 24, 2025: member

Does it ever make sense to have multiple announcements for the same wtxid to be m_reconsider? I’m wondering if AddChildrenToWorkSet should skip wtxids for which a reconsider-announcement already exists.

Indeed, we don’t need to reconsider the same transaction more than once if nothing’s changed. I’m wondering how bad it is for AddChildrenToWorkSet to iterate through all of them every time - it might be slightly more efficient to allow multiple m_reconsiders and reset all to false in GetTxToReconsider (only difference is that orphans may be processed sooner if there were multiple announcements with m_reconsider)? Alternatively, we could add a set of reconsiderable wtxids, or another index by {wtxid, m_reconsider}.

(This could be better for a followup though, as the behavior is the same on master?)

Also, if the reconsiderable-announcer of a wtxid goes offline, should another announcer of the same wtxid be made the reconsiderer?

Is the idea that an attacker could make us forget to reconsider a transaction by disconnecting? That was the original rationale for selecting a random announcement in AddChildrenToWorkset. Reassigning does seem worthwhile but shouldn’t we do it for LimitOrphans evictions as well?

glozow commented at 11:52 pm on June 24, 2025: member

I adapted the bench for any announcement limit, and changed the ManyPeers scenario https://github.com/glozow/bitcoin/commits/2025-06-bench-31829/

Ultimately, regardless of the number of peers you have, the most you can exceed the usage limit with 1 transaction is ~400K, so the max number of transactions you can evict is 400K/240 (240wu being the smallest transaction size). We can increase this with more heap operations by making sure each peer is neck and neck for “DoSiest” at each iteration.

Jacking up the announcement limit means you can have more peers in the heap, but the total time doesn’t get very high - I’m still <0.5ms when I make max announcements and per-peer usage really high.

Maybe I’m doing something very dumb, but I’m thinking LimitOrphans runtime isn’t the main concern when increasing limits. Maybe it’s actually EraseForBlock (which can theoretically affect all transactions in the orphanage).

instagibbs commented at 11:56 am on June 25, 2025: member

@glozow this sounds right. I wonder if we’d want to have LimitOrphans deterministically called inside AddTx/AddAnnouncer such that we can assert what the max amount is being trimmed, in count/size for coverage?

sipa commented at 12:15 pm on June 25, 2025: member

@glozow

Alternatively, we could add a set of reconsiderable wtxids

I think this is the best option. An extra index does not seem worth it, as it’s 3 pointers extra per announcement, while a set of reconsiderable wtxids is just 3-4 pointers per reconsiderable wtxid. Another alternative is making the ByWtxid index be (Wtxid, bool, NodeId), but that means that some queries (those which want to check for a specific Wtxid/NodeId announcement) need two lookups. The separate set seems more convenient.

EDIT: actually, AddChildrenToWorkSet already iterates over all of the announcers of a given wtxid-to-be-made-reconsiderable anyway to count them in std::distance. Changing it to bail out if it encounters another m_reconsider one doesn’t change the complexity, and would only be a negligible extra cost.

Having an std::set of reconsiderable Wtxids may still be useful to avoid looping multiple times over the same Wtxid in AddChildrenToWorkSet, which otherwise might need multiple iterations over all announcements still.

(This could be better for a followup though, as the behavior is the same on master?)

For context, I’m writing a simulation fuzz test, and was wondering about what properties to require from functions interacting with the work set. It seems cleaner to me that semantically speaking, reconsiderability is really just a per-wtxid bool property (even if it’s implemented as a per-announcement boolean), which is set by AddChildrenToWorkSet, and unset by GetTxToReconsider or the last announcement of a given Wtxid disappearing.

But yes, it’s not a regression - just something that is a lot easier to do better now.

Is the idea that an attacker could make us forget to reconsider a transaction by disconnecting?

Or have the same expensive transaction reprocessed multiple times (once per attacker peer), and fail every time because there are still missing inputs.

Reassigning does seem worthwhile but shouldn’t we do it for LimitOrphans evictions as well?

Indeed!

Jacking up the announcement limit means you can have more peers in the heap, but the total time doesn’t get very high - I’m still <0.5ms when I make max announcements and per-peer usage really high.

There must be some component of LimitOrphans whose runtime grows proportionally with $\mathcal{O}(\log(\mathrm{peers}) \times \mathrm{announcements})$? In a scenario with N-1 attacker peers that give you all the same transactions (but each in a different order, so that each attacker peer gives a different wtxid as their last one), together staying just below the global announcement and usage limits, but massively exceeding the per-peer usage reservation. Then a single large additional transaction is announced by a single honest peer which hasn’t announced anything else yet, causing the global usage to exceed the limit. LimitOrphans in this case will evict at least all all but one announcement of the attacker peers.

Maybe it’s actually EraseForBlock (which can theoretically affect all transactions in the orphanage).

I’m thinking the worst case runtime for LimitOrphans should be worse, as they’re both bounded by erasing all announcements, but LimitOrphans has its heap management work per-announcement, while EraseForBlock only has per-transaction overhead of gathering conflicted COutPoints? But if we’re concerned about EraseForBlock we should have a benchmark for its expected worst case too.

EDIT: Hmm, the $\mathcal{O}(\log(\mathrm{peers}))$ factor from the heap management probably doesn’t actually matter, because each multi_index update operation has an $\mathcal{O}(\log(\mathrm{announcements}))$ cost anyway. That would make the LimitOrphans and EraseForBlock’s worst cases asymptotically equal.

in src/node/txorphanage.h:32 in 5992831c92 outdated

34 /** A class to track orphan transactions (failed on TX_MISSING_INPUTS)
35- * Since we cannot distinguish orphans from bad transactions with
36- * non-existent inputs, we heavily limit the number of orphans
37- * we keep and the duration we keep them for.
38+ * Since we cannot distinguish orphans from bad transactions with non-existent inputs, we heavily limit the amount of
39+ * announcements (unique (NodeId, wtxid) pairs). We also try to prevent adversaries churning this data structure: when

mzumsande commented at 6:15 pm on June 25, 2025:

nit: missing “from”

glozow commented at 8:46 pm on June 27, 2025:

fixed

instagibbs commented at 7:27 pm on June 25, 2025: member

together staying just below the global announcement and usage limits, but massively exceeding the per-peer usage reservation

Whiteboarded in person. Hopefully this explanation is clear and explains it in a more permanent location. If this ends up being correct we can adapt the benchmark to be more realistic.

The attack is as follows, assuming a global announcement count of 24,000, and per-peer reservation limit of 404kWU:

Peer 0 sends a 260WU orphan
Peers 1 through 124 announce the same set of 193 transactions, each 261,656WU in size, but where each unique transaction is announced at least once as the 192nd and 193rd orphan for some peer.

No global limits have been exceeded yet. (Deduplicated) Usage: 261,656193+240==50499848 < 50500000 Announcement count: 193124 + 1 == 23933 < 24000

Peer 0 then sends another tx, even minimally sized: 50499848+240==50500088 > 50500000

This causes trimming to start. Since Peers 1 through 124 all have the same DoS score for weight, their earliest announcements are trimmed one by one. No transactions are deleted until 193-2==191 announcements from each peer are removed. Once a single transaction in one of the selected peer’s 192nd announcement causes a tx to be evicted, trimming stops. This only leaves partial announcements of 192nd, and the 193rd for each.

in src/node/txorphanage.cpp:467 in 5992831c92 outdated

541+        // We evict the oldest announcement(s) from this peer, sorting non-reconsiderable before reconsiderable.
542+        // The number of inner loop iterations is bounded by the total number of announcements.
543+        const auto& dos_threshold = heap_peer_dos.empty() ? FeeFrac{1, 1} : heap_peer_dos.front().second;
544+        auto it_ann = m_orphans.get<ByPeer>().lower_bound(ByPeerView{worst_peer, false, 0});
545+        while (NeedsTrim()) {
546+            if (!Assume(it_ann->m_announcer == worst_peer)) break;

mzumsande commented at 4:55 pm on June 26, 2025:

could also assume / break that it_ann != m_orphans.get<ByPeer>().end() like in other places.

glozow commented at 8:47 pm on June 27, 2025:

done

in src/node/txorphanage.cpp:338 in 5992831c92 outdated

320+        Assume(!IsUnique(ret.first));
321+    }
322+    return brand_new;
323 }
324 
325 bool TxOrphanageImpl::AddAnnouncer(const Wtxid& wtxid, NodeId peer)

mzumsande commented at 5:05 pm on June 26, 2025:

Now that we have both a size and an announcement count limit, we should probably also call LimitOrphans() when we call AddAnnouncer() in TxDownloadManager::AddTxAnnouncement(), so that peers can’t add thousands of announcements over the allowed limit using existing orphans, without LimitOrphans() ever being called.

instagibbs commented at 8:04 pm on June 26, 2025:

Is exposing LimitOrphans even worth it vs calling it internally when we add a tx/announcement?

mzumsande commented at 8:38 pm on June 26, 2025:

Calling it just internally makes sense to me.

glozow commented at 9:02 pm on June 26, 2025:

Agree it can be handled internally. LimitOrphans used to allow variable-size limiting (which was also only used for testing) but we’re getting rid of that in this PR.

glozow commented at 4:07 pm on July 10, 2025:

I added this to a followup PR: #32941

I think the tests added here are easier to reason about when LimitOrphans is an explicit call, so it feels better to do it in a separate step.

instagibbs commented at 5:44 pm on July 10, 2025:

I still think we should be LimitOphan-ing for each AddAnnouncer?

in src/node/txorphanage.cpp:382 in 5992831c92 outdated

496+    // Create a heap of pairs (NodeId, DoS score), sorted by descending DoS score.
497+    std::vector<std::pair<NodeId, FeeFrac>> heap_peer_dos;
498+    heap_peer_dos.reserve(m_peer_orphanage_info.size());
499+    for (const auto& [nodeid, entry] : m_peer_orphanage_info) {
500+        // Performance optimization: only consider peers with a DoS score >= 1.
501+        if (entry.GetDosScore(max_ann, max_mem) >> FeeFrac{1, 1}) {

mzumsande commented at 5:20 pm on June 26, 2025:

nit: could save the score in a local variable so that you don’t have to call GetDosScore() again in the next line.

glozow commented at 8:47 pm on June 27, 2025:

done

in src/node/txorphanage.cpp:405 in 5992831c92 outdated

458+    Assume(!m_peer_orphanage_info.contains(peer));
459+
460+    if (num_ann > 0) LogDebug(BCLog::TXPACKAGES, "Erased %d orphan transaction(s) from peer=%d\n", num_ann, peer);
461 }
462 
463+/** If the data structure needs trimming, evicts announcements by selecting the DoSiest peer and evicting its oldest

mzumsande commented at 5:59 pm on June 26, 2025:

now that the oldest announcement is evicted, should the comment “Note that, if the orphanage reaches capacity, it’s possible that we immediately evict the transaction we just added.” from TxDownloadManagerImpl::MempoolRejectedTx be adjusted - since the announcement we added there would be the newest it shouldn’t be evicted right away.

glozow commented at 8:47 pm on June 27, 2025:

Yes! Removed, thanks

glozow commented at 6:18 pm on June 26, 2025: member

Whiteboarded in person. Hopefully this explanation is clear and explains it in a more permanent location. If this ends up being correct we can adapt the benchmark to be more realistic.

🤦 yes sorry, I was being dumb. I think I’m still missing why peer 0 is separate from the others? These are the numbers I get:

announcement limit A=24,000
number of peers P=125
number of unique transactions N = A / P = 24,000 / 125 = 192
total memory limit M = 404k * P = 50,500,000wu
size of each transaction S = M / N = 50,500,000 / 192 = 263,020

If we fill up this way, we’re at capacity, and can send 1 small transaction to any of the peers, triggering evictions of P * (N - 1) + 1 announcements?

sipa commented at 6:21 pm on June 26, 2025: member

I think the peer that “triggers” to the LimitOrphan eviction should be one that is not by itself violating per-peer dos limits, otherwise it’ll be the first one evicted from, immediately halting the eviction process (peers 1-124 each have a per-peer usage of close to 50.5 MWU, so they’re all massively over the per-peer usage reservation).

glozow commented at 6:30 pm on June 26, 2025: member

it’ll be the first one evicted from, immediately halting the eviction process

Does this require the “aggressive” approach where we trim a peer until it is within limits? Even if peer0 is one of the 50.5MWu ones, we would only need to delete 1 item to make it no longer the most DoSy, and then we round robin through all 125?

sipa commented at 6:33 pm on June 26, 2025: member

Oh, you’re right. My thinking was that as soon as the trigger-peer is selected, the bump-over-the-limit transaction announcement would be evicted, causing the global limit to no longer be exceeded. But given that we evict from old to new, that isn’t possible.

mzumsande commented at 6:40 pm on June 26, 2025: contributor

First readthrough of the new implementation, just one suggestion to add a LimitOrphans() call (and a few nits).

sipa commented at 7:23 pm on June 26, 2025: member

See https://github.com/sipa/bitcoin/commits/pr31829:

Makes a few preparatory behavior changes to make it more simulation-testable (return which announcements are made reconsiderable from AddChildrenToWorkSet, and tie-break equally-DoSy peers by picking the highest NodeId). Feel free to squash or otherwise incorporate these.
Add a simulation fuzz test which uses a super dumb vector of (wtxid, nodeid) pairs (in announcement order) to represent the state of the orphanage.
Adds a std::set<Wtxid> of reconsiderable wtxids to prevent having more than one reconsiderable announcement per wtxid (without this, i think AddChildrenToWorkSet might have some pathological cases where it iterates every announcement multiple times).

theStack commented at 1:11 pm on June 27, 2025: contributor

Side-note about when to call LimitOrphans: IIUC, so far the discussed scenarios to exceed the global limit were focused on the obvious cases of increasing the nominator of the DoS score fraction, i.e. adding new orphans or announcements (AddTx/AddAnnouncement). What about decreasing the denominator? Disconnecting a peer decreases the global usage limit, so if remaining peers have previously exceeded their per-peer usage limit, it could lead to an exceeding of global usage limit if the disconnected per had low usage. I think this is probably very rare in practice, but possible. Simple two-peer example in form of a unit test: https://github.com/theStack/bitcoin/commit/bb4f48be8f2b2b640cba75f24bf98842f87d7b0e

TLDR: Should LimitOrphans also be called in EraseForPeer (or is it fine to leave the possibility open for temporarily exceeding the global usage limit, until the next tx/announcement comes in)?

sipa commented at 1:23 pm on June 27, 2025: member

@theStack I don’t think this matters really, because “number of peers with at least one orphan” is used as a logistically-easier approximation for “number of peers which participate in transaction relay” (total memory usage is allowed to go up with more peers, because it’s expected that peers that do transaction relay result in increases memory usage due to that). The fact that this means that peers which don’t have any announced remaining orphans are not counted is a side-effect, not the goal.

I think we’ve talked about follow-ups where net_processing invokes some registration/deregistration function for peers instead, which would allow counting non-orphan peers towards the global usage limit. This can be combined with giving peers non-uniform limits too; e.g. outbound or otherwise trusted peers could be given higher limits.

From a testability perspective however, it may be better to call LimitOrphans inside EraseForPeer (and other Erase functions) too, as that would allow making NeedsTrim() private (it would be false from the perspective of any public API observer anyway), and instead have assert(!NeedsTrim()); inside SanityCheck.

instagibbs commented at 6:04 pm on June 27, 2025: member

@sipa simulation looks solid from a few minutes going through it, I think it’s a good addition

in src/test/fuzz/txorphan.cpp:244 in 2def907391 outdated

239+{
240+    FuzzedDataProvider fuzzed_data_provider(buffer.data(), buffer.size());
241+    FastRandomContext orphanage_rng{/*fDeterministic=*/true};
242+    SetMockTime(ConsumeTime(fuzzed_data_provider));
243+
244+    // We have num_peers, of which Peer==0 is the "honest" one

instagibbs commented at 6:10 pm on June 27, 2025:

older comment

glozow commented at 8:47 pm on June 27, 2025:

reworded

glozow force-pushed on Jun 27, 2025

glozow commented at 9:11 pm on June 27, 2025: member

See https://github.com/sipa/bitcoin/commits/pr31829

Thanks! I’ve incorporated this branch (squashed a couple of the changes directly into the main commit). Also addressed some comments. Still working on the bench and having TxOrphanage do its own limiting.

DrahtBot added the label CI failed on Jun 27, 2025

DrahtBot commented at 10:12 pm on June 27, 2025: contributor

🚧 At least one of the CI tasks failed. Task tidy: https://github.com/bitcoin/bitcoin/runs/44955184402 LLM reason (✨ experimental): clang-tidy reported an error due to an inefficient loop variable usage, causing the CI failure.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow force-pushed on Jul 1, 2025

DrahtBot removed the label CI failed on Jul 1, 2025

glozow force-pushed on Jul 2, 2025

DrahtBot added the label CI failed on Jul 3, 2025

DrahtBot commented at 0:46 am on July 3, 2025: contributor

🚧 At least one of the CI tasks failed. Task MSan, depends: https://github.com/bitcoin/bitcoin/runs/45249160475 LLM reason (✨ experimental): MemorySanitizer detected use of uninitialized memory during testing.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow force-pushed on Jul 3, 2025

glozow force-pushed on Jul 6, 2025

DrahtBot removed the label CI failed on Jul 6, 2025

glozow force-pushed on Jul 7, 2025

DrahtBot added the label CI failed on Jul 7, 2025

DrahtBot commented at 8:16 pm on July 7, 2025: contributor

🚧 At least one of the CI tasks failed. Task CentOS, depends, gui: https://github.com/bitcoin/bitcoin/runs/45504803088 LLM reason (✨ experimental): The CI failure is caused by a compilation error in txorphanage.cpp due to incorrect use of std::floor with a template argument.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow force-pushed on Jul 7, 2025

glozow commented at 8:54 pm on July 7, 2025: member

Each transaction now counts as 1 + num_inputs / 10 announcements, accounting for the additional time it takes to do m_outpoint_to_orphan_it operations (benchmarks suggest it’s roughly 1/10th of the time to do m_orphans operations). We round down so that the first 9 inputs are “free” and most normal transactions don’t get a “penalty”. This means a 9-input transaction has the highest computation time / latency score, which is why the benchmarks use them.

A few pushes ago I changed the limit back from 24,000 to 3,000 after changing the benchmarks to be more representative of actual worst case and realizing we actually definitely can’t handle 24k. EraseForBlock and LimitOrphans take around 5ms on my machine, which seems ok (EDITED after I fixed OrphanageMultiPeerEviction to account for input count):

0|               ns/op |                op/s |    err% |     total | benchmark
1|--------------------:|--------------------:|--------:|----------:|:----------
2|        5,666,291.00 |              176.48 |    0.0% |      0.01 | `OrphanageEraseForBlock`
3|        3,467,375.00 |              288.40 |    0.0% |      0.00 | `OrphanageEraseForPeer`
4|          498,542.00 |            2,005.85 |    0.0% |      0.00 | `OrphanageMultiPeerEviction`
5|          305,500.00 |            3,273.32 |    0.0% |      0.00 | `OrphanageSinglePeerEviction`

Note that the primary intention of OrphangeEraseForPeer is to give us a sense of how much time we are spending on the loop through vtx in EraseForBlock.

Also rebased for #31553

instagibbs commented at 9:03 pm on July 7, 2025: member

A few pushes ago I changed the limit back from 24,000 to 3,000 after changing the benchmarks to be more representative of actual worst case and realizing we actually definitely can’t handle 24k

Can you recap what the deficiency was?

glozow force-pushed on Jul 8, 2025

glozow commented at 1:50 pm on July 8, 2025: member

Can you recap what the deficiency was?

Duplicate transactions do not count towards memory usage. When we exceed memory limits, the worst case happens when we need to delete a huge amount of announcements before finding a unique one. So OrphanageMultiPeerEviction needs to delete all but the last 2 transactions for each peer before it makes any progress on the memory usage.

For 24k, I was getting 10s of milliseconds for OrphanageMultiPeerEviction and even worse for EraseForBlock (I can’t remember exactly), though this was before incorporating input count into the score.

Even though we’ve added the input count now, my numbers for EraseForBlock make me hesitant to raise the limit above 3k.

DrahtBot removed the label CI failed on Jul 8, 2025

in src/test/fuzz/txorphan.cpp:331 in 00e8e61358 outdated

326+                fuzzed_data_provider,
327+                [&] { // AddTx
328+                    bool have_tx_and_peer = orphanage->HaveTxFromPeer(wtxid, peer_id);
329+                    if (peer_is_protected && !have_tx_and_peer &&
330+                        (orphanage->UsageByPeer(peer_id) + tx_weight > honest_mem_limit ||
331+                        orphanage->LatencyScoreFromPeer(peer_id) + std::floor<node::TxOrphanage::Count>(tx->vin.size()) + 1 > honest_latency_limit)) {

marcofleon commented at 12:46 pm on July 10, 2025:

What’s the point of std::floor here? Also I noticed this doesn’t match the calculation in GetLatencyScore(). Is it meant to be an overestimate?

glozow commented at 3:59 pm on July 10, 2025:

oops yes, added the /10 !

in src/test/fuzz/txorphan.cpp:389 in 00e8e61358 outdated

384+
385+                    // Number of announcements and usage should never differ before and after since
386+                    // we've never exceeded the per-peer reservations.
387+                    for (unsigned int peer = 0; peer < num_peers; ++peer) {
388+                        if (protected_peers[peer]) {
389+                            protected_count -= orphanage->AnnouncementsFromPeer(peer);

marcofleon commented at 12:55 pm on July 10, 2025:

We increment by the latency score above but then decrement here by announcement count. These aren’t guaranteed to be the same afaict, so this might be what’s causing the failure of the protected_count == 0 assertion.

glozow commented at 3:59 pm on July 10, 2025:

Nice catch! Changed

marcofleon commented at 1:15 pm on July 10, 2025: contributor

The protected_count == 0 assertion failed in txorphan_protected.

 0../../../../src/test/fuzz/txorphan.cpp:395 operator(): Assertion `protected_count == 0' failed.
 1==2872074== ERROR: libFuzzer: deadly signal
 2    [#0](/bitcoin-bitcoin/0/) 0x558b2f142c41 in __sanitizer_print_stack_trace (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1c92c41) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)
 3    [#1](/bitcoin-bitcoin/1/) 0x558b2f097c08 in fuzzer::PrintStackTrace() (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1be7c08) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)
 4    [#2](/bitcoin-bitcoin/2/) 0x558b2f07d1b3 in fuzzer::Fuzzer::CrashCallback() (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1bcd1b3) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)
 5    [#3](/bitcoin-bitcoin/3/) 0x7f25effeb04f  (/lib/x86_64-linux-gnu/libc.so.6+0x3c04f) (BuildId: 79005c16293efa45b441fed45f4f29b138557e9e)
 6    [#4](/bitcoin-bitcoin/4/) 0x7f25f0039eeb  (/lib/x86_64-linux-gnu/libc.so.6+0x8aeeb) (BuildId: 79005c16293efa45b441fed45f4f29b138557e9e)
 7    [#5](/bitcoin-bitcoin/5/) 0x7f25effeafb1 in raise (/lib/x86_64-linux-gnu/libc.so.6+0x3bfb1) (BuildId: 79005c16293efa45b441fed45f4f29b138557e9e)
 8    [#6](/bitcoin-bitcoin/6/) 0x7f25effd5471 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x26471) (BuildId: 79005c16293efa45b441fed45f4f29b138557e9e)
 9    [#7](/bitcoin-bitcoin/7/) 0x558b2fe31d0d in assertion_fail(std::basic_string_view<char, std::char_traits<char>>, int, std::basic_string_view<char, std::char_traits<char>>, std::basic_string_view<char, std::char_traits<char>>) /root/bitcoin/txorphanfuzzbuild/src/util/../../../src/util/check.cpp:34:5
10    [#8](/bitcoin-bitcoin/8/) 0x558b2f877e03 in bool&& inline_assertion_check<true, bool>(bool&&, char const*, int, char const*, char const*) /root/bitcoin/txorphanfuzzbuild/src/test/fuzz/../../../../src/util/check.h:80:13
11    [#9](/bitcoin-bitcoin/9/) 0x558b2f877e03 in txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_5::operator()() const /root/bitcoin/txorphanfuzzbuild/src/test/fuzz/../../../../src/test/fuzz/txorphan.cpp:395:21
12    [#10](/bitcoin-bitcoin/10/) 0x558b2f877e03 in unsigned long CallOneOf<txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_1, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_2, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_3, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_4, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_5>(FuzzedDataProvider&, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_1, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_2, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_3, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_4, txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>)::$_5) /root/bitcoin/txorphanfuzzbuild/src/test/fuzz/../../../../src/test/fuzz/util.h:42:27
13    [#11](/bitcoin-bitcoin/11/) 0x558b2f877e03 in txorphan_protected_fuzz_target(std::span<unsigned char const, 18446744073709551615ul>) /root/bitcoin/txorphanfuzzbuild/src/test/fuzz/../../../../src/test/fuzz/txorphan.cpp:327:13
14    [#12](/bitcoin-bitcoin/12/) 0x558b2fa7b301 in std::function<void (std::span<unsigned char const, 18446744073709551615ul>)>::operator()(std::span<unsigned char const, 18446744073709551615ul>) const /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_function.h:591:9
15    [#13](/bitcoin-bitcoin/13/) 0x558b2fa7b301 in test_one_input(std::span<unsigned char const, 18446744073709551615ul>) /root/bitcoin/txorphanfuzzbuild/src/test/fuzz/util/../../../../../src/test/fuzz/fuzz.cpp:88:5
16    [#14](/bitcoin-bitcoin/14/) 0x558b2fa7b301 in LLVMFuzzerTestOneInput /root/bitcoin/txorphanfuzzbuild/src/test/fuzz/util/../../../../../src/test/fuzz/fuzz.cpp:215:5
17    [#15](/bitcoin-bitcoin/15/) 0x558b2f07e6f6 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1bce6f6) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)
18    [#16](/bitcoin-bitcoin/16/) 0x558b2f06734f in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1bb734f) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)
19    [#17](/bitcoin-bitcoin/17/) 0x558b2f06d341 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1bbd341) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)
20    [#18](/bitcoin-bitcoin/18/) 0x558b2f098602 in main (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1be8602) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)
21    [#19](/bitcoin-bitcoin/19/) 0x7f25effd6249  (/lib/x86_64-linux-gnu/libc.so.6+0x27249) (BuildId: 79005c16293efa45b441fed45f4f29b138557e9e)
22    [#20](/bitcoin-bitcoin/20/) 0x7f25effd6304 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x27304) (BuildId: 79005c16293efa45b441fed45f4f29b138557e9e)
23    [#21](/bitcoin-bitcoin/21/) 0x558b2f061cc0 in _start (/root/bitcoin/txorphanfuzzbuild/bin/fuzz+0x1bb1cc0) (BuildId: 2ed4a047818a3d1ea39c53799e090813bde04dba)

txorphanprotected_input.txt

0XgAwwv////9xcXFxcXFxcf////////wAAFIAAFJSAABU//8AIx4l/zoAAACJ/65SAAB/lpEAAAAAAAAAAP8FAABxCQBgRePUAABw/y8AQHr/////45QAAHD/KQAAev//Mf/3//9/li/gxZEAAABAAAAAAAAAAAAA/xcXFxcXFxcXFxcXFxcXFxcX83pNDAAXF/N6TQwAgQAAAAA6AIEAAAAAOhcXFxcXFxcXFxcXFxcXFxcX83pNDACBAAAAADoMAIEAAAAAADoXFxcAgQAAAAA6FxcXFxcXFxcXFwAAOhcXFxfzek0MAIEAAAAAOhcXFxfzek0MAIEAAAAAOhcXFxcXFxcXFxcXFxcXFxcX83pNDACBAAAAADoAgQAAAAA6FxcXFxcXFxcXFwAAOhcXFxfzek0MAIEAAAAAOhcXFxfzek0MAIEAAAAAOg==

I left some (potentially related?) comments in the fuzz test.

glozow force-pushed on Jul 10, 2025

glozow commented at 4:08 pm on July 10, 2025: member

I’ve created a followup PR with some of the cleanups and optimizations: #32941

in src/node/txorphanage.cpp:63 in 6bedcce9f6 outdated

58+         * The computation time is a function of the number of entries in m_orphans (thus 1 per announcement) and the
59+         * number of entries in m_outpoint_to_orphan_it (thus an additional 1 for every 10 inputs). Transactions with a
60+         * small number of inputs (9 or fewer) are counted as 1 to make it easier to reason about each peer's limits in
61+         * terms of "normal" transactions. */
62+        TxOrphanage::Count GetLatencyScore() const {
63+            return 1 + std::floor<TxOrphanage::Count>(m_tx->vin.size() / 10);

sipa commented at 4:09 pm on July 10, 2025:

This std::floor has no effect. m_tx->vin.size() / 10 is an integer division, not floating point, and it already rounds towards 0.

glozow commented at 4:25 pm on July 10, 2025:

Removed

glozow force-pushed on Jul 10, 2025

in src/test/fuzz/txorphan.cpp:704 in ef2fac9315 outdated

699+                    // Make reconsiderable.
700+                    sim_ann_it->reconsider = true;
701+                }
702+                for (auto& [wtxid, peer] : added) {
703+                    // Remove from child_wtxids map, so we can check that only already-reconsiderable
704+                    // ones we not present in added.

instagibbs commented at 6:51 pm on July 10, 2025:

ones not present?

glozow commented at 1:55 pm on July 11, 2025:

reworded to make it more clear

in src/test/fuzz/txorphan.cpp:776 in ef2fac9315 outdated

771+                        if (done) break;
772+                    }
773+                    assert(done);
774+                }
775+                real->LimitOrphans();
776+                // We must now be within limits, otherwise LimitOrphans should have continued further).

instagibbs commented at 6:58 pm on July 10, 2025:

Maybe leave a comment that discrepencies between the simulation and real Limiting will be revealed below in full comparison

glozow commented at 1:55 pm on July 11, 2025:

added

in src/bench/txorphanage.cpp:44 in 76e65b8d29 outdated

39+        BulkTransaction(tx, target_weight);
40+    }
41+    return MakeTransactionRef(tx);
42+}
43+
44+// Constructs a transaction using the inputs[start_input : start_input + num_inputs] or a subset that is just under the weight_limit.

instagibbs commented at 7:12 pm on July 10, 2025:

0// Constructs a transaction using a subset of inputs[start_input : start_input + num_inputs] up to the weight_limit.

glozow commented at 1:55 pm on July 11, 2025:

taken

in src/bench/txorphanage.cpp:195 in 76e65b8d29 outdated

190+
191+static void OrphanageEraseAll(benchmark::Bench& bench, bool block_or_disconnect)
192+{
193+    FastRandomContext det_rand{true};
194+    const auto orphanage{node::MakeTxOrphanage(/*max_global_ann=*/node::DEFAULT_MAX_ORPHANAGE_LATENCY_SCORE, /*reserved_peer_usage=*/node::DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER)};
195+    // Note that this block has an unrealistically large number of inputs.

instagibbs commented at 7:18 pm on July 10, 2025:

fits in a block, just not very typical? How was this value chosen?

glozow commented at 1:58 pm on July 11, 2025:

I’ve edited the number and the comment. I divide the max block weight by an approximate input weight (I chose 200) and make 1 huge transaction. The test has 20k inputs, but I think typical blocks have single digit thousands?

in src/bench/txorphanage.cpp:201 in 76e65b8d29 outdated

196+    constexpr unsigned int NUM_BLOCK_INPUTS{9500};
197+    const auto block_tx{MakeTransactionBulkedTo(NUM_BLOCK_INPUTS, MAX_BLOCK_WEIGHT - 4000, det_rand)};
198+    CBlock block;
199+    block.vtx.push_back(block_tx);
200+
201+    // Tranactions with 9 inputs maximize the computation / LatencyScore ratio.

instagibbs commented at 7:19 pm on July 10, 2025:

0    // Transactions with 9 inputs maximize the computation / LatencyScore ratio.

instagibbs commented at 7:22 pm on July 10, 2025: member

reviewed through 76e65b8d2982764c9f47be925c26bef23ad0ab97

we should probably be calling LimitOrphans after/during AddAnnouncer()

in src/test/fuzz/txorphan.cpp:708 in ef2fac9315 outdated

703+                    // Remove from child_wtxids map, so we can check that only already-reconsiderable
704+                    // ones we not present in added.
705+                    child_wtxids.erase(wtxid);
706+                }
707+                // Verify that all child wtxids which did not occur at least once in the AddChildrenToWorkSet result
708+                // are wtxid which were already reconsiderable (for some peer).

instagibbs commented at 8:40 pm on July 10, 2025:

in case this seems more straight forward to understand

0                // are wtxid which were already reconsiderable for some peer due to a previous AddChildrenToWorkSet().

glozow commented at 1:59 pm on July 11, 2025:

reworded, took this as well

glozow force-pushed on Jul 11, 2025

glozow commented at 1:59 pm on July 11, 2025: member

we should probably be calling LimitOrphans after/during AddAnnouncer()

Yes 👍 done.

[txorphanage] change type of usage to int64_t

Since this field holds a total number of bytes, overflow is within the
realm of possibility. Use int64 to be safe.

bb91d23fa9

[prep/refactor] move txorphanage to node namespace and directory

This is move-only.

08e58fa911

[prep/rpc] remove entry and expiry time from getorphantxs

Expiry is going away in a later commit.
This is only an RPC change. Behavior of the orphanage does not change.
Note that getorphantxs is marked experimental.

15a4ec9069

[fuzz] add SeedRandomStateForTest(SeedRand::ZEROS) to txorphan 44f5327824

glozow force-pushed on Jul 11, 2025

glozow commented at 6:05 pm on July 11, 2025: member

Rebased for #32631

in src/node/txorphanage.cpp:700 in d5097d6eef outdated

931+
932+    // Cached m_unique_rounded_input_scores value is correct.
933+    const auto calculated_total_latency_score = std::accumulate(unique_wtxids_to_scores.begin(), unique_wtxids_to_scores.end(),
934+        TxOrphanage::Count{0}, [](TxOrphanage::Count sum, const auto pair) { return sum + pair.second.second; });
935+    assert(calculated_total_latency_score == m_unique_rounded_input_scores);
936 }

sipa commented at 8:15 pm on July 11, 2025:

Nit: worth checking that global usage is not larger than the sum of the per-peer usages, and that global total latency score is not larger than the sum of the per-peer latency scores?

glozow commented at 7:53 pm on July 14, 2025:

added

in src/node/txorphanage.h:25 in d5097d6eef outdated

24+static constexpr int64_t DEFAULT_RESERVED_ORPHAN_WEIGHT_PER_PEER{404'000};
25+/** Default value for TxOrphanage::m_max_global_latency_score. Helps limit the maximum latency for operations like
26+ * EraseForBlock and LimitOrphans. */
27+static constexpr unsigned int DEFAULT_MAX_ORPHANAGE_LATENCY_SCORE{100};
28+/** Minimum NodeId for lower_bound lookups (in practice, NodeIds start at 0). */
29+static constexpr NodeId MIN_PEER{std::numeric_limits<NodeId>::min()};

sipa commented at 8:16 pm on July 11, 2025:

Are these MIN_PEER and MAX_PEER values needed in the .h file?

glozow commented at 7:53 pm on July 14, 2025:

Right, moved

in src/test/fuzz/txorphan.cpp:484 in 6ce95bffee outdated

479+                    assert(partx->version == 1);
480+                    COutPoint outpoint(partx->GetHash(), rng.randrange<size_t>(partx->vout.size()));
481+                    tx.vin.emplace_back(outpoint);
482+                    tx.vin.back().scriptSig.resize(provider.ConsumeIntegralInRange<unsigned>(16, 200));
483+
484+                    // Do not construct transactions with more than 9 inputs to avoid latency score diverging from announcement count.

sipa commented at 8:25 pm on July 11, 2025:

That’s unfortunate, as it means the simulation isn’t exercising anything involving nontrivial latency scores.

instagibbs commented at 8:37 pm on July 11, 2025:

I haven’t thought a lot about why, but what’s the challenge when adding more than 9 inputs for the simulator?

sipa commented at 9:38 pm on July 11, 2025:

 0diff --git a/src/test/fuzz/txorphan.cpp b/src/test/fuzz/txorphan.cpp
 1index 58ee99e10bf..8bba7e8cd49 100644
 2--- a/src/test/fuzz/txorphan.cpp
 3+++ b/src/test/fuzz/txorphan.cpp
 4@@ -409,11 +409,11 @@ FUZZ_TARGET(txorphan_protected, .init = initialize_orphanage)
 5 FUZZ_TARGET(txorphanage_sim)
 6 {
 7     SeedRandomStateForTest(SeedRand::ZEROS);
 8-    // This is a comphehensive simulation fuzz test, which runs through a scenario involving up to 16 transactions
 9-    // (which may have simple or complex topology, and may have duplicate txids with distinct wtxids, but never with
10-    // more than 9 inputs), and up to 16 peers. The scenario is performed both on a real TxOrphanage object and the
11-    // behavior is compared with a naive reimplementation (just a vector of announcements) where possible, and tested
12-    // for desired properties where not possible.
13+    // This is a comphehensive simulation fuzz test, which runs through a scenario involving up to
14+    // 16 transactions (which may have simple or complex topology, and may have duplicate txids
15+    // with distinct wtxids, and up to 16 peers. The scenario is performed both on a real
16+    // TxOrphanage object and the behavior is compared with a naive reimplementation (just a vector
17+    // of announcements) where possible, and tested for desired properties where not possible.
18 
19     //
20     // 1. Setup.
21@@ -480,9 +480,6 @@ FUZZ_TARGET(txorphanage_sim)
22                     COutPoint outpoint(partx->GetHash(), rng.randrange<size_t>(partx->vout.size()));
23                     tx.vin.emplace_back(outpoint);
24                     tx.vin.back().scriptSig.resize(provider.ConsumeIntegralInRange<unsigned>(16, 200));
25-
26-                    // Do not construct transactions with more than 9 inputs to avoid latency score diverging from announcement count.
27-                    if (tx.vin.size() >= 9) break;
28                 }
29             }
30             // Construct fallback input in case there are no dependencies.
31@@ -596,7 +593,7 @@ FUZZ_TARGET(txorphanage_sim)
32         int64_t usage{0};
33         for (auto& ann : sim_announcements) {
34             if (ann.announcer != peer) continue;
35-            count += 1;
36+            count += 1 + (txn[ann.tx]->vin.size() / 10);
37             usage += GetTransactionWeight(*txn[ann.tx]);
38         }
39         return std::max(FeeFrac{count, max_count}, FeeFrac{usage, max_usage});
40@@ -729,16 +726,18 @@ FUZZ_TARGET(txorphanage_sim)
41                 const auto max_ann = max_global_ann / std::max<unsigned>(1, count_peers_fn());
42                 const auto max_mem = reserved_peer_usage;
43                 while (true) {
44-                    bool oversized = sim_announcements.size() > real->MaxGlobalLatencyScore();
45-                    if (!oversized) {
46-                        // Count global usage and number of peers.
47-                        node::TxOrphanage::Usage total_usage{0};
48-                        for (unsigned tx = 0; tx < NUM_TX; ++tx) {
49-                            if (have_tx_fn(tx)) total_usage += GetTransactionWeight(*txn[tx]);
50+                    // Count global usage and number of peers.
51+                    node::TxOrphanage::Usage total_usage{0};
52+                    node::TxOrphanage::Count total_latency_score = sim_announcements.size();
53+                    for (unsigned tx = 0; tx < NUM_TX; ++tx) {
54+                        if (have_tx_fn(tx)) {
55+                            total_usage += GetTransactionWeight(*txn[tx]);
56+                            total_latency_score += txn[tx]->vin.size() / 10;
57                         }
58-                        auto num_peers = count_peers_fn();
59-                        if (total_usage > reserved_peer_usage * num_peers) oversized = true;
60                     }
61+                    auto num_peers = count_peers_fn();
62+                    bool oversized = (total_usage > reserved_peer_usage * num_peers) ||
63+                                     (total_latency_score > real->MaxGlobalLatencyScore());
64                     if (!oversized) break;
65                     // Find worst peer.
66                     FeeFrac worst_dos_score{0, 1};
67@@ -782,17 +781,19 @@ FUZZ_TARGET(txorphanage_sim)
68 
69     real->SanityCheck();
70 
71-    // This is not generally true, but must be true for this fuzz target.
72-    assert(real->TotalLatencyScore() == real->CountAnnouncements());
73 
74     auto all_orphans = real->GetOrphanTransactions();
75     node::TxOrphanage::Usage orphan_usage{0};
76     std::vector<node::TxOrphanage::Usage> usage_by_peer(NUM_PEERS);
77     node::TxOrphanage::Count unique_orphans{0};
78     std::vector<node::TxOrphanage::Count> count_by_peer(NUM_PEERS);
79+    node::TxOrphanage::Count total_latency_score = sim_announcements.size();
80     for (unsigned tx = 0; tx < NUM_TX; ++tx) {
81         bool sim_have_tx = have_tx_fn(tx);
82-        if (sim_have_tx) orphan_usage += GetTransactionWeight(*txn[tx]);
83+        if (sim_have_tx) {
84+            orphan_usage += GetTransactionWeight(*txn[tx]);
85+            total_latency_score += txn[tx]->vin.size() / 10;
86+        }
87         unique_orphans += sim_have_tx;
88         auto orphans_it = std::find_if(all_orphans.begin(), all_orphans.end(), [&](auto& orph) { return orph.tx->GetWitnessHash() == txn[tx]->GetWitnessHash(); });
89         // GetOrphanTransactions (OrphanBase existence)
90@@ -863,4 +864,6 @@ FUZZ_TARGET(txorphanage_sim)
91     assert(max_global_ann / std::max<unsigned>(1, present_peers) == real->MaxPeerLatencyScore());
92     // MaxGlobalUsage
93     assert(reserved_peer_usage * std::max<unsigned>(1, present_peers) == real->MaxGlobalUsage());
94+    // TotalLatencyScore.
95+    assert(real->TotalLatencyScore() == total_latency_score);
96 }

glozow commented at 8:11 pm on July 14, 2025:

Thanks! Added, I should have just done this.

in src/test/fuzz/txorphan.cpp:20 in 6ce95bffee outdated

16@@ -17,11 +17,14 @@
17 #include <test/util/setup_common.h>
18 #include <uint256.h>
19 #include <util/check.h>
20+#include <util/feefrac.h>

instagibbs commented at 2:04 pm on July 14, 2025:

6ce95bffee0f809d9bf4d8fed6f8a7df077fa83f

I think txorphanage_sim subsumes the older txorphan harness now unless I’m missing something

marcofleon commented at 5:38 pm on July 14, 2025:

Do you mean if we end up constructing txs with more than 9 inputs in the sim target? It seems to me that txorphan is more of a stress test than the simulation target, which is a differential test, no?

instagibbs commented at 5:41 pm on July 14, 2025:

It seems to me that txorphan is more of a stress test than the simulation target, which is a differential test, no?

fair enough

glozow commented at 8:11 pm on July 14, 2025:

left it in for now.

instagibbs commented at 1:27 pm on July 15, 2025:

you can resolve this

in src/node/txorphanage.cpp:649 in d5097d6eef outdated

877-
878-        Assume(!orphan.announcers.empty());
879-        for (const auto& peer : orphan.announcers) {
880-            auto& count_peer_entry = counted_size_per_peer.try_emplace(peer).first->second;
881-            count_peer_entry += orphan.GetUsage();
882+    std::unordered_map<NodeId, PeerDoSInfo> reconstructed_peer_info;

instagibbs commented at 2:13 pm on July 14, 2025:

d5097d6eef85eab971352cf78fb94a30c6e6f127

looks like you forgot to assert the calculated set matches anything ala

assert(m_peer_orphanage_info == reconstructed_peer_info);

glozow commented at 7:53 pm on July 14, 2025:

oh weird! added

in src/node/txorphanage.cpp:678 in d5097d6eef outdated

920+            assert(unique_wtxids_to_scores.contains(wtxid));
921         }
922     }
923+
924+    // Cached m_unique_orphans value is correct.
925+    assert(m_orphans.size() >= m_unique_orphans);

instagibbs commented at 2:16 pm on July 14, 2025:

0    assert(m_orphans.size() >= m_unique_orphans);
1    assert(m_orphans.size() <= m_peer_orphanage_info.size() * m_unique_orphans);

glozow commented at 7:53 pm on July 14, 2025:

added

instagibbs commented at 2:18 pm on July 14, 2025: member

reviewed through 790f6e7a72d31988c5be9f2d520a04d0c0edfa09

in src/node/txorphanage.cpp:423 in 790f6e7a72 outdated

418+    // We have exceeded the global limit(s). Now, identify who is using too much and evict their orphans.
419+    // Create a heap of pairs (NodeId, DoS score), sorted by descending DoS score.
420+    std::vector<std::pair<NodeId, FeeFrac>> heap_peer_dos;
421+    heap_peer_dos.reserve(m_peer_orphanage_info.size());
422+    for (const auto& [nodeid, entry] : m_peer_orphanage_info) {
423+        // Performance optimization: only consider peers with a DoS score >= 1.

marcofleon commented at 5:05 pm on July 14, 2025:

nit: I believe this should be DoS score > 1.

glozow commented at 7:54 pm on July 14, 2025:

done, thanks

in src/node/txorphanage.cpp:597 in 790f6e7a72 outdated

588+    Assume(wtxids_to_erase.size() == num_erased);
589+}
590+
591+/** Get all children that spend from this tx and were received from nodeid. Sorted from most
592+ * recent to least recent. */
593+std::vector<CTransactionRef> TxOrphanageImpl::GetChildrenFromSamePeer(const CTransactionRef& parent, NodeId peer) const

marcofleon commented at 5:19 pm on July 14, 2025:

Thinking out loud, it seems like using m_outpoint_to_orphan_it for lookups here instead of going through all orphans could be better performance-wise (if there’s a lot of unrelated orphans from this peer). Ultimately, I think this implementation is fine, as it’s simpler and automatically returns the children sorted from newest to oldest.

glozow commented at 8:19 pm on July 14, 2025:

Yeah it was originally implemented that way, but I think this way likely requires much fewer element lookups. Under “normal” conditions, we probably have multiple announcers for each transaction and there are very children from this peer.

marcofleon commented at 5:32 pm on July 14, 2025: contributor

Code review ACK 790f6e7a72d31988c5be9f2d520a04d0c0edfa09

Focused mainly on the orphanage reimplementation d5097d6eef85eab971352cf78fb94a30c6e6f127 and fuzzing. Fuzzed the three txorphan targets for ~1000 cpu hours each. A couple comments I had initially are addressed in the followup.

DrahtBot requested review from sipa on Jul 14, 2025

DrahtBot requested review from monlovesmango on Jul 14, 2025

DrahtBot requested review from jsarenik on Jul 14, 2025

DrahtBot requested review from theStack on Jul 14, 2025

in src/test/orphanage_tests.cpp:473 in 67e1e6cdb7 outdated

113@@ -126,13 +114,15 @@ BOOST_AUTO_TEST_CASE(DoS_mapOrphans)
114         tx.vout[0].nValue = i*CENT;
115         tx.vout[0].scriptPubKey = GetScriptForDestination(PKHash(key.GetPubKey()));
116 
117-        orphanage.AddTx(MakeTransactionRef(tx), i);
118+        auto ptx = MakeTransactionRef(tx);
119+        orphanage.AddTx(ptx, i);
120+        orphans_added.emplace_back(ptx);

theStack commented at 6:22 pm on July 14, 2025:

nitty nit: for consistency, this should probably be done after every successful AddTx call (e.g. also in the next loop below), though it doesn’t seem to be necessary for now

glozow commented at 8:11 pm on July 14, 2025:

added

in src/test/orphanage_tests.cpp:527 in e79509f436 outdated

527+        BOOST_CHECK_EQUAL(orphanage->GetTxToReconsider(node0), nullptr);
528 
529-        // Delete this tx, clearing the orphanage.
530-        BOOST_CHECK_EQUAL(orphanage.EraseTx(orphan_wtxid), 1);
531-        BOOST_CHECK_EQUAL(orphanage.Size(), 0);
532+        // Delete this tx, clearing the orphanage->

theStack commented at 6:26 pm on July 14, 2025:

:detective:

0        // Delete this tx, clearing the orphanage.

glozow commented at 8:09 pm on July 14, 2025:

CAUGHT 🔫

in src/node/txorphanage.cpp:120 in d5097d6eef outdated

153 
154-    /** Orphan transactions in vector for quick random eviction */
155-    std::vector<OrphanMap::iterator> m_orphan_list;
156+    /** Index from the parents' outputs to wtxids that exist in m_orphans. Used to find children of
157+     * a transaction that can be reconsidered and to remove entries that conflict with a block.*/
158+    std::unordered_map<COutPoint, std::set<Wtxid>, SaltedOutpointHasher> m_outpoint_to_orphan_it;

theStack commented at 6:29 pm on July 14, 2025:

in d5097d6eef85eab971352cf78fb94a30c6e6f127: nit: the naming seems outdated, considering that (in contrast to master) the map’s value type is not an iterator anymore?

0    std::unordered_map<COutPoint, std::set<Wtxid>, SaltedOutpointHasher> m_outpoint_to_orphan_wtxids;

glozow commented at 5:35 pm on July 16, 2025:

Added to followup

in src/node/txorphanage.cpp:304 in d5097d6eef outdated

328-    m_total_orphan_usage += sz;
329-    m_total_announcements += 1;
330+    // We will return false if the tx already exists under a different peer.
331+    const bool brand_new{!HaveTx(wtxid)};
332+
333+    auto ret = m_orphans.get<ByWtxid>().emplace(tx, peer, m_current_sequence);

theStack commented at 6:35 pm on July 14, 2025:

in d5097d6eef85eab971352cf78fb94a30c6e6f127: nit: could use a structured binding for getting rid of many .first/.second accesses (that I often find confusing) and enhancing the readability, e.g.:

0    auto [new_announcement_it, inserted] = m_orphans.get<ByWtxid>().emplace(tx, peer, m_current_sequence);

(here and in many other return values of .emplace calls)

glozow commented at 8:09 pm on July 14, 2025:

Nice, changed

in src/node/txorphanage.cpp:107 in d5097d6eef outdated

142+
143+    /** Number of unique orphans by wtxid. Less than or equal to the number of entries in m_orphans. */
144+    TxOrphanage::Count m_unique_orphans{0};
145+
146+    /** Memory used by orphans (see Announcement::GetMemUsage()), deduplicated by wtxid. */
147+    TxOrphanage::Usage m_unique_orphan_bytes{0};

theStack commented at 6:36 pm on July 14, 2025:

in d5097d6eef85eab971352cf78fb94a30c6e6f127: nit: maybe also call it ..._usage for consistency?

in src/node/txorphanage.cpp:146 in d5097d6eef outdated

185+            m_total_latency_score -= ann.GetLatencyScore();
186+            m_count_announcements -= 1;
187+            return m_count_announcements == 0;
188+        }
189+        /** There are 2 DoS scores:
190+        * - CPU score (ratio of total latency score / max allowed latency score)

theStack commented at 6:38 pm on July 14, 2025:

in d5097d6eef85eab971352cf78fb94a30c6e6f127: nit: should this also called “latency score” here for consistency, or is the different naming intentional?

theStack commented at 6:45 pm on July 14, 2025: contributor

Left some nits I found over the last days, mostly about naming. I think it would be worth it to update the commit message of main overhaul commit d5097d6eef85eab971352cf78fb94a30c6e6f127 to use the new “latency score” terminology. Planning to look more at functional and fuzz tests soon.

DrahtBot requested review from theStack on Jul 14, 2025

[prep/test] modify test to not access TxOrphanage internals

These internals should and will be private.

8dd24c29ae

[prep/config] remove -maxorphantx

The orphanage will no longer have a maximum number of unique orphans.

51365225b8

[prep/refactor] move DEFAULT_MAX_ORPHAN_TRANSACTIONS to txorphanage.h

This is move only.

d0af4239b7

[prep/test] have TxOrphanage remember its own limits in LimitOrphans

Move towards a model where TxOrphanage is initialized with limits that
it remembers throughout its lifetime.
Remove the param. Limiting by number of unique orphans will be removed
in a later commit.
Now that -maxorphantx is gone, this does not change the node behavior.
The parameter is only used in tests.

77ebe8f280

[prep/refactor] make TxOrphanage a virtual class implemented by TxOrphanageImpl 3da6d7f8f6

[prep] change return type of EraseTx to bool

This function only ever returns 0 or 1 (number of unique orphans
erased).

b50bd72c42

[refactor] create aliases for TxOrphanage Count and Usage 1a41e7962d

[p2p] overhaul TxOrphanage with smarter limits

This is largely a reimplementation using boost::multi_index_container.
All the same public methods are available. It has an index by outpoint,
per-peer tracking, peer worksets, etc.

A few differences:
- Limits have changed: instead of a global limit of 100 unique orphans,
  we have a maximum number of announcements (which can include duplicate
orphans) and a global memory limit which scales with the number of
peers.
- The maximum announcements limit is 100 to match the original limit,
  but this is actually a stricter limit because the announcement count
is not de-duplicated.
- Eviction strategy: when global limits are reached, a per-peer limit
  comes into play. While limits are exceeded, we choose the peer whose
“DoS score” (max usage / limit ratio for announcements and memory
limits) is highest and evict announcements by entry time, sorting
non-reconsiderable ones before reconsiderable ones. Since announcements
are unique by (wtxid, peer), as long as 1 announcement remains for a
transaction, it remains in the orphanage.
- This eviction strategy means no peer can influence the eviction of
  another peer’s orphans.
- Also, since global limits are a multiple of per-peer limits, as long
  as a peer does not exceed its limits, its orphans are protected from
eviction.
- Orphans no longer expire, since older announcements are generally
  removed before newer ones.
- GetChildrenFromSamePeer returns the transactions from newest to
  oldest.

Co-authored-by: Pieter Wuille <pieter@wuille.net>

067365d2a8

[cleanup] remove unused rng param from LimitOrphans 4d23d1d7e7

[unit test] basic TxOrphanage eviction and protection 7ce3b7ee57

[unit test] strengthen GetChildrenFromSamePeer tests: results are in recency order a2878cfb4a

[fuzz] TxOrphanage protects peers that don't go over limit

Co-authored-by: Greg Sanders <gsanders87@gmail.com>

24afee8d8f

[p2p] bump DEFAULT_MAX_ORPHANAGE_LATENCY_SCORE to 3,000

For the default number of peers (125), allows each to relay a default
descendant package (up to 25-1=24 can be missing inputs) of small (9
inputs or fewer) transactions out of order.

This limit also gives acceptable bounds for worst case LimitOrphans iterations.

Functional tests aren't changed to check for larger cap because it would
make the runtime too long.

Also deletes the now-unused DEFAULT_MAX_ORPHAN_TRANSACTIONS.

ea29c4371e

[prep] Return the made-reconsiderable announcements in AddChildrenToWorkSet

This is preparation for the simulation fuzz test added in a later commit. Since
AddChildrenToWorkSet consumes randomness, there is no way for the simulator to
exactly predict its behavior. By returning the set of made-reconsiderable announcements
instead, the simulator can instead test that it is *a* valid choice, and then
apply it to its own data structures.

03aaaedc6d

[fuzz] Add simulation fuzz test for TxOrphanage

This adds a large simulation fuzz test for all TxOrphanage public interface
functions, using a mix of comparison with expected behavior (in case it is
fully specified), and testing of properties exhibited otherwise.

b113877545

[prep/test] restart instead of bumpmocktime between p2p_orphan_handling subtests

If we want to restart at all during the tests, we can't have future timestamps.

835f5c77cd

[functional test] orphan resolution works in the presence of DoSy peers

Co-authored-by: Greg Sanders <gsanders87@gmail.com>

45c7a4b56d

[bench] worst case LimitOrphans and EraseForBlock

Co-authored-by: Greg Sanders <gsanders87@gmail.com>

50024620b9

glozow force-pushed on Jul 14, 2025

marcofleon commented at 11:17 am on July 15, 2025: contributor

ReACK 50024620b909fc30b68a3715680e963f048482a5

A couple additional assertions, some nits addressed, and improvements in the txorphanage_sim fuzz target since last review. Ran the fuzz tests for a bit on existing corpora to be sure.

instagibbs approved

instagibbs commented at 2:04 pm on July 15, 2025: member

ACK 50024620b909fc30b68a3715680e963f048482a5

Wish orphan traffic was higher for more live testing on mainnet. Will test anyways and report back if I see anything odd.

I’m not convinced that we really need EraseForBlock anymore, but I don’t think it’s a unique danger and it’s nice that we get some real benchmarks for it in master.

instagibbs commented at 4:42 pm on July 15, 2025: member

Looking at logs, was wondering if we can get some more information about which peer/ which tx is being evicted from the orphanage? I’m eyeballing some logs since I’ve been running variants of this for a few weeks now, and the orphanage overflow string shows up significantly more often due to the non-timeout of announcements after this PR.

e.g.: “2025-07-07T11:27:34.481530Z [txpackages] orphanage overflow, removed 1 tx (4 announcements)”

Being able to quickly see that, f.e., the only reason we’re evicting is because of a single faulty / spammy peer would be helpful to separate from “legitimate” traffic, where you’d expect to see more a round-robin eviction pattern.

glozow commented at 5:48 pm on July 15, 2025: member

Looking at logs, was wondering if we can get some more information about which peer/ which tx is being evicted from the orphanage?

Will add to the followup. What about adding a log for each peer chosen in the loop? So for example 1 call to LimitOrphans:

0[txpackages] peer=25 orphanage overflow, removed 4 announcements
1[txpackages] peer=177 orphanage overflow, removed 1 announcements
2[txpackages] peer=25 orphanage overflow, removed 1 announcements
3[txpackages] orphanage overflow, removed 5 tx (6 announcements)

in src/node/txorphanage.cpp:468 in 067365d2a8 outdated

593+        // The number of inner loop iterations is bounded by the total number of announcements.
594+        const auto& dos_threshold = heap_peer_dos.empty() ? FeeFrac{1, 1} : heap_peer_dos.front().second;
595+        auto it_ann = m_orphans.get<ByPeer>().lower_bound(ByPeerView{worst_peer, false, 0});
596+        while (NeedsTrim()) {
597+            if (!Assume(it_ann->m_announcer == worst_peer)) break;
598+            if (!Assume(it_ann != m_orphans.get<ByPeer>().end())) break;

theStack commented at 11:14 pm on July 16, 2025:

these two Assume lines should be swapped I think, to prevent potential dereference of an end() iterator (which, AFAIR, would be UB)

glozow commented at 6:54 pm on July 17, 2025:

thanks, added to #32941

in src/node/txorphanage.cpp:501 in 067365d2a8 outdated

629-                // Belt and suspenders, each orphan should always have at least 1 announcer.
630-                if (!Assume(!elem->second.announcers.empty())) continue;
631+            for (const auto& wtxid : it_by_prev->second) {
632+                // Belt and suspenders, each entry in m_outpoint_to_orphan_it should always have at least 1 announcement.
633+                auto it = index_by_wtxid.lower_bound(ByWtxidView{wtxid, MIN_PEER});
634+                if (!Assume(it != index_by_wtxid.end())) continue;

theStack commented at 11:36 pm on July 16, 2025:

0                if (!Assume(it != index_by_wtxid.end() && it->m_tx->GetWitnessHash() == wtxid)) continue;

for a full belts and suspenders (though I guess if no m_orphan entry with this wtxid exists, it would still be caught with the next Assume below, as std::distance would return a negative(?) value 🤔 )

glozow commented at 6:54 pm on July 17, 2025:

thanks, added to #32941

theStack approved

theStack commented at 11:46 pm on July 16, 2025: contributor

Code-review ACK 50024620b909fc30b68a3715680e963f048482a5

With two suggestions regarding sanity checks on lower_bound iterators. Probably more than just nits, but still fine to tackle in the follow-up IMHO.

achow101 commented at 7:40 pm on July 18, 2025: member

light ACK 50024620b909fc30b68a3715680e963f048482a5

achow101 merged this on Jul 18, 2025

achow101 closed this on Jul 18, 2025

in src/test/fuzz/txorphan.cpp:669 in 50024620b9

667+                });
668+                break;
669+            } else if (command-- == 0) {
670+                // AddChildrenToWorkSet
671+                auto tx = read_tx_fn();
672+                FastRandomContext rand_ctx(rng.rand256());

hebasto commented at 9:58 am on July 20, 2025:

b113877545a1c83b470a380402b4409aa02c8282

On Alpine Linux v3.22, using GCC 14.2.0:

 0[ 74%] Building CXX object src/test/fuzz/CMakeFiles/fuzz.dir/txorphan.cpp.o
 1In file included from /bitcoin/src/script/script.h:10,
 2                 from /bitcoin/src/primitives/transaction.h:11,
 3                 from /bitcoin/src/consensus/validation.h:11,
 4                 from /bitcoin/src/test/fuzz/txorphan.cpp:6:
 5/bitcoin/src/crypto/common.h: In function 'void txorphanage_sim_fuzz_target(FuzzBufferType)':
 6/bitcoin/src/crypto/common.h:53:11: warning: writing 4 bytes into a region of size 0 [-Wstringop-overflow=]
 7   53 |     memcpy(ptr, &v, 4);
 8      |           ^
 9/bitcoin/src/test/fuzz/txorphan.cpp:669:55: note: at offset 32 into destination object '<anonymous>' of size 32
10  669 |                 FastRandomContext rand_ctx(rng.rand256());
11      |                                            ~~~~~~~~~~~^~

glozow deleted the branch on Jul 20, 2025

p2p: improve TxOrphanage denial of service bounds #31829

Code Coverage & Benchmarks

Reviews

Conflicts

LLM Linter (✨ experimental)