Problem
Under sustained low-priority P2P traffic (tx relay, addr gossip), RPC tail latency can degrade significantly, especially on resource-constrained nodes.
Root Cause
The single-threaded message handler (-msghand thread) processes all P2P messages sequentially without considering RPC queue pressure:
- Single-threaded bottleneck - The msghand thread can saturate a single CPU core at 100%, leaving other cores idle while processing P2P messages
- No priority differentiation - Low-priority P2P messages (INV, TX, ADDR) compete equally with time-sensitive operations
- RPC starvation - When the RPC work queue fills up during heavy P2P processing, new RPC requests are rejected with “Work queue depth exceeded”
Affected Users
- Wallet operators making frequent RPC calls
- Block explorers and indexing services
- Lightning Network nodes requiring low-latency RPC
- Any node running RPC-heavy workloads alongside P2P relay
Observed Symptoms
- RPC p95/p99 latency spikes during tx relay floods
- “Work queue depth exceeded” errors under load
- Inconsistent RPC response times
Proposed Solution
Add an opt-in backpressure mechanism that defers low-priority P2P message processing when the RPC work queue is under pressure.
Key Design Points
- RpcLoadMonitor - Lock-free state machine tracking RPC queue depth with hysteresis
- Message Classification - Separate low-priority (TX, INV, ADDR) from critical (HEADERS, BLOCK, CMPCTBLOCK)
- Defer-to-tail - Low-priority messages requeued to back of queue, not dropped
- Opt-in - Disabled by default via
-experimental-rpc-priorityflag
State Machine
0NORMAL → ELEVATED: queue ≥ 75%
1ELEVATED → NORMAL: queue < 50% (hysteresis)
2ELEVATED → CRITICAL: queue ≥ 90%
3CRITICAL → ELEVATED: queue < 70%
Preliminary Results
A/B test with P2P INV flood (~108K entries) and concurrent RPC workload:
| Metric | Improvement |
|---|---|
| RPC p95 | -16.79% (better) |
| RPC p99 | -11.11% (better) |
| Throughput | +9.4% |
Implementation
PR: #34898
Changes
- New:
src/node/rpc_load_monitor.h- State machine implementation - Modified:
src/net_processing.cpp- Backpressure gate in ProcessMessages() - Modified:
src/httpserver.cpp- Queue depth sampling - New flag:
-experimental-rpc-priority
Testing
- 12 unit tests for state machine
- Functional A/B test with metrics
Questions for Discussion
- Are the threshold values (75%/90% enter, 50%/70% leave) reasonable defaults?
- Should message classification be configurable?
- Is defer-to-tail the right strategy vs. other approaches (drop, rate-limit)?
- Should this eventually become default behavior?
Related Discussions
- Mining pools have reported ProcessMessage CPU saturation causing block delays
- Previous requests to split msghand thread into multiple threads
- Analysis of CPU time distribution in ProcessMessages()
This PR provides a lightweight mitigation without requiring architectural changes to the threading model.
Is your feature related to a problem, if so please describe it.
Under sustained low-priority P2P traffic (tx relay, addr gossip), RPC tail latency can degrade significantly, especially on resource-constrained nodes.
Describe the solution you’d like
Add an opt-in backpressure mechanism that defers low-priority P2P message processing when the RPC work queue is under pressure.
Describe any alternatives you’ve considered
No response
Please leave any additional context
No response