net_processing: add opt-in RPC-aware P2P backpressure gate #34898

pull morozow wants to merge 4 commits into bitcoin:master from morozow:pr/rpc-p2p-backpressure-exp changing 15 files +981 −0
  1. morozow commented at 9:47 am on March 23, 2026: none

    Summary

    Add an experimental backpressure mechanism that defers low-priority P2P message processing when the RPC work queue is under pressure, improving RPC tail latency under sustained P2P load.

    Problem

    The single-threaded message handler (-msghand thread) processes all P2P messages sequentially. Under sustained low-priority P2P traffic (tx relay, addr gossip), RPC tail latency degrades significantly because:

    1. Single-threaded bottleneck - The msghand thread can saturate a single CPU core at 100%, leaving other cores idle
    2. No priority differentiation - Low-priority P2P messages (INV, TX, ADDR) compete equally with RPC work
    3. RPC starvation - When RPC work queue fills up, new requests are rejected with “Work queue depth exceeded”

    This affects users running RPC-heavy workloads (wallets, block explorers, Lightning nodes) alongside P2P relay.

    Solution

    Introduce a minimal, opt-in backpressure policy:

    Request Flow

     0sequenceDiagram
     1    participant Peer as P2P Peer
     2    participant PM as PeerManager
     3    participant Monitor as RpcLoadMonitor
     4    participant HTTP as HTTP Server
     5    participant RPC as RPC Client
     6
     7    Note over HTTP,Monitor: RPC queue fills up
     8    RPC->>HTTP: Multiple RPC requests
     9    HTTP->>Monitor: OnQueueDepthSample(depth=80, cap=100)
    10    Monitor->>Monitor: State: NORMAL → ELEVATED
    11
    12    Note over Peer,PM: P2P message arrives
    13    Peer->>PM: INV (tx announcements)
    14    PM->>Monitor: GetState()
    15    Monitor-->>PM: ELEVATED
    16    PM->>PM: IsLowPriorityMessage("inv") = true
    17    PM->>PM: RequeueMessageForProcessing()
    18    Note over PM: Message deferred to back of queue
    19
    20    Note over Peer,PM: Critical message arrives
    21    Peer->>PM: HEADERS
    22    PM->>Monitor: GetState()
    23    Monitor-->>PM: ELEVATED
    24    PM->>PM: IsLowPriorityMessage("headers") = false
    25    PM->>PM: ProcessMessage()
    26    Note over PM: Critical messages always processed
    27
    28    Note over HTTP,Monitor: RPC queue drains
    29    HTTP->>Monitor: OnQueueDepthSample(depth=40, cap=100)
    30    Monitor->>Monitor: State: ELEVATED → NORMAL
    31    Note over PM: Resume normal P2P processing
    

    Components

    1. RpcLoadMonitor - A lock-free state machine that tracks RPC queue depth and exposes load state (NORMAL, ELEVATED, CRITICAL) with hysteresis to prevent oscillation.

    2. Backpressure Gate - A check in PeerManagerImpl::ProcessMessages() that defers low-priority P2P messages when RPC load is elevated.

    3. Message Classification - Clear separation between:

      • Low-priority (deferrable): TX, INV (tx), GETDATA (tx), MEMPOOL, ADDR, ADDRV2, GETADDR
      • Critical (never throttled): HEADERS, BLOCK, CMPCTBLOCK, BLOCKTXN, GETHEADERS, GETBLOCKS, handshake/control messages
    4. Defer-to-tail - Deferred messages are requeued to the back of the peer’s message queue, not dropped. This preserves eventual delivery while prioritizing RPC responsiveness.

    Changes

    New Files

    • src/node/rpc_load_monitor.h - RpcLoadState enum, RpcLoadMonitor interface, AtomicRpcLoadMonitor implementation

    Modified Files

    • src/net_processing.h - Add experimental_rpc_priority and rpc_load_monitor to PeerManager::Options
    • src/net_processing.cpp - Backpressure gate in ProcessMessages(), IsLowPriorityMessage() helper
    • src/net.h - Add RequeueMessageForProcessing() to CNode
    • src/net.cpp - Implement RequeueMessageForProcessing()
    • src/httpserver.h - Add SetHttpServerRpcLoadMonitor()
    • src/httpserver.cpp - Call OnQueueDepthSample() at enqueue/dispatch points
    • src/node/peerman_args.cpp - Parse -experimental-rpc-priority flag
    • src/init.cpp - Create RpcLoadMonitor, wire to HTTP server and PeerManager

    New Flag

    0-experimental-rpc-priority=<0|1>  (default: 0)
    1    Enable experimental RPC-aware P2P backpressure policy.
    2    When enabled, low-priority P2P messages may be deferred
    3    during RPC queue overload to improve RPC latency.
    

    Policy Details

    State Machine

     0stateDiagram-v2
     1    [*] --> NORMAL
     2    
     3    NORMAL --> ELEVATED : queue ≥ 75%
     4    NORMAL --> CRITICAL : queue ≥ 90%
     5    
     6    ELEVATED --> CRITICAL : queue ≥ 90%
     7    ELEVATED --> NORMAL : queue < 50%
     8    
     9    CRITICAL --> ELEVATED : queue < 70%
    10    
    11    note right of NORMAL : Process all messages
    12    note right of ELEVATED : Defer low-priority P2P
    13    note right of CRITICAL : Defer low-priority P2P
    

    Hysteresis prevents rapid state oscillation under fluctuating load.

    Thresholds

    Transition Condition
    NORMAL → ELEVATED queue_depth ≥ 75% capacity
    NORMAL → CRITICAL queue_depth ≥ 90% capacity
    ELEVATED → CRITICAL queue_depth ≥ 90% capacity
    ELEVATED → NORMAL queue_depth < 50% capacity
    CRITICAL → ELEVATED queue_depth < 70% capacity

    Behavior by State

    State Low-priority P2P Critical P2P RPC
    NORMAL Process normally Process normally Process normally
    ELEVATED Defer to tail Process normally Process normally
    CRITICAL Defer to tail Process normally Process normally

    Performance Results

    A/B test with concurrent P2P INV flood (~108K entries) and RPC workload (12 threads, 45s duration):

    Baseline (flag=0)

    Metric Value
    RPC p50 1.925ms
    RPC p95 9.320ms
    RPC p99 17.974ms
    RPC calls/sec 3,919
    P2P INV msgs 3,394

    With Policy (flag=1)

    Metric Value
    RPC p50 1.845ms
    RPC p95 7.755ms
    RPC p99 15.977ms
    RPC calls/sec 4,286
    P2P INV msgs 3,372

    Improvement

    Metric Change
    RPC p50 -4.2% (better)
    RPC p95 -16.79% (better)
    RPC p99 -11.11% (better)
    Throughput +9.4%

    Testing

    Unit Tests (src/test/rpc_load_monitor_tests.cpp)

    • rpc_load_monitor_tests suite (12 tests):
      • State transitions (normal→elevated→critical)
      • Hysteresis behavior
      • Thread safety
      • Edge cases (zero/negative capacity)
      • Custom threshold configuration

    Functional Test (test/functional/feature_rpc_p2p_backpressure_ab.py)

    • A/B comparison with P2P INV flood workload
    • Measures RPC latency percentiles (p50/p95/p99)
    • Verifies no RPC errors under load
    • Outputs JSON metrics for analysis

    Test Commands

    0# Unit tests
    1build/bin/test_bitcoin --run_test=rpc_load_monitor_tests
    2
    3# Functional A/B test
    4python3 test/functional/feature_rpc_p2p_backpressure_ab.py
    

    Limitations and Future Work

    1. Experimental - Feature is opt-in and may change based on feedback
    2. Overhead - Without P2P pressure, policy adds ~1% overhead from state checks
    3. Tuning - Threshold values are initial estimates; may need adjustment based on real-world data
    4. Message granularity - Currently classifies by message type; could be refined to inspect INV/GETDATA contents

    Backwards Compatibility

    • No consensus changes
    • No P2P protocol changes
    • No behavior change when flag is disabled (default)
    • Existing tests pass

    Historical Context

    This problem has been discussed in various forms:

    • Mining pools (AntPool) reported ProcessMessage CPU saturation causing block delays
    • Requests to split the msghand thread into multiple threads
    • Analysis of CPU time spent in ProcessMessages() per peer

    This PR provides a lightweight, opt-in mitigation without requiring architectural changes to the message handler threading model.

  2. net_processing: add opt-in RPC-aware P2P backpressure gate
    Add an experimental backpressure mechanism that defers low-priority P2P
    message processing when the RPC work queue is under pressure, improving
    RPC tail latency under sustained P2P load.
    
    Problem:
    The single-threaded message handler can saturate CPU processing P2P
    messages, causing RPC latency spikes and 'Work queue depth exceeded'
    errors under heavy P2P traffic.
    
    Solution:
    - Add RpcLoadMonitor interface with lock-free atomic implementation
    - Add backpressure gate in ProcessMessages() to defer low-priority P2P
    - Add RequeueMessageForProcessing() to CNode for defer-to-tail behavior
    - Wire HTTP queue depth sampling to RpcLoadMonitor
    - Add -experimental-rpc-priority flag (default: off)
    
    Low-priority messages (TX, INV, GETDATA for tx, ADDR, MEMPOOL) are
    deferred when RPC queue depth exceeds thresholds. Critical messages
    (HEADERS, BLOCK, CMPCTBLOCK, handshake) are never throttled.
    
    State machine uses hysteresis to prevent oscillation:
    - NORMAL -> ELEVATED: queue >= 75%
    - ELEVATED -> NORMAL: queue < 50%
    - ELEVATED -> CRITICAL: queue >= 90%
    - CRITICAL -> ELEVATED: queue < 70%
    
    A/B test results with P2P INV flood (~108K entries):
    - RPC p95 latency: -16.79% (better)
    - RPC p99 latency: -11.11% (better)
    - RPC throughput: +9.4%
    
    No consensus or P2P protocol changes. Feature is opt-in and experimental.
    a1c08cc9ce
  3. docs: Add RPC-P2P backpressure issue documentation and A/B test
    - Add comprehensive issue documentation explaining RPC latency degradation
    under sustained P2P traffic with root cause analysis
    - Document proposed backpressure solution with state machine design and
    threshold values for queue pressure monitoring
    - Include preliminary A/B test results showing p95/p99 latency improvements
    - Update functional test docstring to remove specific issue reference
    - Provides context for opt-in RPC-aware P2P backpressure gate implementation
    76ccb3fc6f
  4. DrahtBot commented at 9:48 am on March 23, 2026: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept NACK stickies-v

    If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

  5. test(feature_rpc_p2p_backpressure_ab): Use assert_greater_than_or_equal for latency checks
    - Import assert_greater_than_or_equal utility function
    - Replace direct assertions with assert_greater_than_or_equal for p95 latency comparison
    - Replace direct assertions with assert_greater_than_or_equal for p99 latency comparison
    - Improves test readability and provides clearer assertion semantics for threshold validation
    afb0c61462
  6. DrahtBot added the label CI failed on Mar 23, 2026
  7. test(test_runner): Add RPC-P2P backpressure A/B test to suite
    - Add feature_rpc_p2p_backpressure_ab.py to BASE_SCRIPTS test list
    - Position test in functional test suite after mining_getblocktemplate_longpoll.py
    - Enable A/B testing of RPC-aware P2P backpressure gate functionality
    f46fac7594
  8. maflcko commented at 1:42 pm on March 23, 2026: member

    Add an experimental backpressure mechanism that defers low-priority P2P message processing when the RPC work queue is under pressure, improving RPC tail latency under sustained P2P load.

    If your machine can’t handle the RPC load, I don’t think the solution is to starve the P2P. The correct fix would be to:

    • Reduce the load on the RPC.
    • Upgrade your machine to handle the load.
    • Create a flame graph, or other benchmarks to find the RPC bottleneck and make it faster, or create an issue.

    If you have too many P2P peers for your machine to handle, you can reduce the number of peers or otherwise reduce traffic (see the reduce traffic docs)

  9. stickies-v commented at 3:04 pm on March 23, 2026: contributor
    Concept NACK, agreed with the rationale outlined above. This is adding a lot of unnecessary complexity.
  10. maflcko commented at 3:40 pm on March 23, 2026: member
    Also, this is a low-effort LLM generated pull, in any case.
  11. maflcko closed this on Mar 23, 2026

  12. morozow commented at 4:05 pm on March 23, 2026: none

    Thanks for the feedback. I understand the concern about P2P priority. If anyone encounters RPC latency issues under P2P load in production, I’d be interested to hear about the use case.

    To confirm, docs and tests can be LLM-generated, but the core contribution logic is small, targeted, and integrated exactly where needed. The problem it addresses is potentially significant, so the “low-effort” sounds too abstract.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-03-30 00:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me