rpc: add cpu_load to getpeerinfo #31672

pull vasild wants to merge 1 commits into bitcoin:master from vasild:peer_cpu_load changing 9 files +184 −0
  1. vasild commented at 2:02 pm on January 16, 2025: contributor

    Add a new field cpu_load to the output of getpeerinfo RPC.

    It represents the CPU time spent by the message handling thread for the given peer, weighted for the duration of the connection. That is, for example, if two peers are equally demanding and one is connected longer than the other, then they will have the same cpu_load number.


    Monitoring CPU usage is useful on its own. Also related to #31033.

  2. DrahtBot commented at 2:02 pm on January 16, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31672.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK i-am-yuvi
    Concept ACK jonatack, sipa, theStack, BrandonOdiwuor, 1440000bytes

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. DrahtBot added the label RPC/REST/ZMQ on Jan 16, 2025
  4. vasild force-pushed on Jan 16, 2025
  5. DrahtBot added the label CI failed on Jan 16, 2025
  6. DrahtBot commented at 2:05 pm on January 16, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/35717681069

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  7. jonatack commented at 2:25 pm on January 16, 2025: member
    Concept ACK if this can be a reliable, useful metric and help pave the path to #31033.
  8. sipa commented at 3:24 pm on January 16, 2025: member
    Concept ACK
  9. vasild force-pushed on Jan 16, 2025
  10. theStack commented at 6:15 pm on January 16, 2025: contributor

    Concept ACK

    Might be worth noting that this is currently only available on POSIX systems (i.e. not available on Windows, but on all other systems that we support AFAICT) in both the RPC help and a release note.

  11. BrandonOdiwuor commented at 7:00 pm on January 16, 2025: contributor
    Concept ACK
  12. fanquake commented at 1:06 pm on January 17, 2025: member

    Seems like there was some very brief discussion in #31033, and the conclusion “I guess this should start with planting some metrics”, but I’m not sure putting changes into Bitcoin Core is the right first step.

    Before changing our API, it’d be good to atleast show some usage that indicates that the changes here are useful. I assume you’ve already been running this locally, and collecting the data, so it’d be good to know what it turned up?

    As noted in #31033, others have also already conducted similar research (https://b10c.me/projects/023-cpu-usage-of-peers/ or https://delvingbitcoin.org/t/cpu-usage-of-peers/196), and it seems like the data you’d expose here, will be less detailed / useful than what can/has already been produced using tracepoints or similar, so I’m wondering if this is the best approach.

  13. vasild commented at 4:44 pm on January 17, 2025: contributor

    @fanquake IMO, monitoring CPU usage is useful on its own, even if we don’t start treating peers differently based on that metric (https://github.com/bitcoin/bitcoin/issues/31033). In other words, this metric is useful at least as much as a bunch of other metrics in the getpeerinfo output.

    Yes, I have been running this locally - there is like 65x difference between the least and most demanding peers. I am curious to correlate this to the messages being sent/received to/from those peers and to have a histogram of the data, not just least / most demanding (e.g. histogram of bitcoin-cli getpeerinfo |jq ".[].cpu_load"). I view those as something nice to build on top of this PR, not as a blocker that’s needed to justify the usefulness of the CPU time metric.

  14. 1440000bytes commented at 5:34 pm on January 17, 2025: none
    Concept ACK
  15. vasild commented at 12:25 pm on January 20, 2025: contributor

    Might be worth noting that this is currently only available on POSIX systems (i.e. not available on Windows

    Right. GetThreadTimes() looks like a promising Windows alternative. I will try to implement that here. Switching to draft because I will push some work-in-progress a bunch of times.

  16. vasild marked this as a draft on Jan 20, 2025
  17. vasild force-pushed on Jan 21, 2025
  18. vasild force-pushed on Jan 22, 2025
  19. vasild force-pushed on Jan 22, 2025
  20. DrahtBot removed the label CI failed on Jan 22, 2025
  21. vasild force-pushed on Jan 22, 2025
  22. vasild marked this as ready for review on Jan 23, 2025
  23. vasild commented at 10:07 am on January 23, 2025: contributor

    Ready for review. Implemented for Windows as well.

    Here is a test program I played with to test this manually. I think it is not worth adding as a unit test or benchmark, but at least it can live here in the comments of this PR:

     0#include <assert.h>
     1
     2#include <windows.h>
     3#include <winnt.h>
     4
     5#include <processthreadsapi.h>
     6
     7#include <iostream>
     8#include <thread>
     9
    10void thread(size_t work)
    11{
    12    FILETIME before_creation;
    13    FILETIME before_exit;
    14    FILETIME before_kernel;
    15    FILETIME before_user;
    16    bool ret = GetThreadTimes(GetCurrentThread(),
    17        &before_creation,
    18        &before_exit,
    19        &before_kernel,
    20        &before_user);
    21    assert(ret == 1);
    22
    23    if (work > 10'000) {
    24        for (size_t i{0}; i < work; ++i) {
    25            (void)(i * i);
    26        }
    27    } else {
    28        std::this_thread::sleep_for(std::chrono::milliseconds{work});
    29    }
    30
    31    FILETIME after_creation;
    32    FILETIME after_exit;
    33    FILETIME after_kernel;
    34    FILETIME after_user;
    35    ret = GetThreadTimes(GetCurrentThread(),
    36        &after_creation,
    37        &after_exit,
    38        &after_kernel,
    39        &after_user);
    40    assert(ret == 1);
    41
    42    ULARGE_INTEGER before_kernel_;
    43    before_kernel_.LowPart = before_kernel.dwLowDateTime;
    44    before_kernel_.HighPart = before_kernel.dwHighDateTime;
    45
    46    ULARGE_INTEGER before_user_;
    47    before_user_.LowPart = before_user.dwLowDateTime;
    48    before_user_.HighPart = before_user.dwHighDateTime;
    49
    50    ULARGE_INTEGER after_kernel_;
    51    after_kernel_.LowPart = after_kernel.dwLowDateTime;
    52    after_kernel_.HighPart = after_kernel.dwHighDateTime;
    53
    54    ULARGE_INTEGER after_user_;
    55    after_user_.LowPart = after_user.dwLowDateTime;
    56    after_user_.HighPart = after_user.dwHighDateTime;
    57
    58    ULARGE_INTEGER elapsed;
    59    elapsed.QuadPart = after_kernel_.QuadPart + after_user_.QuadPart - before_kernel_.QuadPart - before_user_.QuadPart;
    60    ULONGLONG elapsed_ns{elapsed.QuadPart * 100};
    61    std::cout << "cpu time: " << elapsed_ns / 1'000'000'000.0 << " sec\n";
    62}
    63
    64int main ()
    65{
    66    std::thread t1{thread, 1000000000};
    67    std::thread t2{thread, 1000000000};
    68    std::thread t3{thread, 3000};
    69    t1.join();
    70    t2.join();
    71    t3.join();
    72
    73    return 0;
    74}
    
  24. in src/rpc/net.cpp:150 in 28b2a67f13 outdated
    142@@ -143,6 +143,11 @@ static RPCHelpMan getpeerinfo()
    143                     {RPCResult::Type::NUM_TIME, "last_block", "The " + UNIX_EPOCH_TIME + " of the last block received from this peer"},
    144                     {RPCResult::Type::NUM, "bytessent", "The total bytes sent"},
    145                     {RPCResult::Type::NUM, "bytesrecv", "The total bytes received"},
    146+                    {RPCResult::Type::NUM, "cpu_load", /*optional=*/true,
    147+                        "The CPU time (user + system) spent processing messages from this peer "
    148+                        "and crafting messages for it expressed in per milles (‰) of the "
    149+                        "duration of the connection. Will be omitted on platforms that do not "
    150+                        "support this or if still not measured."},
    


    vasild commented at 10:19 am on January 23, 2025:

    Currently I chose to represent the CPU load as 1/1000s of the connection time. For example, if we have spent 5 CPU seconds for a peer that has been connected for 1000 seconds, then cpu_load will be 5.

    Output of bitcoin-cli getpeerinfo |jq '.[].cpu_load ' |sort -n:

    • Shortly after startup:
     00.003207936170212766
     10.003909962962962963
     20.004000127659574468
     30.004340326086956522
     40.00904279365079365
     50.0270041914893617
     60.03776429787234042
     70.04154987234042553
     80.05750733333333333
     90.0828310843373494
    100.1014558936170213
    110.1453753214285714
    120.1619409404761905
    
    • After running for a few hours:
     00.5100616477272727
     10.7087315352941177
     20.7814365649717514
     30.8773131578947369
     40.9717403481228669
     51.069695995623632
     61.078479046189377
     71.153550253945481
     81.167638256213824
     91.187242140449438
    101.199645694868011
    111.199951082706767
    121.220689207655502
    131.264582222222222
    141.265083146221971
    151.282676876955162
    161.301380197644649
    171.303624877378436
    181.333201868001821
    191.341489760490639
    201.343411334541063
    211.361829386934673
    221.364938446935725
    231.372325493506493
    241.385421828947368
    251.472354355769231
    261.612321964879852
    271.641998848148148
    281.648207868421053
    291.681837521410579
    301.783343262569832
    311.78646682029841
    321.837577241385135
    331.844571151379764
    341.851067047263682
    351.908551290322581
    361.915087061408061
    371.923511220435511
    381.939589383639822
    391.940788162162162
    401.962705583687341
    412.000298126030624
    422.008423324931507
    432.00921122707588
    442.018911893157895
    452.019168323205742
    462.019358503865546
    472.030658536117768
    482.034708427345187
    492.03510252110758
    502.04022746739726
    512.054511708699122
    522.07339420882353
    532.076240002244949
    542.080369749066667
    552.08561637254902
    562.092916448969578
    572.095147154892331
    582.119004164227642
    592.129640033722438
    602.145774200383772
    612.164519192393115
    622.27069154817898
    632.29444800067659
    642.313547881255947
    652.31539280930693
    662.368903822619457
    672.445622954545454
    682.503377741525424
    692.557022695182724
    702.56755574523507
    712.709643024574987
    722.724320047021944
    732.913325884282384
    743.154521111713488
    753.706281482599432
    763.843126643942505
    773.865110431762718
    783.934123987370194
    794.634227207136824
    805.185085595879828
    815.233027217613057
    825.766649107296137
    836.105655600034141
    846.138089870962286
    8510.09238865432725
    8610.24402739230769
    8710.56650271556122
    8820.75054951145038
    8947.69228849275363
    9077.79810328289474
    91145.2686368960396
    

    Maybe using an integer would be better for this? Then it would have to have higher resolution, e.g. instead of 1/1000s it should show 1/1Bs? Then the above numbers would vary from 3207 to 145'268'636. E.g. the lightest peer has caused 3207 nanoseconds of CPU time for each 1 second (=1B nanoseconds) of connection time.

  25. in src/util/time.cpp:128 in 28b2a67f13 outdated
    123+
    124+    timespec t;
    125+    if (clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) == -1) {
    126+        return std::chrono::nanoseconds{0};
    127+    }
    128+    return std::chrono::nanoseconds{static_cast<int64_t>(t.tv_sec) * 1'000'000'000 + t.tv_nsec};
    


    maflcko commented at 10:25 am on January 23, 2025:
    nit: Instead of manual casting and multiplication, you can use chrono types and let the compiler insert what is needed.

    vasild commented at 12:38 pm on January 23, 2025:
    Right, much better and simpler this way. Thanks!
  26. in src/rpc/net.cpp:250 in 28b2a67f13 outdated
    245@@ -239,6 +246,11 @@ static RPCHelpMan getpeerinfo()
    246         obj.pushKV("last_block", count_seconds(stats.m_last_block_time));
    247         obj.pushKV("bytessent", stats.nSendBytes);
    248         obj.pushKV("bytesrecv", stats.nRecvBytes);
    249+        const auto cpu_time_ns{count_nanoseconds(stats.m_cpu_time)};
    250+        const auto connection_time_ns{count_nanoseconds(now - stats.m_connected)};
    


    maflcko commented at 10:29 am on January 23, 2025:
    nit: Instead of manually casting, you can just divide two chrono types and let the compiler insert anything that is needed.

    vasild commented at 12:40 pm on January 23, 2025:
    Thanks for the suggestion! And now no need to add a new function count_nanoseconds().
  27. maflcko commented at 10:31 am on January 23, 2025: member
    (left two nits, feel free to ignore)
  28. rpc: add cpu_load to getpeerinfo
    Add a new field `cpu_load` to the output of `getpeerinfo` RPC.
    
    It represents the CPU time spent by the message handling thread for the
    given peer, weighted for the duration of the connection. That is, for
    example, if two peers are equally demanding and one is connected longer
    than the other, then they will have the same `cpu_load` number.
    0f68c47e93
  29. vasild force-pushed on Jan 23, 2025
  30. vasild commented at 12:38 pm on January 23, 2025: contributor
    28b2a67f13...0f68c47e93: address suggestions by @maflcko
  31. maflcko commented at 1:06 pm on January 23, 2025: member
    The code here assumes that Bitcoin Core is single threaded, which may be true right now, but could change in the future.
  32. luke-jr commented at 8:35 pm on January 25, 2025: member
    How so? Both implementations use thread-specific timers…?
  33. i-am-yuvi commented at 4:40 pm on January 26, 2025: contributor

    Concept ACK

    I think this would be a clean way to get peers’ CPU usage for message processing since Tracepoints is currently not compatible with macOS/Windows for which folks will be having a hard time.

    Also, this could pave the way for prioritising peers based on CPU usage #31033.

  34. i-am-yuvi commented at 4:52 pm on January 26, 2025: contributor

    Also, during my observations I found value null as well.

    0null
    10.0352094358974359
    20.54925525
    30.579001
    40.6585706
    51.390154842105263
    61.703986581081081
    72.427459408450704
    824.85140159887005
    

    Any idea why is it null? maybe it is about to get disconnected.

  35. vasild commented at 10:52 am on January 27, 2025: contributor

    Any idea why is it null? maybe it is about to get disconnected.

    jq '.[].cpu_load' displays null when there is no cpu_load in the JSON for that peer. The new field is optional:

    0$ bitcoin-cli help getpeerinfo
    1...
    2    "cpu_load" : n,                       (numeric, optional) The CPU time (user + system) spent processing messages from this peer and crafting messages for it expressed in per milles () of the duration of the connection. Will be omitted on platforms that do not support this or if still not measured.
    

    The omission comes from this condition:

    0if (stats.m_cpu_time > 0s && now > stats.m_connected) {
    

    Within 1 second of connecting, now will be equal to stats.m_connected (they use second-precision). I am open to suggestions if you think this can be done in a better or more intuitive way.

  36. vasild commented at 11:01 am on January 27, 2025: contributor

    The code here assumes that Bitcoin Core is single threaded, which may be true right now,

    Hmm? https://github.com/bitcoin/bitcoin/blob/0a931a9787b196d7a620863cc143d9319ffd356d/doc/developer-notes.md#threads

    but could change in the future.

    I guess you mean that message processing is single threaded - the b-msghand thread. I agree it could change in the future.

  37. i-am-yuvi commented at 12:46 pm on January 27, 2025: contributor

    Any idea why is it null? maybe it is about to get disconnected.

    jq '.[].cpu_load' displays null when there is no cpu_load in the JSON for that peer. The new field is optional:

    0$ bitcoin-cli help getpeerinfo
    1...
    2    "cpu_load" : n,                       (numeric, optional) The CPU time (user + system) spent processing messages from this peer and crafting messages for it expressed in per milles () of the duration of the connection. Will be omitted on platforms that do not support this or if still not measured.
    

    The omission comes from this condition:

    0if (stats.m_cpu_time > 0s && now > stats.m_connected) {
    

    Within 1 second of connecting, now will be equal to stats.m_connected (they use second-precision). I am open to suggestions if you think this can be done in a better or more intuitive way.

    Got it, thanks!

  38. i-am-yuvi approved
  39. i-am-yuvi commented at 12:49 pm on January 27, 2025: contributor
    tACK 0f68c47e931de05200adeae639bcee50ea3c171d
  40. DrahtBot requested review from jonatack on Jan 27, 2025
  41. DrahtBot requested review from sipa on Jan 27, 2025
  42. DrahtBot requested review from theStack on Jan 27, 2025
  43. DrahtBot requested review from BrandonOdiwuor on Jan 27, 2025
  44. maflcko commented at 11:26 am on February 5, 2025: member

    I guess you mean that message processing is single threaded - the b-msghand thread. I agree it could change in the future.

    I left the comment, because if the work is done in a different thread, this feature will silently break. I guess that is fine if the work is minimal, or code could be added to do the additional accounting if the work is substantial.

  45. vasild commented at 12:29 pm on February 5, 2025: contributor
    @maflcko yes, I agree. Future changes that make message processing multi-threaded will have to take this into account.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-02-22 21:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me