[IBD] Raspberry Pi: 90% CPU time for 1.5% of block processing #32832

issue l0rinc openend this issue on June 30, 2025
  1. l0rinc commented at 9:37 am on June 30, 2025: contributor

    There are numerous reports about how slow syncing on cheaper nodes (e.g. Raspberry Pi) is, a few recent ones from different platforms:

    I’m running a Raspberry Pi 5 with 8GB of RAM and a 4TB SSD on RPi’s Debian Lite Bookworm. This is a fresh install of v29 and after more than two weeks I’m only at blockheight of 829564 of 901531.

    IBD now, I am sitting at 7 days and only 78% synced. The progress has whittled down to only ~1-2% per day. At this rate I think it will take me another 2-3 weeks to finish initial sync. […] I’m running bitcoin core 29.0 on 64bit Raspberry Pi OS (debian bookworm)

    have a raspberry pi 5 and the sync has been running for over a week and is only at 82%

    i have setup a raspberry pi 5 with umbrel and installed the bitcoin node and it is downloading the blockchain for more then 1 and half week. Its very very slow when it reached 91%.

    I have been chipping away since Thanksgiving trying to just get in sync with the blockchain.

    download is VERY slow on my setup, like 1% a day

    Something happened in the past 24 months that makes syncing on typical hard drives impossibly slow.

    Every time I sync a full node on certain Thinkpads, there’s a distinct step-function change in CPU usage.


    As part of #32043 I’m measuring IBD and -reindex performance regularly. Displaying the resulting stack as flames (where the call graph of the methods is represented as the callee at the bottom and the final method at the top, deduplicating the parent methods, resulting in multiple similar leaves based on the call paths) indicates which areas of the code take most time:

    On high-end devices we see that multithreaded script validation is a minor part of IBD (partially because of more threads, but the total is also a lot less dominant):

    Image

    Recently I got access to several lower-end Raspberry Pi 4 devices. IBD took surprisingly long indeed, and the resulting flame graphs are quite surprising:

    Image

    Even the hidden ConnectTip looks quite different… Image even though it seems we do have SHANI optimizations that I first suspected as a possible culprit: Image

    (I’ll add the details of the server later, but currently my Pi servers are frozen, they’re very sensitive creatures)


    Doing an inverse flame to show the own time (where the final methods are at the top and their different callers are below, deduplicating the final methods, resulting in duplicated parents) is also very revealing:

    Hetzner I7/i9 server:

    Corresponding to:

     0=== Top 20 functions by self-time ===
     124,367,833,683,883 (12.53%) Lloop1_30
     216,614,495,487,954 ( 8.54%) secp256k1_fe_mul_inner
     39,671,690,320,529 ( 4.97%) std::_Hashtable<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, Pool...
     48,048,128,525,665 ( 4.14%) Lloop2_30
     57,479,154,097,547 ( 3.85%) secp256k1_gej_double
     64,875,602,514,443 ( 2.51%) SipHashUint256Extra
     74,848,058,903,029 ( 2.49%) leveldb::(anonymous namespace)::BloomFilterPolicy::CreateFilter
     84,794,743,071,310 ( 2.47%) __memcmp_avx2_movbe
     94,683,317,778,516 ( 2.41%) __memmove_avx_unaligned_erms
    104,469,407,701,360 ( 2.30%) _int_malloc
    113,409,056,546,109 ( 1.75%) CSHA256::Write
    123,135,255,483,667 ( 1.61%) leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::FindGreater...
    132,957,747,949,622 ( 1.52%) secp256k1_fe_sqrt
    142,473,661,676,032 ( 1.27%) leveldb::BlockBuilder::Add
    152,432,229,937,337 ( 1.25%) AutoFile::write_buffer
    162,309,378,132,280 ( 1.19%) _int_free
    172,272,582,522,437 ( 1.17%) malloc
    182,140,700,000,760 ( 1.10%) leveldb::MemTable::KeyComparator::operator
    192,088,998,156,937 ( 1.07%) AutoFile::detail_fread
    202,033,602,002,215 ( 1.05%) leveldb::InternalKeyComparator::Compare
    

    Pi - showing very heavy signature validation:

    I.e. almost 90% of the time is spent in signature validation even though only 1.5% of the blocks had them ((900000-886157)/900000=1.5%)

     0=== Top 20 functions by self-time ===
     145,200,219,094,553 (49.07%) secp256k1_fe_mul_inner
     217,059,129,066,358 (18.52%) secp256k1_gej_double
     34,958,246,221,561 ( 5.38%) secp256k1_fe_sqrt
     44,537,566,553,791 ( 4.93%) secp256k1_gej_add_ge_var
     52,679,543,872,014 ( 2.91%) secp256k1_fe_sqr_inner
     61,776,534,490,509 ( 1.93%) CSHA256::Write
     71,423,064,476,505 ( 1.54%) secp256k1_ecmult_strauss_wnaf.constprop.0
     81,406,501,260,817 ( 1.53%) sha256_arm_shani::Transform
     91,160,517,562,371 ( 1.26%) __memcpy_generic
    101,087,798,904,524 ( 1.18%) secp256k1_ge_from_storage
    11967,511,395,459 ( 1.05%) secp256k1_gej_add_zinv_var
    12837,850,817,824 ( 0.91%) secp256k1_modinv64_var
    13613,159,858,561 ( 0.67%) cfree@GLIBC_2.17
    14490,119,900,000 ( 0.53%) secp256k1_ecmult_wnaf.constprop.0.isra.0
    15474,344,476,011 ( 0.51%) uint256 SignatureHash<CTransaction>
    16462,336,566,142 ( 0.50%) SignatureCache::Get
    17455,625,547,742 ( 0.49%) GetScriptOp
    18393,204,249,970 ( 0.43%) MurmurHash3
    19387,212,664,603 ( 0.42%) CRollingBloomFilter::insert
    20375,980,794,732 ( 0.41%) secp256k1_modinv64_update_de_62.isra.0
    

    (it’s possible I ran the two measurements with slightly different configs since it took weeks to convince the Pi not to overheat or OOM in the middle of measurements. I think the findings are still enough to continue the investigation)


    Zooming in the signature verification doesn’t immediately reveal any obvious reason for the huge performance difference:

    Hetzner:

    Pi:


    I’ll investigate whether there’s any room for obvious optimizations for Pi in https://github.com/bitcoin-core/secp256k1 - any other questions or suggestions are welcome.

  2. maflcko commented at 10:35 am on June 30, 2025: member

    ((900000-886157)/900000=1.5%)

    Could clarify in the title or in the text that this is for blocks after assumevalid?

  3. maflcko added the label Resource usage on Jun 30, 2025
  4. l0rinc commented at 1:26 pm on June 30, 2025: contributor

    Could clarify in the title or in the text that this is for blocks after assumevalid?

    Did I miscalculate it? Since the benchmark was running until 900k blocks of which only 13843 blocks needed script validation (unless I ran into #31494), isn’t that 13843/900000=0.0153811111 i.e. 1.5% of the blocks had any script validation, yet almost 90% of the time was spent there?

  5. achow101 referenced this in commit 97593c1fd3 on Aug 15, 2025
  6. Raimo33 commented at 9:01 pm on September 8, 2025: none

    I think I found the root cause, not tested, just speculating:

    New Raspberry devices (Pi 3, Pi 4, etc.) are 64-bit CPUs. They can also be setup to run in 32bit mode, but I assume both the above benchmarks and issues used them in the default 64bit mode.

    the incriminated functions: secp256k1_fe_mul_inner and secp256k1_fe_sqr_inner are notably the most expensive operations of libsecp256k1…

    libsecp256k1 has various versions of these functions, compiled selectively based on the target architecture. Specifically:

    32bit x86 -> field_10x26_impl.h: unoptimized, baseline, not really useful (most 32bit machines are arm) 32bit arm -> field_10x26_impl_arm.s: optimized assembly path, useful in older Raspberry devices (Pi 1, Pi 2)

    64bit x86/arm -> field_5x52_impl.h: this is our case study.

    On 64bit archs, field_5x52_impl.h will use 128bit integers (secp256k1_uint128) regardless of whether 128bit registers are supported or not. While most modern machines support 128bit registers natively, with uint128_t being an actual available type, Raspberry Pi doesn’t have them. Therefore 64x64 bit multiplications, which happen 30+ times per single secp256k1_fe_mul_inner call and 20+ times in secp256k1_fe_sqr_inner, have to be emulated using 64bit registers. I believe this is the main bottleneck!

    emulating 128bit multiplications with 64bit registers can be 2-5x slower, or maybe more, seeing the width difference of the flamegraph stacks between Pi and i7. Furthermore, it’s not only multiplication. inside secp256k1_fe_mul_inner and secp256k1_fe_sqr_inner there are plenty of shifts, accumulations, additions…

    I wouldn’t be surprised if the 32bit version Pi would perform better than the 64bit one, just because of the specific assembly implementation (field_10x26_impl_arm.s) that would be compiled rather than the emulated 128bit functions.

    Something happened in the past 24 months that makes syncing on typical hard drives impossibly slow.

    This user confirms he was using a machine without native 128bit registers

  7. l0rinc commented at 9:10 pm on September 8, 2025: contributor
    Thanks for the hint. I’m planning on investigating in more detail, but I have noticed my intuitions were off often, I try not to speculate anymore, it’s why it would help if we could back these by any concrete measurements before we attempt a fix.
  8. sipa commented at 9:15 pm on September 8, 2025: member

    Is this before or after the assumevalid point? If before, the secp256k1 operations won’t even be used.

    And if you are on a 64-bit ARM system, you absolutely want to use the 64-bit secp256k1 field implementations. They perform 4x fewer multiplications than the 32-bit ones. The asm optimizations compensate somewhat for that by better pipelining, but (1) can’t overcome the fact that you just need far more arithmetic operations on 32-bit, and (2) the asm optimizations aren’t even enabled by default in Bitcoin Core builds.

  9. Raimo33 commented at 9:16 pm on September 8, 2025: none

    the asm optimizations aren’t even enabled by default in Bitcoin Core builds.

    Is there a specific reason for this?

  10. sipa commented at 9:19 pm on September 8, 2025: member
    Historically, because they were new and unreviewed, and after that I guess we forgot about them (sorry, @laanwj …).
  11. Raimo33 commented at 9:19 pm on September 8, 2025: none
    @l0rinc if I were you I would setup a simple benchmark to test how many cycles it takes to execute a single 64x64->128 multiplication. and compare that with your i7.
  12. sipa commented at 9:21 pm on September 8, 2025: member
    Or run the libsecp256k1 benchmarks on both to see how fast signature verification is on both systems.
  13. Raimo33 referenced this in commit b51190ca76 on Sep 12, 2025
  14. Raimo33 referenced this in commit 9df8c9d150 on Sep 12, 2025
  15. Raimo33 referenced this in commit 40e75eaae8 on Sep 12, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-09-19 15:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me