Cache 26% more coins: Reduce CCoinsMap::value_type from 96 to 76 bytes #17060

pull martinus wants to merge 3 commits into bitcoin:master from martinus:2019-09-more-compact-Coin changing 10 files +199 −61
  1. martinus commented at 11:07 am on October 5, 2019: contributor

    This is an attempt to more tightly pack the data inside CCoinsMap. (replaces #16970, done after my investigations on #16957) Currently, there is quite a bit of overhead involved in the CCoinsMap::value_type data structure (total bytes to the left):

     096 CCoinsMap::value_type
     1    36 COutPoint
     2        32 uint256
     3         4 uint32_t
     4     4 >>> PADDING <<<
     5    56 CCoinsCacheEntry
     6        48 Coin
     7            40 CTxOut
     8                8 nValue
     9                32 CScript
    10             4 fCoinBase & nHeight
    11             4 >>> PADDING <<<
    12         1 flags (dirty & fresh)
    13         7 >>> PADDING <<< 
    

    So there is quite a bit of padding data. In my experiements I’ve noticed that the compiler is forced to use a padding size >=8 only because nValue’s CAmount type is int64_t which has to be 8-byte aligned. When replacing nValue with a 4-byte aligned data structure, the whole CCoinsMap::value_type will be aligned to 4 bytes, reducing padding.

    Another 4 bytes can be saved by refactoring prevector to only use a single byte for direct size information, and reducing CScript’s size from 28 to 27 bytes. It is still able to directly cache most scripts as most are <= 25 bytes long.

    The remaining 4 bytes due to the 1 byte flag + 3 bytes padding in CCoinsCacheEntry can be removed by moving the flags into Coin, stealing another 2 bits from nHeight.

    Finally, the resulting data structure is 20 bytes smaller:

    076 CCoinsMap::value_type
    1    36 COutPoint
    2        32 uint256
    3         4 uint32_t
    4    40 CCoinsCacheEntry
    5        40 Coin
    6            36 CTxOut
    7                 8 nValue
    8                28 CScript
    9            4 fCoinBase & nHeight & flags (dirty & fresh)
    

    So we can store about 26% more data into dbcache’s memory. I have evalued this on my Intel i7-8700, 32GB RAM, external SSD, for -reindex-chainstate with both -dbcache=500 and -dbcache=5000:

    out2

        time max resident set size
    master -dbcache=500 05:57:20 2957532
    2019-09-more-compact-Coin -dbcache=500 05:33:37 2919312
    Improvement   6,64% 1,29%
        time max resident set size
    master -dbcache=5000 04:22:42 7072612
    2019-09-more-compact-Coin -dbcache=5000 04:09:16 6533132
    Improvement   5,11% 7,63%

    So on my machine there is definitive an improvement, but honestly I am not sure if this ~6% improvement is worth the change. Maybe the improvement is bigger with slower disk, as ~26% more transaction can be cached.

  2. fanquake added the label Resource usage on Oct 5, 2019
  3. laanwj added the label Validation on Oct 5, 2019
  4. laanwj added the label UTXO Db and Indexes on Oct 5, 2019
  5. DrahtBot commented at 11:54 am on October 5, 2019: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #18113 ([coins] Don’t allow a coin to spent and FRESH. Improve commenting. by jnewbery)
    • #18087 (Get rid of VARINT default argument by sipa)
    • #18000 ([WIP] Coin Statistics Index by fjahr)
    • #17708 (prevector: avoid misaligned member accesses by ajtowns)
    • #17487 (coins: allow write to disk without cache drop by jamesob)
    • #9384 (CCoinsViewCache code cleanup & deduplication by ryanofsky)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  6. Sjors commented at 1:51 pm on October 6, 2019: member

    Would be nice to test this on a Raspberry-like device. They tend to have low memory and slow storage. Unfortunately they generally also require pruning, and the way pruning currently works we flush the coin cache.

    For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I’m pruning it to 5 GB, it never uses more than 230 MB.

  7. MarcoFalke commented at 5:05 pm on October 7, 2019: member

    For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I’m pruning it to 5 GB, it never uses more than 230 MB.

    A 512 GB micro SD card should do the job, no?

  8. fanquake commented at 3:38 pm on October 8, 2019: member
    @TheBlueMatt You might be interested here given you commented in #16970.
  9. in src/primitives/transaction.h:133 in d69050ecec outdated
    126@@ -127,13 +127,72 @@ class CTxIn
    127     std::string ToString() const;
    128 };
    129 
    130+
    131+/**
    132+ * CAmount is an int64_t, which requires alignment of 8. This wrapper stores the data
    133+ * in a memory array and packs/unpacks it whenever needed, regarldess of alignment.
    


    practicalswift commented at 9:31 pm on October 8, 2019:
    Should be “regardless” :)
  10. in src/primitives/transaction.h:143 in d69050ecec outdated
    138+class PackableCAmount
    139+{
    140+    uint8_t m_data[sizeof(CAmount)];
    141+
    142+public:
    143+    PackableCAmount(CAmount val) noexcept
    


    practicalswift commented at 9:31 pm on October 8, 2019:
    Should be explicit? :)

    martinus commented at 2:37 pm on October 9, 2019:

    It should, I can add that easily by adding the assignment operator for CAmount.

    Also theoretically the operator CAmount() should be explicit too, but changing that would mean that I’d have to change practically all usages of this class…

  11. laanwj commented at 10:03 am on October 9, 2019: member

    A 512 GB micro SD card should do the job, no?

    Yep, works fine.

  12. in src/primitives/transaction.h:135 in d69050ecec outdated
    126@@ -127,13 +127,72 @@ class CTxIn
    127     std::string ToString() const;
    128 };
    129 
    130+
    131+/**
    132+ * CAmount is an int64_t, which requires alignment of 8. This wrapper stores the data
    133+ * in a memory array and packs/unpacks it whenever needed, regarldess of alignment.
    134+ *
    135+ * Since CAmount's size is know, the compiler should be able to optimize the std::memcpy away
    


    GChuf commented at 1:59 pm on October 9, 2019:
    Should be “known”. … And, uh, is that “away” or “anyway”?

    martinus commented at 2:52 pm on October 9, 2019:
    fixed in 03cb535384eb6d2f4284524c33ab03371bd263e2, also it should be “optimized away” because the compiler can the function call away. See this: https://godbolt.org/z/6uP9_S
  13. fanquake added the label Needs Conceptual Review on Oct 9, 2019
  14. fanquake commented at 2:03 pm on October 9, 2019: member
    Please lets keep the discussion high level before pointing out any typos.
  15. promag commented at 7:00 am on October 10, 2019: member

    Concept ACK. Didn’t verified the claim but 26% sounds a good improvement to me.

    AFAICT dirty and fresh are almost read and written at the same time, maybe use a 2 bit flag? Or is compiler doing that anyway?

  16. laanwj commented at 7:41 am on October 12, 2019: member
    @promag that’s what the bitfield syntax unsigned int dirty : 1; does, it makes the flags take up the minimum possible space. Or maybe I’m misunderstanding what you’re trying to accomplish.
  17. GChuf commented at 1:26 pm on October 12, 2019: contributor
    Would like to test this on my VM using a HDD. I’m not sure how to benchmark this - I’ve tried https://github.com/chaincodelabs/bitcoinperf but haven’t succeeded. Any info/help would be appreciated.
  18. MarcoFalke commented at 1:30 pm on October 12, 2019: member
    bitcoinperf is a bit messy to set up. I think it requires docker. Is there any specific issue you ran into? I guess it might be better to report such upstream at https://github.com/chaincodelabs/bitcoinperf to not bloat this thread
  19. martinus commented at 1:46 pm on October 12, 2019: contributor

    Would like to test this on my VM using a HDD. I’m not sure how to benchmark this - I’ve tried https://github.com/chaincodelabs/bitcoinperf but haven’t succeeded. Any info/help would be appreciated.

    In my benchmarks I deleted debug.log, ran /usr/bin/time -v ./bitcoind -reindex-chainstate -stopatheight=594000 -printtoconsole=0 -dbcache=500, then used a hand written script to convert the debug.log into gnuplot-readable data, then used gnuplot to create the graphs.

    It would be awesome to have some kind of logger in bitcoind that can periodically log statistics into a file in an easily parseable data.

  20. GChuf commented at 1:52 pm on October 12, 2019: contributor

    There were a couple issues. Could you recommend anything else to test -reindex-chainstate? Should I simply run bitcoind and have linux count time … ? Or maybe I’ll look around things mentioned in this stackexchange answer.

    p.s. thanks for the info @martinus. I’ll figure something out with what you did and those tools mentioned in the link.

  21. promag commented at 2:24 pm on October 12, 2019: member

    @laanwj I mean stuff like this:

    0        auto oldDirty = coin.dirty;
    1        auto oldFresh = coin.fresh;
    

    could be

    0        auto old_cache = coin.cache;
    

    where

    0    unsigned int cache : 2;   // DIRTY = 1,  FRESH = 2
    
  22. GChuf commented at 5:55 pm on October 15, 2019: contributor

    The debug.log from first benchmark (master) got overwritten by the second benchmark, and I didn’t make a script to do a graph from debug.log yet anyway, so I’m just gonna post the /usr/bin/time -v results. The improvement seems remarkable! I tested this on Ubuntu 16 VM running on a Seagate SSHD (also called bybrid) disk. It seems however the process used more CPU - not sure what’s going on with those percentages though. Must be a VM bug.

    First Header Second Header time max resident set size
    2019-09-more-compact-Coin dbcache=500 2:20:09 1673108
    Master dbcache=500 3:25:46 1729412
    Improvement 31,9% 3,26%
     0
     1	Command being timed: "./bitcoindmaster -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
     2	User time (seconds): 19760.08
     3	System time (seconds): 1031.02
     4	Percent of CPU this job got: 168%
     5	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:25:46
     6	Average shared text size (kbytes): 0
     7	Average unshared data size (kbytes): 0
     8	Average stack size (kbytes): 0
     9	Average total size (kbytes): 0
    10	Maximum resident set size (kbytes): 1729412
    11	Average resident set size (kbytes): 0
    12	Major (requiring I/O) page faults: 0
    13	Minor (reclaiming a frame) page faults: 2088592
    14	Voluntary context switches: 38379113
    15	Involuntary context switches: 2185645
    16	Swaps: 0
    17	File system inputs: 0
    18	File system outputs: 0
    19	Socket messages sent: 0
    20	Socket messages received: 0
    21	Signals delivered: 0
    22	Page size (bytes): 4096
    23	Exit status: 0
    
     0
     1	Command being timed: "./bitcoindcoin -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
     2	User time (seconds): 19004.87
     3	System time (seconds): 598.24
     4	Percent of CPU this job got: 233%
     5	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:20:09
     6	Average shared text size (kbytes): 0
     7	Average unshared data size (kbytes): 0
     8	Average stack size (kbytes): 0
     9	Average total size (kbytes): 0
    10	Maximum resident set size (kbytes): 1673108
    11	Average resident set size (kbytes): 0
    12	Major (requiring I/O) page faults: 0
    13	Minor (reclaiming a frame) page faults: 1737560
    14	Voluntary context switches: 17203875
    15	Involuntary context switches: 2926137
    16	Swaps: 0
    17	File system inputs: 0
    18	File system outputs: 0
    19	Socket messages sent: 0
    20	Socket messages received: 0
    21	Signals delivered: 0
    22	Page size (bytes): 4096
    23	Exit status: 0
    
  23. martinus commented at 7:05 pm on October 15, 2019: contributor

    Thanks @GChuf for testing this! This are really nice results.

    It seems however the process used more CPU

    I don’t think so:

    • master had 19760.08 sec user + 1031.02 sec system = 20791.1 sec
    • 2019-09-more-compact-Coin had 19004.87 sec user + 598.24 sec system = 19603,11

    So the branch used ~6% less CPU. Note that the percentage 233% is much higher than for master because it used that CPU in a much shorter timeframe.

    The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

  24. GChuf commented at 5:40 pm on October 16, 2019: contributor

    The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

    Thanks for the explaination! The math works :) If more testing is needed I’d be glad to help, but I think HDDs will show similar results to my SSHD and it’s clear that the improvements are real.

  25. jamesob commented at 5:48 am on October 17, 2019: member
    Concept ACK from me, seems like some great savings here. Though it’ll be crucial to test this thoroughly across platforms. Will review in depth soon.
  26. GChuf commented at 5:11 pm on October 17, 2019: contributor
    I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?
  27. sipa commented at 5:35 pm on October 17, 2019: member

    These are pretty impressive gains, so concept ACK.

    Some overall comments:

    • Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.
    • The coding style guidelines for variables say to use snake_case in new code.

    I do want to review the prevector changes in more detail.

  28. MarcoFalke added the label Needs gitian build on Oct 17, 2019
  29. martinus commented at 5:43 pm on October 17, 2019: contributor

    I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?

    Sorry, I build only in Linux, never tried to build in Windows

    Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.

    I’m also unsure if it was everywhere necessary to backup and then restore the previous flags. Moving the flags into Coin only saves us 4bytes. I could also remove that change and do this in a separate PR.

  30. MarcoFalke commented at 6:21 pm on October 17, 2019: member
    @DrahtBot might have a windows build ready tomorrow
  31. dongcarl commented at 7:20 pm on October 17, 2019: member
    I have Gitian-built windows binaries of 03cb535384eb6d2f4284524c33ab03371bd263e2 here for those who are keen (@GChuf): https://send.firefox.com/download/c78ac103a4dabe86/#S70uxKCUua-baEeR0v029g
  32. GChuf commented at 8:15 pm on October 17, 2019: contributor
    Thanks @MarcoFalke & @dongcarl, will test on Win10 tomorrow/this weekend.
  33. wtogami commented at 0:28 am on October 18, 2019: contributor

    A 512 GB micro SD card should do the job, no?

    You’ll find sdcards larger than 32GB to be rather terrible for anything aside from sequential write (4K video cameras). The main issue is the “erase block size” which is the minimum size of any erase. Every time you write anything to disk it could rewrite that size. 32GB is the smallest sdcard I’ve seen with 4MB erase block size. I’ve seen 128GB sdcard with 16MB. I’m guessing bigger has even larger erase blocks.

    There are possible changes that could be done to Core to make it more flash friendly but it is not worthwhile as performance of that type of flash is abysmal while SATA/NVMe SSD controllers don’t really need such optimizations.

  34. jamesob commented at 2:07 am on October 18, 2019: member

    Hm, getting the opposite of what I would expect in terms of benchmark results. For /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=4000 -connect=0 (run using a lightly modified version of this script) across a number of different machines, I’m seeing this branch as reliably slower than master:

     0bench-ssd-2  master                               4:05:52    0.45  6283.55MB 204%
     1bench-ssd-2  martinus/2019-09-more-compact-Coin   9:06:48    1.00  5927.98MB 347%
     2
     3bench-ssd-3  martinus/2019-09-more-compact-Coin   9:07:05    1.00  5914.48MB 346%
     4bench-ssd-3  master                               2:59:52    0.33  6278.71MB 77%
     5
     6bench-ssd-4  master                               2:33:47    0.28  6257.46MB 89%
     7bench-ssd-4  martinus/2019-09-more-compact-Coin   9:07:19    1.00  5834.16MB 346%
     8
     9bench-ssd-5  martinus/2019-09-more-compact-Coin   9:03:16    1.00  5963.10MB 349%
    10bench-ssd-5  master                               3:58:48    0.44  6252.48MB 209%
    11
    12bench-hdd-4  master                               6:22:47    0.64  6263.75MB 36%
    13bench-hdd-4  martinus/2019-09-more-compact-Coin   10:02:42   1.00  5908.06MB 314%
    14
    15bench-hdd-5  martinus/2019-09-more-compact-Coin   9:52:42    1.00  5889.18MB 320%
    16bench-hdd-5  master                               6:54:05    0.70  6240.25MB 120%
    

    Sample machine specs:

    0Hostname:            bench-ssd-2
    1Kernel:              Linux 4.9.0-8-amd64
    2OS:                  Debian GNU/Linux 9
    3RAM (GB):            7.71
    4Architecture:          x86_64
    5CPU(s):                4
    6Thread(s) per core:    1
    7Core(s) per socket:    4
    8Model name:            Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
    

    Can anyone think of an explanation for this?


    Also, FWIW, in your original PR description your graph and table show master as being faster than this branch but when I read it initially I’d assumed it was just a labeling error.

  35. martinus commented at 4:03 pm on October 18, 2019: contributor

    Can anyone think of an explanation for this?

    I really can’t explain why the diffference is so huge. Maybe it has something to do with the Xeon CPU? Or there is still a bug in the code that manifests on that CPU. Do you have the debug.log files, or graphs of the progress? Maybe there it’s visible at which point the difference occurs.

    I’m currently running a comparison on a very slow Intel Celeron N3050, 2GB of RAM, -dbcache=500, and external SSD. So far it seems the results are very close

  36. DrahtBot commented at 4:45 am on October 19, 2019: member

    Gitian builds for commit ec3ed5a4487886f1c2a35fda0a3289be7b280248 (master):

    Gitian builds for commit c03bdee106a5f9743d373955e7ae63c244a9d14f (master and this pull):

  37. DrahtBot removed the label Needs gitian build on Oct 19, 2019
  38. GChuf commented at 3:55 pm on October 19, 2019: contributor
    @jamesob please post the debug logs, some time ago I was looking at your results at #16801 which seemed weird as well, and debug logs could shed some much needed light into this. As for Xeon CPUs, I’ve got a Xeon E3 1240 v2. The previous test on ubuntu VM was run with this CPU and I’m gonna run windows benchmarks on it as well and post the results here. Anyway I doubt xeon (or any other) intel CPUs are the problem, thought it’s possible.
  39. GChuf commented at 8:56 am on October 20, 2019: contributor

    Much more modest improvements when running this on a physical Windows machine (~2% improvement in time). Same hardware was used as in my previous benchmark. I might have seen better improvements if I used an old HDD, but I just wanted to make sure the improvements are there on windows as well.

    master: 02:59:21 2019-09-more-compact-Coin: 02:56:05

    figure_1

  40. martinus commented at 12:31 pm on October 20, 2019: contributor

    @GChuf , did you run the windows benchmark on the same hardware where you got the 31% improvement in Linux?

    My benchmark with Intel Celeron N3050 has finished, where the branch is about of 1.4% faster. It seems that the CPU is definitely the limiting factor here - each run took over 6 days

    out2

  41. GChuf commented at 1:38 pm on October 20, 2019: contributor
    @martinus yes, forgot to mention. Hardware was the same. The difference was that on the first run I was running the benchmark on a VM, which certainly performs worse. I think the difference between physical vs VM is more pronounced with CPUs rather than disks.
  42. JeremyRubin commented at 7:03 pm on October 20, 2019: contributor

    Some notes on other paths to shave off bytes here:

    As an alternative, you could do something like the union below to compact the CAmount for the (evidence: https://eprint.iacr.org/2017/1095.pdf, also by pigeon hole we have 21M coins and 100+M UTXOs so most have to be small) common case of a UTXO smaller than 2**31/100e6 (about 21.47 BTC). This technique saves another 4 bytes. The nice thing about this change is that we don’t have to change the guts of CScript which has implications across a bunch of sensitive areas.

    Were you to combine it with the prevector and CScript changes, you would see the same improvement of 4 bytes, bringing it down to 72 bytes.

    There is also the option to hash the COutpoint, to save another 4 bytes, but I looked into this once and it seemed complicated because some cases we use the key of the pair IIRC. (The same trick as above could be used to shave off 2 bytes as almost all outputs indexes are < 16 bits)

     0union {
     1struct {
     2    bool is_big : 1;
     3    private:
     4    uint32_t amount: 31;
     5} read_flag;
     6struct {
     7    struct {
     8        private:
     9        bool __flag : 1;
    10        public:
    11        uint32_t amount: 31;
    12    } flag_amount;
    13    CScript script;
    14} small_value;
    15struct {
    16    struct {
    17        private:
    18        bool __flag : 1;
    19        public:
    20        uint32_t ignore: 31;
    21    } flag;
    22    std::unique_ptr<CTxOut> ptr;
    23} large_value;
    24}
    
  43. jamesob commented at 6:25 am on October 21, 2019: member

    Just finished another benchmark run on a significantly beefier machine with similar (odd) results. Only the debug.log for this branch was preserved since I started the run a few days ago, but (i) I’m working on some better tooling to parse debug.log files (optionally contingent on #16805) and generate tables, graphs, etc., and (ii) in future runs I’ll make sure I preserve the debug.log files for each branch.

    /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=4000 -connect=0

    Edit: compiled with g++ (Debian 8.3.0-6) 8.3.0.

    master

    {'time': '2:56:18', 'cpu_perc': '68%', 'mem_kb': 6389536, 'user_time_secs': 6754.8, 'system_time_secs': 527.61}

    martinus/2019-09-more-compact-Coin (03cb53538)

    {'time': '9:16:02', 'cpu_perc': '635%', 'mem_kb': 5871084, 'user_time_secs': 211292.35, 'system_time_secs': 565.91}

    debug.log (martinus/2019-09-more-compact-Coin)

    https://transfer.sh/ncVfW/jamesob-17060-debug.log

    hwinfo

    0hostname               bench-strong        
    1cpu_model_name         Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
    2ram_gb                 31.35
    3os                     ['Debian GNU/Linux', '10', 'buster']
    4arch                   x86_64              
    5kernel                 4.19.0-5-amd64      
    6read_iops (/tmp)       29.2k
    7write_iops (/tmp)      9694
    8read_iops (/data)      396
    9write_iops (/data)     131
    
  44. martinus commented at 6:55 am on October 21, 2019: contributor

    Thanks for posting the debug.log! That’s already helpful. I was hoping to find some issue where e.g. db sync takes a long time, but this is not the case. Flushing the db cache takes a few minutes, but nothing extraordinary:

    02019-10-18T08:30:54Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2009-01-03T18:15:05Z' progress=0.000000 cache=0.0MiB(0txo)
    12019-10-18T10:10:27Z UpdateTip: new best=00000000000000000d533060763bd64c8b97de7fdbc407bd918cf97ff37ae69a height=372609 version=0x00000003 log2_work=83.29081 tx=82023512 date='2015-09-02T02:22:29Z' progress=0.175964 cache=4275.5MiB(35061266txo)
    22019-10-18T10:17:07Z UpdateTip: new best=0000000000000000056e394d44a7ae5eb021d6171504a7dec0a833d50763fad6 height=372610 version=0x00000003 log2_work=83.290838 tx=82024083 date='2015-09-02T02:26:03Z' progress=0.175965 cache=482.8MiB(0txo)
    32019-10-18T12:15:33Z UpdateTip: new best=00000000000000000168ff4ff0e7a5ddf08c707fe928be19a25a9c90e8444dab height=433063 version=0x20000000 log2_work=85.368569 tx=160962375 date='2016-10-06T00:55:30Z' progress=0.345290 cache=4276.1MiB(35338561txo)
    42019-10-18T12:23:07Z UpdateTip: new best=000000000000000000b7619d54035e1ad58c762daef103369d2c04a8641fb20a height=433064 version=0x20000000 log2_work=85.368598 tx=160962901 date='2016-10-06T00:58:57Z' progress=0.345290 cache=482.8MiB(0txo)
    52019-10-18T14:52:48Z UpdateTip: new best=0000000000000000002a003b9e34d263f117db2562c80171080e31d4db61bd72 height=490207 version=0x20000000 log2_work=87.294564 tx=262721239 date='2017-10-16T23:03:01Z' progress=0.563536 cache=4276.0MiB(35487228txo)
    62019-10-18T15:00:54Z UpdateTip: new best=000000000000000000564f989b2bd03905df412506a20b2a4708a52b1c230787 height=490208 version=0x20000000 log2_work=87.294603 tx=262723684 date='2017-10-16T23:11:33Z' progress=0.563539 cache=482.8MiB(0txo)
    72019-10-18T16:40:48Z UpdateTip: new best=00000000000000000032015afc76ed6c6e3e6fabae75f832e8f3d6f1a79bb2bd height=526412 version=0x20000000 log2_work=88.977034 tx=320908996 date='2018-06-07T10:09:17Z' progress=0.688312 cache=4275.9MiB(35483120txo)
    82019-10-18T16:47:17Z UpdateTip: new best=00000000000000000031989467b9aa7c18b75ccafd0ca3f788d5e413dbb86e4a height=526413 version=0x20000000 log2_work=88.977084 tx=320909164 date='2018-06-07T10:22:49Z' progress=0.688310 cache=482.8MiB(0txo)
    92019-10-18T17:43:44Z UpdateTip: new best=000000000000000000223b7a2298fb1c6c75fb0efc28a4c56853ff4112ec6bc9 height=550000 version=0x20000000 log2_work=90.011241 tx=356588225 date='2018-11-14T02:35:41Z' progress=0.764816 cache=2880.1MiB(22413485txo)
    

    It looks like progress is made quite linearly, so whatever is slowing it down, it seems to happen from start to finish.The really odd thing is the extreme CPU usage in the branch, it’s 31 times higher than on master.

    Could you try to run perf top while syncing on the branch, maybe this gives an indication where so much CPU is spent? or create a flamegraph with hotspot. Then can’t use pyperf system tune though because it severely reduces the number of allowed samples

  45. GChuf commented at 2:53 pm on October 21, 2019: contributor

    @martinus @jamesob looking at the first test results from James, it occured to me that the “problem” must be in the CPU (or at least not in the disks) - benching master on ssd vs hdd turned out to be ~2x faster (as expected), whereas benching this branch took 9h on ssd and 10h on hdd.

    James, you mentioned this in #16801:

    Before anyone asks: these are dedicated benchmarking machines (specs listed below). Before each run I’m dropping all caches (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;), tuning with pyperf (sudo /usr/local/bin/pyperf system tune;), and I’ve ensured that the CPU governors are set to performance.

    Besides what Martin suggested, can you try:

    • not running pyperf & CPU governor
    • using gitian builds above
    • using -printtoconsole=0 (This probably won’t change anything, but since we have no idea what’s going on …)
  46. jamesob commented at 8:21 pm on October 29, 2019: member

    Okay, I’ve finally gotten around to taking perf to the weird bench-strong results. I took measurements at progress=0.20 (roughly height 387,000) by doing

    0$ # configure with -fno-omit-frame-pointer, tune system params to enable perf, make, etc.
    1$ ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=7000 -printtoconsole=0 -connect=0 &
    2$ while ! tail -n 10000 ~/.bitcoin/debug.log | grep progress=0.2; do sleep 3; done; kill -STOP `pidof bitcoind`
    3[... I wait ...]
    4$ kill -CONT `pidof bitcoind` && sleep 0.4 && perf record -g -F 101 -p `pidof bitcoind` -o perf.$THE_BRANCH.data sleep 90
    

    After running the above for both this branch and master at 6a97e8a060f7632bbaee27d3de8035dc6ebe3895, here’s perf diff perf.master.data perf.martinus.data:

     0# Event 'cycles:ppp'
     1#
     2# Baseline  Delta Abs  Shared Object              Symbol
     3# ........  .........  .........................  ......................................................................................
     4#
     5     2.41%     +3.06%  bitcoind                   [.] SipHashUint256Extra
     6    15.36%     +2.93%  bitcoind                   [.] secp256k1_fe_mul_inner
     7     4.09%     -2.21%  libc-2.28.so               [.] 0x000000000007b544
     8     0.26%     +1.52%  bitcoind                   [.] (anonymous namespace)::ripemd160::rol
     9    14.02%     -1.42%  bitcoind                   [.] secp256k1_fe_sqr_inner
    10     1.99%     -1.40%  bitcoind                   [.] sha256d64_avx2::(anonymous namespace)::Xor
    11     1.48%     -1.32%  bitcoind                   [.] operator new
    12               +1.12%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::const_iterator::operator!=
    13               +1.10%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::const_iterator::operator++
    14     0.00%     +1.07%  bitcoind                   [.] ser_writedata32<CSizeComputer>
    15               +0.99%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::is_direct
    16               +0.96%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::size
    17     0.03%     +0.92%  bitcoind                   [.] sha256d64_avx2::(anonymous namespace)::Or
    

    I think the percentages are misleading here because when I run sudo perf top --comms=b-loadblk, this is what I see:

     0Samples: 236K of event 'cycles:ppp', 2000 Hz, Event count (approx.): 252595020173
     1Overhead  Shared Object              Symbol
     2   2.79%  bitcoind                   [.] SipHashUint256Extra
     3   1.00%  bitcoind                   [.] base_blob<256u>::GetUint64
     4   0.56%  bitcoind                   [.] std::__detail::_Hash_code_base<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::__
     5   0.50%  bitcoind                   [.] SaltedOutpointHasher::operator()
     6   0.45%  bitcoind                   [.] prevector<28u, unsigned char, unsigned int, int>::const_iterator::operator*
     7   0.33%  bitcoind                   [.] std::_Hashtable<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::allocator<std::pa
     8   0.26%  bitcoind                   [.] ser_writedata32<CHashWriter>
     9   0.23%  bitcoind                   [.] std::_Hashtable<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::allocator<std::pa
    10   0.22%  bitcoind                   [.] std::__uninitialized_default_n_1<false>::__uninit_default_n<CTxOut*, unsigned long>
    11   0.18%  libc-2.28.so               [.] 0x000000000015878a
    12   0.17%  bitcoind                   [.] operator-
    

    In other words, near as I can figure SipHashUint256Extra clocks in as the bottleneck on the loadblk thread, which is where the reindexing happens. Based on the (unreliable?) percentage values, it looks like this branch is spending more than twice the time hashing for map access than master is. Why this would be (or even whether or not this is the actual cause of the difference) is outside of my understanding.

    I can upload the raw perf data files if anyone wants, but to my understanding they aren’t portable - unfortunately perf is not very well documented, and I had trouble even generating flamegraphs on a secondary machine, so I’m not really sure what the best way to make the datafiles portable is.

  47. elichai commented at 9:05 pm on October 29, 2019: contributor

    @jamesob twice the time or twice the percentage? Because I can confirm that my own benchmarks of master from ~2-3 weeks ago showed SipHashUint256Extra very very high on the list (spent a while re-implementing it with SSE instructions just to see it’s actually slower with them lol)

    (i’m usually using the following flags: perf report -g 'graph,0.5,caller' that way it sorts the percentages by caller instead of callee, though you probably know more about perf than I :) )

  48. martinus commented at 9:11 pm on October 29, 2019: contributor
    @jamesob did your master build already contain #16957?
  49. JeremyRubin commented at 9:12 pm on October 29, 2019: contributor

    @martinus yes it did – are you not rebased?

    Worth pointing out that a prevector<28> on master fits perfectly in a 32 byte or 64 byte cache line.

    No longer after this change (which is fine, we can increase to prevector<31> where sensitive to it, like std::vector<prevector<».

    Also, now that CAmounts are unaligned, they too can be extra slow to do something with.

    I doubt that’s what’s going on here, but worth looking at before merge.

  50. martinus commented at 10:51 pm on October 29, 2019: contributor

    @martinus yes it did – are you not rebased?

    I didn’t saw he linked to the master revision, so both master and my branch has #16957 so it’s not that.

    I suspect that there is something strange going on with the threading code. Maybe some timing issue causes lots of caching misses for some reason? I don’t know much about how the checkqueue and surrounding classes work though.

  51. martinus commented at 11:30 pm on October 29, 2019: contributor

    I think I might have been able to reproduce this issue. I’ve tried ./bitcoind -datadir=/run/media/martinus/tera/bitcoin/db -reindex-chainstate -stopatheight=150000 -printtoconsole=0 -connect=0 -dbcache=15000 on master, and on an older version, before #16957. The old version used ~20 seconds of user time, master used 332 seconds. I’ve done git bisect and found this revision as the first bad commit: fa3a7331160d1a460b1c15fca1810e98070d629c

    I don’t know why, but this commit seems to have a dramatic effect on performance.

  52. sipa commented at 11:34 pm on October 29, 2019: member
    @martinus Are you aware that blocks before the assumevalid point don’t get signature/script validated?
  53. MarcoFalke commented at 0:31 am on October 30, 2019: member
    For benchmarking purposes, it might be best to set all nodes compiled from different branches to -noassumevalid or the same block hash.
  54. martinus commented at 5:03 am on October 30, 2019: contributor

    @martinus Are you aware that blocks before the assumevalid point don’t get signature/script validated?

    No, I didn’t know. Also it seems that since I didn’t yet have this block, -reindex-chainstate signature/script validates everything

    It might be helpful to have a list of assumevalid blocks instead of just a single block, to prevent revalidating everything when that specific marker block is not yet there

  55. DrahtBot added the label Needs rebase on Nov 8, 2019
  56. more tightly pack Coin
    This change reduces CCoinMap's value_type from 96 bytes to 80 bytes by
    more tightly packing it's data. This is achieved by these changes:
    
    * Refactored prevector so it uses a single byte to determine its size
      when in direct mode
    * Reduce CScriptBase from 28 to 27 indirect bytes
    * Introduced PackableCAmount to be able to align CTxOut to 4 bytes to
      prevent padding when used as a member in Coin
    
    This tighter packing means more data can be stored in the coinsCache
    before it is full and has to be flushed to disk. In my benchmark,
    -reindex-chainstate was 6% faster and used 6% less memory. The cache
    could fit 14% more txo's before it had to resize.
    1fb877c38b
  57. packed CCoinsCacheEntry flags into Coin
    Removed CCoinsCacheEntry's 1 byte for the flags and put it directly into
    Coin. That way we can get rid of unnecessary padding, which reduces the
    memory requirement for the coinsCache. We steal 2 bits from Coin's
    nHeight, so now there are only 29 bits left. Still, we are save until
    block 2^29-1 = 536870911.
    43328c458e
  58. make PackableCAmount's ctor explicit
    This requires the operator=(CAmount) which I've implemented here as well. Using static_cast for nicer formatting.
    
    Note that it would be cleaner to make the user defined conversion`operator CAmount()` explicit too, but that would mean that practically everywhere it is used we need to add an explicit cast.
    See "C.164: Avoid implicit conversion operators" http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c164-avoid-implicit-conversion-operators
    22d1d944d7
  59. martinus force-pushed on Nov 8, 2019
  60. DrahtBot removed the label Needs rebase on Nov 8, 2019
  61. jamesob commented at 5:09 pm on November 12, 2019: member

    Did some benches over the weekend (bench/master.1 vs. bench/compactcoin.1) with the same setup/reindex-to-550,000 as mentioned in the allocator PR (https://github.com/bitcoin/bitcoin/pull/16801#issuecomment-551883348):

     0host         tag                      time       time% maxmem  cpu%  dbcache
     1bench-ssd-2  master.1                 9:13:54    1.00  5770.10MB 344%  4000MB
     2bench-ssd-2  master.1                 9:15:57    1.00  5773.23MB 343%  4000MB
     3bench-ssd-2  compactcoin.1            9:07:01    0.98  5899.80MB 347%  4000MB
     4bench-ssd-2  compactcoin.1            9:07:41    0.99  5923.72MB 346%  4000MB
     5
     6bench-ssd-3  compactcoin.1            9:06:32    0.98  5959.14MB 347%  4000MB
     7bench-ssd-3  compactcoin.1            9:07:35    0.98  6013.83MB 346%  4000MB
     8bench-ssd-3  master.1                 9:17:00    1.00  5771.56MB 342%  4000MB
     9bench-ssd-3  master.1                 9:17:24    1.00  5773.17MB 342%  4000MB
    10
    11bench-ssd-4  compactcoin.1            9:08:11    0.98  5936.75MB 346%  4000MB
    12bench-ssd-4  compactcoin.1            9:07:46    0.98  5916.50MB 346%  4000MB
    13bench-ssd-4  master.1                 9:13:56    0.99  5770.13MB 344%  4000MB
    14bench-ssd-4  master.1                 9:17:07    1.00  5760.78MB 342%  4000MB
    15
    16bench-ssd-5  compactcoin.1            9:03:06    0.98  5943.29MB 349%  4000MB
    17bench-ssd-5  compactcoin.1            9:04:24    0.98  5975.77MB 348%  4000MB
    18bench-ssd-5  master.1                 9:11:54    1.00  5769.57MB 346%  4000MB
    19bench-ssd-5  master.1                 9:13:56    1.00  5775.04MB 344%  4000MB
    20
    21bench-strong compactcoin.1            14:59:15   0.99  5710.87MB 546%  4000MB
    22bench-strong compactcoin.1            14:59:00   0.99  5701.12MB 548%  4000MB
    23bench-strong master.1                 15:06:04   1.00  5720.08MB 544%  4000MB
    24bench-strong master.1                 15:04:26   1.00  5745.98MB 545%  4000MB
    

    See modest 1-2% improvement in runtime and (oddly) on the SSD machines I’m seeing significantly higher memory usage (usually ~200MB) for this branch.

  62. ajtowns commented at 12:47 pm on December 10, 2019: member

    If CScript needs to be 8 byte aligned (because it has a pointer), then I think this becomes:

    • 88 CCoinsMap::value_type
      • 36 COutPoint
        • 32 uint256
        • 4 uint32_t
      • 4 PADDING
      • 48 CCoinsCacheEntry
        • 48 Coin
          • 40 CTxOut
            • 8 nValue
            • 32 CScript
          • 4 nHeight
          • 1 fCoinBase & flags (dirty & fresh)
          • 3 PADDING

    and there’s no need for PackableCAmount, and nHeight doesn’t need bit fields. 96 byte to 88 is only an 9% improvement though.

    (EDIT: with 32-bit pointers, CAmount alignment becomes your blocker, but not sure that optimising that heavily for 32-bit systems makes sense?)

  63. DrahtBot commented at 8:57 am on February 11, 2020: member
  64. DrahtBot added the label Needs rebase on Feb 11, 2020
  65. martinus commented at 10:45 pm on February 18, 2020: contributor
    I’ll close this issue, as @ajtowns noted, this PR has illegally aligned the CScript to 4 bytes to get it to 28 bytes.
  66. martinus closed this on Feb 18, 2020

  67. martinus referenced this in commit a0b1f54c19 on Aug 14, 2021
  68. martinus referenced this in commit a6dc21eace on Aug 14, 2021
  69. martinus referenced this in commit 0781ea49a2 on Aug 14, 2021
  70. martinus referenced this in commit 8430523512 on Aug 15, 2021
  71. martinus referenced this in commit 0bc3d3825f on Aug 27, 2021
  72. martinus referenced this in commit 7fd45dc4df on Aug 27, 2021
  73. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me