Cache 26% more coins: Reduce CCoinsMap::value

martinus commented at 11:07 am on October 5, 2019: contributor

This is an attempt to more tightly pack the data inside CCoinsMap. (replaces #16970, done after my investigations on #16957) Currently, there is quite a bit of overhead involved in the CCoinsMap::value_type data structure (total bytes to the left):

 096 CCoinsMap::value_type
 1    36 COutPoint
 2        32 uint256
 3         4 uint32_t
 4     4 >>> PADDING <<<
 5    56 CCoinsCacheEntry
 6        48 Coin
 7            40 CTxOut
 8                8 nValue
 9                32 CScript
10             4 fCoinBase & nHeight
11             4 >>> PADDING <<<
12         1 flags (dirty & fresh)
13         7 >>> PADDING <<<

So there is quite a bit of padding data. In my experiements I’ve noticed that the compiler is forced to use a padding size >=8 only because nValue’s CAmount type is int64_t which has to be 8-byte aligned. When replacing nValue with a 4-byte aligned data structure, the whole CCoinsMap::value_type will be aligned to 4 bytes, reducing padding.

Another 4 bytes can be saved by refactoring prevector to only use a single byte for direct size information, and reducing CScript’s size from 28 to 27 bytes. It is still able to directly cache most scripts as most are <= 25 bytes long.

The remaining 4 bytes due to the 1 byte flag + 3 bytes padding in CCoinsCacheEntry can be removed by moving the flags into Coin, stealing another 2 bits from nHeight.

Finally, the resulting data structure is 20 bytes smaller:

076 CCoinsMap::value_type
1    36 COutPoint
2        32 uint256
3         4 uint32_t
4    40 CCoinsCacheEntry
5        40 Coin
6            36 CTxOut
7                 8 nValue
8                28 CScript
9            4 fCoinBase & nHeight & flags (dirty & fresh)

So we can store about 26% more data into dbcache’s memory. I have evalued this on my Intel i7-8700, 32GB RAM, external SSD, for -reindex-chainstate with both -dbcache=500 and -dbcache=5000:

out2

		time	max resident set size
master	-dbcache=500	05:57:20	2957532
2019-09-more-compact-Coin	-dbcache=500	05:33:37	2919312
Improvement		6,64%	1,29%

		time	max resident set size
master	-dbcache=5000	04:22:42	7072612
2019-09-more-compact-Coin	-dbcache=5000	04:09:16	6533132
Improvement		5,11%	7,63%

So on my machine there is definitive an improvement, but honestly I am not sure if this ~6% improvement is worth the change. Maybe the improvement is bigger with slower disk, as ~26% more transaction can be cached.

fanquake added the label Resource usage on Oct 5, 2019

laanwj added the label Validation on Oct 5, 2019

laanwj added the label UTXO Db and Indexes on Oct 5, 2019

DrahtBot commented at 11:54 am on October 5, 2019: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#18113 ([coins] Don’t allow a coin to spent and FRESH. Improve commenting. by jnewbery)
#18087 (Get rid of VARINT default argument by sipa)
#18000 ([WIP] Coin Statistics Index by fjahr)
#17708 (prevector: avoid misaligned member accesses by ajtowns)
#17487 (coins: allow write to disk without cache drop by jamesob)
#9384 (CCoinsViewCache code cleanup & deduplication by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Sjors commented at 1:51 pm on October 6, 2019: member

Would be nice to test this on a Raspberry-like device. They tend to have low memory and slow storage. Unfortunately they generally also require pruning, and the way pruning currently works we flush the coin cache.

For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I’m pruning it to 5 GB, it never uses more than 230 MB.

MarcoFalke commented at 5:05 pm on October 7, 2019: member

For example I have an Orange Pi chipping away at IBD this week; although it has 2 GB of RAM, because I’m pruning it to 5 GB, it never uses more than 230 MB.

A 512 GB micro SD card should do the job, no?

fanquake commented at 3:38 pm on October 8, 2019: member

@TheBlueMatt You might be interested here given you commented in #16970.

in src/primitives/transaction.h:133 in d69050ecec outdated

126@@ -127,13 +127,72 @@ class CTxIn
127     std::string ToString() const;
128 };
129 
130+
131+/**
132+ * CAmount is an int64_t, which requires alignment of 8. This wrapper stores the data
133+ * in a memory array and packs/unpacks it whenever needed, regarldess of alignment.

practicalswift commented at 9:31 pm on October 8, 2019:

Should be “regardless” :)

in src/primitives/transaction.h:143 in d69050ecec outdated

138+class PackableCAmount
139+{
140+    uint8_t m_data[sizeof(CAmount)];
141+
142+public:
143+    PackableCAmount(CAmount val) noexcept

practicalswift commented at 9:31 pm on October 8, 2019:

Should be explicit? :)

martinus commented at 2:37 pm on October 9, 2019:

It should, I can add that easily by adding the assignment operator for CAmount.

Also theoretically the operator CAmount() should be explicit too, but changing that would mean that I’d have to change practically all usages of this class…

laanwj commented at 10:03 am on October 9, 2019: member

A 512 GB micro SD card should do the job, no?

Yep, works fine.

in src/primitives/transaction.h:135 in d69050ecec outdated

126@@ -127,13 +127,72 @@ class CTxIn
127     std::string ToString() const;
128 };
129 
130+
131+/**
132+ * CAmount is an int64_t, which requires alignment of 8. This wrapper stores the data
133+ * in a memory array and packs/unpacks it whenever needed, regarldess of alignment.
134+ *
135+ * Since CAmount's size is know, the compiler should be able to optimize the std::memcpy away

GChuf commented at 1:59 pm on October 9, 2019:

Should be “known”. … And, uh, is that “away” or “anyway”?

martinus commented at 2:52 pm on October 9, 2019:

fixed in 03cb535384eb6d2f4284524c33ab03371bd263e2, also it should be “optimized away” because the compiler can the function call away. See this: https://godbolt.org/z/6uP9_S

fanquake added the label Needs Conceptual Review on Oct 9, 2019

fanquake commented at 2:03 pm on October 9, 2019: member

Please lets keep the discussion high level before pointing out any typos.

promag commented at 7:00 am on October 10, 2019: member

Concept ACK. Didn’t verified the claim but 26% sounds a good improvement to me.

AFAICT dirty and fresh are almost read and written at the same time, maybe use a 2 bit flag? Or is compiler doing that anyway?

laanwj commented at 7:41 am on October 12, 2019: member

@promag that’s what the bitfield syntax unsigned int dirty : 1; does, it makes the flags take up the minimum possible space. Or maybe I’m misunderstanding what you’re trying to accomplish.

GChuf commented at 1:26 pm on October 12, 2019: contributor

Would like to test this on my VM using a HDD. I’m not sure how to benchmark this - I’ve tried https://github.com/chaincodelabs/bitcoinperf but haven’t succeeded. Any info/help would be appreciated.

MarcoFalke commented at 1:30 pm on October 12, 2019: member

bitcoinperf is a bit messy to set up. I think it requires docker. Is there any specific issue you ran into? I guess it might be better to report such upstream at https://github.com/chaincodelabs/bitcoinperf to not bloat this thread

martinus commented at 1:46 pm on October 12, 2019: contributor

Would like to test this on my VM using a HDD. I’m not sure how to benchmark this - I’ve tried https://github.com/chaincodelabs/bitcoinperf but haven’t succeeded. Any info/help would be appreciated.

In my benchmarks I deleted debug.log, ran /usr/bin/time -v ./bitcoind -reindex-chainstate -stopatheight=594000 -printtoconsole=0 -dbcache=500, then used a hand written script to convert the debug.log into gnuplot-readable data, then used gnuplot to create the graphs.

It would be awesome to have some kind of logger in bitcoind that can periodically log statistics into a file in an easily parseable data.

GChuf commented at 1:52 pm on October 12, 2019: contributor

There were a couple issues. Could you recommend anything else to test -reindex-chainstate? Should I simply run bitcoind and have linux count time … ? Or maybe I’ll look around things mentioned in this stackexchange answer.

p.s. thanks for the info @martinus. I’ll figure something out with what you did and those tools mentioned in the link.

promag commented at 2:24 pm on October 12, 2019: member

@laanwj I mean stuff like this:

0        auto oldDirty = coin.dirty;
1        auto oldFresh = coin.fresh;

could be

0        auto old_cache = coin.cache;

where

0    unsigned int cache : 2;   // DIRTY = 1,  FRESH = 2

GChuf commented at 5:55 pm on October 15, 2019: contributor

The debug.log from first benchmark (master) got overwritten by the second benchmark, and I didn’t make a script to do a graph from debug.log yet anyway, so I’m just gonna post the /usr/bin/time -v results. The improvement seems remarkable! I tested this on Ubuntu 16 VM running on a Seagate SSHD (also called bybrid) disk. It seems however the process used more CPU - not sure what’s going on with those percentages though. Must be a VM bug.

First Header	Second Header	time	max resident set size
2019-09-more-compact-Coin	dbcache=500	2:20:09	1673108
Master	dbcache=500	3:25:46	1729412
Improvement		31,9%	3,26%

 0
 1	Command being timed: "./bitcoindmaster -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
 2	User time (seconds): 19760.08
 3	System time (seconds): 1031.02
 4	Percent of CPU this job got: 168%
 5	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:25:46
 6	Average shared text size (kbytes): 0
 7	Average unshared data size (kbytes): 0
 8	Average stack size (kbytes): 0
 9	Average total size (kbytes): 0
10	Maximum resident set size (kbytes): 1729412
11	Average resident set size (kbytes): 0
12	Major (requiring I/O) page faults: 0
13	Minor (reclaiming a frame) page faults: 2088592
14	Voluntary context switches: 38379113
15	Involuntary context switches: 2185645
16	Swaps: 0
17	File system inputs: 0
18	File system outputs: 0
19	Socket messages sent: 0
20	Socket messages received: 0
21	Signals delivered: 0
22	Page size (bytes): 4096
23	Exit status: 0

 0
 1	Command being timed: "./bitcoindcoin -reindex-chainstate -stopatheight=325000 -printtoconsole=0 -dbcache=500"
 2	User time (seconds): 19004.87
 3	System time (seconds): 598.24
 4	Percent of CPU this job got: 233%
 5	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:20:09
 6	Average shared text size (kbytes): 0
 7	Average unshared data size (kbytes): 0
 8	Average stack size (kbytes): 0
 9	Average total size (kbytes): 0
10	Maximum resident set size (kbytes): 1673108
11	Average resident set size (kbytes): 0
12	Major (requiring I/O) page faults: 0
13	Minor (reclaiming a frame) page faults: 1737560
14	Voluntary context switches: 17203875
15	Involuntary context switches: 2926137
16	Swaps: 0
17	File system inputs: 0
18	File system outputs: 0
19	Socket messages sent: 0
20	Socket messages received: 0
21	Signals delivered: 0
22	Page size (bytes): 4096
23	Exit status: 0

martinus commented at 7:05 pm on October 15, 2019: contributor

Thanks @GChuf for testing this! This are really nice results.

It seems however the process used more CPU

I don’t think so:

master had 19760.08 sec user + 1031.02 sec system = 20791.1 sec
2019-09-more-compact-Coin had 19004.87 sec user + 598.24 sec system = 19603,11

So the branch used ~6% less CPU. Note that the percentage 233% is much higher than for master because it used that CPU in a much shorter timeframe.

The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

GChuf commented at 5:40 pm on October 16, 2019: contributor

The percentage is calculated 100*(user + system time) / elapsed. 233 percent means that on average 2.33 cores were running for this job.

Thanks for the explaination! The math works :) If more testing is needed I’d be glad to help, but I think HDDs will show similar results to my SSHD and it’s clear that the improvements are real.

jamesob commented at 5:48 am on October 17, 2019: member

Concept ACK from me, seems like some great savings here. Though it’ll be crucial to test this thoroughly across platforms. Will review in depth soon.

GChuf commented at 5:11 pm on October 17, 2019: contributor

I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?

sipa commented at 5:35 pm on October 17, 2019: member

These are pretty impressive gains, so concept ACK.

Some overall comments:

Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.
The coding style guidelines for variables say to use snake_case in new code.

I do want to review the prevector changes in more detail.

MarcoFalke added the label Needs gitian build on Oct 17, 2019

martinus commented at 5:43 pm on October 17, 2019: contributor

I can also test it on windows. Does anyone have the windows executable for this already, to make my life easier? @martinus maybe?

Sorry, I build only in Linux, never tried to build in Windows

Since this is making the boundary between Coin and CCoinsCacheEntry a bit less clear, maybe it would be useful to either have a Coin::AssignWithoutFlags function, or even make Coin::operator= not touch the flags. That would avoid all the oldX = ...; ...; ... = oldX code, and make review easier.

I’m also unsure if it was everywhere necessary to backup and then restore the previous flags. Moving the flags into Coin only saves us 4bytes. I could also remove that change and do this in a separate PR.

MarcoFalke commented at 6:21 pm on October 17, 2019: member

@DrahtBot might have a windows build ready tomorrow

dongcarl commented at 7:20 pm on October 17, 2019: member

I have Gitian-built windows binaries of 03cb535384eb6d2f4284524c33ab03371bd263e2 here for those who are keen (@GChuf): https://send.firefox.com/download/c78ac103a4dabe86/#S70uxKCUua-baEeR0v029g

GChuf commented at 8:15 pm on October 17, 2019: contributor

Thanks @MarcoFalke & @dongcarl, will test on Win10 tomorrow/this weekend.

wtogami commented at 0:28 am on October 18, 2019: contributor

A 512 GB micro SD card should do the job, no?

You’ll find sdcards larger than 32GB to be rather terrible for anything aside from sequential write (4K video cameras). The main issue is the “erase block size” which is the minimum size of any erase. Every time you write anything to disk it could rewrite that size. 32GB is the smallest sdcard I’ve seen with 4MB erase block size. I’ve seen 128GB sdcard with 16MB. I’m guessing bigger has even larger erase blocks.

There are possible changes that could be done to Core to make it more flash friendly but it is not worthwhile as performance of that type of flash is abysmal while SATA/NVMe SSD controllers don’t really need such optimizations.

jamesob commented at 2:07 am on October 18, 2019: member

Hm, getting the opposite of what I would expect in terms of benchmark results. For /usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=4000 -connect=0 (run using a lightly modified version of this script) across a number of different machines, I’m seeing this branch as reliably slower than master:

 0bench-ssd-2  master                               4:05:52    0.45  6283.55MB 204%
 1bench-ssd-2  martinus/2019-09-more-compact-Coin   9:06:48    1.00  5927.98MB 347%
 2
 3bench-ssd-3  martinus/2019-09-more-compact-Coin   9:07:05    1.00  5914.48MB 346%
 4bench-ssd-3  master                               2:59:52    0.33  6278.71MB 77%
 5
 6bench-ssd-4  master                               2:33:47    0.28  6257.46MB 89%
 7bench-ssd-4  martinus/2019-09-more-compact-Coin   9:07:19    1.00  5834.16MB 346%
 8
 9bench-ssd-5  martinus/2019-09-more-compact-Coin   9:03:16    1.00  5963.10MB 349%
10bench-ssd-5  master                               3:58:48    0.44  6252.48MB 209%
11
12bench-hdd-4  master                               6:22:47    0.64  6263.75MB 36%
13bench-hdd-4  martinus/2019-09-more-compact-Coin   10:02:42   1.00  5908.06MB 314%
14
15bench-hdd-5  martinus/2019-09-more-compact-Coin   9:52:42    1.00  5889.18MB 320%
16bench-hdd-5  master                               6:54:05    0.70  6240.25MB 120%

Sample machine specs:

0Hostname:            bench-ssd-2
1Kernel:              Linux 4.9.0-8-amd64
2OS:                  Debian GNU/Linux 9
3RAM (GB):            7.71
4Architecture:          x86_64
5CPU(s):                4
6Thread(s) per core:    1
7Core(s) per socket:    4
8Model name:            Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz

Can anyone think of an explanation for this?

Also, FWIW, in your original PR description your graph and table show master as being faster than this branch but when I read it initially I’d assumed it was just a labeling error.

martinus commented at 4:03 pm on October 18, 2019: contributor

Can anyone think of an explanation for this?

I really can’t explain why the diffference is so huge. Maybe it has something to do with the Xeon CPU? Or there is still a bug in the code that manifests on that CPU. Do you have the debug.log files, or graphs of the progress? Maybe there it’s visible at which point the difference occurs.

I’m currently running a comparison on a very slow Intel Celeron N3050, 2GB of RAM, -dbcache=500, and external SSD. So far it seems the results are very close

DrahtBot commented at 4:45 am on October 19, 2019: member

Gitian builds for commit ec3ed5a4487886f1c2a35fda0a3289be7b280248 (master):

f64e5b9561bbfd24763ca216dfb72e6c... bitcoin-0.19.99-aarch64-linux-gnu-debug.tar.gz
3450cd39da524d4259d8971fa07f7cd3... bitcoin-0.19.99-aarch64-linux-gnu.tar.gz
f9c08aecb54c510abc1c49a0f615cc47... bitcoin-0.19.99-arm-linux-gnueabihf-debug.tar.gz
48858aa869f72b05f6fd77a174cec72d... bitcoin-0.19.99-arm-linux-gnueabihf.tar.gz
2f3d551fa3a0744ced807e575f1e4643... bitcoin-0.19.99-i686-pc-linux-gnu-debug.tar.gz
61eef7a7416e41bd49f11481339df776... bitcoin-0.19.99-i686-pc-linux-gnu.tar.gz
befb99d795c38a2f320e99e677e437fb... bitcoin-0.19.99-osx-unsigned.dmg
aadf3d3811d300b416a2e3a36e5d9636... bitcoin-0.19.99-osx64.tar.gz
75ad9c5b79ef283885def690f4210e28... bitcoin-0.19.99-riscv64-linux-gnu-debug.tar.gz
0bdc56205b7a686505b49d9694dfbb7d... bitcoin-0.19.99-riscv64-linux-gnu.tar.gz
0ed997185c1a65c6ab29882306a5c4ed... bitcoin-0.19.99-win64-debug.zip
e642bb08c080af778aad7ab1bc889769... bitcoin-0.19.99-win64-setup-unsigned.exe
471525f13e4ee98129b3e140175776a5... bitcoin-0.19.99-win64.zip
a2d38074e8a186c6251fbb30948caf6f... bitcoin-0.19.99-x86_64-linux-gnu-debug.tar.gz
544d75a68b31657850b22296599a3d3f... bitcoin-0.19.99-x86_64-linux-gnu.tar.gz
c780c4b108f18250dfaa13a06d7ded80... bitcoin-0.19.99.tar.gz
42923b2fe3a4f72ea620650c3be0a040... bitcoin-core-linux-0.20-res.yml
15e9a3c42739179b937c4bb927c02731... bitcoin-core-osx-0.20-res.yml
4f45465e66f847f97cfd6af3c2bd3514... bitcoin-core-win-0.20-res.yml
a34a94c2dee07af19dfd80a405f39c1d... linux-build.log
e0a214ba922d8febd8566d7102414cc0... osx-build.log
bfb37291238fa4be8727e7c882d422e0... win-build.log

Gitian builds for commit c03bdee106a5f9743d373955e7ae63c244a9d14f (master and this pull):

21be324e602ab9137713978809daf2d4... bitcoin-0.19.99-aarch64-linux-gnu-debug.tar.gz
a2a947b221298ecd34e6ce1cb37c0e24... bitcoin-0.19.99-aarch64-linux-gnu.tar.gz
bd91c4a5226072e4edbd5e98b97cb501... bitcoin-0.19.99-arm-linux-gnueabihf-debug.tar.gz
faeb989fd34185ce53da69936d733f79... bitcoin-0.19.99-arm-linux-gnueabihf.tar.gz
a3123122a7e65d5cbf21bef0f8fcf735... bitcoin-0.19.99-i686-pc-linux-gnu-debug.tar.gz
9671072a15c92d5c4e319e106d35562b... bitcoin-0.19.99-i686-pc-linux-gnu.tar.gz
b6ac4ce3519f9c8b1f7f48f178f9ca65... bitcoin-0.19.99-osx-unsigned.dmg
3a85bf2bc871654de1d6e5e54e0f96ad... bitcoin-0.19.99-osx64.tar.gz
b8dde959618ef0d0377edddacf15206d... bitcoin-0.19.99-riscv64-linux-gnu-debug.tar.gz
0440b4bd516b3dce22624bb4238c8f15... bitcoin-0.19.99-riscv64-linux-gnu.tar.gz
c637cd01437f265728fb56e3e1d59251... bitcoin-0.19.99-win64-debug.zip
129e664178ed338e3bdacd004a9bef0f... bitcoin-0.19.99-win64-setup-unsigned.exe
a109ae374325151140aa4d1198d9b0b3... bitcoin-0.19.99-win64.zip
51db059ac46e7bb3023eb388d0d15c2a... bitcoin-0.19.99-x86_64-linux-gnu-debug.tar.gz
156a11f1a47ef7117f042e61884941c6... bitcoin-0.19.99-x86_64-linux-gnu.tar.gz
24814072e1ad7f0737ac7f27041652ec... bitcoin-0.19.99.tar.gz
273925fb3dc8bf3bf938d2fbbad25498... bitcoin-core-linux-0.20-res.yml
fa5fe33a45bd629de2c70f19d4591e14... bitcoin-core-linux-0.20-res.yml.diff
85c08b95741702e1d30f1ddf661a998e... bitcoin-core-osx-0.20-res.yml
ff0ae3c0febd01ff9f5ae8143668e186... bitcoin-core-osx-0.20-res.yml.diff
c436fdc5ae135eef95b78b1a87fd046e... bitcoin-core-win-0.20-res.yml
445445b454c55b180f66c987724dc908... bitcoin-core-win-0.20-res.yml.diff
f623e4a8beac28c21ae777486a2d9896... linux-build.log
bc158fbfa6154a602bbd26d94e152cbc... linux-build.log.diff
3b19b1483a35da9ab989094d72883afe... osx-build.log
fe3f35265a17b343e21809b7b81af0d1... osx-build.log.diff
e7d72b7ce7a9ef37e687fb95165837b7... win-build.log
3aee332171621d115568e8fd07e7ac9f... win-build.log.diff

DrahtBot removed the label Needs gitian build on Oct 19, 2019

GChuf commented at 3:55 pm on October 19, 2019: contributor

@jamesob please post the debug logs, some time ago I was looking at your results at #16801 which seemed weird as well, and debug logs could shed some much needed light into this. As for Xeon CPUs, I’ve got a Xeon E3 1240 v2. The previous test on ubuntu VM was run with this CPU and I’m gonna run windows benchmarks on it as well and post the results here. Anyway I doubt xeon (or any other) intel CPUs are the problem, thought it’s possible.

GChuf commented at 8:56 am on October 20, 2019: contributor

Much more modest improvements when running this on a physical Windows machine (~2% improvement in time). Same hardware was used as in my previous benchmark. I might have seen better improvements if I used an old HDD, but I just wanted to make sure the improvements are there on windows as well.

master: 02:59:21 2019-09-more-compact-Coin: 02:56:05

figure_1

martinus commented at 12:31 pm on October 20, 2019: contributor

@GChuf , did you run the windows benchmark on the same hardware where you got the 31% improvement in Linux?

My benchmark with Intel Celeron N3050 has finished, where the branch is about of 1.4% faster. It seems that the CPU is definitely the limiting factor here - each run took over 6 days

out2

GChuf commented at 1:38 pm on October 20, 2019: contributor

@martinus yes, forgot to mention. Hardware was the same. The difference was that on the first run I was running the benchmark on a VM, which certainly performs worse. I think the difference between physical vs VM is more pronounced with CPUs rather than disks.

JeremyRubin commented at 7:03 pm on October 20, 2019: contributor

Some notes on other paths to shave off bytes here:

As an alternative, you could do something like the union below to compact the CAmount for the (evidence: https://eprint.iacr.org/2017/1095.pdf, also by pigeon hole we have 21M coins and 100+M UTXOs so most have to be small) common case of a UTXO smaller than 2**31/100e6 (about 21.47 BTC). This technique saves another 4 bytes. The nice thing about this change is that we don’t have to change the guts of CScript which has implications across a bunch of sensitive areas.

Were you to combine it with the prevector and CScript changes, you would see the same improvement of 4 bytes, bringing it down to 72 bytes.

There is also the option to hash the COutpoint, to save another 4 bytes, but I looked into this once and it seemed complicated because some cases we use the key of the pair IIRC. (The same trick as above could be used to shave off 2 bytes as almost all outputs indexes are < 16 bits)

 0union {
 1struct {
 2    bool is_big : 1;
 3    private:
 4    uint32_t amount: 31;
 5} read_flag;
 6struct {
 7    struct {
 8        private:
 9        bool __flag : 1;
10        public:
11        uint32_t amount: 31;
12    } flag_amount;
13    CScript script;
14} small_value;
15struct {
16    struct {
17        private:
18        bool __flag : 1;
19        public:
20        uint32_t ignore: 31;
21    } flag;
22    std::unique_ptr<CTxOut> ptr;
23} large_value;
24}

jamesob commented at 6:25 am on October 21, 2019: member

Just finished another benchmark run on a significantly beefier machine with similar (odd) results. Only the debug.log for this branch was preserved since I started the run a few days ago, but (i) I’m working on some better tooling to parse debug.log files (optionally contingent on #16805) and generate tables, graphs, etc., and (ii) in future runs I’ll make sure I preserve the debug.log files for each branch.

/usr/bin/time -v ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=4000 -connect=0

Edit: compiled with g++ (Debian 8.3.0-6) 8.3.0.

master

{'time': '2:56:18', 'cpu_perc': '68%', 'mem_kb': 6389536, 'user_time_secs': 6754.8, 'system_time_secs': 527.61}

martinus/2019-09-more-compact-Coin (03cb53538)

{'time': '9:16:02', 'cpu_perc': '635%', 'mem_kb': 5871084, 'user_time_secs': 211292.35, 'system_time_secs': 565.91}

debug.log (martinus/2019-09-more-compact-Coin)

https://transfer.sh/ncVfW/jamesob-17060-debug.log

hwinfo

0hostname               bench-strong        
1cpu_model_name         Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
2ram_gb                 31.35
3os                     ['Debian GNU/Linux', '10', 'buster']
4arch                   x86_64              
5kernel                 4.19.0-5-amd64      
6read_iops (/tmp)       29.2k
7write_iops (/tmp)      9694
8read_iops (/data)      396
9write_iops (/data)     131

martinus commented at 6:55 am on October 21, 2019: contributor

Thanks for posting the debug.log! That’s already helpful. I was hoping to find some issue where e.g. db sync takes a long time, but this is not the case. Flushing the db cache takes a few minutes, but nothing extraordinary:

02019-10-18T08:30:54Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2009-01-03T18:15:05Z' progress=0.000000 cache=0.0MiB(0txo)
12019-10-18T10:10:27Z UpdateTip: new best=00000000000000000d533060763bd64c8b97de7fdbc407bd918cf97ff37ae69a height=372609 version=0x00000003 log2_work=83.29081 tx=82023512 date='2015-09-02T02:22:29Z' progress=0.175964 cache=4275.5MiB(35061266txo)
22019-10-18T10:17:07Z UpdateTip: new best=0000000000000000056e394d44a7ae5eb021d6171504a7dec0a833d50763fad6 height=372610 version=0x00000003 log2_work=83.290838 tx=82024083 date='2015-09-02T02:26:03Z' progress=0.175965 cache=482.8MiB(0txo)
32019-10-18T12:15:33Z UpdateTip: new best=00000000000000000168ff4ff0e7a5ddf08c707fe928be19a25a9c90e8444dab height=433063 version=0x20000000 log2_work=85.368569 tx=160962375 date='2016-10-06T00:55:30Z' progress=0.345290 cache=4276.1MiB(35338561txo)
42019-10-18T12:23:07Z UpdateTip: new best=000000000000000000b7619d54035e1ad58c762daef103369d2c04a8641fb20a height=433064 version=0x20000000 log2_work=85.368598 tx=160962901 date='2016-10-06T00:58:57Z' progress=0.345290 cache=482.8MiB(0txo)
52019-10-18T14:52:48Z UpdateTip: new best=0000000000000000002a003b9e34d263f117db2562c80171080e31d4db61bd72 height=490207 version=0x20000000 log2_work=87.294564 tx=262721239 date='2017-10-16T23:03:01Z' progress=0.563536 cache=4276.0MiB(35487228txo)
62019-10-18T15:00:54Z UpdateTip: new best=000000000000000000564f989b2bd03905df412506a20b2a4708a52b1c230787 height=490208 version=0x20000000 log2_work=87.294603 tx=262723684 date='2017-10-16T23:11:33Z' progress=0.563539 cache=482.8MiB(0txo)
72019-10-18T16:40:48Z UpdateTip: new best=00000000000000000032015afc76ed6c6e3e6fabae75f832e8f3d6f1a79bb2bd height=526412 version=0x20000000 log2_work=88.977034 tx=320908996 date='2018-06-07T10:09:17Z' progress=0.688312 cache=4275.9MiB(35483120txo)
82019-10-18T16:47:17Z UpdateTip: new best=00000000000000000031989467b9aa7c18b75ccafd0ca3f788d5e413dbb86e4a height=526413 version=0x20000000 log2_work=88.977084 tx=320909164 date='2018-06-07T10:22:49Z' progress=0.688310 cache=482.8MiB(0txo)
92019-10-18T17:43:44Z UpdateTip: new best=000000000000000000223b7a2298fb1c6c75fb0efc28a4c56853ff4112ec6bc9 height=550000 version=0x20000000 log2_work=90.011241 tx=356588225 date='2018-11-14T02:35:41Z' progress=0.764816 cache=2880.1MiB(22413485txo)

It looks like progress is made quite linearly, so whatever is slowing it down, it seems to happen from start to finish.The really odd thing is the extreme CPU usage in the branch, it’s 31 times higher than on master.

Could you try to run perf top while syncing on the branch, maybe this gives an indication where so much CPU is spent? or create a flamegraph with hotspot. Then can’t use pyperf system tune though because it severely reduces the number of allowed samples

GChuf commented at 2:53 pm on October 21, 2019: contributor

@martinus @jamesob looking at the first test results from James, it occured to me that the “problem” must be in the CPU (or at least not in the disks) - benching master on ssd vs hdd turned out to be ~2x faster (as expected), whereas benching this branch took 9h on ssd and 10h on hdd.

James, you mentioned this in #16801:

Before anyone asks: these are dedicated benchmarking machines (specs listed below). Before each run I’m dropping all caches (sudo /sbin/swapoff -a; sudo /sbin/sysctl vm.drop_caches=3;), tuning with pyperf (sudo /usr/local/bin/pyperf system tune;), and I’ve ensured that the CPU governors are set to performance.

Besides what Martin suggested, can you try:

not running pyperf & CPU governor
using gitian builds above
using -printtoconsole=0 (This probably won’t change anything, but since we have no idea what’s going on …)

jamesob commented at 8:21 pm on October 29, 2019: member

Okay, I’ve finally gotten around to taking perf to the weird bench-strong results. I took measurements at progress=0.20 (roughly height 387,000) by doing

0$ # configure with -fno-omit-frame-pointer, tune system params to enable perf, make, etc.
1$ ./src/bitcoind -reindex-chainstate -stopatheight=550000 -dbcache=7000 -printtoconsole=0 -connect=0 &
2$ while ! tail -n 10000 ~/.bitcoin/debug.log | grep progress=0.2; do sleep 3; done; kill -STOP `pidof bitcoind`
3[... I wait ...]
4$ kill -CONT `pidof bitcoind` && sleep 0.4 && perf record -g -F 101 -p `pidof bitcoind` -o perf.$THE_BRANCH.data sleep 90

After running the above for both this branch and master at 6a97e8a060f7632bbaee27d3de8035dc6ebe3895, here’s perf diff perf.master.data perf.martinus.data:

 0# Event 'cycles:ppp'
 1#
 2# Baseline  Delta Abs  Shared Object              Symbol
 3# ........  .........  .........................  ......................................................................................
 4#
 5     2.41%     +3.06%  bitcoind                   [.] SipHashUint256Extra
 6    15.36%     +2.93%  bitcoind                   [.] secp256k1_fe_mul_inner
 7     4.09%     -2.21%  libc-2.28.so               [.] 0x000000000007b544
 8     0.26%     +1.52%  bitcoind                   [.] (anonymous namespace)::ripemd160::rol
 9    14.02%     -1.42%  bitcoind                   [.] secp256k1_fe_sqr_inner
10     1.99%     -1.40%  bitcoind                   [.] sha256d64_avx2::(anonymous namespace)::Xor
11     1.48%     -1.32%  bitcoind                   [.] operator new
12               +1.12%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::const_iterator::operator!=
13               +1.10%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::const_iterator::operator++
14     0.00%     +1.07%  bitcoind                   [.] ser_writedata32<CSizeComputer>
15               +0.99%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::is_direct
16               +0.96%  bitcoind                   [.] prevector<27u, unsigned char, unsigned int, int>::size
17     0.03%     +0.92%  bitcoind                   [.] sha256d64_avx2::(anonymous namespace)::Or

I think the percentages are misleading here because when I run sudo perf top --comms=b-loadblk, this is what I see:

 0Samples: 236K of event 'cycles:ppp', 2000 Hz, Event count (approx.): 252595020173
 1Overhead  Shared Object              Symbol
 2   2.79%  bitcoind                   [.] SipHashUint256Extra
 3   1.00%  bitcoind                   [.] base_blob<256u>::GetUint64
 4   0.56%  bitcoind                   [.] std::__detail::_Hash_code_base<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::__
 5   0.50%  bitcoind                   [.] SaltedOutpointHasher::operator()
 6   0.45%  bitcoind                   [.] prevector<28u, unsigned char, unsigned int, int>::const_iterator::operator*
 7   0.33%  bitcoind                   [.] std::_Hashtable<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::allocator<std::pa
 8   0.26%  bitcoind                   [.] ser_writedata32<CHashWriter>
 9   0.23%  bitcoind                   [.] std::_Hashtable<COutPoint, std::pair<COutPoint const, CCoinsCacheEntry>, std::allocator<std::pa
10   0.22%  bitcoind                   [.] std::__uninitialized_default_n_1<false>::__uninit_default_n<CTxOut*, unsigned long>
11   0.18%  libc-2.28.so               [.] 0x000000000015878a
12   0.17%  bitcoind                   [.] operator-

In other words, near as I can figure SipHashUint256Extra clocks in as the bottleneck on the loadblk thread, which is where the reindexing happens. Based on the (unreliable?) percentage values, it looks like this branch is spending more than twice the time hashing for map access than master is. Why this would be (or even whether or not this is the actual cause of the difference) is outside of my understanding.

I can upload the raw perf data files if anyone wants, but to my understanding they aren’t portable - unfortunately perf is not very well documented, and I had trouble even generating flamegraphs on a secondary machine, so I’m not really sure what the best way to make the datafiles portable is.

elichai commented at 9:05 pm on October 29, 2019: contributor

@jamesob twice the time or twice the percentage? Because I can confirm that my own benchmarks of master from ~2-3 weeks ago showed SipHashUint256Extra very very high on the list (spent a while re-implementing it with SSE instructions just to see it’s actually slower with them lol)

(i’m usually using the following flags: perf report -g 'graph,0.5,caller' that way it sorts the percentages by caller instead of callee, though you probably know more about perf than I :) )

martinus commented at 9:11 pm on October 29, 2019: contributor

@jamesob did your master build already contain #16957?

JeremyRubin commented at 9:12 pm on October 29, 2019: contributor

@martinus yes it did – are you not rebased?

Worth pointing out that a prevector<28> on master fits perfectly in a 32 byte or 64 byte cache line.

No longer after this change (which is fine, we can increase to prevector<31> where sensitive to it, like std::vector<prevector<».

Also, now that CAmounts are unaligned, they too can be extra slow to do something with.

I doubt that’s what’s going on here, but worth looking at before merge.

martinus commented at 10:51 pm on October 29, 2019: contributor

@martinus yes it did – are you not rebased?

I didn’t saw he linked to the master revision, so both master and my branch has #16957 so it’s not that.

I suspect that there is something strange going on with the threading code. Maybe some timing issue causes lots of caching misses for some reason? I don’t know much about how the checkqueue and surrounding classes work though.

martinus commented at 11:30 pm on October 29, 2019: contributor

I think I might have been able to reproduce this issue. I’ve tried ./bitcoind -datadir=/run/media/martinus/tera/bitcoin/db -reindex-chainstate -stopatheight=150000 -printtoconsole=0 -connect=0 -dbcache=15000 on master, and on an older version, before #16957. The old version used ~20 seconds of user time, master used 332 seconds. I’ve done git bisect and found this revision as the first bad commit: fa3a7331160d1a460b1c15fca1810e98070d629c

I don’t know why, but this commit seems to have a dramatic effect on performance.

sipa commented at 11:34 pm on October 29, 2019: member

@martinus Are you aware that blocks before the assumevalid point don’t get signature/script validated?

MarcoFalke commented at 0:31 am on October 30, 2019: member

For benchmarking purposes, it might be best to set all nodes compiled from different branches to -noassumevalid or the same block hash.

martinus commented at 5:03 am on October 30, 2019: contributor

@martinus Are you aware that blocks before the assumevalid point don’t get signature/script validated?

No, I didn’t know. Also it seems that since I didn’t yet have this block, -reindex-chainstate signature/script validates everything

It might be helpful to have a list of assumevalid blocks instead of just a single block, to prevent revalidating everything when that specific marker block is not yet there

DrahtBot added the label Needs rebase on Nov 8, 2019

more tightly pack Coin

This change reduces CCoinMap's value_type from 96 bytes to 80 bytes by
more tightly packing it's data. This is achieved by these changes:

* Refactored prevector so it uses a single byte to determine its size
  when in direct mode
* Reduce CScriptBase from 28 to 27 indirect bytes
* Introduced PackableCAmount to be able to align CTxOut to 4 bytes to
  prevent padding when used as a member in Coin

This tighter packing means more data can be stored in the coinsCache
before it is full and has to be flushed to disk. In my benchmark,
-reindex-chainstate was 6% faster and used 6% less memory. The cache
could fit 14% more txo's before it had to resize.

1fb877c38b

packed CCoinsCacheEntry flags into Coin

Removed CCoinsCacheEntry's 1 byte for the flags and put it directly into
Coin. That way we can get rid of unnecessary padding, which reduces the
memory requirement for the coinsCache. We steal 2 bits from Coin's
nHeight, so now there are only 29 bits left. Still, we are save until
block 2^29-1 = 536870911.

43328c458e

make PackableCAmount's ctor explicit

This requires the operator=(CAmount) which I've implemented here as well. Using static_cast for nicer formatting.

Note that it would be cleaner to make the user defined conversion`operator CAmount()` explicit too, but that would mean that practically everywhere it is used we need to add an explicit cast.
See "C.164: Avoid implicit conversion operators" http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c164-avoid-implicit-conversion-operators

22d1d944d7

martinus force-pushed on Nov 8, 2019

DrahtBot removed the label Needs rebase on Nov 8, 2019

jamesob commented at 5:09 pm on November 12, 2019: member

Did some benches over the weekend (bench/master.1 vs. bench/compactcoin.1) with the same setup/reindex-to-550,000 as mentioned in the allocator PR (https://github.com/bitcoin/bitcoin/pull/16801#issuecomment-551883348):

 0host         tag                      time       time% maxmem  cpu%  dbcache
 1bench-ssd-2  master.1                 9:13:54    1.00  5770.10MB 344%  4000MB
 2bench-ssd-2  master.1                 9:15:57    1.00  5773.23MB 343%  4000MB
 3bench-ssd-2  compactcoin.1            9:07:01    0.98  5899.80MB 347%  4000MB
 4bench-ssd-2  compactcoin.1            9:07:41    0.99  5923.72MB 346%  4000MB
 5
 6bench-ssd-3  compactcoin.1            9:06:32    0.98  5959.14MB 347%  4000MB
 7bench-ssd-3  compactcoin.1            9:07:35    0.98  6013.83MB 346%  4000MB
 8bench-ssd-3  master.1                 9:17:00    1.00  5771.56MB 342%  4000MB
 9bench-ssd-3  master.1                 9:17:24    1.00  5773.17MB 342%  4000MB
10
11bench-ssd-4  compactcoin.1            9:08:11    0.98  5936.75MB 346%  4000MB
12bench-ssd-4  compactcoin.1            9:07:46    0.98  5916.50MB 346%  4000MB
13bench-ssd-4  master.1                 9:13:56    0.99  5770.13MB 344%  4000MB
14bench-ssd-4  master.1                 9:17:07    1.00  5760.78MB 342%  4000MB
15
16bench-ssd-5  compactcoin.1            9:03:06    0.98  5943.29MB 349%  4000MB
17bench-ssd-5  compactcoin.1            9:04:24    0.98  5975.77MB 348%  4000MB
18bench-ssd-5  master.1                 9:11:54    1.00  5769.57MB 346%  4000MB
19bench-ssd-5  master.1                 9:13:56    1.00  5775.04MB 344%  4000MB
20
21bench-strong compactcoin.1            14:59:15   0.99  5710.87MB 546%  4000MB
22bench-strong compactcoin.1            14:59:00   0.99  5701.12MB 548%  4000MB
23bench-strong master.1                 15:06:04   1.00  5720.08MB 544%  4000MB
24bench-strong master.1                 15:04:26   1.00  5745.98MB 545%  4000MB

See modest 1-2% improvement in runtime and (oddly) on the SSD machines I’m seeing significantly higher memory usage (usually ~200MB) for this branch.

ajtowns commented at 12:47 pm on December 10, 2019: member

If CScript needs to be 8 byte aligned (because it has a pointer), then I think this becomes:

88 CCoinsMap::value_type
- 36 COutPoint
  - 32 uint256
  - 4 uint32_t
- 4 PADDING
- 48 CCoinsCacheEntry
  - 48 Coin
    - 40 CTxOut
      - 8 nValue
      - 32 CScript
    - 4 nHeight
    - 1 fCoinBase & flags (dirty & fresh)
    - 3 PADDING

and there’s no need for PackableCAmount, and nHeight doesn’t need bit fields. 96 byte to 88 is only an 9% improvement though.

(EDIT: with 32-bit pointers, CAmount alignment becomes your blocker, but not sure that optimising that heavily for 32-bit systems makes sense?)

DrahtBot commented at 8:57 am on February 11, 2020: member

DrahtBot added the label Needs rebase on Feb 11, 2020

martinus commented at 10:45 pm on February 18, 2020: contributor

I’ll close this issue, as @ajtowns noted, this PR has illegally aligned the CScript to 4 bytes to get it to 28 bytes.

martinus closed this on Feb 18, 2020

martinus referenced this in commit a0b1f54c19 on Aug 14, 2021

martinus referenced this in commit a6dc21eace on Aug 14, 2021

martinus referenced this in commit 0781ea49a2 on Aug 14, 2021

martinus referenced this in commit 8430523512 on Aug 15, 2021

martinus referenced this in commit 0bc3d3825f on Aug 27, 2021

martinus referenced this in commit 7fd45dc4df on Aug 27, 2021

DrahtBot locked this on Feb 15, 2022

Cache 26% more coins: Reduce CCoinsMap::value_type from 96 to 76 bytes #17060

Conflicts

master

martinus/2019-09-more-compact-Coin (03cb53538)

debug.log (martinus/2019-09-more-compact-Coin)

hwinfo