The old code needed 365ns for 256 byte input on an i7-6920HQ CPU, this commit needs 268ns for the same 256 byte input. This is a speedup of about 27%.
Add a test and optimize HexStr() by 27% #23364
pull denis2342 wants to merge 2 commits into bitcoin:master from denis2342:master changing 3 files +88 −8-
denis2342 commented at 9:10 PM on October 26, 2021: none
-
test: Add four tests for HexStr() in util/strencodings.cpp 497f20a9a9
-
7f561b7b86
util: optimize HexStr() by 27% with a larger table and 16-bit writes
The old code needed 365ns for 256 byte input on an i7-6920HQ CPU, this commit needs 268ns for the same 256 byte input. This is a speedup of about 27%.
- DrahtBot added the label Build system on Oct 26, 2021
- DrahtBot added the label Utils/log/libs on Oct 26, 2021
-
madars commented at 10:31 PM on October 26, 2021: none
Could you PR the benchmark used for this? It would be interesting to compare it with bit-twiddling approaches; e.g. g++/clang will SSE-vectorize code that uses something like
*it++ = ((v >> 4) < 10 ? '0' + (v >> 4) : 'a' + (v >> 4) - 10); *it++ = ((v & 0xF) < 10 ? '0' + (v & 0xF) : 'a' + (v & 0xF) - 10);(I'm clearly assuming ASCII here, but so does the current implementation of
HexDigit) -
sipa commented at 10:32 PM on October 26, 2021: member
I'm not entirely sure, but my intuition says that accessing a char array through a uint16_t pointer is undefined behavior.
-
denis2342 commented at 11:29 PM on October 26, 2021: none
Could you PR the benchmark used for this? It would be interesting to compare it with bit-twiddling approaches; e.g. g++/clang will SSE-vectorize code that uses something like
*it++ = ((v >> 4) < 10 ? '0' + (v >> 4) : 'a' + (v >> 4) - 10); *it++ = ((v & 0xF) < 10 ? '0' + (v & 0xF) : 'a' + (v & 0xF) - 10);(I'm clearly assuming ASCII here, but so does the current implementation of
HexDigit)The speedup comes from the 16-bit write.
I'm not entirely sure, but my intuition says that accessing a char array through a uint16_t pointer is undefined behavior.
modifying is undefined behavior.
- denis2342 closed this on Oct 26, 2021
- denis2342 reopened this on Oct 26, 2021
-
denis2342 commented at 12:46 AM on October 27, 2021: none
Could you PR the benchmark used for this?
The times from the commit log where measured with a clock_gettime(CLOCK_MONOTONIC_RAW) in the test. To be sure I patched roc/blockchain.cpp to measure the getblock non verbose call. For a 1.6MB block (706,813 ) unpatched it needed 1.7ms average and 1.2ms average with the new code.
-
theuni commented at 2:08 AM on October 27, 2021: member
modifying is undefined behavior.
memcpywould make this defined while still likely aggressively optimized by a modern compiler, no? -
laanwj commented at 6:31 AM on October 27, 2021: member
Thanks for your contribution. However, from a risk perspective I'd prefer to keep this function as is instead of push boundaries on C++ undefined behavior here. Sorry (I also don't think particular improvements in micro-benchmarks here are as important to the big picture.)
Edit: the test is very welcome though!
-
in src/util/strencodings.cpp:528 in 7f561b7b86
531 | + "606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f" 532 | + "808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f" 533 | + "a0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf" 534 | + "c0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedf" 535 | + "e0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"; 536 | + uint16_t *it = (uint16_t *)rv.data();
laanwj commented at 6:34 AM on October 27, 2021:Please don't do this. You can't assume
.data()is 16-bit aligned.martinus commented at 7:17 AM on October 27, 2021: contributorI've run the benchmark
BlockToJsonVerboseon this and master on my CPU. Just for interest I also tried the non-lookuptable version from @madars, and a fixed variant that shouldn't have any alignment issues:std::string HexStr(const Span<const uint8_t> s) { std::string rv(s.size() * 2, '\0'); static const char* hexmap = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f" "202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f" "404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f" "606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f" "808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f" "a0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf" "c0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedf" "e0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"; char* it = rv.data(); for (uint8_t v : s) { std::memcpy(it, hexmap + v * 2, 2); it += 2; } assert(it == rv.data() + rv.size()); return rv; }| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 61,945,723.75 | 16.14 | 0.1% | 480,230,046.38 | 190,496,947.38 | 2.521 | 82,830,913.62 | 0.5% | 5.51 |
BlockToJsonVerbose master| 60,552,789.50 | 16.51 | 0.2% | 464,155,088.33 | 185,292,740.50 | 2.505 | 82,185,538.22 | 0.5% | 5.58 |BlockToJsonVerbose [#23364](/bitcoin-bitcoin/23364/)| 64,150,811.88 | 15.59 | 0.0% | 507,344,048.00 | 197,153,314.75 | 2.573 | 82,831,592.25 | 0.5% | 5.32 | @madars variant | 60,510,153.56 | 16.53 | 0.0% | 464,058,161.00 | 185,538,679.67 | 2.501 | 82,185,536.89 | 0.5% | 5.58 | @martinus variantThe PR is 2.2% faster than master, the non-lookup table of @madars is slower, and my rewrite of the PR seems to produce exactly the same code as the PR (ins/op are 0.02% different)
From what I've seen how
HexStris used I think more could be gained by changing the API to something likeAppendHexStr(const Span<const uint8_t> s, std::string& out);. This would get rid of a lot of temporarystd::strings. Such a change might be less controversial.MarcoFalke removed the label Build system on Oct 27, 2021denis2342 commented at 9:43 AM on October 27, 2021: noneBlockToJsonVerbose tests a lot of code which is not part of this PR. Maybe there should be a benchmark which is more like getblock(). I made a benchmark like this:
static void BlockToHex(benchmark::Bench& bench) { TestBlockAndIndex data; CDataStream ssBlock(SER_NETWORK, PROTOCOL_VERSION); ssBlock << data.block; bench.run([&] { auto univalue = HexStr(ssBlock); ankerl::nanobench::doNotOptimizeAway(univalue); }); } BENCHMARK(BlockToHex);I got the following results:
| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 390,514.00 | 2,560.73 | 0.0% | 0.01 |
BlockToHex_PR| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 899,745.00 | 1,111.43 | 0.0% | 0.01 |
BlockToHex_HEADmartinus commented at 11:50 AM on October 27, 2021: contributor@denis2342 Ah, I didn't know that full blocks are also converted to hex. Makes sense to add a benchmark for that.
laanwj commented at 3:59 PM on November 26, 2021: membermemcpy would make this defined while still likely aggressively optimized by a modern compiler, no?
We can try this, at least it would help move this forward. As it is now it cannot be merged.
martinus referenced this in commit 25aad0c5b8 on Mar 23, 2022martinus referenced this in commit 1f6a0da4f1 on Mar 23, 2022fanquake closed this on Mar 24, 2022laanwj referenced this in commit fe6a299fc0 on May 4, 2022sidhujag referenced this in commit 3f7e7668f7 on May 4, 2022DrahtBot locked this on Apr 14, 2023Labels
This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-13 18:14 UTC
More mirrored repositories can be found on mirror.b10c.me