Add a test and optimize HexStr() by 27% #23364

denis2342 commented at 9:10 PM on October 26, 2021: none

The old code needed 365ns for 256 byte input on an i7-6920HQ CPU, this commit needs 268ns for the same 256 byte input. This is a speedup of about 27%.

test: Add four tests for HexStr() in util/strencodings.cpp 497f20a9a9

util: optimize HexStr() by 27% with a larger table and 16-bit writes

The old code needed 365ns for 256 byte input on an i7-6920HQ CPU,
this commit needs 268ns for the same 256 byte input.
This is a speedup of about 27%.

7f561b7b86

DrahtBot added the label Build system on Oct 26, 2021

DrahtBot added the label Utils/log/libs on Oct 26, 2021

madars commented at 10:31 PM on October 26, 2021: none

Could you PR the benchmark used for this? It would be interesting to compare it with bit-twiddling approaches; e.g. g++/clang will SSE-vectorize code that uses something like

    *it++ = ((v >> 4) < 10 ? '0' + (v >> 4) : 'a' + (v >> 4) - 10);
    *it++ = ((v & 0xF) < 10 ? '0' + (v & 0xF) : 'a' + (v & 0xF) - 10);

(I'm clearly assuming ASCII here, but so does the current implementation of HexDigit)

sipa commented at 10:32 PM on October 26, 2021: member

I'm not entirely sure, but my intuition says that accessing a char array through a uint16_t pointer is undefined behavior.

denis2342 commented at 11:29 PM on October 26, 2021: none

Could you PR the benchmark used for this? It would be interesting to compare it with bit-twiddling approaches; e.g. g++/clang will SSE-vectorize code that uses something like
    *it++ = ((v >> 4) < 10 ? '0' + (v >> 4) : 'a' + (v >> 4) - 10);
    *it++ = ((v & 0xF) < 10 ? '0' + (v & 0xF) : 'a' + (v & 0xF) - 10);
(I'm clearly assuming ASCII here, but so does the current implementation of HexDigit)

The speedup comes from the 16-bit write.

I'm not entirely sure, but my intuition says that accessing a char array through a uint16_t pointer is undefined behavior.

modifying is undefined behavior.

denis2342 closed this on Oct 26, 2021

denis2342 reopened this on Oct 26, 2021

denis2342 commented at 12:46 AM on October 27, 2021: none

Could you PR the benchmark used for this?

The times from the commit log where measured with a clock_gettime(CLOCK_MONOTONIC_RAW) in the test. To be sure I patched roc/blockchain.cpp to measure the getblock non verbose call. For a 1.6MB block (706,813 ) unpatched it needed 1.7ms average and 1.2ms average with the new code.

theuni commented at 2:08 AM on October 27, 2021: member

modifying is undefined behavior.

memcpy would make this defined while still likely aggressively optimized by a modern compiler, no?

laanwj commented at 6:31 AM on October 27, 2021: member

Thanks for your contribution. However, from a risk perspective I'd prefer to keep this function as is instead of push boundaries on C++ undefined behavior here. Sorry (I also don't think particular improvements in micro-benchmarks here are as important to the big picture.)

Edit: the test is very welcome though!

in src/util/strencodings.cpp:528 in 7f561b7b86

 531 | +        "606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f"
 532 | +        "808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f"
 533 | +        "a0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf"
 534 | +        "c0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedf"
 535 | +        "e0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
 536 | +    uint16_t *it = (uint16_t *)rv.data();

laanwj commented at 6:34 AM on October 27, 2021:

Please don't do this. You can't assume .data() is 16-bit aligned.

martinus commented at 7:17 AM on October 27, 2021: contributor

I've run the benchmark BlockToJsonVerbose on this and master on my CPU. Just for interest I also tried the non-lookuptable version from @madars, and a fixed variant that shouldn't have any alignment issues:

std::string HexStr(const Span<const uint8_t> s)
{
    std::string rv(s.size() * 2, '\0');
    static const char* hexmap =
        "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f"
        "202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f"
        "404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f"
        "606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f"
        "808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f"
        "a0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf"
        "c0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedf"
        "e0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
    char* it = rv.data();

    for (uint8_t v : s) {
        std::memcpy(it, hexmap + v * 2, 2);
        it += 2;
    }

    assert(it == rv.data() + rv.size());
    return rv;
}

| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 61,945,723.75 | 16.14 | 0.1% | 480,230,046.38 | 190,496,947.38 | 2.521 | 82,830,913.62 | 0.5% | 5.51 | BlockToJsonVerbose master | 60,552,789.50 | 16.51 | 0.2% | 464,155,088.33 | 185,292,740.50 | 2.505 | 82,185,538.22 | 0.5% | 5.58 | BlockToJsonVerbose [#23364](/bitcoin-bitcoin/23364/) | 64,150,811.88 | 15.59 | 0.0% | 507,344,048.00 | 197,153,314.75 | 2.573 | 82,831,592.25 | 0.5% | 5.32 | @madars variant | 60,510,153.56 | 16.53 | 0.0% | 464,058,161.00 | 185,538,679.67 | 2.501 | 82,185,536.89 | 0.5% | 5.58 | @martinus variant

The PR is 2.2% faster than master, the non-lookup table of @madars is slower, and my rewrite of the PR seems to produce exactly the same code as the PR (ins/op are 0.02% different)

From what I've seen how HexStr is used I think more could be gained by changing the API to something like AppendHexStr(const Span<const uint8_t> s, std::string& out);. This would get rid of a lot of temporary std::strings. Such a change might be less controversial.

MarcoFalke removed the label Build system on Oct 27, 2021

denis2342 commented at 9:43 AM on October 27, 2021: none

BlockToJsonVerbose tests a lot of code which is not part of this PR. Maybe there should be a benchmark which is more like getblock(). I made a benchmark like this:

static void BlockToHex(benchmark::Bench& bench)
{
    TestBlockAndIndex data;
    CDataStream ssBlock(SER_NETWORK, PROTOCOL_VERSION);
    ssBlock << data.block;
    bench.run([&] {
        auto univalue = HexStr(ssBlock);
        ankerl::nanobench::doNotOptimizeAway(univalue);
    });
}
BENCHMARK(BlockToHex);

I got the following results:

| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 390,514.00 | 2,560.73 | 0.0% | 0.01 | BlockToHex_PR

| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 899,745.00 | 1,111.43 | 0.0% | 0.01 | BlockToHex_HEAD

martinus commented at 11:50 AM on October 27, 2021: contributor

@denis2342 Ah, I didn't know that full blocks are also converted to hex. Makes sense to add a benchmark for that.

laanwj commented at 3:59 PM on November 26, 2021: member

memcpy would make this defined while still likely aggressively optimized by a modern compiler, no?

We can try this, at least it would help move this forward. As it is now it cannot be merged.

martinus referenced this in commit 25aad0c5b8 on Mar 23, 2022

martinus referenced this in commit 1f6a0da4f1 on Mar 23, 2022

fanquake commented at 3:38 PM on March 24, 2022: member

Going to close this for now. Let us know if you'd like to continue working on it, and address comments, and it can be re-opened. Looks like @martinus may also be working on a follow up.

fanquake closed this on Mar 24, 2022

fanquake commented at 7:42 AM on April 14, 2022: member

Follow up in #24852.

laanwj referenced this in commit fe6a299fc0 on May 4, 2022

sidhujag referenced this in commit 3f7e7668f7 on May 4, 2022

DrahtBot locked this on Apr 14, 2023