Fix a violation of C++ standard rules where unions are used for type-punning #18167

TheQuantumPhysicist commented at 8:15 PM on February 17, 2020: contributor

Type punning in C++ is not like C. As per the C++ standard, one cannot use unions to convert the bit type. A discussion about this can be found here. In C++, a union is supposed to only hold one type at a time. It's intended to be used only as std::variant. Switching types is undefined behavior.

In fact, C++20 has a special casting function, called bit_cast that solved this problem.

Why has it been working so far? Because some compilers tolerate using unions and switching types, like gcc. More information here.

One important thing to mention is that performance is generally not affected by that memcpy. Compilers are smart enough to convert that to a memory cast when possible. But we have to do it the right way, otherwise, it's jut undefined behavior that depends on the compiler.

Fix a violation of C++ standard rules that unions cannot be switched. be94096dfb

practicalswift commented at 10:08 PM on February 17, 2020: contributor

Great first-time contribution! Welcome as a contributor! Hope to see more contributions from you.

Thanks for tackling UB issues in the project. The bulk of them should be fixed by now and this is one of the last few UB issues I'm aware of. Don't hesitate to report and/or fix any other UB issues you might find and don't hesitate to ping me if you want your work reviewed.

Concept ACK

FWIW:

The C++ Core Guidelines – with editors Bjarne Stroustrup and Herb Sutter – has this to say: "It is undefined behavior to read a union member with a different type from the one with which it was written."
The C++ Core Guidelines has this to say: "Unfortunately, unions are commonly used for type punning. We don’t consider “sometimes, it works as expected” a strong argument."
Professor John Regehr has this to say: "Again, although we might expect that adding a function call would make the code harder to optimize, both compilers understand memcpy deeply enough that we get the desired object code […] In my opinion c5 is the easiest code to understand out of this little batch of functions because it doesn’t do the messy shifting and also it is totally, completely, obviously free of complications that might arise from the confusing rules for unions and strict aliasing. It became my preferred idiom for type punning a few years ago when I discovered that compilers could see through the memcpy and generate the right code."
Video recommendation: Timur Doumler's talk “Type punning in modern C++” from CppCon 2019

practicalswift commented at 2:32 PM on February 18, 2020: contributor

ACK be94096dfb0c4862e2314cbae4120d7360b08ef2

Verified that clang++ -O1 (trunk) generates exactly the same object code before and after the change (as expected) for each of the changed functions:

<details><summary>Click for results</summary>

#include <cstdint>
#include <cstring>

uint64_t ser_double_to_uint64_old(double x)
{
    union { double x; uint64_t y; } tmp;
    tmp.x = x;
    return tmp.y;
}

uint64_t ser_double_to_uint64_new(double x)
{
    uint64_t tmp;
    std::memcpy(&tmp, &x, sizeof(x));
    return tmp;
}

uint32_t ser_float_to_uint32_old(float x)
{
    union { float x; uint32_t y; } tmp;
    tmp.x = x;
    return tmp.y;
}

uint32_t ser_float_to_uint32_new(float x)
{
    uint32_t tmp;
    std::memcpy(&tmp, &x, sizeof(x));
    return tmp;
}

double ser_uint64_to_double_old(uint64_t y)
{
    union { double x; uint64_t y; } tmp;
    tmp.y = y;
    return tmp.x;
}

double ser_uint64_to_double_new(uint64_t y)
{
    double tmp;
    std::memcpy(&tmp, &y, sizeof(y));
    return tmp;
}

float ser_uint32_to_float_old(uint32_t y)
{
    union { float x; uint32_t y; } tmp;
    tmp.y = y;
    return tmp.x;
}

float ser_uint32_to_float_new(uint32_t y)
{
    float tmp;
    std::memcpy(&tmp, &y, sizeof(y));
    return tmp;
}

…

ser_double_to_uint64_old(double):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_double_to_uint64_old(double)
        movq    rax, xmm0
        ret
ser_double_to_uint64_new(double):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_double_to_uint64_new(double)
        movq    rax, xmm0
        ret
ser_float_to_uint32_old(float):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_float_to_uint32_old(float)
        movd    eax, xmm0
        ret
ser_float_to_uint32_new(float):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_float_to_uint32_new(float)
        movd    eax, xmm0
        ret
ser_uint64_to_double_old(unsigned long):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint64_to_double_old(unsigned long)
        movq    xmm0, rdi
        ret
ser_uint64_to_double_new(unsigned long):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint64_to_double_new(unsigned long)
        movq    xmm0, rdi
        ret
ser_uint32_to_float_old(unsigned int):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint32_to_float_old(unsigned int)
        movd    xmm0, edi
        ret
ser_uint32_to_float_new(unsigned int):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint32_to_float_new(unsigned int)
        movd    xmm0, edi
        ret

</details>

luke-jr commented at 3:45 AM on February 19, 2020: member

~0

Seems like the better solution here is to stop assuming floats have a specific internal representation? :/

sipa commented at 3:48 AM on February 19, 2020: member

@luke-jr That would indeed be preferable, but it seems that would effectively require implementing a IEEE 754 software encoder/decoder, which is nontrivial.

This is still an improvement though. memcpy into another type seems indeed to be the modern idiom (we use the same in ReadLE32 and friends, after verifying that compilers indeed optimize through that).

Concept ACK

in src/serialize.h:144 in be94096dfb outdated

 139 | @@ -139,27 +140,27 @@ template<typename Stream> inline uint64_t ser_readdata64(Stream &s)
 140 |  }
 141 |  inline uint64_t ser_double_to_uint64(double x)
 142 |  {
 143 | -    union { double x; uint64_t y; } tmp;
 144 | -    tmp.x = x;
 145 | -    return tmp.y;
 146 | +    uint64_t tmp;
 147 | +    std::memcpy(&tmp, &x, sizeof(x));

elichai commented at 12:26 PM on February 19, 2020:

Maybe add static_assert(sizeof(tmp) == sizeof(x), "double and uint64_t assumed to have the same size");? (this might be redundant with other places but I think it's also helpful as documentation)

(Same for the rest of the functions)

TheQuantumPhysicist commented at 1:43 PM on February 19, 2020:

I swear I wanted to do that, but then I was afraid people would find it paranoid 🤐

You sure I should go for that?

practicalswift commented at 2:01 PM on February 19, 2020:

Explicitly stating assumptions is good and static_assert:s cannot hurt :) Go for it! :)

FWIW:

https://github.com/bitcoin/bitcoin/blob/68e841e0af223a220d1f037e4c5680c1b228aa3e/src/compat/assumptions.h#L43-L46

TheQuantumPhysicist commented at 5:46 PM on February 19, 2020:

Done.

TheQuantumPhysicist commented at 1:46 PM on February 19, 2020: contributor

@luke-jr That would indeed be preferable, but it seems that would effectively require implementing a IEEE 754 software encoder/decoder, which is nontrivial.

There's a famous library, called softfloat. It does all that. But I think that's unnecessary. IEEE 754 is the same everywhere as repsentation. The differences between different CPUs come from certain operations, like rounding and division by epsilon (IIRC). You can see in softfloat different rounding modes and similar things.

Add static_asserts to ser_X_to_Y() methods 0653939ac1

practicalswift commented at 7:02 PM on February 19, 2020: contributor

ACK 0653939ac130eddffe40c53ac418bea305d3bf82

elichai commented at 8:26 PM on February 19, 2020: contributor

ACK 0653939ac130eddffe40c53ac418bea305d3bf82

kristapsk approved

kristapsk commented at 1:25 AM on February 21, 2020: contributor

ACK 0653939ac130eddffe40c53ac418bea305d3bf82

laanwj commented at 5:55 PM on February 26, 2020: member

Code review ACK 0653939ac130eddffe40c53ac418bea305d3bf82

laanwj commented at 5:59 PM on February 26, 2020: member

That would indeed be preferable, but it seems that would effectively require implementing a IEEE 754 software encoder/decoder, which is nontrivial.

Please, don't. If you're really willing to pick up a huge amount of work to support obscure platforms, I would prefer to move away from serializing floating point numbers at all.I tried this once but there's quite some usage (luckily, not in the P2P or consensus, only for internal file formats, so it's not impossible at least!).

laanwj merged this on Feb 26, 2020

laanwj closed this on Feb 26, 2020

practicalswift commented at 6:59 PM on February 26, 2020: contributor

Great to see this merged! @TheQuantumPhysicist If you want to tackle the few remaining instances of UB you might want to build with CC=clang CXX=clang++ ./configure --with-sanitizers=undefined && make and fix any violations you are able to trigger :)

TheQuantumPhysicist commented at 7:04 PM on February 26, 2020: contributor

@practicalswift Thanks for the support! Actually I use clang sanitizers all the time. Thread sanitizer, undefined behavior sanitizer, memory sanitizer and address sanitizer. You can see them in all my github projects that I wrote myself. But this one here I discovered by coincidence and hence fixed 😉

And I almost never use gcc anymore 😄

Btw, if you're interested, there's an issue I opened in libbtc about undefined behavior like 2 years ago... it's still untouched and not merged.

MarcoFalke commented at 7:45 PM on February 26, 2020: member

Checked that be94096dfb0c4862e2314cbae4120d7360b08ef2 compiles to the same bitcoind with -O2 on:

clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final) Target: aarch64-unknown-linux-gnu
clang version 9.0.0 (Fedora 9.0.0-1.fc31) Target: x86_64-unknown-linux-gnu

sidhujag referenced this in commit c1e0c5a2ac on Feb 26, 2020

MarkLTZ cross-referenced this on Apr 4, 2020 from issue Bitcoin PR tracking by MarkLTZ

MarkLTZ referenced this in commit 6f592ec1c2 on Apr 19, 2020

MarkLTZ referenced this in commit a775842ea1 on Apr 19, 2020

sidhujag referenced this in commit 1489ff2028 on Nov 10, 2020

deadalnix referenced this in commit 6421ff497a on Jan 4, 2021

furszy cross-referenced this on Jun 7, 2021 from issue Road to Tor v3 support (BIP155) by furszy

furszy cross-referenced this on Jun 8, 2021 from issue [Backport] Serialization framework updates by furszy

furszy referenced this in commit 5c93f159bc on Jul 5, 2021

random-zebra referenced this in commit b4751e10ce on Aug 11, 2021

bitcoin locked this on Feb 15, 2022