Fix a violation of C++ standard rules where unions are used for type-punning #18167

pull TheQuantumPhysicist wants to merge 2 commits into bitcoin:master from TheQuantumPhysicist:master changing 1 files +17 −12
  1. TheQuantumPhysicist commented at 8:15 PM on February 17, 2020: contributor

    Type punning in C++ is not like C. As per the C++ standard, one cannot use unions to convert the bit type. A discussion about this can be found here. In C++, a union is supposed to only hold one type at a time. It's intended to be used only as std::variant. Switching types is undefined behavior.

    In fact, C++20 has a special casting function, called bit_cast that solved this problem.

    Why has it been working so far? Because some compilers tolerate using unions and switching types, like gcc. More information here.

    One important thing to mention is that performance is generally not affected by that memcpy. Compilers are smart enough to convert that to a memory cast when possible. But we have to do it the right way, otherwise, it's jut undefined behavior that depends on the compiler.

  2. Fix a violation of C++ standard rules that unions cannot be switched. be94096dfb
  3. practicalswift commented at 10:08 PM on February 17, 2020: contributor

    Great first-time contribution! Welcome as a contributor! Hope to see more contributions from you.

    Thanks for tackling UB issues in the project. The bulk of them should be fixed by now and this is one of the last few UB issues I'm aware of. Don't hesitate to report and/or fix any other UB issues you might find and don't hesitate to ping me if you want your work reviewed.

    Concept ACK

    FWIW:

    1. The C++ Core Guidelines – with editors Bjarne Stroustrup and Herb Sutter – has this to say: "It is undefined behavior to read a union member with a different type from the one with which it was written."

    2. The C++ Core Guidelines has this to say: "Unfortunately, unions are commonly used for type punning. We don’t consider “sometimes, it works as expected” a strong argument."

    3. Professor John Regehr has this to say: "Again, although we might expect that adding a function call would make the code harder to optimize, both compilers understand memcpy deeply enough that we get the desired object code […] In my opinion c5 is the easiest code to understand out of this little batch of functions because it doesn’t do the messy shifting and also it is totally, completely, obviously free of complications that might arise from the confusing rules for unions and strict aliasing. It became my preferred idiom for type punning a few years ago when I discovered that compilers could see through the memcpy and generate the right code."

    4. Video recommendation: Timur Doumler's talk “Type punning in modern C++” from CppCon 2019

  4. practicalswift commented at 2:32 PM on February 18, 2020: contributor

    ACK be94096dfb0c4862e2314cbae4120d7360b08ef2

    Verified that clang++ -O1 (trunk) generates exactly the same object code before and after the change (as expected) for each of the changed functions:

    <details><summary>Click for results</summary>

    #include <cstdint>
    #include <cstring>
    
    uint64_t ser_double_to_uint64_old(double x)
    {
        union { double x; uint64_t y; } tmp;
        tmp.x = x;
        return tmp.y;
    }
    
    uint64_t ser_double_to_uint64_new(double x)
    {
        uint64_t tmp;
        std::memcpy(&tmp, &x, sizeof(x));
        return tmp;
    }
    
    uint32_t ser_float_to_uint32_old(float x)
    {
        union { float x; uint32_t y; } tmp;
        tmp.x = x;
        return tmp.y;
    }
    
    uint32_t ser_float_to_uint32_new(float x)
    {
        uint32_t tmp;
        std::memcpy(&tmp, &x, sizeof(x));
        return tmp;
    }
    
    double ser_uint64_to_double_old(uint64_t y)
    {
        union { double x; uint64_t y; } tmp;
        tmp.y = y;
        return tmp.x;
    }
    
    double ser_uint64_to_double_new(uint64_t y)
    {
        double tmp;
        std::memcpy(&tmp, &y, sizeof(y));
        return tmp;
    }
    
    float ser_uint32_to_float_old(uint32_t y)
    {
        union { float x; uint32_t y; } tmp;
        tmp.y = y;
        return tmp.x;
    }
    
    float ser_uint32_to_float_new(uint32_t y)
    {
        float tmp;
        std::memcpy(&tmp, &y, sizeof(y));
        return tmp;
    }
    

    ser_double_to_uint64_old(double):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_double_to_uint64_old(double)
            movq    rax, xmm0
            ret
    ser_double_to_uint64_new(double):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_double_to_uint64_new(double)
            movq    rax, xmm0
            ret
    ser_float_to_uint32_old(float):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_float_to_uint32_old(float)
            movd    eax, xmm0
            ret
    ser_float_to_uint32_new(float):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_float_to_uint32_new(float)
            movd    eax, xmm0
            ret
    ser_uint64_to_double_old(unsigned long):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint64_to_double_old(unsigned long)
            movq    xmm0, rdi
            ret
    ser_uint64_to_double_new(unsigned long):          # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint64_to_double_new(unsigned long)
            movq    xmm0, rdi
            ret
    ser_uint32_to_float_old(unsigned int):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint32_to_float_old(unsigned int)
            movd    xmm0, edi
            ret
    ser_uint32_to_float_new(unsigned int):           # [@ser](/bitcoin-bitcoin/contributor/ser/)_uint32_to_float_new(unsigned int)
            movd    xmm0, edi
            ret
    

    </details>

  5. luke-jr commented at 3:45 AM on February 19, 2020: member

    ~0

    Seems like the better solution here is to stop assuming floats have a specific internal representation? :/

  6. sipa commented at 3:48 AM on February 19, 2020: member

    @luke-jr That would indeed be preferable, but it seems that would effectively require implementing a IEEE 754 software encoder/decoder, which is nontrivial.

    This is still an improvement though. memcpy into another type seems indeed to be the modern idiom (we use the same in ReadLE32 and friends, after verifying that compilers indeed optimize through that).

    Concept ACK

  7. in src/serialize.h:144 in be94096dfb outdated
     139 | @@ -139,27 +140,27 @@ template<typename Stream> inline uint64_t ser_readdata64(Stream &s)
     140 |  }
     141 |  inline uint64_t ser_double_to_uint64(double x)
     142 |  {
     143 | -    union { double x; uint64_t y; } tmp;
     144 | -    tmp.x = x;
     145 | -    return tmp.y;
     146 | +    uint64_t tmp;
     147 | +    std::memcpy(&tmp, &x, sizeof(x));
    


    elichai commented at 12:26 PM on February 19, 2020:

    Maybe add static_assert(sizeof(tmp) == sizeof(x), "double and uint64_t assumed to have the same size");? (this might be redundant with other places but I think it's also helpful as documentation)

    (Same for the rest of the functions)


    TheQuantumPhysicist commented at 1:43 PM on February 19, 2020:

    I swear I wanted to do that, but then I was afraid people would find it paranoid 🤐

    You sure I should go for that?


    practicalswift commented at 2:01 PM on February 19, 2020:

    Explicitly stating assumptions is good and static_assert:s cannot hurt :) Go for it! :)

    FWIW:

    https://github.com/bitcoin/bitcoin/blob/68e841e0af223a220d1f037e4c5680c1b228aa3e/src/compat/assumptions.h#L43-L46


    TheQuantumPhysicist commented at 5:46 PM on February 19, 2020:

    Done.

  8. TheQuantumPhysicist commented at 1:46 PM on February 19, 2020: contributor

    @luke-jr That would indeed be preferable, but it seems that would effectively require implementing a IEEE 754 software encoder/decoder, which is nontrivial.

    There's a famous library, called softfloat. It does all that. But I think that's unnecessary. IEEE 754 is the same everywhere as repsentation. The differences between different CPUs come from certain operations, like rounding and division by epsilon (IIRC). You can see in softfloat different rounding modes and similar things.

  9. Add static_asserts to ser_X_to_Y() methods 0653939ac1
  10. practicalswift commented at 7:02 PM on February 19, 2020: contributor

    ACK 0653939ac130eddffe40c53ac418bea305d3bf82

  11. elichai commented at 8:26 PM on February 19, 2020: contributor

    ACK 0653939ac130eddffe40c53ac418bea305d3bf82

  12. kristapsk approved
  13. kristapsk commented at 1:25 AM on February 21, 2020: contributor

    ACK 0653939ac130eddffe40c53ac418bea305d3bf82

  14. laanwj commented at 5:55 PM on February 26, 2020: member

    Code review ACK 0653939ac130eddffe40c53ac418bea305d3bf82

  15. laanwj commented at 5:59 PM on February 26, 2020: member

    That would indeed be preferable, but it seems that would effectively require implementing a IEEE 754 software encoder/decoder, which is nontrivial.

    Please, don't. If you're really willing to pick up a huge amount of work to support obscure platforms, I would prefer to move away from serializing floating point numbers at all.I tried this once but there's quite some usage (luckily, not in the P2P or consensus, only for internal file formats, so it's not impossible at least!).

  16. laanwj merged this on Feb 26, 2020
  17. laanwj closed this on Feb 26, 2020

  18. practicalswift commented at 6:59 PM on February 26, 2020: contributor

    Great to see this merged! @TheQuantumPhysicist If you want to tackle the few remaining instances of UB you might want to build with CC=clang CXX=clang++ ./configure --with-sanitizers=undefined && make and fix any violations you are able to trigger :)

  19. TheQuantumPhysicist commented at 7:04 PM on February 26, 2020: contributor

    @practicalswift Thanks for the support! Actually I use clang sanitizers all the time. Thread sanitizer, undefined behavior sanitizer, memory sanitizer and address sanitizer. You can see them in all my github projects that I wrote myself. But this one here I discovered by coincidence and hence fixed 😉

    And I almost never use gcc anymore 😄

    Btw, if you're interested, there's an issue I opened in libbtc about undefined behavior like 2 years ago... it's still untouched and not merged.

  20. MarcoFalke commented at 7:45 PM on February 26, 2020: member

    Checked that be94096dfb0c4862e2314cbae4120d7360b08ef2 compiles to the same bitcoind with -O2 on:

    • clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final) Target: aarch64-unknown-linux-gnu

    • clang version 9.0.0 (Fedora 9.0.0-1.fc31) Target: x86_64-unknown-linux-gnu

  21. sidhujag referenced this in commit c1e0c5a2ac on Feb 26, 2020
  22. MarkLTZ referenced this in commit 6f592ec1c2 on Apr 19, 2020
  23. MarkLTZ referenced this in commit a775842ea1 on Apr 19, 2020
  24. sidhujag referenced this in commit 1489ff2028 on Nov 10, 2020
  25. deadalnix referenced this in commit 6421ff497a on Jan 4, 2021
  26. furszy referenced this in commit 5c93f159bc on Jul 5, 2021
  27. random-zebra referenced this in commit b4751e10ce on Aug 11, 2021
  28. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-29 03:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me