test: Correctly decode UTF-8 literal string paths #24469

pull ryanofsky wants to merge 1 commits into bitcoin:master from ryanofsky:pr/testu changing 1 files +3 −3
  1. ryanofsky commented at 7:34 pm on March 3, 2022: member

    Call fs::u8path() to convert some UTF-8 string literals to paths, instead of relying on the implicit conversion. Fake Macro pointed out in #24306 (review) that fs_tests are incorrectly decoding some literal UTF-8 paths using the current windows codepage, instead of treating them as UTF-8. This could cause test failures depending what environment windows tests are run under.

    The fs::path class exists to avoid problems like this, but because it is lenient with const char* conversions, under assumption that they are “safe as long as the literals are ASCII”, bugs like this are still possible.

    If we think this is a concern, followup options to try to prevent this bug in the future are:

    1. Do nothing
    2. Improve the “safe as long as the literals are ASCII” comment. Make it clear that non-ASCII strings are invalid.
    3. Drop the implicit const char* conversion functions. This would be nice because it would simplifify the fs::path class a little, while making it safer. Drawback is that it would require some more verbosity from callers. For example, instead of GetDataDirNet() / "mempool.dat" they would have to write GetDataDirNet() / fs::u8path("mempool.dat")
    4. Keep the implicit const char* conversion functions, but make them call fs::u8path() internally. Change the “safe as long as the literals are ASCII” comment to “safe as long as the literals are UTF-8”.

    I’d be happy with 0, 1, or 2. I’d be a little resistant to 3 even though it was would add more safety, because it would slightly increase complexity, and because I think it would encourage representing paths as strings, when I think there are so many footguns associated with paths as strings, that it’s best to convert strings to paths at the earliest point possible, and convert paths to strings at the latest point possible.

  2. test: Correctly decode UTF-8 literal string paths
    Call fs::u8path to convert some UTF-8 string literals to paths, instead
    of relying on implicit conversions. The implicit conversions incorrectly
    decode const char* paths using the current windows codepage, instead of
    treating them as UTF-8. This could cause test failures depending what
    environment windows tests are run in.
    
    Issue was reported by MarcoFalke <falke.marco@gmail.com> in
    https://github.com/bitcoin/bitcoin/pull/24306#discussion_r818566106
    2f5fd3cf92
  3. DrahtBot added the label Tests on Mar 3, 2022
  4. w0xlt approved
  5. w0xlt commented at 9:26 pm on March 3, 2022: contributor

    crACK 2f5fd3c

    I think the second option is good. While it requires more verbosity from callers, it also makes it explicit that the object must be an fs::path instance, not a string.

  6. hebasto commented at 6:23 pm on March 5, 2022: member
    Concept ACK. The only concerns are about maintainability of the codebase in the future as the suggested changes, while being correct, are not forced by a test and/or the fs::path interface. So I lean to the option 2.
  7. shaavan commented at 3:31 pm on March 6, 2022: contributor

    Concept ACK

    • I think the potential risk of inconsistent behavior of fs_tests based on the different environments is a bug severe enough to be solved. Hence I think options 0 and 1 are not feasible.
    • At first glance, I found option 3 to be more appealing as this reduces the need for verbose arguments. However, further reading the argument against option 3, I am convinced this would not be a way forward. We do not want to encourage representing paths as strings.
    • I think option 2 is the way to go. Though it makes arguments verbose, this would ensure that there would be no risk of a wrongful interpretation of a non-ASCII char, as ASCII one. Also, this would encourage developers to convert string to a path as soon as possible, without relying on internal conversions.
  8. ryanofsky commented at 5:27 pm on March 7, 2022: member

    If we think this is a concern, followup options to try to prevent this bug in the future are:

    There are also other followup options beyond what’s listed above. Since the goal of providing const char* helpers is to make it possible to write simple datadir / "indexes" / "blockfilter" and file + ".ext" expressions, another followup option might be to use some constexpr magic and make it a compile-time error to append non-ASCII path literals. Or yet another possible followup option could be to overhaul the fs::path implementation so it whitelists operations instead of blacklisting them as described #24493 (comment)

  9. luke-jr commented at 5:29 am on March 8, 2022: member
    Or something like GetDataDirNet() / "mempool.dat"_u8p
  10. laanwj commented at 11:46 am on March 10, 2022: member
    I’m not sure non-ASCII path literals is something that needs a lot of special thought (or a special syntax). It’s not something that we’re likely to do except for testing. Unicode paths will generally come from the system or from the configuration, not our code. This PR looks fine to me. Code review ACK 2f5fd3cf9225aed439d1de767312bb340972d665
  11. MarcoFalke merged this on Mar 10, 2022
  12. MarcoFalke closed this on Mar 10, 2022

  13. sidhujag referenced this in commit 941d3a743e on Mar 11, 2022
  14. MarcoFalke referenced this in commit 12455acca2 on May 3, 2022
  15. DrahtBot locked this on Mar 10, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-04-05 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me