fix WIN32 boost::filesystem::path issues when using special chars for datadir path #6093

pull jonasschnelli wants to merge 2 commits into bitcoin:master from jonasschnelli:2015/05/fix_win_boost_path changing 1 files +5 −3
  1. jonasschnelli commented at 11:32 am on May 1, 2015: contributor
    fixes #6078 Meant for 0.11 branch.
  2. don't imbue boost::filesystem::path with locale "C" on windows
    fixes https://github.com/bitcoin/bitcoin/issues/6078
    b3ffcdf916
  3. laanwj commented at 12:15 pm on May 1, 2015: member
    Good catch. I already fixed this once before, people really shouldn’t be touching this ‘magic’ code. Maybe add a comment referring to the issue.
  4. laanwj commented at 12:21 pm on May 1, 2015: member
    This will likely bring back issue #3136 (solved by #5877). You should imbue some locale on WIN32, I’m not sure which one, but indeed C is wrong. Maybe the result of std::locale("")? what does boost lazily initialize it to?
  5. laanwj added the label Bug on May 1, 2015
  6. laanwj added the label Windows on May 1, 2015
  7. laanwj added this to the milestone 0.10.0 on May 1, 2015
  8. jonasschnelli commented at 1:03 pm on May 1, 2015: contributor

    Right. This is not the right approach. Was trying to go down the rabbit hole of win/unicode handling… Can’t find a solution. It seems that it could also be a compiler wchar issue. It looks like that the default locale is C?

    I could also guess that it might be connected to https://github.com/bitcoin/bitcoin/blob/master/src/util.cpp#L672 SHGetSpecialFolderPathA. The A stands for ANSI. There is also a SHGetSpecialFolderPathW equivalent which supports WCHAR.

  9. laanwj commented at 1:48 pm on May 1, 2015: member
    In windows the file system locale depends on the user locale. I think it only works if boost uses the same locale as windows. But I’m not sure anymore how this was determined. I got it to work once, hence it worked in 0.10… (using the W* functions is out of the question unless you use them everywhere and that’d involve a lot of platform specific code as well as widechar to UTF-8 conversion)
  10. dexX7 commented at 1:54 pm on May 1, 2015: contributor

    @jonasschnelli: using the “C” locale to imbue boost::filesystem::path indeed triggers the failure on Windows in this context.

    There is more than one problem tackled by SetupEnvironment():

    1. On POSIX systems, messed up environment settings can cause a failure (Python test).
    2. During the deinitialization, an internal facet pointer of boost::path is deleted, and if boost::path was not initialized by the main thread, it causes failures such as #3136 (and related).

    The current situation:

    1. As default, the “C” locale is used.
    2. On POSIX systems, the environment locale is used (via std::locale("")).
    3. On POSIX systems, to detect bad environment settings, an exception thrown by std::locale("") is caught.
    4. On POSIX systems, and if there are bad environment settings, the “C” locale is used as fallback.
    5. As workaround, to explicitly force the initialization of the internal facet pointer by the main thread, boost::filesystem::path::imbue() is called.
    6. Bad initialization of boost::filesystem::boost::path can result in exceptions, which are currently not caught.
    7. On Windows systems, the “C” locale appearingly does not cover the whole character set, which can result in (6), as shown by #6078.

    In practise, the current 0.10.1 can result in a failure on Russian (and other) Windows, and earlier versions are affected by the deinitialization issue, resulting in faulty Boost test executions, as well as other failures during the shutdown, such as the crash reported in the context of rebuilding the block database.

  11. laanwj commented at 3:47 pm on May 1, 2015: member

    Arguably the new problem is worse, as it prevents people from running the client at all with some characters in their data directories, instead of a rare shutdown race. What about:

     0    std::locale loc("C");
     1    // On most POSIX systems (e.g. Linux, but not BSD) the environment's locale
     2    // may be invalid, in which case the "C" locale is used as fallback.
     3    try {
     4        loc = std::locale(""); // Raises a runtime error if current locale is invalid
     5    } catch (const std::runtime_error&) {
     6#if !defined(WIN32) && !defined(MAC_OSX) && !defined(__FreeBSD__) && !defined(__OpenBSD__)
     7        setenv("LC_ALL", "C", 1);
     8#endif
     9    }
    10    // The path locale is lazy initialized and to avoid deinitialization errors 
    11    // in multithreading environments, it is set explicitly by the main thread.
    12    boost::filesystem::path::imbue(loc);
    

    This would imbue the system locale on all platforms and should avoid the race condition, as well as the data directory problems. Needs to be tested though..

  12. jonasschnelli commented at 4:33 pm on May 1, 2015: contributor

    @laanwj i tried you proposed code and it does not fix the problem. Maybe it would be worth to revert #5877 and find a proper solution.

    What really surprises me: At least since 0.9.2 (didn’t try further down), bitcoind and Bitcoin-Qt does not run with a datadir-path containing special Chinese or other Asian chars (Windows only).

  13. dexX7 commented at 5:45 pm on May 1, 2015: contributor

    boost::filesystem::path::imbue() returns the “previously used locale”, which should be the “right one”, if called in a pristine environment, or to put it another way: if imbue() wouldn’t be called at all, then this appears to be the one fs::path would use.

    In the following patch, a dummy locale is used to extract the internal locale, to use it explicitly, which seems to be a preferable choice, because the goal is not to modify the locale at all, but to prevent deinitialization issues, and the bad-environment-fallback also kicks in earlier.

     0diff --git a/src/util.cpp b/src/util.cpp
     1index 4fea18b..68fb326 100644
     2--- a/src/util.cpp
     3+++ b/src/util.cpp
     4@@ -713,18 +713,20 @@ void RenameThread(const char* name)
     5
     6 void SetupEnvironment()
     7 {
     8-    std::locale loc("C");
     9     // On most POSIX systems (e.g. Linux, but not BSD) the environment's locale
    10     // may be invalid, in which case the "C" locale is used as fallback.
    11 #if !defined(WIN32) && !defined(MAC_OSX) && !defined(__FreeBSD__) && !defined(__OpenBSD__)
    12     try {
    13-        loc = std::locale(""); // Raises a runtime error if current locale is invalid
    14+        std::locale(""); // Raises a runtime error if current locale is invalid
    15     } catch (const std::runtime_error&) {
    16         setenv("LC_ALL", "C", 1);
    17     }
    18 #endif
    19     // The path locale is lazy initialized and to avoid deinitialization errors 
    20     // in multithreading environments, it is set explicitly by the main thread.
    21+    // A dummy locale is used to extract the internal default locale, used by
    22+    // boost::filesystem::path, which is then used to explicitly imbue the path.
    23+    std::locale loc = boost::filesystem::path::imbue(std::locale::classic());
    24     boost::filesystem::path::imbue(loc);
    25 }
    

    I tested 0.9.3, 0.10.1, another patch and this one, and it’s behavior appears to be similar to 0.9.3 (i.e. it can handle paths, which 0.10.1 can’t handle).

    Example paths and test results:

    Neither the newly patched version, 0.9.3, or 0.10.0, are able to use "E:\LocaleTests\юзза" as datadir.

    When choosing "E:\LocaleTests\юзза" as datadir via bitcoin-qt 0.9.3, then the error message is in German, while the patched version, as well as bitcoin-qt 0.10.0, shows an English message, even though the rest of the UI is actually translated.

  14. theuni commented at 8:23 pm on May 1, 2015: member

    I think that’s a bit circuitous. It works, but not for obvious reasons.

    The returned loc from boost is the classic locale plus custom boost facet for windows. Otherwise, you could just replace the first call with std::locale().

    The below is greatly simplified and I think it does what we want. Note that std::locale::global("") does not work as intended on windows, presumably because we’re statically linking libstdc++. The basic C setlocale() does though.

    0void SetupEnvironment()
    1{
    2    if (!setlocale(LC_ALL, ""))
    3    {
    4        setenv("LC_ALL", "C", 1);
    5        setlocale(LC_ALL, "C");
    6    }
    7    boost::filesystem::path::imbue(std::locale());
    8}
    

    All we’re really trying to accomplish here is to teach boost how to handle locale-specific char conversions for paths, right? Afaik we don’t mess with c++ locales elsewhere.

    The above accomplishes that, won’t throw exceptions, and I believe it’s portable. Tested and verified with a Chinese profile/username/locale on win7, and on Linux with a busted locale via LC_LANG=bad ./bitcoind

  15. dexX7 commented at 9:30 pm on May 1, 2015: contributor

    @theuni: when building on Ubuntu 14.04, with HOST=x86_64-w64-mingw32, and the dependency packages provided via depends, I had to wrap setenv with #if !defined(WIN32) ..., because it would otherwise cause a build error.

    Back on Windows 8.1 x64, when using bitcoin-qt.exe, the tricky paths, such as ..\œ and ..\€@ä®, are accepted and work very well, however, when using ..\юзза, a runtime error is raised.

    I was not able to use that path with any version or patch, though 0.9.3, and the build with the dummy-workaround, respond with "Error: Specified data directory "E:\LocaleTests\????" does not exist." instead.

    The behavior of bitcoind.exe 0.9.3, the dummy-workaround patch, and your simple patch appears to be similar.

  16. jonasschnelli commented at 8:45 am on May 2, 2015: contributor

    Updated this PR with @theunis solution (fixed: added a #if !defined(WIN32) for setenv()). Built through gitian: https://builds.jonasschnelli.ch/pulls/6093/

    Won’t run on win7 with a username containing Chinese chars (as it was with 0.9.3, 0.10.0, probably it was always like this). But won’t also run on win7 with a username containing Russian chars (where 0.9.3 and 0.10.0 was running okay).

    Short term we need a solution to fix #6078 without reintroducing #3136. Long term we need a solution who can handle all possible windows charsets.

    bildschirmfoto 2015-05-02 um 10 37 53

    bildschirmfoto 2015-05-02 um 10 43 54

  17. theuni commented at 2:43 pm on May 2, 2015: member

    Ok, I’ve done more debugging here. tl;dr: After poking some more with even more variables, I think @dexX7’s change is the right one. @jonasschnelli It works fine for me with a Chinese username. Turns out there’s another variable: selected system “Format” in system location settings. Looks like that’s what influences the default locale at runtime rather than the system locale.

    Some testing using printf("locale %s:"setlocale(LC_ALL,NULL)): With a Chinese format selected:locale: Chinese (Simplified)_People's Republic of China.936 With English selected: English_United States.1252

    So.. makes sense. Locale is per-user.

    When the Format is set to Chinese, a Chinese username works. When the format is left (default?) in English, the Chinese username doesn’t work.

    This is the case because when we use the user’s locale (Format), paths for that language should just work.

    However.

    Using @dexX7’s trick, Chinese username with English Format does work. This is because of boost’s usage of its own codecvt, which queries the Win API for the codepage type: https://github.com/boostorg/filesystem/blob/boost-1.55.0/src/windows_file_codecvt.cpp#L39

    So it looks like that’s actually the way to go.

  18. dexX7 commented at 4:14 pm on May 2, 2015: contributor

    @theuni: my idea was basically to get access and capture the whole following block, which uses different facets, depending on the OS and build: boost-1.55.0/src/path.cpp#L792-L873, and then use it to properly “initialize” Boost path by the main thread with the correct locale, after taking care of bad environment settings.

    Nevertheless, it would probably make sense to catch the exception, or handle failures related to the datadir. In your simpler solution I can trigger a runtime error, if I use the path "E:\LocaleTests\юзза" (which is a legit path) on a German localized Windows, while 0.9.3 and my patch results in "Error: Specified data directory "E:\LocaleTests\????" does not exist.". I’m not entirely sure, where this is coming from, but ideally there should be a more expressive user faced message, with clear instructions to use a “compatible” path, if possible

    FWIW: https://github.com/dexX7/bitcoin/commit/c9a37843bc58bce12ece6bcf896339eeadc58ccd @laanwj’s words are still ringing in my ear.. :/

    This more and more looks like just some occult incantation to ward off the evil spirits of bad locales :)

  19. sipa commented at 4:16 pm on May 2, 2015: member
    Can’t we all just agree to go back to EBCDIC?
  20. laanwj commented at 8:49 am on May 4, 2015: member

    I preferred PETSCII :)

    Question: at least for the 0.10 branch, can you code this so that it doesn’t affect behavior on non-Windows? I’d hate to release a ‘fix’ then break the Linux issue again, where an invalid locale prevents the application from starting.

    Using @dexX7’s trick, Chinese username with English Format does work. This is because of boost’s usage of its own codecvt, which queries the Win API for the codepage type: https://github.com/boostorg/filesystem/blob/boost-1.55.0/src/windows_file_codecvt.cpp#L39

    Phew – that’s suble. Thanks for shedding some light on this dark matter. I agree that @dexX7’s swap-the-locale trick, though circuitous. seems to be the best way forward.

  21. in src/util.cpp: in 426ee48f86 outdated
    737+    {
    738+#if !defined(WIN32)
    739         setenv("LC_ALL", "C", 1);
    740-    }
    741 #endif
    742-    // The path locale is lazy initialized and to avoid deinitialization errors 
    


    laanwj commented at 8:52 am on May 4, 2015:
    Please correct instead of remove the comments. Esoteric knowledge about locales is needed to understand this on reading, heck, we will have forgotten about the rationale for this in a few months.
  22. jonasschnelli force-pushed on May 4, 2015
  23. jonasschnelli force-pushed on May 4, 2015
  24. jonasschnelli force-pushed on May 4, 2015
  25. jonasschnelli force-pushed on May 4, 2015
  26. jonasschnelli force-pushed on May 4, 2015
  27. jonasschnelli force-pushed on May 4, 2015
  28. jonasschnelli force-pushed on May 4, 2015
  29. jonasschnelli renamed this:
    don't imbue boost::filesystem::path with locale "C" on windows
    fix WIN32 boost::filesystem::path issues when using special chars for datadir path
    on May 4, 2015
  30. jonasschnelli commented at 12:10 pm on May 4, 2015: contributor

    Updated to make use of @dexX7 solution (https://github.com/bitcoin/bitcoin/pull/6093#issuecomment-98185779). I think this is okay for 0.11.

    If there are no complains, i’ll open a PR for the 0.10 branch with the same solution but only touching Win32 behavior (some ifdefs).

  31. theuni commented at 1:58 pm on May 4, 2015: member

    @dexX7 Thanks for sticking with this and explaining the issue well. It looks like it took me a little while to figure out exactly what you already had, I just had to work through it myself for it to make sense. Nice detective work! @laanwj Agreed on making it windows only for 0.10, and on adding comments.

    Additionally, we’re depending on internal boost workings here that may change in the future, I wonder if it’d be worthwhile to add a quick test to use ‘./bitcoind.exe -datadir=c:\something\non-latin’. Sadly, I don’t think we could trust wine to faithfully reproduce the issue, but at least we’d have a set of tests that could be run manually.

    If that’s possible, we’d probably want to bake it into the test_bitcoin.exe binary to rule out python/shell middle-man issues.

    Edit: thinking on it a little more, using -datadir for a test is overkill. We could just add a quick unit-test to add a subdir inside the current profile and touch a file in there.

  32. dexX7 commented at 3:52 pm on May 4, 2015: contributor

    @laanwj: Question: at least for the 0.10 branch, can you code this so that it doesn’t affect behavior on non-Windows?

    Is this the case? Travis is happy with the change applied on top of 0.10.1, and the bad-environment test passes as well. It does not bring back #3136, and at least to my understanding, it should restore the behavior of pre-0.10, but without the multithreading-deinitialization issue. But please don’t give my understanding much weight here.

    All this is based on the assumption that path::codecvt() is called via boost::filesystem::path at some point, which then undergoes a similar path as path::imbue() does, to return the original locale.

    @theuni: Thanks for sticking with this and explaining the issue well.

    It’s the least I can do, given that my fix for #3136 (deinitialization issue), and the follow up #5950 (bad environment handling), ultimately led in this situation.

    We could just add a quick unit-test to add a subdir inside the current profile and touch a file in there.

    Having more tests for this would be incredibly useful.

    There might be a twist though: the unit tests (via TestingSetup) appear to use pathTemp = GetTempPath() / ... for the temporary test datadir, and this probably throws, even before any tests are executed, if path can’t handle paths inside the current profile, because GetTempPath() returns \Users\{username}\AppData\Local\Temp as per default on Windows Vista+.

    Just brainstorming on this: what I tested manually might be wrapped into a Python test, which starts and stops nodes, using a bunch of datadirs with “exotic” characters. Unfortunally I’m not sure, which test data could be used for it, given that, appearingly, the system’s localization plays a significant role here (e.g. œ™€® passes on my system, while юзза, which seems to work fine for the original poster of #6078, doesn’t).

    Tackling this from another perspective: ideally there would also be some form of mitigation in the context of GetDataDir() ([create datadir] and related), to a) catch potential exceptions, and b) provide some form of guidance for the user, or more telling messages, if it actually happens.

    Edit: where I’m getting: there are probably other related issues in the context of the filesystem, which are not related to localization. For example, using a datadir without permission, results in a segfault.

  33. theuni commented at 4:10 pm on May 4, 2015: member
    @dexX7 If the test throws during initialization, then it will fail anyway, no? I don’t see how that’s a problem as it still indicates an error.
  34. [squashme] simplify SetupEnvironment() (by dexX7) 3da7849007
  35. in src/util.cpp: in fda4da8025 outdated
    736     } catch (const std::runtime_error&) {
    737         setenv("LC_ALL", "C", 1);
    738     }
    739 #endif
    740-    // The path locale is lazy initialized and to avoid deinitialization errors 
    741+    // The path locale is lazy initialized and to avoid deinitialization errors
    


    laanwj commented at 3:16 pm on May 8, 2015:
    Duplicated comment :)

    jonasschnelli commented at 8:07 am on May 10, 2015:
    Fixed.
  36. jonasschnelli force-pushed on May 10, 2015
  37. jonasschnelli commented at 8:12 am on May 10, 2015: contributor

    I lost track of this PR. How we proceed here? @theuni @dexX7? Is this ready?

    My tests still tell me that there are problems with special char datadir on windows (as it always was the case). Maybe this PR should concentrate on fixing #6078 without reintroducing #3136 (solved by #5877). There will still be cases where bitcoind/qt won’t start up. As example: if you use a Chinese username on a english windows.

  38. dexX7 commented at 9:30 am on May 10, 2015: contributor

    @jonasschnelli: As example: if you use a Chinese username on a english windows.

    It’s the best solution I’ve found, though this is still very unfortunate. As far as I can see, it restores the behavior of 0.9.3, without reintroducing #3136.

  39. jonasschnelli commented at 9:38 am on May 10, 2015: contributor
    Okay. Then i think this is merge ready and should probably get a ACK/NACK from @theuni. @laanwj: Should we backport this (as it is) to 0.10 or should i create a PR where it only touches win32 codeflow?
  40. laanwj merged this on May 10, 2015
  41. laanwj closed this on May 10, 2015

  42. laanwj referenced this in commit 23254131a3 on May 10, 2015
  43. laanwj commented at 12:27 pm on May 10, 2015: member
    Cherry-picked to 0.10 as 424ae66 (the change is now subtle enough that no special path for WIN32 is necessary)
  44. theuni commented at 5:00 pm on May 10, 2015: member
    post-merge ACK.
  45. dexX7 commented at 5:20 pm on May 10, 2015: contributor

    @laanwj: the current 0.10 tip doesn’t include the changes of this PR?

    Edit: now I see it. :)

  46. laanwj referenced this in commit 424ae6629b on May 11, 2015
  47. mart2038 commented at 8:33 pm on May 29, 2015: none
    post-merge comments. Kudos to you guys for sticking with a win32 issue, but… Why? ‘They’ say that coding for *nix is too human resource hungry due to intra-distribution incompatibilities… hmmmmm, Oooookay, if ’they’ say so. @sipa @laanwj EBCDIC, PETSCII You must have been rich! “When I were a lad” (olde Englysh comedy) we programmed our 4004 directly in binary using mechanical switches. Our output device was four LED’s. Encoding? Who needs it? ;-)
  48. reddink referenced this in commit 200b5b8c93 on May 27, 2020
  49. MarcoFalke locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-05 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me