Add a developer note about avoiding fs::PathToString
in RPCs, and improve some other fs::PathToString
comments.
Developer note might have been useful in two recent review comments:
Add a developer note about avoiding fs::PathToString
in RPCs, and improve some other fs::PathToString
comments.
Developer note might have been useful in two recent review comments:
Documentation review ACK 9b575f1c734c052b695ce921fb6412b22c18fdb4
Are the tests going to fail if we switch to std filesystem and have an incorrect string conversion?
Yes, especially the Windows tests will fail, as the situation is most dire there. Both the functional and unit tests, as they both have some tests that use unicode characters in paths (that are not in any of the “narrow” codepages).
ACK 9b575f1c734c052b695ce921fb6412b22c18fdb4
Thanks for improving documentation on that topic!
93@@ -94,31 +94,34 @@ static inline path operator+(path p1, path p2)
94
95 /**
96 * Convert path object to byte string. On POSIX, paths natively are byte
0 * Convert path object to a byte string. On POSIX, paths natively are byte
93@@ -94,31 +94,34 @@ static inline path operator+(path p1, path p2)
94
95 /**
96 * Convert path object to byte string. On POSIX, paths natively are byte
97- * strings so this is trivial. On Windows, paths natively are Unicode, so an
98- * encoding step is necessary.
99+ * strings, so this is trivial. On Windows, paths natively are Unicode, so an
100+ * encoding step is necessary. The inverse of \ref PathToString is \ref
101+ * PathFromString. The strings returned and parsed by these functions can be
102+ * used to call POSIX APIs, and for roundtrip conversion, logging, and
0 * used to call POSIX APIs and for roundtrip conversion, logging, and
124+ * Because \ref PathToString and \ref PathFromString functions don't specify an
125+ * encoding, they are meant to be used internally, not externally. They are not
126+ * appropriate to use in applications requiring UTF-8, where
127+ * fs::path::u8string() and fs::u8path() methods should be used instead. Other
128+ * applications could require still different encodings. For example, JSON, XML,
129+ * or URI applications might prefer to use higher level escapes (\uXXXX or
0 * or URI applications might prefer to use higher-level escapes (\uXXXX or
ACK 9b575f1c734c052b695ce921fb6412b22c18fdb4
This PR intends to add documentation for the PathToString functions. This PR:
The documentation is nicely worded and easy to understand. There is no critical grammar error in the wording. However, I found some minor grammatical corrections that you might consider if you update the PR for some other crucial reason.
re: #23522 (comment)
Is there any way to enforce this without looking at the code with human eyes? Are the tests going to fail if we switch to std filesystem and have an incorrect string conversion?
I think ideally PathToString
and PathFromString
functions would be eliminated. I wrote these functions to allow switching from boost::filesystem
to std::filesystem
while preserving current behavior, but current behavior is actually a little broken.
PathToString
is mostly used for logging, but this is not ideal because the result is usually not quoted, and paths can contain spaces, quotes, punctuation and text that make resulting log messages ambiguous. They could also contain arbitrary characters including newlines, invalid UTF-8 sequences, invalid multibyte sequences that break log parsing. I think ideally these PathToString
calls would be replaced with calls to a new PathToLogString
function, and that function would add surrounding quotes around paths, replace internal quotes with \"
, replace special ASCII characters like tabs and newlines with C \t
\n
escapes, replace non-latin unicode characters with \uXXXX
escapes on windows, and replace non-latin bytes with \xXX
escapes on POSIX.
PathFromString
is mostly used to parse command line and configuration arguments. I think ideally these calls would be removed and replaced with calls to a new ArgsManager::GetPathArg
method. The method would treat command line arguments and configuration file paths as bytes on POSIX with no encoding or decoding. On windows, it would decode configuration file paths as UTF-8, and it would either decode command line arguments according to the current code page, or it would bypass the code page encoding and decoding done by the windows C library by overriding wmain
and getting the original unicode arguments passed by the parent process in wchar_t *argv[]
So ideally we could get rid of PathToString
to PathFromString
entirely. But more narrowly if we just wanted to try minimize new encoding bugs and avoid PathToString
results being passed to the UniValue
constructor, we would need to use distinct string types like std::string
for byte strings and std::u8string
for unicode strings and make it an error to construct a UniValue
from a byte string. We could also try to renaming the PathToString
function to discourage use. Maybe to something like PathToByteString
, PathToInternalString
, PathToSystemString
, LegacyPathToString