Improve fs::PathToString documentation

ryanofsky commented at 1:46 am on November 16, 2021: member

Add a developer note about avoiding fs::PathToString in RPCs, and improve some other fs::PathToString comments.

Developer note might have been useful in two recent review comments:

Improve fs::PathToString documentation 9b575f1c73

DrahtBot added the label UTXO Db and Indexes on Nov 16, 2021

unknown approved

unknown commented at 4:58 am on November 16, 2021: none

ACK https://github.com/bitcoin/bitcoin/pull/23522/commits/9b575f1c734c052b695ce921fb6412b22c18fdb4

MarcoFalke added the label Docs on Nov 16, 2021

MarcoFalke removed the label UTXO Db and Indexes on Nov 16, 2021

MarcoFalke commented at 7:35 am on November 16, 2021: member

Is there any way to enforce this without looking at the code with human eyes? Are the tests going to fail if we switch to std filesystem and have an incorrect string conversion?

laanwj commented at 9:52 am on November 16, 2021: member

Documentation review ACK 9b575f1c734c052b695ce921fb6412b22c18fdb4

Are the tests going to fail if we switch to std filesystem and have an incorrect string conversion?

Yes, especially the Windows tests will fail, as the situation is most dire there. Both the functional and unit tests, as they both have some tests that use unicode characters in paths (that are not in any of the “narrow” codepages).

hebasto approved

hebasto commented at 11:05 am on November 16, 2021: member

ACK 9b575f1c734c052b695ce921fb6412b22c18fdb4

Thanks for improving documentation on that topic!

in src/fs.h:96 in 9b575f1c73

93@@ -94,31 +94,34 @@ static inline path operator+(path p1, path p2)
94 
95 /**
96  * Convert path object to byte string. On POSIX, paths natively are byte

shaavan commented at 1:19 pm on November 16, 2021:

0 * Convert path object to a byte string. On POSIX, paths natively are byte

fanquake commented at 10:05 am on January 11, 2022:

Done in #20744.

in src/fs.h:100 in 9b575f1c73

 93@@ -94,31 +94,34 @@ static inline path operator+(path p1, path p2)
 94 
 95 /**
 96  * Convert path object to byte string. On POSIX, paths natively are byte
 97- * strings so this is trivial. On Windows, paths natively are Unicode, so an
 98- * encoding step is necessary.
 99+ * strings, so this is trivial. On Windows, paths natively are Unicode, so an
100+ * encoding step is necessary. The inverse of \ref PathToString is \ref
101+ * PathFromString. The strings returned and parsed by these functions can be
102+ * used to call POSIX APIs, and for roundtrip conversion, logging, and

shaavan commented at 1:19 pm on November 16, 2021:

0 * used to call POSIX APIs and for roundtrip conversion, logging, and

fanquake commented at 10:05 am on January 11, 2022:

I don’t think this is more correct, have left out of #20744.

in src/fs.h:108 in 9b575f1c73

124+ * Because \ref PathToString and \ref PathFromString functions don't specify an
125+ * encoding, they are meant to be used internally, not externally. They are not
126+ * appropriate to use in applications requiring UTF-8, where
127+ * fs::path::u8string() and fs::u8path() methods should be used instead. Other
128+ * applications could require still different encodings. For example, JSON, XML,
129+ * or URI applications might prefer to use higher level escapes (\uXXXX or

shaavan commented at 1:19 pm on November 16, 2021:

0 * or URI applications might prefer to use higher-level escapes (\uXXXX or

fanquake commented at 10:05 am on January 11, 2022:

Done in #20744.

shaavan approved

shaavan commented at 1:21 pm on November 16, 2021: contributor

ACK 9b575f1c734c052b695ce921fb6412b22c18fdb4

This PR intends to add documentation for the PathToString functions. This PR:

Clarifies where it is safe to use this command and where it used to be replaced with other functions.
Rewords the comments in the fs.h file detailing pathToString usage to make them more logically and verbally sound.
Move the Windows and POSIX implementation notes under the PathToString function

The documentation is nicely worded and easy to understand. There is no critical grammar error in the wording. However, I found some minor grammatical corrections that you might consider if you update the PR for some other crucial reason.

jamesob commented at 7:33 pm on November 16, 2021: member

ACK https://github.com/bitcoin/bitcoin/pull/23522/commits/9b575f1c734c052b695ce921fb6412b22c18fdb4

fanquake merged this on Nov 17, 2021

fanquake closed this on Nov 17, 2021

ryanofsky commented at 3:28 am on November 17, 2021: member

re: #23522 (comment)

Is there any way to enforce this without looking at the code with human eyes? Are the tests going to fail if we switch to std filesystem and have an incorrect string conversion?

I think ideally PathToString and PathFromString functions would be eliminated. I wrote these functions to allow switching from boost::filesystem to std::filesystem while preserving current behavior, but current behavior is actually a little broken.

PathToString is mostly used for logging, but this is not ideal because the result is usually not quoted, and paths can contain spaces, quotes, punctuation and text that make resulting log messages ambiguous. They could also contain arbitrary characters including newlines, invalid UTF-8 sequences, invalid multibyte sequences that break log parsing. I think ideally these PathToString calls would be replaced with calls to a new PathToLogString function, and that function would add surrounding quotes around paths, replace internal quotes with \", replace special ASCII characters like tabs and newlines with C \t \n escapes, replace non-latin unicode characters with \uXXXX escapes on windows, and replace non-latin bytes with \xXX escapes on POSIX.

PathFromString is mostly used to parse command line and configuration arguments. I think ideally these calls would be removed and replaced with calls to a new ArgsManager::GetPathArg method. The method would treat command line arguments and configuration file paths as bytes on POSIX with no encoding or decoding. On windows, it would decode configuration file paths as UTF-8, and it would either decode command line arguments according to the current code page, or it would bypass the code page encoding and decoding done by the windows C library by overriding wmain and getting the original unicode arguments passed by the parent process in wchar_t *argv[]

So ideally we could get rid of PathToString to PathFromString entirely. But more narrowly if we just wanted to try minimize new encoding bugs and avoid PathToString results being passed to the UniValue constructor, we would need to use distinct string types like std::string for byte strings and std::u8string for unicode strings and make it an error to construct a UniValue from a byte string. We could also try to renaming the PathToString function to discourage use. Maybe to something like PathToByteString, PathToInternalString, PathToSystemString, LegacyPathToString

ryanofsky commented at 3:38 am on November 17, 2021: member

Thanks for all reviews and thanks @shaavan for suggested edits. I didn’t get to apply them here, but they could be used in another followup or maybe #20744

sidhujag referenced this in commit 867659eaf6 on Nov 17, 2021

DrahtBot locked this on Jan 11, 2023

Improve fs::PathToString documentation #23522