This:
- Removes dead code.
- Avoids unused copies in some places.
- Adds copies in other places for safety.
This:
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
For detailed information about the code coverage, see the test coverage report.
See the guideline for information on the review process.
Type | Reviewers |
---|---|
ACK | ryanofsky, stickies-v, achow101 |
Concept ACK | hebasto |
Stale ACK | vasild |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Concept ACK.
This silences the following (clang 18):
0common/args.cpp:288:31: error: returning variable 'm_cached_blocks_path' by reference requires holding mutex 'cs_args' [-Werror,-Wthread-safety-reference-return] 1 288 | if (!path.empty()) return path; 2 | ^
Does it happen on the master branch?
Does it happen on the master branch? @hebasto, yes, see https://github.com/bitcoin/bitcoin/pull/28774
13@@ -14,6 +14,8 @@
14 #include <ios>
15 #include <ostream>
16 #include <string>
17+#include <system_error>
18+#include <type_traits>
nit: the commit message has comments:
0 #include <system_error> // for error_code
1 #include <type_traits> // for is_same
personally I like them and find them useful.
61- // (C++20). Convert both to std::string. This method can be removed
62- // after switching to C++20.
63+ const std::u8string& utf8_str{std::filesystem::path::u8string()};
64+ // Convert to std::string as a convenience for use in RPC code.
65 return std::string{utf8_str.begin(), utf8_str.end()};
66 }
What about the comment that says “This method can be removed after switching to C++20”? I think indeed this method should be removed entirely.
In C++17 its prototype, according to the spec is std::string u8string() const
which matches ours and is ok. But in C++20 it is: std::u8string u8string() const
and it is confusing to have a mismatch. I would better drop this method like the comment says and then fix problems (if any) that arise due to that.
“This method can be removed after switching to C++20” was written by me and I think it no longer makes sense.
But in C++20 it is: std::u8string u8string() const and it is confusing to have a mismatch.
I think this is fine. fs::path
is not an exact API-copy of std::filesystem::path
. Some methods are different, or deleted, so this mismatch should be fine.
If a u8string was returned here, UniValue would have to be updated to take it as well, in which case the conversion would need to happen there. Other parts of the codebase may have to be adjusted as well. I don’t think forcing conversions all over the codebase adds any value, but I am happy to change my view, if there are any good reasons.
I need to understand this better. What I grasp now is that in C++17 it was ok to store UTF-8 strings in std::string
and in C++20 it is not and one has to use std::u8string
instead?
This smells because if callers need std::string
why do they call u8string()
and not string()
!? Is it because it was ok to do that in C++17 because in C++17 std::filesystem::path::string()
would return std::string
itself and on top of this it was needed because of some complication on Windows?
On Windows filesystem::path
uses wchar_t
and it has to be somehow converted to std::string<char>
, I guess:
filesystem::path::string()
which does that conversion is not okfilesystem::path::u8string()
converts wchar_t
to
2.1. C++17: char
from which we convert it explicitly to char
which is a noop
2.2. C++20: char8_t
from which we convert it explicitly to char
. Is that ok? Why C++20 has std::u8string
if we can store UTF-8 strings in std::string
?What would be an example in C++20 where it is broken by calling string()
(1.) and ok by converting wchar_t
->char8_t
->char
(2.2)?
I need to understand this better. What I grasp now is that in C++17 it was ok to store UTF-8 strings in
std::string
and in C++20 it is not and one has to usestd::u8string
instead?
The underlying data structure is just a container to hold bytes. If the bytes are identical, which they are (in this function), it only matters for readers of the code what the data structure is called.
The important part to getting the right bytes is to call the right function, which is called u8string
in all versions of C++ and in this file.
2.2. C++20:
char8_t
from which we convert it explicitly tochar
. Is that ok? Why C++20 hasstd::u8string
if we can store UTF-8 strings instd::string
?
Yes, conversion should be fine. See https://stackoverflow.com/a/57453713
What would be an example in C++20 where it is broken by calling
string()
(1.) and ok by convertingwchar_t
->char8_t
->char
(2.2)?
This has nothing to do with C++20. If you call the wrong function string()
vs u8string()
, you will also get the wrong result with C++17.
This has nothing to do with C++20
ok, rephrase (I mentioned C++20 because C++17 does not have char8_t
):
What would be an example where it is broken by calling string()
(1.) and ok by converting wchar_t
->char8_t
->char
(2.2)?
What would be an example where it is broken by calling
string()
Any place, which correctly calls u8string()
, and instead incorrectly calls string()
is broken. For example, the code may be broken on systems whose native encoding is not UTF-8.
See also
0diff --git a/src/test/fs_tests.cpp b/src/test/fs_tests.cpp
1index 7cfecb2b22..34168a551b 100644
2--- a/src/test/fs_tests.cpp
3+++ b/src/test/fs_tests.cpp
4@@ -19,7 +19,7 @@ BOOST_AUTO_TEST_CASE(fsbridge_pathtostring)
5 {
6 std::string u8_str = "fs_tests_₿_🏃";
7 BOOST_CHECK_EQUAL(fs::PathToString(fs::PathFromString(u8_str)), u8_str);
8- BOOST_CHECK_EQUAL(fs::u8path(u8_str).u8string(), u8_str);
9+ BOOST_CHECK_EQUAL(std::filesystem::path{fs::u8path(u8_str)}.string(), u8_str);
10 BOOST_CHECK_EQUAL(fs::PathFromString(u8_str).u8string(), u8_str);
11 BOOST_CHECK_EQUAL(fs::PathToString(fs::u8path(u8_str)), u8_str);
12 #ifndef WIN32
0unknown location(0): fatal error: in "fs_tests/fsbridge_pathtostring": class std::system_error: No mapping for the Unicode character exists in the target multi-byte code page.
https://github.com/maflcko/bitcoin-core-with-ci/actions/runs/7180069987/job/19553642111#step:23:477
In commit “refactor: Remove pre-C++20 fs code” (fa3d9304e80c214c8b073f12a7f4b08c5a94af04)
re: #29040 (review)
If a u8string was returned here, UniValue would have to be updated to take it as well, in which case the conversion would need to happen there.
In theory, no conversions are actually required, right? In theory UniValue could use u8string instead of string internally, and other parts of the codebase could too, so nothing would need to be converted?
In any case, I think going forward it will be confusing to have a u8string method that doesn’t actually return a u8string. The method isn’t called very many places, so maybe we could follow this up with a scripted-diff renaming u8string to utf8sting or something like that, to avoid confusing it with the standard method.
Renaming might also be good since it should make the standard u8string method easier to call and allow getting the utf8 paths more efficiently without {utf8_str.begin(), utf8_str.end()} copies.
confusing to have a u8string method that doesn’t actually return a u8string
My concern exactly!
In theory, no conversions are actually required, right? In theory UniValue could use u8string instead of string internally, and other parts of the codebase could too, so nothing would need to be converted?
Not sure if it makes sense to update the whole codebase from std::string
to std::u8string
, just so that UniValue can accept an std::u8string
. I think it would be fine for UniValue to accept both string types, though, then one would have to be “converted”/copied to the other (I don’t think reinterpret_cast+std::move is guaranteed to work from char
strings to char8_t
strings, is it?)
re: #29040 (review)
Not sure if it makes sense to update the whole codebase from
std::string
tostd::u8string
, just so that UniValue can accept anstd::u8string
. I think it would be fine for UniValue to accept both string types, though, then one would have to be “converted”/copied to the other (I don’t think reinterpret_cast+std::move is guaranteed to work fromchar
strings tochar8_t
strings, is it?)
Yeah I don’t think it makes sense to make an effort to convert code to std::u8string
. To explain “In theory UniValue could use u8string”, I mean the UniValue
class could switch to std::u8string
members internally, and it could accept u8string
values without converting, and it could return both string_view
and u8string_view
values without converting (just casting). The downside is it would need a conversion step for accepting std::string
values, which would not be great right now because the rest of our code base is using std::string
. But maybe later if we have new code using u8string
it might make sense to switch UniValue
then. It could also make sense to switch UniValue
to use std::u8string
if it turns out there an efficient way to move a std::string
value into a std::u8string
object, but I think you are right that the standard doesn’t provide a way to do this now and the reinterpret_cast + move option would technically be undefined behavior even if it works in practice.
UniValue
with https://github.com/nlohmann/json
u8string()
method.
The operator accepts a const& reference, so no copy or move is needed.
See https://en.cppreference.com/w/cpp/filesystem/path/append
Also, add missing includes:
#include <system_error> // for error_code
#include <type_traits> // for is_same
#include <cerrno> // for errno
Treating std::string as UTF-8 is deprecated in std::filesystem::path
since C++20.
However, it makes this codebase easier to read and maintain to retain
the ability for std::string to hold UTF-8.
`ArgsManager::m_cached_blocks_path` is protected by
`ArgsManager::cs_args` and returning a reference to it after releasing
the mutex is unsafe.
To resolve this, return a copy of the path. This has some performance
penalty which is presumably ok, given that paths are a few 100s bytes
at most and `GetBlocksDirPath()` is not called often.
This silences the following (clang 18):
```
common/args.cpp:288:31: error: returning variable 'm_cached_blocks_path' by reference requires holding mutex 'cs_args' [-Werror,-Wthread-safety-reference-return]
288 | if (!path.empty()) return path;
| ^
```
Do the same with
`ArgsManager::GetDataDir()`,
`ArgsManager::GetDataDirBase()` and
`ArgsManager::GetDataDirNet()`.
ACK 856c88776f8486446602476a1c9e133ac0cff510
I realized that this discussion #29040 (review) is not a blocker for this PR because it merely removes a comment and changes const auto
to const u8string
which is noop, just a style thing.
276@@ -277,7 +277,7 @@ fs::path ArgsManager::GetPathArg(std::string arg, const fs::path& default_value)
277 return result.has_filename() ? result : result.parent_path();
278 }
279
280-const fs::path& ArgsManager::GetBlocksDirPath() const
281+fs::path ArgsManager::GetBlocksDirPath() const
In commit “ArgsManager: return path by value from GetBlocksDirPath()” (856c88776f8486446602476a1c9e133ac0cff510)
This change looks ok, but it seems like the clang-18 warning is a false alarm. We know know m_cached_blocks_path will never change after it is written, and we always acquire the cs_args lock before reading m_cached_blocks_path so we can be sure that by the time we do read it we are necessarily seeing the latest value. I don’t see how there could be a race condition here, because it shouldn’t matter how long reader threads hold the lock, as long as they are acquire the lock before reading and we know the value never changes.
Maybe declaring the fs::path& path
const, and writing to m_cached_blocks_path
directly could make the warning go away?
The warning tells us that the code is potentially unsafe. Not that it has a certain bug now. It is similar to having a variable declared with GUARDED_BY()
, then accessing it without the mutex in such a way that it is safe.
Maybe declaring the fs::path& path const, and writing to m_cached_blocks_path directly could make the warning go away?
Would that be like tricking the compiler, going around the warning without actually removing the underlying potential problem?
re: #29040 (review)
Maybe declaring the fs::path& path const, and writing to m_cached_blocks_path directly could make the warning go away?
Would that be like tricking the compiler, going around the warning without actually removing the underlying potential problem?
I’m actually not sure the change I’m suggesting would be enough to silence the compiler warning, so it may be a moot point. But the compiler just has incomplete information. It has no way of knowing that m_cached_blocks_path string is not written again after the first access. So I think suppressing the warning or declaring the reference const would not be tricks really, just ways of giving the compiler instructions or information.
I do think this change is fine as is. I was mostly just curious if the warning was actually a false-positive, since it does seem like one.
Code review ACK 856c88776f8486446602476a1c9e133ac0cff510
I suggested a scripted diff followup, but all these changes seem good, and the followup could happen later. I am also a little disturbed that 856c88776f8486446602476a1c9e133ac0cff510 doesn’t begin with fa
but I think I can get over it.
but I think I can get over it.
53@@ -54,7 +54,7 @@ class path : public std::filesystem::path
54 // Disallow std::string conversion method to avoid locale-dependent encoding on windows.
55 std::string string() const = delete;
56
57- std::string u8string() const
58+ std::string utf8string() const
In commit “refactor: Rename fs::path::u8string() to fs::path::utf8string()” (facdafaea8cfc2a37e55173d0500794c80b28f7f) ~In commit “ArgsManager: return path by value from GetBlocksDirPath()” (856c88776f8486446602476a1c9e133ac0cff510)~
Might be good to add documention for this method. Maybe “Return a UTF-8 representation of the path as a std::string, for compatibility with code using std::string. For code using the newer std::u8string type, it is more efficient to call the inherited std::filesystem::path::u8string method instead.” to explain why this exists and also point out that there is an alternative.
In commit “ArgsManager: return path by value from GetBlocksDirPath()” (https://github.com/bitcoin/bitcoin/commit/856c88776f8486446602476a1c9e133ac0cff510)
Wrong commit? :)
documention
Thanks, copy-pasted this and removed my one-line comment.
Code review ACK facdafaea8cfc2a37e55173d0500794c80b28f7f.
Only change since last review is a new commit renaming the fs::path::u8string method.
ACK facdafaea8cfc2a37e55173d0500794c80b28f7f
To verify that nothing has been forgotten in the last commit which renames fs::path::u8string()
to fs::path::utf8string()
I added std::u8string u8string() const = delete;
. The only call to that function is in BOOST_CHECK(fs::path(str8).u8string() == str8);
which I guess is intentional.
To verify that nothing has been forgotten in the last commit which renames
fs::path::u8string()
tofs::path::utf8string()
I addedstd::u8string u8string() const = delete;
This should not be needed, because the two functions return different and incompatible types, so any oversight should result in a compile failure as-is.
double
before converting to CAmount
).
Code review ACK 66667130416b86208e01a0eb5541a15ea805ac26. Just documentation change since last review.
To clarify and clean things up more, we could also document the fs::path::utf8string()
method and fs::u8path()
function as being inverses of each other, and maybe rename fs::u8path
to fs::utf8path
for consistency. I don’t think either of these things are necessary though, just thoughts for improvement.
maybe rename
fs::u8path
tofs::utf8path
for consistency
I think it is nice that it shadows the deprecated “constructor”, but I am happy to review a follow-up. (Can be a scripted-diff, because no use of the deprecated u8path
remain in the code)