Remove the boost/algorithm/string/case_conv.hpp dependency #13671

pull l2a5b1 wants to merge 3 commits into bitcoin:master from l2a5b1:patch/remove_boost_case_conv.hpp changing 8 files +116 −10
  1. l2a5b1 commented at 10:50 pm on July 15, 2018: contributor

    This pull request removes the boost/algorithm/string/case_conv.hpp dependency from the project.

    boost/algorithm/string/case_conv.hpp is included for the boost::to_lower and boost::to_upper template functions.

    We can replace the calls to these functions with straightforward alternative implementations that use the C++ Standard Library, because the functions are called with std::string objects that use standard 7-bit ASCII characters as argument.

    The refactored implementation should work without the explicit static_cast<unsigned char> cast and unsigned char lambda return type. Both have been added defensively and to be explicit. Especially in case of the former, behaviour is undefined (potentially result in a crash) if the std::toupper argument is not an unsigned char.

    A potential alternative, maybe even preferred, implementation to address the boost::to_lower function call in ParseNetwork(std::string) could have been:

    0if (net == "ipv4" || net == "IPv4") return NET_IPV4;
    1if (net == "ipv6" || net == "IPv6") return NET_IPV6;
    

    This alternative implementation would however change the external behaviour of ParseNetwork(std::string).

    This pull requests includes a unit test to validate the implementation of ParseNetwork(std::string) prior and after the removal of the case_conv.hpp dependency.

    boost/algorithm/string/case_conv.hpp has been removed from the EXPECTED_BOOST_INCLUDES in test/lint/lint-includes.sh because it is no longer required.

  2. l2a5b1 force-pushed on Jul 15, 2018
  3. practicalswift commented at 11:33 pm on July 15, 2018: contributor

    Concept ACK modulo …

    • Nice to see the EXPECTED_BOOST_INCLUDES array in test/lint/lint-includes.sh get shorter
    • This is a good opportunity to also shorten the KNOWN_VIOLATIONS array in test/lint/lint-locale-dependence.sh – please replace the use of locale dependent functions std::lower and std::upper with locale independent alternatives
  4. fanquake added the label Refactoring on Jul 15, 2018
  5. in src/rpc/server.cpp:194 in e5085d3f8f outdated
    190@@ -192,9 +191,8 @@ std::string CRPCTable::help(const std::string& strCommand, const JSONRPCRequest&
    191                     if (!category.empty())
    192                         strRet += "\n";
    193                     category = pcmd->category;
    194-                    std::string firstLetter = category.substr(0,1);
    195-                    boost::to_upper(firstLetter);
    196-                    strRet += "== " + firstLetter + category.substr(1) + " ==\n";
    197+                    unsigned char firstLetter = std::toupper(static_cast<unsigned char>(category[0]));
    


    Empact commented at 1:12 am on July 16, 2018:
    std::toupper(category.front()) should work

    l2a5b1 commented at 0:50 am on July 18, 2018:
    Thanks @Empact, I have addressed your feedback. The code to capitalize the first letter of a string has been isolated in void Capitalize(std::string& str) in utilstrencodings.cpp.
  6. in src/netbase.cpp:40 in e5085d3f8f outdated
    36@@ -38,7 +37,7 @@ static const int SOCKS5_RECV_TIMEOUT = 20 * 1000;
    37 static std::atomic<bool> interruptSocks5Recv(false);
    38 
    39 enum Network ParseNetwork(std::string net) {
    40-    boost::to_lower(net);
    41+    std::transform(net.begin(), net.end(), net.begin(), [](unsigned char c) -> unsigned char {return std::tolower(c);});
    


    Empact commented at 1:28 am on July 16, 2018:

    You can pass tolower directly like so: std::transform(net.begin(), net.end(), net.begin(), tolower);

    The same doesn’t work with std::tolower fyi.


    Empact commented at 1:40 am on July 16, 2018:

    You could cast like so, I believe that has the same effect: std::transform(net.begin(), net.end(), net.begin(), (int (*)(int)) std::tolower);

    Note tolower accepts an int but handles it as an unsigned char: https://en.cppreference.com/w/c/string/byte/tolower


    l2a5b1 commented at 1:13 am on July 18, 2018:

    Thanks @Empact!

    I opted for the lambda expression, because it allowed me to pass unsigned char characters to the tolower function.

    I understood from the referenced documentation that it doesn’t handle the int as an unsigned char (character to be converted. If the value of ch is not representable as unsigned char and does not equal EOF, the behavior is undefined.).

    I hope you like the new implementation better.

  7. Empact commented at 1:37 am on July 16, 2018: member
    Good to include <algorithm>/<ctype.h> where appropriate.
  8. Empact commented at 1:42 am on July 16, 2018: member
    Concept ACK
  9. laanwj commented at 6:59 am on July 16, 2018: member
    • is std::tolower locale-independent? (I don’t think so, as @practicalswift mentions, so let’s use this opportunity to write a locale-independent ASCII deterministic one)
    • I’d prefer to factor this out to our own strtolower function in utilstrencodings.cpp, instead of writing this out here
  10. DrahtBot commented at 11:18 am on July 16, 2018: member
    • #13945 (Refactoring CRPCCommand with enum category by isghe)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  11. l2a5b1 commented at 2:03 pm on July 16, 2018: contributor

    Many thanks for the review comments! @laanwj

    is std::tolower locale-independent?

    As used in this project at the moment both std::tolower and std::toupper should be locale-independent and fully deterministic given that the arguments are std::string objects with characters in the standard 7-bit ASCII range.

    • In one instance std::toupper is used to uppercase the first character of CRPCCommand::category, which gets its value from arrays declared like the one below where category is a std::string that uses standard 7-bit ASCII characters:

    https://github.com/bitcoin/bitcoin/blob/171b03de065dae94ed6beb5db9ba7f08e92e7c7c/src/rpc/blockchain.cpp#L1923-L1929

    • In the second instance it lowercases network names where the accepted values are limited to [“ipv4”, “ipv6”, “tor”, “onion”] to catch a potential “IPv4” and “IPv6” input.

    In this particular case I have also raised the suggestion to drop boost::to_lower(net) and replace it with:

    0if (net == "ipv4" || net == "IPv4") return NET_IPV4;
    1if (net == "ipv6" || net == "IPv6") return NET_IPV6;
    

    I would even prefer this alternative if the modified external behaviour is acceptable. @laanwj

    I’d prefer to factor this out to our own strtolower function in utilstrencodings.cpp, instead of writing this out here

    I had definitely considered this but felt that it would be a potential rabbit hole without guidance.

    Implementing utility functions strtolower and strtoupper without the use of the std::tolower or std::toupper should be trivial if it is acceptable to limit their use to std::string objects with standard 7-bit ASCII characters.

    Is this good enough for our purposes?

    If the objective is to write custom functions that accept std::string objects with extended 8-bit ASCII characters, but can’t use std::tolower or std::toupper, then I am not sure this is doable unless we simplify things tremendously. If the most important thing is to be deterministic, one thing that come to mind is doing conversions based on a single character encoding like ISO-8859-15 and use conversion tables that literally transform characters without considering different alphabets and language specific cases.

    Either way, I am happy to give this another shot by addressing @Empact ’s comments; or pursuing alternatives proposed by @practicalswift and @laanwj if I can get further guidance in which direction to take this.

  12. practicalswift commented at 4:11 pm on July 16, 2018: contributor

    @251Labs For say the custom locale independent std::toupper function you would simply return any character outside of the ASCII range as is:

    In the default “C” locale, the following lowercase letters abcdefghijklmnopqrstuvwxyz are replaced with respective uppercase letters ABCDEFGHIJKLMNOPQRSTUVWXYZ.

    More generally: as long as your custom functions return the same value as the standard library equivalents would have done under the default “C” locale everything is fine.

  13. l2a5b1 force-pushed on Jul 17, 2018
  14. l2a5b1 force-pushed on Jul 17, 2018
  15. l2a5b1 force-pushed on Jul 17, 2018
  16. l2a5b1 force-pushed on Jul 17, 2018
  17. l2a5b1 commented at 1:32 am on July 18, 2018: contributor

    Thanks again for the feedback!

    Per the review comments, I have implemented custom ToLower and ToUpper functions in 6b2d814 that are locale independent and ASCII deterministic.

    I have tried to make the ToLower(unsigned char) and ToUpper(unsigned char) functions defined in utilstrencodings.h as efficient as possible. I hope I did not miss something (obvious) that makes it a failed attempt.

  18. in src/test/util_tests.cpp:1207 in 6b2d814211 outdated
    1201@@ -1202,4 +1202,63 @@ BOOST_AUTO_TEST_CASE(test_DirIsWritable)
    1202     fs::remove(tmpdirname);
    1203 }
    1204 
    1205+BOOST_AUTO_TEST_CASE(test_ToLower)
    1206+{
    1207+    BOOST_CHECK(ToLower('@') == '@');
    


    sipa commented at 1:54 am on July 18, 2018:
    Use BOOST_CHECK_EQUAL(ToLower('@'), '@') (it gives a clearer diagnostic message in case of failure).

    l2a5b1 commented at 1:15 pm on July 18, 2018:
    Thanks, addressed in 0234af5
  19. in src/utilstrencodings.h:186 in 6b2d814211 outdated
    181+ * @return          the lowercase equivalent of c; or the argument
    182+ *                  if no conversion is possible.
    183+ */
    184+inline unsigned char ToLower(unsigned char c)
    185+{
    186+    static const unsigned char mappings[256] = {
    


    sipa commented at 1:56 am on July 18, 2018:

    That’s a lot of code!

    What about return (c >= 'A' && c <= 'Z' ? (c - 'A') + 'a' : c);?


    l2a5b1 commented at 1:13 pm on July 18, 2018:

    Thanks @sipa,

    I asked myself the same question and I am happy to change it!

    Just to share why I decided to use a mappings array:

    Support for extended ASCII

    Should there be a time where it becomes important to have support for characters in the extended ASCII range then it will be trivial to implement it. It will then just be a matter of adjusting entries in mappings array vs extending the implementation with control statements like if (c == 'Ü') return 'ü'; or alternative logic.

    Performance

    I wanted our locale independent functions to be a good citizen in terms of performance for code that requires it.

    I did some naive tests to get an impression what the code would compile to and it seemed that the compiler was able to generate more efficient code in the case return mappings[c];

    At all optimization levels I would get a pointer to the mappings array where mappings are performed on:

    0lea rax, [rip + __ZZ7ToLowerhE15tolowermappings]
    1movzx   eax, byte ptr [rdi + rax]
    

    Alternatively in the case of (c >= 'A' && c <= 'Z' ? (c - 'A') + 'a' : c) the compiler would generate something like this at O3 optimization:

    0    mov eax, edi 
    1    add al, -65 
    2    cmp al, 26
    3    jae LBB3_2
    4    add dil, 32
    5LBB3_2:
    6    movzx   eax, dil
    

    Particularly in the case of iterations I felt the compiler was able to do better optimizations, for example minimizing iterations by operating in sequences. In the case of the former it could do 4 mapping operations per sequence; while in the latter of at most two mappings per iteration.

    On this basis of this I assumed the code would also perform better, but I realize I can be very wrong because I am no expert at this.

    I hope this feedback is useful. It it isn’t, I will gladly change the code!


    practicalswift commented at 1:28 pm on July 18, 2018:
    I strongly prefer @sipa:s more readable shorter version. I’d bet any potential performance difference is insignificant in the grand scheme of things :-)

    l2a5b1 commented at 1:32 pm on July 18, 2018:
    Thanks @practicalswift, will update the code per @sipa’s feedback!

    Empact commented at 1:32 pm on July 18, 2018:
    I agree with @sipa. If we need to add extended ascii support, we can address that then, and this is not a performance-critical operation that calls for extra code to optimize it. Simple, repetitive code adds visual noise and can hide bugs as well.

    l2a5b1 commented at 1:45 pm on July 18, 2018:
    Thanks @empact, I will update the pull request!

    l2a5b1 commented at 8:59 am on July 19, 2018:
    Thanks again for the feedback. I have updated the pull request and addressed your comments in 53b3bfa.
  20. in src/utilstrencodings.h:252 in 6b2d814211 outdated
    247+ * Converts the given string to its uppercase equivalent.
    248+ * This function is locale independent. It only converts lowercase
    249+ * characters in the standard 7-bit ASCII range.
    250+ * @param[in,out] str   the string to convert to uppercase.
    251+ */
    252+void ToUpper(std::string& str);
    


    sipa commented at 1:57 am on July 18, 2018:
    I feel it’s slightly confusing to have a version of ToUpper that operates on chars (and does not modify the argument) while the version operating on vectors does modify. Perhaps call this ConvertToUpper or so?

    l2a5b1 commented at 1:16 pm on July 18, 2018:
    Thanks, addressed in 0234af5
  21. l2a5b1 force-pushed on Jul 18, 2018
  22. in src/rpc/server.cpp:196 in f296a939ba outdated
    190@@ -192,9 +191,9 @@ std::string CRPCTable::help(const std::string& strCommand, const JSONRPCRequest&
    191                     if (!category.empty())
    192                         strRet += "\n";
    193                     category = pcmd->category;
    194-                    std::string firstLetter = category.substr(0,1);
    195-                    boost::to_upper(firstLetter);
    196-                    strRet += "== " + firstLetter + category.substr(1) + " ==\n";
    197+                    std::string heading = category;
    198+                    Capitalize(heading);
    199+                    strRet += "== " + heading + " ==\n";
    


    Empact commented at 1:23 pm on July 18, 2018:
    nit: I would return the string from Capitalize and inline that into the string construction.
  23. in src/utilstrencodings.cpp:553 in f296a939ba outdated
    544@@ -544,3 +545,18 @@ bool ParseFixedPoint(const std::string &val, int decimals, int64_t *amount_out)
    545     return true;
    546 }
    547 
    548+void ConvertToLower(std::string& str)
    549+{
    550+    std::transform(str.begin(), str.end(), str.begin(), [](unsigned char c){return ToLower(c);});
    551+}
    552+
    553+void ConvertToUpper(std::string& str)
    


    Empact commented at 1:24 pm on July 18, 2018:
    mu-nit: Ruby uses the terms Downcase and Upcase for these operations, which I like. Just a suggestion.

    Empact commented at 1:37 pm on July 18, 2018:
    I don’t think we should define ConvertToUpper if we’re not using it. The construction can be trivially re-created from ConvertToLower if we need it, and in the meantime it’s just dead code.
  24. in src/utilstrencodings.h:223 in f296a939ba outdated
    218+ * characters in the standard 7-bit ASCII range.
    219+ * @param[in] c     the character to convert to uppercase.
    220+ * @return          the uppercase equivalent of c; or the argument
    221+ *                  if no conversion is possible.
    222+ */
    223+inline unsigned char ToUpper(unsigned char c)
    


    Empact commented at 1:33 pm on July 18, 2018:
    nit: constexpr
  25. in src/utilstrencodings.h:184 in f296a939ba outdated
    179+ * characters in the standard 7-bit ASCII range.
    180+ * @param[in] c     the character to convert to lowercase.
    181+ * @return          the lowercase equivalent of c; or the argument
    182+ *                  if no conversion is possible.
    183+ */
    184+inline unsigned char ToLower(unsigned char c)
    


    Empact commented at 1:33 pm on July 18, 2018:
    nit: constexpr
  26. l2a5b1 force-pushed on Jul 18, 2018
  27. in src/test/util_tests.cpp:1227 in 400dadf4a5 outdated
    1207+    BOOST_CHECK_EQUAL(ToLower('@'), '@');
    1208+    BOOST_CHECK_EQUAL(ToLower('A'), 'a');
    1209+    BOOST_CHECK_EQUAL(ToLower('Z'), 'z');
    1210+    BOOST_CHECK_EQUAL(ToLower('['), '[');
    1211+    BOOST_CHECK_EQUAL(ToLower(0), 0);
    1212+    BOOST_CHECK_EQUAL(ToLower(255), 255);
    


    practicalswift commented at 9:23 am on July 19, 2018:
    Please add some ToUpper(...) tests as well :-)

    l2a5b1 commented at 11:22 am on July 19, 2018:
    Thanks @practicalswift! I was a bit too eager when I removed ToUpper(std::string&) and its tests (facepalm). Fixed in dff017f.
  28. l2a5b1 force-pushed on Jul 19, 2018
  29. in src/rpc/server.cpp:194 in 76d70e5fb1 outdated
    190@@ -192,9 +191,7 @@ std::string CRPCTable::help(const std::string& strCommand, const JSONRPCRequest&
    191                     if (!category.empty())
    192                         strRet += "\n";
    193                     category = pcmd->category;
    194-                    std::string firstLetter = category.substr(0,1);
    195-                    boost::to_upper(firstLetter);
    196-                    strRet += "== " + firstLetter + category.substr(1) + " ==\n";
    197+                    strRet += "== " + Capitalize(category) + " ==\n";
    


    Empact commented at 1:41 pm on July 19, 2018:
    ~Since this modifies the category in-place, I think this breaks the comparison logic, e.g. on line 189. Sorry, I think my original nit was misguided.~

    practicalswift commented at 1:53 pm on July 19, 2018:
    Capitalize(…) (std::string Capitalize(std::string str)) does not modify category in-place :-)

    l2a5b1 commented at 2:46 pm on July 19, 2018:
    Hi @Empact, we should be good. When I addressed your feedback, I did change the method signature from void Capitalize(std::string&) to std::string Capitalize(std::string), which passes the argument by value (category is copied into str).

    Empact commented at 2:48 pm on July 19, 2018:
    Thanks, I misread - agree the argument is passed by value, so it’s fine. 👍
  30. Empact commented at 2:50 pm on July 19, 2018: member
    utACK 76d70e5fb1923cdbc4771f438f10ae0039beccaa
  31. fanquake added this to the "In progress" column in a project

  32. DrahtBot added the label Needs rebase on Jul 24, 2018
  33. l2a5b1 force-pushed on Jul 25, 2018
  34. DrahtBot removed the label Needs rebase on Jul 25, 2018
  35. l2a5b1 commented at 3:28 pm on July 25, 2018: contributor
    aa32fe3 has been rebased on master (1211b15bf6c0b2904d90b96a9b3834c5cb9e7b4e)
  36. DrahtBot added the label Needs rebase on Aug 28, 2018
  37. Implements ParseNetwork unit test.
    This commit implements a unit test that validates the `ParseNetwork(std::string)` implementation in `netbase.cpp`.
    e2ba043b8d
  38. Implements custom tolower and toupper functions.
    This commit implements custom equivalents for the C and C++ `tolower` and `toupper` Standard Library functions.
    In addition it implements a utility function to capitalize the first letter of a string.
    7a208d9fad
  39. Removes the Boost case_conv.hpp dependency.
    This commit removes the `boost/algorithm/string/case_conv.hpp` dependency from the project. It replaces the `boost::to_lower` and `boost::to_upper` functions with custom functions that are locale independent and ASCII deterministic.
    b193d5a443
  40. l2a5b1 force-pushed on Aug 28, 2018
  41. DrahtBot removed the label Needs rebase on Aug 28, 2018
  42. l2a5b1 commented at 6:49 pm on August 28, 2018: contributor
    b193d5a has been rebased on master (aa39ca764578a9017e03796b554955796022eb0d), woot!
  43. in src/test/netbase_tests.cpp:318 in b193d5a443
    313+    BOOST_CHECK_EQUAL(ParseNetwork("IPv6"), NET_IPV6);
    314+    BOOST_CHECK_EQUAL(ParseNetwork("ONION"), NET_ONION);
    315+    BOOST_CHECK_EQUAL(ParseNetwork("TOR"), NET_ONION);
    316+
    317+    BOOST_CHECK_EQUAL(ParseNetwork(":)"), NET_UNROUTABLE);
    318+    BOOST_CHECK_EQUAL(ParseNetwork("tÖr"), NET_UNROUTABLE);
    


    laanwj commented at 12:59 pm on August 29, 2018:
    lol, tÖr, that’s like the Swedish version of Tor right?

    l2a5b1 commented at 3:31 pm on August 29, 2018:
    Lol @laanwj, best review comment ever! I do hidden messages or easter eggs from time to time, but this isn’t one of them 😂😂
  44. laanwj commented at 12:59 pm on August 29, 2018: member
    utACK b193d5a443bfd994936ad21b807b2bb37756ef2c certainly an improvement
  45. ken2812221 referenced this in commit 5924dadc2f on Aug 29, 2018
  46. laanwj merged this on Aug 29, 2018
  47. laanwj closed this on Aug 29, 2018

  48. fanquake moved this from the "In progress" to the "Done" column in a project

  49. kittywhiskers referenced this in commit 466b153103 on May 22, 2021
  50. Munkybooty referenced this in commit e0e67a220f on Jun 30, 2021
  51. kittywhiskers referenced this in commit aa51821b69 on Jul 13, 2021
  52. kittywhiskers referenced this in commit 648e5ede87 on Jul 13, 2021
  53. kittywhiskers referenced this in commit 6492d47b22 on Jul 15, 2021
  54. kittywhiskers referenced this in commit 7ab19cc9ee on Jul 15, 2021
  55. kittywhiskers referenced this in commit 4619630551 on Jul 16, 2021
  56. UdjinM6 referenced this in commit 730a89f8ed on Jul 16, 2021
  57. DrahtBot locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-06 01:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me