sync, log: inline lock contention logging macro to fix duration, improve BCLog::LogMsg() #22904

pull jonatack wants to merge 4 commits into bitcoin:master from jonatack:logging-and-lock-contention-fixups changing 6 files +27 −30
  1. jonatack commented at 11:08 pm on September 6, 2021: member

    Follow-up to #22736.

    The first commit addresses the issue identified and reported by Martin Ankerl in #22736 (review) to fix the lock contention duration reporting.

    The next three commits make improvements to the timer code in BCLog::LogMsg() and add util/types.h with an ALWAYS_FALSE template, that springboard from #22736 (review) by Marco Falke.

  2. sync: inline lock contention logging macro to fix time duration
    Co-authored-by: Martin Ankerl <martin.ankerl@gmail.com>
    8d2f847ed9
  3. log, timer: improve BCLog::LogMsg()
    - make timer code more homogeneous
    - replace division with multiplication
    - log if the time type is unexpected
    498b323425
  4. DrahtBot added the label Tests on Sep 7, 2021
  5. in src/logging/timer.h:61 in 498b323425 outdated
    59@@ -60,21 +60,13 @@ class Timer
    60 
    61         if (std::is_same<TimeType, std::chrono::microseconds>::value) {
    


    martinus commented at 6:55 am on September 7, 2021:

    Maybe overkill, but it would be possible here to use if constexpr so the error case becomes a compile time error, like so:

     0    std::string LogMsg(const std::string& msg)
     1    {
     2        const auto end_time = GetTime<std::chrono::microseconds>() - m_start_t;
     3        if (m_start_t.count() <= 0) {
     4            return strprintf("%s: %s", m_prefix, msg);
     5        }
     6
     7        if constexpr (std::is_same<TimeType, std::chrono::microseconds>::value) {
     8            return strprintf("%s: %s (%iμs)", m_prefix, msg, end_time.count());
     9        } else if constexpr (std::is_same<TimeType, std::chrono::milliseconds>::value) {
    10            return strprintf("%s: %s (%.2fms)", m_prefix, msg, end_time.count() * 0.001);
    11        } else if constexpr (std::is_same<TimeType, std::chrono::seconds>::value) {
    12            return strprintf("%s: %s (%.2fs)", m_prefix, msg, end_time.count() * 0.000001);
    13        } else {
    14            static_assert(dependent_false<TimeType>, "Error: unexpected time type");
    15        }
    16    }
    

    for the static_assert to work though this needs a utility function dependent_false, which could in future be useful in other cases too (taken from p1830r1):

    0template <bool value, typename... Args>
    1inline constexpr bool dependent_bool_value = value;
    2
    3template <typename... Args> 
    4inline constexpr bool dependent_false = dependent_bool_value<false, Args...>;
    

    MarcoFalke commented at 7:13 am on September 7, 2021:
    You can use commit fac588b57bece4d3572a8ecdcd58140b5343cdf6

    jonatack commented at 4:44 pm on September 7, 2021:
    Good ideas! Done.
  6. in src/sync.h:140 in 8d2f847ed9 outdated
    136@@ -138,7 +137,7 @@ class SCOPED_LOCKABLE UniqueLock : public Base
    137     {
    138         EnterCritical(pszName, pszFile, nLine, Base::mutex());
    139         if (Base::try_lock()) return;
    140-        LockContention(pszName, pszFile, nLine); // log the contention
    141+        LOG_TIME_MICROS_WITH_CATEGORY(strprintf("lock contention %s, %s:%d", pszName, pszFile, nLine), BCLog::LOCK);
    


    martinus commented at 6:57 am on September 7, 2021:
    I think it would be nice to also have util::ThreadGetInternalName() in the log message, then it’s more obvious which thread is blocking which

    martinus commented at 2:47 pm on September 7, 2021:
    disregard that, I just saw that this is already nicely possible with -logthreadnames.
  7. martinus approved
  8. martinus commented at 7:03 am on September 7, 2021: contributor
    ACK 498b323425d960274c40472a6a847afc1982201d, code review & tested. Found two one potential improvements.
  9. Add util/types.h with ALWAYS_FALSE template bddae7e7ff
  10. Make unexpected time type in BCLog::LogMsg() a compile-time error f530202353
  11. in src/serialize.h:26 in e9af7c2fab outdated
    22@@ -23,6 +23,7 @@
    23 
    24 #include <prevector.h>
    25 #include <span.h>
    26+#include <util/types.h>
    


    jonatack commented at 5:15 pm on September 7, 2021:
    I guess this can be dropped, though I see how it might be used in this file for SERIALIZE_METHODS.
  12. jonatack force-pushed on Sep 7, 2021
  13. theStack approved
  14. theStack commented at 11:50 am on September 8, 2021: member

    ACK f530202353a4f8bb444966559aa15681ab3cebc6

    Thanks for catching the timing flaw @martinus (and also sorry to all for not noticing it in my review of #22736)! Indeed the numbers look way more plausible now – on a signet non-debug build (src/bitcoind -signet -debug=lock), lock contention times are now in the range of hundreds to thousands micro-seconds on my machine.

  15. in src/test/logging_tests.cpp:35 in 498b323425 outdated
    34+    BOOST_CHECK_EQUAL(sec_timer.LogMsg("test secs"), "tests: test secs (1.00s)");
    35+
    36+    SetMockTime(1);
    37+    auto minute_timer = BCLog::Timer<std::chrono::minutes>("tests", "end_msg");
    38+    SetMockTime(2);
    39+    BOOST_CHECK_EQUAL(minute_timer.LogMsg("test minutes"), "Error: unexpected time type");
    


    MarcoFalke commented at 11:56 am on September 8, 2021:

    seems a bit odd to add a new feature in one commit and then remove it in the next commit.

    If you squash the two commits and keep the ordering, the diff will be a lot smaller, if not empty.


    jonatack commented at 1:45 pm on September 9, 2021:
    Kept the last three commits separate as they do three different things and were contributed by three different people.

    MarcoFalke commented at 1:55 pm on September 9, 2021:
    Fair enough

    fanquake commented at 1:59 pm on September 9, 2021:
    3 different commits by 3 different people is irrelevant if we’re just adding code to immediately delete it. As long as bddae7e7ff7bb5931ed807acaef7336f2ee98476 came first it didn’t matter. So I don’t understand why the remaining two commits, 498b323425d960274c40472a6a847afc1982201d and f530202353a4f8bb444966559aa15681ab3cebc6, couldn’t have been squashed together and just have the two authors.
  16. martinus commented at 7:09 pm on September 8, 2021: contributor

    re-ACK f530202353a4f8bb444966559aa15681ab3cebc6. I ran a fully synced node for about a day. My node was mostly idle though so not much was going on. I wrote a little script to parse the debug.log and summarize the output to see if anything interesting was going on, here is the result:

    total[ms] average[ms] count thread where
    869.151 17.383 50 msghand cs_main, net_processing.cpp:4051
    434.890 27.181 16 opencon cs, ./addrman.h:172
    382.321 7.802 49 scheduler cs_main, net_processing.cpp:1476
    153.231 19.154 8 opencon cs_main, net_processing.cpp:1152
    98.148 0.008 11960 msghand cs_vNodes, net.cpp:2263
    87.235 0.022 3983 msghand m_mutex, ./checkqueue.h:168
    68.884 0.004 18252 msghand cs_vNodes, net.cpp:2228
    66.666 1.449 46 msghand m_recent_confirmed_transactions_mutex, net_processing.cpp:1649
    62.938 0.054 1167 scriptch.8 m_mutex, ./checkqueue.h:77
    62.812 0.054 1173 scriptch.6 m_mutex, ./checkqueue.h:77
    62.467 0.053 1178 scriptch.5 m_mutex, ./checkqueue.h:77
    61.860 0.050 1241 scriptch.7 m_mutex, ./checkqueue.h:77
    61.779 0.052 1192 scriptch.2 m_mutex, ./checkqueue.h:77
    61.284 0.051 1200 scriptch.3 m_mutex, ./checkqueue.h:77
    60.051 0.051 1181 scriptch.10 m_mutex, ./checkqueue.h:77
    59.659 0.049 1207 scriptch.4 m_mutex, ./checkqueue.h:77
    58.570 0.049 1207 scriptch.1 m_mutex, ./checkqueue.h:77
    58.133 0.049 1175 scriptch.0 m_mutex, ./checkqueue.h:77
    56.667 0.046 1245 scriptch.9 m_mutex, ./checkqueue.h:77
    47.258 47.258 1 opencon cs, ./addrman.h:206
    31.575 31.575 1 msghand cs, ./addrman.h:180
    23.652 0.004 5278 net pnode->cs_hSocket, net.cpp:1539
    23.523 3.920 6 msghand cs_main, net_processing.cpp:4522
    17.490 0.008 2266 net pnode->cs_vSend, net.cpp:1359
    15.514 0.287 54 msghand g_cs_orphans, net_processing.cpp:4051
    9.338 0.004 2346 net cs_vNodes, net.cpp:1611
    8.315 0.004 2080 net cs_vNodes, net.cpp:1522
    6.490 0.004 1770 net cs_vNodes, net.cpp:1342
    3.453 0.003 1033 net cs_vNodes, net.cpp:1236
    2.838 0.218 13 msghand ::cs_main, validation.cpp:4950
    2.663 2.663 1 msghand cs_main, net_processing.cpp:3130
    1.745 0.291 6 scheduler cs_main, net_processing.cpp:4278
    1.443 0.005 299 msghand newTaskMutex, scheduler.cpp:76
    1.064 0.003 317 net cs_vNodes, net.cpp:1287
    0.799 0.266 3 msghand cs_main, net_processing.cpp:3411
    0.639 0.004 145 msghand cs_vNodes, ./net.h:828
    0.615 0.308 2 msghand cs_main, net_processing.cpp:2921
    0.487 0.061 8 msghand m_mutex, ./checkqueue.h:77
    0.424 0.071 6 msghand g_cs_orphans, net_processing.cpp:3451
    0.409 0.004 100 opencon cs_vNodes, net.cpp:1906
    0.262 0.262 1 msghand cs_main, net_processing.cpp:3204
    0.256 0.008 32 loadblk cs_main, validation.cpp:4498
    0.242 0.004 58 msghand node.cs_hSocket, net.cpp:781
    0.236 0.118 2 msghand cs_main, net_processing.cpp:2600
    0.174 0.004 44 net mutexMsgProc, net.cpp:1630
    0.110 0.110 1 scheduler cs_main, index/base.cpp:282
    0.108 0.108 1 msghand g_cs_orphans, txorphanage.cpp:156
    0.098 0.005 18 scheduler m_cs_callbacks_pending, scheduler.cpp:151
    0.079 0.004 19 msghand m_cs_callbacks_pending, scheduler.cpp:183
    0.078 0.004 20 addcon cs_vNodes, net.cpp:2120
    0.051 0.003 15 msghand pnode->cs_vSend, net.cpp:3016
    0.028 0.006 5 scheduler instance->m_cs_callbacks_pending, scheduler.cpp:168
    0.016 0.005 3 scheduler m_cs_callbacks_pending, scheduler.cpp:137
    0.014 0.003 5 net pnode->cs_vProcessMsg, net.cpp:1572
    0.009 0.003 3 scheduler newTaskMutex, scheduler.cpp:76
    0.005 0.002 2 msghand pfrom->cs_vProcessMsg, net_processing.cpp:4077
    0.005 0.002 2 msghand m_cs_callbacks_pending, scheduler.cpp:137
    0.004 0.002 2 opencon cs_vNodes, net.cpp:332
    0.002 0.002 1 scheduler g_cs_orphans, txorphanage.cpp:175
    0.001 0.001 2 loadblk m_mutex, random.cpp:419
    0.001 0.001 2 msghand cs_args, util/system.cpp:1024
    0.001 0.001 1 loadblk newTaskMutex, scheduler.cpp:76
    0.001 0.001 1 loadblk cs_args, util/system.cpp:1024
    0.001 0.001 1 net cs_hSocket, net.cpp:499
    0.001 0.001 1 msghand cs_main, net_processing.cpp:2044

    Almost all of the contention in cs_main, net_processing.cpp:4051 happened at the startup of the node.

  17. MarcoFalke commented at 1:53 pm on September 9, 2021: member
    I expect lock contention to be more pronounced in the GUI (or when there is an active RPC poll)
  18. MarcoFalke merged this on Sep 9, 2021
  19. MarcoFalke closed this on Sep 9, 2021

  20. jonatack deleted the branch on Sep 9, 2021
  21. sidhujag referenced this in commit 388bc0e0cc on Sep 11, 2021
  22. ajtowns commented at 7:11 pm on September 29, 2021: member

    This seems to introduce a busy loop when running the checkqueue tests for me:

     0$ make -j5 check &
     1$ tail -f test/*.log
     22021-09-29T18:35:19.275154Z [test] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:168 started
     32021-09-29T18:35:19.275213Z [scriptch.2] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (306μs)
     42021-09-29T18:35:19.275244Z [scriptch.0] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (222μs)
     52021-09-29T18:35:19.275275Z [scriptch.1] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (127μs)
     62021-09-29T18:35:19.275293Z [scriptch.0] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 started
     72021-09-29T18:35:19.275319Z [scriptch.2] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 started
     82021-09-29T18:35:19.275347Z [test] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:168 completed (109μs)
     92021-09-29T18:35:19.275366Z [scriptch.1] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 started
    102021-09-29T18:35:19.275387Z [scriptch.0] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (62μs)
    112021-09-29T18:35:19.275404Z [test] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:168 started
    122021-09-29T18:35:19.275425Z [scriptch.2] [logging/timer.h:51] [Log] Enter: lock contention m_mutex, ./checkqueue.h:77 completed (69μs)
    

    It does finish eventually with a >1GB log file:

    0$ ls -l test/*.log --sort=s -h -r | tail -n2
    1-rw-r--r-- 1 aj aj   20M Sep 30 04:30 test/validationinterface_tests.cpp.log
    2-rw-r--r-- 1 aj aj  1.3G Sep 30 04:35 test/checkqueue_tests.cpp.log
    3$ cat test/checkqueue_tests.cpp.log | grep 'Enter: lock contention' | cut -d\  -f8,9 | sort | uniq -c
    4      2 instance->m_cs_callbacks_pending, scheduler.cpp:168
    5      6 m_cs_callbacks_pending, scheduler.cpp:183
    62599034 m_mutex, ./checkqueue.h:168
    77399288 m_mutex, ./checkqueue.h:77
    8 161286 m, test/checkqueue_tests.cpp:66
    9      2 newTaskMutex, scheduler.cpp:76
    

    I was thinking that the fact that logging takes its own lock might have been a problem (leading to contention on that lock? or just taking more time leading to more contention on other locks?) but it seems like there really is that many instances of contention for the checkqueue mutex – reducing the logging by a factor of 1000 only reduces the logging proportionally. But gigabyte log files and many minutes to run a simple test doesn’t seem great…

  23. theuni commented at 7:37 pm on September 29, 2021: member

    I suspect you have a bunch of spurious failures. Regardless of whether that’s the case or not, relying on unspecified behavior (spurious mutex failures) from libc/stdlib is a bad idea, especially if it leads to possible remote triggers (3 of the 4 most contended locks are in net_processing).

    IMO this is far too dangerous to have turned on by default.

  24. theuni commented at 7:41 pm on September 29, 2021: member
    Following up: if this were to be reverted, here is an alternative solution to the header macro problem for lock contention and lock ordering. As a caveat, I haven’t benchmarked to see the impact of the lost inlining.
  25. martinus commented at 6:38 am on September 30, 2021: contributor

    I suspect you have a bunch of spurious failures.

    I think the contentions are real and not spurious. For me the file checkqueue_tests.cpp.log is by far the largest, and it looks like most the contentions come from the test test_CheckQueue_Correct_Random. When I run

    0./src/test/test_bitcoin --run_test=checkqueue_tests/test_CheckQueue_Correct_Random -- DEBUG_LOG_OUT >out.txt
    

    I get a 1.1GB log file. Parsing this with with this script I get this output:

    total[ms] average[ms] count thread where
    44386.247 0.039 1133991 test m_mutex, ./checkqueue.h:168
    41293.635 0.039 1065538 scriptch.0 m_mutex, ./checkqueue.h:77
    41167.037 0.039 1068648 scriptch.1 m_mutex, ./checkqueue.h:77
    41140.515 0.039 1068460 scriptch.2 m_mutex, ./checkqueue.h:77
    17.652 0.041 434 test m_mutex, ./checkqueue.h:77

    the first 4 threads have almost the same number of contentions, so it seems to me that they are blocking each other. Especially the 3 scriptch threads have practically the same number of contentions. I suspect that these 3 threads are always started simultaneously but only one of them gets the lock, blocking the other 2.

  26. fanquake commented at 3:12 am on October 4, 2021: member
    Clearly this needs following up on. I’ve opened an issue, #23167, for further discussion / so this doesn’t get lost.
  27. jonatack commented at 11:57 am on October 8, 2021: member

    I think the contentions are real and not spurious.

    Good call @martinus. Testing this further, it was not only the case on every run but also pre-existing with DEBUG_LOCKCONTENTION defined, as some of these tests are heavily contested by design. Logging contentions by default in the tests served to bring the issue to light, which was the goal.

    Screenshot from 2021-10-08 13-07-06

    Opened #23223 to resolve.

  28. fanquake referenced this in commit c1059c9fef on Apr 8, 2022
  29. Fabcien referenced this in commit 5d9c53f2a6 on Sep 28, 2022
  30. DrahtBot locked this on Oct 30, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-22 18:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me