scheduler test causing pthread_cond_timedwait: Invalid argument error #18227

issue amitiuttarwar openend this issue on February 28, 2020
  1. amitiuttarwar commented at 8:36 pm on February 28, 2020: contributor

    Problem

    In #18037 I introduced code to mock the scheduler from functional tests. I added a scheduler#MockForward method & associated scheduler_tests#mockforward unit test. The unit test caused sporadic failures on one of the Travis machines & failures on Bitcoin builds.

    After a couple attempts at small fixes [1] [2], we decided to disable the test as we investigate the underlying issue.

    Failures

    Some examples of failing builds

    Travis: [1] [2] Note that it is always the travis machine with x86_64 Linux [GOAL: install] [bionic] [no wallet]

    Bitcoin Builds: [1] [2] [3]

    The test always fails in the same way:

    0terminate called after throwing an instance of 'boost::wrapexcept<boost::condition_error>'
    1  what():  boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Invalid argument
    2unknown location(0): fatal error: in "scheduler_tests/mockforward": signal: SIGABRT (application abort requested)
    

    Reproducing the issue

    One of the issues that makes this really difficult to debug is the failure is hard to reproduce. I have not been able to reproduce it myself. In this section, I’m compiling the information from others that were able to reproduce.

    • @jonasschnelli found that enabling ccache meant consistent failure, and clearing ccache meant consistent success. [1] [2]

    • @MarcoFalke was able to reproduce on a single CPU instance, but unable to extract much debugging information. [1].

    Relevant debugging information

    The boost:condition_variable triggers the SIGABRT as a result of the pthread_cond_timedwait returning an Invalid argument, or EINVAL error. According to the docs [1] & [2], this indicates that either The value specified by cond, mutex, or abstime is invalid or Different mutexes were supplied for concurrent pthread_cond_timedwait() or pthread_cond_wait() operations on the same condition variable.

    Next steps

    This bug doesn’t seem urgent, but should be investigated and resolved. I’m opening this issue to track any relevant comments or findings.

    Thank you to everyone who has jumped in to take a look & thanks in advance to anybody willing to help continue the investigation 🙏🏽

  2. amitiuttarwar added the label Bug on Feb 28, 2020
  3. fanquake added the label Tests on Feb 28, 2020
  4. MarcoFalke commented at 9:14 pm on February 28, 2020: member
    For completeness, the other bugs that mention the same exception “Invalid argument” in the scheduler: #16307, #16027, #14200.
  5. fanquake commented at 9:24 pm on February 28, 2020: member
    @MarcoFalke Could you post the specs/setup for the Google Cloud instance you were using to reliably recreate the issue?
  6. MarcoFalke commented at 9:27 pm on February 28, 2020: member

    I think it was n1-standard-1 (1 vCPU, 3.75 GB memory) and it only happened after a reboot once.

    Also, it only happened with make -j 2 check, as opposed to src/test/test_bitcoin -t scheduler_tests

  7. nothingmuch commented at 0:08 am on February 29, 2020: contributor

    Although I haven’t been able to replicate the failure, my current best guess is an invalid timespec passed to pthread_cond_timedwait from the timed wait loop.

    I’m hoping someone can suggest a cleaner way of corroborating or ruling out this hypothesis (see below), especially since @amitiuttarwar told me that @ajtowns has concluded that an invalid timespec is not the cause (edit: i seem to have interpreted that too strongly, apologies to both)

    Assumptions/observations:

    Alternative explanations for EINVAL and evidence against them

    • unintiialized condvar or mutex, since pthread_cond_init and pthread_mutex_init are called from the constructors of the scheduler member variables boost::condition_variable newTasksScheduled and mutable boost::mutex newTasksMutex
    • destroyed condvar or mutex, since the scheduler lifetime is bound to the test’s scope
    • different mutex for same condvar, since there is only one serviceQueue thread[1]

    As far as I’m aware the only way for the timespec to be invalid is if tv_msec tv_nsec is outside the range [0, 1e9).

    There is also some very weak evidence for this cause, namely condition variable and other timed wait related bugfixes mentioned boost release notes, but these generally pertain to win32, and non monotonic system clock values. Furthermore, such a bug only seems plausible if time values are converted (see my question below about do_wait_until variants and internal_chrono_clock).

    Next steps

    I suppose this can be validated by patching boost in make depends, so that it checks the tv_msec field of the timespec returned by getTs to see if this makes a difference on the CI machines that reliably fail.

    However my understanding of C++ is at it’s limit and it’s not clear to me which definition of do_wait_until is used by wait_until:

    Additionally, it’s not clear to me which definition of wait_until itself is used, i.e. when does detail::internal_chrono_clock correspond to boost::system_clock, but that doesn’t seem consequential.

    In either case, sanity checks could on the following getTs calls’ return values could be added in a patch:

    but like I said I’m hoping someone can suggest a simpler approach, or finds flaws in my reasoning, since I’m not really looking forward to trying this approach without being able to replicate the failure locally.

    [1] furthermore, if BOOST_THREAD_PROVIDES_INTERRUPTIONS is defined boost::condition_variable will use an internal mutex instead the one corresponding to the lock argument

  8. ajtowns commented at 10:40 am on February 29, 2020: member

    I’m hoping someone can suggest a cleaner way of corroborating or ruling out this hypothesis (see below), especially since @amitiuttarwar told me that @ajtowns has concluded that an invalid timespec is not the cause (edit: i seem to have interpreted that too strongly, apologies to both)

    As far as I’m aware the only way for the timespec to be invalid is if tv_msec is outside the range [0, 1e9).

    I think this is a typo – should be tv_nsec (nanoseconds).

    And yeah that’s the only way I can see it happening per http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_cond_timedwait.c;h=7278ec45b0eb0b48be50fe832fcd17ae988dca27;hb=HEAD

    Okay, I guess I can delete the paragraph about how that doesn’t seem plausible. I added:

     0#include <pthread.h>
     1#include <errno.h>
     2inline int my_pthread_cond_timedwait(pthread_cond_t *cond,
     3       pthread_mutex_t *mutex,
     4       const struct timespec *abstime)
     5{
     6    if (abstime->tv_nsec < 0) {
     7        if (abstime->tv_sec < 0) return EISDIR;
     8        else return ENOTDIR;
     9    }
    10    if (abstime->tv_nsec >= 1000000000l) return ENOTBLK;
    11    return pthread_cond_timedwait(cond,mutex,abstime);
    12}
    13#define pthread_cond_timedwait(c,m,t) my_pthread_cond_timedwait(c,m,t)
    

    as the very first thing in scheduler.cpp and ran it on travis, and got:

    0terminate called after throwing an instance of 'boost::wrapexcept<boost::condition_error>'
    1  what():  boost::condition_variable::do_wait_until failed in pthread_cond_timedwait: Is a directory
    

    (Ref: https://travis-ci.org/ajtowns/bitcoin/jobs/656619767?utm_medium=notification&utm_source=github_status and https://github.com/ajtowns/bitcoin/commits/202002-pthread )

    so seems like it’s getting an invalid timespec passed in, with both tv_sec < 0 and tv_nsec, so I’m guessing as a result of int-division-rounding-towards-zero making ns_to_timespec mess up in https://www.boost.org/doc/libs/1_70_0/boost/thread/detail/platform_time.hpp . But pthread_cond_timedwait expects an absolute time since the unix epoch, so trying to pass through a negative value is crazy, and I can’t see anywhere that should be causing that…

  9. MarcoFalke commented at 8:38 pm on February 29, 2020: member

    Those are the steps that lead to the exception (with example values filled in for illustrative purposes):

    Reported upstream: https://github.com/boostorg/thread/issues/308

  10. MarcoFalke added the label Upstream on Feb 29, 2020
  11. ajtowns commented at 5:14 am on March 1, 2020: member

    @MarcoFalke nice!

    While I was thinking it was a mutex problem I had a go at switching scheduler.cpp to the sync.h primitives to see if that would make the bug obvious. Might be worth doing anyway, and I think the code in the standard libraries for converting to timespec avoids the boost bug above. Opened #18234 if anyone wants to take a look.

  12. laanwj closed this on Mar 6, 2020

  13. laanwj referenced this in commit d20d5dc824 on Mar 11, 2020
  14. sidhujag referenced this in commit d9eaa09aea on Mar 11, 2020
  15. MarcoFalke referenced this in commit 89a6bb9245 on Aug 28, 2020
  16. sidhujag referenced this in commit a15661f3b5 on Nov 10, 2020
  17. PastaPastaPasta referenced this in commit 16feb0bf56 on Jun 27, 2021
  18. PastaPastaPasta referenced this in commit cdf7274255 on Jun 28, 2021
  19. PastaPastaPasta referenced this in commit 07813b5bce on Jun 29, 2021
  20. PastaPastaPasta referenced this in commit 7b9dbad9c2 on Jul 1, 2021
  21. PastaPastaPasta referenced this in commit b6d94d9a67 on Jul 1, 2021
  22. PastaPastaPasta referenced this in commit 2f6f313942 on Jul 14, 2021
  23. PastaPastaPasta referenced this in commit 417099870a on Sep 28, 2021
  24. PastaPastaPasta referenced this in commit 9b8884b430 on Sep 28, 2021
  25. kittywhiskers referenced this in commit efb9cdad7a on Oct 12, 2021
  26. DrahtBot locked this on Feb 15, 2022
  27. gades referenced this in commit 65ec4496fb on May 2, 2022
  28. Munkybooty referenced this in commit 308aa902f3 on Nov 8, 2022
  29. Munkybooty referenced this in commit 229a901ad8 on Nov 15, 2022
  30. Munkybooty referenced this in commit 0b453d5296 on Nov 25, 2022
  31. Munkybooty referenced this in commit e08534c56e on Nov 25, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-19 00:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me