Better error and log messages #218

pull ryanofsky wants to merge 4 commits into bitcoin-core:master from ryanofsky:pr/errs changing 10 files +212 −11
  1. ryanofsky commented at 9:45 pm on September 24, 2025: collaborator
    This PR contains some logging and error message improvements I made locally in recent months that seem useful to upstream. Details are in commit messages
  2. DrahtBot commented at 9:45 pm on September 24, 2025: none

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK ismaelsadeeq
    Stale ACK theuni

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #223 (ci: Replace nix-shell with equivalent nix develop command by ryanofsky)
    • #222 (test, ci: Fix threadsanitizer errors in mptest by ryanofsky)
    • #212 (ci: add newdeps job testing newer versions of cmake and capnproto by ryanofsky)
    • #209 (cmake: Increase cmake policy version by ryanofsky)
    • #175 (Set cmake_minimum_required(VERSION 3.22) by maflcko)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. ryanofsky force-pushed on Sep 25, 2025
  4. ryanofsky commented at 1:50 pm on September 25, 2025: collaborator
    Updated 395e3266b5e29a3241e8ea6c5e4024971c21dd39 -> a5b8a2a3293a9650feab1101afa35a1d47981049 (pr/errs.1 -> pr/errs.2, compare) to fix iwyu errors (https://github.com/bitcoin-core/libmultiprocess/actions/runs/17990500563)
  5. ismaelsadeeq commented at 10:26 am on September 26, 2025: member
    Concept ACK
  6. ryanofsky force-pushed on Sep 26, 2025
  7. ryanofsky commented at 5:46 pm on September 26, 2025: collaborator
    Updated a5b8a2a3293a9650feab1101afa35a1d47981049 -> 0ce75e3162bcc605757cd36cb3b90541d7a13a88 (pr/errs.2 -> pr/errs.3, compare) rewrite type-context.h comment to try to describe the error better
  8. in include/mp/util.h:261 in 0ce75e3162 outdated
    256+inline std::string _demangle(const char* m) { return m; }
    257+#endif
    258+
    259+template<class T>
    260+std::string TypeName(const T& /*unused*/)
    261+{
    


    theuni commented at 6:59 pm on October 1, 2025:
    Since this is only done in one place now, maybe wrap this in (c++20 standard) #ifdef __cpp_rtti? I checked quickly, and gcc and clang treat that define as-expected with or with or without -fno-rtti.

    ryanofsky commented at 7:55 pm on October 2, 2025:

    re: #218 (review)

    Sure, added #ifdef __cpp_rtti, though have not tested without rtti myself.

  9. in include/mp/util.h:247 in 0ce75e3162
    243@@ -238,6 +244,24 @@ inline char* CharCast(unsigned char* c) { return (char*)c; }
    244 inline const char* CharCast(const char* c) { return c; }
    245 inline const char* CharCast(const unsigned char* c) { return (const char*)c; }
    246 
    247+#if defined(__GNUG__)          // GCC & Clang ─ use <cxxabi.h> to demangle
    


    theuni commented at 7:16 pm on October 1, 2025:

    I don’t love this without an escape-hatch for Frankenstein compilers like clang-cl. Something like #if defined(__GNUG__) && !defined(MP_NO_DEMANGLE) would make it more bulletproof.

    That can always be added if it ever becomes a real problem though.


    ryanofsky commented at 7:53 pm on October 2, 2025:

    re: #218 (review)

    Thanks, replaced this with __has_include(<cxxabi.h>) which I think should be more on-target than __GNUG__. I’d hope for something like this, the compiler could provide enough information to detect the feature without needing to involve cmake, but cmake is available if it turns out to be necessary.

  10. theuni commented at 7:22 pm on October 1, 2025: contributor

    code review ACK 0ce75e3162bcc605757cd36cb3b90541d7a13a88.

    The logging additions look helpful, though they’ll need to be updated to use the new macros from #220.

    I’d rather hold off on that second commit until after v30 has settled though, since it introduces some platform-specific functionality that has the (admittedly small) potential for causing build problems.

  11. logging: Add better logging on IPC server-side failures
    Motivation: when trying to add a new unit test ran into a confusing error
    "(remote):0: failed: remote exception: Called null capability" caused by
    forgetting to make a FooInterface.initThreadMap call during initialization and
    these logging prints would have made it easier to see where the error was
    coming from and debug.
    21b2f653f4
  12. debug: Add TypeName() function and log statements for Proxy objects being created and destroyed 2af6a9bc30
  13. DrahtBot added the label Needs rebase on Oct 2, 2025
  14. ryanofsky force-pushed on Oct 2, 2025
  15. ryanofsky commented at 7:58 pm on October 2, 2025: collaborator

    Rebased 0ce75e3162bcc605757cd36cb3b90541d7a13a88 -> 7283a2b3d3b9acbc1b3939ef054f1d1537cdfebd (pr/errs.3 -> pr/errs.4, compare) to fix conflict with #220, and also applied review suggestions, and renamed TypeName to CxxTypename since there is actually already another TypeName function for cap’nproto types

    Updated 7283a2b3d3b9acbc1b3939ef054f1d1537cdfebd -> 2af6a9bc3036895ddddbd80d1f7789bec92a6516 (pr/errs.4 -> pr/errs.5, compare) fixing typo found by drahtbot

  16. ryanofsky force-pushed on Oct 2, 2025
  17. DrahtBot removed the label Needs rebase on Oct 2, 2025
  18. ryanofsky commented at 9:12 am on October 3, 2025: collaborator

    gnu32 CI job is failing but I think it is not directly related to changes in this PR. The new thread busy test introduced in #214 is segfaulting here but it has a known race condition fixed in #222 . The segfault is consistent with the race and might just be happening now because new log prints in this PR change the timing of the test, and fact that bugs tend to trigger more segfaults in the gnu32 job because of the smaller memory space.

    Will cherry-pick the race fix from #222 as a workaround to see if it fixes the segfault. Assuming this works, if reviewing this PR, would be nice to just ACK first two commits (2af6a9bc3036895ddddbd80d1f7789bec92a6516) and ignore the workaround commit so it can be dropped later without needing to rebase or re-review.

    Added 1 commits 2af6a9bc3036895ddddbd80d1f7789bec92a6516 -> 0fc31ec036091fcc5e62372fade78faa80e83d0f (pr/errs.5 -> pr/errs.6, compare) adding attempted workaround for CI failure

  19. (workaround for ci, to be dropped after #214 is merged)
    Cherry-picking 73d22ba2e9301fd77727a2193fc289e5299563bb from https://github.com/bitcoin-core/libmultiprocess/pull/222
    0fc31ec036
  20. ci: try to get stack trace from mptest 437f042541
  21. ryanofsky commented at 4:06 pm on October 3, 2025: collaborator

    Really struggling to debug the gnu32 segfault in the “thread busy” test. Failed to get a stack trace from core dump in CI, but I was able to make failure happen locally compiling with gnu32 and asan, though failure seems to be just a normal segfault and there aren’t asan errors. GDB shows exception happening in:

    In serverInvoke, the kj::Exception& object is valid and holds the “thread busy” error, but after the argument is forwarded to kj::operator*, the kj::Exception& argument somehow points to a different address with garbage data and the result is a segfault. I have no idea how this is happening because the two functions in between are just forwarding the reference argument, and they are hard to inspect with GDB because they are varargs functions.

    The failure seems very strange, but the segfault is happening in the line of code in serverInvoke that this PR is changing, so I don’t think it is spurious. Will have to continue debugging. I hope it is not just something stupid in the gnu32 CI job setup.

  22. in include/mp/proxy-types.h:729 in 21b2f653f4 outdated
    725@@ -726,6 +726,13 @@ kj::Promise<void> serverInvoke(Server& server, CallContext& call_context, Fn fn)
    726                 MP_LOG(*server.m_context.loop, Log::Debug) << "IPC server send response #" << req << " " << TypeName<Results>();
    727                 MP_LOG(*server.m_context.loop, Log::Trace) << "response data: "
    728                     << LogEscape(call_context.getResults().toString(), server.m_context.loop->m_log_opts.max_chars);
    729+            }, [&server, req](::kj::Exception&& e) {
    


    theuni commented at 10:15 pm on October 3, 2025:

    isn’t the local server_capture being captured by reference problematic if invoke is being called async? Possible that’s smashing the stack here?

    (Maybe I’m way off, these tangled callbacks a breaking my brain).

    But I think I’d expect this to be return ReplaceVoid([server_context = std::move(server_context)]()...


    ryanofsky commented at 12:26 pm on October 4, 2025:

    I should double-check all the lifetimes here, but I think the problem in the gnu32 job is weirder than a lifetime issue, because in gdb I can inspect the kj::Exception&& e variable here and it is a valid object containing the “thread busy” error and full context information. But then two calls deeper in the call stack, the stringify function which is supposed to print the exception, somehow receives a garbage reference to a kj::Exception with a completely different address. So it seems like there is an ABI issue or compiler bug exposed by this change, but I am not sure.

    On server vs server_context, the server object is actually the one with the longer lifetime. When you declare an interface MyInterface { myMethod [@0](/bitcoin-core-multiprocess/contributor/0/) () -> (); } capnproto generates C++ class like:

    0class MyInterface::Server {
    1  virtual Promise<void> myMethod(MyMethodContext) = 0;
    2}
    

    which you inherit from to implement all the methods. The server variable is a reference to the object implementing these methods, and there is also a MyInterface::Client class generated to call the methods. The lifetime of the server object is completely controlled by Cap’n Proto. It will keep the server around until nothing is referencing it, so it will be alive as long as any client on a live connection exists, and as long as any IPC call is in progress, even if the client is released or disconnected.

    By contrast server_context is just a local variable in the serverInvoke function, so you would need to copy or move it like you suggested to use it in a promise.then() handler but it is fine to use by reference in ReplaceVoid lambdas because of the way ReplaceVoid is defined, where it just calls them right away before returning. I try to use convention of using [&] and implicit captures for any lambda that will be called right away before the statement returns, and using explicit captures for lambdas that are delayed callbacks.


    maflcko commented at 12:12 pm on October 6, 2025:

    So it seems like there is an ABI issue or compiler bug exposed by this change, but I am not sure.

    I haven’t looked at anything here, but I presume capnp is compiled with clang in the CI and libmultiprocess with GCC, so it just reminds me of https://github.com/bitcoin/bitcoin/issues/31772. But I haven’t checked any of this.

  23. DrahtBot added the label Needs rebase on Oct 7, 2025
  24. DrahtBot commented at 3:06 pm on October 7, 2025: none
    🐙 This pull request conflicts with the target branch and needs rebase.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/libmultiprocess. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-12-04 19:30 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me