Better error and log messages #218
pull ryanofsky wants to merge 4 commits into bitcoin-core:master from ryanofsky:pr/errs changing 10 files +212 −11-
ryanofsky commented at 9:45 pm on September 24, 2025: collaboratorThis PR contains some logging and error message improvements I made locally in recent months that seem useful to upstream. Details are in commit messages
-
DrahtBot commented at 9:45 pm on September 24, 2025: none
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Reviews
See the guideline for information on the review process.
Type Reviewers Concept ACK ismaelsadeeq Stale ACK theuni If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Conflicts
Reviewers, this pull request conflicts with the following ones:
- #223 (ci: Replace nix-shell with equivalent nix develop command by ryanofsky)
- #222 (test, ci: Fix threadsanitizer errors in mptest by ryanofsky)
- #212 (ci: add newdeps job testing newer versions of cmake and capnproto by ryanofsky)
- #209 (cmake: Increase cmake policy version by ryanofsky)
- #175 (Set cmake_minimum_required(VERSION 3.22) by maflcko)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
-
ryanofsky force-pushed on Sep 25, 2025
-
ryanofsky commented at 1:50 pm on September 25, 2025: collaboratorUpdated 395e3266b5e29a3241e8ea6c5e4024971c21dd39 -> a5b8a2a3293a9650feab1101afa35a1d47981049 (
pr/errs.1->pr/errs.2, compare) to fix iwyu errors (https://github.com/bitcoin-core/libmultiprocess/actions/runs/17990500563) -
ismaelsadeeq commented at 10:26 am on September 26, 2025: memberConcept ACK
-
ryanofsky force-pushed on Sep 26, 2025
-
in include/mp/util.h:261 in 0ce75e3162 outdated
256+inline std::string _demangle(const char* m) { return m; } 257+#endif 258+ 259+template<class T> 260+std::string TypeName(const T& /*unused*/) 261+{
theuni commented at 6:59 pm on October 1, 2025:Since this is only done in one place now, maybe wrap this in (c++20 standard)#ifdef __cpp_rtti? I checked quickly, and gcc and clang treat that define as-expected with or with or without-fno-rtti.
ryanofsky commented at 7:55 pm on October 2, 2025:re: #218 (review)
Sure, added
#ifdef __cpp_rtti, though have not tested without rtti myself.in include/mp/util.h:247 in 0ce75e3162
243@@ -238,6 +244,24 @@ inline char* CharCast(unsigned char* c) { return (char*)c; } 244 inline const char* CharCast(const char* c) { return c; } 245 inline const char* CharCast(const unsigned char* c) { return (const char*)c; } 246 247+#if defined(__GNUG__) // GCC & Clang ─ use <cxxabi.h> to demangle
theuni commented at 7:16 pm on October 1, 2025:I don’t love this without an escape-hatch for Frankenstein compilers like
clang-cl. Something like#if defined(__GNUG__) && !defined(MP_NO_DEMANGLE)would make it more bulletproof.That can always be added if it ever becomes a real problem though.
ryanofsky commented at 7:53 pm on October 2, 2025:re: #218 (review)
Thanks, replaced this with
__has_include(<cxxabi.h>)which I think should be more on-target than__GNUG__. I’d hope for something like this, the compiler could provide enough information to detect the feature without needing to involve cmake, but cmake is available if it turns out to be necessary.theuni commented at 7:22 pm on October 1, 2025: contributorcode review ACK 0ce75e3162bcc605757cd36cb3b90541d7a13a88.
The logging additions look helpful, though they’ll need to be updated to use the new macros from #220.
I’d rather hold off on that second commit until after v30 has settled though, since it introduces some platform-specific functionality that has the (admittedly small) potential for causing build problems.
21b2f653f4logging: Add better logging on IPC server-side failures
Motivation: when trying to add a new unit test ran into a confusing error "(remote):0: failed: remote exception: Called null capability" caused by forgetting to make a FooInterface.initThreadMap call during initialization and these logging prints would have made it easier to see where the error was coming from and debug.
debug: Add TypeName() function and log statements for Proxy objects being created and destroyed 2af6a9bc30DrahtBot added the label Needs rebase on Oct 2, 2025ryanofsky force-pushed on Oct 2, 2025ryanofsky commented at 7:58 pm on October 2, 2025: collaboratorRebased 0ce75e3162bcc605757cd36cb3b90541d7a13a88 -> 7283a2b3d3b9acbc1b3939ef054f1d1537cdfebd (
pr/errs.3->pr/errs.4, compare) to fix conflict with #220, and also applied review suggestions, and renamedTypeNametoCxxTypenamesince there is actually already anotherTypeNamefunction for cap’nproto typesUpdated 7283a2b3d3b9acbc1b3939ef054f1d1537cdfebd -> 2af6a9bc3036895ddddbd80d1f7789bec92a6516 (
pr/errs.4->pr/errs.5, compare) fixing typo found by drahtbotryanofsky force-pushed on Oct 2, 2025DrahtBot removed the label Needs rebase on Oct 2, 2025ryanofsky commented at 9:12 am on October 3, 2025: collaboratorgnu32 CI job is failing but I think it is not directly related to changes in this PR. The new thread busy test introduced in #214 is segfaulting here but it has a known race condition fixed in #222 . The segfault is consistent with the race and might just be happening now because new log prints in this PR change the timing of the test, and fact that bugs tend to trigger more segfaults in the gnu32 job because of the smaller memory space.
Will cherry-pick the race fix from #222 as a workaround to see if it fixes the segfault. Assuming this works, if reviewing this PR, would be nice to just ACK first two commits (2af6a9bc3036895ddddbd80d1f7789bec92a6516) and ignore the workaround commit so it can be dropped later without needing to rebase or re-review.
Added 1 commits 2af6a9bc3036895ddddbd80d1f7789bec92a6516 -> 0fc31ec036091fcc5e62372fade78faa80e83d0f (
pr/errs.5->pr/errs.6, compare) adding attempted workaround for CI failure0fc31ec036(workaround for ci, to be dropped after #214 is merged)
Cherry-picking 73d22ba2e9301fd77727a2193fc289e5299563bb from https://github.com/bitcoin-core/libmultiprocess/pull/222
ci: try to get stack trace from mptest 437f042541ryanofsky commented at 4:06 pm on October 3, 2025: collaboratorReally struggling to debug the gnu32 segfault in the “thread busy” test. Failed to get a stack trace from core dump in CI, but I was able to make failure happen locally compiling with gnu32 and asan, though failure seems to be just a normal segfault and there aren’t asan errors. GDB shows exception happening in:
serverInvokeproxy-types.h:734kj::str<char const (&) [16], kj::Exception&>kj/string.h:627kj::toCharSequence<kj::Exception&>kj/string.h:610kj::operator*kj/exception.c++:779
In serverInvoke, the
kj::Exception&object is valid and holds the “thread busy” error, but after the argument is forwarded tokj::operator*, thekj::Exception&argument somehow points to a different address with garbage data and the result is a segfault. I have no idea how this is happening because the two functions in between are just forwarding the reference argument, and they are hard to inspect with GDB because they are varargs functions.The failure seems very strange, but the segfault is happening in the line of code in
serverInvokethat this PR is changing, so I don’t think it is spurious. Will have to continue debugging. I hope it is not just something stupid in the gnu32 CI job setup.in include/mp/proxy-types.h:729 in 21b2f653f4 outdated
725@@ -726,6 +726,13 @@ kj::Promise<void> serverInvoke(Server& server, CallContext& call_context, Fn fn) 726 MP_LOG(*server.m_context.loop, Log::Debug) << "IPC server send response #" << req << " " << TypeName<Results>(); 727 MP_LOG(*server.m_context.loop, Log::Trace) << "response data: " 728 << LogEscape(call_context.getResults().toString(), server.m_context.loop->m_log_opts.max_chars); 729+ }, [&server, req](::kj::Exception&& e) {
theuni commented at 10:15 pm on October 3, 2025:isn’t the local
server_capturebeing captured by reference problematic ifinvokeis being called async? Possible that’s smashing the stack here?(Maybe I’m way off, these tangled callbacks a breaking my brain).
But I think I’d expect this to be
return ReplaceVoid([server_context = std::move(server_context)]()...
ryanofsky commented at 12:26 pm on October 4, 2025:I should double-check all the lifetimes here, but I think the problem in the gnu32 job is weirder than a lifetime issue, because in gdb I can inspect the
kj::Exception&& evariable here and it is a valid object containing the “thread busy” error and full context information. But then two calls deeper in the call stack, the stringify function which is supposed to print the exception, somehow receives a garbage reference to akj::Exceptionwith a completely different address. So it seems like there is an ABI issue or compiler bug exposed by this change, but I am not sure.On
servervsserver_context, theserverobject is actually the one with the longer lifetime. When you declare aninterface MyInterface { myMethod [@0](/bitcoin-core-multiprocess/contributor/0/) () -> (); }capnproto generates C++ class like:0class MyInterface::Server { 1 virtual Promise<void> myMethod(MyMethodContext) = 0; 2}which you inherit from to implement all the methods. The
servervariable is a reference to the object implementing these methods, and there is also aMyInterface::Clientclass generated to call the methods. The lifetime of the server object is completely controlled by Cap’n Proto. It will keep the server around until nothing is referencing it, so it will be alive as long as any client on a live connection exists, and as long as any IPC call is in progress, even if the client is released or disconnected.By contrast
server_contextis just a local variable in theserverInvokefunction, so you would need to copy or move it like you suggested to use it in a promise.then() handler but it is fine to use by reference inReplaceVoidlambdas because of the wayReplaceVoidis defined, where it just calls them right away before returning. I try to use convention of using[&]and implicit captures for any lambda that will be called right away before the statement returns, and using explicit captures for lambdas that are delayed callbacks.
maflcko commented at 12:12 pm on October 6, 2025:So it seems like there is an ABI issue or compiler bug exposed by this change, but I am not sure.
I haven’t looked at anything here, but I presume capnp is compiled with clang in the CI and libmultiprocess with GCC, so it just reminds me of https://github.com/bitcoin/bitcoin/issues/31772. But I haven’t checked any of this.
DrahtBot added the label Needs rebase on Oct 7, 2025DrahtBot commented at 3:06 pm on October 7, 2025: none🐙 This pull request conflicts with the target branch and needs rebase.
github-metadata-mirror
This is a metadata mirror of the GitHub repository bitcoin-core/libmultiprocess. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-12-04 19:30 UTC
This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me