Switches Windows CI to produce RelWithDebInfo binaries and adds code that uses ProcDump to generate a dump file of the bitcoind process which may help explain where it’s gotten stuck.
Unresolved:
Whether the CI change is remotely correct.
How to get access to the generated .EXE, .DMP and .PDB files.
Obtaining a .DMP artifact(s).
Verify debugging with downloaded artifacts works.
Making the GitHub workflow begin execution based on schedule rather than based on git push.
Undo most temporary changes and let CI run every 3 hours on this branch on personal master branch for 2 weeks.
Compare execution time of release vs RelWithDebInfo to see if defaults could be changed on master branch.
DrahtBot
commented at 12:56 pm on September 24, 2024:
contributor
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
See the guideline for information on the review process.
A summary of reviews will appear here.
Conflicts
Reviewers, this pull request conflicts with the following ones:
#30660 (test: Shut down framework cleanly on RPC connection failure by hodlinator)
#30487 (ci: skip Github CI on branch pushes for forks by Sjors)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
hodlinator marked this as a draft
on Sep 24, 2024
hodlinator force-pushed
on Sep 24, 2024
hodlinator
commented at 9:05 pm on September 24, 2024:
contributor
Seems fine to integrate in a pull request and then keep the pull request unmerged, and only merge the pull request in the GHA CI, as that is the only place where it reproduces. Adding temporary test test-only code for everyone else may be a bit too much.
Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?
Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI? I’m totally unfamiliar with GHA so RTFM-links are welcome too. An alternative approach could be to make this temporary PR upload the relevant files to somewhere, only if the issue occurs.
Feel free to loop in others to answer if you think it’s more their area of ownership.
davidgumberg
commented at 9:42 pm on September 24, 2024:
contributor
Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI?
maflcko
commented at 6:22 am on September 25, 2024:
member
Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?
Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto
hodlinator force-pushed
on Sep 25, 2024
hodlinator
commented at 10:08 am on September 25, 2024:
contributor
actions/upload-artifact@v4 is not allowed to be used in bitcoin/bitcoin. Actions in this workflow must be: within a repository owned by bitcoin or matching the following: actions/cache@, actions/cache/restore@, actions/cache/save@, actions/checkout@, ilammy/msvc-dev-cmd@*.
Maybe one could (ab)use actions/cache/save but I’ll pause here for more input.
maflcko
commented at 10:12 am on September 25, 2024:
member
Maybe you can leave it in your repo and just trigger the task every 3 hours for two weeks to get the dump eventually?
hodlinator force-pushed
on Sep 25, 2024
hodlinator force-pushed
on Sep 25, 2024
hodlinator force-pushed
on Sep 25, 2024
hebasto
commented at 11:23 am on September 26, 2024:
member
Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?
Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto
Here is some historical context:
In pre-Cmake times, manually crafted project files had two configurations: “Release” and “Debug”.
While working on the CMake staging branch, the CI build configuration remained “Release” to allow the use of Ccache. This functionality was later dropped.
At this point, I don’t see any drawbacks to switching the CI job’s build configuration from “Release” to “RelWithDebInfo”.
hodlinator force-pushed
on Sep 26, 2024
hodlinator force-pushed
on Sep 26, 2024
hodlinator force-pushed
on Sep 26, 2024
hodlinator force-pushed
on Sep 27, 2024
hodlinator force-pushed
on Sep 27, 2024
DrahtBot added the label
CI failed
on Sep 27, 2024
DrahtBot
commented at 11:05 am on September 27, 2024:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
hodlinator force-pushed
on Sep 27, 2024
hodlinator
commented at 9:31 pm on September 27, 2024:
contributor
The Procdump.exe process itself times out when trying to produce the dump. :facepalm:
Could be that it’s popping up a GUI dialogue on first run, will investigate further.
hodlinator force-pushed
on Sep 28, 2024
hodlinator force-pushed
on Sep 28, 2024
hodlinator force-pushed
on Sep 28, 2024
hodlinator force-pushed
on Sep 28, 2024
hodlinator force-pushed
on Sep 28, 2024
hodlinator
commented at 10:43 pm on September 28, 2024:
contributor
Still remains to see if it loads nicely in the debugger.
Other remaining issue is that I used a hardcoded absolute path for the upload-artifact step since I’m still struggling to find the correct glob-expression. Can’t assume it will be node0 that stalls, and unsure if the path will be the same when removing –nocleanup. (Come to think of it, –nocleanup shouldn’t be necessary now that the test fails).
I’m a bit mystified by RUNNER_TEMP passed as --tmpdirprefix to the functional test runner having the value D:/a/_temp/, while the test data seems to end up in D:/a/bitcoin/bitcoin/build/test/cache/, but maybe the cache part is a clue.
hodlinator
commented at 2:03 pm on September 30, 2024:
contributor
Still remains to see if it loads nicely in the debugger.
It debugs!
Requires .DMP, .PDB and .EXE (last one currently luckily added just as an extra test).
hodlinator force-pushed
on Sep 30, 2024
hodlinator force-pushed
on Sep 30, 2024
hodlinator force-pushed
on Oct 1, 2024
hodlinator force-pushed
on Oct 1, 2024
hodlinator force-pushed
on Oct 1, 2024
hodlinator force-pushed
on Oct 1, 2024
hodlinator force-pushed
on Oct 1, 2024
hodlinator force-pushed
on Oct 2, 2024
hodlinator force-pushed
on Oct 2, 2024
hodlinator force-pushed
on Oct 2, 2024
hodlinator renamed this:
DO NOT MERGE: Windows bitcoind stall debugging
Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
on Oct 2, 2024
DrahtBot renamed this:
Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
on Oct 2, 2024
hodlinator force-pushed
on Oct 2, 2024
ci: Temporarily switch Windows to RelWithDebInfo
RelWithDebInfo (Release with debug information) ensures we generate .PDB files which are useful when debugging dump files.
7421a0219a
test: Generate dump file for stalled bitcoind
Upon failing to establish an RPC connection to bitcoind:
Download SysInternals Procdump, unzip it, and use it to generate a dump file that can be opened in a debugger to see what bitcoind was up to.
28751ebbd7
hodlinator force-pushed
on Oct 3, 2024
hodlinator
commented at 9:48 am on October 3, 2024:
contributor
I finally read the documentation for schedule a bit more closely and realized it only applies to the master branch, so I’m making the “every 3 hours”-schedule happen on my personal master branch for now.
hodlinator force-pushed
on Oct 3, 2024
hodlinator
commented at 8:24 pm on October 3, 2024:
contributor
hodlinator
commented at 12:01 pm on October 19, 2024:
contributor
Had a few other cases. First 2 are confirmed to be the same issue - seeding randomness entropy from Windows performance data obtained through the Registry API during startup hangs when attempting to release a semaphore deep in Microsoft code. Turned off the once-per-hour CI schedule on my master-branch for now. More work to follow next week.
davidgumberg
commented at 9:29 pm on October 19, 2024:
contributor
seeding randomness entropy from Windows performance data obtained through the Registry API during startup hangs when attempting to release a semaphore deep in Microsoft code.
The call to RegQueryValueExA in RandAddSeedPerfmon:
apparently, querying HKEY_PERFORMANCE_DATA can potentially load all kinds of DLLs for performance counter data, depending on drivers installed on the system, as well as collects performance information for every process and thread on the system (could be a lot on busy systems)
it’s very possible that one of these has a race condition, or other crash bug, in the configuration used in the CI
if we have enough other random sources in Windows (i haven’t checked this) then we could consider disabling this one, it was more of a workaround when there were no official OS entropy sources
does disabling it fix the stalls?
fanquake
commented at 8:41 am on October 21, 2024:
member
The call to RegQueryValueExA in RandAddSeedPerfmon:
I think you can open a PR to just remove this entirely.
hodlinator
commented at 9:07 am on October 21, 2024:
contributor
I think you can open a PR to just remove this entirely.
On it. Will explore other possible fixes after publishing that PR.
hodlinator
commented at 11:54 am on October 21, 2024:
contributor
Googling for _PerfdiskPnpNotification finds it sometimes tries to open a pop up window.
Linked post uses the PDH API, which internally queries the registry (seen in linked posts’ callstack). Seems like their case was solved by moving the calling code out of DllMain. Would be weird if us calling Perf Counter functions from inside our main() was unsupported though.
This might indicate there are lingering threads from a previous loop iteration that were not cleaned up yet. Despite the official example not calling RegCloseKeyat all, maybe we should try doing it for every loop iteration to see if that fixes it.
Some older performance counter consumers use the registry APIs to collect performance data from the special HKEY_PERFORMANCE_DATA registry key. This is not recommended for new code because processing the data from the registry is complex and error-prone. The registry API implementation directly supports collecting data from V1 providers. It indirectly supports collecting data from V2 providers through a translation layer that uses the V2 consumer APIs.
Some performance counter consumers use the PerfLib V2 Consumer functions to directly access data from V2 providers. This is more complex than consuming data using the PDH APIs, but this approach can be useful if PDH APIs cannot be used due to performance or dependency concerns. The PerfLib V2 implementation directly supports collecting data from V2 providers. It does not support collecting data from V1 providers.
The main callstack above with our hang contains QueryV1Provider. Makes me think maybe we should migrate to the more modern V2 API that does not go through the Registry layer.
If the buffer specified by lpData parameter is not large enough to hold the data, the function returns ERROR_MORE_DATA and stores the required buffer size in the variable pointed to by lpcbData. In this case, the contents of the lpData buffer are undefined.
It is unclear whether we get anything useful back in the failure case, but we could add code to see if lpData (PPERF_DATA_BLOCK) has anything usable. Maybe it would provide hints at what is different from successful runs and causing the failures.
Be sure to use the RegCloseKey function to close the handle to the key when you are finished obtaining the performance data. This is important for both the local and remote cases:
RegCloseKey(HKEY_PERFORMANCE_DATA) does not actually close a registry handle, but it clears all cached data and frees the loaded performance DLLs.
RegCloseKey(hkeyRemotePerformanceData) closes the handle to the remote machine’s registry.
It is not entirely clear whether it should be called after every RegQueryValueExA or if it’s fine as-is, after the loop.
The system defines predefined keys that are always open.
and
The predefined handles are not thread safe. Closing a predefined handle in one thread affects any other threads that are using the handle.
So there’s some mixed messaging around closing.
RandAddPeriodic (documented by us to be thread-safe) is called with one minute between every call, so shouldn’t require a mutex even on sluggish machines. (We seed startup randomness before scheduling the periodic re-seeding).
If the key is not one of the predefined registry keys, call the RegCloseKey function after you have finished using the handle.
Closing after the loop happens after the issue covered here, so seems to be not too often unless there’s some inter-CI-run interference. Or maybe inter-process interference is possible and another unknown process on CI is closing the handle? Could write a minimal test case of two processes querying the registry and closing handles on each other forever. :\
As mentioned above with the case of duplicate thread callstacks, it might be fruitful to attempt closing in every iteration of the loop.
DrahtBot added the label
Needs rebase
on Oct 24, 2024
DrahtBot
commented at 3:54 pm on October 24, 2024:
contributor
🐙 This pull request conflicts with the target branch and needs rebase.
laanwj
commented at 5:34 pm on October 24, 2024:
member
It is not entirely clear whether it should be called after every RegQueryValueExA or if it’s fine as-is, after the loop.
FWIW the idea that i got from reading that was that you should close the key when done with the API. Not at any specific point. This will unload the performance collection DLLs. The next time you query the key again they will be loaded again.
i concluded that we do the right thing (never close it) because, we’re never done with it, we keep queryiing them time after time after time, so might as well keep the DLLs in memory.
But yes it’s ambiguous.
achow101 referenced this in commit
947f2925d5
on Oct 24, 2024
hodlinator
commented at 10:19 pm on October 24, 2024:
contributor
Turned off hourly CI for now, since fix removing the offending function was merged into bitcoin:master.
Intend to experiment with more surgical solutions and document that here in the future.
davidgumberg
commented at 6:43 pm on October 26, 2024:
contributor
Turned off hourly CI for now, since fix removing the offending function was merged into bitcoin:master.
Would it make sense to leave it running for a little longer to validate that the fix worked?
hodlinator
commented at 8:11 pm on October 26, 2024:
contributor
Would it make sense to leave it running for a little longer to validate that the fix worked?
My reasoning was that there should be enough CI runs on master and all runs on PRs based on top of the merged fix. If it re-occurs hopefully one of us will reopen #30390. But I don’t have a well-developed sense for how much strain is reasonable to put on the CI resources.
davidgumberg
commented at 9:26 am on October 27, 2024:
contributor
My reasoning was that there should be enough CI runs on master and all runs on PRs based on top of the merged fix.
Oh, didn’t think of that, makes sense to me.
maflcko
commented at 9:23 am on October 28, 2024:
member
Can this be closed, or is it waiting on something?
WIP - Attempt at more surgical fix for bitcoind stalling757918c22b
WIP - ci: Temporarily schedule to re-run every 3 hours (at minute 21)
Goal is to verify more surgical fix to issue with stalled bitcoind within 1-2.
f69a238bce
hodlinator force-pushed
on Oct 28, 2024
hodlinator
commented at 9:50 pm on October 28, 2024:
contributor
Can this be closed, or is it waiting on something?
fanquake
commented at 10:19 pm on October 28, 2024:
member
I’d like to keep it open as I explore less nuclear solutions. Pushed an initial attempt to here and my personal master: hodlinator/bitcoin@84cd647…f69a238
In my opinion your (already merged) change to drop the offending code from the RNG is fine, and isn’t really something we need to investigate further / work on reintroducing.
laanwj
commented at 7:48 am on October 29, 2024:
member
isn’t really something we need to investigate further / work on reintroducing.
Agree, it’s fine, it’s good to get rid of the perfmon query.
hodlinator
commented at 11:11 am on October 29, 2024:
contributor
Alright, our time is finite and hopefully CryptGenRandom should be enough for now. Closing.
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me