Windows bitcoind stall debugging [NOMERGE, DRAFT] #30956

pull hodlinator wants to merge 6 commits into bitcoin:master from hodlinator:30390_windows_debug changing 3 files +104 −21
  1. hodlinator commented at 12:56 pm on September 24, 2024: contributor

    This PR is meant to help investigate #30390 (https://github.com/bitcoin/bitcoin/issues/30390#issuecomment-2326628987).

    Switches Windows CI to produce RelWithDebInfo binaries and adds code that uses ProcDump to generate a dump file of the bitcoind process which may help explain where it’s gotten stuck.

    Unresolved:

    • Whether the CI change is remotely correct.
    • How to get access to the generated .EXE, .DMP and .PDB files.
    • Obtaining a .DMP artifact(s).
    • Verify debugging with downloaded artifacts works.
    • Making the GitHub workflow begin execution based on schedule rather than based on git push.
    • Undo most temporary changes and let CI run every 3 hours on this branch on personal master branch for 2 weeks.
    • Compare execution time of release vs RelWithDebInfo to see if defaults could be changed on master branch.
  2. DrahtBot commented at 12:56 pm on September 24, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30660 (test: Shut down framework cleanly on RPC connection failure by hodlinator)
    • #30487 (ci: skip Github CI on branch pushes for forks by Sjors)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. hodlinator marked this as a draft on Sep 24, 2024
  4. hodlinator force-pushed on Sep 24, 2024
  5. hodlinator commented at 9:05 pm on September 24, 2024: contributor

    https://github.com/bitcoin/bitcoin/actions/runs/11014291937/job/30584604403?pr=30956#step:10:930 contains: bitcoind.vcxproj -> D:\a\bitcoin\bitcoin\build\src\RelWithDebInfo\bitcoind.exe …notice the “RelWithDebInfo”, so it should be generating .PDB files like it does for me locally. @maflcko continuing discussion after your comment in the related issue:

    Seems fine to integrate in a pull request and then keep the pull request unmerged, and only merge the pull request in the GHA CI, as that is the only place where it reproduces. Adding temporary test test-only code for everyone else may be a bit too much.

    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    2. Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI? I’m totally unfamiliar with GHA so RTFM-links are welcome too. An alternative approach could be to make this temporary PR upload the relevant files to somewhere, only if the issue occurs.

    Feel free to loop in others to answer if you think it’s more their area of ownership.

  6. davidgumberg commented at 9:42 pm on September 24, 2024: contributor

    Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI?

    I think you can add something like:

    0- uses: actions/upload-artifact@v4
    1  with:
    2    name: dump
    3    path: %RUNNER_TEMP%/**/bitcoind.dmp
    4    if-no-files-found: ignore
    

    after the run: part of each job.

    https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow https://github.com/actions/upload-artifact#examples

  7. maflcko commented at 6:22 am on September 25, 2024: member
    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

  8. hodlinator force-pushed on Sep 25, 2024
  9. hodlinator commented at 10:08 am on September 25, 2024: contributor

    https://github.com/bitcoin/bitcoin/actions/runs/11030752196 gives me:

    actions/upload-artifact@v4 is not allowed to be used in bitcoin/bitcoin. Actions in this workflow must be: within a repository owned by bitcoin or matching the following: actions/cache@, actions/cache/restore@, actions/cache/save@, actions/checkout@, ilammy/msvc-dev-cmd@*.

    Maybe one could (ab)use actions/cache/save but I’ll pause here for more input.

  10. maflcko commented at 10:12 am on September 25, 2024: member
    Maybe you can leave it in your repo and just trigger the task every 3 hours for two weeks to get the dump eventually?
  11. hodlinator force-pushed on Sep 25, 2024
  12. hodlinator force-pushed on Sep 25, 2024
  13. hodlinator force-pushed on Sep 25, 2024
  14. hebasto commented at 11:23 am on September 26, 2024: member
    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

    Here is some historical context:

    1. In pre-Cmake times, manually crafted project files had two configurations: “Release” and “Debug”.
    2. While working on the CMake staging branch, the CI build configuration remained “Release” to allow the use of Ccache. This functionality was later dropped.

    At this point, I don’t see any drawbacks to switching the CI job’s build configuration from “Release” to “RelWithDebInfo”.

  15. hodlinator force-pushed on Sep 26, 2024
  16. hodlinator force-pushed on Sep 26, 2024
  17. hodlinator force-pushed on Sep 26, 2024
  18. hodlinator force-pushed on Sep 27, 2024
  19. hodlinator force-pushed on Sep 27, 2024
  20. DrahtBot added the label CI failed on Sep 27, 2024
  21. DrahtBot commented at 11:05 am on September 27, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/30753975513

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  22. hodlinator force-pushed on Sep 27, 2024
  23. hodlinator commented at 9:31 pm on September 27, 2024: contributor

    https://github.com/hodlinator/bitcoin/actions/runs/11069726468/job/30757882453#step:15:114 : subprocess.TimeoutExpired: Command '['D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\Procdump.exe', '-mm', '7904', 'D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\bitcoind.dmp']' timed out after 4800 seconds

    The Procdump.exe process itself times out when trying to produce the dump. :facepalm: Could be that it’s popping up a GUI dialogue on first run, will investigate further.

  24. hodlinator force-pushed on Sep 28, 2024
  25. hodlinator force-pushed on Sep 28, 2024
  26. hodlinator force-pushed on Sep 28, 2024
  27. hodlinator force-pushed on Sep 28, 2024
  28. hodlinator force-pushed on Sep 28, 2024
  29. hodlinator commented at 10:43 pm on September 28, 2024: contributor

    Partial victory! - Latest run https://github.com/hodlinator/bitcoin/actions/runs/11087284207 actually uploaded a .DMP file (but it wasn’t a real stall, just forced it to happen).

    Still remains to see if it loads nicely in the debugger.

    Other remaining issue is that I used a hardcoded absolute path for the upload-artifact step since I’m still struggling to find the correct glob-expression. Can’t assume it will be node0 that stalls, and unsure if the path will be the same when removing –nocleanup. (Come to think of it, –nocleanup shouldn’t be necessary now that the test fails).

    0      - name: Upload .DMP file
    1        uses: actions/upload-artifact@v4
    2        id: dmp-upload
    3        if: ${{ failure() && steps.functional-tests.conclusion == 'failure' }}
    4        with:
    5          name: bitcoind-windows-dmp-file
    6          path: D:/a/bitcoin/bitcoin/build/test/cache/node0/bitcoind.dmp
    

    I’m a bit mystified by RUNNER_TEMP passed as --tmpdirprefix to the functional test runner having the value D:/a/_temp/, while the test data seems to end up in D:/a/bitcoin/bitcoin/build/test/cache/, but maybe the cache part is a clue.

  30. hodlinator commented at 2:03 pm on September 30, 2024: contributor

    Still remains to see if it loads nicely in the debugger.

    It debugs!

    image

    Requires .DMP, .PDB and .EXE (last one currently luckily added just as an extra test).

  31. hodlinator force-pushed on Sep 30, 2024
  32. hodlinator force-pushed on Sep 30, 2024
  33. hodlinator force-pushed on Oct 1, 2024
  34. hodlinator force-pushed on Oct 1, 2024
  35. hodlinator force-pushed on Oct 1, 2024
  36. hodlinator force-pushed on Oct 1, 2024
  37. hodlinator force-pushed on Oct 1, 2024
  38. hodlinator force-pushed on Oct 2, 2024
  39. hodlinator force-pushed on Oct 2, 2024
  40. hodlinator force-pushed on Oct 2, 2024
  41. hodlinator renamed this:
    DO NOT MERGE: Windows bitcoind stall debugging
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    on Oct 2, 2024
  42. DrahtBot renamed this:
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    on Oct 2, 2024
  43. hodlinator force-pushed on Oct 2, 2024
  44. ci: Temporarily switch Windows to RelWithDebInfo
    RelWithDebInfo (Release with debug information) ensures we generate .PDB files which are useful when debugging dump files.
    7421a0219a
  45. test: Generate dump file for stalled bitcoind
    Upon failing to establish an RPC connection to bitcoind:
    Download SysInternals Procdump, unzip it, and use it to generate a dump file that can be opened in a debugger to see what bitcoind was up to.
    28751ebbd7
  46. hodlinator force-pushed on Oct 3, 2024
  47. hodlinator commented at 9:48 am on October 3, 2024: contributor
    I finally read the documentation for schedule a bit more closely and realized it only applies to the master branch, so I’m making the “every 3 hours”-schedule happen on my personal master branch for now.
  48. hodlinator force-pushed on Oct 3, 2024
  49. hodlinator commented at 8:24 pm on October 3, 2024: contributor

    Had 2 scheduled runs on my modified master, 3 hours apart.

    • Dump file generation upon bitcoind connection failure is enabled for functional tests, and no longer artificially forced.
    • Re-enabled unit tests to keep the setup as close as possible to regular CI.
  50. hodlinator renamed this:
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    Windows bitcoind stall debugging [NOMERGE, DRAFT]
    on Oct 3, 2024
  51. in .github/workflows/ci.yml:221 in e9c6b02462 outdated
    223+          BITCOINWALLET: '${{ github.workspace }}\build\src\RelWithDebInfo\bitcoin-wallet.exe'
    224           TEST_RUNNER_EXTRA: ${{ github.event_name != 'pull_request' && '--extended' || '' }}
    225         shell: cmd
    226-        run: py -3 test\functional\test_runner.py --jobs %NUMBER_OF_PROCESSORS% --ci --quiet --tmpdirprefix=%RUNNER_TEMP% --combinedlogslen=99999999 --timeout-factor=%TEST_RUNNER_TIMEOUT_FACTOR% %TEST_RUNNER_EXTRA%
    227+        run: |
    228+          py -3 test\functional\test_runner.py --jobs %NUMBER_OF_PROCESSORS% --ci --quiet --tmpdirprefix=%RUNNER_TEMP% --combinedlogslen=99999999 --timeout-factor=%TEST_RUNNER_TIMEOUT_FACTOR% %TEST_RUNNER_EXTRA% --filter mempool_
    


    maflcko commented at 12:15 pm on October 7, 2024:
    Any reason to filter to only mempool_?

    hodlinator commented at 3:57 pm on October 7, 2024:
    Thanks! Leftover WIP change. Removed filter in latest push here and to my personal master.
  52. ci: Upload .DMP, .EXE and .PDB artifacts
    Only done upon functional test failure. .EXE & .PDB are only uploaded if .DMP file is successfully uploaded first.
    0c7b73c40b
  53. WIP - ci: Temporarily disable non-windows jobs 76f01a4397
  54. hodlinator force-pushed on Oct 7, 2024
  55. DrahtBot removed the label CI failed on Oct 7, 2024
  56. maflcko commented at 8:36 am on October 10, 2024: member
    Looking at https://github.com/hodlinator/bitcoin/actions it doesn’t look it happened yet. Maybe you can crank up the cron to once an hour instead of every three hours?
  57. hodlinator commented at 8:37 am on October 10, 2024: contributor
    Sure, will do! Thanks again for spotting the remaining mempool filter.
  58. hodlinator force-pushed on Oct 10, 2024
  59. hodlinator commented at 7:23 am on October 16, 2024: contributor
    We have a winner! https://github.com/hodlinator/bitcoin/actions/runs/11311779692/job/31458399560 Downloaded the artifacts now just to be sure, but I won’t be able to look at them until early next week. Might keep the CI schedule running for one more dump before turning it off unless anyone disagrees.
  60. hodlinator commented at 11:33 pm on October 16, 2024: contributor
  61. hodlinator commented at 12:01 pm on October 19, 2024: contributor
    Had a few other cases. First 2 are confirmed to be the same issue - seeding randomness entropy from Windows performance data obtained through the Registry API during startup hangs when attempting to release a semaphore deep in Microsoft code. Turned off the once-per-hour CI schedule on my master-branch for now. More work to follow next week.
  62. davidgumberg commented at 9:29 pm on October 19, 2024: contributor

    seeding randomness entropy from Windows performance data obtained through the Registry API during startup hangs when attempting to release a semaphore deep in Microsoft code.

    The call to RegQueryValueExA in RandAddSeedPerfmon:

    https://github.com/bitcoin/bitcoin/blob/e8f72aefd20049eac81b150e7f0d33709acd18ed/src/randomenv.cpp#L80-L91

  63. laanwj added the label Windows on Oct 20, 2024
  64. laanwj added the label Tests on Oct 20, 2024
  65. laanwj commented at 6:01 pm on October 20, 2024: member

    The call to RegQueryValueExA in RandAddSeedPerfmon:

    good find !

    i couldn’t find anything wrong with that code; however i looked into the Microsoft documentation for this: https://learn.microsoft.com/en-us/windows/win32/perfctrs/using-the-registry-functions-to-consume-counter-data

    apparently, querying HKEY_PERFORMANCE_DATA can potentially load all kinds of DLLs for performance counter data, depending on drivers installed on the system, as well as collects performance information for every process and thread on the system (could be a lot on busy systems)

    it’s very possible that one of these has a race condition, or other crash bug, in the configuration used in the CI

    if we have enough other random sources in Windows (i haven’t checked this) then we could consider disabling this one, it was more of a workaround when there were no official OS entropy sources

    does disabling it fix the stalls?

  66. fanquake commented at 8:41 am on October 21, 2024: member

    The call to RegQueryValueExA in RandAddSeedPerfmon:

    I think you can open a PR to just remove this entirely.

  67. hodlinator commented at 9:07 am on October 21, 2024: contributor

    I think you can open a PR to just remove this entirely.

    On it. Will explore other possible fixes after publishing that PR.

  68. hodlinator commented at 11:54 am on October 21, 2024: contributor
    PR: #31124
  69. hodlinator commented at 9:02 pm on October 21, 2024: contributor

    Debugged a 3rd dump: https://github.com/hodlinator/bitcoin/actions/runs/11414623090 Same stacks in threads and local function state in RandAddSeedPerfmon() was similar/same as observed in 2 previous dump files - nSize = 843750 (250000 × (1.5^3)).

    Main thread

     0 	ntdll.dll!NtReleaseSemaphore()	Unknown
     1 	KERNELBASE.dll!ReleaseSemaphore()	Unknown
     2 	WmiApRpl.dll!WmiReverseGuard::LeaveRead(long)	Unknown
     3 	WmiApRpl.dll!WmiReverseMemoryExt<class WmiReverseGuard>::Read(void *,unsigned long,unsigned long *,unsigned long)	Unknown
     4 	WmiApRpl.dll!WmiAdapterWrapper::GetValidity(unsigned long)	Unknown
     5 	WmiApRpl.dll!WmiAdapterWrapper::CollectObjects(unsigned short const *,void * *,unsigned long *,unsigned long *)	Unknown
     6 	WmiApRpl.dll!WmiAdapterWrapper::Collect(unsigned short const *,void * *,unsigned long *,unsigned long *)	Unknown
     7 	WmiApRpl.dll!WmiCollectPerfData(unsigned short const *,void * *,unsigned long *,unsigned long *)	Unknown
     8 	advapi32.dll!CallExtObj()	Unknown
     9 	advapi32.dll!QueryV1Provider()	Unknown
    10 	advapi32.dll!QueryExtensibleData()	Unknown
    11 	advapi32.dll!PerfRegQueryValueEx()	Unknown
    12 	advapi32.dll!PerfRegQueryValue()	Unknown
    13 	KERNELBASE.dll!BaseRegQueryValueInternal()	Unknown
    14>	KERNELBASE.dll!RegQueryValueExA()	Unknown
    15 	bitcoind.exe!`anonymous namespace'::RandAddSeedPerfmon(CSHA512 & hasher) Line 86	C++
    16 	bitcoind.exe!RandAddDynamicEnv(CSHA512 & hasher) Line 235	C++
    17 	bitcoind.exe!`anonymous namespace'::SeedStartup(CSHA512 & hasher, `anonymous-namespace'::RNGState & rng) Line 624	C++
    18 	bitcoind.exe!`anonymous namespace'::ProcRand(unsigned char * out, int num, `anonymous-namespace'::RNGLevel level, bool always_use_real_rng) Line 662	C++
    19 	[Inline Frame] bitcoind.exe!kernel::Context::{ctor}::__l2::<lambda_1>::operator()() Line 21	C++
    20 	[Inline Frame] bitcoind.exe!std::invoke(kernel::Context::{ctor}::__l2::<lambda_1> &&) Line 1704	C++
    21 	bitcoind.exe!std::call_once<`kernel::Context::Context'::`2'::<lambda_1>>(std::once_flag & _Once, kernel::Context::{ctor}::__l2::<lambda_1> && _Fx) Line 103	C++
    22 	bitcoind.exe!kernel::Context::Context() Line 23	C++
    23 	[Inline Frame] bitcoind.exe!std::make_unique() Line 3597	C++
    24 	bitcoind.exe!AppInit(node::NodeContext & node) Line 188	C++
    25 	bitcoind.exe!main(int argc, char * * argv) Line 277	C++
    26 	[Inline Frame] bitcoind.exe!invoke_main() Line 78	C++
    27 	bitcoind.exe!__scrt_common_main_seh() Line 288	C++
    28 	kernel32.dll!BaseThreadInitThunk()	Unknown
    29 	ntdll.dll!RtlUserThreadStart()	Unknown
    

    PerfdiskPnpNotification thread

    0 	win32u.dll!NtUserGetMessage()	Unknown
    1 	user32.dll!GetMessageW()	Unknown
    2>	perfdisk.dll!PerfdiskPnpNotification(void *)	Unknown
    3 	kernel32.dll!BaseThreadInitThunk()	Unknown
    4 	ntdll.dll!RtlUserThreadStart()	Unknown
    

    This was the only thread not stuck in NtReleaseSemaphore/NtWaitFor*.

    Another project also encountered a hang with [_]PerfdiskPnpNotification in another thread. https://stackoverflow.com/questions/15730185/performance-data-handler-hangs-for-physicaldisk

    Googling for _PerfdiskPnpNotification finds it sometimes tries to open a pop up window.

    Linked post uses the PDH API, which internally queries the registry (seen in linked posts’ callstack). Seems like their case was solved by moving the calling code out of DllMain. Would be weird if us calling Perf Counter functions from inside our main() was unsupported though.

    Remote Access Manager thread

    0 	ntdll.dll!NtWaitForSingleObject()	Unknown
    1 	KERNELBASE.dll!WaitForSingleObjectEx()	Unknown
    2>	rasctrs.dll!RasmanServiceMonitor()	Unknown
    3 	kernel32.dll!BaseThreadInitThunk()	Unknown
    4 	ntdll.dll!RtlUserThreadStart()	Unknown
    

    Possibly a side effect of running ProcDump to capture the .DMP file from the sound of https://github.com/Rem0o/FanControl.Releases/issues/101#issuecomment-762076872.

    Other threads

    0 	ntdll.dll!NtWaitForSingleObject()	Unknown
    1 	KERNELBASE.dll!WaitForSingleObjectEx()	Unknown
    2>	aspnet_perf.dll!CPerfCounterClient::GatherPerfData(void)	Unknown
    3 	aspnet_perf.dll!PerfDataGatherThreadStart(void *)	Unknown
    4 	kernel32.dll!BaseThreadInitThunk()	Unknown
    5 	ntdll.dll!RtlUserThreadStart()	Unknown
    
    0 	ntdll.dll!NtWaitForMultipleObjects()	Unknown
    1 	sechost.dll!EtwpProcessRealTimeTraces()	Unknown
    2 	sechost.dll!ProcessTrace()	Unknown
    3>	perfts.dll!BlockOnProcessTrace(void *)	Unknown
    4 	kernel32.dll!BaseThreadInitThunk()	Unknown
    5 	ntdll.dll!RtlUserThreadStart()	Unknown
    
    0 	ntdll.dll!NtWaitForMultipleObjects()	Unknown
    1 	KERNELBASE.dll!WaitForMultipleObjectsEx()	Unknown
    2 	KERNELBASE.dll!WaitForMultipleObjects()	Unknown
    3>	perfos.dll!StandbyMonitorThreadProc(void *)	Unknown
    4 	kernel32.dll!BaseThreadInitThunk()	Unknown
    5 	ntdll.dll!RtlUserThreadStart()	Unknown
    
    0 	ntdll.dll!NtWaitForWorkViaWorkerFactory()	Unknown
    1>	ntdll.dll!TppWorkerThread()	Unknown
    2 	kernel32.dll!BaseThreadInitThunk()	Unknown
    3 	ntdll.dll!RtlUserThreadStart()	Unknown
    

    2 each of

    0 	ntdll.dll!NtWaitForMultipleObjects()	Unknown
    1 	KERNELBASE.dll!WaitForMultipleObjectsEx()	Unknown
    2 	KERNELBASE.dll!WaitForMultipleObjects()	Unknown
    3>	CorperfmonExt.dll!HandlerAuxThreadProc(void *)	Unknown
    4 	kernel32.dll!BaseThreadInitThunk()	Unknown
    5 	ntdll.dll!RtlUserThreadStart()	Unknown
    
    0 	ntdll.dll!NtWaitForSingleObject()	Unknown
    1 	KERNELBASE.dll!WaitForSingleObjectEx()	Unknown
    2>	aspnet_perf.dll!CPerfCounterClient::MonitorPerfPipeNames(void)	Unknown
    3 	aspnet_perf.dll!RegistryMonitorThreadStart(void *)	Unknown
    4 	kernel32.dll!BaseThreadInitThunk()	Unknown
    5 	ntdll.dll!RtlUserThreadStart()	Unknown
    

    This might indicate there are lingering threads from a previous loop iteration that were not cleaned up yet. Despite the official example not calling RegCloseKey at all, maybe we should try doing it for every loop iteration to see if that fixes it.

    Alternative Performance Counter APIs

    https://learn.microsoft.com/en-us/windows/win32/perfctrs/about-performance-counters#performance-api-architecture <- Has a handy diagram of the 3 consumer-facing apis. It states:

    Some older performance counter consumers use the registry APIs to collect performance data from the special HKEY_PERFORMANCE_DATA registry key. This is not recommended for new code because processing the data from the registry is complex and error-prone. The registry API implementation directly supports collecting data from V1 providers. It indirectly supports collecting data from V2 providers through a translation layer that uses the V2 consumer APIs.

    Some performance counter consumers use the PerfLib V2 Consumer functions to directly access data from V2 providers. This is more complex than consuming data using the PDH APIs, but this approach can be useful if PDH APIs cannot be used due to performance or dependency concerns. The PerfLib V2 implementation directly supports collecting data from V2 providers. It does not support collecting data from V1 providers.

    The main callstack above with our hang contains QueryV1Provider. Makes me think maybe we should migrate to the more modern V2 API that does not go through the Registry layer.

    Log performance counter data?

    https://learn.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regqueryvalueexa:

    If the buffer specified by lpData parameter is not large enough to hold the data, the function returns ERROR_MORE_DATA and stores the required buffer size in the variable pointed to by lpcbData. In this case, the contents of the lpData buffer are undefined.

    It is unclear whether we get anything useful back in the failure case, but we could add code to see if lpData (PPERF_DATA_BLOCK) has anything usable. Maybe it would provide hints at what is different from successful runs and causing the failures.

    RegCloseKey investigation

    https://learn.microsoft.com/en-us/windows/win32/perfctrs/using-the-registry-functions-to-consume-counter-data States:

    Be sure to use the RegCloseKey function to close the handle to the key when you are finished obtaining the performance data. This is important for both the local and remote cases:

    • RegCloseKey(HKEY_PERFORMANCE_DATA) does not actually close a registry handle, but it clears all cached data and frees the loaded performance DLLs.
    • RegCloseKey(hkeyRemotePerformanceData) closes the handle to the remote machine’s registry.

    It is not entirely clear whether it should be called after every RegQueryValueExA or if it’s fine as-is, after the loop.

    https://learn.microsoft.com/en-us/windows/win32/sysinfo/predefined-keys also mentions:

    The system defines predefined keys that are always open.

    and

    The predefined handles are not thread safe. Closing a predefined handle in one thread affects any other threads that are using the handle.

    So there’s some mixed messaging around closing.

    RandAddPeriodic (documented by us to be thread-safe) is called with one minute between every call, so shouldn’t require a mutex even on sluggish machines. (We seed startup randomness before scheduling the periodic re-seeding).

    https://learn.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regopenkeyexw explicitly says:

    If the key is not one of the predefined registry keys, call the RegCloseKey function after you have finished using the handle.

    Closing after the loop happens after the issue covered here, so seems to be not too often unless there’s some inter-CI-run interference. Or maybe inter-process interference is possible and another unknown process on CI is closing the handle? Could write a minimal test case of two processes querying the registry and closing handles on each other forever. :\

    As mentioned above with the case of duplicate thread callstacks, it might be fruitful to attempt closing in every iteration of the loop.

  70. DrahtBot added the label Needs rebase on Oct 24, 2024
  71. DrahtBot commented at 3:54 pm on October 24, 2024: contributor

    🐙 This pull request conflicts with the target branch and needs rebase.

  72. laanwj commented at 5:34 pm on October 24, 2024: member

    It is not entirely clear whether it should be called after every RegQueryValueExA or if it’s fine as-is, after the loop.

    FWIW the idea that i got from reading that was that you should close the key when done with the API. Not at any specific point. This will unload the performance collection DLLs. The next time you query the key again they will be loaded again.

    i concluded that we do the right thing (never close it) because, we’re never done with it, we keep queryiing them time after time after time, so might as well keep the DLLs in memory.

    But yes it’s ambiguous.

  73. achow101 referenced this in commit 947f2925d5 on Oct 24, 2024
  74. hodlinator commented at 10:19 pm on October 24, 2024: contributor

    Turned off hourly CI for now, since fix removing the offending function was merged into bitcoin:master.

    Intend to experiment with more surgical solutions and document that here in the future.

  75. davidgumberg commented at 6:43 pm on October 26, 2024: contributor

    Turned off hourly CI for now, since fix removing the offending function was merged into bitcoin:master.

    Would it make sense to leave it running for a little longer to validate that the fix worked?

  76. hodlinator commented at 8:11 pm on October 26, 2024: contributor

    Would it make sense to leave it running for a little longer to validate that the fix worked?

    My reasoning was that there should be enough CI runs on master and all runs on PRs based on top of the merged fix. If it re-occurs hopefully one of us will reopen #30390. But I don’t have a well-developed sense for how much strain is reasonable to put on the CI resources.

  77. davidgumberg commented at 9:26 am on October 27, 2024: contributor

    My reasoning was that there should be enough CI runs on master and all runs on PRs based on top of the merged fix.

    Oh, didn’t think of that, makes sense to me.

  78. maflcko commented at 9:23 am on October 28, 2024: member
    Can this be closed, or is it waiting on something?
  79. WIP - Attempt at more surgical fix for bitcoind stalling 757918c22b
  80. WIP - ci: Temporarily schedule to re-run every 3 hours (at minute 21)
    Goal is to verify more surgical fix to issue with stalled bitcoind within 1-2.
    f69a238bce
  81. hodlinator force-pushed on Oct 28, 2024
  82. hodlinator commented at 9:50 pm on October 28, 2024: contributor

    Can this be closed, or is it waiting on something?

    I’d like to keep it open as I explore less nuclear solutions. Pushed an initial attempt to here and my personal master: https://github.com/hodlinator/bitcoin/compare/84cd6478c422bc296589ab031f5c76e7bab2704d...f69a238bcec12eb9fc2d8a13264fa471e06a4d20

  83. fanquake commented at 10:19 pm on October 28, 2024: member

    I’d like to keep it open as I explore less nuclear solutions. Pushed an initial attempt to here and my personal master: hodlinator/bitcoin@84cd647…f69a238

    In my opinion your (already merged) change to drop the offending code from the RNG is fine, and isn’t really something we need to investigate further / work on reintroducing.

  84. laanwj commented at 7:48 am on October 29, 2024: member

    isn’t really something we need to investigate further / work on reintroducing.

    Agree, it’s fine, it’s good to get rid of the perfmon query.

  85. hodlinator commented at 11:11 am on October 29, 2024: contributor
    Alright, our time is finite and hopefully CryptGenRandom should be enough for now. Closing.
  86. hodlinator closed this on Oct 29, 2024


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me