Windows bitcoind stall debugging [NOMERGE, DRAFT] #30956

pull hodlinator wants to merge 5 commits into bitcoin:master from hodlinator:30390_windows_debug changing 2 files +103 −20
  1. hodlinator commented at 12:56 pm on September 24, 2024: contributor

    This PR is meant to help investigate #30390 (https://github.com/bitcoin/bitcoin/issues/30390#issuecomment-2326628987).

    Switches Windows CI to produce RelWithDebInfo binaries and adds code that uses ProcDump to generate a dump file of the bitcoind process which may help explain where it’s gotten stuck.

    Unresolved:

    • ~Whether the CI change is remotely correct.~
    • ~How to get access to the generated .EXE, .DMP and .PDB files.~
    • ~Obtaining a .DMP artifact(s).~
    • ~Verify debugging with downloaded artifacts works.~
    • ~Making the GitHub workflow begin execution based on schedule rather than based on git push.~
    • ~Undo most temporary changes and let CI run every 3 hours on this branch~ on personal master branch for 2 weeks.
    • Compare execution time of release vs RelWithDebInfo to see if defaults could be changed on master branch.
  2. DrahtBot commented at 12:56 pm on September 24, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30660 (test: Shut down framework cleanly on RPC connection failure by hodlinator)
    • #30487 (ci: skip Github CI on branch pushes for forks by Sjors)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. hodlinator marked this as a draft on Sep 24, 2024
  4. hodlinator force-pushed on Sep 24, 2024
  5. hodlinator commented at 9:05 pm on September 24, 2024: contributor

    https://github.com/bitcoin/bitcoin/actions/runs/11014291937/job/30584604403?pr=30956#step:10:930 contains: bitcoind.vcxproj -> D:\a\bitcoin\bitcoin\build\src\RelWithDebInfo\bitcoind.exe …notice the “RelWithDebInfo”, so it should be generating .PDB files like it does for me locally. @maflcko continuing discussion after your comment in the related issue:

    Seems fine to integrate in a pull request and then keep the pull request unmerged, and only merge the pull request in the GHA CI, as that is the only place where it reproduces. Adding temporary test test-only code for everyone else may be a bit too much.

    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    2. Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI? I’m totally unfamiliar with GHA so RTFM-links are welcome too. An alternative approach could be to make this temporary PR upload the relevant files to somewhere, only if the issue occurs.

    Feel free to loop in others to answer if you think it’s more their area of ownership.

  6. davidgumberg commented at 9:42 pm on September 24, 2024: contributor

    Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI?

    I think you can add something like:

    0- uses: actions/upload-artifact@v4
    1  with:
    2    name: dump
    3    path: %RUNNER_TEMP%/**/bitcoind.dmp
    4    if-no-files-found: ignore
    

    after the run: part of each job.

    https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow https://github.com/actions/upload-artifact#examples

  7. maflcko commented at 6:22 am on September 25, 2024: member
    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

  8. hodlinator force-pushed on Sep 25, 2024
  9. hodlinator commented at 10:08 am on September 25, 2024: contributor

    https://github.com/bitcoin/bitcoin/actions/runs/11030752196 gives me:

    actions/upload-artifact@v4 is not allowed to be used in bitcoin/bitcoin. Actions in this workflow must be: within a repository owned by bitcoin or matching the following: actions/cache@, actions/cache/restore@, actions/cache/save@, actions/checkout@, ilammy/msvc-dev-cmd@*.

    Maybe one could (ab)use actions/cache/save but I’ll pause here for more input.

  10. maflcko commented at 10:12 am on September 25, 2024: member
    Maybe you can leave it in your repo and just trigger the task every 3 hours for two weeks to get the dump eventually?
  11. hodlinator force-pushed on Sep 25, 2024
  12. hodlinator force-pushed on Sep 25, 2024
  13. hodlinator force-pushed on Sep 25, 2024
  14. hebasto commented at 11:23 am on September 26, 2024: member
    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

    Here is some historical context:

    1. In pre-Cmake times, manually crafted project files had two configurations: “Release” and “Debug”.
    2. While working on the CMake staging branch, the CI build configuration remained “Release” to allow the use of Ccache. This functionality was later dropped.

    At this point, I don’t see any drawbacks to switching the CI job’s build configuration from “Release” to “RelWithDebInfo”.

  15. hodlinator force-pushed on Sep 26, 2024
  16. hodlinator force-pushed on Sep 26, 2024
  17. hodlinator force-pushed on Sep 26, 2024
  18. hodlinator force-pushed on Sep 27, 2024
  19. hodlinator force-pushed on Sep 27, 2024
  20. DrahtBot added the label CI failed on Sep 27, 2024
  21. DrahtBot commented at 11:05 am on September 27, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/30753975513

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  22. hodlinator force-pushed on Sep 27, 2024
  23. hodlinator commented at 9:31 pm on September 27, 2024: contributor

    https://github.com/hodlinator/bitcoin/actions/runs/11069726468/job/30757882453#step:15:114 : subprocess.TimeoutExpired: Command '['D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\Procdump.exe', '-mm', '7904', 'D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\bitcoind.dmp']' timed out after 4800 seconds

    The Procdump.exe process itself times out when trying to produce the dump. :facepalm: Could be that it’s popping up a GUI dialogue on first run, will investigate further.

  24. hodlinator force-pushed on Sep 28, 2024
  25. hodlinator force-pushed on Sep 28, 2024
  26. hodlinator force-pushed on Sep 28, 2024
  27. hodlinator force-pushed on Sep 28, 2024
  28. hodlinator force-pushed on Sep 28, 2024
  29. hodlinator commented at 10:43 pm on September 28, 2024: contributor

    Partial victory! - Latest run https://github.com/hodlinator/bitcoin/actions/runs/11087284207 actually uploaded a .DMP file (but it wasn’t a real stall, just forced it to happen).

    Still remains to see if it loads nicely in the debugger.

    Other remaining issue is that I used a hardcoded absolute path for the upload-artifact step since I’m still struggling to find the correct glob-expression. Can’t assume it will be node0 that stalls, and unsure if the path will be the same when removing –nocleanup. (Come to think of it, –nocleanup shouldn’t be necessary now that the test fails).

    0      - name: Upload .DMP file
    1        uses: actions/upload-artifact@v4
    2        id: dmp-upload
    3        if: ${{ failure() && steps.functional-tests.conclusion == 'failure' }}
    4        with:
    5          name: bitcoind-windows-dmp-file
    6          path: D:/a/bitcoin/bitcoin/build/test/cache/node0/bitcoind.dmp
    

    I’m a bit mystified by RUNNER_TEMP passed as --tmpdirprefix to the functional test runner having the value D:/a/_temp/, while the test data seems to end up in D:/a/bitcoin/bitcoin/build/test/cache/, but maybe the cache part is a clue.

  30. hodlinator commented at 2:03 pm on September 30, 2024: contributor

    Still remains to see if it loads nicely in the debugger.

    It debugs!

    image

    Requires .DMP, .PDB and .EXE (last one currently luckily added just as an extra test).

  31. hodlinator force-pushed on Sep 30, 2024
  32. hodlinator force-pushed on Sep 30, 2024
  33. hodlinator force-pushed on Oct 1, 2024
  34. hodlinator force-pushed on Oct 1, 2024
  35. hodlinator force-pushed on Oct 1, 2024
  36. hodlinator force-pushed on Oct 1, 2024
  37. hodlinator force-pushed on Oct 1, 2024
  38. hodlinator force-pushed on Oct 2, 2024
  39. hodlinator force-pushed on Oct 2, 2024
  40. hodlinator force-pushed on Oct 2, 2024
  41. hodlinator renamed this:
    DO NOT MERGE: Windows bitcoind stall debugging
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    on Oct 2, 2024
  42. DrahtBot renamed this:
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    on Oct 2, 2024
  43. hodlinator force-pushed on Oct 2, 2024
  44. ci: Temporarily switch Windows to RelWithDebInfo
    RelWithDebInfo (Release with debug information) ensures we generate .PDB files which are useful when debugging dump files.
    7421a0219a
  45. test: Generate dump file for stalled bitcoind
    Upon failing to establish an RPC connection to bitcoind:
    Download SysInternals Procdump, unzip it, and use it to generate a dump file that can be opened in a debugger to see what bitcoind was up to.
    28751ebbd7
  46. hodlinator force-pushed on Oct 3, 2024
  47. hodlinator commented at 9:48 am on October 3, 2024: contributor
    I finally read the documentation for schedule a bit more closely and realized it only applies to the master branch, so I’m making the “every 3 hours”-schedule happen on my personal master branch for now.
  48. hodlinator force-pushed on Oct 3, 2024
  49. hodlinator commented at 8:24 pm on October 3, 2024: contributor

    Had 2 scheduled runs on my modified master, 3 hours apart.

    • Dump file generation upon bitcoind connection failure is enabled for functional tests, and no longer artificially forced.
    • Re-enabled unit tests to keep the setup as close as possible to regular CI.
  50. hodlinator renamed this:
    Windows bitcoind stall debugging [WIP, NOMERGE, DRAFT]
    Windows bitcoind stall debugging [NOMERGE, DRAFT]
    on Oct 3, 2024
  51. in .github/workflows/ci.yml:221 in e9c6b02462 outdated
    223+          BITCOINWALLET: '${{ github.workspace }}\build\src\RelWithDebInfo\bitcoin-wallet.exe'
    224           TEST_RUNNER_EXTRA: ${{ github.event_name != 'pull_request' && '--extended' || '' }}
    225         shell: cmd
    226-        run: py -3 test\functional\test_runner.py --jobs %NUMBER_OF_PROCESSORS% --ci --quiet --tmpdirprefix=%RUNNER_TEMP% --combinedlogslen=99999999 --timeout-factor=%TEST_RUNNER_TIMEOUT_FACTOR% %TEST_RUNNER_EXTRA%
    227+        run: |
    228+          py -3 test\functional\test_runner.py --jobs %NUMBER_OF_PROCESSORS% --ci --quiet --tmpdirprefix=%RUNNER_TEMP% --combinedlogslen=99999999 --timeout-factor=%TEST_RUNNER_TIMEOUT_FACTOR% %TEST_RUNNER_EXTRA% --filter mempool_
    


    maflcko commented at 12:15 pm on October 7, 2024:
    Any reason to filter to only mempool_?

    hodlinator commented at 3:57 pm on October 7, 2024:
    Thanks! Leftover WIP change. Removed filter in latest push here and to my personal master.
  52. ci: Upload .DMP, .EXE and .PDB artifacts
    Only done upon functional test failure. .EXE & .PDB are only uploaded if .DMP file is successfully uploaded first.
    0c7b73c40b
  53. WIP - ci: Temporarily disable non-windows jobs 76f01a4397
  54. WIP - ci: Temporarily schedule to re-run every 3rd hour (at minute 21)
    Goal is to trigger issue with stalled bitcoind within 2 weeks.
    c35e307f10
  55. hodlinator force-pushed on Oct 7, 2024
  56. DrahtBot removed the label CI failed on Oct 7, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-08 16:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me