DO NOT MERGE: Windows bitcoind stall debugging #30956

pull hodlinator wants to merge 8 commits into bitcoin:master from hodlinator:30390_windows_debug changing 2 files +189 −13
  1. hodlinator commented at 12:56 pm on September 24, 2024: contributor

    This PR is meant to help investigate #30390 (https://github.com/bitcoin/bitcoin/issues/30390#issuecomment-2326628987).

    Switches Windows CI to produce RelWithDebInfo binaries and adds code that uses ProcDump to generate a dump file of the bitcoind process which may help explain where it’s gotten stuck.

    Unresolved:

    • ~Whether the CI change is remotely correct.~
    • ~How to get access to the generated .DMP and .PDB files.~
    • Obtaining a .DMP artifact.
    • Comparing execution time of release vs RelWithDebInfo to see if defaults could be changed
  2. DrahtBot commented at 12:56 pm on September 24, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30660 (test: Shut down framework cleanly on RPC connection failure by hodlinator)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. hodlinator marked this as a draft on Sep 24, 2024
  4. ci: Temporarily switch Windows to RelWithDebInfo
    RelWithDebInfo (Release with debug information) ensures we generate .PDB files which are useful when debugging dump files.
    9c06696487
  5. hodlinator force-pushed on Sep 24, 2024
  6. hodlinator commented at 9:05 pm on September 24, 2024: contributor

    https://github.com/bitcoin/bitcoin/actions/runs/11014291937/job/30584604403?pr=30956#step:10:930 contains: bitcoind.vcxproj -> D:\a\bitcoin\bitcoin\build\src\RelWithDebInfo\bitcoind.exe …notice the “RelWithDebInfo”, so it should be generating .PDB files like it does for me locally. @maflcko continuing discussion after your comment in the related issue:

    Seems fine to integrate in a pull request and then keep the pull request unmerged, and only merge the pull request in the GHA CI, as that is the only place where it reproduces. Adding temporary test test-only code for everyone else may be a bit too much.

    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    2. Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI? I’m totally unfamiliar with GHA so RTFM-links are welcome too. An alternative approach could be to make this temporary PR upload the relevant files to somewhere, only if the issue occurs.

    Feel free to loop in others to answer if you think it’s more their area of ownership.

  7. davidgumberg commented at 9:42 pm on September 24, 2024: contributor

    Do you have any ideas for being able to retrieve generated .DMP and .PDB files from CI?

    I think you can add something like:

    0- uses: actions/upload-artifact@v4
    1  with:
    2    name: dump
    3    path: %RUNNER_TEMP%/**/bitcoind.dmp
    4    if-no-files-found: ignore
    

    after the run: part of each job.

    https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow https://github.com/actions/upload-artifact#examples

  8. maflcko commented at 6:22 am on September 25, 2024: member
    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

  9. hodlinator force-pushed on Sep 25, 2024
  10. hodlinator commented at 10:08 am on September 25, 2024: contributor

    https://github.com/bitcoin/bitcoin/actions/runs/11030752196 gives me:

    actions/upload-artifact@v4 is not allowed to be used in bitcoin/bitcoin. Actions in this workflow must be: within a repository owned by bitcoin or matching the following: actions/cache@, actions/cache/restore@, actions/cache/save@, actions/checkout@, ilammy/msvc-dev-cmd@*.

    Maybe one could (ab)use actions/cache/save but I’ll pause here for more input.

  11. maflcko commented at 10:12 am on September 25, 2024: member
    Maybe you can leave it in your repo and just trigger the task every 3 hours for two weeks to get the dump eventually?
  12. hodlinator force-pushed on Sep 25, 2024
  13. hodlinator force-pushed on Sep 25, 2024
  14. hodlinator force-pushed on Sep 25, 2024
  15. hebasto commented at 11:23 am on September 26, 2024: member
    1. Do you think the current approach of temporarily changing from Release -> RelWithDebInfo for Windows would be acceptable, as a PR that is only run on CI?

    Seems fine to do permanent, if there are no downsides or slow-down. cc @hebasto

    Here is some historical context:

    1. In pre-Cmake times, manually crafted project files had two configurations: “Release” and “Debug”.
    2. While working on the CMake staging branch, the CI build configuration remained “Release” to allow the use of Ccache. This functionality was later dropped.

    At this point, I don’t see any drawbacks to switching the CI job’s build configuration from “Release” to “RelWithDebInfo”.

  16. hodlinator force-pushed on Sep 26, 2024
  17. hodlinator force-pushed on Sep 26, 2024
  18. hodlinator force-pushed on Sep 26, 2024
  19. hodlinator force-pushed on Sep 27, 2024
  20. hodlinator force-pushed on Sep 27, 2024
  21. DrahtBot added the label CI failed on Sep 27, 2024
  22. DrahtBot commented at 11:05 am on September 27, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/30753975513

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  23. hodlinator force-pushed on Sep 27, 2024
  24. hodlinator commented at 9:31 pm on September 27, 2024: contributor

    https://github.com/hodlinator/bitcoin/actions/runs/11069726468/job/30757882453#step:15:114 : subprocess.TimeoutExpired: Command '['D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\Procdump.exe', '-mm', '7904', 'D:\\a\\bitcoin\\bitcoin\\build\\test\\cache\\node0\\bitcoind.dmp']' timed out after 4800 seconds

    The Procdump.exe process itself times out when trying to produce the dump. :facepalm: Could be that it’s popping up a GUI dialogue on first run, will investigate further.

  25. hodlinator force-pushed on Sep 28, 2024
  26. test: Generate dump file for stalled bitcoind
    Upon failing to establish an RPC connection to bitcoind:
    Download SysInternals Procdump, unzip it, and use it to generate a dump file that can be opened in a debugger to see what bitcoind was up to.
    e6ddd997fa
  27. ci: Store .DMP and .PDB files if both are found 6192f40356
  28. hodlinator force-pushed on Sep 28, 2024
  29. hodlinator force-pushed on Sep 28, 2024
  30. hodlinator force-pushed on Sep 28, 2024
  31. ci: Temporarily add steps to verify GHA behavior 0bb26087ce
  32. ci: Temporarily disable non-windows jobs d6440c11cf
  33. ci: Temporarily add schedule to re-run every 3rd hour (at minute 21)
    Goal is to trigger issue with stalled bitcoind within 2 weeks.
    a8b1c84c9f
  34. test: Temporarily force timeout behavior so .DMP file is generated on CI 3a8ee4a3ce
  35. ci: Temporarily disable ctest until dumping is working 3ef1236708
  36. hodlinator force-pushed on Sep 28, 2024
  37. hodlinator commented at 10:43 pm on September 28, 2024: contributor

    Partial victory! - Latest run https://github.com/hodlinator/bitcoin/actions/runs/11087284207 actually uploaded a .DMP file (but it wasn’t a real stall, just forced it to happen).

    Still remains to see if it loads nicely in the debugger.

    Other remaining issue is that I used a hardcoded absolute path for the upload-artifact step since I’m still struggling to find the correct glob-expression. Can’t assume it will be node0 that stalls, and unsure if the path will be the same when removing –nocleanup. (Come to think of it, –nocleanup shouldn’t be necessary now that the test fails).

    0      - name: Upload .DMP file
    1        uses: actions/upload-artifact@v4
    2        id: dmp-upload
    3        if: ${{ failure() && steps.functional-tests.conclusion == 'failure' }}
    4        with:
    5          name: bitcoind-windows-dmp-file
    6          path: D:/a/bitcoin/bitcoin/build/test/cache/node0/bitcoind.dmp
    

    I’m a bit mystified by RUNNER_TEMP passed as --tmpdirprefix to the functional test runner having the value D:/a/_temp/, while the test data seems to end up in D:/a/bitcoin/bitcoin/build/test/cache/, but maybe the cache part is a clue.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-29 01:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me