correct wrong assumptions in the contrib linearize data script #31888

pull midnightmagic wants to merge 1 commits into bitcoin:master from midnightmagic:fix-linearize-gjpyn changing 1 files +6 −1
  1. midnightmagic commented at 0:32 am on February 17, 2025: contributor

    The linearize-data contrib script does not quite work correctly in the case of a chain reorg, and it fully-truncates files with an incorrect open mode.

    This small patch clips off incorrect trailing data and corrects file open mode type.

    (small patch being passed around which I did not write but thought would be helpful to the community)

  2. DrahtBot commented at 0:32 am on February 17, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31888.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

  3. DrahtBot commented at 1:22 am on February 17, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/37309076255

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  4. DrahtBot added the label CI failed on Feb 17, 2025
  5. laanwj added the label Scripts and tools on Feb 17, 2025
  6. laanwj commented at 11:42 am on February 17, 2025: member

    Lint is failing on the formatting of the commit message:

    0[10:12:47.126] The subject line of commit hash 0cf4a6245d2bc8d57f91dd2b4f5bd5a0eed18ed4 is followed by a non-empty line. Subject lines should always be followed by a blank line.
    1[10:12:47.126] ^^^
    2[10:12:47.126] 
    3[10:12:47.126] ^---- ⚠️ Failure generated from lint check 'commit_msg' (Check that commit messages have a new line before the body or no body at all.)!
    

    It also looks like the changes somehow break the feature_loadblock.py test, which uses the linearize script

  7. midnightmagic force-pushed on Feb 19, 2025
  8. midnightmagic force-pushed on Feb 19, 2025
  9. maflcko commented at 9:59 am on February 19, 2025: member
    It would be good to include a test, or modify the existing one. Otherwise, it will be harder to test and see what behavior actually changed and future changes could break the fix again.
  10. correct the linearize-data script (minor)
    this clips off unwanted final data in event of reorg and fixes the file open
    mode to prevent the truncation of files reopened
    
    small patch being passed around which I did not write but thought would be
    helpful to the community at large
    e9dc16194b
  11. midnightmagic force-pushed on Feb 19, 2025
  12. DrahtBot removed the label CI failed on Feb 19, 2025
  13. theStack commented at 2:12 pm on March 12, 2025: contributor
    Can you give more context on what problem this PR is trying to solve? Looking at the description and the diff I can’t follow, e.g. it’s unclear to me what an “incorrect open mode” is.
  14. midnightmagic commented at 5:01 am on April 23, 2025: contributor

    The changes function to make linearize “less-perturb” on-disk data. Altogether, in the event of a block reorganization, the reorg’d blocks are sometimes shorter than prior blocks’ data in individual output files when split on e.g. size. When the writer potentially moves on to the next blk data file therefore, the rest of the block data that pre-existed the reorganization might need to be truncated off the end of that file before moving on to the next one.

    For the “wb” Python open mode, all pre-existing block files as it writes out blk* files are destroyed and re-created by an open with “wb”, which in python will auto-truncate any pre-existing file to 0 bytes—for many large files, something pulling data from the source will always see rapidly-changing source dataset all the time.

    With the change, the attempt to verify that the file doesn’t pre-exist and then re-write in-place from 0 will be much less likely to break anything trying to read the files, most of them will effectively never change, and instead we just get the (much less likely) chance of a reorg write racing a potential midstream read, plus the much less-likely truncate-eliminating-a-bit-of-end-file data.

    It’s not really perfect when considered in conjunction with other tools, but it’s much better than the current behaviour. Probably some form of atomic file modification would be “better” in the sense of guaranteeing on-disk state consistency, or, perhaps a trigger to external tools to identify “current” consistent state, but in the meanwhile I can’t really see this change hurting because of how shallow reorgs usually are and the inordinate size of the dataset we have to manipulate.

    The use case is for people who repeatedly re-run linearize and make use of the results on an ongoing basis.

  15. bitcoin31888 commented at 2:57 am on May 19, 2025: none

    Consider the following bash script

    while true;do ./linearize-hashes.py linearize.cfg > hashlist.txt; ./linearize-data.py linearize.cfg; sleep 120; done

    This will continuously write out the bootstrap files, without this patch the files will continuously be truncated to zero length and then re-written, that behavior is undesirable.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-05-24 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me