correct wrong assumptions in the contrib linearize data script #31888

pull midnightmagic wants to merge 1 commits into bitcoin:master from midnightmagic:fix-linearize-gjpyn changing 1 files +6 −1
  1. midnightmagic commented at 0:32 am on February 17, 2025: contributor

    The linearize-data contrib script does not quite work correctly in the case of a chain reorg, and it fully-truncates files with an incorrect open mode.

    This small patch clips off incorrect trailing data and corrects file open mode type.

    (small patch being passed around which I did not write but thought would be helpful to the community)

  2. DrahtBot commented at 0:32 am on February 17, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31888.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

  3. DrahtBot commented at 1:22 am on February 17, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/37309076255

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  4. DrahtBot added the label CI failed on Feb 17, 2025
  5. laanwj added the label Scripts and tools on Feb 17, 2025
  6. laanwj commented at 11:42 am on February 17, 2025: member

    Lint is failing on the formatting of the commit message:

    0[10:12:47.126] The subject line of commit hash 0cf4a6245d2bc8d57f91dd2b4f5bd5a0eed18ed4 is followed by a non-empty line. Subject lines should always be followed by a blank line.
    1[10:12:47.126] ^^^
    2[10:12:47.126] 
    3[10:12:47.126] ^---- ⚠️ Failure generated from lint check 'commit_msg' (Check that commit messages have a new line before the body or no body at all.)!
    

    It also looks like the changes somehow break the feature_loadblock.py test, which uses the linearize script

  7. midnightmagic force-pushed on Feb 19, 2025
  8. midnightmagic force-pushed on Feb 19, 2025
  9. maflcko commented at 9:59 am on February 19, 2025: member
    It would be good to include a test, or modify the existing one. Otherwise, it will be harder to test and see what behavior actually changed and future changes could break the fix again.
  10. correct the linearize-data script (minor)
    this clips off unwanted final data in event of reorg and fixes the file open
    mode to prevent the truncation of files reopened
    
    small patch being passed around which I did not write but thought would be
    helpful to the community at large
    e9dc16194b
  11. midnightmagic force-pushed on Feb 19, 2025
  12. DrahtBot removed the label CI failed on Feb 19, 2025
  13. theStack commented at 2:12 pm on March 12, 2025: contributor
    Can you give more context on what problem this PR is trying to solve? Looking at the description and the diff I can’t follow, e.g. it’s unclear to me what an “incorrect open mode” is.
  14. midnightmagic commented at 5:01 am on April 23, 2025: contributor

    The changes function to make linearize “less-perturb” on-disk data. Altogether, in the event of a block reorganization, the reorg’d blocks are sometimes shorter than prior blocks’ data in individual output files when split on e.g. size. When the writer potentially moves on to the next blk data file therefore, the rest of the block data that pre-existed the reorganization might need to be truncated off the end of that file before moving on to the next one.

    For the “wb” Python open mode, all pre-existing block files as it writes out blk* files are destroyed and re-created by an open with “wb”, which in python will auto-truncate any pre-existing file to 0 bytes—for many large files, something pulling data from the source will always see rapidly-changing source dataset all the time.

    With the change, the attempt to verify that the file doesn’t pre-exist and then re-write in-place from 0 will be much less likely to break anything trying to read the files, most of them will effectively never change, and instead we just get the (much less likely) chance of a reorg write racing a potential midstream read, plus the much less-likely truncate-eliminating-a-bit-of-end-file data.

    It’s not really perfect when considered in conjunction with other tools, but it’s much better than the current behaviour. Probably some form of atomic file modification would be “better” in the sense of guaranteeing on-disk state consistency, or, perhaps a trigger to external tools to identify “current” consistent state, but in the meanwhile I can’t really see this change hurting because of how shallow reorgs usually are and the inordinate size of the dataset we have to manipulate.

    The use case is for people who repeatedly re-run linearize and make use of the results on an ongoing basis.

  15. bitcoin31888 commented at 2:57 am on May 19, 2025: none

    Consider the following bash script

    while true;do ./linearize-hashes.py linearize.cfg > hashlist.txt; ./linearize-data.py linearize.cfg; sleep 120; done

    This will continuously write out the bootstrap files, without this patch the files will continuously be truncated to zero length and then re-written, that behavior is undesirable.

  16. achow101 requested review from Eunovo on Oct 22, 2025
  17. sedited commented at 7:54 am on March 4, 2026: contributor
    There’s not been a response from the author to a review comment here, in quite some time. There has also been no further review. Maybe this should be closed and marked as up for grabs?
  18. maflcko commented at 8:50 am on March 4, 2026: member
    My reading from #31888 (comment) seems to give an example of why this pull is needed. However, I haven’t looked closely and reviewed/tested this myself
  19. sedited commented at 9:11 am on March 4, 2026: contributor

    My reading from #31888 (comment) seems to give an example of why this pull is needed. However, I haven’t looked closely and reviewed/tested this myself

    Huh, yes, I did read that the wrong way around. Will take a look myself then :)

  20. sedited commented at 12:03 pm on March 4, 2026: contributor
    I’m not sure about this patch being framed as a correction of a wrong assumption. As described in a later comment, this adds support for writing to the same set of files again. I can see why that might be useful, but the description both for the PR and the commit should clarify this. Would also be good to capture this behaviour in the README.
  21. midnightmagic commented at 7:21 pm on March 4, 2026: contributor
    The use case described in that comment is precisely demonstrating a logical outcome of the regressive, drive-destroying behaviour for anyone who wants to use this script more than once, so, I’d thought it was a good example of what the goal of the diff is and didn’t need comment. Does it need further comment? :-)
  22. sedited commented at 9:36 am on March 5, 2026: contributor

    The use case described in that comment is precisely demonstrating a logical outcome of the regressive, drive-destroying behaviour for anyone who wants to use this script more than once, so, I’d thought it was a good example of what the goal of the diff is and didn’t need comment. Does it need further comment?

    To be clear, I agree it is a desirable thing to patch and the examples in the discussion here clarify that. I’m just saying that this should be clearly described in the pull request description and the commit message.

  23. midnightmagic commented at 8:32 pm on March 5, 2026: contributor
    Sure, I can flesh it out more if you like and make it more explicit.
  24. sedited commented at 1:13 pm on March 9, 2026: contributor

    Sure, I can flesh it out more if you like and make it more explicit.

    Yes, that’s all that’s required to make progress here.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-03-23 06:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me