[contrib] New linearize format #4762

pull jgarzik wants to merge 3 commits into bitcoin:master from jgarzik:2014_linearize changing 2 files +52 −26
  1. jgarzik commented at 5:57 PM on August 24, 2014: contributor

    The new linearize format outputs one month's worth of bitcoin transactions, as determined by the timestamp inside the block. The file last-modified time is set to the highest timestamp seen thusfar in the processing:

    -rw-rw-r-- 1 jgarzik jgarzik    512129 Sep 30  2009 blk00008.dat
    -rw-rw-r-- 1 jgarzik jgarzik    522737 Oct 31  2009 blk00009.dat
    -rw-rw-r-- 1 jgarzik jgarzik    521531 Nov 30  2009 blk00010.dat
    -rw-rw-r-- 1 jgarzik jgarzik    992159 Dec 31  2009 blk00011.dat
    -rw-rw-r-- 1 jgarzik jgarzik   1190943 Jan 31  2010 blk00012.dat
    -rw-rw-r-- 1 jgarzik jgarzik   1580654 Feb 28  2010 blk00013.dat
    -rw-rw-r-- 1 jgarzik jgarzik   1435103 Mar 31  2010 blk00014.dat
    -rw-rw-r-- 1 jgarzik jgarzik   2646319 Apr 30  2010 blk00015.dat
    -rw-rw-r-- 1 jgarzik jgarzik   2040834 May 31  2010 blk00016.dat
    -rw-rw-r-- 1 jgarzik jgarzik   1911267 Jun 30  2010 blk00017.dat
    -rw-rw-r-- 1 jgarzik jgarzik   7862024 Jul 31  2010 blk00018.dat
    

    A check is added to verify that blocks remain in the proper order. Previously this was guaranteed by bitcoind anyway, and so is a simply a sanity check. Who knows what weird input people might feed to this. Headers-first will change this, sometimes storing blocks out of order on disk. linearize will need a future update to buffer blocks.

    This new format is suitable for researchers performing lots of raw block processing, and bitcoind users importing via "-reindex". The reindex import method is superior to bootstrap.dat.

  2. contrib/linearize: Guarantee that output is generated in-order
    This was typically ensured implicitly by virtue of normal bitcoind
    operation.  Adding an explicit check provides a stronger guarantee, and
    it is cheap to add.
    75400a2a41
  3. contrib/linearize: split block files based on year-month, not just year 8f5a423344
  4. contrib/linearize: Add feature to set file's timestamp
    based on block header time.
    4bb61b4535
  5. BitcoinPullTester commented at 6:10 PM on August 24, 2014: none

    Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/p4762_4bb61b45359b98632c64fca2f3640d276ab75244/ for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.

  6. TheBlueMatt commented at 2:15 AM on August 27, 2014: member

    Why keep chunks by date instead of by total size (as bitcoind does anyway)?

  7. jgarzik commented at 3:40 AM on August 27, 2014: contributor

    Date is more human-centric and useful to some queries. "ls -l" will therefore produce useful, date-based information. "ls -l" tells you which files contain data from >= 2014, etc. No need to scan the files to determine that.

  8. laanwj commented at 10:34 AM on August 27, 2014: member

    TheBlueMatt: also it makes it more straightforward to extend the range; just add a file per month, the other files will stay the same

  9. sipa commented at 7:49 PM on August 27, 2014: member

    Well, the reason for choosing size as split criterion is because of pruning: you prune entire files, so ideally they are approximately equal in size.

  10. jgarzik commented at 7:55 PM on August 27, 2014: contributor

    Let me slay this shed painting, of which I am also guilty, by noting that the script fully supports split-by-file-size as a criterion, via configuration file setting.

    The default is split-by-size. Default of "max_out_sz" is 1000L * 1000 * 1000 bytes.

    This commit adds a non-default split-by-date, enabled via config file.

    It is my fault for not being more explicit about the capabilities of the code, and how they were changing.

  11. sipa commented at 7:58 PM on August 27, 2014: member

    @jgarzik thanks for clearing that up!

    It is my fault for not actually ever having read the code...

  12. in contrib/linearize/README.md:None in 4bb61b4535
      26 | @@ -27,6 +27,7 @@ output.
      27 |  Optional config file setting for linearize-data:
      28 |  * "netmagic": network magic number
      29 |  * "max_out_sz": maximum output file size (default 1000*1000*1000)
      30 | -* "split_year": Split files when a new year is first seen, in addition to
      31 | +* "split_timestamp": Split files when a new year is first seen, in addition to
    


    TheBlueMatt commented at 4:22 AM on August 28, 2014:

    I think you meant month here (and maybe not _timestamp, but _month?)


    jgarzik commented at 1:35 PM on August 28, 2014:

    The config file variable name is correct, but the English description is not. Will fix.

  13. TheBlueMatt commented at 4:25 AM on August 28, 2014: member

    ut ACK pending last comment.

  14. laanwj added the label Improvement on Aug 28, 2014
  15. laanwj commented at 1:06 PM on September 4, 2014: member

    Let's get that last comment fixed and merge this

  16. laanwj referenced this in commit d800dcc32a on Sep 4, 2014
  17. laanwj commented at 1:21 PM on September 4, 2014: member

    Merged via d800dcc, replacing the "year" with "month" in README.md

  18. laanwj closed this on Sep 4, 2014

  19. MarcoFalke locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-20 00:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me