-masterdatadir for datadir bootstrapping #13746

pull kallewoof wants to merge 9 commits into bitcoin:master from kallewoof:masterdatadir changing 8 files +327 −23
  1. kallewoof commented at 9:55 am on July 23, 2018: member

    This PR adds the ability to bootstrap another node from an existing datadir.

    To fire up a new node with access to an existing node’s datadir y, you would simply do

    0$ bitcoind -datadir=x -masterdatadir=y
    

    (presumably permanently in bitcoin.conf) and it would use the blocks files from the former up until it needed to change something, at which point it would copy. (See accompanying test.)

    It currently only supports blocks/blk* caching, which is implemented as copy-on-write. Due to limitations in LevelDB, the chainstate and blocks/index dirs are copied to -datadir on first launch. Latter is only ~75 MB, but former is ~4 GB, so this is sad but better than nothing.

    Note: the feature_masterdatadir.py test makes ~150 MB of block data. May want to combine this with the pruning tests to not redo the work of mining, but it is in its own test for now. Also may wanna move it to extended, but keeping it in base while this PR is ongoing.

  2. kallewoof force-pushed on Jul 23, 2018
  3. kallewoof force-pushed on Jul 23, 2018
  4. laanwj added the label Feature on Jul 23, 2018
  5. laanwj commented at 10:22 am on July 23, 2018: member
    Concept ACK!
  6. in src/validation.cpp:228 in 3ef0f47d47 outdated
    225@@ -226,6 +226,7 @@ std::atomic_bool fImporting(false);
    226 std::atomic_bool fReindex(false);
    227 bool fHavePruned = false;
    228 bool fPruneMode = false;
    229+bool has_master_datadir = false;
    


    promag commented at 10:40 am on July 23, 2018:
    Use prefix g_, g_has_master_datadir.
  7. DrahtBot added the label Needs rebase on Jul 24, 2018
  8. kallewoof force-pushed on Jul 25, 2018
  9. DrahtBot removed the label Needs rebase on Jul 25, 2018
  10. in src/validation.cpp:3806 in 797c160dbf outdated
    3801+            // need writing; copy to own datadir
    3802+            file = fsbridge::fopen(path, "wb+");
    3803+            uint8_t buffer[16384];
    3804+            while (!feof(mfile)) {
    3805+                size_t r = fread(buffer, 1, 16384, mfile);
    3806+                if (r) fwrite(buffer, 1, r, file);
    


    promag commented at 2:00 pm on July 25, 2018:
    Should handle failure instead of ignoring? Could lead to corrupt blocks?
  11. in test/functional/feature_masterdatadir.py:2 in ed4e3909cd outdated
    0@@ -0,0 +1,88 @@
    1+#!/usr/bin/env python3
    2+# Copyright (c) 2014-2017 The Bitcoin Core developers
    


    promag commented at 2:00 pm on July 25, 2018:
    2018
  12. promag commented at 2:02 pm on July 25, 2018: member
    What if there is a node (A) running in the master data dir of node (B) and A is writing a block at the same time B is reading?
  13. DrahtBot commented at 3:00 pm on July 25, 2018: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #15118 (Refactor block file logic by jimpo)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  14. kallewoof force-pushed on Jul 26, 2018
  15. kallewoof commented at 2:24 am on July 26, 2018: member

    @promag

    What if there is a node (A) running in the master data dir of node (B) and A is writing a block at the same time B is reading?

    I’m not sure what the chances are that this results in actual corruption. The master is safe always, but the slave may end up with a copy that has incomplete data. That depends on whether the master flushes the block file atomically or not, though. I would expect it makes an effort to always keep files in a clean state, in case of crashes and such.

    That said, the block files are checked on startup and I expect a corruption would be detected almost immediately, since it is almost guaranteed to be the newest block file.

    I’ll experiment a bit with the ‘open existing file as readonly’ by leaving out the last 1 byte and seeing how it behaves.

  16. kallewoof commented at 2:41 am on July 26, 2018: member

    Experiment: skip last byte for case “!readonly & copy existing”:

     0diff --git a/src/validation.cpp b/src/validation.cpp
     1index b5646f107..7af6956b3 100644
     2--- a/src/validation.cpp
     3+++ b/src/validation.cpp
     4@@ -3804,6 +3804,11 @@ static FILE* OpenDiskFile(const CDiskBlockPos &pos, const char *prefix, bool fRe
     5             while (!feof(mfile)) {
     6                 size_t r = fread(buffer, 1, 16384, mfile);
     7                 if (r) {
     8+                    // TEST: CORRUPTION
     9+                    if (r > 0 && r < 16384) {
    10+                        LogPrintf("*** Corrupting file %s by writing %d/%d bytes from master\n", path.string(), r-1, r);
    11+                        r--;
    12+                    }
    13                     if (r != fwrite(buffer, 1, r, file)) {
    14                         fclose(mfile);
    15                         fclose(file);
    

    Result: slave node crashes with a ‘corrupt block database’ error:

     02018-07-26T02:35:14Z Loaded best chain: hashBestChain=00000000000000000029ac6548add211c4f3d20d20da3d5858c3eda1fb84fece height=529951 date=2018-06-30T22:01:25Z progress=0.977126
     12018-07-26T02:35:14Z init message: Rewinding blocks...
     22018-07-26T02:35:16Z *** Corrupting file /Users/karljohan-alm/workspace/bitcoin-kw/src/slavedata/blocks/blk01301.dat by writing 5744/5745 bytes from master
     32018-07-26T02:35:17Z init message: Verifying blocks...
     42018-07-26T02:35:17Z Verifying last 6 blocks at level 3
     52018-07-26T02:35:17Z [0%]...ERROR: ReadBlockFromDisk: Deserialize or I/O error - CAutoFile::read: end of file: unspecified iostream_category error at CBlockDiskPos(nFile=1301, nPos=97518532)
     62018-07-26T02:35:17Z ERROR: VerifyDB(): *** ReadBlockFromDisk failed at 529951, hash=00000000000000000029ac6548add211c4f3d20d20da3d5858c3eda1fb84fece
     72018-07-26T02:35:17Z : Corrupted block database detected.
     8Please restart with -reindex or -reindex-chainstate to recover.
     9: Corrupted block database detected.
    10Please restart with -reindex or -reindex-chainstate to recover.
    112018-07-26T02:35:17Z Aborted block database rebuild. Exiting.
    122018-07-26T02:35:17Z Shutdown: In progress...
    132018-07-26T02:35:17Z scheduler thread interrupt
    142018-07-26T02:35:17Z Shutdown: done
    

    Disabling corruption patch and restarting results in another crash. Only fix was to -reindex, which initiated a full redownload of blocks (master was pruned).

    I think it would be interesting to patch the ‘corrupt!’ logic with a fallback to grab master file again before failing. Actually, I think I’m going to do that.

    Edit: I did some testing, and the above idea to copy again does not work, because the only time it copies is when it is about to write something, and it does not check the block file consistency at this time. I.e. it gets a corrupt file, writes to it, closes it, then later opens it, finds that it’s corrupted, recopies it, but now it has lost the data it wrote before.

  17. in src/init.cpp:1262 in 79af9869d6 outdated
    1259@@ -1256,6 +1260,27 @@ bool AppInitMain()
    1260             gArgs.GetArg("-datadir", ""), fs::current_path().string());
    1261     }
    1262 
    1263+    if (g_has_master_datadir) {
    1264+        // LDB does not support readonly access, so we need to copy the entire
    


    jonasschnelli commented at 3:54 pm on July 30, 2018:
    nit: s/LDB/levelDB
  18. in src/init.cpp:1285 in 79af9869d6 outdated
    1259@@ -1256,6 +1260,27 @@ bool AppInitMain()
    1260             gArgs.GetArg("-datadir", ""), fs::current_path().string());
    1261     }
    1262 
    1263+    if (g_has_master_datadir) {
    1264+        // LDB does not support readonly access, so we need to copy the entire
    1265+        // chainstate and blocks/index dirs over to the new location, if not found already
    1266+        fs::path path = GetDataDir() / "chainstate";
    1267+        if (!fs::exists(path)) {
    


    jonasschnelli commented at 3:57 pm on July 30, 2018:
    Doesn’t this copy action require a protection that the masterdatadir is currently not in use/write? If a Bitcoin Core instance is using this (master) datadir it may result in a corrupted copy?

    kallewoof commented at 5:06 pm on July 30, 2018:

    Yes. Master is guaranteed to be safe as no write operation is ever done from the slave, but the opposite is not the case. See #13746 (comment) for experiment on corruption. There’s no good way to recover, as the slave will have already written data and closed the file before realizing it was truncated.

    The question remains how frequent this kind of race condition is, though.

  19. jonasschnelli commented at 3:57 pm on July 30, 2018: contributor
    Concept ACK
  20. kallewoof force-pushed on Jul 30, 2018
  21. kallewoof commented at 6:23 am on July 31, 2018: member

    Actually, I realized that it should be possible to make a perfectly safe slave instance by checking the master directory for the last block file in existence, and marking that minus 1 as the last available master block file.

    I.e. if master has blk000 ~ blk500, slave will use files blk000 ~ blk499 from master, copy blk500 over the first time it starts up, and then make its own files from there.

    Important to note is that it will not, for example, try to use master’s blk501, in the case where slave is shutdown and master keeps chugging along to the point where it makes a new block file before the slave, something this PR does not actually address.

    Still means master should preferably be stopped during initial start-up of slave, but it can run safely without slave risking corruption after slave has finished first startup.

  22. jonasschnelli commented at 11:12 am on July 31, 2018: contributor
    I’m more worried about hot copying the levelDB part (chainstate/). Imo to be safe, the master must close the leveldb environment.
  23. sipa commented at 10:40 pm on July 31, 2018: member

    You can’t do this while an instance is running on the master (unless you can somehow force it to temporarily close its database environment and pause operations). If you don’t need it to work while the master is running you should grab the lock on the master to avoid it.

    Alternatively you can just not copy databases at all, and require a reindex to setup a second datadir.

  24. kallewoof force-pushed on Aug 1, 2018
  25. kallewoof force-pushed on Aug 1, 2018
  26. kallewoof force-pushed on Aug 1, 2018
  27. kallewoof force-pushed on Aug 1, 2018
  28. kallewoof commented at 9:46 am on August 1, 2018: member

    @sipa I am now locking master temporarily while doing the copying, and init-error-shutdown if it is already locked.

    I am also now tracking the last block file that the slave relies on in master, so it won’t accidentally start reading from master block files if master runs ahead in file count (e.g. slave writes blk123, master then creates blk124 and slave starts trying to read from it). @jonasschnelli

    I’m more worried about hot copying the levelDB part (chainstate/). Imo to be safe, the master must close the leveldb environment.

    This should be addressed now that the master is locked before copy. @promag

    What if there is a node (A) running in the master data dir of node (B) and A is writing a block at the same time B is reading?

    The slave will now copy the very last block file over on initial start. It also holds a lock on the master dir for the duration of these operations, so your concerns have been addressed, I think.

  29. kallewoof force-pushed on Aug 1, 2018
  30. kallewoof force-pushed on Aug 1, 2018
  31. kallewoof force-pushed on Aug 1, 2018
  32. jonasschnelli commented at 7:04 am on August 2, 2018: contributor
    @kallewoof Does the lock works against masters not running this change (0.17 as example)?
  33. kallewoof commented at 7:14 am on August 2, 2018: member

    Does the lock works against masters not running this change (0.17 as example)?

    Yes. The slave locks the master datadir before copying, and after the initial copying is done, it no longer relies on master playing nice. Since the datadir locking is done in older clients as well, this should (1) cause the slave to fail if the master is running when it’s started up for the first time, and (2) cause the master to fail to startup if the slave is in the process of copying.

  34. DrahtBot added the label Needs rebase on Aug 29, 2018
  35. kallewoof force-pushed on Sep 10, 2018
  36. kallewoof force-pushed on Sep 10, 2018
  37. kallewoof force-pushed on Sep 10, 2018
  38. DrahtBot removed the label Needs rebase on Sep 10, 2018
  39. DrahtBot added the label Needs rebase on Sep 10, 2018
  40. kallewoof force-pushed on Sep 11, 2018
  41. DrahtBot removed the label Needs rebase on Sep 11, 2018
  42. DrahtBot added the label Needs rebase on Sep 16, 2018
  43. kallewoof force-pushed on Sep 18, 2018
  44. DrahtBot removed the label Needs rebase on Sep 18, 2018
  45. in test/functional/feature_masterdatadir.py:50 in 0d8b04d60a outdated
    45+        assert_equal(150, self.nodes[0].getblockcount())
    46+        # We want more than one block file, so make a few large blocks
    47+        for i in range(150):
    48+            mine_large_block(self.nodes[0])
    49+            if i % 50 == 0:
    50+                self.log.info("... " + str(i) + " / 150")
    


    practicalswift commented at 9:01 pm on October 2, 2018:

    Use placeholders in the log message like this:

    0self.log.info("... %s / 150", i)
    

    It is the preferred form since it allows for lazy evaluation.

    Applies to all logging calls in this PR :-)


    kallewoof commented at 0:44 am on October 6, 2018:
    Fixed!
  46. kallewoof force-pushed on Oct 6, 2018
  47. DrahtBot added the label Needs rebase on Oct 8, 2018
  48. kallewoof force-pushed on Oct 8, 2018
  49. DrahtBot removed the label Needs rebase on Oct 8, 2018
  50. DrahtBot added the label Needs rebase on Nov 5, 2018
  51. kallewoof force-pushed on Nov 6, 2018
  52. DrahtBot removed the label Needs rebase on Nov 6, 2018
  53. DrahtBot added the label Needs rebase on Jan 16, 2019
  54. kallewoof force-pushed on Jan 17, 2019
  55. DrahtBot removed the label Needs rebase on Jan 17, 2019
  56. DrahtBot added the label Needs rebase on Jan 21, 2019
  57. kallewoof force-pushed on Jan 22, 2019
  58. DrahtBot removed the label Needs rebase on Jan 22, 2019
  59. kallewoof force-pushed on Jan 22, 2019
  60. kallewoof force-pushed on Jan 22, 2019
  61. kallewoof force-pushed on Jan 22, 2019
  62. kallewoof force-pushed on Jan 22, 2019
  63. kallewoof force-pushed on Jan 22, 2019
  64. kallewoof force-pushed on Jan 23, 2019
  65. kallewoof force-pushed on Jan 23, 2019
  66. kallewoof force-pushed on Jan 23, 2019
  67. DrahtBot added the label Needs rebase on Feb 4, 2019
  68. kallewoof force-pushed on Feb 7, 2019
  69. DrahtBot removed the label Needs rebase on Feb 7, 2019
  70. DrahtBot added the label Needs rebase on Feb 21, 2019
  71. util: add CopyDirectory helper b1334d42f3
  72. util: add master datadir path support f3f7c49de5
  73. util: add ObtainDirectoryLock() for temporary dir locks 96a4cb2fa8
  74. validation: add g_has_master_datadir and g_master_endblock 3f7f08447b
  75. init: add -masterdatadir arg d02569a89d
  76. init: copy chainstate and blocks/index to datadir if -masterdatadir is used
    Also find and track the last blk/rev files that we rely on from master, as we do not want to load anything beyond that, in case master runs ahead of us at some point.
    d73d9773a3
  77. validation: use or copy master blocks file when local datadir file does not exist ebb0b3b3e7
  78. test: basic -masterdatadir tests b28076d899
  79. linter: add exception to locale dependent function linter d88e6c326d
  80. kallewoof force-pushed on Feb 22, 2019
  81. DrahtBot removed the label Needs rebase on Feb 22, 2019
  82. DrahtBot commented at 10:38 pm on March 2, 2019: member
  83. DrahtBot added the label Needs rebase on Mar 2, 2019
  84. practicalswift commented at 9:32 am on March 3, 2019: contributor
    Concept ACK
  85. kallewoof closed this on Apr 3, 2019

  86. laanwj removed the label Needs rebase on Oct 24, 2019
  87. kallewoof deleted the branch on Jan 4, 2020
  88. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-18 18:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me