net: Serve blocks directly from disk when possible #13151

pull laanwj wants to merge 1 commits into bitcoin:master from laanwj:2018_05_direct_from_disk changing 3 files +107 −47
  1. laanwj commented at 12:17 pm on May 2, 2018: member

    In ProcessGetBlockData, send the block data directly from disk if type MSG_WITNESS_BLOCK is requested. This is a valid shortcut as the on-disk format matches the network format.

    This is expected to increase performance because a deserialization and subsequent serialization roundtrip is avoided.

  2. fanquake added the label P2P on May 2, 2018
  3. fanquake added the label Validation on May 2, 2018
  4. laanwj force-pushed on May 2, 2018
  5. MarcoFalke commented at 12:34 pm on May 2, 2018: member
    Could you please add a benchmark to ./src/bench/checkblock.cpp, so it is easier to see how much this improves?
  6. laanwj commented at 12:38 pm on May 2, 2018: member

    Sure, though I’m not sure how to do that; none of the benches actually uses ReadBlockFromDisk, I would have to set up a fake block index or such.

    (I also don’t think it will work on block413567 as-is because it has no magic/size header, and is not a file on disk, though it’s easy enough to write a temporary file of course). [When doing this from memory there’s effectively nothing to benchmark, too.]

  7. in src/validation.cpp:1150 in 4c790dff74 outdated
    1145+                HexStr(messageStart, messageStart + CMessageHeader::MESSAGE_START_SIZE));
    1146+    }
    1147+
    1148+    try {
    1149+        block.resize(nSize);
    1150+        filein.read((char*)block.data(), nSize);
    


    promag commented at 2:00 pm on May 2, 2018:
    Just to throw out the idea, mmap wouldn’t pay off right?

    laanwj commented at 2:07 pm on May 2, 2018:
    I don’t think that’s a win here, as the entire block is read consecutively - could even be slower as it’d have to create and destroy the mapping. Also it’s not portable.
  8. promag commented at 2:03 pm on May 2, 2018: member

    Concept ACK.

    Couldn’t we serve corrupted blocks?

  9. in src/net_processing.cpp:1146 in 4c790dff74 outdated
    1142@@ -1142,60 +1143,71 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
    1143         std::shared_ptr<const CBlock> pblock;
    1144         if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
    1145             pblock = a_recent_block;
    1146+        } else if (inv.type == MSG_WITNESS_BLOCK) {
    


    MarcoFalke commented at 2:23 pm on May 2, 2018:
    Shouldn’t this compare against the serialization flags of the block on disk? Currently you are assuming that all blocks are serialized as witness blocks on disk, but this is not true for all “early” blocks.

    laanwj commented at 2:32 pm on May 2, 2018:

    Yes I’m not convinced this logic is correct. It seems to work, though, even for the initial blocks.

    Edit: What is the operation to convert from a non-witness block to witness block with no witnesses? I suppose this could still be done without a full round-trip?


    sipa commented at 4:12 pm on May 2, 2018:

    @laanwj It’s always correct to give the raw blocks we store to peers that ask for witnesses (even if the block does not have a witness).

    Converting extended format to basic format is a lot more complicated. You could have a special CTransaction which skips the witness fields instead of reading/deserializing them, but I don’t see how to do it without going through some form of serialization code.


    laanwj commented at 4:22 pm on May 2, 2018:

    @sipa Thanks.

    Converting extended format to basic format is a lot more complicated. You could have a special CTransaction which skips the witness fields instead of reading/deserializing them, but I don’t see how to do it without going through some form of serialization code.

    Right, that case should fall back to deserialization->serialization right now. I don’t think we can do much better there. Not sure it’s even worth optimizing, there won’t be many new clients being synced with pre-segwit versions.


    MarcoFalke commented at 8:24 pm on May 2, 2018:
    Leaving out the less common edge case (non-witness peers) seems fine for now.

    sipa commented at 7:39 pm on May 13, 2018:
    Unsure if we care, but we could also check whether SegWit was active for the requested block, and if not, we can serve without deserialization even when witnesses are not requested.

    laanwj commented at 2:58 pm on May 14, 2018:
    @sipa Yes, that would be something that could be done here.
  10. MarcoFalke commented at 2:24 pm on May 2, 2018: member
    Concept ACK. Would be nice to see how much the additional savings are on top of #13098.
  11. laanwj commented at 2:28 pm on May 2, 2018: member

    Concept ACK. Would be nice to see how much the additional savings are on top of #13098.

    At least it’s a lot simpler.

  12. laanwj commented at 4:25 pm on May 2, 2018: member

    Couldn’t we serve corrupted blocks?

    Yes, that’s a possibility, though only if the underlying storage is corrupted. I’ve posited the idea to add a CRC32C to the on-disk blocks at some point (which is quick to verify, especially with specialized instructions, and should protect against accidental corruptions), but that’s quite an invasive change. It’s something that could be done later.

    The only option to verify with the current information would be to do a Merkle tree check - which could be done without deserialization but it’s not pretty… (and a serious overhead SHA256-hashing)

  13. MarcoFalke commented at 4:46 pm on May 2, 2018: member

    The only option to verify with the current information would be to do a Merkle tree check

    I don’t see that we currently do this, so it wouldn’t make anything worse here.

  14. gmaxwell commented at 5:53 pm on May 2, 2018: contributor

    I think serving a corrupted block if our state is corrupted is fine, the peer will just disconnect us and go get the block from someone else, seems pretty harmless!

    This is a much smaller change than I was expecting– in particular I forgot there was a size, light review ACK.

  15. in src/validation.cpp:1149 in 4c790dff74 outdated
    1144+                HexStr(msg_start_in, msg_start_in + CMessageHeader::MESSAGE_START_SIZE),
    1145+                HexStr(messageStart, messageStart + CMessageHeader::MESSAGE_START_SIZE));
    1146+    }
    1147+
    1148+    try {
    1149+        block.resize(nSize);
    


    TheBlueMatt commented at 7:49 pm on May 2, 2018:
    Probably want to check the size is sane before we do this.

    laanwj commented at 5:42 am on May 3, 2018:
    Good point. What constant would be appropriate here? Edit: I’ll go with MAX_SIZE from serialize.h.

    laanwj commented at 5:47 am on May 3, 2018:
    Another thing I wondered here: what is the C++11 proper way to allocate a vector (or a RAII memory area) without zeroing it? I think that’s unnecessary here. That was a bad idea: even though we handle errors while reading, as this data is sent directly over P2P, zeroing is defense-in-depth against heartbleed-style issues here
  16. TheBlueMatt commented at 7:49 pm on May 2, 2018: member
    utACK except for the below:
  17. in src/net_processing.cpp:1149 in 4c790dff74 outdated
    1142@@ -1142,60 +1143,71 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
    1143         std::shared_ptr<const CBlock> pblock;
    1144         if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
    1145             pblock = a_recent_block;
    1146+        } else if (inv.type == MSG_WITNESS_BLOCK) {
    1147+            // Fast-path: in this case it is possible to serve the block directly from disk,
    1148+            // as the network format matches the format on disk
    1149+            LogPrintf("debug: Serving raw block directly from disk: %s\n", pindex->ToString());
    


    MarcoFalke commented at 8:20 pm on May 2, 2018:
    Should probably remove this debug logging

    laanwj commented at 5:41 am on May 3, 2018:
    Yes, definitely. I added it while WIP so that people testing this can be sure that the code actually triggers and they’re testing the right thing.
  18. MarcoFalke commented at 8:25 pm on May 2, 2018: member
    Will measure some round trips tomorrow.
  19. jonasschnelli commented at 6:44 am on May 3, 2018: contributor

    utACK 4c790dff7481d1464a906ad6b17a3179a7da3431

    This would probably also speedup an external indexing daemon via p2p (see experiment in https://github.com/jonasschnelli/bitcoincore-indexd [very WIP])

    Here a flamegraph of serving the first 200k blocks via p2p localhost (though the real deser/ser starts probably at higher up in the chain).

  20. in src/validation.cpp:1140 in 4c790dff74 outdated
    1135+        return error("%s: OpenBlockFile failed for %s", __func__, pos.ToString());
    1136+    }
    1137+
    1138+    CMessageHeader::MessageStartChars msg_start_in;
    1139+    unsigned int nSize;
    1140+    filein >> msg_start_in >> nSize;
    


    laanwj commented at 8:25 am on May 3, 2018:
    I guess this deserialization logic should be within the try {}.
  21. laanwj renamed this:
    WIP: net: Serve blocks directly from disk when possible
    net: Serve blocks directly from disk when possible
    on May 3, 2018
  22. laanwj commented at 8:49 am on May 3, 2018: member

    OK, thanks for review everyone, removed WIP tag and pushed commits with the following changes:

    • Remove extraneous debug log message
    • Check nSize against MAX_SIZE
    • Move deserialization of msg_start_in, and size into exception try
    • Improved variable and parameter naming

    Will squash if no further issues.

    This would probably also speedup an external indexing daemon via p2p (see experiment in https://github.com/jonasschnelli/bitcoincore-indexd [very WIP])

    Yeah - I saw in #13098 that @MarcoFalke had also optimized the zmq full-block notification part. I’m not sure that is any critical path performance-wise (maybe for your indexer?) but if so I’ll leave that for a later PR.

    Here a flamegraph of serving the first 200k blocks via p2p localhost (though the real deser/ser starts probably at higher up in the chain).

    Did you forget to post the link?

  23. in src/net_processing.cpp:1151 in e0223ebf0c outdated
    1142@@ -1142,60 +1143,70 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
    1143         std::shared_ptr<const CBlock> pblock;
    1144         if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
    1145             pblock = a_recent_block;
    1146+        } else if (inv.type == MSG_WITNESS_BLOCK) {
    1147+            // Fast-path: in this case it is possible to serve the block directly from disk,
    1148+            // as the network format matches the format on disk
    1149+            std::vector<uint8_t> block_data;
    1150+            if (!ReadRawBlockFromDisk(block_data, pindex, chainparams.MessageStart()))
    1151+                assert(!"cannot load block from disk");
    


    sipa commented at 11:25 pm on May 3, 2018:
    Nit: braces around then-branch if on a separate line.
  24. sipa commented at 11:26 pm on May 3, 2018: member
    utACK after squash.
  25. MarcoFalke commented at 0:33 am on May 4, 2018: member

    Running with e0223ebf0c58f7beedea91df48e9586154cd4436 and just looking at the wall clock time for reading+optional deserialization shows for me on an ssd:

    ssd

    Edit: Note that this was done with full fake blocks and not real blocks from the network.

  26. laanwj commented at 3:53 pm on May 4, 2018: member

    Thanks for benchmarking @MarcoFalke.

    I used a patched version of @jonasschnelli’s bitcoincore-indexd to benchmark the time for fetching block 0..473600 through P2P, with no processing client-side. The result is:

    0With patch:
    1real    63m51.273s
    2
    3Without patch:
    4real    70m28.956s
    

    10% speedup. And in my case is the blocks are on a slow harddisk. I expect the gains to be more significant in case of a faster storage medium, or slower CPU.

  27. in src/validation.cpp:1156 in 9893e712e9 outdated
    1151+                    blk_size, MAX_SIZE);
    1152+        }
    1153+
    1154+        block.resize(blk_size); // Zeroing of memory is intentional here
    1155+        filein.read((char*)block.data(), blk_size);
    1156+    } catch(const std::exception& e) {
    


    promag commented at 1:39 pm on May 7, 2018:
    nit, space after catch } catch (....
  28. in src/validation.cpp:1154 in 9893e712e9 outdated
    1149+        if (blk_size > MAX_SIZE) {
    1150+            return error("%s: Block data is larger than maximum deserialization size for %s: %s versus %s", __func__, pos.ToString(),
    1151+                    blk_size, MAX_SIZE);
    1152+        }
    1153+
    1154+        block.resize(blk_size); // Zeroing of memory is intentional here
    


    promag commented at 1:43 pm on May 7, 2018:

    Zeroing of memory is intentional

    Why?


    laanwj commented at 2:01 pm on May 7, 2018:
    To avoid heartbleed-type leaks as this data goes directly over the network.
  29. jonasschnelli commented at 3:10 pm on May 7, 2018: contributor

    Did 10 rounds of requesting blocks in range 490'000 up to 500'000 on master and got. Setup:

    • Non VM machine
    • SSD 1400MB/s
    • Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
    • txindex was enabled
    • connect=0 –whitebind=127.0.0.1:8333
    • no other resource intense applications where running on that system
    • used a modified version of https://github.com/laanwj/bitcoincore-indexd

    Master (-g -O2):

    95211ms in avg. (all rounds where very similar and there was no need to exclude the first round) exemption

    This PR (-g O2):

    101051ms in avg.

    I can’t figure out why this PR performs ~6% slower and I double checked the comparison by manually rolling back from the head of this PR to 598db389c33e5e90783ef1223df2eeab095ed622 and back to head. Also added the LogPrintfs back to ensure I’m using the “fast track” (removed again during benchmarking).

    sidenode: found out that -debug=net (which was disabled during the rounds reported above) account for a 3% slowdown for the above test-scenario.

  30. laanwj commented at 5:35 am on May 9, 2018: member
    @jonasschnelli That’s really strange. As reported, I did see some actual speed-ups where using this. Maybe someone else can try some measurements, or we should just close, I don’t know.
  31. jonasschnelli commented at 8:47 am on May 9, 2018: contributor

    If someone wants to compare master against this PR built in the same environment:

    PR: https://bitcoin.jonasschnelli.ch/build/600 master: https://bitcoin.jonasschnelli.ch/build/599

  32. MarcoFalke commented at 6:12 pm on May 9, 2018: member

    I updated my benchmark to also include the time it takes to Make (serialize) the net message:

    net message serialization times

  33. MarcoFalke commented at 6:15 pm on May 9, 2018: member
    I think we should definitively look into why it is slower to sync, since that indicates a problem (potentially in our code) exists elsewhere.
  34. laanwj commented at 5:45 pm on May 13, 2018: member

    I did the same experiment as @jonasschnelli, a modified bitcoincore-indexd that requests block 490000..500000 (https://github.com/laanwj/bitcoincore-indexd/tree/bench). Tried both cases 5 times;

     0with patch:
     1real    0m55.928s
     2real    0m55.986s
     3real    0m55.913s
     4real    0m55.844s
     5real    0m55.790s
     6
     7without patch (using the commit before):
     8real    2m47.673s
     9real    2m46.329s
    10real    2m46.634s
    11real    2m46.413s
    12real    2m46.458s
    

    ~66% speedup. This is with a spinning rust disk not a SSD. This was done on a different computer than my previous test.

  35. jonasschnelli commented at 3:19 pm on May 14, 2018: contributor

    I think this is a clear benefit for spinning disk and probably also for ssd in non absurd localhost cases.

    utACK

  36. MarcoFalke commented at 3:27 pm on May 14, 2018: member
    @jonasschnelli Did you have a chance to look into why your result was unexpected?
  37. jonasschnelli commented at 3:40 pm on May 14, 2018: contributor
    @Marcofalke: no. I haven’t but I’m willing to do as soon as someone could confirm my results (SSD test).
  38. MarcoFalke commented at 4:04 pm on May 14, 2018: member

    Sure, will do

    On Mon, May 14, 2018, 11:41 Jonas Schnelli notifications@github.com wrote:

    @MarcoFalke https://github.com/MarcoFalke: no. I haven’t but I’m willing to do as soon as someone could confirm my results (SSD test).

    — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/pull/13151#issuecomment-388863028, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGmv9-r1eVW0U-ZpIGMl9OABHCU7k27ks5tyaWYgaJpZM4TvWjI .

  39. MarcoFalke commented at 9:04 pm on May 14, 2018: member

    For clarity: 598db means master@598db and 9893e means this pull request. I used @laanwj’s branch of bitcoincore-indexd.

    10k block fetch times 490k-500k

  40. MarcoFalke commented at 9:06 pm on May 14, 2018: member
    @jonasschnelli I coulnd’t find the branch you were using. Mind to share, otherwise I can’t reproduce?
  41. laanwj commented at 6:05 am on May 15, 2018: member

    Same experiment as #13151 (comment) on i.MX6Q ARM board w/ USB2 spinning disk:

     0with patch:
     1real    10m18.368s
     2real    11m14.600s
     3real    10m12.006s
     4real    10m21.668s
     5real    10m11.070s
     6
     7without patch (using the commit before):
     8real    27m30.574s
     9real    26m27.591s
    10real    25m38.311s
    11real    25m36.661s
    12real    25m40.902s
    

    Seems to help even with a slow CPU and slow I/O.

  42. net: Serve blocks directly from disk when possible
    In `ProcessGetBlockData`, send the block data directly from disk if
    type MSG_WITNESS_BLOCK is requested. This is a valid shortcut as the
    on-disk format matches the network format.
    
    This is expected to increase performance because a deserialization and
    subsequent serialization roundtrip is avoided.
    0bf431870e
  43. laanwj force-pushed on May 15, 2018
  44. laanwj commented at 6:13 am on May 15, 2018: member
    squashed, no other changes – 9893e712e9e04e8b9478e36e0b5d843899540bd20bf431870e45d8e20c4671e51a782ebf97b75fac
  45. jonasschnelli commented at 6:23 pm on May 15, 2018: contributor
    I guess my setup was either faulty or there is a performance loss with that particular setup (>1000MB/s IO r&w on very fast CPUs). However, this PR is a clear and significant win!
  46. MarcoFalke commented at 6:57 pm on May 15, 2018: member
    @jonasschnelli I can’t explain why, but you might want to try to drop the files cached in memory with sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'. For me this was speeding up the sync on an ssd.
  47. laanwj merged this on May 23, 2018
  48. laanwj closed this on May 23, 2018

  49. laanwj referenced this in commit 7f4db9a7c3 on May 23, 2018
  50. PastaPastaPasta referenced this in commit b6dee626e9 on Apr 12, 2020
  51. PastaPastaPasta referenced this in commit 9e172995d1 on Apr 12, 2020
  52. PastaPastaPasta referenced this in commit 837fa178b3 on Apr 16, 2020
  53. PastaPastaPasta referenced this in commit b50217e2b8 on Apr 18, 2020
  54. PastaPastaPasta referenced this in commit 27206a9f60 on Apr 18, 2020
  55. UdjinM6 referenced this in commit 3d175aa2e5 on Apr 19, 2020
  56. ckti referenced this in commit a1aee7db6b on Mar 28, 2021
  57. MarcoFalke locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-22 03:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me