net: Serve blocks directly from disk when possible

laanwj commented at 12:17 pm on May 2, 2018: member

In ProcessGetBlockData, send the block data directly from disk if type MSG_WITNESS_BLOCK is requested. This is a valid shortcut as the on-disk format matches the network format.

This is expected to increase performance because a deserialization and subsequent serialization roundtrip is avoided.

fanquake added the label P2P on May 2, 2018

fanquake added the label Validation on May 2, 2018

laanwj force-pushed on May 2, 2018

MarcoFalke commented at 12:34 pm on May 2, 2018: member

Could you please add a benchmark to ./src/bench/checkblock.cpp, so it is easier to see how much this improves?

laanwj commented at 12:38 pm on May 2, 2018: member

Sure, though I’m not sure how to do that; none of the benches actually uses ReadBlockFromDisk, I would have to set up a fake block index or such.

(I also don’t think it will work on block413567 as-is because it has no magic/size header, and is not a file on disk, though it’s easy enough to write a temporary file of course). [When doing this from memory there’s effectively nothing to benchmark, too.]

in src/validation.cpp:1150 in 4c790dff74 outdated

1145+                HexStr(messageStart, messageStart + CMessageHeader::MESSAGE_START_SIZE));
1146+    }
1147+
1148+    try {
1149+        block.resize(nSize);
1150+        filein.read((char*)block.data(), nSize);

promag commented at 2:00 pm on May 2, 2018:

Just to throw out the idea, mmap wouldn’t pay off right?

laanwj commented at 2:07 pm on May 2, 2018:

I don’t think that’s a win here, as the entire block is read consecutively - could even be slower as it’d have to create and destroy the mapping. Also it’s not portable.

promag commented at 2:03 pm on May 2, 2018: member

Concept ACK.

Couldn’t we serve corrupted blocks?

in src/net_processing.cpp:1146 in 4c790dff74 outdated

1142@@ -1142,60 +1143,71 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
1143         std::shared_ptr<const CBlock> pblock;
1144         if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
1145             pblock = a_recent_block;
1146+        } else if (inv.type == MSG_WITNESS_BLOCK) {

MarcoFalke commented at 2:23 pm on May 2, 2018:

Shouldn’t this compare against the serialization flags of the block on disk? Currently you are assuming that all blocks are serialized as witness blocks on disk, but this is not true for all “early” blocks.

laanwj commented at 2:32 pm on May 2, 2018:

Yes I’m not convinced this logic is correct. It seems to work, though, even for the initial blocks.

Edit: What is the operation to convert from a non-witness block to witness block with no witnesses? I suppose this could still be done without a full round-trip?

sipa commented at 4:12 pm on May 2, 2018:

@laanwj It’s always correct to give the raw blocks we store to peers that ask for witnesses (even if the block does not have a witness).

Converting extended format to basic format is a lot more complicated. You could have a special CTransaction which skips the witness fields instead of reading/deserializing them, but I don’t see how to do it without going through some form of serialization code.

laanwj commented at 4:22 pm on May 2, 2018:

@sipa Thanks.

Converting extended format to basic format is a lot more complicated. You could have a special CTransaction which skips the witness fields instead of reading/deserializing them, but I don’t see how to do it without going through some form of serialization code.

Right, that case should fall back to deserialization->serialization right now. I don’t think we can do much better there. Not sure it’s even worth optimizing, there won’t be many new clients being synced with pre-segwit versions.

MarcoFalke commented at 8:24 pm on May 2, 2018:

Leaving out the less common edge case (non-witness peers) seems fine for now.

sipa commented at 7:39 pm on May 13, 2018:

Unsure if we care, but we could also check whether SegWit was active for the requested block, and if not, we can serve without deserialization even when witnesses are not requested.

laanwj commented at 2:58 pm on May 14, 2018:

@sipa Yes, that would be something that could be done here.

MarcoFalke commented at 2:24 pm on May 2, 2018: member

Concept ACK. Would be nice to see how much the additional savings are on top of #13098.

laanwj commented at 2:28 pm on May 2, 2018: member

Concept ACK. Would be nice to see how much the additional savings are on top of #13098.

At least it’s a lot simpler.

laanwj commented at 4:25 pm on May 2, 2018: member

Couldn’t we serve corrupted blocks?

Yes, that’s a possibility, though only if the underlying storage is corrupted. I’ve posited the idea to add a CRC32C to the on-disk blocks at some point (which is quick to verify, especially with specialized instructions, and should protect against accidental corruptions), but that’s quite an invasive change. It’s something that could be done later.

The only option to verify with the current information would be to do a Merkle tree check - which could be done without deserialization but it’s not pretty… (and a serious overhead SHA256-hashing)

MarcoFalke commented at 4:46 pm on May 2, 2018: member

The only option to verify with the current information would be to do a Merkle tree check

I don’t see that we currently do this, so it wouldn’t make anything worse here.

gmaxwell commented at 5:53 pm on May 2, 2018: contributor

I think serving a corrupted block if our state is corrupted is fine, the peer will just disconnect us and go get the block from someone else, seems pretty harmless!

This is a much smaller change than I was expecting– in particular I forgot there was a size, light review ACK.

in src/validation.cpp:1149 in 4c790dff74 outdated

1144+                HexStr(msg_start_in, msg_start_in + CMessageHeader::MESSAGE_START_SIZE),
1145+                HexStr(messageStart, messageStart + CMessageHeader::MESSAGE_START_SIZE));
1146+    }
1147+
1148+    try {
1149+        block.resize(nSize);

TheBlueMatt commented at 7:49 pm on May 2, 2018:

Probably want to check the size is sane before we do this.

laanwj commented at 5:42 am on May 3, 2018:

Good point. What constant would be appropriate here? Edit: I’ll go with MAX_SIZE from serialize.h.

laanwj commented at 5:47 am on May 3, 2018:

~~Another thing I wondered here: what is the C++11 proper way to allocate a vector (or a RAII memory area) without zeroing it? I think that’s unnecessary here.~~ That was a bad idea: even though we handle errors while reading, as this data is sent directly over P2P, zeroing is defense-in-depth against heartbleed-style issues here

TheBlueMatt commented at 7:49 pm on May 2, 2018: member

utACK except for the below:

in src/net_processing.cpp:1149 in 4c790dff74 outdated

1142@@ -1142,60 +1143,71 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
1143         std::shared_ptr<const CBlock> pblock;
1144         if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
1145             pblock = a_recent_block;
1146+        } else if (inv.type == MSG_WITNESS_BLOCK) {
1147+            // Fast-path: in this case it is possible to serve the block directly from disk,
1148+            // as the network format matches the format on disk
1149+            LogPrintf("debug: Serving raw block directly from disk: %s\n", pindex->ToString());

MarcoFalke commented at 8:20 pm on May 2, 2018:

Should probably remove this debug logging

laanwj commented at 5:41 am on May 3, 2018:

Yes, definitely. I added it while WIP so that people testing this can be sure that the code actually triggers and they’re testing the right thing.

MarcoFalke commented at 8:25 pm on May 2, 2018: member

Will measure some round trips tomorrow.

jonasschnelli commented at 6:44 am on May 3, 2018: contributor

utACK 4c790dff7481d1464a906ad6b17a3179a7da3431

This would probably also speedup an external indexing daemon via p2p (see experiment in https://github.com/jonasschnelli/bitcoincore-indexd [very WIP])

Here a flamegraph of serving the first 200k blocks via p2p localhost (though the real deser/ser starts probably at higher up in the chain).

in src/validation.cpp:1140 in 4c790dff74 outdated

1135+        return error("%s: OpenBlockFile failed for %s", __func__, pos.ToString());
1136+    }
1137+
1138+    CMessageHeader::MessageStartChars msg_start_in;
1139+    unsigned int nSize;
1140+    filein >> msg_start_in >> nSize;

laanwj commented at 8:25 am on May 3, 2018:

I guess this deserialization logic should be within the try {}.

laanwj renamed this:
~~WIP: net: Serve blocks directly from disk when possible~~
net: Serve blocks directly from disk when possible
on May 3, 2018

laanwj commented at 8:49 am on May 3, 2018: member

OK, thanks for review everyone, removed WIP tag and pushed commits with the following changes:

Remove extraneous debug log message
Check nSize against MAX_SIZE
Move deserialization of msg_start_in, and size into exception try
Improved variable and parameter naming

Will squash if no further issues.

This would probably also speedup an external indexing daemon via p2p (see experiment in https://github.com/jonasschnelli/bitcoincore-indexd [very WIP])

Yeah - I saw in #13098 that @MarcoFalke had also optimized the zmq full-block notification part. I’m not sure that is any critical path performance-wise (maybe for your indexer?) but if so I’ll leave that for a later PR.

Here a flamegraph of serving the first 200k blocks via p2p localhost (though the real deser/ser starts probably at higher up in the chain).

Did you forget to post the link?

in src/net_processing.cpp:1151 in e0223ebf0c outdated

1142@@ -1142,60 +1143,70 @@ void static ProcessGetBlockData(CNode* pfrom, const Consensus::Params& consensus
1143         std::shared_ptr<const CBlock> pblock;
1144         if (a_recent_block && a_recent_block->GetHash() == pindex->GetBlockHash()) {
1145             pblock = a_recent_block;
1146+        } else if (inv.type == MSG_WITNESS_BLOCK) {
1147+            // Fast-path: in this case it is possible to serve the block directly from disk,
1148+            // as the network format matches the format on disk
1149+            std::vector<uint8_t> block_data;
1150+            if (!ReadRawBlockFromDisk(block_data, pindex, chainparams.MessageStart()))
1151+                assert(!"cannot load block from disk");

sipa commented at 11:25 pm on May 3, 2018:

Nit: braces around then-branch if on a separate line.

sipa commented at 11:26 pm on May 3, 2018: member

utACK after squash.

MarcoFalke commented at 0:33 am on May 4, 2018: member

Running with e0223ebf0c58f7beedea91df48e9586154cd4436 and just looking at the wall clock time for reading+optional deserialization shows for me on an ssd:

ssd

Edit: Note that this was done with full fake blocks and not real blocks from the network.

laanwj commented at 3:53 pm on May 4, 2018: member

Thanks for benchmarking @MarcoFalke.

I used a patched version of @jonasschnelli’s bitcoincore-indexd to benchmark the time for fetching block 0..473600 through P2P, with no processing client-side. The result is:

0With patch:
1real    63m51.273s
2
3Without patch:
4real    70m28.956s

10% speedup. And in my case is the blocks are on a slow harddisk. I expect the gains to be more significant in case of a faster storage medium, or slower CPU.

in src/validation.cpp:1156 in 9893e712e9 outdated

1151+                    blk_size, MAX_SIZE);
1152+        }
1153+
1154+        block.resize(blk_size); // Zeroing of memory is intentional here
1155+        filein.read((char*)block.data(), blk_size);
1156+    } catch(const std::exception& e) {

promag commented at 1:39 pm on May 7, 2018:

nit, space after catch } catch (....

in src/validation.cpp:1154 in 9893e712e9 outdated

1149+        if (blk_size > MAX_SIZE) {
1150+            return error("%s: Block data is larger than maximum deserialization size for %s: %s versus %s", __func__, pos.ToString(),
1151+                    blk_size, MAX_SIZE);
1152+        }
1153+
1154+        block.resize(blk_size); // Zeroing of memory is intentional here

promag commented at 1:43 pm on May 7, 2018:

Zeroing of memory is intentional

Why?

laanwj commented at 2:01 pm on May 7, 2018:

To avoid heartbleed-type leaks as this data goes directly over the network.

jonasschnelli commented at 3:10 pm on May 7, 2018: contributor

Did 10 rounds of requesting blocks in range 490'000 up to 500'000 on master and got. Setup:

Non VM machine
SSD 1400MB/s
Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
txindex was enabled
connect=0 –whitebind=127.0.0.1:8333
no other resource intense applications where running on that system
used a modified version of https://github.com/laanwj/bitcoincore-indexd

Master (`-g -O2`):

95211ms in avg. (all rounds where very similar and there was no need to exclude the first round) exemption

This PR (`-g O2`):

101051ms in avg.

I can’t figure out why this PR performs ~6% slower and I double checked the comparison by manually rolling back from the head of this PR to 598db389c33e5e90783ef1223df2eeab095ed622 and back to head. Also added the LogPrintfs back to ensure I’m using the “fast track” (removed again during benchmarking).

sidenode: found out that -debug=net (which was disabled during the rounds reported above) account for a 3% slowdown for the above test-scenario.

laanwj commented at 5:35 am on May 9, 2018: member

@jonasschnelli That’s really strange. As reported, I did see some actual speed-ups where using this. Maybe someone else can try some measurements, or we should just close, I don’t know.

jonasschnelli commented at 8:47 am on May 9, 2018: contributor

If someone wants to compare master against this PR built in the same environment:

PR: https://bitcoin.jonasschnelli.ch/build/600 master: https://bitcoin.jonasschnelli.ch/build/599

MarcoFalke commented at 6:12 pm on May 9, 2018: member

I updated my benchmark to also include the time it takes to Make (serialize) the net message:

net message serialization times

MarcoFalke commented at 6:15 pm on May 9, 2018: member

I think we should definitively look into why it is slower to sync, since that indicates a problem (potentially in our code) exists elsewhere.

laanwj commented at 5:45 pm on May 13, 2018: member

I did the same experiment as @jonasschnelli, a modified bitcoincore-indexd that requests block 490000..500000 (https://github.com/laanwj/bitcoincore-indexd/tree/bench). Tried both cases 5 times;

 0with patch:
 1real    0m55.928s
 2real    0m55.986s
 3real    0m55.913s
 4real    0m55.844s
 5real    0m55.790s
 6
 7without patch (using the commit before):
 8real    2m47.673s
 9real    2m46.329s
10real    2m46.634s
11real    2m46.413s
12real    2m46.458s

~66% speedup. This is with a spinning rust disk not a SSD. This was done on a different computer than my previous test.

jonasschnelli commented at 3:19 pm on May 14, 2018: contributor

I think this is a clear benefit for spinning disk and probably also for ssd in non absurd localhost cases.

utACK

MarcoFalke commented at 3:27 pm on May 14, 2018: member

@jonasschnelli Did you have a chance to look into why your result was unexpected?

jonasschnelli commented at 3:40 pm on May 14, 2018: contributor

@Marcofalke: no. I haven’t but I’m willing to do as soon as someone could confirm my results (SSD test).

MarcoFalke commented at 4:04 pm on May 14, 2018: member

Sure, will do

On Mon, May 14, 2018, 11:41 Jonas Schnelli notifications@github.com wrote:

@MarcoFalke https://github.com/MarcoFalke: no. I haven’t but I’m willing to do as soon as someone could confirm my results (SSD test).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/pull/13151#issuecomment-388863028, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGmv9-r1eVW0U-ZpIGMl9OABHCU7k27ks5tyaWYgaJpZM4TvWjI .

MarcoFalke commented at 9:04 pm on May 14, 2018: member

For clarity: 598db means master@598db and 9893e means this pull request. I used @laanwj’s branch of bitcoincore-indexd.

10k block fetch times 490k-500k

MarcoFalke commented at 9:06 pm on May 14, 2018: member

@jonasschnelli I coulnd’t find the branch you were using. Mind to share, otherwise I can’t reproduce?

laanwj commented at 6:05 am on May 15, 2018: member

Same experiment as #13151 (comment) on i.MX6Q ARM board w/ USB2 spinning disk:

 0with patch:
 1real    10m18.368s
 2real    11m14.600s
 3real    10m12.006s
 4real    10m21.668s
 5real    10m11.070s
 6
 7without patch (using the commit before):
 8real    27m30.574s
 9real    26m27.591s
10real    25m38.311s
11real    25m36.661s
12real    25m40.902s

Seems to help even with a slow CPU and slow I/O.

net: Serve blocks directly from disk when possible

In `ProcessGetBlockData`, send the block data directly from disk if
type MSG_WITNESS_BLOCK is requested. This is a valid shortcut as the
on-disk format matches the network format.

This is expected to increase performance because a deserialization and
subsequent serialization roundtrip is avoided.

0bf431870e

laanwj force-pushed on May 15, 2018

laanwj commented at 6:13 am on May 15, 2018: member

squashed, no other changes – 9893e712e9e04e8b9478e36e0b5d843899540bd2 → 0bf431870e45d8e20c4671e51a782ebf97b75fac

jonasschnelli commented at 6:23 pm on May 15, 2018: contributor

I guess my setup was either faulty or there is a performance loss with that particular setup (>1000MB/s IO r&w on very fast CPUs). However, this PR is a clear and significant win!

MarcoFalke commented at 6:57 pm on May 15, 2018: member

@jonasschnelli I can’t explain why, but you might want to try to drop the files cached in memory with sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'. For me this was speeding up the sync on an ssd.

laanwj merged this on May 23, 2018

laanwj closed this on May 23, 2018

laanwj referenced this in commit 7f4db9a7c3 on May 23, 2018

PastaPastaPasta referenced this in commit b6dee626e9 on Apr 12, 2020

PastaPastaPasta referenced this in commit 9e172995d1 on Apr 12, 2020

PastaPastaPasta referenced this in commit 837fa178b3 on Apr 16, 2020

PastaPastaPasta referenced this in commit b50217e2b8 on Apr 18, 2020

PastaPastaPasta referenced this in commit 27206a9f60 on Apr 18, 2020

UdjinM6 referenced this in commit 3d175aa2e5 on Apr 19, 2020

ckti referenced this in commit a1aee7db6b on Mar 28, 2021

MarcoFalke locked this on Sep 8, 2021

net: Serve blocks directly from disk when possible #13151

Master (-g -O2):

This PR (-g O2):

Master (`-g -O2`):

This PR (`-g O2`):