Provide a way to access raw blocks at high speed #8614

issue GSPP openend this issue on August 27, 2016
  1. GSPP commented at 6:40 pm on August 27, 2016: none

    Please provide a high speed way to enumerate all blocks in their serialized form. The use case is blockchain data analysis. Deserialization can be performed using any number of libraries. Only raw blocks are required.

    The blk* files store the blocks database. As far as I was able to find out there is no official way to access these files. There are various code snippets on the web accessing them. They do not appear to quite work. For example, in my own files I found strange data of the form <magic><length><magic>. So the data portion was missing. There was just the header. Apparently, it cannot be assumed that the blocks come contiguously. According to the source code this could happen if the software or OS exits at the right time. This is totally OK as I was accessing undocumented internals, of course. My solution was to search for the magic, then try reading and deserializing. This works but that should not be necessary.

    The getblock API can transmit a block as a hexadecimal string. This is really slow according to my tests. Between 10 and 100 times slower. getblock(verbose=true) becomes very slow for high block numbers, too (looks like 10 per second).

    There should be a fast way to obtain raw blocks. I can think of two approaches:

    1. Add an API that sends the disk positions of the blocks in the blk* files. That way the API only needs to send tiny amounts of data.
    2. Add an API to send blocks more efficiently. This precludes the use of JSON and hexadecimal strings. Also, the API would need to be able to send multiple blocks in one go. Otherwise it is too chatty.

    There’s also the issue of orphan blocks. The API should have a mode that makes it send only blocks from the main chain. I’m solving that right now by calling getblock(verbose=true) 450k times. This is very slow.

    I think (1) would be far less work and would have less additional concepts (no new endpoint).

    This is my concrete proposal:

    1. Make getblock(verbose=true) also send the file name and disk position.
    2. Add an API to obtain many block headers in one call. This solves the chattiness issue. The API should return all headers subject to the following filter options with defaults: int minBlockHeight = 0, int maxBlockHeight = maxint, bool allowNonMainChain = false, string[] blockHashes = null. These filters would be and-combined. If obtaining headers is costly it would be OK to just return the disk positions. Clients can then deserialize the headers themselves.
  2. sipa commented at 7:07 pm on August 27, 2016: member
    The REST api allows you to fetch raw blocks in binary by hash, i think.
  3. jgarzik commented at 8:14 pm on August 27, 2016: contributor
    Correct - REST API already provides this. Enable with -rest
  4. GSPP commented at 8:39 pm on August 27, 2016: none
    I will check this out and report back.
  5. GSPP commented at 7:04 pm on August 28, 2016: none

    I tested the block API. For testing purposes I compared the two:

    1. Block API: A full scan takes roughly 2 hours including deserialization in the client.
    2. blk* files: A full scan takes 45 minutes including deserialization in the client.

    I called the block API on 8 concurrent threads. The blocks were sitting on an SSD. Bitcoin Core ran inside of a VM.

    The REST API is helpful and might be enough for some but if you really want high speed access there’s no substitute for the blk* files right now.

    What do you think about my API proposal of exposing the block positions through the JSON API?

  6. sipa commented at 7:09 pm on August 28, 2016: member
    I don’t like committing to having a stable on-disk format. Future versions may use a different serialization, or include indexes/pointers interleaved with data.
  7. GSPP commented at 7:13 pm on August 28, 2016: none
    That would even be OK as long as there are subranges in the file storing whole blocks. I understand the concern, though.
  8. jonasschnelli added the label RPC/REST/ZMQ on Aug 29, 2016
  9. jonasschnelli added the label Block storage on Aug 29, 2016
  10. jonasschnelli commented at 6:43 am on August 29, 2016: contributor
    @GSPP: you could time profile the REST API and try to optimize it? I think factor ~2 in performance seems reasonable for a higher level, file format independent, access to block data.
  11. GSPP commented at 7:12 am on August 29, 2016: none
    That’s not a bad way to proceed I think. I’m working on benchmarking obtaining the block hashes. For some reason I sometimes see perf problems and sometimes not. Need to investigate more.
  12. laanwj commented at 12:41 pm on February 7, 2017: member

    The on-disk format is not an interface *, so this is not going to be part of mainline bitcoind. The REST interface with binary format will have to do - feel free to optimize that if it’s still a bottleneck (for example: we may be doing an extraneous deserialize/serialize step).

    * If you really want to cheat this, an option is to use contrib/linearize.py’s strategy: this tool gets the hashes from bitcoind in a JSONRPC batch operation, then parses the block files itself, hashing the block headers to find out what blocks are where. As fast block access is mostly important when you want to process all (or a large subset of) blocks anyway, this seems a feasible way to do it. Must say it again though: this is not supported, and this may break at any time in future updates. (it has already broken once when switching to out-of-order block download).

  13. laanwj closed this on Feb 7, 2017

  14. hkalodner referenced this in commit b1f517013d on May 5, 2018
  15. MarcoFalke locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-22 00:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me