Please provide a high speed way to enumerate all blocks in their serialized form. The use case is blockchain data analysis. Deserialization can be performed using any number of libraries. Only raw blocks are required.
The blk*
files store the blocks database. As far as I was able to find out there is no official way to access these files. There are various code snippets on the web accessing them. They do not appear to quite work. For example, in my own files I found strange data of the form <magic><length><magic>
. So the data portion was missing. There was just the header. Apparently, it cannot be assumed that the blocks come contiguously. According to the source code this could happen if the software or OS exits at the right time. This is totally OK as I was accessing undocumented internals, of course. My solution was to search for the magic, then try reading and deserializing. This works but that should not be necessary.
The getblock
API can transmit a block as a hexadecimal string. This is really slow according to my tests. Between 10 and 100 times slower. getblock(verbose=true)
becomes very slow for high block numbers, too (looks like 10 per second).
There should be a fast way to obtain raw blocks. I can think of two approaches:
- Add an API that sends the disk positions of the blocks in the blk* files. That way the API only needs to send tiny amounts of data.
- Add an API to send blocks more efficiently. This precludes the use of JSON and hexadecimal strings. Also, the API would need to be able to send multiple blocks in one go. Otherwise it is too chatty.
There’s also the issue of orphan blocks. The API should have a mode that makes it send only blocks from the main chain. I’m solving that right now by calling getblock(verbose=true)
450k times. This is very slow.
I think (1) would be far less work and would have less additional concepts (no new endpoint).
This is my concrete proposal:
- Make getblock(verbose=true) also send the file name and disk position.
- Add an API to obtain many block headers in one call. This solves the chattiness issue. The API should return all headers subject to the following filter options with defaults: int minBlockHeight = 0, int maxBlockHeight = maxint, bool allowNonMainChain = false, string[] blockHashes = null. These filters would be and-combined. If obtaining headers is costly it would be OK to just return the disk positions. Clients can then deserialize the headers themselves.