Ultraprune: use a pruned-txout-set database for block validation

sipa commented at 6:26 pm on August 16, 2012: member

This is a rewrite of the block storage and validation engine.

Instead of blkindex.dat (a database with block tree data, and all transactions and their spendings in the active chain), it uses chain.dat (only block tree data) and coins.dat (pruned txout set). These two databases together are significantly smaller than blkindex.dat (<200 MiB), and only coins.dat is actively needed during block validation, speeding it up significantly (15 minutes for importing 185000 blocks from a local disk file).

Blocks are still stored in blk????.dat files, in the same file format, but smaller files (up to 128 MiB). To prevent excessive fragmentation, they are allocated in chunks of 16 MiB, and some statistics are kept about them. To assist with reorganisation, undo files are created (rev????.dat), which contain the data necessary to undo block connections.

Block pruning itself is not yet implemented, but this makes it trivial to do so; all that is required is deleting old block and undo files when certain thresholds are reached. Also note that this block pruning mechanism is different from the transaction pruning mechanism described by Satoshi. This one does not prevent a node from acting as a full node.

All commits result in a functional code tree, with succeeding unit tests. The first few add some extra classes, without changing actual semantics. “One file per block” and “Multiple blocks per file” form a refactor of the block storage mechanism, with related database changes. “Do not store hashNext on disk” only introduces a forward-incompatible change that simplifies the database layout. “Ultraprune” itself contains the switch from txindex.dat to coins.dat as validation data, and contains the majority of the changes. What follows are optimizations and other some improvements, that do not effect compatibility.

There are a few TODO’s left (see comment below), but I’d like to give the code some exposure already.

sipa commented at 6:29 pm on August 16, 2012: member

(EDITED)

List of implementation changes:

new database layout:
- 2 leveldb’s (coins/ and blktree/ subdirs), replacing blkindex.dat
- separate directory (blocks/) with block data (in the usual format, but smaller files) and undo data
database keys are of the form (char,key) instead of (string,key) for reasons of compactness
there is no txid-to-diskpos index anymore, only blkid-to-diskpos and txid-to-unspent-outputs
- this makes getrawtransaction only work on unspent outputs (and slower)
  - an optional txid-to-diskpos index is planned
some new very specialized serializers are added (compact variable-length integer, compact amount, compact txout)
block index does not store hashBlockNext anymore - this is reconstructed from hashBestBlock at startup
at startup, automatically reorg to the best block in blktree/blocks
new RPCs: gettxoutsetinfo and gettxout operate on the coins database
no more CTxIndex in-scope: instead, a global pcoinsTip (representing the coin db) and pblocktree (representing the blktree db)
- intended to be moved to separate modules/classes, properly encapsulated
blktree database contains statistics about the block file (size, which blocks in it, times, heights, undo stats, …)
blktree database contains flag per block that determines the degree of validation it had, to allow future headers-first mode
block files are pre-allocated (in batches of 16 MiB, the files grow to max 128 MIB), to reduce fragmentation
transaction hashes are cached, and typically never calculated more than once

Included in the pullreq, but technically separate:

do -loadblock= and bootstrap.dat import in a separate thread
add check for strict DER encoding for signatures, and standard public keys

Diapolo commented at 12:18 pm on August 21, 2012: none

@sipa One question, our current AppendBlockFile() function takes MAX_SIZE into account and generates a new block-file if the space left in the block file (max allowed filesize) is < MAX_SIZE. So 128 MiB files would have a maximum of 96 MiB usage-data, right?

sipa commented at 12:24 pm on August 21, 2012: member

@Diapolo: not sure what you mean; I don’t use AppendBlockFile anymore.

Diapolo commented at 12:27 pm on August 21, 2012: none

@sipa I saw that and wanted to understand the change here, which condition is used to determine, if a new block-file needs to be created, where is the check in your new code for that and what’s the space limit?

sipa commented at 12:29 pm on August 21, 2012: member

The check is in FindBlockPos in main.cpp. And a new file is created if (old_used_size + new_block_size >= MAX_BLOCKFILE_SIZE).

in src/main.cpp: in f51448802f outdated

1892-    unsigned int nBlockPos = 0;
1893-    if (!WriteToDisk(nFile, nBlockPos))
1894+    CDiskBlockPos blockPos;
1895+    {
1896+        CChainDB chaindb;
1897+        if (!FindBlockPos(chaindb, blockPos, nBlockSize+8, nHeight, nTime))

Diapolo commented at 1:13 pm on August 21, 2012:

Why nBlockSize+8, is that a padding?

sipa commented at 1:27 pm on August 21, 2012:

4 bytes magic, 4 bytes block length; that’s just the file format of blk*.dat.

Diapolo commented at 1:34 pm on August 21, 2012:

I’m lacking some background information here, sorry :). Is the format defined / described somewhere?

sipa commented at 1:39 pm on August 21, 2012:

No idea, but I wanted to retain compatibility between pre and post-ultraprune block files, so I used the same format. That is: the files are a concatenation of {4 bytes magic, 4 bytes LE integer with the actual block size, block data itself).

Diapolo commented at 1:43 pm on August 21, 2012:

I found this one and it explains what I was missing here: https://bitcointalk.org/index.php?topic=101514.0 thanks for your further explanation, too.

Why keep things compatible here, perhaps it’s the right time to even optimize the internals of the block-files (e.g. compression or such a thing)?

luke-jr commented at 4:03 am on August 24, 2012: member

Does this break the ability to downgrade at all? (I expect it just means wasted “padding” space in the blk*.dat files?)

sipa commented at 0:23 am on August 27, 2012: member

Updated. Batch block connection now keeps a permanent cache, and modifies that (instead of delaying block connection until several blocks were available, which interfered with normal network-based downloading). Also added a commit that changes the block database format, in preparation of things like parallel signature checking and initial headers-only mode.

Diapolo commented at 5:47 am on August 27, 2012: none

@sipa With block database format you mean stored blocks in blk0000x.dat?

sipa commented at 10:31 am on August 27, 2012: member

@luke-jr how do you mean breaking the ability to downgrade? The blk000*.dat files remain exactly the same format, but the other databases are incompatible. @Diapolo No, it uses coins.dat (the unspent txout set) and chain.dat (the block index), in addition to the blk_.dat (and rev_.dat) files. It’s the format of chain.dat that changed in the last commit.

luke-jr commented at 4:47 pm on August 27, 2012: member

@sipa If it interacts with downgrades in ugly ways, I’d probably not want to put it into next-test.

sipa commented at 5:07 pm on August 27, 2012: member

@luke-jr Shouldn’t be a problem - the filenames are all different, so you can (almost) run ultraprune and non-ultraprune together in the same datadir independently.

That said, it’s likely to conflict with a lot of other stuff, so decide for yourself.

mikehearn commented at 12:34 pm on August 30, 2012: contributor

Could you provide a squashed version of the patch somewhere, for review? It’s really hard to review as is because it’s just a record of how you implemented it over time.

sipa commented at 12:48 pm on August 30, 2012: member

@mikehearn #1677.diff ?

mikehearn commented at 10:32 am on August 31, 2012: contributor

Thanks, that looks useful.

sipa commented at 11:29 am on August 31, 2012: member

@mikehearn Seems that through rebasing I lost some comments you made earlier on the commits?

Regarding the encodings, I plan to write some text about the final format for all datastructures, but I may change a few things still.

sipa commented at 11:43 pm on September 4, 2012: member

Rebased/combined with @mikehearn’s LevelDB patch

sipa commented at 2:40 pm on September 20, 2012: member

Rebased on 0.7, and moved the more experimental block caching and parallel signature checking to a separate branch. The code in here should be stable and can be tested.

The only things that remain to be done are automatic import of old data, and more elaborate consistency checks at startup. I think those can be done in separate pull requests though.

This branch has its own LevelDB glue, independent (though similar, but simpler) from the one in Mike’s leveldb branch. As the coin and block indexes are only opened once, there was no need for a CDB-like wrapper and global CDBEnv to cache database accessors. If LevelDB is merged first, I’ll add reverts for most of it here.

mikehearn commented at 4:05 pm on September 20, 2012: contributor

I closed the LevelDB pull req. Let’s merge it as part of this.

Note that my LevelDB branch has code that does replay the blocks with some GUI progress. It’s not great because it actually re-writes the block files in order to track the block offsets … I didn’t do any deep refactorings to fix that as I wanted it to be as easy/fast to merge as possible and it’s a one-off migration anyway. But as it’s now a part of ultraprune that bridge was crossed, so you could just re-use whatever GUI code is possible.

sipa commented at 12:07 pm on September 21, 2012: member

@TheBlueMatt any way to disable the build tester here, as it seems to be incompatible with this anyway?

laanwj commented at 1:01 pm on September 21, 2012: member

I’ve tested this a bit on the testnet. No problems found, and synchronization is super-fast.

One small comment: in your bitcoin-qt.pro, please use $(MAKE) instead of make. This prevents an annoying warning about a job server in Qt Creator.

sipa commented at 1:12 pm on September 21, 2012: member

@laanwj: updated to use $(MAKE)

TheBlueMatt commented at 8:46 pm on September 22, 2012: member

@sipa Id rather not, the patch is really quite simple (http://jenkins.bluematt.me/pull-tester/files/bitcoind-comparison.patch) , afaict, its only failing because setBlockIndexValid was added directly above hashGenesisBlock in main.cpp. Can you just move that line and see if it works?

sipa commented at 11:48 am on September 25, 2012: member

Changed the database/serialization format one more time: coins and undo data now contains the transaction version number. This may be necessary when new versions of transaction are defined that have an influence on their ability to be spent. @TheBlueMatt ok, moved the setBlockIndexValid line in main.cpp.

mikehearn commented at 3:00 pm on September 27, 2012: contributor

This does not build on MacOS X because there is no fdatasync on that platform.

sipa commented at 2:56 pm on September 28, 2012: member

@TheBlueMatt I wonder why it still complains?

EDIT: Oh, just out of date with master. Let’s wait for the next cycle.

mikehearn commented at 10:19 am on September 29, 2012: contributor

I just tried to start my client based on this branch and got:

Loading block index… Opening LevelDB in /Users/hearn/Library/Application Support/Bitcoin/blktree Opened LevelDB successfully Opening LevelDB in /Users/hearn/Library/Application Support/Bitcoin/coins Opened LevelDB successfully LoadBlockIndex(): last block file = 23 LoadBlockIndex(): last block file: CBlockFileInfo(blocks=1572, size=132444896, heights=199237..200807, time=2012-09-17..2012-09-27) LoadBlockIndex(): hashBestChain=00000000000000e78688 height=200806 date=09/27/2012 21:08:42 Verifying last 2500 blocks at level 1 block index 36135ms Loading wallet… dbenv.open LogDir=/Users/hearn/Library/Application Support/Bitcoin/database ErrorFile=/Users/hearn/Library/Application Support/Bitcoin/db.log nFileVersion = 70099 wallet 1192ms REORGANIZE: Disconnect 1 blocks; 000000000000051dcdc2..00000000000000e78688 REORGANIZE: Connect 2 blocks; 000000000000051dcdc2..00000000000003d0a2b1

EXCEPTION: NSt8ios_base7failureE
CAutoFile::read : end of file
bitcoin in Runaway exception

mikehearn commented at 11:19 am on September 29, 2012: contributor

On investigation this failure can happen with both ultralevelprune and old bdb code, it happens when the block is not written but the db updates are. Typically if power is yanked at just the wrong time.

As it is not a new failure mode, I guess it should not delay review/merge of this code.

in bitcoin-qt.pro: in 41f98b3c50 outdated

89@@ -90,6 +90,33 @@ contains(BITCOIN_NEED_QT_PLUGINS, 1) {
90     QTPLUGIN += qcncodecs qjpcodecs qtwcodecs qkrcodecs qtaccessiblewidgets
91 }
92 
93+contains(USE_LEVELDB, -) {

Diapolo commented at 8:34 pm on October 11, 2012:

So this still includes legacy BDB support? Means we need to keep 2 code-bases up to date. What was the intention to keep it to be able to revert, just wanna know :).

sipa commented at 9:00 pm on October 11, 2012:

Yes, though the BDB version most likely doesn’t compile anymore. This was converted from Mike’s code which tried to keep compatibility, but that’s just an unneccessary burden.

Diapolo commented at 5:36 am on October 12, 2012:

Thanks, so it would be nice to remove that burden entirely from this pull and the code. If this is a one way ticket there is no need to keep BDB compatibility code in.

mikehearn commented at 8:41 am on October 12, 2012:

The original idea was to reduce the risk of merging the code, in case there were issues with LevelDB [on some specific platform] we don’t want to hold up the release or do a potentially messy revert.

I agree it’s irritating and a burden, but it’d suck if all of ultraprune ended up getting reverted due to unanticipated issues with LevelDB. Once 0.8 has been successfully rolled out to the userbase and things are quiet it could be deleted at that time?

Diapolo commented at 9:16 am on October 12, 2012:

I’m fine with removing that later as long as you / sipa keep track of that. That whole block of commands in the pro-file looks like Vodoo to me anyway :-D.

Diapolo commented at 8:37 pm on October 11, 2012: none

Did anyone build this directly on Windows with MinGW? I saw there was a cross-compile Windows flag in the pro file. Perhaps I should just fetch that branch and try in the next days.

sipa commented at 8:59 pm on October 11, 2012: member

@Diapolo ýes, I’ve done windows builds; I even had to backport the LevelDB env for windows from c++0x to c++, as gitian only has a gcc 4.2 mingw compiler which doesn’t support c++0x.

gavinandresen commented at 4:07 pm on October 16, 2012: contributor

Errors compiling on my Mac:

makefile.osx has 4 spaces instead of a tab: @echo "Building LevelDB ..."; cd leveldb-1.5.0; make; cd ..

And: txdb-bdb.cpp: In member function ‘bool CCoinsDB::HaveCoins(uint256)’: txdb-bdb.cpp:10: error: ‘make_pair’ was not declared in this scope

sipa commented at 6:51 pm on October 16, 2012: member

@gavinandresen Why does it build the BDB version? Did you explicitly disable USE_LEVELDB, or is there a problem with the makefile that causes this?

sipa commented at 2:20 pm on October 20, 2012: member

Some additional changes: reorganized the commits a bit, fixed a bug that caused unit tests to fail, removed some dead code, and added a lot of comments (mostly in main.h).

Also see the list of implementation changes in the first comment here.

gavinandresen commented at 2:32 pm on October 20, 2012: contributor

Design ACK; I reviewed all of the changes to main.h and about half of main.cpp, and had just a handful of tiny nits that aren’t worth picking.

I feel comfortable pulling this as long as @sipa can commit to working through the remaining TODOs and help fix any bugs that crop up over the next month or so.

sipa commented at 2:40 pm on October 20, 2012: member

By the way: this pull request is rebased on top of ’threadimport’ (#1880) and ‘canonical’ (#1742). I suppose that means those require ACKs first. If there is a problem, I’ll remove them from under this pull.

Import LevelDB 1.5, it will be used for the transaction database. 5e650d6d2d

Leveldb Windows port by Edouard Alligand, adapted for MingW by me. 94a50fb339

Disable libsnappy detection in LevelDB 9d503a7285

Backport Win32 LevelDB env from C++0x to C++

Since the gitian mingw compiler doesn't support C++0x yet.

9f56678fce

Makefile integration of LevelDB 3ff3a2bd60

LevelDB glue

Database-independent glue for supporting LevelDB databases.

Based on code from earlier commits by Mike Hearn in his leveldb
branch.

43b7905e98

Compact serialization for variable-length integers

Variable-length integers: bytes are a MSB base-128 encoding of the number.
The high bit in each byte signifies whether another digit follows. To make
the encoding is one-to-one, one is subtracted from all but the last digit.
Thus, the byte sequence a[] with length len, where all but the last byte
has bit 128 set, encodes the number:

  (a[len-1] & 0x7F) + sum(i=1..len-1, 128^i*((a[len-i-1] & 0x7F)+1))

Properties:
* Very small (0-127: 1 byte, 128-16511: 2 bytes, 16512-2113663: 3 bytes)
* Every integer has exactly one encoding
* Encoding does not depend on size of original integer type

4d6144f97f

Compact serialization for scripts

Special serializers for script which detect common cases and encode
them much more efficiently. 3 special cases are defined:
* Pay to pubkey hash (encoded as 21 bytes)
* Pay to script hash (encoded as 21 bytes)
* Pay to pubkey starting with 0x02, 0x03 or 0x04 (encoded as 33 bytes)

Other scripts up to 121 bytes require 1 byte + script length. Above
that, scripts up to 16505 bytes require 2 bytes + script length.

69fc8047a9

Compact serialization for amounts

Special serializer/deserializer for amount values. It is optimized for
values which have few non-zero digits in decimal representation. Most
amounts currently in the txout set take only 1 or 2 bytes to
represent.

0fa593d0fb

Add CCoins: pruned list of transaction outputs

The CCoins class represents a pruned set of transaction outputs from
a given transaction. It only retains information about its height in
the block chain, whether it was a coinbase transaction, and its
unspent outputs (script + amount).

It has a custom serializer that has very low redundancy.

10fd8604d7

Add CTxUndo: transaction undo information

The CTxUndo class encapsulates data necessary to undo the effects of
a transaction on the txout set, namely the previous outputs consumed
by it (script + amount), and potentially transaction meta-data when
it is spent entirely.

44ac1c0fe3

One file per block

Refactor of the block storage code, which now stores one file per block.
This will allow easier pruning, as blocks can be removed individually.

630fd8dcb6

Preliminary undo file creation

Create files (one per block) with undo information for the transactions
in it.

8adf48dc9b

Multiple blocks per file

Change the block storage layer again, this time with multiple files
per block, but tracked by txindex.dat database entries. The file
format is exactly the same as the earlier blk00001.dat, but with
smaller files (128 MiB for now).

The database entries track how many bytes each block file already
uses, how many blocks are in it, which range of heights is present
and which range of dates.

5382bcf8cd

Pre-allocate block and undo files in chunks

Introduce a AllocateFileRange() function in util, which wipes or
at least allocates a given range of a file. It can be overriden
by more efficient OS-dependent versions if necessary.

Block and undo files are now allocated in chunks of 16 and 1 MiB,
respectively.

bba89aa82a

Ultraprune

This switches bitcoin's transaction/block verification logic to use a
"coin database", which contains all unredeemed transaction output scripts,
amounts and heights.

The name ultraprune comes from the fact that instead of a full transaction
index, we only (need to) keep an index with unspent outputs. For now, the
blocks themselves are kept as usual, although they are only necessary for
serving, rescanning and reorganizing.

The basic datastructures are CCoins (representing the coins of a single
transaction), and CCoinsView (representing a state of the coins database).
There are several implementations for CCoinsView. A dummy, one backed by
the coins database (coins.dat), one backed by the memory pool, and one
that adds a cache on top of it. FetchInputs, ConnectInputs, ConnectBlock,
DisconnectBlock, ... now operate on a generic CCoinsView.

The block switching logic now builds a single cached CCoinsView with
changes to be committed to the database before any changes are made.
This means no uncommitted changes are ever read from the database, and
should ease the transition to another database layer which does not
support transactions (but does support atomic writes), like LevelDB.

For the getrawtransaction() RPC call, access to a txid-to-disk index
would be preferable. As this index is not necessary or even useful
for any other part of the implementation, it is not provided. Instead,
getrawtransaction() uses the coin database to find the block height,
and then scans that block to find the requested transaction. This is
slow, but should suffice for debug purposes.

450cbb0944

Batch block connection during IBD

During the initial block download (or -loadblock), delay connection
of new blocks a bit, and perform them in a single action. This reduces
the load on the database engine, as subsequent blocks often update an
earlier block's transaction already.

ae8bfd12da

Transaction hash caching

Use CBlock's vMerkleTree to cache transaction hashes, and pass them
along as argument in more function calls. During initial block download,
this results in every transaction's hash to be only computed once.

64dd46fd05

Direct CCoins references

To prevent excessive copying of CCoins in and out of the CCoinsView
implementations, introduce a GetCoins() function in CCoinsViewCache
with returns a direct reference. The block validation and connection
logic is updated to require caching CCoinsViews, and exploits the
GetCoins() function heavily.

13c51f20f6

Automatically reorganize at startup to best known block

Given that the block tree database (chain.dat) and the active chain
database (coins.dat) are entirely separate now, it becomes legal to
swap one with another instance without affecting the other.

This commit introduces a check in the startup code that detects the
presence of a better chain in chain.dat that has not been activated
yet, and does so efficiently (in batch, while reusing the blk???.dat
files).

4fea06db25

Prepare database format for multi-stage block processing

This commit adds a status field and a transaction counter to the block
indexes.

857c61df0b

Use singleton block tree database instance d979e6e36a

Flush and sync block data 44d40f26dc

LevelDB block and coin databases

Split off CBlockTreeDB and CCoinsViewDB into txdb-*.{cpp,h} files,
implemented by either LevelDB or BDB.

Based on code from earlier commits by Mike Hearn in his leveldb
branch.

2d8a48292b

Add LevelDB MemEnv support

Support LevelDB memory-backed environments, and use them in unit tests.

e1bfbab802

Add gettxout and gettxoutsetinfo RPCs beeb57610c

Remove BDB block database support 4ca60bba5c

gmaxwell commented at 9:41 pm on October 20, 2012: contributor

ACK. This appears ready for integration.

sipa referenced this in commit cf9b49fa50 on Oct 20, 2012

sipa merged this on Oct 20, 2012

sipa closed this on Oct 20, 2012

laudney referenced this in commit f7eb23fc8b on Mar 19, 2014

luke-jr referenced this in commit d3ef9b00ec on Mar 24, 2014

DrahtBot locked this on Sep 8, 2021

Ultraprune: use a pruned-txout-set database for block validation #1677