Is your feature request related to a problem? Please describe.
Size of the serialized UTXO set
Describe the solution you’d like
The serialization format of the serialized UTXO set from dumptxoutset
is a list of (COutPoint, Coin)
.
However, many out points refer to the same transaction so we can group by Txid
with:
list(Txid, list((vout,Coin))
Since the cursor is already iterating through sorted Txid
this doesn’t add complexity to the serialization code.
Considering the UTXO at height 745995:
0serialized_size: ~5.3Gb
1total_elements: 83_082_178
2uniques_txids: 49_517_483
3bytes_lost: total_elements // due to the additional byte for the length of the inner list
4bytes_savings: (total_elements-unique_txids)*32 - bytes_lost ~= 1Gb
Describe alternatives you’ve considered
Additional bytes could be saved by leveraging the duplications in scripts (address reuse). However, this is not considered worthy because the format would lose the streaming property and also because we don’t want to optimize on something which is not recommended.
The byte lost for expressing the length of the inner list could be optimized with a special byte containing both the length of the list and the first vout, since usually both vout and this value are very small they should fit on a single byte most of the time having a fallback for edge cases.
Additional context