addrman: bump peers.dat format version number #23382

vasild commented at 2:43 PM on October 28, 2021: contributor

After 92617b7a758c0425330fba4b886296730567927c (https://github.com/bitcoin/bitcoin/pull/23306) we allow multiple entries in addrman with the same address that only differ by port.

Such peers.dat trips versions before the above commit and they fail to read it with an obscure "corruption" error:

Corrupt data. Consistency check failed with code -5

So, if we detect that there are such "duplicate" entries, write the file with a newver format version, so that old software emits:

Unsupported format of addrman database: 4. It is compatible with formats >=4, but the maximum supported by this version of Bitcoin Core is 3.

addrman: bump peers.dat format version number

After 92617b7a758c0425330fba4b886296730567927c
(https://github.com/bitcoin/bitcoin/pull/23306) we allow multiple entries
in addrman with the same address that only differ by port.

Such `peers.dat` trips versions before the above commit and they fail to
read it with an obscure "corruption" error:

```
Corrupt data. Consistency check failed with code -5
```

So, if we detect that there are such "duplicate" entries, write the file
with a newver format version, so that old software emits:

```
Unsupported format of addrman database: 4. It is compatible with formats >=4, but the maximum supported by this version of Bitcoin Core is 3.
```

1a8a382787

in src/addrman.cpp:172 in 1a8a382787

 168 | @@ -169,9 +169,17 @@ void AddrManImpl::Serialize(Stream& s_) const
 169 |  
 170 |      s << static_cast<uint8_t>(FILE_FORMAT);
 171 |  
 172 | +    std::set<CNetAddr> dup_guard;

vasild commented at 2:44 PM on October 28, 2021:

Checking 70k addresses takes 0.032 sec. If changed to unordered_map then it takes 0.017 sec. While that is ~2x speedup, I think it is insignificant during Serialize() and not worth introducing and maintaining CNetAddr hasher.

<details> <summary>CNetAddrHash</summary>

class CNetAddrHash
{
public:
    size_t operator()(const CNetAddr& a) const noexcept
    {
        CSipHasher hasher(m_salt_k0, m_salt_k1);
        hasher.Write(a.m_net);
        hasher.Write(a.m_addr.data(), a.m_addr.size());
        return static_cast<size_t>(hasher.Finalize());
    }

private:
    const uint64_t m_salt_k0 = GetRand(std::numeric_limits<uint64_t>::max());
    const uint64_t m_salt_k1 = GetRand(std::numeric_limits<uint64_t>::max());
};

</details>

sipa commented at 2:50 PM on October 28, 2021:

std::sort + std::adjacent_find may be faster than both.

vasild commented at 3:42 PM on October 28, 2021:

Inserting the elements in std::set<CNetAddr> is essentially sorting them (by address only). Inserting N elements would be O(N*log(N)). Same as std::sort, except that the std::set variant could quit early if a duplicate is found.

The std::unordered_set variant is O(N) (time and memory).

sipa commented at 3:53 PM on October 28, 2021:

@vasild Yes, but both set/unordered_set will require an allocation per element. I would expect that to outweigh the complexity difference for the sizes we're talking about.

vasild commented at 9:17 AM on October 29, 2021:

I played with this out of curiosity. Speed is not so important in this case. But anyway:

The "sort + adjacent_find":

    std::vector<CNetAddr> dup_guard;
    dup_guard.reserve(mapAddr.size());
    for (const auto& [addr, id] : mapAddr) {
        dup_guard.push_back(addr);
    }
    std::sort(dup_guard.begin(), dup_guard.end());
    const bool duplicates_by_addr =
        std::adjacent_find(dup_guard.begin(), dup_guard.end()) != dup_guard.end();

Performs about the same as the unordered_set variant. The reserve() call is very important - if missing then this is slower than unordered_set.

However, the unordered_set also benefits from a reserve() call:

    std::unordered_set<CNetAddr, CNetAddrHash> dup_guard;
    dup_guard.reserve(mapAddr.size());

This makes it the fastest on Earth - 0.011 sec.

The above is for about 70k entries without duplicates (worst case). If there are duplicates, then the unordered_set is even faster as it would cancel the sorting somewhere in the middle (sorting is the slowest operation). In my case, with just 73 duplicates from 70k entries, the timings from a few runs are: 0.0013, 0.00077, 0.002, 0.00064.

Clang 13, -O2

vasild cross-referenced this on Oct 28, 2021 from issue Make AddrMan support multiple ports per IP by sipa

sipa commented at 2:46 PM on October 28, 2021: member

Duplicate of #23354, though this variant also uses V3 when possible.

vasild commented at 2:55 PM on October 28, 2021: contributor

Duplicate of #23354, though this variant also uses V3 when possible.

Doh, closing in favor of that one. I wasn't aware of it. Sorry for the noise.

Feel free to pick the use-V3-if-no-duplicates tweak into #23354.

vasild closed this on Oct 28, 2021

bitcoin locked this on Oct 30, 2022