Corrections to bad-chain alert triggering #7568

dgenr8 commented at 6:49 PM on February 20, 2016: contributor

Two corrections are made to the bad-chain alert triggering mechanism: oversampling is eliminated and the test is changed to make correct use of the poisson CDF.

Overall, the effect will be for the alert to trigger less often, since it was triggering too easily before. The alert was originally introduced by #5947.

btcdrak commented at 9:45 PM on February 20, 2016: contributor

Is this related to #7488 as well?

sipa commented at 10:17 PM on February 20, 2016: member

@btcdrak I believe so.

dgenr8 commented at 10:52 PM on February 20, 2016: contributor

Oh yes, that was actually what prompted me. I've asked Meni Rosenfeld and another person who ran simulations to comment.

dgenr8 force-pushed on Feb 23, 2016

Corrections to bad-chain alert triggering

Three corrections are made to the bad-chain alert triggering mechanism:
the sample interval is reduced to 130 minutes (from 240 minutes),
oversampling is eliminated, and the test is changed to make correct use
of the poisson CDF.

A sample interval of 130 minutes is chosen as the smallest interval
capable of detecting a too-few-blocks event with a false positive rate
of once in 50 years.

It is critical that a sample be taken only every 130 minutes, otherwise
the test will catch events that span sample intervals, which it is
designed to miss.  Oversampling is eliminated by changing the
PartitionCheck funtion to return the interval at which it must be run,
and scheduling it at that interval.

A much smaller bias in the other direction is also eliminated.  The
poisson PDF was being used to check the likelihood of a specific
observation, a method which is incorrectly sensitive to the size of the
sample inteval.  This is corrected by using the poisson CDF and doing a
separate statistical check for the high and low cases.

The effect of the third correction is small because, in practice, the
targeted PDF levels accounted for almost all of the probability mass
in the neighborhood of the test.

580aaa36a0

dgenr8 force-pushed on Feb 23, 2016

dgenr8 commented at 1:52 AM on February 23, 2016: contributor

We use an arbitrary threshold of a once-in-fifty-years event. We don't need to use a second arbitrary parameter for the sample interval.

Updated to sample at 130-minute intervals, instead of 240 minutes. This is the minimum interval required for an observation of zero blocks to happen spuriously less than once in fifty years. The high-side check works out to 33 or more blocks in 130 minutes.

bitcoinfees commented at 2:27 AM on February 25, 2016: none

The math is correct, i.e. the false positive rate is once every ~ 50 years assuming an average block time of 10 min. To be clear, this is the combined (two-sided) alert rate; the one-sided alerts (pHigh or pLow) have individual rates of every ~ 100 years.

This is in contrast to the original code, which from simulation, gives an alert rate of ~ 3.5 years.

Additional statistics: assuming the block rate deviates from the proof-of-work target by 20% (due to changing hash rate), then pHigh / pLow will have rates of once per 3 / 8 years respectively, which seems reasonable. If this is deemed too high, the test can be modified to allow for deviations up to a specified amount.

The other significant change here is that the test spans are made non-overlapping, and their frequency is reduced from once per 10 minutes to once per 130 minutes. This unfortunately reduces the potential statistical power (the ability to correctly detect an anomaly), and also adds latency to the alerts.

The main reason for this change is that is not trivial to determine the threshold for overlapping spans. It can be done by simulation, but that's not as a good as the closed-form solution that you would get for non-overlapping spans.

laanwj added the label Docs and Output on Feb 29, 2016

laanwj commented at 7:31 PM on March 7, 2016: member

Concept ACK

in src/init.cpp:None in 580aaa36a0

1671 | @@ -1672,7 +1672,8 @@ bool AppInit2(boost::thread_group& threadGroup, CScheduler& scheduler)
1672 |      int64_t nPowTargetSpacing = Params().GetConsensus().nPowTargetSpacing;
1673 |      CScheduler::Function f = boost::bind(&PartitionCheck, &IsInitialBlockDownload,
1674 |                                           boost::ref(cs_main), boost::cref(pindexBestHeader), nPowTargetSpacing);
1675 | -    scheduler.scheduleEvery(f, nPowTargetSpacing);
1676 | +    CBlockIndex *pdummy = NULL;
1677 | +    scheduler.scheduleEvery(f, PartitionCheck(&IsInitialBlockDownload, boost::ref(cs_main), boost::cref(pdummy), nPowTargetSpacing));

sipa commented at 10:07 PM on March 7, 2016:

It took me a while to understand this, with PartitionCheck called once as the scheduled function, and once specially to determine the frequency.

Wouldn't it be easy to just schedule it once, and have each invocation schedule its own next call? That way the interval cal also be made dynamic more easily.

dgenr8 commented at 12:35 AM on March 8, 2016:

I did this just to keep the constants out of global scope, which would probably be more readable. A drawback of constant rescheduling would be some drift, probably not much.

MeniRosenfeld commented at 10:16 PM on March 15, 2016: none

In case it's still relevant, here's one issue that came up in my discussions with dgenr8:

As far as I understand, the test is conducted at 4 hour intervals after the node starts. With the computed threshold and assuming constant hashrate, the fail-high (>=33 blocks in 120 minutes) should trigger for this node every ~100 years.

However, this is not the only node on the network; and different nodes start at different times, and so they sample on different schedules. So there are more opportunities for the alarm to be falsely triggered somewhere on the network.

So if the event we are trying to prevent is "hashrate is constant and yet an alarm is triggered somewhere on the network" then this is no better than the original method where each node sampled continuously (because the network as a whole sampled continuously). For the new parameters, I estimate it will trigger every ~18 years (I haven't run simulations yet).

On top of that we have the added confusion of some nodes reporting an alarm while others do not.

It would make more sense to me if each node tested for the alarm once when it starts, and then on a 4-hour schedule which is the same for all nodes.

dgenr8 commented at 11:06 PM on March 30, 2016: contributor

@MeniRosenfeld What will happen is a more severe problem that lasts longer will get more nodes reporting it, while a shorter-lived problem will get fewer nodes reporting it. It's similar to different nodes seeing different resolutions to block races, in proportion to the closeness of the race.

Synchronizing the checks network-wide can't be a good idea. Also, we must not check at startup because the last check may have been less than 130 minutes ago.

morcos commented at 5:35 PM on March 31, 2016: member

Do we know that the cause of false positives would be cured by the changes here? It seems likely to me that the bigger culprit is the variability that can exist on block times which isn't representative of when they were actually found.

MeniRosenfeld commented at 5:50 PM on March 31, 2016: none

@morcos :

"A" cause would be cured. There might be other problems but it doesn't mean we shouldn't solve this one.
According to a model that assumes that reported timestamps are accurate, the original code resulted in false positives being triggered x10 too frequently. The new code tests the time span for 33 blocks - if you tell me how much you expect the reported span to be off, I can say how much this effect makes false positives more frequent.

sipa commented at 6:19 PM on March 31, 2016: member

So there are various issues, and it seems different people are arguing for fixing different ones.

The probabilities in the current code are wrong, they're using PDF instead of CDF. This PR fixes it.
The current code samples overlapping periods, causing the false positive rate to be too high. This PR fixes it.
We're not taking into account that the difficulty does not necessarily corresponds to the actual hashrate. For extreme cases, the whole point is of course exactly just detecting that, but if there is normal steady continuous growth of the hashrate, the difficulty is on average slightly off, and the probabilities don't take that into account. We could account for that by just multiplying or dividing by 1.1 towards the expected value, but the PR here does not do that.
We're not taking into account that the block timestamps do not necessarily correspond to the actual block generation timestamps. We could fix that by remembering the actual local system time at which a block was received, but that would only work for long running nodes.
Depending on whether you consider a single node to be showing an false alert to be less bad, the fact that different nodes check at different times may or may not be a problem. As suggested before, using coordinated check times would fix that.

gmaxwell commented at 6:32 PM on March 31, 2016: contributor

I think that it's better to have all the FPs happen at once, then communication can effectively address them. If they happen at different times, there will be an increased general rate of FPs and no one will know if something is wrong but they were just the first to detect or whatever.

We still don't seem to completely understand all the causes of false positives here.

laanwj added this to the milestone 0.13.0 on Apr 1, 2016

arowser commented at 8:44 AM on May 25, 2016: contributor

Can one of the admins verify this patch?

dgenr8 commented at 7:19 PM on June 9, 2016: contributor

@sipa This is regarding your last three points. Sometimes (rarely) a factor of .9 would be better than a factor of 1.1, so maybe that's trying too hard. It would be kind of nice to detect blocktime shenanigans. And I feel coordinating alert times would be totally artificial because there's no globally correct sampling schedule.

laanwj commented at 6:52 AM on June 14, 2016: member

I still don't think this has received enough review or testing. Also there are lots of proposals but no further progress. This has to be resolved before 0.13 release (but not necessarily the feature freeze, it's not a new feature) or the functionality will have to be disabled like was done for 0.12.

laanwj removed this from the milestone 0.13.0 on Jul 6, 2016

laanwj commented at 9:55 AM on August 3, 2016: member

This functionality has been removed in #8275, closing.

laanwj closed this on Aug 3, 2016

MarcoFalke locked this on Sep 8, 2021