In GetMedianTimePast, there is no need to fully sort the time of the previous blocks since we are only interested in the median. Therefore we can use std::nth_element instead of std::sort which better matches the semantics of what we want to do and might perform a bit better.
Use std::nth_element instead of std::sort to calculate median time past #11333
pull danra wants to merge 3 commits into bitcoin:master from danra:improve/median-time-past changing 2 files +48 −5-
danra commented at 9:48 PM on September 14, 2017: contributor
- danra renamed this:
Use `std::nth_element` instead of `std::sort` to calculate median time past
Use std::nth_element instead of std::sort to calculate median time past
on Sep 14, 2017 - danra force-pushed on Sep 14, 2017
-
010fe71ac9
Use std::nth_element instead of std::sort to calculate median time past
In GetMedianTimePast, there is no need to fully sort the time of the previous blocks since we are only interested in the median. Therefore we can use std::nth_element instead of std::sort which better matches the semantics of what we want to do and might perform a bit better.
-
Minor style improvements in GetMedianTimePast 0b2c88216b
- danra force-pushed on Sep 14, 2017
-
gmaxwell commented at 11:24 PM on September 14, 2017: contributor
Or maybe sort works in place and nth_element copies to build a heap and is actually 10x slower when we're only talking about 11 items. If the justification is performance you should at least give it a casual test.
- fanquake added the label Refactoring on Sep 15, 2017
-
(Not for merge) Add benchmark for verifying std::nth_element is faster than std::sort for an array of length nMedianTimeSpan==11 06663cff04
-
danra commented at 7:33 AM on September 15, 2017: contributor
@gmaxwell @MeshCollider Added a (not for merge) benchmark commit showing the median is faster even for such a short length.
On my machine, bench_bitcoin gives SmallMedian,184549376,0.000000005698851,0.000000006605831,0.000000005873540,14,16,14 SmallSort,134217728,0.000000007392771,0.000000009320686,0.000000008182214,18,23,20
- laanwj added the label Consensus on Sep 15, 2017
-
laanwj commented at 11:06 AM on September 15, 2017: member
This changes consensus-critical code, so it will have to be carefully tested that this returns the same in all cases. To see if this is warranted, we need a more global benchmark as well - e.g. what % of the validation time is spent in this function?
-
danra commented at 11:14 AM on September 15, 2017: contributor
@laanwj My guess is very little - but in case your guess is different and you think even some small percentage could be gained, let me know and I'll try to benchmark that.
My initial motivation was not performance gain, rather to make the semantics a bit clearer, but it makes sense that that's not a strong enough argument to change consensus critical code.
-
jnewbery commented at 5:00 PM on September 15, 2017: member
I'd tend to NACK changes like this, unless there's a significant performance gain. I'm especially not keen on the gratuitous style changes (why prefer
auto*overint64_t*?) -
danra commented at 9:23 PM on September 15, 2017: contributor
@jnewbery The rationale is simple:
auto* pbegin = &pmedian[11];conveys the concise intention of having a pointer to an object representing time.int64_t* pbegin = &pmedian[11];conveys the less concise intention of having a pointer to an object representing time, and that you expect this object to beint64_t. In the context ofGetMedianTimePast(), you don't really care about the exact time type. Therefore, usingauto*better reflects the semantics needed in the context of the function. -
jnewbery commented at 9:33 PM on September 15, 2017: member
@danra thanks for the nice explanation. There are different views on when
autois appropriate (see https://botbot.me/freenode/bitcoin-core-dev/2017-02-02/?msg=80353999 for example). I personally would prefer that it not get used for simple types likeint64_t.The project doesn't have style guidelines for
auto. Changing existing, functioning code to match your personal preferences seems unnecessary. -
jnewbery commented at 9:48 PM on September 15, 2017: member
Why?
Because in general I think explicit is better than implicit. But you seem to have more C++ experience than me, so I may just be wrong :)
In any case,
autoshould compile to the same thing and there's no preference expressed in the project style guide, so in general I'd expect there to be a pretty strong argument to make changes (especially as this is consensus code).Anyway, I'm only expressing a vague personal preference. I don't want to completely derail the discussion of this PR.
-
theuni commented at 10:06 PM on September 15, 2017: member
NACK. Not that it's not a readability improvement, but it's just not worth the risk of some freak divergence.
Also, while I personally agree with your use of auto in general:
@jnewbery It's not a personal preference, it's better code semantics in this case.
Better is subjective. Please keep in mind that there are many ways to measure the code.
-
promag commented at 10:28 PM on September 15, 2017: member
NACK. IMHO more code to read, test and maintain vs sort 11 elements worst case.
-
sipa commented at 11:00 PM on September 15, 2017: member
If we're going to rewrite this anyway, I think this code can be simplified further:
int64_t GetMedianTimePast() const { int64_t median[nMedianTimeSpan]; size_t count = 0; const CBlockIndex* index = this; while (count < nMedianTimeSpan && index) { median[count++] = index->GetBlockTime(); index = index->pprev; } std::nth_element(median, median + count / 2, median + count); return median[count / 2]; }Regardless of whether people like the idea of changing this function, thanks for using
nth_element, I wasn't aware of that function. It's pretty incredible how short the compiler output here for this function (52 asm instructions, no function calls). -
gmaxwell commented at 11:21 PM on September 15, 2017: contributor
FWIW, I was in the process of writing something pretty much like sipa's suggestion in english (though using std::begin())-- with the suggestion of reducing the very C-ish references to array members and pointer arithmetic. (Also.. comma operator? meh....)
As far as the use of auto: The principle I think we prefer to follow is that explicit is better than implicit, except where explicit creates a ton of noise or breaks generality--e.g. in a template; or deals with a far-away defined type making changes have to happen in many places. If those things aren't true, we prefer explicitness.
I'm a fan of using the nth element in a cleanup like sipa suggests. I think that code is easy enough to review and test (I would just try all n! inputs of c*1-n for n [1..11] and a number of c values and make sure it gives the same results; or something similar which is almost exhaustive)
-
laanwj commented at 7:06 AM on September 16, 2017: member
In the context of GetMedianTimePast(), you don't really care about the exact time type.
For a generic function that would be true - but because this is consensus code, we really strongly care about the exact types and everything, and shouldn't leave anything to the compiler.
Due to the high risk I'd personally prefer not to do this change, or another rewrite of the function, unless there is a strong other driver than aesthetics. If the code is unclear please add a comment instead.
This seems to be the opinion of many other people here as well. Hence I'm closing this. I'd say there are more important priorities, too.
- laanwj closed this on Sep 16, 2017
-
danra commented at 8:32 AM on September 16, 2017: contributor
@gmaxwell @laanwj Explicit is better than implicit in this context when you want to commit to a type. If what you're interested in doing is just tracking a type, it is a violation of DRY.
In the consensus code, I think there's a trade-off. You can re-commit to types as you go, but that makes the code less concise and more resistant to change, and as a consequence less secure, IMHO. The advantage of re-committing to types everywhere is similar to adding asserts - you wouldn't stick asserts everywhere, but you would at junctions where you think they would matter, or you're afraid of mistakes.
I can speak to this advantage of not using
autofrom my own personal experience of writing DSP code, wheredoubles andfloats often mix together and usingautowould be bug-prone - that's a case where I often want to commit to types, since I don't trust myself to get the type tracking right, so I'm insuring myself against my own bugs.However, if I had to, e.g.,
std::sortan array ofdoubles, I wouldn't hesitate to getautopointers to it, since there is no fear of any mixup. If I later change the array tofloats, the sorting would still work fine. @sipa @gmaxwell I suppose the code intentionally initializes the vector in reverse order, so it is likely to be already sorted or close to sorted (since later blocks will tend to have later times), so the followingstd::sorttakes less time. Does that make sense? I agree the loop could still probably be simplified. -
gmaxwell commented at 4:25 PM on September 16, 2017: contributor
and as a consequence less secure,
Mixing up the types as you change things in cases like this will be caught by the compiler. Which is also a helpful signal that you're being careless. I agree it's debatable but in general we prefer to be explicit where it doesn't really harm generality or make a mess by repeating a complex type signature. I think we've found it aids review.
so the following std::sort takes less time. Does that make sense?
That is quite plausible to me. Depends on the specifics of the underlying algorithim.
- DrahtBot locked this on Sep 8, 2021