@gmaxwell wrote down a few rules to make it clear what is expected of DNS seeds and their operators.
Rendered document: https://github.com/laanwj/bitcoin/blob/2014_07_dnsseed_policy/doc/dnsseed-policy.md
@gmaxwell wrote down a few rules to make it clear what is expected of DNS seeds and their operators.
Rendered document: https://github.com/laanwj/bitcoin/blob/2014_07_dnsseed_policy/doc/dnsseed-policy.md
TODO:
10 | + 11 | +0. A DNSseed operating organization or person is expected 12 | +to follow good host security practices and maintain control of 13 | +their serving infrastructure and not sell or transfer control of their 14 | +infrastructure. Any hosting services contracted by the operator are 15 | +equally expected to uphold these expectations.
It seems unreasonable to forbid usage of datacenters without a commitment from them not to be bought out...
Yea, sorry— thats not the intent. I mentioned hosting services at all because the first sentence of 0 could have been read as precluding the use of hosting services. Can you suggest a rephrase which is consistent with your own expectations?
I would think a clause prohibiting sale of the server may be problematic to any organisationally-hosted seeds, even aside from hosting companies. What is the goal of prohibiting such transfers (without knowing, I can't come up with any alternative ideas)?
Oh I see how you're reading this. In light of that it should say instead:
their serving infrastructure and not sell or transfer control of their DNSseed.
The expectations here are largely not directly enforceable by technology (or we wouldn't need to ask for them, they'd just be enforced), so there is a degree of reliance on honest behavior by operators. What I want to address here is the risk that some anonymous party comes up with a way to exploit the position of being a DNS seed for a dishonest end and they offer to buy control of a DNSseed from an existing operator (without mentioning their intended attack). An honest operator should, per these expectations (if not common sense first), refuse such a request.
Ah, so the goal is to stop selling/leasing of the DNS seed by itself, but doing so as part of a larger transfer (company sale) is okay? I can't think of a good way to phrase this better :(
Yep. Well at least we can make it clear that it's the dnsseed and not the underlying server. :)
25 | +3. The results may not be served with a DNS TTL of less than one minute. 26 | + 27 | +4. Any logging of DNS queries should be only that which is necessary 28 | +for the operation of the service or urgent health of the Bitcoin 29 | +network and must not be retained longer than necessary or disclosed 30 | +to any third party.
How about anonymous statistics?
It's very easy to mess up anonymous statistics and leak information. Can you suggest a way that would be better defined? E.g. what level of statistics you think would be sensible? I don't see any big issue with raw traffic amounts.
How about using a similar level of granularity as the EDNS Client Subnet extensions (http://tools.ietf.org/html/draft-vandergaast-edns-client-subnet-02). The originating IP address is truncated to the BGP prefix in logs used for statistical analysis. This would prevent any analysis from identifying an individual user, which is the rationale behind point 4 if I'm not mistaken and would still allow to do research based on the query logs.
As it stands right now, point 4 simply is too restrictive.
The purpose of the DNSseed infrastructure is to facilitate fast introduction to the network for hosts. Can you explain how (4) is at odds with this goal?
Identifying organizations which are using Bitcoin may still be harmful for those users and a violation of their privacy.
Bitcoin users are human being and should not be made subjects of research without informed consent. Use of the reference software is not consent. The privileged position to observe DNSseed queries should not be used for research purposes.
While I do agree on the main purpose of the DNSSeed infrastructure, I do also believe that valuable insights can be gained from analyzing the queries, that might ultimately benefit Bitcoin itself.
I'm proposing the BGP prefix as a lower bound on the granularity of the collected data, which I believe is far more reasonable than an outright ban on any collection and use of this data. Data that by the way is still available to anyone with enough time to build a crawler of the network.
According to your stance any research on the Bitcoin system would require the consent of all participants. This includes all research into message propagation and into protocol optimizations. Without the ground truth gathered by measurements we cannot improve the system.
A crawler returns nothing on clients which are either not listening/advertising or on full nodes which are not advertising and are only listening to a subset of the network. The document already makes an effort to separate data produced from crawling without the aid of a privileged position in the network.
Without reaching any conclusions on the merits of collecting data on users, what you're proposing is equivalent to an explicit phone-home feature but less transparent and less subject to user choice and privacy controls because it silently piggybacking on an existing infrastructure.
If it were desirable to have phone-home user monitoring it should be done as an explicit feature so that its privacy implications can be carefully considered and so that it could be disabled independently of the rest of the system.
I had no idea that anyone would have thought it even remotely acceptable to utilize the privileged position of operating a DNS seed to track users, but the purpose of this document is to minimize those sorts of surprises.
The address of a node will be forwarded independently of whether it is listening to the network or not, you can collect the IP addresses of all nodes simply by listening for incoming addr messages.
I am trying to make it possible to make anonymized information available to the public in a controlled fashion in order to level the playing field. It is likely that some people will make use of the information they collect, ignoring the policy because it cannot be enforced by technical means. Without an agreed upon way to make this information available we actually increase the disparity between seed providers that have the privileged access and the rest that do not.
That being said, my seed will adhere to the policy, should the other seed providers agree to do so. I would however prefer if a fair-use policy were put in place that allows anonymized data to be used for research.
Just to be clear: this section is about logging of received DNS queries, not about crawling or data gathered through crawling.
For that purpose, DNS seeds are in a privileged position, as they get information from who is running bitcoin nodes, rather than who is running a reachable full node.
Just a point on the behavior of the software,
The address of a node will be forwarded independently of whether it is listening to the network or not, you can collect the IP addresses of all nodes simply by listening for incoming addr messages.
It will be forwarded if the node announces itself. If the node does not announce, there will be nothing to forward. Setting listen=0, for example, disables announcements... or just being a non-full node client will also result in no announcements.
Hm, didn't know that. That is assuming that all nodes behave like Bitcoin Core though, if a node decides to send an addr on behalf of one of its peers then that will be forwarded, right? That would explain a strange behavior I had a few weeks ago, but that's probably off-topic, sorry :-)
on behalf of one of its peers then that will be forwarded, right?
It will be, but thats broken. It isn't how the protocol works, any implementation that was doing that would be broken and it would be mildly harmful to the network. Of course, nodes can do malicious things and the system is generally robust against them.
OK - so anonymous statistics (collected from DNS queries) are not allowed either. Do we need to reword anything to make that clear?
How about rephrasing the first sentence to "Any logging of DNS queries, or derivatives thereof, must be limited to the scope necessary for the operation of the service." to explicitly include statistics and aggregations.
I find the part about urgent health of the network confusing and might create loopholes.
I think that aggregation/logging of total queries (nothing broken down by any IP range) may be useful in the longer term to monitor service health.
I guess for the operation of the seeds we have a limit on the retention time in place.
I started asking about statistics and aggregate data because like you I believe there is a use for some statistics, if handled carefully.
We need to define two granularity levels:
What I gathered so far from the discussion is that any logging is ok as long as it is strictly needed for the operation of the seed (still a bit vague for my taste) and we have a limited retention time. For the granularity of data to be released to the public some are pushing for total silence.
I'm ok with these if everybody agrees, but the current formulation is ambiguous as it concentrates on logs of individual queries, which is why I started asking about aggregated data.
The intention is to absolutely prohibit using this as a privileged position to monitor users. "Aggregates" have a long history of surprising outcomes, especially because the user's threat model might not be what the aggregator is thinking about.
If we wanted to enable user-monitoring we would do so with a separate service specifically designed and intended for that so it could be transparent about its operation, so that the risks could be maximally mitigated, and so that users could opt out of it without otherwise degrading the operation of the software.
"Necessary for the operation" was intended to cover things like measuring traffic levels for capacity planning or investigating high load (e.g. to figure out how to block a DOS attack), or dumping queries that fail for software troubleshooting. Since it seems to be enabling some confusion here I'll think up some other language.
Aggregates that are OK: total number of queries, bandwidth up/down, CPU load, I/O load - these don't discriminate clients in any way, and require no logging of possibly identifying (meta)data.
Aggregates that are not OK: counting requests per country, client OS/program, number of unique IPs - these require acting on request contents or metadata.
Ack, sounds reasonable, thanks for clarifying. I would however allow counting queries by type (A, AAAA, SRV, ...) since at least in my case that helped debug quite a few issues with my DNS seed.
That all sounds reasonable to me too. I'd prefer to see that spelled out explicitly in the document though.
14 | +infrastructure. Any hosting services contracted by the operator are 15 | +equally expected to uphold these expectations. 16 | + 17 | +1. The DNSseed results must consist exclusively of fairly selected and 18 | +functioning Bitcoin nodes from the public network to the best of the 19 | +operators understanding and capability.
"Fairly selected" might be good to define, but maybe not very easy to. Some seed nodes may out of necessity only index IPv4 or IPv6, so discrimination based on IP/address in general can't be prohibited. OTOH, leaving this undefined may be better just to avoid loopholes around common sense...
0 | @@ -0,0 +1,49 @@ 1 | +Expectations for DNSSeed operators 2 | +==================================== 3 | + 4 | +Bitcoin Core attempts to minimize the level of trust in DNS seeds, 5 | +but DNS seeds still pose a small amount of risk for the network. 6 | +Other implementations of Bitcoin software may also use the same 7 | +seeds and may be more exposed. In light of this exposure this 8 | +document establishes some basic expectations for the expectations 9 | +for the operation of dnsseeds.
Nit: If you call them a DNSSeed, you should write DNSSeeds here, IMO.
Yeah that's my fault. I find 'DNSseed' difficult to read, and changed it in the initial sentence, then saw it was used all over the place and forgot to change it back here.
I vote for just 'DNS seed'.
I also vote for DNS seed.
Changed to DNS seed everywhere
ACK
Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/p4566_0a0878d43a2e7db9c41b20ba1d3eb714fd6806c4/ for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.