CONTRIBUTING: Caution against using AI/LLMs (ChatGPT, Copilot, etc)

luke-jr commented at 11:05 pm on July 27, 2023: member

There’s been at least a few instances where someone tried to contribute LLM-generated content, but such content has a dubious copyright status.

Our contributing policy already implicitly rules out such contributions, but being more explicit here might help.

CONTRIBUTING: Caution against using LLMs 08f9f62dc4

DrahtBot commented at 11:05 pm on July 27, 2023: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	kevkevinpal
Concept NACK	ryanofsky, glozow
Concept ACK	jonatack, russeree, Sjors, petertodd

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

jonatack commented at 11:08 pm on July 27, 2023: member

Concept ACK, makes sense, though IANAL.

DrahtBot added the label CI failed on Jul 28, 2023

russeree commented at 2:01 am on July 28, 2023: contributor

Concept ACK.

The two thoughts that come to mind are that

but not limited to, ChatGPT, GitHub Copilot, and Meta LLaMA

This section could become a cat and mouse game between the various models and which ones of these will still have relevance in the future.
LLMs right now are the benchmark for text to text models but there are other types of models as well example NLP and RNN models. So this language could become obsolete overtime.

This post was written by GPT4 … Just Kidding.

in CONTRIBUTING.md:245 in 08f9f62dc4

237@@ -238,3 +238,9 @@ By contributing to this repository, you agree to license your work under the
238 MIT license unless specified otherwise in `contrib/debian/copyright` or at 
239 the top of the file itself. Any work contributed where you are not the original 
240 author must contain its license header with the original author(s) and source.
241+
242+If you do not know where the work comes from and/or its license terms, it may
243+not be contributed until that is resolved. In particular, anything generated by
244+AI or LLMs derived from undisclosed or otherwise non-MIT-compatible inputs
245+(including, but not limited to, ChatGPT, GitHub Copilot, and Meta LLaMA) cannot

ariard commented at 3:11 am on July 28, 2023:

I would recommend to drop any reference to a corporate entity, or one of its product in Bitcoin Core documentation, to avoid a mischaracterization of what they’re doing (whatever one personal opinion).

We already mentioned Github a lot in the contributing.md, though only as the technical platform where contributions are happening, not taking a stance on one of their product.

luke-jr commented at 9:03 pm on July 28, 2023:

I mentioned these specifically because:

ChatGPT is the most popularly known, and most likely to be searched for if someone is considering using it.
GitHub promotes use of Copilot heavily, and we are using GitHub.
Meta is falsely advertising LLaMA as open source, and many people are just believing that without verifying. (The source code is not available, and the license is not permissive)

Sjors commented at 11:01 am on July 29, 2023:

I think it’s fine to mention these examples.

ariard commented at 11:36 pm on July 31, 2023:

I think a) there is no certainty ChatGPT / LLaMa will be the most popular framework 12 / 18 months from now and I don’t think we’re going to update contributing rules everytime and b) Meta is a registered trademark of a commercial entity and I think it’s better to not give the appearance Bitcoin Core the project is supportive or supported or linked to Meta in anyway.

kevkevinpal commented at 3:42 am on July 28, 2023: contributor

Conecpt ACK 08f9f62

I was one of the mentioned PR’s #28101 (review)

Would make sense to have this in the CONTRIBUTING.md

ariard commented at 3:53 am on July 28, 2023: contributor

To be honest, and after looking on some LLM-term of service and in basic knowledge of copyright law, there is an uncertainty on the status of LLM output. It sounds LLM or AI operating platforms in their terms of service do no make the claim they own the intellectual property of the LLM output, and if even if they do so it’s probably an unfounded claim. A user might mix an “original” or “creative” element by sending an individual prompt request, a determining factor in any matter of intellectual property rights assignment.

To the best of my knowledge there has been no legal precedent on the matter in any major jurisdiction. However, there are ongoing proposals to rework legal framework in matter of data use and AI (at least in the EU), and this will be probably change the question.

My personal opinion would be to left the contributing rules unchanged for now and looked again in 24 / 36 months when there is more clarity on the matter, if any.

DrahtBot removed the label CI failed on Jul 28, 2023

ariard commented at 7:52 pm on July 28, 2023: contributor

Sent a mail to Jess from the Bitcoin Defense Legal Fund to collect more legal opinions with the context. Normally luke-jr (luke@dashjr.org) and jonatack (jon@atack.com) are cc.

Sjors commented at 11:13 am on July 29, 2023: member

Concept ACK, but happy to wait for legal opinions. Hopefully they clarify the risks in two separate categories:

Content the AI obtained somewhere else without permission (i.e. claims from the original author)
Content the AI generated itself. Potentially owned by some corporation who didn’t give permission to the person making the pull request to MIT license it.

When it comes to (1) I’m more worried about snippets of fresh code than e.g. suggested refactorings. I don’t see how one can claim copyright over e.g. the use of std::map over std::vector.

When it comes to (2) in a way it’s not a new risk. A contributor could already have a consultant standing next to them telling them what to write. It could then turn out the code belongs to that consultant (who didn’t license it MIT) and not contributor (who at least implicitly did). But these AI companies have a lot more budget to actually legally harass the project over such claims.

petertodd commented at 11:51 am on July 29, 2023: contributor

ACK

The copyright lobby is pretty strong, and stands to lose a lot from AI. I think there’s a significant chance that AI copyright gets resolved in favor of copyright owners in such a way that is disastrous for AI. Just look at how the copyright lobby managed to keep extending the duration of copyrights worldwide to ludicrious, economically irrational, lengths until very recently.

Also, AI poses unknown security threats. It frequently hallucinates incorrect answers. Bitcoin Core is a type of software that needs to meet very stringent reliability standards, so much so that review is usually more work than righting the code. Saving time righting the code doesn’t help much, and poses as yet unknown risks.

ryanofsky commented at 5:11 pm on July 31, 2023: contributor

NACK from me, because I think legal questions like this are essentially political questions, and you make absurd legal outcomes more likely to happen by expecting them to happen, and by writing official documentation which gives them credence.

If the risk is that openai or microsoft could claim copyright over parts of the bitcoin codebase, that would be absurd because their usage agreements assign away their rights to the output and say it can be used for any purpose.

If the risk is that someone else could claim copyright over parts of the bitcoin codebase, like in the SCO case (https://en.wikipedia.org/wiki/SCO%E2%80%93Linux_disputes), that would also be an absurd outcome, which would have bigger repercussions beyond this project, and could happen about as easily without an LLM involved.

ariard commented at 11:46 pm on July 31, 2023: contributor

I had the feedback from the Bitcoin Defense Legal Fund, they need more time to analyze the issue though it is thought as an important one.

I did additionally cc Andrew Chow and Fanquake as maintainers on the mail thread for open-source transparency.

I think whatever one individual political opinion on copyrights or legal risks tolerance, the fact is we have already dubious copyrights litigations affecting the project, so I think it’s reasonable to wait for risk clarification before to do a change to the contributing rules on the usage of AI / LLM tooling.

fanquake commented at 9:33 am on August 1, 2023: member

I mostly agree with @ryanofsky.

The reality is that going forward it’ll be essentially impossible to avoid contributions that may include output from AI/LLMs, just because (in almost all cases) it’ll be impossible to tell, unless the author makes it apparent.

We certainly don’t want to end up in some situation where contributors are trying to “guess” or point out these types of contributions, or end up with reversion PRs (incorrectly) trying to remove certain content.

If we end up with an opinion from the BLDF then maybe we can consider making an addition to our license, if necessary.

ryanofsky commented at 1:04 pm on August 1, 2023: contributor

If we end up with an opinion from the BLDF then maybe we can consider making an addition to our license, if necessary.

+1. If we have professional advise to change the license or add a separate policy document or agreement like a CLA, we should consider doing that. But we shouldn’t freelance and add legal speculation to developer documentation.

In this case and in general I think a good strategy is to:

First, focus on doing the right thing morally. If contribution includes content that seems plagiarized, or not credited properly, or is not fair to someone, we should not include it.
Second, try not to make political mistakes. Avoid doing things that would be unpopular broadly or would offend a particular group of people and provoke an attack. Avoid waving meat in front of hyenas and taking actions that could give credibility to absurd legal claims.
Third, try not to innovate. Have a software license, follow professional advise, maybe participate in a patent network. Avoid doing things that are speculative or new and not obvious wins.

mzumsande commented at 10:14 pm on August 1, 2023: contributor

The reality is that going forward it’ll be essentially impossible to avoid contributions that may include output from AI/LLMs, just because (in almost all cases) it’ll be impossible to tell, unless the author makes it apparent.

Maybe making it apparent is part of the problem. There is no requirement to state publicly which technical tools were involved in a contribution, so for now it might be best if everyone would just use their favourite LLM helpers silently (as, I am sure, many contributors already do!).

fanquake marked this as a draft on Aug 3, 2023

fanquake commented at 9:56 am on August 3, 2023: member

Moved to draft for now, as there’s not consensus to merge as-is, and in any case, this is waiting on further legal opinions.

DrahtBot added the label CI failed on Oct 15, 2023

DrahtBot removed the label CI failed on Oct 16, 2023

DrahtBot added the label CI failed on Oct 25, 2023

glozow commented at 11:34 am on December 21, 2023: member

Agree with the intent of avoiding legal problems, but NACK on adding this text. Unless we have some kind of legal guidance saying this text would protect us beyond what our existing license-related docs say, I don’t see any reason to discourage specific tools in the contributing guidelines. I agree with the above that trying to speculate/innovate can do more harm than good.

I think we should close this for now and reconsider if/when a a lawyer advises us to do something like this.

DrahtBot requested review from jonatack on Dec 21, 2023

fanquake commented at 5:47 pm on January 5, 2024: member

Closing this for now.

fanquake closed this on Jan 5, 2024

bitcoin locked this on Jan 4, 2025

CONTRIBUTING: Caution against using AI/LLMs (ChatGPT, Copilot, etc) #28175

Code Coverage

Reviews