Page MenuHomeFreeBSD

committers: add AI policy
Needs RevisionPublic

Authored by dch on Jun 2 2025, 1:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 9, 6:28 AM
Unknown Object (File)
Wed, Oct 8, 9:28 PM
Unknown Object (File)
Tue, Oct 7, 2:34 AM
Unknown Object (File)
Sun, Oct 5, 2:23 PM
Unknown Object (File)
Sun, Oct 5, 2:23 PM
Unknown Object (File)
Sun, Oct 5, 2:23 PM
Unknown Object (File)
Thu, Oct 2, 6:10 PM
Unknown Object (File)
Wed, Oct 1, 3:50 AM

Details

Reviewers
glebius
olivier
lwhsu
Group Reviewers
Core Team
Summary
Test Plan

open questions & comments

  • should we allow documentation translation via AI?
  • it's permitted already to contribute AI tools to ports
  • should we be more clear in the general committers guide that you need to be 100% clear and transparent on the origin of your code/patches/contributions
  • can I *use* AI/LLM tooling to help me with commit messages, checking my language and style?

Diff Detail

Repository
R9 FreeBSD doc repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 64595
Build 61479: arc lint + arc unit

Event Timeline

dch requested review of this revision.Jun 2 2025, 1:27 PM
dch created this revision.
dch edited the test plan for this revision. (Show Details)
This revision is now accepted and ready to land.Jun 2 2025, 6:40 PM
olivier requested changes to this revision.EditedJun 9 2025, 10:47 AM
olivier added a subscriber: olivier.

About the documentation, comments in code or commit message: Is using AI to fix my English forbidden too ?

This first sentence was written by non-native-English me, but for documentation or commit message, I might ask the AI to "fix my English," and the AI result will be something like this:

"Am I also prohibited from using AI to correct my English?"

I'm asking because I have dyslexia, which is a serious issue when you need to write in French (where correct writing is mandatory in French culture). Therefore, I'm accustomed to using software to check for all grammar and orthographic errors. However, since these tools are now AI-based, does that mean we can't use them either?

This revision now requires changes to proceed.Jun 9 2025, 10:47 AM
lwhsu requested changes to this revision.Jun 9 2025, 3:48 PM
lwhsu added a subscriber: lwhsu.

I fully agree the biggest and must be solved issue is the license concern, but putting "expressly forbidden" on a tool because of its current limitation is too narrow. I believe the spirit the project is being more inclusive, as long as the contribution can meet the requirements, e.g., the license, quality, convention, etc.

Rather saying in a negative way, I do prefer to draw a clear line of the requirement for any kind of the contributions to the project, and it is all the contributor's responsibility to follow it, and committer's responsibility to verify it.
It's not due the tool itself, but depends on the contributor (committer is also a kind of contributor) can use it in a correct way. You must have full knowledge and responsibility of what you commit to the project repository.

https://www.apache.org/legal/generative-tooling.html
https://www.linuxfoundation.org/legal/generative-ai

About the documentation, comments in code or commit message: Is using AI to fix my English forbidden too ?

This first sentence was written by non-native-English me, but for documentation or commit message, I might ask the AI to "fix my English," and the AI result will be something like this:

"Am I also prohibited from using AI to correct my English?"

I'm asking because I have dyslexia, which is a serious issue when you need to write in French (where correct writing is mandatory in French culture). Therefore, I'm accustomed to using software to check for all grammar and orthographic errors. However, since these tools are now AI-based, does that mean we can't use them either?

Me too! The core issue here is not "what tool am I using?" but:

  • does this change the provenance of this contribution?
  • can I still provide a personal commitment that the attribution is still mine?

I think orthographic, spelling, and similar "assistive" tooling is fair, assuming it still meets that bar.

dch edited the test plan for this revision. (Show Details)
dch edited the summary of this revision. (Show Details)

I fully agree the biggest and must be solved issue is the license concern, but putting "expressly forbidden" on a tool because of its current limitation is too narrow. I believe the spirit the project is being more inclusive, as long as the contribution can meet the requirements, e.g., the license, quality, convention, etc.

If you leave a crack open then that crack will be exploited, intentionally or otherwise, and then our "97% BSD licensed, fairly attributed" codebase becomes "YOLO AI-License".

Our lessons learned from 5600 words of GPL licensing is that clarity & simplicity matter a great deal.

When you read the 1100 word ASF one closely, you will find that in *every* case, it's still No, unless you can reasonably show that its actually OK. It just takes more examples and many more words to say so:

https://www.apache.org/legal/generative-tooling.html

## Can contributions to ASF projects include AI generated content?
...

Given the above, code generated in whole or in part using AI can be contributed if
the contributor ensures that:

    1. The terms and conditions of the generative AI tool do not place any restrictions
        on use of the output that would be inconsistent with the Open Source Definition.
    2. At least one of the following conditions is met:
        2.1. The output is not copyrightable subject matter (and would not be even if
            produced by a human).
        2.2. No third party materials are included in the output.
        2.3. Any third party materials that are included in the output are being used with
            permission (e.g., under a compatible open-source license) of the third party
            copyright holders and in compliance with the applicable license terms.
    3. A contributor obtains reasonable certainty that conditions 2.2 or 2.3 are met if the AI
            tool itself provides sufficient information about output that may be similar to
            training data, or from code scanning results.
...
## What About Documentation?
The above text applies to documentation as well. 
...
## What About Images?
As with documentation, the above principles would still apply.

Same for the Linux Foundation one:

https://www.linuxfoundation.org/legal/generative-ai

If any pre-existing copyrighted materials (including pre-existing open source code) authored
or owned by third parties are included in the AI tool’s output, prior to contributing such
output to the project, the Contributor should confirm that they have have permission from the
third party owners–such as the form of an open source license or public domain declaration
that complies with the project’s licensing policies–to use and modify such pre-existing
materials and contribute them to the project. Additionally, the contributor should provide
notice and attribution of such third party rights, along with information about the applicable
license terms, with their contribution.

Rather saying in a negative way, I do prefer to draw a clear line of the requirement for any kind of the contributions to the project, and it is all the contributor's responsibility to follow it, and committer's responsibility to verify it.

Yes, we should have this in the contributors / committers guide. I think in this case, clarity
and simplicity matter. Having a decent FAQ with a bunch of examples is fine, but we should end
up in the same place:

*if you can't be certain 100% of provenance and attribution, then this is not suitable for inclusion*

It's not due the tool itself, but depends on the contributor (committer is also a kind of contributor) can use it in a correct way. You must have full knowledge and responsibility of what you commit to the project repository.

I disagree. The *tool* is everything. If you didn't produce this content (code, docs, whatever) yourself then how can you *guarantee* the provenance & attribution? How can you present this content as under *your* copyright when you didn't even produce it?

If and when there are AI tools that can provide provenance & attribution, then we should revisit this position, but as of today, I am not aware of any of these. If somebody made an LLM entirely off the BSD licenced history of this project, then arguably that would be fair play for inclusion & usage.

documentation/content/en/articles/committers-guide/_index.adoc
2391

"The phrase 'contribute material generated by AI' is too broad:

  • Should we allow AI to format style? This isn't coding, but it helps with boring tasks. So are spaces and tabs considered 'material generated by AI' as well?
  • If your AI-assisted IDE detects typos in your code, missing braces, etc., and you just click 'yes fix it' to correct them, are these minor corrections considered 'material generated by AI' too?

Here, we need to forbid AI-generated 'algorithms,' but not using the word 'material.'"

I disagree. The *tool* is everything. If you didn't produce this content (code, docs, whatever) yourself then how can you *guarantee* the provenance & attribution? How can you present this content as under *your* copyright when you didn't even produce it?

You can do this in a lot of wys.

First, you can ignore the third party code unless it's obviously there. This is largely a red-herring based on extreme examples posted early on.
Second, the code might not be copyrightable, even if it were produced by a human. Here our current policy must apply, here I can guarantee it. At least as well as I could before.
Third, If the GenAI took my original content, and improved it, and I selectively applied the recommendations, then I've exercised enough control over the process to have a copyright I can guarantee it as much as I can with a spell checker and grammar checker. This is just a more advanced use of that tool. I'm still creatively producing the content.
Forth, if I asked a GenAI to find a bug, and it does, then I can contribute that. Most bug fixes don't qualify for a copyright, even when produced for humans (this is a specific example of above).
fifth, if I use GenAI to create a template and I then fill it in, I can contribute it: my filling the template in is transformative enough that I have a copyright. The generated template wouldn't qualify for copyright because it was produced by GenAI, but also because it's the scenes a faire: the template / boilerplate parts of the code that have to be there due to the needs of the form.
Sixth, if I tell GenAI to make me something, then I substantially rewrite it, I have a copyright, just like I do today. The original is in the public domain, and my transformation is likely covered by copyright because its transformative, every bit as much as when CSRG rewrote the AT&T parts of the Unix kernel one plank at a time.
Seventh, there is no absolute guarantee, even today. I've contributed code to FreeBSD when I worked for someone that thought they owned everything I did and made me sign paperwork to that effect. But I had a lawyer tell me that if I did it on my own time, on my own machines it was mine to contribute. Here, there might be a difference of opinion about who owned it, but I felt confident that I could defend my ownership in court. It wasn't a 100%, though.
Eight, if I write something, but use AI to test it and that finds a ton of bugs, then that's fine. The work is still my own, even though AI was used to produce it. The analog today is if I produce a ton of patches for POSIX compliance so we pass the POSIX Compliance test that I can't distribute.
Ninth, I use GenAI to reformat my code to comply with FreeBSD's style(9) after it had been trained on the FreeBSD source base. This is just a more advanced version of 'indent' and doesn't inject any real creative value, even if it were to rewrite 3/4 of the file because my free-form code totally sucked. Of course, I'd have to verify that the style changes were, in fact, just style changes and not semantic ones, just like I'd do with clang-format or indent.

So when looking at AI, I think we should look at it through the same lenses we look at other contributions by and large.

So using AI tools to improve your work is fine, legal and relatively low risk. Using AI to do all the work isn't. somewhere in between these two extremes is the 'safe enough' dividing line. We already tolerate some risk in our IP (as do all open source projects). In fact, we tolerate a lot more risk than other projects. We lack a contributor agreement, we don't enforce SOB, etc. Any AI use policy has to balance risk vs reward.

There's also the current proposed list of rules for Linux Kernel contributions https://www.phoronix.com/news/Linux-Kernel-AI-Docs-Rules has one article, and that has the links to the original material and discussions.

Just like we say 'if you write ore than 20% of a file you can add a copyright, generally" or "If you rewrite more than 75-85% of the file, you can replace copyrights in general" we should have more concrete guidelines. Though I'll admit, these aren't in the committers guide, and they are only rules of thumb, not absolutes. There's time adding a few percent can be so meaningful that one can add a copyright. And there's other times that an 80% rewrite doesn't replace all the creative content that was there originally. and these rules are imperfectly applied. There's lots of files that retained a copyright when it was just the template that was used in the copying, or years after all the original content was removed (arm and mips support files were often like this).

Or put another way. If I use GenAI to correct my grammar, and it ads a few commas, fixes spelling, rewords an awkward phrase (10 words out of 200), then that's clearly fair use. Even if I write a comment message in Italian, and use GenAI to translate it to English, that's still transformative (and copyright law recognizes that the original author has some copyright interest in the translations). And how is that different than using Google or Bing to do the translation? You're effectively stating that the use of GenAI somehow strips off my copyright in all cases, when there's a lot of times when it clearly doesn't.

So having such harsh, heavy-handed wording that's in the review at the moment strikes me as somewhat out of touch with all the other nuanced bits of analysis we need to do with other content presented to us for inclusion in FreeBSD.

There's no real absolute guarantees before this policy change. Just a million shades of gray that are close enough for people to be comfortable in using FreeBSD.