Page MenuHomeFreeBSD

Mk/Scripts: Add SPDX license file normalizer and matcher
Needs ReviewPublic

Authored by bofh on Jul 18 2025, 9:06 PM.
Referenced Files
F132374507: D51414.id158762.diff
Thu, Oct 16, 8:35 AM
F132374504: D51414.id.diff
Thu, Oct 16, 8:35 AM
F132374503: D51414.id158783.diff
Thu, Oct 16, 8:35 AM
F132319762: D51414.diff
Wed, Oct 15, 8:58 PM
Unknown Object (File)
Sun, Oct 12, 7:19 PM
Unknown Object (File)
Wed, Oct 1, 2:17 AM
Unknown Object (File)
Tue, Sep 30, 11:23 PM
Unknown Object (File)
Sat, Sep 27, 4:41 PM

Details

Reviewers
imp
Group Reviewers
portmgr
Summary
This commit introduces a new script `check_spdx.lua` to support
automated SPDX license verification and matching for FreeBSD ports. The
goal is to improve the accuracy and consistency of LICENSE_FILE entries
by comparing their normalized content against official SPDX license
templates.

Key features:

- Supports `LICENSE_FILE`, `LICENSE_FILE_<LICENSE>`, and fallback SPDX
  header scanning (`SPDX-License-Identifier`) within WRKSRC sources.
- Normalizes both the LICENSE_FILE and SPDX license templates using a
  Python-compatible preprocessing pipeline (implemented in flua) that:
  - Removes URLs, comments, copyright notices
  - Canonicalizes typographic and spelling variants (e.g., licence →
    license, organisation → organization)
  - Replaces fancy quotes and normalizes whitespace
- Computes similarity via the Sørensen–Dice coefficient and reports the
  top matches.
- Gracefully handles:
  - Missing or improperly defined LICENSE_FILE
  - Multiple LICENSE values with LICENSE_COMB and no per-license file
    mapping
  - Ports without any declared license metadata (using `-s` to search
    source files)

SPDX license templates are cached in /var/db/ports-licenses/normalized/
to avoid repeated downloads. The matcher is modeled on prior work from
the spdx-license-matcher[1] project and reimplemented here in flua for
native use in the ports tree.

This tool relies on `libucl` for parsing the official SPDX license JSON
index. FreeBSD 15 includes `libucl` in the base system with flua
bindings at `/usr/lib/flua/ucl.so`. On earlier FreeBSD versions, users
must install `textproc/libucl` from ports to provide the required Lua
bindings (typically in `/usr/local/lib/lua/5.4/ucl.so`).

Makefile integration logic automatically:
- Invokes the checker based on LICENSE / LICENSE_FILE settings
- Falls back to scanning source files for SPDX headers if no license
  file is found
- Skips SPDX matching when LICENSE_COMB=dual/multi is used with a single
  LICENSE_FILE

This addition helps validate license declarations, identify
misattributed or legacy licenses, and assist maintainers in migrating
toward SPDX-aligned metadata.

If the similarity score is not exactly 1.000, the file is not a verbatim
copy of any SPDX license template. Manual review is advised in such
cases.

[1] https://github.com/spdx/spdx-license-matcher
Test Plan

Run make check-spdx-license inside the directory of any ports that has LICENSE_FILE*.

Diff Detail

Repository
R11 FreeBSD ports repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

bofh requested review of this revision.Jul 18 2025, 9:06 PM
bofh created this revision.
bofh edited the summary of this revision. (Show Details)

Fix some corner cases of normalization.

I've tested few random packages in ports.

  • It's bit confusing that check-spdx-license compiles things when one might just want to check license. What I mean is that it should just extract and check license file in current port directory not try to compile whole package?
  • Another one what might be easier to user to understand would be if match is 1.0 (I understood that is 100%) it should not be relevant to show other possible licenses?
  • It would be nice to see unified diff if license differs from license that is mentioned in ports Makefile.

I've tested few random packages in ports.

  • It's bit confusing that check-spdx-license compiles things when one might just want to check license. What I mean is that it should just extract and check license file in current port directory not try to compile whole package?

I think that is what this is doing. It's extracting and checking the license. There is no compiling involved.

  • Another one what might be easier to user to understand would be if match is 1.0 (I understood that is 100%) it should not be relevant to show other possible licenses?

Yes I have that code hidden and commented around somewhere still in the scripts and waiting for some other's feedback

  • It would be nice to see unified diff if license differs from license that is mentioned in ports Makefile.

This is not easy to do. Initially we are actually normalizing the entire documents into a single line to avoid any ambiguity. So the unified diff is not helpful as it prints two huge lines. To have something really readable we have to break into lines again. I will look into it but no promises.

I've tested few random packages in ports.

  • It's bit confusing that check-spdx-license compiles things when one might just want to check license. What I mean is that it should just extract and check license file in current port directory not try to compile whole package?

I think that is what this is doing. It's extracting and checking the license. There is no compiling involved.

Then is just me.. I find it confusing in some tested ports like Go ones just print so much stuff and it looks like it starts compile something.

I've commented code little bit. It they don't seem to be relevant just ignore them. Code is clean and seems to do what it should without much hassle. Some more commenting would not hurt though.

Mk/Scripts/check_spdx.lua
41

Where these colors come from as they don't seems to be ANSI escape colors

61

Should there be global dprint for every Lua scripts in Mk/Scripts

167

I didn't idea behind this one?

180

Is this faster done in Big regex than for-loop? I admit it would be large and I don't know if Lua can handle nested Regex? But would be only one run to make every change.

236

I understand this is only used here but does it cause unnecessary complication to have nested function just for this? It could be normal function. Did I get it right that it makes array with every char of string?

254

This should be in global library if one would be formed. Could be beatifically for other Lua scripts also.

255

There should be check that it goes fine or does it stop executing if this fails?

266

For reading it would easier to have something like eprint-function or something that prints colors.red as not have it every time that it's needed. Just like dprint.

273

Should some error be returned or nil?

296

As this is second time mainly the same code. I would move it to function.

326

Why this is not used for normalization of licenses above? And why read LICENSE-file is not normalized same way that downloaded licenses?

396

Would this be easier to be not case sensitive?

402

Same a with red color. Some function would be easier to maintain.

444

There is no similar than getopt available?

My only other feedback from looking at the draft standard was that while we should have LicenseRef-Foo as described in the document, each time we have to do that because the matching score isn't high enough, we should encourage the maintainer to submit that to the SPDX legal team so that they can either tweak the markup for license Foo, or create a new Foo variant if the changes are legally different enough.

Beyond that, and my lack of time to give this a super-close look, I really like how the policy has evolved after the initial proposal. I was worried it would take several iterations. I must have had a good day describing the changes, or Moin is a good mind reader :). In all seriousness, I'm super glad this is being done. I can also introduce people around to the SPDX folks I have a relationship with.

Mk/Scripts/check_spdx.lua
41

They match https://en.wikipedia.org/wiki/ANSI_escape_code
These codes have come from later editions of ECMA-48...

Coming from 'C' \27 looks wrong, but in LUA \ddd is a decimal number, not an octal one. Lua is not quite the same as 'C' in these details.