Page MenuHomeFreeBSD

New port: textproc/py-ocrmypdf: Adds an OCR test layer to scanned PDF files

Authored by kai on Jul 11 2019, 10:48 PM.
Referenced Files
Unknown Object (File)
Fri, Jan 27, 8:15 AM
Unknown Object (File)
Dec 28 2022, 6:06 PM
Unknown Object (File)
Dec 23 2022, 12:16 AM
Unknown Object (File)
Dec 23 2022, 12:16 AM
Unknown Object (File)
Dec 23 2022, 12:16 AM
Unknown Object (File)
Dec 23 2022, 12:16 AM
Unknown Object (File)
Dec 11 2022, 4:07 PM
Unknown Object (File)
Dec 7 2022, 9:02 PM


Group Reviewers
rP506461: New port: textproc/py-ocrmypdf

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be
searched or copy-pasted.

Main features:

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Scales properly to handle files with thousands of pages
  • Battle-tested on millions of PDFs


It's also listed on the "WantedPorts" FreeBSD wiki page.

Test Plan
  • poudriere (11.2-, 11.3-, 12.0-RELEASE, 13.0-CURRENT@r349352 amd64 + i386) for each py36 + py37 flavor -> OK
  • portlint + portclippy -> No issues
  • Runtime tests with py36 + py37 -> OK
  • "make test" gives "168 passed, 32 skipped, 1 xfailed"

Questions from my side:

  • Is the naming "textproc/py-ocrmypdf" reasonable? Or should it be renamed to "textproc/ocrmypdf" because it's rather an application and not a library?
  • The workaround in the post-stage is required to pass the PLIST checks. Maybe there's an better solution to cope with that issue?

Diff Detail

No Lint Coverage
No Test Coverage
Build Status
Buildable 25302
Build 23965: arc lint + arc unit

Event Timeline

0mp added inline comments.

Wouldn't = instead of += be enough?


Is it necessary to use a subshell here?

Addressing 0mp's suggestions:

  • Use '=' instead '+=' for MAKE_ENV
  • Don't run the tests in a subshell
koobs added a subscriber: koobs.

Looks good (with minor patch comment), ship it.

Very nice work Kai


Add a comment to this patch as to why it was needed. Send upstream first (if applicable) and add upstream refs to comment if created upstream

This revision is now accepted and ready to land.Jul 12 2019, 9:12 AM
This revision was automatically updated to reflect the committed changes.

I did some minor changes:

  • Added comments with explanations
  • Replaced ".PLIST.pymodtmp" with ${_PYTHONPKGLIST}
  • Added two "-o" parameters for pytest to omit the patch for setup.cfg and do some further tests that would be skipped otherwise

Thank you, @0mp and @koobs, for reviewing this diff!

mat added inline comments.
65 ↗(On Diff #59689)

post-stage runs very late in the install process. It should only be used if post-install is ran too early. Are you sure you need post-stage here ?

kai marked an inline comment as done.Jul 12 2019, 5:26 PM
kai added inline comments.
65 ↗(On Diff #59689)

Thank you for your suggestion and pointing that out! It works indeed fine with post-install and it's fixed in rP506483.

Minor fixes.

38 ↗(On Diff #59689)

There's no need to add "concurrent" since this port supports python 3 only.

55 ↗(On Diff #59689)


66–68 ↗(On Diff #59689)

Use ECHO_CMD instead of ECHO.

from Mk/

# ECHO is defined in /usr/share/mk/, which can either be "echo",
# or "true" if the make flag -s is given.  Use ECHO_CMD where you mean
# the echo command.
73 ↗(On Diff #59689)

Replace LC_ALL=en_US.UTF-8 with MAKE_ENV.

kai marked 4 inline comments as done.Jul 15 2019, 10:45 AM
kai added inline comments.
38 ↗(On Diff #59689)

Ok, that's right. I automatically added "concurrent" as the port installs a script to ${PREFIX}/bin to prevent conflicts when one uses py36 and py37 in its environment.

But the actual state of the Ports tree that has Python 2.7 and Python 3.6 as default versions the "concurrent" keyword seems superfluous. I'll remove it then with the next update of textproc/py-ocrmypdf to avoid bumping the port.

55 ↗(On Diff #59689)

Perfect, thank you for this hint! That also reduce the hardcoding of the locale further down in the do-test target. It's changed in rP506655.

66–68 ↗(On Diff #59689)

Thanks for that hint, too! It's fixed in rP506655.

concurrent should always be used when anything is installed by a package in locations (filenames/pathnames) that are not version-specific, like /usr/local/bin.

For most cases, since most python packages install into a single place (the python environment), this basically means anything installed outside LOCALBASE/lib/pythonX.Y/ needs to be made concurrent-safe.

In almost all cases ive seen, this means scripts, docs, man pages.

It should even be considered for python packages that only support a single version *if* that port has shared-location files, as packages can change their supported versions at any time in the future, and its more likely than not that concurrency will *NOT* be checked (due to autoplist), and the packages will then conflict. Poudriere runs wont pick this up either.

What the 'present' default versions of python happen to be at any particular point in time is entirely irrelevant and unrelated to declaring (with 'concurrent') whether or not a port needs to have shared path/file references de-conflicted, processed and pointing to the correct place (according to the users or systems, default selection).


Comment these too if reported upstream