Page MenuHomeFreeBSD

New port: textproc/py-ocrmypdf: Adds an OCR test layer to scanned PDF files
ClosedPublic

Authored by kai on Jul 11 2019, 10:48 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Mar 14, 2:13 PM
Unknown Object (File)
Jan 8 2024, 6:47 AM
Unknown Object (File)
Jan 5 2024, 8:47 PM
Unknown Object (File)
Jan 2 2024, 7:41 PM
Unknown Object (File)
Dec 20 2023, 2:25 AM
Unknown Object (File)
Dec 10 2023, 9:51 PM
Unknown Object (File)
Nov 23 2023, 3:12 PM
Unknown Object (File)
Nov 15 2023, 3:21 PM

Details

Reviewers
koobs
Group Reviewers
Python
Commits
rP506461: New port: textproc/py-ocrmypdf
Summary

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be
searched or copy-pasted.

Main features:

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Scales properly to handle files with thousands of pages
  • Battle-tested on millions of PDFs

WWW: https://github.com/jbarlow83/OCRmyPDF

It's also listed on the "WantedPorts" FreeBSD wiki page.

Test Plan
  • poudriere (11.2-, 11.3-, 12.0-RELEASE, 13.0-CURRENT@r349352 amd64 + i386) for each py36 + py37 flavor -> OK
  • portlint + portclippy -> No issues
  • Runtime tests with py36 + py37 -> OK
  • "make test" gives "168 passed, 32 skipped, 1 xfailed"

Questions from my side:

  • Is the naming "textproc/py-ocrmypdf" reasonable? Or should it be renamed to "textproc/ocrmypdf" because it's rather an application and not a library?
  • The workaround in the post-stage is required to pass the PLIST checks. Maybe there's an better solution to cope with that issue?

Diff Detail

Repository
rP FreeBSD ports repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

0mp added inline comments.
textproc/py-ocrmypdf/Makefile
55 ↗(On Diff #59659)

Wouldn't = instead of += be enough?

67 ↗(On Diff #59659)

Is it necessary to use a subshell here?

Addressing 0mp's suggestions:

  • Use '=' instead '+=' for MAKE_ENV
  • Don't run the tests in a subshell
koobs added a subscriber: koobs.

Looks good (with minor patch comment), ship it.

Very nice work Kai

textproc/py-ocrmypdf/files/patch-setup.cfg
7 ↗(On Diff #59676)

Add a comment to this patch as to why it was needed. Send upstream first (if applicable) and add upstream refs to comment if created upstream

This revision is now accepted and ready to land.Jul 12 2019, 9:12 AM
This revision was automatically updated to reflect the committed changes.

I did some minor changes:

  • Added comments with explanations
  • Replaced ".PLIST.pymodtmp" with ${_PYTHONPKGLIST}
  • Added two "-o" parameters for pytest to omit the patch for setup.cfg and do some further tests that would be skipped otherwise

Thank you, @0mp and @koobs, for reviewing this diff!

mat added inline comments.
head/textproc/py-ocrmypdf/Makefile
65

post-stage runs very late in the install process. It should only be used if post-install is ran too early. Are you sure you need post-stage here ?

kai marked an inline comment as done.Jul 12 2019, 5:26 PM
kai added inline comments.
head/textproc/py-ocrmypdf/Makefile
65

Thank you for your suggestion and pointing that out! It works indeed fine with post-install and it's fixed in rP506483.

Minor fixes.

head/textproc/py-ocrmypdf/Makefile
38

There's no need to add "concurrent" since this port supports python 3 only.

55

Try USE_LOCALE=en_US.UTF-8

66–68

Use ECHO_CMD instead of ECHO.

from Mk/bsd.commands.mk:

# ECHO is defined in /usr/share/mk/sys.mk, which can either be "echo",
# or "true" if the make flag -s is given.  Use ECHO_CMD where you mean
# the echo command.
73

Replace LC_ALL=en_US.UTF-8 with MAKE_ENV.

kai marked 4 inline comments as done.Jul 15 2019, 10:45 AM
kai added inline comments.
head/textproc/py-ocrmypdf/Makefile
38

Ok, that's right. I automatically added "concurrent" as the port installs a script to ${PREFIX}/bin to prevent conflicts when one uses py36 and py37 in its environment.

But the actual state of the Ports tree that has Python 2.7 and Python 3.6 as default versions the "concurrent" keyword seems superfluous. I'll remove it then with the next update of textproc/py-ocrmypdf to avoid bumping the port.

55

Perfect, thank you for this hint! That also reduce the hardcoding of the locale further down in the do-test target. It's changed in rP506655.

66–68

Thanks for that hint, too! It's fixed in rP506655.

concurrent should always be used when anything is installed by a package in locations (filenames/pathnames) that are not version-specific, like /usr/local/bin.

For most cases, since most python packages install into a single place (the python environment), this basically means anything installed outside LOCALBASE/lib/pythonX.Y/ needs to be made concurrent-safe.

In almost all cases ive seen, this means scripts, docs, man pages.

It should even be considered for python packages that only support a single version *if* that port has shared-location files, as packages can change their supported versions at any time in the future, and its more likely than not that concurrency will *NOT* be checked (due to autoplist), and the packages will then conflict. Poudriere runs wont pick this up either.

What the 'present' default versions of python happen to be at any particular point in time is entirely irrelevant and unrelated to declaring (with 'concurrent') whether or not a port needs to have shared path/file references de-conflicted, processed and pointing to the correct place (according to the users or systems, default selection).

textproc/py-ocrmypdf/Makefile
69 ↗(On Diff #59676)

Comment these too if reported upstream