Page MenuHomeFreeBSD

New port: textproc/py-ocrmypdf: Adds an OCR test layer to scanned PDF files
ClosedPublic

Authored by kai on Jul 11 2019, 10:48 PM.

Details

Reviewers
koobs
Group Reviewers
Python
Commits
rP506461: New port: textproc/py-ocrmypdf
Summary

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be
searched or copy-pasted.

Main features:

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Scales properly to handle files with thousands of pages
  • Battle-tested on millions of PDFs

WWW: https://github.com/jbarlow83/OCRmyPDF

It's also listed on the "WantedPorts" FreeBSD wiki page.

Test Plan
  • poudriere (11.2-, 11.3-, 12.0-RELEASE, 13.0-CURRENT@r349352 amd64 + i386) for each py36 + py37 flavor -> OK
  • portlint + portclippy -> No issues
  • Runtime tests with py36 + py37 -> OK
  • "make test" gives "168 passed, 32 skipped, 1 xfailed"

Questions from my side:

  • Is the naming "textproc/py-ocrmypdf" reasonable? Or should it be renamed to "textproc/ocrmypdf" because it's rather an application and not a library?
  • The workaround in the post-stage is required to pass the PLIST checks. Maybe there's an better solution to cope with that issue?

Diff Detail

Repository
rP FreeBSD ports repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kai created this revision.Jul 11 2019, 10:48 PM
0mp added a subscriber: 0mp.Jul 12 2019, 6:02 AM
0mp added inline comments.
textproc/py-ocrmypdf/Makefile
55 ↗(On Diff #59659)

Wouldn't = instead of += be enough?

67 ↗(On Diff #59659)

Is it necessary to use a subshell here?

kai updated this revision to Diff 59676.Jul 12 2019, 6:25 AM

Addressing 0mp's suggestions:

  • Use '=' instead '+=' for MAKE_ENV
  • Don't run the tests in a subshell
koobs accepted this revision.Jul 12 2019, 9:12 AM
koobs added a subscriber: koobs.

Looks good (with minor patch comment), ship it.

Very nice work Kai

textproc/py-ocrmypdf/files/patch-setup.cfg
7 ↗(On Diff #59676)

Add a comment to this patch as to why it was needed. Send upstream first (if applicable) and add upstream refs to comment if created upstream

This revision is now accepted and ready to land.Jul 12 2019, 9:12 AM
This revision was automatically updated to reflect the committed changes.
kai added a comment.Jul 12 2019, 3:14 PM

I did some minor changes:

  • Added comments with explanations
  • Replaced ".PLIST.pymodtmp" with ${_PYTHONPKGLIST}
  • Added two "-o" parameters for pytest to omit the patch for setup.cfg and do some further tests that would be skipped otherwise

Thank you, @0mp and @koobs, for reviewing this diff!

mat added a subscriber: mat.Jul 12 2019, 4:21 PM
mat added inline comments.
head/textproc/py-ocrmypdf/Makefile
65

post-stage runs very late in the install process. It should only be used if post-install is ran too early. Are you sure you need post-stage here ?

kai marked an inline comment as done.Jul 12 2019, 5:26 PM
kai added inline comments.
head/textproc/py-ocrmypdf/Makefile
65

Thank you for your suggestion and pointing that out! It works indeed fine with post-install and it's fixed in rP506483.

Minor fixes.

head/textproc/py-ocrmypdf/Makefile
38

There's no need to add "concurrent" since this port supports python 3 only.

55

Try USE_LOCALE=en_US.UTF-8

66–68

Use ECHO_CMD instead of ECHO.

from Mk/bsd.commands.mk:

# ECHO is defined in /usr/share/mk/sys.mk, which can either be "echo",
# or "true" if the make flag -s is given.  Use ECHO_CMD where you mean
# the echo command.
73

Replace LC_ALL=en_US.UTF-8 with MAKE_ENV.

kai marked 4 inline comments as done.Jul 15 2019, 10:45 AM
kai added inline comments.
head/textproc/py-ocrmypdf/Makefile
38

Ok, that's right. I automatically added "concurrent" as the port installs a script to ${PREFIX}/bin to prevent conflicts when one uses py36 and py37 in its environment.

But the actual state of the Ports tree that has Python 2.7 and Python 3.6 as default versions the "concurrent" keyword seems superfluous. I'll remove it then with the next update of textproc/py-ocrmypdf to avoid bumping the port.

55

Perfect, thank you for this hint! That also reduce the hardcoding of the locale further down in the do-test target. It's changed in rP506655.

66–68

Thanks for that hint, too! It's fixed in rP506655.

koobs added a comment.EditedJul 15 2019, 12:18 PM

concurrent should always be used when anything is installed by a package in locations (filenames/pathnames) that are not version-specific, like /usr/local/bin.

For most cases, since most python packages install into a single place (the python environment), this basically means anything installed outside LOCALBASE/lib/pythonX.Y/ needs to be made concurrent-safe.

In almost all cases ive seen, this means scripts, docs, man pages.

It should even be considered for python packages that only support a single version *if* that port has shared-location files, as packages can change their supported versions at any time in the future, and its more likely than not that concurrency will *NOT* be checked (due to autoplist), and the packages will then conflict. Poudriere runs wont pick this up either.

What the 'present' default versions of python happen to be at any particular point in time is entirely irrelevant and unrelated to declaring (with 'concurrent') whether or not a port needs to have shared path/file references de-conflicted, processed and pointing to the correct place (according to the users or systems, default selection).

textproc/py-ocrmypdf/Makefile
69 ↗(On Diff #59676)

Comment these too if reported upstream