This might be feasible to include in guix proper, but I am wondering if you would be able to provide the r-tesseract
package. tesseract-ocr
in Guix also seems outdated relative to the needs of the R package.
Below I offer my pathetic attempt towards this goal. I could get the tesseract-ocr
package to build by not exposing their Makefile to asciidoc and docbook stuff. I could also get the r-tesseract
package to build on top of it. Unfortunately, the package fails to load in R since it cannot find any language training data.
(use-modules (guix packages)
(guix download)
(guix build-system gnu)
(gnu packages compression)
(gnu packages xml)
(gnu packages gtk)
(gnu packages backup)
(gnu packages curl)
(gnu packages icu4c)
(gnu packages image)
(gnu packages python)
(gnu packages documentation)
(gnu packages autotools)
; (gnu packages docbook)
(guix build-system r)
(guix git-download)
((guix licenses) #:prefix license:)
(gnu packages pkg-config)
(gnu packages cran)
(gnu packages ocr)
(gnu packages graph)
(gnu packages statistics)
(gnu packages gcc))
(define-public tesseract-ocr5.1
;; There are useful commits beyond the last official stable release.
(let ((commit "c2a3efe2824e1c8a0810e82a43406ba8e01527c4")
(revision "1"))
(package
(name "tesseract-ocr5.1")
(version (git-version "5.1.0" revision commit))
(source
(origin
(method git-fetch)
(uri (git-reference
(url "https://github.com/tesseract-ocr/tesseract")
(commit commit)))
(file-name (git-file-name name version))
(sha256
(base32
"11asiy9zbmhp8x1xlqiv7a22nhac1xviw03gn2mpsjpx3b1pfp07"))))
(build-system gnu-build-system)
(inputs
`(("cairo" ,cairo)
("icu" ,icu4c)
("leptonica" ,leptonica)
("pango" ,pango)
("python-wrapper" ,python-wrapper)))
(native-inputs
`(("autoconf" ,autoconf)
("automake" ,automake)
("libarchive" ,libarchive)
("libcurl" ,curl)
("libtool" ,libtool)
("libtiff" ,libtiff)
("pkg-config" ,pkg-config)
("xsltproc" ,libxslt)))
(arguments
`(#:configure-flags
(let ((leptonica (assoc-ref %build-inputs "leptonica")))
(list (string-append "LIBLEPT_HEADERSDIR=" leptonica "/include")))
#:tests? #f ; Tests currently result in a segfault
#:phases
(modify-phases %standard-phases
(add-after 'install 'build-training
(lambda _
(invoke "make" "training")))
(add-after 'build-training 'install-training
(lambda _
(invoke "make" "training-install"))))))
(home-page "https://github.com/tesseract-ocr/tesseract")
(synopsis "Optical character recognition engine")
(description
"Tesseract is an optical character recognition (OCR) engine with very
high accuracy. It supports many languages, output text formatting, hOCR
positional information and page layout analysis. Several image formats are
supported through the Leptonica library. It can also detect whether text is
monospaced or proportional.")
(license license:asl2.0))))
I can't say ignoring the documentation is the best practice, but I was just hoping to be able to use the R package...
(define-public r-tesseract
(package
(name "r-tesseract")
(version "5.0.0")
(source
(origin
(method url-fetch)
(uri (cran-uri "tesseract" version))
(sha256
(base32 "1xdwjm3bing15ljdicl20g88ymmd0bbjmlbah5hzvws5b656iicn"))))
(properties `((upstream-name . "tesseract")))
(build-system r-build-system)
(inputs (list zlib))
(propagated-inputs (list r-curl r-digest r-pdftools r-rappdirs r-rcpp))
(native-inputs (list pkg-config r-knitr tesseract-ocr5.1 leptonica))
(home-page
"https://docs.ropensci.org/tesseract/https://github.com/ropensci/tesseract")
(synopsis "Open Source OCR Engine")
(description
"Bindings to 'Tesseract' <https://opensource.google/projects/tesseract>: a
powerful optical character recognition (OCR) engine that supports over 100
languages. The engine is highly configurable in order to tune the detection
algorithms and obtain the best possible results.")
(license license:gpl2)))
Here is the text of the error I saw in R when loading:
Error opening data file /gnu/store/ihga2gkalxzfmbaczaf5xazdqbc5h4ly-tesseract-ocr5.1-5.1.0-1.c2a3efe/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Warning message:
Unable to find English training data
I looked in that gnu/store
directory and sure enough there was no trained data. I also didn't see any evidence that you could download language training data from Guix either.
I originally had an older version of r-tesseract
in my personal channel which seems to be circa the same period as Guix's tesseract-ocr package, but sadly that now fails to build.
(define-public r-tesseract
(package
(name "r-tesseract")
(version "4.1")
(source
(origin
(method url-fetch)
(uri (cran-uri "tesseract" version))
(sha256
(base32
"1a7cf48n7hdd6inqz23ifqhq6kx6wxri34a79ns2vxaj6f4llxf0"))))
(properties `((upstream-name . "tesseract")))
(build-system r-build-system)
(inputs `(("zlib" ,zlib)
("tesseract-ocr" ,tesseract-ocr)
("leptonica" ,leptonica)))
(propagated-inputs
`(("r-curl" ,r-curl)
("r-digest" ,r-digest)
("r-pdftools" ,r-pdftools)
("r-rappdirs" ,r-rappdirs)
("r-rcpp" ,r-rcpp)))
(native-inputs
`(("pkg-config" ,pkg-config)
("r-knitr" ,r-knitr)))
(home-page
"https://github.com/ropensci/tesseract")
(synopsis "Open Source OCR Engine")
(description
"Bindings to 'Tesseract' <https://opensource.google.com/projects/tesseract>: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.")
(license license:asl2.0)))