galkahana / pdf-text-extraction Goto Github PK
View Code? Open in Web Editor NEWcli for extracting text from PDF files (and maybe possibly tables)
License: Apache License 2.0
cli for extracting text from PDF files (and maybe possibly tables)
License: Apache License 2.0
Hi,
For a universty-project we chose to fuzz-test your application.
During this tests we found some flaws in the application. The PDFs which causes some crashes can be found here: pdfs.zip
After cloning and then running
cd pdf-text-extraction
mkdir build
cd build
cmake ..
make
I get
Downloads/pdf-text-extraction/build$ cmake ..
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU
-- Check for working C compiler: /usr/bin/
-- Check for working C compiler: /usr/bin/cc
-- Detecting C compiler ABI
-- Detecting C compiler ABI info - done
-- Detecting C compile
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
Scanning dependencies of target pdfhummus-populate
[11%] Creating directories for 'pdfhummus-populate' [ 22%] Performing download step (git clone) for 'pdfhummus-populate'
-- Avoiding repeated git clone, stamp file is up to date: 'Downloads/pdf-text-extraction/build/_deps/pdfhummus-subbuild/pdfhummus-populate-prefix/src/pdfhummus-populate-stamp/pdfhummus-populate-gitinfo.txt' [ 33%] No patch step for 'pdfhummus-populate' [44%] Performing update step for 'pdfhummus-populate'
CMake Error at /home/gokaf001/Downloads/pdf-text-extraction/build/_deps/pdfhummus-subbuild/pdfhummus-populate-prefix/tmp/pdfhummus-populate-gitupdate.cmake:10 (message):
Failed to get the hash for HEAD
make[2]: *** [CMakeFiles/pdfhummus-populate.dir/build.make:97: pdfhummus-populate-prefix/src/pdfhummus-populate-stamp/pdfhummus-populate-update]
Error 1 make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/pdfhummus-populate.dir/all]
Error 2 make: *** [Makefile:84: all]
Error 2 CMake Error at /usr/share/cmake-3.16/Modules/FetchContent.cmake:915 (message):
Build step for pdfhummus failed:
Call Stack (most recent call first:
usr/share/cmake-3.16/Modules/FetchContent.cmake:1006 (__FetchContent_directPopulate
/usr/share/cmake-3.16/Modules/FetchContent.cmake:1047 (FetchContent_Populate)
CMakeLists.txt:17 (FetchContent_MakeAvailable)
-- Configuring incomplete, errors occurred
"See also "/Downloads/pdf-text-extraction/build/CMakeFiles/CMakeOutput.log".
My wifi is fine, git and cmake are the latest versions...no idea what could be wrong here.
I was running your text extraction tool and noticed a crash when extracting https://www.supremecourt.gov/opinions/22pdf/20-1199_l6gn.pdf. I turned on Apple's memory sanitizers and confirmed that the crash was indeed a use after free bug.
Upon looking at the code, I noticed the following in TextExtraction.cpp:64
PdfPageInput pageInput(inParser, pageObject.GetPtr())
This causes pageObject to effectively lose one reference when it is passed to pageInput. When pageInput cleans up, it deletes the pageObject pointer because it believes that it no longer has references. When the function returns, pageObject tries to delete the object again, not knowing that it has already been deleted.
I solved this problem by changing the line to read as follows:
PdfPageInput pageInput(inParser, pageObject)
not working for PDF with Chinese characters.
Output is unreadable and doesn't contain all the characters of that pdf.
Hello!
I am trying to read quite complicated PDF with various symbols and I receive a lot of broken symbols.
Originally I found this issue while reading the German language, and all umlauts (ä, ö, ü) were badly interpreted in the resulting file.
.\TextExtraction.exe "C:\Users\misha\cloudplan\documentos\UTF-8 test file.pdf" >> utf_test.txt
UTF-8 test file.pdf
utf_test.txt
UPD:
problem was in windows pipes :) using it with the flag -o fixed issue
I would like to use this as a dynamic library in a server side application written in dart using ffi, but for that I need to compile it as a dynamic library (DLL/OS) with c-style export for linux and windows
I thought of something like this:
//Lib.h
extern "C" __declspec(dllexport) const char* extractText(const char* inFilePath, int startPage, int endPage, int (*callback)(const char*));
//Lib.c
#include "Lib.h"
#include <iostream>
#include <string>
#include "EStatusCode.h"
#include "BoxingBase.h"
#include "OutputFile.h"
#include "InputStringStream.h"
#include "OutputStreamTraits.h"
#include "IByteReaderWithPosition.h"
#include "TextExtraction.h"
using namespace std;
using namespace PDFHummus;
/// <summary>
/// </summary>
/// <param name="inFilePath"></param>
/// <param name="startPage"> set 0 </param>
/// <param name="endPage">set -1</param>
/// <returns></returns>
const char* extractText(const char* inFilePath, int startPage, int endPage, int (*callback)(const char*))
{
string filePath = inFilePath;
bool writeToOutputFile = false;
string outputFilePath = "";
bool quiet = false;
long bidiFlag = -1;
string result = "";
TextExtraction textExtraction;
EStatusCode status;
status = textExtraction.ExtractText(filePath, startPage, endPage);
if (status != eSuccess) {
cerr << "Error: " << textExtraction.LatestError.description.c_str() << endl;
callback(textExtraction.LatestError.description.c_str());
}
TextExtractionWarningList::iterator it = textExtraction.LatestWarnings.begin();
for (; it != textExtraction.LatestWarnings.end(); ++it) {
cerr << "Warning: " << it->description.c_str() << endl;
callback(it->description.c_str());
}
if (status == eSuccess) {
result = textExtraction.GetResultsAsXML(bidiFlag);
}
return status == eSuccess ? result.c_str() : "-1";
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.