GithubHelp home page GithubHelp logo

galkahana / pdf-text-extraction Goto Github PK

View Code? Open in Web Editor NEW
61.0 3.0 16.0 5.54 MB

cli for extracting text from PDF files (and maybe possibly tables)

License: Apache License 2.0

CMake 3.87% C 0.76% C++ 95.36%
pdf pdf-to-text

pdf-text-extraction's Issues

Fuzz testing the application

Hi,

For a universty-project we chose to fuzz-test your application.

During this tests we found some flaws in the application. The PDFs which causes some crashes can be found here: pdfs.zip

CMake Error when attempting to build project

After cloning and then running

cd pdf-text-extraction
mkdir build
cd build
cmake ..
make

I get

Downloads/pdf-text-extraction/build$ cmake ..
-- The C compiler identification is GNU 9.4.0 
-- The CXX compiler identification is GNU 
-- Check for working C compiler: /usr/bin/
-- Check for working C compiler: /usr/bin/cc 
-- Detecting C compiler ABI 
-- Detecting C compiler ABI info - done 
-- Detecting C compile 
-- Detecting C compile features - done 
-- Check for working CXX compiler: /usr/bin/c++   
-- Check for working CXX compiler: /usr/bin/c++ -- works        
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done 
-- Detecting CXX compile features  
-- Detecting CXX compile features - done
Scanning dependencies of target pdfhummus-populate 
[11%] Creating directories for 'pdfhummus-populate'                                                                                                                                                               [ 22%] Performing download step (git clone) for 'pdfhummus-populate'
-- Avoiding repeated git clone, stamp file is up to date: 'Downloads/pdf-text-extraction/build/_deps/pdfhummus-subbuild/pdfhummus-populate-prefix/src/pdfhummus-populate-stamp/pdfhummus-populate-gitinfo.txt'                                                                                                                                                                                                             [ 33%] No patch step for 'pdfhummus-populate' [44%] Performing update step for 'pdfhummus-populate' 
CMake Error at /home/gokaf001/Downloads/pdf-text-extraction/build/_deps/pdfhummus-subbuild/pdfhummus-populate-prefix/tmp/pdfhummus-populate-gitupdate.cmake:10 (message):
Failed to get the hash for HEAD 
make[2]: *** [CMakeFiles/pdfhummus-populate.dir/build.make:97: pdfhummus-populate-prefix/src/pdfhummus-populate-stamp/pdfhummus-populate-update] 
Error 1 make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/pdfhummus-populate.dir/all] 
Error 2 make: *** [Makefile:84: all]
Error 2 CMake Error at /usr/share/cmake-3.16/Modules/FetchContent.cmake:915 (message): 
Build step for pdfhummus failed: 
Call Stack (most recent call first: 
usr/share/cmake-3.16/Modules/FetchContent.cmake:1006 (__FetchContent_directPopulate
/usr/share/cmake-3.16/Modules/FetchContent.cmake:1047 (FetchContent_Populate)
CMakeLists.txt:17 (FetchContent_MakeAvailable)
 -- Configuring incomplete, errors occurred
"See also "/Downloads/pdf-text-extraction/build/CMakeFiles/CMakeOutput.log".  

My wifi is fine, git and cmake are the latest versions...no idea what could be wrong here.

Use after free in TextExtraction.cpp

I was running your text extraction tool and noticed a crash when extracting https://www.supremecourt.gov/opinions/22pdf/20-1199_l6gn.pdf. I turned on Apple's memory sanitizers and confirmed that the crash was indeed a use after free bug.

Upon looking at the code, I noticed the following in TextExtraction.cpp:64
PdfPageInput pageInput(inParser, pageObject.GetPtr())

This causes pageObject to effectively lose one reference when it is passed to pageInput. When pageInput cleans up, it deletes the pageObject pointer because it believes that it no longer has references. When the function returns, pageObject tries to delete the object again, not knowing that it has already been deleted.

I solved this problem by changing the line to read as follows:
PdfPageInput pageInput(inParser, pageObject)

not working for Chinese PDF

not working for PDF with Chinese characters.

Output is unreadable and doesn't contain all the characters of that pdf.

Reading non-ASCII symbol

Hello!
I am trying to read quite complicated PDF with various symbols and I receive a lot of broken symbols.
Originally I found this issue while reading the German language, and all umlauts (ä, ö, ü) were badly interpreted in the resulting file.

.\TextExtraction.exe "C:\Users\misha\cloudplan\documentos\UTF-8 test file.pdf" >> utf_test.txt

UTF-8 test file.pdf
utf_test.txt

UPD:
problem was in windows pipes :) using it with the flag -o fixed issue

add an option to compile as a dynamic library for windows and linux amd64bits

I would like to use this as a dynamic library in a server side application written in dart using ffi, but for that I need to compile it as a dynamic library (DLL/OS) with c-style export for linux and windows

I thought of something like this:

//Lib.h
extern "C" __declspec(dllexport)  const char* extractText(const char* inFilePath, int startPage, int endPage, int (*callback)(const char*));

//Lib.c

#include "Lib.h"
#include <iostream>
#include <string>

#include "EStatusCode.h"
#include "BoxingBase.h"
#include "OutputFile.h"
#include "InputStringStream.h"
#include "OutputStreamTraits.h"
#include "IByteReaderWithPosition.h"

#include "TextExtraction.h"

using namespace std;
using namespace PDFHummus;

/// <summary>
/// </summary>
/// <param name="inFilePath"></param>
/// <param name="startPage"> set 0 </param>
/// <param name="endPage">set -1</param>
/// <returns></returns>
const char* extractText(const char* inFilePath, int startPage, int endPage, int (*callback)(const char*))
{

	string  filePath = inFilePath;

	bool writeToOutputFile = false;
	string outputFilePath = "";
	bool quiet = false;
	long bidiFlag = -1;
	string result = "";
	TextExtraction textExtraction;
	EStatusCode status;

	status = textExtraction.ExtractText(filePath, startPage, endPage);

	if (status != eSuccess) {
		cerr << "Error: " << textExtraction.LatestError.description.c_str() << endl;
		callback(textExtraction.LatestError.description.c_str());
	}
	TextExtractionWarningList::iterator it = textExtraction.LatestWarnings.begin();
	for (; it != textExtraction.LatestWarnings.end(); ++it) {
		cerr << "Warning: " << it->description.c_str() << endl;
		callback(it->description.c_str());
	}

	if (status == eSuccess) {
		result = textExtraction.GetResultsAsXML(bidiFlag);
	}

	return  status == eSuccess ? result.c_str() : "-1";
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.