galkahana / pdf-text-extraction Goto Github PK

View Code? Open in Web Editor NEW

61.0 3.0 16.0 5.54 MB

cli for extracting text from PDF files (and maybe possibly tables)

License: Apache License 2.0

CMake 3.87% C 0.76% C++ 95.36%

pdf pdf-to-text

pdf-text-extraction's Issues

Fuzz testing the application

Hi,

For a universty-project we chose to fuzz-test your application.

During this tests we found some flaws in the application. The PDFs which causes some crashes can be found here: pdfs.zip

PDF is in English, extracted text is in Greek, correct text can be copied from Chrome with no problem

Attached is the example pdf.
Attached is the output of TextExtraction.exe.

The problem description is in the title.

Is it possible to fix it technically - means extract text as it looks, in English ?
If the answer is yes, is it going or can be fixed soon ?

Thanks and cheers.
Lidia.

4.pdf
aa.txt

CMake Error when attempting to build project

After cloning and then running

cd pdf-text-extraction
mkdir build
cd build
cmake ..
make

I get

Downloads/pdf-text-extraction/build$ cmake ..
-- The C compiler identification is GNU 9.4.0 
-- The CXX compiler identification is GNU 
-- Check for working C compiler: /usr/bin/
-- Check for working C compiler: /usr/bin/cc 
-- Detecting C compiler ABI 
-- Detecting C compiler ABI info - done 
-- Detecting C compile 
-- Detecting C compile features - done 
-- Check for working CXX compiler: /usr/bin/c++   
-- Check for working CXX compiler: /usr/bin/c++ -- works        
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done 
-- Detecting CXX compile features  
-- Detecting CXX compile features - done
Scanning dependencies of target pdfhummus-populate 
[11%] Creating directories for 'pdfhummus-populate'                                                                                                                                                               [ 22%] Performing download step (git clone) for 'pdfhummus-populate'
-- Avoiding repeated git clone, stamp file is up to date: 'Downloads/pdf-text-extraction/build/_deps/pdfhummus-subbuild/pdfhummus-populate-prefix/src/pdfhummus-populate-stamp/pdfhummus-populate-gitinfo.txt'                                                                                                                                                                                                             [ 33%] No patch step for 'pdfhummus-populate' [44%] Performing update step for 'pdfhummus-populate' 
CMake Error at /home/gokaf001/Downloads/pdf-text-extraction/build/_deps/pdfhummus-subbuild/pdfhummus-populate-prefix/tmp/pdfhummus-populate-gitupdate.cmake:10 (message):
Failed to get the hash for HEAD 
make[2]: *** [CMakeFiles/pdfhummus-populate.dir/build.make:97: pdfhummus-populate-prefix/src/pdfhummus-populate-stamp/pdfhummus-populate-update] 
Error 1 make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/pdfhummus-populate.dir/all] 
Error 2 make: *** [Makefile:84: all]
Error 2 CMake Error at /usr/share/cmake-3.16/Modules/FetchContent.cmake:915 (message): 
Build step for pdfhummus failed: 
Call Stack (most recent call first: 
usr/share/cmake-3.16/Modules/FetchContent.cmake:1006 (__FetchContent_directPopulate
/usr/share/cmake-3.16/Modules/FetchContent.cmake:1047 (FetchContent_Populate)
CMakeLists.txt:17 (FetchContent_MakeAvailable)
 -- Configuring incomplete, errors occurred
"See also "/Downloads/pdf-text-extraction/build/CMakeFiles/CMakeOutput.log".

My wifi is fine, git and cmake are the latest versions...no idea what could be wrong here.

Use after free in TextExtraction.cpp

I was running your text extraction tool and noticed a crash when extracting https://www.supremecourt.gov/opinions/22pdf/20-1199_l6gn.pdf. I turned on Apple's memory sanitizers and confirmed that the crash was indeed a use after free bug.

Upon looking at the code, I noticed the following in TextExtraction.cpp:64
PdfPageInput pageInput(inParser, pageObject.GetPtr())

This causes pageObject to effectively lose one reference when it is passed to pageInput. When pageInput cleans up, it deletes the pageObject pointer because it believes that it no longer has references. When the function returns, pageObject tries to delete the object again, not knowing that it has already been deleted.

I solved this problem by changing the line to read as follows:
PdfPageInput pageInput(inParser, pageObject)

not working for Chinese PDF

not working for PDF with Chinese characters.

Output is unreadable and doesn't contain all the characters of that pdf.

Reading non-ASCII symbol

Hello!
I am trying to read quite complicated PDF with various symbols and I receive a lot of broken symbols.
Originally I found this issue while reading the German language, and all umlauts (ä, ö, ü) were badly interpreted in the resulting file.

.\TextExtraction.exe "C:\Users\misha\cloudplan\documentos\UTF-8 test file.pdf" >> utf_test.txt

UTF-8 test file.pdf
utf_test.txt

UPD:
problem was in windows pipes :) using it with the flag -o fixed issue

add an option to compile as a dynamic library for windows and linux amd64bits

I would like to use this as a dynamic library in a server side application written in dart using ffi, but for that I need to compile it as a dynamic library (DLL/OS) with c-style export for linux and windows

I thought of something like this:

//Lib.h
extern "C" __declspec(dllexport)  const char* extractText(const char* inFilePath, int startPage, int endPage, int (*callback)(const char*));

//Lib.c

#include "Lib.h"
#include <iostream>
#include <string>

#include "EStatusCode.h"
#include "BoxingBase.h"
#include "OutputFile.h"
#include "InputStringStream.h"
#include "OutputStreamTraits.h"
#include "IByteReaderWithPosition.h"

#include "TextExtraction.h"

using namespace std;
using namespace PDFHummus;

/// <summary>
/// </summary>
/// <param name="inFilePath"></param>
/// <param name="startPage"> set 0 </param>
/// <param name="endPage">set -1</param>
/// <returns></returns>
const char* extractText(const char* inFilePath, int startPage, int endPage, int (*callback)(const char*))
{

	string  filePath = inFilePath;

	bool writeToOutputFile = false;
	string outputFilePath = "";
	bool quiet = false;
	long bidiFlag = -1;
	string result = "";
	TextExtraction textExtraction;
	EStatusCode status;

	status = textExtraction.ExtractText(filePath, startPage, endPage);

	if (status != eSuccess) {
		cerr << "Error: " << textExtraction.LatestError.description.c_str() << endl;
		callback(textExtraction.LatestError.description.c_str());
	}
	TextExtractionWarningList::iterator it = textExtraction.LatestWarnings.begin();
	for (; it != textExtraction.LatestWarnings.end(); ++it) {
		cerr << "Warning: " << it->description.c_str() << endl;
		callback(it->description.c_str());
	}

	if (status == eSuccess) {
		result = textExtraction.GetResultsAsXML(bidiFlag);
	}

	return  status == eSuccess ? result.c_str() : "-1";
}

galkahana / pdf-text-extraction Goto Github PK

pdf-text-extraction's Issues

Fuzz testing the application

PDF is in English, extracted text is in Greek, correct text can be copied from Chrome with no problem

CMake Error when attempting to build project

Use after free in TextExtraction.cpp

not working for Chinese PDF

Reading non-ASCII symbol

add an option to compile as a dynamic library for windows and linux amd64bits

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs