cycomanic / menextract2pdf Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 15.0 47 KB

Extract Mendely annotations to PDF FIles

License: GNU General Public License v3.0

Python 95.50% Shell 1.91% Batchfile 2.59%

menextract2pdf's People

Contributors

Stargazers

Watchers

Forkers

jensb89 diegodlh timtangcoding heussd schweser bwiernik grantdobbe stevelaskaridis philmaweb yishilin14 imciner2 kgoutsos spcanelon

menextract2pdf's Issues

Table of contents missing in converted files

Table of contents / outline of PDF files is not preserved after conversion. This would make it very difficult to navigate some long articles.

ImportError: No module named PyPDF2.generic

Hi guys!
I am trying to migrate my Mendeley Desktop library from v. 1.8 to the latest version of Zotero, on macOS Catalina 10.15.4, and I get the following output on Terminal:
Traceback (most recent call last): File "src/menextract2pdf.py", line 14, in <module> import pdfannotation File "/Users/johnny/Downloads/Menextract2pdf-master/src/pdfannotation.py", line 10, in <module> from PyPDF2.generic import * ImportError: No module named PyPDF2.generic
~~I have no idea what I'm doing wrong, and will try doing this in a Windows VM shortly if I don't get feedback soon. In any case, I'll let you know how that went.~~

Exported files without text

Hello, thanks for this useful tool.

I am using ubuntu and I used the script "menextract2pdf__overwrite.sh" to overwrite my mendeley library (with the aim of importing it in Zotero, using their native Mendeley importer).

Unfortunately all the pdf files with comments do not have any text at all, you can only see the highlighted portions (but with no text).
Is this a known problem?

See one example file here.

zlib.error: Error -3 while decompressing data: incorrect header check

Hi,
I am a long-time mac Mendeley user, but I have become extremely fed up with the various bugs and limitations of Mendeley so I have decided to try to switch to Zotero. The problem is I have 10 years of annotated (and highlighted) PDFs I cannot lose in the conversion process.
I have tried running the .sh from my macOs Sierra terminal but it does not work.
the only command that starts some sort of process is:

python3 menextract2pdf.py mydatabase.sqlite mypdffolder/ --overwrite

The overwriting of pdfs works for a while and about a third of my 2800 files get modified with the highlighting as it should. but then the process stops and I get the following error message:

Traceback (most recent call last):
File "menextract2pdf.py", line 193, in
mendeley2pdf(fn, dir_pdf)
File "menextract2pdf.py", line 177, in mendeley2pdf
processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons)
File "menextract2pdf.py", line 156, in processpdf
inpdf._flatten()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1506, in _flatten
pages = catalog["/Pages"].getObject()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1593, in getObject
retval = self._getObjectFromStream(indirectReference)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1543, in getObjectFromStream
streamData = BytesIO(b(objStm.getData()))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 841, in getData
decoded._data = filters.decodeStreamData(self)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 111, in decode
data = decompress(data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 49, in decompress
return zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

Thanks in advance for your help!
Max

converturl2abspath does not work in Windows and the temporary workaround

The bat files do not work in Windows as indicated in #13.

The python itself does not work too. The function converturl2abspath does not return the correct file name in Windows. When url is file:///C:/Users/xxxxxx/xxxxxx.pdf, it returns C:/C:/Users/xxxxxx/xxxxxx.pdf.

Here goes my temporary fix, just in case some one need it in the future.

In menextract2pdf.py

Add import urllib, urllib.request at the beginning.
In converturl2abspath, change the return line to return urllib.request.url2pathname(pth).

Then run the following command.

python menextract2pdf.py "C:\Users\xxxxx\AppData\Local\Mendeley Ltd.\Mendeley Desktop\[email protected]@www.mendeley.com.sqlite" "C:\Users\xxxxx\AppData\Local\Mendeley Ltd.\Mendeley Desktop\Downloaded_new"

Now it works like magic!

Lost highlight in 1st page

Hi,
I believe there is a bug in the file menextract2pdf.py in line 101-107:
The annotations dict firstly takes highlights (line 151), then takes in notes (line 152).
Suppose one pdf has both highlights and notes in its 1st page, then pth is in results, but results[pth]
has only 'highlights' but not 'notes', in such cases the line results[pth][pg] = {'notes': [note]} will overwrite the sub-dict and renders the highlight records lost.

Process fails with certain encoding characters in PDF title (letter ñ)

First of all, sorry for the lack of logs, proofs and reproducible scenarios, I'm at work and I don't have much time to gather them.

I just wanted to point that the process may fail if character like the spanish letter ñ is found in the title of the PDF file.

May be in my spare time I can find a way to fix this and open a PR, but for the moment I just wanted you to know that.

Highlight in line-break

If highlight includes line breaks, separate highlights (one per line) are generated.

Fail to open PDF Files. "Could not find pdffile~"

In the process of the opening original PDF files, this error below occurred.

"Could not find pdffile C:\C:\Users*\AppData\Local\Mendeley Ltd\Mendeley Desktop\Downloaded*.pdf"

The problem was the misrecognized file directory path, "C:\C:\".

"C:\" is duplicated twice.

I added one line to the coverturl2abspath method in the original code.

pth = pth.lstrip('/')

def converturl2abspath(URL):
    """Convert a url string to an absolute path"""
    try:
        pth = unquote(urlparse(url).path) #this is necessary for filenames with unicode strings
    except:
        pth = unquote(str(urlparse(url).path)).decode("utf8") #this is necessary for filenames with unicode strings
    pth = pth.lstrip('/')
    return os.path.abspath(pth)

After modifying the code, the adding annotations process had well-done, but I didn't modify the code in GitHub because I am not sure if this error occurs the same in all users.

PDF files not found

Hi Jochen

I would love to use your tool but I have an issue when running menextract2pdf.py: it appears to correctly read my Mendeley database, but it can't seem to find the accompanying pdf files. Output is a long list, e.g.:

Could not find pdffile D:\PapersMen\2010_Hatfull.pdf
Could not find pdffile D:\PapersMen\2012_Zhong et al.pdf
Could not find pdffile D:\PapersMen\1953_Bloch, Noll.pdf
Could not find pdffile D:\PapersMen\2011_Lawson, Norton, Clements.pdf
Could not find pdffile D:\PapersMen\2013_Goodman, Church, Kosuri.pdf

I saw in the code it's due to an I/O error when opening the file. However, I'm not familiar with pdf editing and I can't find out what's wrong. Note that the error is not limited to file names containing special characters (the upper one should be fine). When I copy a path from the error message and paste it in the command line, the pdf opens just fine.

Running Win7 64bit, Python 2.7, updated sqlite3 to get the script to work.

Would you know what happened?

Cheers
Loes

(Also: line 109 has an extra dot after ['notes'])

zlib.error: Error -5 while decompressing data: incomplete or truncated stream

Hello,
I am having the following error when trying to open a PDF file. I think the issue is because the pdf file is editable. It worked using a non editable file.

Code
existing_pdf = PdfFileReader(open("S21.pdf","rb"))

Error

Traceback (most recent call last):
  File "fill.py", line 16, in <module>
    existing_pdf = PdfFileReader(open("S21.pdf","rb"))
  File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1807, in read
    streamData = BytesIO(b_(xrefstream.getData()))
  File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 841, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/filters.py", line 111, in decode
    data = decompress(data)
  File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/filters.py", line 49, in decompress
    return zlib.decompress(data)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream

ImportError: No module named PyPDF2.generic

Hi all, I get this error when running Menextract2pdf - can anyone advise, please?

"ImportError: No module named PyPDF2.generic"

PyPDF2.utils.PdfReadError: EOF marker not found

Hi cycomanic,

Thank you very much for creating this tool! Very handy, especially now that Mendeley has started encrypting it's local database.

I'm running Menextract2pdf.py on Mac, and have installed the dependencies you listed. I'm extracting pdfs from Mendeley version 1.18. Menextract2pdf.py runs perfectly fine for a bunch of papers, but at some point it hits a pdf of a book chapter and gives me the following error:

Traceback (most recent call last): File "menextract2pdf.py", line 184, in <module> mendeley2pdf(fn, dir_pdf) File "menextract2pdf.py", line 168, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "menextract2pdf.py", line 141, in processpdf inpdf = PyPDF2.PdfFileReader(open(fn, 'rb'), strict=False) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__ self.read(stream) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1696, in read raise utils.PdfReadError("EOF marker not found") PyPDF2.utils.PdfReadError: EOF marker not found

(Apologies for the messed-up formatting.)

Thanks,
Marieke

Could not find pdffile

Hi, I'm immensely grateful for this project!
However, so far I wasn't able to get it to work - It may well be an error on my side, as I'm not too proficient with python. But maybe it's a general issue.

I've copied all my Mendeley files (1.18 on Windows 10) on my Linux machine and there I'm running:

~/Desktop/mendeley_transfer/Menextract2pdf-master/src$ python menextract2pdf.py "/home/myname/Desktop/mendeley_transfer/Mendeley Desktop/[email protected]@www.mendeley.com.sqlite" "/home/myname/Desktop/mendeley_transfer/Paper_Literatur/"

an as an error message I'm getting
Could not find pdffile /C:/Users/_myusername_/Documents/Mendeley Desktop/PaperName.pdf

I have a hard time interpreting this error message, as the folder, containing the PDF-files is /home/myname/Desktop/mendeley_transfer/Paper_Literatur/ and not the folder, where my sqlite DB is stored - could anybody shade some light on what this error message means and what how to circumvent it?

As a note: I tried on my Windows 10 machine before, but I had issues with the Issue 16 also mentioned here, but the proposed workarounds there did not work for me. So I tried on my Linux maschine and got stuck a bit later in the process.

If I can provide further useful information on this issue I would be glad to do so.

Command not found

Hello,

Thank you for developing this code, I know it will be invaluable once I can get it working. I'm working on a Mac and trying to execute the "menextract2pdf_overwrite.sh" command after navigating to the the unzipped file, but an error message saying "command not found" pops up.

I've tried putting the folder in the same mendeley desktop folder in application support and in other folders not containing anything relevant to mendeley (i.e. download folders) with no luck.

Any help would be greatly appreciated, I'm sure this due to my inexperience working within terminal and I've found very little help on other sites. I was following instructions from (https://www.zotero.org/support/kb/mendeley_import#preserving_mendeley_annotations_and_highlights)

Thank you!

Unnable to find DB

I don´t know if I am doing something wrong (probably yes) but I am unable to use the program. I always have the same problem
"REM Helps to find the right mendeley.sqlite-DB 1*www.mendeley.com.sqlite"') do @set mendeleydb=a was unexpected at this time." even if the patch is correctly setup.

I am doing this:
"menextract2pdf_overwrite.bat "C:\Users\AppData\Local\Mendeley Ltd\Mendeley Desktop\[email protected]" "C:\Users\Documents\Mendeley Desktop"

Any help will be greatly appreciated, I really want to stop using Mendeley
Thanks

Umlauts in filename problem and PyPDF2 hiccups

After I decrypted my database I used menextract2pdf to get my annotations into the pdfs. I encountered a couple of errors:

Could not find pdffile /Users/armin/Desktop/ProjekteOnHold/ceat/mendeley_archive/Mach - 1886 - BeitrÃ¤ge zur Analyse der Empfindungen.pdf

This is an Umlaut encoding issue. Adding .decode("utf8") on line 28 solved this problem for me.

zlib.error: Error -3 while decompressing data: incorrect header check
and
ValueError: invalid literal for int() with base 10: 'dobj'

These were errors related to specific (kind of corrupted) pdfs. I added print(fn) to processpdf(fn, fn_out, annotations) so I could identify and manually remove the culprits.

Thank you for writing Menextract2pdf!

cycomanic / menextract2pdf Goto Github PK

menextract2pdf's People

Contributors

Stargazers

Watchers

Forkers

menextract2pdf's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs