cycomanic / menextract2pdf Goto Github PK
View Code? Open in Web Editor NEWExtract Mendely annotations to PDF FIles
License: GNU General Public License v3.0
Extract Mendely annotations to PDF FIles
License: GNU General Public License v3.0
Table of contents / outline of PDF files is not preserved after conversion. This would make it very difficult to navigate some long articles.
Hi guys!
I am trying to migrate my Mendeley Desktop library from v. 1.8 to the latest version of Zotero, on macOS Catalina 10.15.4, and I get the following output on Terminal:
Traceback (most recent call last): File "src/menextract2pdf.py", line 14, in <module> import pdfannotation File "/Users/johnny/Downloads/Menextract2pdf-master/src/pdfannotation.py", line 10, in <module> from PyPDF2.generic import * ImportError: No module named PyPDF2.generic
I have no idea what I'm doing wrong, and will try doing this in a Windows VM shortly if I don't get feedback soon. In any case, I'll let you know how that went.
Hello, thanks for this useful tool.
I am using ubuntu and I used the script "menextract2pdf__overwrite.sh" to overwrite my mendeley library (with the aim of importing it in Zotero, using their native Mendeley importer).
Unfortunately all the pdf files with comments do not have any text at all, you can only see the highlighted portions (but with no text).
Is this a known problem?
See one example file here.
Hi,
I am a long-time mac Mendeley user, but I have become extremely fed up with the various bugs and limitations of Mendeley so I have decided to try to switch to Zotero. The problem is I have 10 years of annotated (and highlighted) PDFs I cannot lose in the conversion process.
I have tried running the .sh from my macOs Sierra terminal but it does not work.
the only command that starts some sort of process is:
python3 menextract2pdf.py mydatabase.sqlite mypdffolder/ --overwrite
The overwriting of pdfs works for a while and about a third of my 2800 files get modified with the highlighting as it should. but then the process stops and I get the following error message:
Traceback (most recent call last):
File "menextract2pdf.py", line 193, in
mendeley2pdf(fn, dir_pdf)
File "menextract2pdf.py", line 177, in mendeley2pdf
processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons)
File "menextract2pdf.py", line 156, in processpdf
inpdf._flatten()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1506, in _flatten
pages = catalog["/Pages"].getObject()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1593, in getObject
retval = self._getObjectFromStream(indirectReference)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1543, in getObjectFromStream
streamData = BytesIO(b(objStm.getData()))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/generic.py", line 841, in getData
decoded._data = filters.decodeStreamData(self)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 111, in decode
data = decompress(data)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/PyPDF2/filters.py", line 49, in decompress
return zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
Thanks in advance for your help!
Max
The bat
files do not work in Windows as indicated in #13.
The python itself does not work too. The function converturl2abspath
does not return the correct file name in Windows. When url
is file:///C:/Users/xxxxxx/xxxxxx.pdf
, it returns C:/C:/Users/xxxxxx/xxxxxx.pdf
.
Here goes my temporary fix, just in case some one need it in the future.
In menextract2pdf.py
import urllib, urllib.request
at the beginning.converturl2abspath
, change the return line to return urllib.request.url2pathname(pth)
.Then run the following command.
python menextract2pdf.py "C:\Users\xxxxx\AppData\Local\Mendeley Ltd.\Mendeley Desktop\[email protected]@www.mendeley.com.sqlite" "C:\Users\xxxxx\AppData\Local\Mendeley Ltd.\Mendeley Desktop\Downloaded_new"
Now it works like magic!
Hi,
I believe there is a bug in the file menextract2pdf.py
in line 101-107:
The annotations dict firstly takes highlights (line 151), then takes in notes (line 152).
Suppose one pdf has both highlights and notes in its 1st page, then pth
is in results
, but results[pth]
has only 'highlights'
but not 'notes'
, in such cases the line results[pth][pg] = {'notes': [note]}
will overwrite the sub-dict and renders the highlight records lost.
First of all, sorry for the lack of logs, proofs and reproducible scenarios, I'm at work and I don't have much time to gather them.
I just wanted to point that the process may fail if character like the spanish letter ñ
is found in the title of the PDF file.
May be in my spare time I can find a way to fix this and open a PR, but for the moment I just wanted you to know that.
If highlight includes line breaks, separate highlights (one per line) are generated.
In the process of the opening original PDF files, this error below occurred.
"Could not find pdffile C:\C:\Users*\AppData\Local\Mendeley Ltd\Mendeley Desktop\Downloaded*.pdf"
The problem was the misrecognized file directory path, "C:\C:\".
"C:\" is duplicated twice.
I added one line to the coverturl2abspath method in the original code.
pth = pth.lstrip('/')
def converturl2abspath(URL):
"""Convert a url string to an absolute path"""
try:
pth = unquote(urlparse(url).path) #this is necessary for filenames with unicode strings
except:
pth = unquote(str(urlparse(url).path)).decode("utf8") #this is necessary for filenames with unicode strings
pth = pth.lstrip('/')
return os.path.abspath(pth)
After modifying the code, the adding annotations process had well-done, but I didn't modify the code in GitHub because I am not sure if this error occurs the same in all users.
Hi Jochen
I would love to use your tool but I have an issue when running menextract2pdf.py: it appears to correctly read my Mendeley database, but it can't seem to find the accompanying pdf files. Output is a long list, e.g.:
Could not find pdffile D:\PapersMen\2010_Hatfull.pdf
Could not find pdffile D:\PapersMen\2012_Zhong et al.pdf
Could not find pdffile D:\PapersMen\1953_Bloch, Noll.pdf
Could not find pdffile D:\PapersMen\2011_Lawson, Norton, Clements.pdf
Could not find pdffile D:\PapersMen\2013_Goodman, Church, Kosuri.pdf
I saw in the code it's due to an I/O error when opening the file. However, I'm not familiar with pdf editing and I can't find out what's wrong. Note that the error is not limited to file names containing special characters (the upper one should be fine). When I copy a path from the error message and paste it in the command line, the pdf opens just fine.
Running Win7 64bit, Python 2.7, updated sqlite3 to get the script to work.
Would you know what happened?
Cheers
Loes
(Also: line 109 has an extra dot after ['notes'])
Hello,
I am having the following error when trying to open a PDF file. I think the issue is because the pdf file is editable. It worked using a non editable file.
Code
existing_pdf = PdfFileReader(open("S21.pdf","rb"))
Error
Traceback (most recent call last):
File "fill.py", line 16, in <module>
existing_pdf = PdfFileReader(open("S21.pdf","rb"))
File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1084, in __init__
self.read(stream)
File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/pdf.py", line 1807, in read
streamData = BytesIO(b_(xrefstream.getData()))
File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 841, in getData
decoded._data = filters.decodeStreamData(self)
File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/filters.py", line 346, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/filters.py", line 111, in decode
data = decompress(data)
File "/home/dcabrera/.local/lib/python3.8/site-packages/PyPDF2/filters.py", line 49, in decompress
return zlib.decompress(data)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream
Hi cycomanic,
Thank you very much for creating this tool! Very handy, especially now that Mendeley has started encrypting it's local database.
I'm running Menextract2pdf.py on Mac, and have installed the dependencies you listed. I'm extracting pdfs from Mendeley version 1.18. Menextract2pdf.py runs perfectly fine for a bunch of papers, but at some point it hits a pdf of a book chapter and gives me the following error:
Traceback (most recent call last): File "menextract2pdf.py", line 184, in <module> mendeley2pdf(fn, dir_pdf) File "menextract2pdf.py", line 168, in mendeley2pdf processpdf(fn, os.path.join(dir_pdf, os.path.basename(fn)), annons) File "menextract2pdf.py", line 141, in processpdf inpdf = PyPDF2.PdfFileReader(open(fn, 'rb'), strict=False) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__ self.read(stream) File "/Users/pplsuser/anaconda2/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1696, in read raise utils.PdfReadError("EOF marker not found") PyPDF2.utils.PdfReadError: EOF marker not found
(Apologies for the messed-up formatting.)
Thanks,
Marieke
Hi, I'm immensely grateful for this project!
However, so far I wasn't able to get it to work - It may well be an error on my side, as I'm not too proficient with python. But maybe it's a general issue.
I've copied all my Mendeley files (1.18 on Windows 10) on my Linux machine and there I'm running:
~/Desktop/mendeley_transfer/Menextract2pdf-master/src$ python menextract2pdf.py "/home/myname/Desktop/mendeley_transfer/Mendeley Desktop/[email protected]@www.mendeley.com.sqlite" "/home/myname/Desktop/mendeley_transfer/Paper_Literatur/"
an as an error message I'm getting
Could not find pdffile /C:/Users/_myusername_/Documents/Mendeley Desktop/PaperName.pdf
I have a hard time interpreting this error message, as the folder, containing the PDF-files is /home/myname/Desktop/mendeley_transfer/Paper_Literatur/ and not the folder, where my sqlite DB is stored - could anybody shade some light on what this error message means and what how to circumvent it?
As a note: I tried on my Windows 10 machine before, but I had issues with the Issue 16 also mentioned here, but the proposed workarounds there did not work for me. So I tried on my Linux maschine and got stuck a bit later in the process.
If I can provide further useful information on this issue I would be glad to do so.
Hello,
Thank you for developing this code, I know it will be invaluable once I can get it working. I'm working on a Mac and trying to execute the "menextract2pdf_overwrite.sh" command after navigating to the the unzipped file, but an error message saying "command not found" pops up.
I've tried putting the folder in the same mendeley desktop folder in application support and in other folders not containing anything relevant to mendeley (i.e. download folders) with no luck.
Any help would be greatly appreciated, I'm sure this due to my inexperience working within terminal and I've found very little help on other sites. I was following instructions from (https://www.zotero.org/support/kb/mendeley_import#preserving_mendeley_annotations_and_highlights)
Thank you!
I don´t know if I am doing something wrong (probably yes) but I am unable to use the program. I always have the same problem
"REM Helps to find the right mendeley.sqlite-DB 1*www.mendeley.com.sqlite"') do @set mendeleydb=a was unexpected at this time." even if the patch is correctly setup.
I am doing this:
"menextract2pdf_overwrite.bat "C:\Users\AppData\Local\Mendeley Ltd\Mendeley Desktop\[email protected]" "C:\Users\Documents\Mendeley Desktop"
Any help will be greatly appreciated, I really want to stop using Mendeley
Thanks
After I decrypted my database I used menextract2pdf to get my annotations into the pdfs. I encountered a couple of errors:
Could not find pdffile /Users/armin/Desktop/ProjekteOnHold/ceat/mendeley_archive/Mach - 1886 - Beiträge zur Analyse der Empfindungen.pdf
This is an Umlaut encoding issue. Adding .decode("utf8")
on line 28 solved this problem for me.
zlib.error: Error -3 while decompressing data: incorrect header check
and
ValueError: invalid literal for int() with base 10: 'dobj'
These were errors related to specific (kind of corrupted) pdfs. I added print(fn)
to processpdf(fn, fn_out, annotations)
so I could identify and manually remove the culprits.
Thank you for writing Menextract2pdf!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.