Comments (10)
That is a nice idea. Do you know how this information is extracted from the pdf? Could you send me an example of a PDF you know has this metadata in it?
from org-ref.
Sometimes I believe metadata can be embedded in the PDF directly, but I'm talking more about tools that read the PDF, and guess (e.g. based on the title, authors, etc.) what paper it is by searching for it. A nice tool I've used that does renaming in this manner is gscholar.
There appears to be some elisp code that semi-automates the pulling of google-scholar (and other) source data and constructing a bibtex entry from it -- see gscholar-bibtex.
As for metadata directly embedded in the PDF, there seems to be some older information on this here and here.
Lastly, the pdf-tools emacs package seems like it is able to extract (and edit!) annotations in a PDF file.
from org-ref.
Thanks for these links. I will take a look at them. I actually tried the
python one, and after the third use or so google blocked me! But it
looks like a lot of the work is done in gscholar-bibtex already.
edgimar writes:
Sometimes I believe metadata can be embedded in the PDF directly, but I'm talking more about tools that read the PDF, and guess (e.g. based on the title, authors, etc.) what paper it is by searching for it. A nice tool I've used that does renaming in this manner is gscholar.
There appears to be some elisp code that semi-automates the pulling of google-scholar (and other) source data and constructing a bibtex entry from it -- see gscholar-bibtex.
As for metadata directly embedded in the PDF, there seems to be some older information on this here and here.
Reply to this email directly or view it on GitHub:
#44 (comment)
Professor John Kitchin
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
@johnkitchin
http://kitchingroup.cheme.cmu.edu
from org-ref.
Update on this issue. I added https://github.com/jkitchin/org-ref/blob/master/org-ref-url-utils.el, which provides some support to drag a webpage onto a bibtex file to add a bibtex entry.
from org-ref.
I implemented a rough method to add an bibtex entry by drag-and-dropping the pdf to emacs if the doi is embedded in the file.
(defun extract-metadata-from-pdf (event)
(interactive "e")
(goto-char (nth 1 (event-start event)))
(x-focus-frame nil)
(let* ((payload (car (last event)))
(pdf-file (abbreviate-file-name (replace-regexp-in-string "\\\\" "/" (car payload))))
(text-file (concat (f-no-ext pdf-file)))
doi)
(save-excursion
(shell-command (format "pdftotext %s %s" pdf-file text-file))
(find-file-existing text-file)
(beginning-of-buffer)
(if (re-search-forward "http://dx.doi.org/\\(10.+$\\)" nil nil)
(setq doi (match-string 1))
(user-error "No doi can be found in the pdf file"))
(kill-buffer)
(delete-file text-file)
(doi-utils-add-bibtex-entry-from-doi doi (car org-ref-default-bibliography)))))
(bind-key "<drag-n-drop>" 'extract-metadata-from-pdf)
I used a pdftotext
command from git which can convert a pdf to a text file. If the doi of current file is embedded in this file, we can search and get the doi, then use it to add a bibtex entry for the default bibliography file.
This is just a rough idea. I tried to extract the title from the text file, however, the title is just plain text without any properties. So I think a better solution is to find a new application which can extract the right metadata instead of pdftotext
.
Any improvement and advice about this is appreciated!
from org-ref.
This is a good start. I will give it some tests this weekend. I think we could think of a series of functions to try. First, if metadata exists we should get it since it is most reliable. Second we could try this approach. The only risk is it takes the first doi link, which we have to assume is for the article. if this failed, then a google/crossref search on some text from the pdf might be the last resort before giving up.
from org-ref.
pdftotext
can accept arguments to generate a metadata html file. Then we can use it to get title, even authors and other things. I updated the above function and git it to https://github.com/llcc/org-ref-extraction-metadata-from-pdf, please have a look (sorry, i named it started with org-ref. Please tell me if it is not good).
Still need some fixes, but the basic has been expressed.
from org-ref.
@jkitchin can you create a file for pdf metadata extraction in the org-ref repository? It will be easier for us to submit? If so, I will merge the function in org-ref. Thanks!
from org-ref.
I committed a draft file here https://github.com/jkitchin/org-ref/blob/master/org-ref-pdf.el
It matches two types of DOIs in pdftotext. If one doi is found, it adds it as a bibtex entry. If two are found, it offers a helm selection menu for which one you want.
I like your idea of getting more structured metadata, but on some tests of about 100 pdfs, I didn't find any useful information in them. This still needs a lot of testing, so PRs are welcome for improvements!
from org-ref.
I am going to close this. Most of the functionality described above has been implemented now. Thanks for the idea! The second idea about extracting annotations is outside the scope of org-ref for now I think. The only other thing not done is renaming the pdf. That might be a good idea to do some time.
from org-ref.
Related Issues (20)
- Feature request: Allow user to define function for PDF file name when downloading automatically from the web HOT 1
- Adding a cite eats up a space HOT 4
- integrating notes with zotero HOT 1
- doi-utils-add-bibtex-entry-from-doi returns mathml code in article titles HOT 2
- Feature request: count how many different references HOT 2
- Issue with exporting to docx with citation HOT 5
- Any workflow for managing paper notes ? HOT 2
- Adding a reference to a specific collection in Zotero HOT 1
- Bibliography link not being detected? HOT 6
- arxiv-add-bibtex-entry throws error "Invalid function: path/to/.bib" HOT 4
- Suggestion: automatically edit ref: when changing label: HOT 4
- cite: links do not allow ; in pre- and post-text HOT 4
- Fail to produce bibliography and proper citations org-mode to pdf export HOT 2
- bibtex-completion-display-formats HOT 4
- Symbol's Function Definition is void: nil for org-ref-insert-link HOT 2
- Multiple selection with helm-bibtex does not seem to work HOT 2
- \nbsp{} inside citation pre-post argument HOT 1
- `org-ref-latex-get-bibliography` doesn't work in multi-file latex project HOT 1
- Error (use-package): org-ref/:init: Symbol’s function definition is void: org-cite-register-processor HOT 3
- Years are exported to latex as full citation HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from org-ref.