GithubHelp home page GithubHelp logo

Comments (10)

jkitchin avatar jkitchin commented on September 18, 2024

That is a nice idea. Do you know how this information is extracted from the pdf? Could you send me an example of a PDF you know has this metadata in it?

from org-ref.

edgimar avatar edgimar commented on September 18, 2024

Sometimes I believe metadata can be embedded in the PDF directly, but I'm talking more about tools that read the PDF, and guess (e.g. based on the title, authors, etc.) what paper it is by searching for it. A nice tool I've used that does renaming in this manner is gscholar.

There appears to be some elisp code that semi-automates the pulling of google-scholar (and other) source data and constructing a bibtex entry from it -- see gscholar-bibtex.

As for metadata directly embedded in the PDF, there seems to be some older information on this here and here.

Lastly, the pdf-tools emacs package seems like it is able to extract (and edit!) annotations in a PDF file.

from org-ref.

jkitchin avatar jkitchin commented on September 18, 2024

Thanks for these links. I will take a look at them. I actually tried the
python one, and after the third use or so google blocked me! But it
looks like a lot of the work is done in gscholar-bibtex already.

edgimar writes:

Sometimes I believe metadata can be embedded in the PDF directly, but I'm talking more about tools that read the PDF, and guess (e.g. based on the title, authors, etc.) what paper it is by searching for it. A nice tool I've used that does renaming in this manner is gscholar.

There appears to be some elisp code that semi-automates the pulling of google-scholar (and other) source data and constructing a bibtex entry from it -- see gscholar-bibtex.

As for metadata directly embedded in the PDF, there seems to be some older information on this here and here.


Reply to this email directly or view it on GitHub:
#44 (comment)

Professor John Kitchin
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
@johnkitchin
http://kitchingroup.cheme.cmu.edu

from org-ref.

jkitchin avatar jkitchin commented on September 18, 2024

Update on this issue. I added https://github.com/jkitchin/org-ref/blob/master/org-ref-url-utils.el, which provides some support to drag a webpage onto a bibtex file to add a bibtex entry.

from org-ref.

llcc avatar llcc commented on September 18, 2024

I implemented a rough method to add an bibtex entry by drag-and-dropping the pdf to emacs if the doi is embedded in the file.

(defun extract-metadata-from-pdf (event)
  (interactive "e")
  (goto-char (nth 1 (event-start event)))
  (x-focus-frame nil)
  (let* ((payload (car (last event)))
     (pdf-file (abbreviate-file-name (replace-regexp-in-string "\\\\" "/" (car payload))))
     (text-file (concat (f-no-ext pdf-file)))
     doi)
    (save-excursion
      (shell-command (format "pdftotext %s %s" pdf-file text-file))
      (find-file-existing text-file)
      (beginning-of-buffer)
      (if (re-search-forward "http://dx.doi.org/\\(10.+$\\)" nil nil)
          (setq doi (match-string 1))
        (user-error "No doi can be found in the pdf file"))
      (kill-buffer)
      (delete-file text-file)
      (doi-utils-add-bibtex-entry-from-doi doi (car org-ref-default-bibliography)))))

(bind-key "<drag-n-drop>" 'extract-metadata-from-pdf)

I used a pdftotext command from git which can convert a pdf to a text file. If the doi of current file is embedded in this file, we can search and get the doi, then use it to add a bibtex entry for the default bibliography file.

This is just a rough idea. I tried to extract the title from the text file, however, the title is just plain text without any properties. So I think a better solution is to find a new application which can extract the right metadata instead of pdftotext.

Any improvement and advice about this is appreciated!

from org-ref.

jkitchin avatar jkitchin commented on September 18, 2024

This is a good start. I will give it some tests this weekend. I think we could think of a series of functions to try. First, if metadata exists we should get it since it is most reliable. Second we could try this approach. The only risk is it takes the first doi link, which we have to assume is for the article. if this failed, then a google/crossref search on some text from the pdf might be the last resort before giving up.

from org-ref.

llcc avatar llcc commented on September 18, 2024

pdftotext can accept arguments to generate a metadata html file. Then we can use it to get title, even authors and other things. I updated the above function and git it to https://github.com/llcc/org-ref-extraction-metadata-from-pdf, please have a look (sorry, i named it started with org-ref. Please tell me if it is not good).

Still need some fixes, but the basic has been expressed.

from org-ref.

llcc avatar llcc commented on September 18, 2024

@jkitchin can you create a file for pdf metadata extraction in the org-ref repository? It will be easier for us to submit? If so, I will merge the function in org-ref. Thanks!

from org-ref.

jkitchin avatar jkitchin commented on September 18, 2024

I committed a draft file here https://github.com/jkitchin/org-ref/blob/master/org-ref-pdf.el

It matches two types of DOIs in pdftotext. If one doi is found, it adds it as a bibtex entry. If two are found, it offers a helm selection menu for which one you want.

I like your idea of getting more structured metadata, but on some tests of about 100 pdfs, I didn't find any useful information in them. This still needs a lot of testing, so PRs are welcome for improvements!

from org-ref.

jkitchin avatar jkitchin commented on September 18, 2024

I am going to close this. Most of the functionality described above has been implemented now. Thanks for the idea! The second idea about extracting annotations is outside the scope of org-ref for now I think. The only other thing not done is renaming the pdf. That might be a good idea to do some time.

from org-ref.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.