ceon / cermine Goto Github PK

View Code? Open in Web Editor NEW

480.0 33.0 98.0 238.92 MB

Content ExtRactor and MINEr

License: GNU Affero General Public License v3.0

Java 57.35% CSS 1.00% HTML 40.62% PHP 0.68% JavaScript 0.36%

metadata-extraction reference-parsing affiliation-parsing machine-learning java pdf

cermine's People

Contributors

Stargazers

Watchers

Forkers

matfed pdendek pszostek kuraju neozhangthe1 dominik-bln gitter-badger adamchandra danduma project-renard-survey osake luzc08 anukat2015 mkobos nooralahzadeh mjspka talgathairov mkrnr uchman21 yklab janes dtrckd muddasani pandurang-kolekar eitanf distresearch seanrife zeyd31 ivan-mashonskiy kyvaith jyfmidi findingbenjamin fendaq gesielrios hatemhosny spbriggs mguggari mjoppich nkconnor dnarnaatp emallson researcherone dtkaczyk j4m355 pvk444 emrul fbeneventi andrei-rusu-imi pkrouth birkbeckctp beckettws shnupta hanssen0 softuncle zy077962 climostatistics jlleitschuh lee1985 pidugusundeep hzxiao jbris beleiaya henry-nlp rrchaudhari c-dongbo 5l1v3r1 murrman95 meatware rafaelbidese mingzi151 varung carnagie dipakbagal crabtail clementleong dlhuang kzyfn techwolfalpha florencekim etspielberg rohit-21 cgrard erima2020 princesegzy01 wangludewdrop resourcesunite tbarkai palantir555 mmarrone bwakkie eleonorap1996 stinger101 uneetsingh cherfaoui-syphax nostosgenomics xielm12 vladdie

cermine's Issues

bibtex to xml

NullPointerException thrown for some articles

Hi,
When I tried to parse the following documents

With the following piece of code:

Element result = extractor.getContentAsNLM();

I get the following exception

Exception in thread "main" java.lang.NullPointerException
	at pl.edu.icm.cermine.metadata.transformers.MetadataToNLMConverter.convertJournalMetadata(MetadataToNLMConverter.java:61)
	at pl.edu.icm.cermine.metadata.transformers.MetadataToNLMConverter.convert(MetadataToNLMConverter.java:42)
	at pl.edu.icm.cermine.InternalContentExtractor.getMetadataAsNLM(InternalContentExtractor.java:198)
	at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:303)
	at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:662)
	at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:678)

improve accuracy of doi extraction

I use Cermine in my little console utility to extract references from papers. I noticed that there are often mistakes in doi extraction, the most common one - is adding extra bracket to the end (that is why I manually remove it in my code https://github.com/antonkulaga/extractor/blob/master/src/main/scala/org/comp/bio/aging/extractor/Reference.scala#L22 ). Please, improve doi extraction accuracy!

Documentation for the XML output

Hi,

Is there a documentation somewhere about the structure of the XML output, especially the available tags and so on? I could not find any.

Thanks!

Could not find or load main class

I am obviously doing something wrong here, as trying to run the latest 1.9 with dependencies from the terminal keeps returning "Error: Could not find or load main class pl.edu.icm.cermine.ContentExtractor". Having re-installed Java and a reset jre/path home a couple of times, I wonder what I am missing. Running on Windows 10 x64 with Java 32-bit (131).

AssertionError in ContentExtractor().getContentAsNLM()

When attempting to extract data from this article https://doi.org/10.7717/peerj-cs.118 (and probably any other article with the same PDF layout) the following failure occurs:

java.lang.AssertionError
	at pl.edu.icm.cermine.structure.readingorder.DocumentPlane.add(DocumentPlane.java:165)
	at pl.edu.icm.cermine.structure.readingorder.DocumentPlane.<init>(DocumentPlane.java:95)
	at pl.edu.icm.cermine.structure.HierarchicalReadingOrderResolver.groupZonesHierarchically(HierarchicalReadingOrderResolver.java:207)
	at pl.edu.icm.cermine.structure.HierarchicalReadingOrderResolver.reorderZones(HierarchicalReadingOrderResolver.java:122)
	at pl.edu.icm.cermine.structure.HierarchicalReadingOrderResolver.resolve(HierarchicalReadingOrderResolver.java:96)
	at pl.edu.icm.cermine.ExtractionUtils.resolveReadingOrder(ExtractionUtils.java:79)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:354)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
	at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:301)
	at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:662)
	at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:678)
	at pl.edu.icm.cermine.ContentExtractorLoopTest.extractionLoopTest(ContentExtractorLoopTest.java:57)

Add pdfimages support for image extraction?

Hi,

Are there plans to add a call to pdfimages (from xpdf/poppler) to ensure images are extracted when parsing full text via Grobid? pdfimages accuracy and performance seems to be very good but I don't think it's directly used by any pdf parsers currently.

Missing references

Cermine does not always handle correctly reference sections that span multiple pages.

Often, only the last page is recognised as references, and previous pages are included as part of the body — or are not included in the output at all.

Examples

Tested with 1.8-SNAPSHOT.

Input

Output

Extraneous references

Cermine does not handle correctly reference sections that are followed by an appendix.

The text following the reference section is mistakenly recognised as additional references.

Example

Tested with 1.8-SNAPSHOT.

Input

Output

Use enums for identifiers

I didn't know which values I can pass to getId until I found the constants ID_DOI, ID_URN, ....
In my opinion it would make the code a bit clearer if you use an enum for the possible identifier types.

A similar remark applies to the different types of dates.

Problem extracting abstract to jats

The attached paper parses just fine to text or zones, but trying to parse it to jats completely skips the abstract.
CIDR_17_020.pdf

Fatal java.lang.OutOfMemoryError thrown while processing document

The following exception is thrown:

java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at com.itextpdf.text.pdf.PdfReader.FlateDecode(PdfReader.java:2188)
	at com.itextpdf.text.pdf.PdfReader.FlateDecode(PdfReader.java:2043)
	at com.itextpdf.text.pdf.FilterHandlers$Filter_FLATEDECODE.decode(FilterHandlers.java:107)
	at com.itextpdf.text.pdf.PdfReader.decodeBytes(PdfReader.java:2619)
	at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:189)
	at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:168)
	at com.itextpdf.text.pdf.parser.ImageRenderInfo.prepareImageObject(ImageRenderInfo.java:150)
	at com.itextpdf.text.pdf.parser.ImageRenderInfo.getImage(ImageRenderInfo.java:140)
	at pl.edu.icm.cermine.structure.ITextCharacterExtractor$BxDocumentCreator.renderImage(ITextCharacterExtractor.java:366)
	at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ImageXObjectDoHandler.handleXObject(PdfContentStreamProcessor.java:1311)
	at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayXObject(PdfContentStreamProcessor.java:375)
	at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:83)
	at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:1023)
	at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:310)
	at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:448)
	at pl.edu.icm.cermine.structure.ITextCharacterExtractor.extractCharacters(ITextCharacterExtractor.java:112)
	at pl.edu.icm.cermine.ExtractionUtils.extractCharacters(ExtractionUtils.java:60)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:346)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
	at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
	at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:299)
	at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:662)
	at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:678)
	at eu.dnetlib.iis.wf.metadataextraction.MetadataExtractorMapper.handleContent(MetadataExtractorMapper.java:249)

when processing this document:

http://eprints.nottingham.ac.uk/41436/1/PhD_Thesis_Maria_Manuela_Marinho_de_Castro.pdf

using 1.13 CERMINE version.

This seems to be similar to #33 because the only way to make it working is increasing Xmx memory up to 5GB. I tried to enforce using most recent 5.5.12 iText version but it did not solve this issue.

As already mentioned in #33#issuecomment-257929226 IIS metadataextraction mapper is allowed to use 4GB memory. Assigning more memory to job triggering CERMINE will decrease task parallelization. For now I am simply blacklisting this document.

Should we consider this as a CERMINE issue or should we report it to iText developers?

Support thesis's and books

CERMINE has problems with academic literature that are not classical articles. For example, support for thesis's and books would be great.

For example, for my thesis the information returned is:

Type: ARTICLE // expected: thesis
Author: "Master's Thesis and Presented by Tobias Diez and Assessors: Dr. G. Rudolph Dr. R. Verch") // expected Tobias Diez
Pages: "86127") // expected: 1 - 127
Title: "Slice theorem for Fréchet group actions and covariant symplectic field theory" // correct

.jar won't run

I have downloaded "cermine-impl-1.9-jar-with-dependencies.jar" on window 10.

Create a folder in C:\NewTest and placed it.

Location of jar is C:\NewTest\cermine-impl-1.9-jar-with-dependencies.jar

Placed the pdf under a folder like C:\NewTest\Input\

Navigate to command prompt and type below command

java -cp cermine-impl-1.9-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path C:\NewTest\Input\

Getting error "Error: Could not find or load main class pl.edu.icm.cermine.ContentExtractor"

Affiliations: Improve the way of finding country in affiliation string

Reported by @mkobos:

Affiliations: Improve way of finding country in affiliation string (this might potentially impact 23% of all records - these are the records where the country name (and thus country code) is missing)

1.1 Easy improvement. Look for the name of the country in affiliation string (this might solve the problem in up to 23%*25%= 6% of all affiliation records).
1.2 More difficult improvement. Recognize the country based on a dictionary of well-known scientific organizations and a dictionary of US states and major cities in various countries (this might solve the problem in up to 23%*80% = 18% of all affiliation records).

Note that all numbers come from the first version of report "Analysis of affiliations extracted by IIS from XMLs and PDFs" available at https://issue.openaire.research-infrastructures.eu/issues/2010, namely: https://issue.openaire.research-infrastructures.eu/attachments/download/509/2016-04-17_21_46_analysis.html. All these numbers are very rough approximations of the real values.

Incorrectly extracts DOI or PMID from bibliography

DOIs are frequently reported in the bibliography sections. Example from PlosONE:

Van Heuven WJB, Dijkstra T. Language comprehension in the bilingual brain: fMRI and ERP support
for psycholinguistic models. Brain Res Rev. 2010; 64(1):104 – 22. doi: 10.1016/j.brainresrev.2010.03.
002 PMID: 20227440

However, they are not extracted by CERMINE, and the first part of DOI gets interpreted as repeated information about volume:

        <mixed-citation>
          14.
          <string-name>
            <surname>Van Heuven</surname>
            <given-names>WJB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dijkstra</surname>
            <given-names>T.</given-names>
          </string-name>
          <article-title>Language comprehension in the bilingual brain: fMRI and ERP support for psycholinguistic models</article-title>
          .
          <source>Brain Res Rev</source>
          .
          <year>2010</year>
          ;
          <volume>64</volume>
          (
          <issue>1</issue>
          ):
          <fpage>104</fpage>
          -
          <lpage>22</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.brainresrev.
          <year>2010</year>
          .
          <volume>03</volume>
          . 002 PMID:
          <fpage>20227440</fpage>
        </mixed-citation>

publish in reliable place

http://maven.icm.edu.pl/artifactory/ is very slow, it would be nice to see cermine also in bintray or maven central

Incorrectly assigns references to citations

Cermine seems to incorrectly assign references to citations, even for perfectly recognized PDFs, and for correctly recognized bibliography. A citation co-authored by X is being assigned to other publications also including author X that occur in the bibliography. This is a very frequent bug and it often happens for obvious cases where the name of the first author is different for the citation and for the actually assigned item in the bibliography.

I can provide examples with specific PDFs if required, although this behaviour should be very easy to elicit using any PDF that includes citations co-authored by the same person.

Affiliations: Improve cleaning the affiliation string

Originally reported by @mkobos:

Affiliations: Improve cleaning the affiliation string because in a small number of cases, the organization field contains some text that doesn't belong there (e.g. address, short name, name of the organization in different language, unrelated text) This might impact up to 20% of all affiliation records.

Feature request: OCR

Some PDFs contain text that has been generated using an unspecified OCR process. For such PDFs, the quality of Cermine output depends directly on the quality of the particular OCR process, which may be far from satisfactory.

It would be great if Cermine performed its own OCR, and attempted to process both the existing and the newly recognised text, in order to get the best result.

Performing OCR may also be the only way to solve #11.

Examples

Tested with 1.8-SNAPSHOT.

Input

BrownPalsbergReviewCopy.pdf

Output

BrownPalsbergReviewCopy.cermxml.txt

Cannot build

Hi,

When trying to build on latest tag, I get:

CERMINE/cermine-impl % mvn compile assembly:single                                                                                      ±[cermine-parent-1.7^0]
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for pl.edu.icm.cermine:cermine-impl:jar:1.7-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-javadoc-plugin is missing. @ pl.edu.icm.cermine:cermine-parent:1.7-SNAPSHOT, /home/phyks/tmp/papers/CERMINE/pom.xml, line 79, column 21
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ pl.edu.icm.cermine:cermine-parent:1.7-SNAPSHOT, /home/phyks/tmp/papers/CERMINE/pom.xml, line 67, column 21
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building CERMINE Engine Implementation - 1.7-SNAPSHOT 1.7-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://maven.icm.edu.pl/artifactory/repo/edu/umass/cs/mallet/mallet/0.1.3/mallet-0.1.3.pom
Downloading: http://maven.icm.edu.pl/artifactory/repo/edu/umass/cs/mallet/grmm-deps/0.1.3/grmm-deps-0.1.3.pom
Downloading: http://maven.icm.edu.pl/artifactory/repo/org/bouncycastle/bcprov-jdk14/1.47/bcprov-jdk14-1.47.pom
Downloading: https://repo.maven.apache.org/maven2/org/bouncycastle/bcprov-jdk14/1.47/bcprov-jdk14-1.47.pom
Downloaded: https://repo.maven.apache.org/maven2/org/bouncycastle/bcprov-jdk14/1.47/bcprov-jdk14-1.47.pom (819 B at 1.2 KB/sec)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 05:29 min
[INFO] Finished at: 2016-01-17T20:42:21+01:00
[INFO] Final Memory: 15M/102M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project cermine-impl: Could not resolve dependencies for project pl.edu.icm.cermine:cermine-impl:jar:1.7-SNAPSHOT: Failed to collect dependencies at edu.umass.cs.mallet:mallet:jar:0.1.3: Failed to read artifact descriptor for edu.umass.cs.mallet:mallet:jar:0.1.3: Could not transfer artifact edu.umass.cs.mallet:mallet:pom:0.1.3 from/to yadda (http://maven.icm.edu.pl/artifactory/repo): Failed to transfer file: http://maven.icm.edu.pl/artifactory/repo/edu/umass/cs/mallet/mallet/0.1.3/mallet-0.1.3.pom. Return code is: 502 , ReasonPhrase:Bad Gateway. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Thanks

Affiliations: Recognize that some texts currently provided by CERMINE as affiliations are certainly not affiliations

Originally reported by @mkobos:

Affiliations: Recognize that some texts currently provided by CERMINE as affiliations are certainly not affiliations, e.g.: "Electronic address:" (this is the most numerous "affiliation" - it can be found in 0.03% of all affiliation records), "These authors contributed equally to this work".

CRF-based Affiliation parser fails with StackOverflowError on large input text

It seems StackOverflowError is thrown by Mallet library:

2016-09-16 00:56:09,493 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.StackOverflowError
    at java.lang.StringBuffer.append(StringBuffer.java:272)
    at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:487)
    at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:491)
    at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:491)
    [...]

when providing large text input to CRFAffiliationParser#parse().

After several tests it turned out affiliation text exceeding 8000-9000 characters causes mentioned problem.

Here is an example causing StackOverflowError:

Affiliations of authors:Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, UK (QG, JT, AMD, MS, JEA, DFE, PDPP); Netherlands Cancer Institute, Antoni van Leeuwenhoek hospital, Amsterdam, the Netherlands (MKS, SC, AB, FBH); Department of Epidemiology, Harvard School of Public Health, Boston, MA (PK, SH, DJH, SL); Program in Genetic Epidemiology and Statistical Genetics, Department of Epidemiology, Harvard School of Public Health, Boston, MA (PK, CCh, DJH, SL); Department of Obstetrics and Gynecology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland (SK, RF, TAM, HN); Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK (MKB, QW, JD, KM, ML, SK, DFE, PDPP); Department of Genetics, QIMR Berghofer Medical Research Institute, Brisbane, Australia (JBee, GCT); Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm 17177, Sweden (KC, HD, ME, JiL, JBr, KH, PH); Laboratory for Translational Genetics, Department of Oncology, University of Leuven, Leuven, Belgium (DL); Vesalius Research Center, VIB, Leuven, Belgium (DL); Oncology Department, University Hospital Gasthuisberg, Leuven, Belgium (CW, KL); Copenhagen General Population Study, Herlev Hospital, Copenhagen, Denmark (SEB, BGN, SFN); Department of Clinical Biochemistry, Herlev Hospital, Copenhagen University Hospital, Denmark (SEB, BGN, SFN); Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark (SEB, BGN); Department of Breast Surgery, Herlev Hospital, Copenhagen University Hospital, Denmark (HF); Division of Cancer Epidemiology, German Cancer Research Center (Deutsches Krebsforschungszentrum), Heidelberg, Germany (JCC, AR, PS, DC, AHü, RK, MB); Department of Cancer Epidemiology/Clinical Cancer Registry and Institute for Medical Biometrics and Epidemiology, University Clinic Hamburg-Eppendorf, Hamburg, Germany (DFJ); Department of Oncology, Helsinki University Central Hospital, Helsinki, Finland (CBl); Department of Clinical Genetics, Helsinki University Central Hospital, Helsinki, Finland (KA); Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN (FJC); Department of Health Sciences Research, Mayo Clinic, Rochester, MN (JEO, CV); Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada (ILA); Ontario Cancer Genetics Network, Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada (ILA, GG); Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada (JAK); Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada (JAK); Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada (AMM); Laboratory Medicine Program, University Health Network, Toronto, Ontario, Canada (AMM); Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA (CAH, BEH, FS); University of Hawaii Cancer Centre, Honolulu, HI (LLM); Centre for Epidemiology and Biostatistics, Melbourne School of Population Health, the University of Melbourne, Melbourne, Australia (JLH, CA, GGG, RLM); Genetic Epidemiology Laboratory, Department of Pathology, the University of Melbourne, Melbourne, Australia (HT, MCS); Sheffield Cancer Research Centre, Department of Oncology, University of Sheffield, Sheffield, UK (AC, MWRR); Academic Unit of Pathology, Department of Neuroscience, University of Sheffield, UK (SSC); Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia (GGG, RLM); Anatomical Pathology, the Alfred Hospital, Melbourne, Australia (CM); Laboratory of Cancer Genetics and Tumor Biology, Department of Clinical Chemistry and Biocenter Oulu, University of Oulu, Oulu, Finland (RW); Laboratory of Cancer Genetics and Tumor Biology, Northern Finland Laboratory Centre NordLab, Oulu, Finland (KP); Department of Oncology, Oulu University Hospital, University of Oulu, Oulu, Finland (AJV); Department of Surgery, Oulu University Hospital, University of Oulu, Oulu, Finland (MG); Department of Medical Oncology, Family Cancer Clinic, Erasmus MC Cancer Institute, Rotterdam, the Netherlands (MJH, AHo, JWMM, AMWvdO); Department of Obstetrics and Gynecology, University of Heidelberg, Heidelberg, Germany (FM, AS, RY, BB); National Center for Tumor Diseases, University of Heidelberg, Heidelberg, Germany (FM, AS); Molecular Epidemiology Group, German Cancer Research Center, Heidelberg, Germany (BB); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD (JF, SJC); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD (JF); Core Genotyping Facility, Frederick National Laboratory for Cancer Research, Gaithersburg, MD (SJC); Department of Cancer Epidemiology and Prevention, M. Sklodowska-Curie Memorial Cancer Center & Institute of Oncology, Warsaw, Poland (JoL); Division of Cancer Studies, National Institute for Health Research, Comprehensive Biomedical Research Centre, Guy’s & St. Thomas’ NHS Foundation Trust in partnership with King’s College London, London, UK (EJS); Wellcome Trust Centre for Human Genetics and Oxford NIHR Biomedical Research Centre, University of Oxford, UK (IT); Clinical Science Institute, University Hospital Galway, Galway, Ireland (MJK, NM); Division of Clinical Epidemiology and Aging Research, German Cancer Research Center, Heidelberg, Germany (HB, AKD, VA); German Cancer Consortium (DKTK), Heidelberg, Germany (HB, AKD); Saarland Cancer Registry, Saarbrücken, Germany (BH); Imaging Center, Department of Clinical Pathology, Kuopio University Hospital, Kuopio, Finland (AM, VMK, JMH); School of Medicine, Institute of Clinical Medicine, Pathology and Forensic Medicine, University of Eastern Finland, Kuopio, Finland (AM, VMK, JMH); Biocenter Kuopio, Cancer Center of Eastern Finland, Kuopio University Hospital, Kuopio, Finland (VKa); School of Medicine, Institute of Clinical Medicine, Oncology, University of Eastern Finland, Kuopio, Finland (VKa); Department of Human Genetics & Department of Pathology, Leiden University Medical Center, 2300 RC Leiden, the Netherlands (PD); Department of Surgical Oncology, Leiden University Medical Center, 2300 RC Leiden, the Netherlands (RAEMT); Family Cancer Clinic, Department of Medical Oncology, Erasmus MC-Daniel den Hoed Cancer Centrer, Rotterdam, the Netherlands (CS); Unit of Molecular Bases of Genetic Risk and Genetic Testing, Department of Preventive and Predictive Medicine, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy (PR); IFOM, Fondazione Istituto FIRC di Oncologia Molecolare, Milan, Italy (PP, PM); Division of Cancer Prevention and Genetics, Istituto Europeo di Oncologia, Milan, Italy (BB); Cogentech Cancer Genetic Test Laboratory, Milan, Italy (PM); David Geffen School of Medicine, Department of Medicine, Division of Hematology and Oncology, University of California at Los Angeles, CA (PAF); Department of Gynecology and Obstetrics, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany (PAF, MWB, AHe); Institute of Human Genetics; University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany (ABE); Western Sydney and Nepean Blue Mountains Local Health Districts, Westmead Millennium Institute for Medical Research, University of Sydney, Sydney, Australia (RB); Peter MacCallum Cancer Center, Melbourne, Australia (kConFab Investigators); the University of Melbourne, Melbourne, Australia (KAP); Division of Cancer Medicine, Peter MacCallum Cancer Centre, Melbourne, Australia (KAP); Centro de Investigación en Red de Enfermedades Raras, Valencia, Spain (JBen); Human Genetics Group, Human Cancer Genetics Program, Spanish National Cancer Research Centre, Madrid, Spain (JBen); Servicio de Oncología Médica, Hospital Universitario La Paz, Madrid, Spain (MPZ); Servicio de Cirugía General y Especialidades, Hospital Monte Naranco, Oviedo, Spain (JIAP); Servicio de Anatomía Patológica, Hospital Monte Naranco, Oviedo, Spain (PM); Department of Genetics and Pathology, Pomeranian Medical University, Szczecin, Poland (AJ, JL, KJB, KD); Molecular Genetics of Breast Cancer, German Cancer Research Center, Heidelberg, Germany (UH, MK); Frauenklinik der Stadtklinik Baden-Baden, Baden-Baden, Germany (HUU); Institute of Pathology, Städtisches Klinikum Karlsruhe, Karlsruhe, Germany (TR); Department of Oncology - Pathology, Karolinska Institutet, Stockholm, Sweden (SM); Department of Genetics, Institute for Cancer Research, Oslo University Hospital, Radiumhospitalet, Oslo, Norway (VKr, SN); Faculty of Medicine (Faculty Division Ahus), University of Oslo, Norway (VKr, SN); Genomic Medicine, Manchester Academic Health Science Centre, University of Manchester, Central Manchester Foundation Trust, St. Mary’s Hospital, Manchester, UK (DGE); Cambridge Breast Research Unit and NIHR Cambridge Biomedical Research Centre, University of Cambridge, Department of Oncology, Cambridge, UK (JEA, HME, CCa); Cambridge Experimental Cancer Medicine Centre, Cambridge, UK (JEA, HME, CCa); Warwick Clinical Trials Unit, University of Warwick, UK (LH, JAD); Cancer Research UK Clinical Trials Unit, Institute for Cancer Studies, the University of Birmingham, Edgbaston, Birmingham, UK (SB); Early Detection Research Group, Division of Cancer Prevention National Cancer Institute Bethesda, MD (CBe); Department of Biology, University of Pisa, Pisa, Italy (DC); Epidemiology Research Program, American Cancer Society, Atlanta, GA (WRD, SMG, MMG); Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, MA (SH); Division of Biostatistics and Epidemiology, University of Massachusetts-Amherst School of Public Health and Health Sciences, Amherst, MA (SH); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD (RNH, MJM); Department of Nutrition, Harvard School of Public Health, Boston, MA (WW); Genomic Epidemiology Group, German Cancer Research Center, Heidelberg, Germany (FC); Breast Cancer Functional Genomics Laboratory, Cancer Research UK Cambridge Institute, University of Cambridge, Li Ka Shing Centre, UK (SFC, CCa); Breakthrough Breast Cancer Research Centre, Division of Breast Cancer Research, the Institute of Cancer Research, London, UK (MGC); Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, UK (MGC, NR); Faculty of Medicine, University of Southampton, UK (DME).

retrieved by IIS PMC parser from one of the PMC XML resources.

I've just created same issue in IIS openaire/iis#663 to bypass this problem.

xref elements sometimes have multiple values for rid element

Hi folks,

Looks like parsing this document with Cermine results in some weird rid= values for xref elements:

doc2vec.pdf

<xref ref-type="bibr" rid="16 14">(Mikolov et al., 2013c)</xref>
<xref ref-type="bibr" rid="25 27">(Socher et al., 2011b)</xref>
<xref ref-type="bibr" rid="17 16 20 13">(Morin & Bengio, 2005; Mnih & Hinton, 2008; Mikolov et al., 2013c)</xref>
<xref ref-type="bibr" rid="13 14 17">(Mikolov et al., 2013a)</xref>
<xref ref-type="bibr" rid="36">(Collobert & Weston, 2008; Zhila et al., 2013)</xref>

I'm pretty sure it's not valid JATS that way -- does it just break on (authorname, year) citation styles currently?

Invalid reference section recognition

As reported in OpenAIRE portal feedback system metadata extracted from the following thesis:

http://ria.ua.pt/bitstream/10773/1682/1/2009001399.pdf

contained invalid bibliographic references section.

I just confirmed it after uploading mentioned publication to:

http://cermine.ceon.pl/index.html

and receiving:

  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>am ro A p m t p 9 i t f</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>

or:

  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>9 i t f</mixed-citation>
      </ref>
    </ref-list>
  </back>

interchangeably, as a part of NLM.

It looks like the table from annex 1 was interpreted as a part of bibliographic references.

Explore how the trained models are used in cermine

Garbage output

For some PDFs, Cermine fails to determine the structure of the document, and outputs garbage.

Examples

Tested with 1.8-SNAPSHOT.

Input

Output

Large portions of text body missing.

For example, processing example article leads to cutting over 250 pages out of ~320. While main text body of an article isn't main interest in CERMINE, having full content would be a desired default behavior.

Comparison of original and .cermxml file:
Page 19 (20 with pdf counting) has a paragraph starting with "Chapter 5, 6..., after which the next paragraph is "There is also a more practical, concrete response. Dahlberg and Moss
(2005: 107-110)", which is on page 311 (312 pdf count) in original PDF.

Tested with Cermine 2.12 and via web interface, same behavior.

Allow using external trained model files [impl]

the mechanism of external files with properties
properties defining the locations of the trained model files

Fatal java.lang.OutOfMemoryError thrown while processing document

Unfortunately we found one publication (out of ~2mln already processed) causing fatal unrecoverable error and IIS processing interruption.

I managed to reproduce this problem locally.

Link to publication causing problem:
https://www.cmi.no/publications/file/5192-ungrateful-children.pdf

TaxPub implementation?

Would it be somehow possible to add TaxPub as add-on? TaxPub is an NLM DTD extension
see https://sourceforge.net/projects/taxpub/
and
https://www.researchgate.net/publication/233807452

Cermine code extension

I would like to extend the cermine code. I would like to add up new keywords for extraction.For eg. I tried removing "Prof." keyword from author enhancer and "Case Report" keyword from title enhancer. and gave the input pdf document and checked but it did not reflect in output xml file. Can anyone help me on this.

Problem with a two column paper

Hi,

I have had problems with the following paper using version 1.13 of the standalone tool:

Maximizing the Spread of Influence through a Social Network (2003) by
David Kempe, Jon Kleinberg, Éva Tardos
https://www.cs.cornell.edu/home/kleinber/kdd03-inf.pdf

It only extracts references in the left column of the last page. It seems to extract the right hand column of the body text, but fails on the references section.

Thanks

Affiliations: Improve mapping country name to country code

Originally reported by @mkobos:

Affiliations: Improve mapping country name to country code. In huge majority of cases when the country code is missing while country name is not, the content of country name field contains a substring which is a name of a country. In these cases it should be possible to assign the correct country code. This problem appears in up to 1.6% of all affiliation records (since this is the percentage of records where country code is missing while country name is not).

Concept of providing external properites to cermine

Some inputs cause article titles to be all-capitalized.

Even when the article title is uncapitalized, titles, upon the following inputs, are all capitalized.
input.zip
output.zip

EDIT: My mistake, I misinterpreted the inputs.

Can not access Webservice "http://cermine.ceon.pl/extract.do"

Hello Support,

I tried to consume the above given rest service in .NET application. Do you have any .wsdl file to generate proxy in .Net?
Otherwise please guide me on how to extract pdf documents in .NET.
Do you have any .NET library to make use of this CERMINE functionalities?

Thanks in advance,
Jisha

article-meta elements are out of order and fail JATS validation

As much as I hate having to conform to ordered XML schemas, I think this is technically a bug as it does fail validation and can cause some JATS apps to panic.

Right now, in CERMINE output, <abstract/> and <kwd-group/> come before <pub-date/> and other associated elements (<volume/>, etc), which causes an error (you can reproduce in testing with just xmllint --valid); these should be moved to the end of <article-meta/>.

Timeout parameter does not interrupt processing in some occasions

There are still cases where timeout feature, thoroughly described in #7, doesn't seem to work.

After upgrading CERMINE to newly released 1.10 version I got IIS metadataextraction workflow failure because ContentExtractor stuck for more than 1h while getting document content as NLM.

First thing: we got a regression in CERMINE becase even without timeout feature we were able to pass through ContentExtractor all available contents without troubles.

Second thing: after CERMINE reaches this point of processing:

1.1 Character extraction: 2.385
1.2 Page segmentation: 4.723
1.3 Reading order resolving: 0.422
1.4 Initial classification: 9.869
2.1 Metadata classification: 0.055

even though timeout threshold is exceeded processing is not interrupted hanging whole process.

Decreasing timeout value eventually triggers interruption but this is probably caused by CERMINE not reaching "point of no return".

Affiliations: Improve recognizing whether a given affiliation text is really an affiliation

Originally reported by @mkobos:

Affiliations: Improve recognizing whether given affiliation text is really an affiliation. This can be done using a simple rule that checks whether the text is not too short and not too long (this might impact up to 6% of all affiliation records - these are the records where the affiliation text does not really correspond to an affiliation).

Extract arXiv identifier

The articles on the arXiv (for example this thesis) contain a side-label containing the identifier (eg, arXiv:1405.2249v1 [math-ph] 9 May 2014). It would be nice if CERMINE could read and parse this information.

The different formats for the arXiv ids are discussed in the official help.

Keyword extraction

Are there any plans to include keyword extraction based on the typical Keywords sections of, e.g., journal articles?

Multithreading problems using CRFBibReferenceParser.parse method

The CRFBibReferenceParser.parse(String affiliation) method is not working correctly using multiply threads. I could not figure out what is the actual problem. Sometimes indexOutOfBounds, sometimes nullPointer exception is being thrown. When I use one thread this problem is not occur. I created a synchronized wrapper method for parse method, and this is the only way I could make it work. The actual problem is that I have a big bunch of affiliations to parse. It would be great if I could use the power of multiply threads.

Testing issues

Your README don't show a detailed install instruction.
Can you give some detailed info. ?

For compiling, mvn compile successd

And when mvn install -DskipTests, I meet this issues:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.4:jar (attach-javadocs) on project cermine-impl: MavenReportException: Error while generating Javadoc:
[ERROR] Exit code: 1 - src/CERMINE-master/cermine-impl/src/main/java/pl/edu/icm/cermine/ComponentFactory.java:87: warning: no description for @throws

UPDATED

I google the issue and use mvn install -DskipTests -Dadditionalparam=-Xdoclint:none
to get rid of all the ERROR and install successfully.

But mvn test fails like this:
Results :

Failed tests: metadataExtractionTest(pl.edu.icm.cermine.AltPdfNLMMetadataExtractorTest)
metadataExtractionTest(pl.edu.icm.cermine.bibref.KMeansBibReferenceExtractorTest): expected:<...98, 186:528-33.(..)
testAddFeatures(pl.edu.icm.cermine.metadata.affiliation.features.AffiliationDictionaryFeatureTest): Token: 18 W expected:<1> but was:<0>
getBxDocumentTest(pl.edu.icm.cermine.ContentExtractorTest)
getBxDocumentWithGeneralLabelsTest(pl.edu.icm.cermine.ContentExtractorTest)
getBxDocumentWithSpecificLabelsTest(pl.edu.icm.cermine.ContentExtractorTest)
textRawFullTextTest(pl.edu.icm.cermine.ContentExtractorTest): expected:<...nnerhagen(..)
getFullTextWithLabelsTest(pl.edu.icm.cermine.ContentExtractorTest)
getNLMBodyTest(pl.edu.icm.cermine.ContentExtractorTest)
getNLMReferencesTest(pl.edu.icm.cermine.ContentExtractorTest)
getNLMContentTest(pl.edu.icm.cermine.ContentExtractorTest)

Tests run: 118, Failures: 11, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] CERMINE Content Extractor and Miner ................ SUCCESS [ 0.164 s]
[INFO] CERMINE Engine Implementation - 1.12-SNAPSHOT ...... FAILURE [01:44 min]
[INFO] CERMINE Tools - 1.12-SNAPSHOT ...................... SKIPPED
[INFO] CERMINE Web Interface - 1.12-SNAPSHOT .............. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:44 min
[INFO] Finished at: 2016-12-19T22:35:29+08:00
[INFO] Final Memory: 12M/159M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project cermine-impl: There are test failures.
[ERROR]
[ERROR] Please refer to /home/duyu/src/CERMINE-master/cermine-impl/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :cermine-impl

Could you give some advice to solve this problem?

Thanks for your reply.

.jar won't run

Hi all,
Thank you for developing Cermine which I found very useful when I tried it on the web.
I 've downloaded the stand alone version. I wanted to use it on windows 10 (64 bit) but it did not execute. Note that I have over jar executable which launch normally with a double click.
Thank you for your responses.

Bibtext Export

Hi!

Using a current snapshot build, I was able to export references as a bibtex file, but the different authors were not correctly separated by "and" but with a comma:

@article{Peiker2016, author = {Peiker, C, Pott, C, Eckardt, L, Kelm, M, Shin, DI, Willems, S}, doi = {10.1093/europace/euv056}, journal = {Europace}, pages = {332--339}, title = {Dual atrioventricular nodal non-re-entrant tachycardia}, volume = {18}, year = {2016}, }

Furthermore, duplicate citekeys were generated.

Stephan

Error and crash: Exception in thread "main" java.lang.IllegalArgumentException: Illegal group reference: group index is missing

I am running CERMINE through 10000 or so pdfs, but some of them throw this error, and the program stops running. Can I somehow fix this, or tell CERMINE to skip errors an continue?

File processed: /home/moritz/Desktop/pdf_extraction/pdfs/Other/��ztun΍ et al. 1991 - A new haloether from Laurencia possessing a lauroxacyclododecane ring. Structure and conformational studies.pdf Exception in thread "main" java.lang.IllegalArgumentException: Illegal group reference: group index is missing at java.util.regex.Matcher.appendReplacement(Matcher.java:819) at pl.edu.icm.cermine.content.cleaning.ContentCleaner.cleanHyphenationAndBreaks(ContentCleaner.java:180) at pl.edu.icm.cermine.content.cleaning.ContentCleaner.cleanAllAndBreaks(ContentCleaner.java:236) at pl.edu.icm.cermine.metadata.model.DocumentMetadata.clean(DocumentMetadata.java:277) at pl.edu.icm.cermine.metadata.EnhancerMetadataExtractor.extractMetadata(EnhancerMetadataExtractor.java:106) at pl.edu.icm.cermine.metadata.EnhancerMetadataExtractor.extractMetadata(EnhancerMetadataExtractor.java:36) at pl.edu.icm.cermine.ExtractionUtils.cleanMetadata(ExtractionUtils.java:101) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:320) at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:286) at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:612) at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:628) at pl.edu.icm.cermine.ContentExtractor.main(ContentExtractor.java:724)

��ztun΍ et al. 1991 - A new haloether from Laurencia possessing a lauroxacyclododecane ring. Structure and conformational studies.pdf

How can I avoid NaN exception: AtRelativeCount?

I'm trying to train CERMINE again on an my university's scientific paper collection. However, after the InitialBuilder has iterated over all the files (~2300), it always throws this exception:

Exception in thread "main" java.lang.RuntimeException: Feature value is set to NaN: AtRelativeCount
	at pl.edu.icm.cermine.tools.classification.general.LinearScaling.scaleFeatureVector(LinearScaling.java:49)
	at pl.edu.icm.cermine.tools.classification.general.FeatureVectorScalerImpl.scaleFeatureVector(FeatureVectorScalerImpl.java:61)
	at pl.edu.icm.cermine.tools.classification.svm.SVMClassifier.buildDatasetForTraining(SVMClassifier.java:157)
	at pl.edu.icm.cermine.tools.classification.svm.SVMClassifier.buildClassifier(SVMClassifier.java:117)
	at pl.edu.icm.cermine.libsvm.training.SVMInitialBuilder.getZoneClassifier(SVMInitialBuilder.java:69)
	at pl.edu.icm.cermine.libsvm.training.SVMInitialBuilder.main(SVMInitialBuilder.java:143)

The line that causes the exception:
featureValue = a*featureValue + b;

Somehow the featureValue was NaN before the calculation. I think the issues may arise from my dataset. The cermstr dataset was generated from sidecars xml files in which all metadata are included: title, author, abstract, references,... I've proof read 20% of the files and everything seems normal. So far, the only problem I can think of is that there are too few examples of correspondence section, and this section usually spans only one line in a pdf file.
How should I improve my dataset to avoid this exception? Thanks!

make cermine interruptible to allow interruption in batch processing

We are heavily relying on cermine in IIS and unfortunately every now and then malicious PDF file appears taking huge amount of time to process.

Currently we have adjusted hadoop mapper (by overwriting mapred.task.timeout for metadataextraction module) to wait 60 minutes before assuming task does not respond. Apparently this is not enough. Recently after adding ~400k new PDF contents to ObjectStore cermine got files taking over 60 minutes to process each one of them. This causes whole IIS processing failure and the only way to proceed is to blacklist given document, retry IIS execution and hope we won't need to do this again (but we probably will).

One possible solution is to execute cermine processing in dedicated thread and interrupt it before the time defined in mapred.task.timeout passes. Currently this won't have any effect because cermine does not honor interruption. If you could check every now and then on Thread.interrupted(), probably as a part of loop condition check, then we could stop metadataextraction worker thread and continue with processing other documents.

Up until now my perception was that iText library blocks the execution which could be dead end. But after running visualVM profiler on cermine I realized there is no single method blocked for ages but the execution time of several methods changes interchangeably. This gives us opportunity to check whether thread was interrupted.

Take a look at the attachment with metadataextraction CPU-time profiling the methods taking most of the execution time. This could guide us where to check for interruption.

Acknowledgements extraction

Hi,

Is there any plan to extract acknowledgements from the PDF files. For instance, if taking this paper extract as a metadata that

This work was supported by the ERC through the QGBE grant and by Provincia Autonoma di Trento.

(although this not be the simplest example, as this sentence is embedded in the paper and not in a separate "Acknowledgements" section).

Thanks

ceon / cermine Goto Github PK

cermine's People

Contributors

Stargazers

Watchers

Forkers

cermine's Issues

Examples

Example

Examples

Examples

Recommend Projects

Recommend Topics

Recommend Org

Jobs