ceon / cermine Goto Github PK
View Code? Open in Web Editor NEWContent ExtRactor and MINEr
License: GNU Affero General Public License v3.0
Content ExtRactor and MINEr
License: GNU Affero General Public License v3.0
Hi,
When I tried to parse the following documents
With the following piece of code:
Element result = extractor.getContentAsNLM();
I get the following exception
Exception in thread "main" java.lang.NullPointerException
at pl.edu.icm.cermine.metadata.transformers.MetadataToNLMConverter.convertJournalMetadata(MetadataToNLMConverter.java:61)
at pl.edu.icm.cermine.metadata.transformers.MetadataToNLMConverter.convert(MetadataToNLMConverter.java:42)
at pl.edu.icm.cermine.InternalContentExtractor.getMetadataAsNLM(InternalContentExtractor.java:198)
at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:303)
at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:662)
at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:678)
I use Cermine in my little console utility to extract references from papers. I noticed that there are often mistakes in doi extraction, the most common one - is adding extra bracket to the end (that is why I manually remove it in my code https://github.com/antonkulaga/extractor/blob/master/src/main/scala/org/comp/bio/aging/extractor/Reference.scala#L22 ). Please, improve doi extraction accuracy!
Hi,
Is there a documentation somewhere about the structure of the XML output, especially the available tags and so on? I could not find any.
Thanks!
I am obviously doing something wrong here, as trying to run the latest 1.9 with dependencies from the terminal keeps returning "Error: Could not find or load main class pl.edu.icm.cermine.ContentExtractor". Having re-installed Java and a reset jre/path home a couple of times, I wonder what I am missing. Running on Windows 10 x64 with Java 32-bit (131).
When attempting to extract data from this article https://doi.org/10.7717/peerj-cs.118 (and probably any other article with the same PDF layout) the following failure occurs:
java.lang.AssertionError
at pl.edu.icm.cermine.structure.readingorder.DocumentPlane.add(DocumentPlane.java:165)
at pl.edu.icm.cermine.structure.readingorder.DocumentPlane.<init>(DocumentPlane.java:95)
at pl.edu.icm.cermine.structure.HierarchicalReadingOrderResolver.groupZonesHierarchically(HierarchicalReadingOrderResolver.java:207)
at pl.edu.icm.cermine.structure.HierarchicalReadingOrderResolver.reorderZones(HierarchicalReadingOrderResolver.java:122)
at pl.edu.icm.cermine.structure.HierarchicalReadingOrderResolver.resolve(HierarchicalReadingOrderResolver.java:96)
at pl.edu.icm.cermine.ExtractionUtils.resolveReadingOrder(ExtractionUtils.java:79)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:354)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341)
at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:301)
at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:662)
at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:678)
at pl.edu.icm.cermine.ContentExtractorLoopTest.extractionLoopTest(ContentExtractorLoopTest.java:57)
Hi,
Are there plans to add a call to pdfimages (from xpdf/poppler) to ensure images are extracted when parsing full text via Grobid? pdfimages accuracy and performance seems to be very good but I don't think it's directly used by any pdf parsers currently.
Cermine does not always handle correctly reference sections that span multiple pages.
Often, only the last page is recognised as references, and previous pages are included as part of the body — or are not included in the output at all.
Tested with 1.8-SNAPSHOT.
Input
Output
Cermine does not handle correctly reference sections that are followed by an appendix.
The text following the reference section is mistakenly recognised as additional references.
Tested with 1.8-SNAPSHOT.
Input
Output
I didn't know which values I can pass to getId until I found the constants ID_DOI, ID_URN, ....
In my opinion it would make the code a bit clearer if you use an enum for the possible identifier types.
A similar remark applies to the different types of dates.
The attached paper parses just fine to text or zones, but trying to parse it to jats completely skips the abstract.
CIDR_17_020.pdf
The following exception is thrown:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at com.itextpdf.text.pdf.PdfReader.FlateDecode(PdfReader.java:2188)
at com.itextpdf.text.pdf.PdfReader.FlateDecode(PdfReader.java:2043)
at com.itextpdf.text.pdf.FilterHandlers$Filter_FLATEDECODE.decode(FilterHandlers.java:107)
at com.itextpdf.text.pdf.PdfReader.decodeBytes(PdfReader.java:2619)
at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:189)
at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:168)
at com.itextpdf.text.pdf.parser.ImageRenderInfo.prepareImageObject(ImageRenderInfo.java:150)
at com.itextpdf.text.pdf.parser.ImageRenderInfo.getImage(ImageRenderInfo.java:140)
at pl.edu.icm.cermine.structure.ITextCharacterExtractor$BxDocumentCreator.renderImage(ITextCharacterExtractor.java:366)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ImageXObjectDoHandler.handleXObject(PdfContentStreamProcessor.java:1311)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayXObject(PdfContentStreamProcessor.java:375)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:83)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:1023)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:310)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:448)
at pl.edu.icm.cermine.structure.ITextCharacterExtractor.extractCharacters(ITextCharacterExtractor.java:112)
at pl.edu.icm.cermine.ExtractionUtils.extractCharacters(ExtractionUtils.java:60)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:346)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:339)
at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:299)
at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:662)
at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:678)
at eu.dnetlib.iis.wf.metadataextraction.MetadataExtractorMapper.handleContent(MetadataExtractorMapper.java:249)
when processing this document:
http://eprints.nottingham.ac.uk/41436/1/PhD_Thesis_Maria_Manuela_Marinho_de_Castro.pdf
using 1.13
CERMINE version.
This seems to be similar to #33 because the only way to make it working is increasing Xmx memory up to 5GB. I tried to enforce using most recent 5.5.12
iText version but it did not solve this issue.
As already mentioned in #33#issuecomment-257929226 IIS metadataextraction mapper is allowed to use 4GB memory. Assigning more memory to job triggering CERMINE will decrease task parallelization. For now I am simply blacklisting this document.
Should we consider this as a CERMINE issue or should we report it to iText developers?
CERMINE has problems with academic literature that are not classical articles. For example, support for thesis's and books would be great.
For example, for my thesis the information returned is:
Type: ARTICLE // expected: thesis
Author: "Master's Thesis and Presented by Tobias Diez and Assessors: Dr. G. Rudolph Dr. R. Verch") // expected Tobias Diez
Pages: "86127") // expected: 1 - 127
Title: "Slice theorem for Fréchet group actions and covariant symplectic field theory" // correct
I have downloaded "cermine-impl-1.9-jar-with-dependencies.jar" on window 10.
Create a folder in C:\NewTest and placed it.
Location of jar is C:\NewTest\cermine-impl-1.9-jar-with-dependencies.jar
Placed the pdf under a folder like C:\NewTest\Input\
Navigate to command prompt and type below command
java -cp cermine-impl-1.9-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path C:\NewTest\Input\
Getting error "Error: Could not find or load main class pl.edu.icm.cermine.ContentExtractor"
Reported by @mkobos:
Affiliations: Improve way of finding country in affiliation string (this might potentially impact 23% of all records - these are the records where the country name (and thus country code) is missing)
1.1 Easy improvement. Look for the name of the country in affiliation string (this might solve the problem in up to 23%*25%= 6% of all affiliation records).
1.2 More difficult improvement. Recognize the country based on a dictionary of well-known scientific organizations and a dictionary of US states and major cities in various countries (this might solve the problem in up to 23%*80% = 18% of all affiliation records).
Note that all numbers come from the first version of report "Analysis of affiliations extracted by IIS from XMLs and PDFs" available at https://issue.openaire.research-infrastructures.eu/issues/2010, namely: https://issue.openaire.research-infrastructures.eu/attachments/download/509/2016-04-17_21_46_analysis.html. All these numbers are very rough approximations of the real values.
DOIs are frequently reported in the bibliography sections. Example from PlosONE:
Van Heuven WJB, Dijkstra T. Language comprehension in the bilingual brain: fMRI and ERP support
for psycholinguistic models. Brain Res Rev. 2010; 64(1):104 – 22. doi: 10.1016/j.brainresrev.2010.03.
002 PMID: 20227440
However, they are not extracted by CERMINE, and the first part of DOI gets interpreted as repeated information about volume:
<mixed-citation>
14.
<string-name>
<surname>Van Heuven</surname>
<given-names>WJB</given-names>
</string-name>
,
<string-name>
<surname>Dijkstra</surname>
<given-names>T.</given-names>
</string-name>
<article-title>Language comprehension in the bilingual brain: fMRI and ERP support for psycholinguistic models</article-title>
.
<source>Brain Res Rev</source>
.
<year>2010</year>
;
<volume>64</volume>
(
<issue>1</issue>
):
<fpage>104</fpage>
-
<lpage>22</lpage>
. doi:
<volume>10</volume>
.1016/j.brainresrev.
<year>2010</year>
.
<volume>03</volume>
. 002 PMID:
<fpage>20227440</fpage>
</mixed-citation>
http://maven.icm.edu.pl/artifactory/ is very slow, it would be nice to see cermine also in bintray or maven central
Cermine seems to incorrectly assign references to citations, even for perfectly recognized PDFs, and for correctly recognized bibliography. A citation co-authored by X is being assigned to other publications also including author X that occur in the bibliography. This is a very frequent bug and it often happens for obvious cases where the name of the first author is different for the citation and for the actually assigned item in the bibliography.
I can provide examples with specific PDFs if required, although this behaviour should be very easy to elicit using any PDF that includes citations co-authored by the same person.
Originally reported by @mkobos:
Affiliations: Improve cleaning the affiliation string because in a small number of cases, the organization field contains some text that doesn't belong there (e.g. address, short name, name of the organization in different language, unrelated text) This might impact up to 20% of all affiliation records.
Note that all numbers come from the first version of report "Analysis of affiliations extracted by IIS from XMLs and PDFs" available at https://issue.openaire.research-infrastructures.eu/issues/2010, namely: https://issue.openaire.research-infrastructures.eu/attachments/download/509/2016-04-17_21_46_analysis.html. All these numbers are very rough approximations of the real values.
Some PDFs contain text that has been generated using an unspecified OCR process. For such PDFs, the quality of Cermine output depends directly on the quality of the particular OCR process, which may be far from satisfactory.
It would be great if Cermine performed its own OCR, and attempted to process both the existing and the newly recognised text, in order to get the best result.
Performing OCR may also be the only way to solve #11.
Tested with 1.8-SNAPSHOT.
Input
Output
Hi,
When trying to build on latest tag, I get:
CERMINE/cermine-impl % mvn compile assembly:single ±[cermine-parent-1.7^0]
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for pl.edu.icm.cermine:cermine-impl:jar:1.7-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-javadoc-plugin is missing. @ pl.edu.icm.cermine:cermine-parent:1.7-SNAPSHOT, /home/phyks/tmp/papers/CERMINE/pom.xml, line 79, column 21
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ pl.edu.icm.cermine:cermine-parent:1.7-SNAPSHOT, /home/phyks/tmp/papers/CERMINE/pom.xml, line 67, column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building CERMINE Engine Implementation - 1.7-SNAPSHOT 1.7-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://maven.icm.edu.pl/artifactory/repo/edu/umass/cs/mallet/mallet/0.1.3/mallet-0.1.3.pom
Downloading: http://maven.icm.edu.pl/artifactory/repo/edu/umass/cs/mallet/grmm-deps/0.1.3/grmm-deps-0.1.3.pom
Downloading: http://maven.icm.edu.pl/artifactory/repo/org/bouncycastle/bcprov-jdk14/1.47/bcprov-jdk14-1.47.pom
Downloading: https://repo.maven.apache.org/maven2/org/bouncycastle/bcprov-jdk14/1.47/bcprov-jdk14-1.47.pom
Downloaded: https://repo.maven.apache.org/maven2/org/bouncycastle/bcprov-jdk14/1.47/bcprov-jdk14-1.47.pom (819 B at 1.2 KB/sec)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 05:29 min
[INFO] Finished at: 2016-01-17T20:42:21+01:00
[INFO] Final Memory: 15M/102M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project cermine-impl: Could not resolve dependencies for project pl.edu.icm.cermine:cermine-impl:jar:1.7-SNAPSHOT: Failed to collect dependencies at edu.umass.cs.mallet:mallet:jar:0.1.3: Failed to read artifact descriptor for edu.umass.cs.mallet:mallet:jar:0.1.3: Could not transfer artifact edu.umass.cs.mallet:mallet:pom:0.1.3 from/to yadda (http://maven.icm.edu.pl/artifactory/repo): Failed to transfer file: http://maven.icm.edu.pl/artifactory/repo/edu/umass/cs/mallet/mallet/0.1.3/mallet-0.1.3.pom. Return code is: 502 , ReasonPhrase:Bad Gateway. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
Thanks
Originally reported by @mkobos:
Affiliations: Recognize that some texts currently provided by CERMINE as affiliations are certainly not affiliations, e.g.: "Electronic address:" (this is the most numerous "affiliation" - it can be found in 0.03% of all affiliation records), "These authors contributed equally to this work".
Note that all numbers come from the first version of report "Analysis of affiliations extracted by IIS from XMLs and PDFs" available at https://issue.openaire.research-infrastructures.eu/issues/2010, namely: https://issue.openaire.research-infrastructures.eu/attachments/download/509/2016-04-17_21_46_analysis.html. All these numbers are very rough approximations of the real values.
It seems StackOverflowError
is thrown by Mallet library:
2016-09-16 00:56:09,493 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.StackOverflowError
at java.lang.StringBuffer.append(StringBuffer.java:272)
at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:487)
at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:491)
at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:491)
[...]
when providing large text input to CRFAffiliationParser#parse()
.
After several tests it turned out affiliation text exceeding 8000-9000 characters causes mentioned problem.
Here is an example causing StackOverflowError
:
Affiliations of authors:Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, UK (QG, JT, AMD, MS, JEA, DFE, PDPP); Netherlands Cancer Institute, Antoni van Leeuwenhoek hospital, Amsterdam, the Netherlands (MKS, SC, AB, FBH); Department of Epidemiology, Harvard School of Public Health, Boston, MA (PK, SH, DJH, SL); Program in Genetic Epidemiology and Statistical Genetics, Department of Epidemiology, Harvard School of Public Health, Boston, MA (PK, CCh, DJH, SL); Department of Obstetrics and Gynecology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland (SK, RF, TAM, HN); Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK (MKB, QW, JD, KM, ML, SK, DFE, PDPP); Department of Genetics, QIMR Berghofer Medical Research Institute, Brisbane, Australia (JBee, GCT); Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm 17177, Sweden (KC, HD, ME, JiL, JBr, KH, PH); Laboratory for Translational Genetics, Department of Oncology, University of Leuven, Leuven, Belgium (DL); Vesalius Research Center, VIB, Leuven, Belgium (DL); Oncology Department, University Hospital Gasthuisberg, Leuven, Belgium (CW, KL); Copenhagen General Population Study, Herlev Hospital, Copenhagen, Denmark (SEB, BGN, SFN); Department of Clinical Biochemistry, Herlev Hospital, Copenhagen University Hospital, Denmark (SEB, BGN, SFN); Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark (SEB, BGN); Department of Breast Surgery, Herlev Hospital, Copenhagen University Hospital, Denmark (HF); Division of Cancer Epidemiology, German Cancer Research Center (Deutsches Krebsforschungszentrum), Heidelberg, Germany (JCC, AR, PS, DC, AHü, RK, MB); Department of Cancer Epidemiology/Clinical Cancer Registry and Institute for Medical Biometrics and Epidemiology, University Clinic Hamburg-Eppendorf, Hamburg, Germany (DFJ); Department of Oncology, Helsinki University Central Hospital, Helsinki, Finland (CBl); Department of Clinical Genetics, Helsinki University Central Hospital, Helsinki, Finland (KA); Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN (FJC); Department of Health Sciences Research, Mayo Clinic, Rochester, MN (JEO, CV); Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada (ILA); Ontario Cancer Genetics Network, Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada (ILA, GG); Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada (JAK); Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada (JAK); Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada (AMM); Laboratory Medicine Program, University Health Network, Toronto, Ontario, Canada (AMM); Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA (CAH, BEH, FS); University of Hawaii Cancer Centre, Honolulu, HI (LLM); Centre for Epidemiology and Biostatistics, Melbourne School of Population Health, the University of Melbourne, Melbourne, Australia (JLH, CA, GGG, RLM); Genetic Epidemiology Laboratory, Department of Pathology, the University of Melbourne, Melbourne, Australia (HT, MCS); Sheffield Cancer Research Centre, Department of Oncology, University of Sheffield, Sheffield, UK (AC, MWRR); Academic Unit of Pathology, Department of Neuroscience, University of Sheffield, UK (SSC); Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia (GGG, RLM); Anatomical Pathology, the Alfred Hospital, Melbourne, Australia (CM); Laboratory of Cancer Genetics and Tumor Biology, Department of Clinical Chemistry and Biocenter Oulu, University of Oulu, Oulu, Finland (RW); Laboratory of Cancer Genetics and Tumor Biology, Northern Finland Laboratory Centre NordLab, Oulu, Finland (KP); Department of Oncology, Oulu University Hospital, University of Oulu, Oulu, Finland (AJV); Department of Surgery, Oulu University Hospital, University of Oulu, Oulu, Finland (MG); Department of Medical Oncology, Family Cancer Clinic, Erasmus MC Cancer Institute, Rotterdam, the Netherlands (MJH, AHo, JWMM, AMWvdO); Department of Obstetrics and Gynecology, University of Heidelberg, Heidelberg, Germany (FM, AS, RY, BB); National Center for Tumor Diseases, University of Heidelberg, Heidelberg, Germany (FM, AS); Molecular Epidemiology Group, German Cancer Research Center, Heidelberg, Germany (BB); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD (JF, SJC); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD (JF); Core Genotyping Facility, Frederick National Laboratory for Cancer Research, Gaithersburg, MD (SJC); Department of Cancer Epidemiology and Prevention, M. Sklodowska-Curie Memorial Cancer Center & Institute of Oncology, Warsaw, Poland (JoL); Division of Cancer Studies, National Institute for Health Research, Comprehensive Biomedical Research Centre, Guy’s & St. Thomas’ NHS Foundation Trust in partnership with King’s College London, London, UK (EJS); Wellcome Trust Centre for Human Genetics and Oxford NIHR Biomedical Research Centre, University of Oxford, UK (IT); Clinical Science Institute, University Hospital Galway, Galway, Ireland (MJK, NM); Division of Clinical Epidemiology and Aging Research, German Cancer Research Center, Heidelberg, Germany (HB, AKD, VA); German Cancer Consortium (DKTK), Heidelberg, Germany (HB, AKD); Saarland Cancer Registry, Saarbrücken, Germany (BH); Imaging Center, Department of Clinical Pathology, Kuopio University Hospital, Kuopio, Finland (AM, VMK, JMH); School of Medicine, Institute of Clinical Medicine, Pathology and Forensic Medicine, University of Eastern Finland, Kuopio, Finland (AM, VMK, JMH); Biocenter Kuopio, Cancer Center of Eastern Finland, Kuopio University Hospital, Kuopio, Finland (VKa); School of Medicine, Institute of Clinical Medicine, Oncology, University of Eastern Finland, Kuopio, Finland (VKa); Department of Human Genetics & Department of Pathology, Leiden University Medical Center, 2300 RC Leiden, the Netherlands (PD); Department of Surgical Oncology, Leiden University Medical Center, 2300 RC Leiden, the Netherlands (RAEMT); Family Cancer Clinic, Department of Medical Oncology, Erasmus MC-Daniel den Hoed Cancer Centrer, Rotterdam, the Netherlands (CS); Unit of Molecular Bases of Genetic Risk and Genetic Testing, Department of Preventive and Predictive Medicine, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy (PR); IFOM, Fondazione Istituto FIRC di Oncologia Molecolare, Milan, Italy (PP, PM); Division of Cancer Prevention and Genetics, Istituto Europeo di Oncologia, Milan, Italy (BB); Cogentech Cancer Genetic Test Laboratory, Milan, Italy (PM); David Geffen School of Medicine, Department of Medicine, Division of Hematology and Oncology, University of California at Los Angeles, CA (PAF); Department of Gynecology and Obstetrics, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany (PAF, MWB, AHe); Institute of Human Genetics; University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany (ABE); Western Sydney and Nepean Blue Mountains Local Health Districts, Westmead Millennium Institute for Medical Research, University of Sydney, Sydney, Australia (RB); Peter MacCallum Cancer Center, Melbourne, Australia (kConFab Investigators); the University of Melbourne, Melbourne, Australia (KAP); Division of Cancer Medicine, Peter MacCallum Cancer Centre, Melbourne, Australia (KAP); Centro de Investigación en Red de Enfermedades Raras, Valencia, Spain (JBen); Human Genetics Group, Human Cancer Genetics Program, Spanish National Cancer Research Centre, Madrid, Spain (JBen); Servicio de Oncología Médica, Hospital Universitario La Paz, Madrid, Spain (MPZ); Servicio de Cirugía General y Especialidades, Hospital Monte Naranco, Oviedo, Spain (JIAP); Servicio de Anatomía Patológica, Hospital Monte Naranco, Oviedo, Spain (PM); Department of Genetics and Pathology, Pomeranian Medical University, Szczecin, Poland (AJ, JL, KJB, KD); Molecular Genetics of Breast Cancer, German Cancer Research Center, Heidelberg, Germany (UH, MK); Frauenklinik der Stadtklinik Baden-Baden, Baden-Baden, Germany (HUU); Institute of Pathology, Städtisches Klinikum Karlsruhe, Karlsruhe, Germany (TR); Department of Oncology - Pathology, Karolinska Institutet, Stockholm, Sweden (SM); Department of Genetics, Institute for Cancer Research, Oslo University Hospital, Radiumhospitalet, Oslo, Norway (VKr, SN); Faculty of Medicine (Faculty Division Ahus), University of Oslo, Norway (VKr, SN); Genomic Medicine, Manchester Academic Health Science Centre, University of Manchester, Central Manchester Foundation Trust, St. Mary’s Hospital, Manchester, UK (DGE); Cambridge Breast Research Unit and NIHR Cambridge Biomedical Research Centre, University of Cambridge, Department of Oncology, Cambridge, UK (JEA, HME, CCa); Cambridge Experimental Cancer Medicine Centre, Cambridge, UK (JEA, HME, CCa); Warwick Clinical Trials Unit, University of Warwick, UK (LH, JAD); Cancer Research UK Clinical Trials Unit, Institute for Cancer Studies, the University of Birmingham, Edgbaston, Birmingham, UK (SB); Early Detection Research Group, Division of Cancer Prevention National Cancer Institute Bethesda, MD (CBe); Department of Biology, University of Pisa, Pisa, Italy (DC); Epidemiology Research Program, American Cancer Society, Atlanta, GA (WRD, SMG, MMG); Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, MA (SH); Division of Biostatistics and Epidemiology, University of Massachusetts-Amherst School of Public Health and Health Sciences, Amherst, MA (SH); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD (RNH, MJM); Department of Nutrition, Harvard School of Public Health, Boston, MA (WW); Genomic Epidemiology Group, German Cancer Research Center, Heidelberg, Germany (FC); Breast Cancer Functional Genomics Laboratory, Cancer Research UK Cambridge Institute, University of Cambridge, Li Ka Shing Centre, UK (SFC, CCa); Breakthrough Breast Cancer Research Centre, Division of Breast Cancer Research, the Institute of Cancer Research, London, UK (MGC); Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, UK (MGC, NR); Faculty of Medicine, University of Southampton, UK (DME).
retrieved by IIS PMC parser from one of the PMC XML resources.
I've just created same issue in IIS openaire/iis#663 to bypass this problem.
Hi folks,
Looks like parsing this document with Cermine results in some weird rid= values for xref elements:
<xref ref-type="bibr" rid="16 14">(Mikolov et al., 2013c)</xref>
<xref ref-type="bibr" rid="25 27">(Socher et al., 2011b)</xref>
<xref ref-type="bibr" rid="17 16 20 13">(Morin & Bengio, 2005; Mnih & Hinton, 2008; Mikolov et al., 2013c)</xref>
<xref ref-type="bibr" rid="13 14 17">(Mikolov et al., 2013a)</xref>
<xref ref-type="bibr" rid="36">(Collobert & Weston, 2008; Zhila et al., 2013)</xref>
I'm pretty sure it's not valid JATS that way -- does it just break on (authorname, year) citation styles currently?
As reported in OpenAIRE portal feedback system metadata extracted from the following thesis:
http://ria.ua.pt/bitstream/10773/1682/1/2009001399.pdf
contained invalid bibliographic references section.
I just confirmed it after uploading mentioned publication to:
http://cermine.ceon.pl/index.html
and receiving:
<back>
<ref-list>
<ref id="ref1">
<mixed-citation>
<article-title>am ro A p m t p 9 i t f</article-title>
</mixed-citation>
</ref>
</ref-list>
</back>
or:
<back>
<ref-list>
<ref id="ref1">
<mixed-citation>9 i t f</mixed-citation>
</ref>
</ref-list>
</back>
interchangeably, as a part of NLM.
It looks like the table from annex 1 was interpreted as a part of bibliographic references.
For some PDFs, Cermine fails to determine the structure of the document, and outputs garbage.
Tested with 1.8-SNAPSHOT.
Input
Output
For example, processing example article leads to cutting over 250 pages out of ~320. While main text body of an article isn't main interest in CERMINE, having full content would be a desired default behavior.
Comparison of original and .cermxml file:
Page 19 (20 with pdf counting) has a paragraph starting with "Chapter 5, 6..., after which the next paragraph is "There is also a more practical, concrete response. Dahlberg and Moss
(2005: 107-110)", which is on page 311 (312 pdf count) in original PDF.
Tested with Cermine 2.12 and via web interface, same behavior.
Unfortunately we found one publication (out of ~2mln already processed) causing fatal unrecoverable error and IIS processing interruption.
I managed to reproduce this problem locally.
Link to publication causing problem:
https://www.cmi.no/publications/file/5192-ungrateful-children.pdf
Would it be somehow possible to add TaxPub as add-on? TaxPub is an NLM DTD extension
see https://sourceforge.net/projects/taxpub/
and
https://www.researchgate.net/publication/233807452
I would like to extend the cermine code. I would like to add up new keywords for extraction.For eg. I tried removing "Prof." keyword from author enhancer and "Case Report" keyword from title enhancer. and gave the input pdf document and checked but it did not reflect in output xml file. Can anyone help me on this.
Hi,
I have had problems with the following paper using version 1.13 of the standalone tool:
Maximizing the Spread of Influence through a Social Network (2003) by
David Kempe, Jon Kleinberg, Éva Tardos
https://www.cs.cornell.edu/home/kleinber/kdd03-inf.pdf
It only extracts references in the left column of the last page. It seems to extract the right hand column of the body text, but fails on the references section.
Thanks
Originally reported by @mkobos:
Affiliations: Improve mapping country name to country code. In huge majority of cases when the country code is missing while country name is not, the content of country name field contains a substring which is a name of a country. In these cases it should be possible to assign the correct country code. This problem appears in up to 1.6% of all affiliation records (since this is the percentage of records where country code is missing while country name is not).
Note that all numbers come from the first version of report "Analysis of affiliations extracted by IIS from XMLs and PDFs" available at https://issue.openaire.research-infrastructures.eu/issues/2010, namely: https://issue.openaire.research-infrastructures.eu/attachments/download/509/2016-04-17_21_46_analysis.html. All these numbers are very rough approximations of the real values.
Even when the article title is uncapitalized, titles, upon the following inputs, are all capitalized.
input.zip
output.zip
EDIT: My mistake, I misinterpreted the inputs.
Hello Support,
I tried to consume the above given rest service in .NET application. Do you have any .wsdl file to generate proxy in .Net?
Otherwise please guide me on how to extract pdf documents in .NET.
Do you have any .NET library to make use of this CERMINE functionalities?
Thanks in advance,
Jisha
As much as I hate having to conform to ordered XML schemas, I think this is technically a bug as it does fail validation and can cause some JATS apps to panic.
Right now, in CERMINE output, <abstract/>
and <kwd-group/>
come before <pub-date/>
and other associated elements (<volume/>
, etc), which causes an error (you can reproduce in testing with just xmllint --valid
); these should be moved to the end of <article-meta/>
.
There are still cases where timeout feature, thoroughly described in #7, doesn't seem to work.
After upgrading CERMINE to newly released 1.10
version I got IIS metadataextraction workflow failure because ContentExtractor
stuck for more than 1h while getting document content as NLM.
First thing: we got a regression in CERMINE becase even without timeout feature we were able to pass through ContentExtractor
all available contents without troubles.
Second thing: after CERMINE reaches this point of processing:
1.1 Character extraction: 2.385
1.2 Page segmentation: 4.723
1.3 Reading order resolving: 0.422
1.4 Initial classification: 9.869
2.1 Metadata classification: 0.055
even though timeout threshold is exceeded processing is not interrupted hanging whole process.
Decreasing timeout value eventually triggers interruption but this is probably caused by CERMINE not reaching "point of no return".
Originally reported by @mkobos:
Affiliations: Improve recognizing whether given affiliation text is really an affiliation. This can be done using a simple rule that checks whether the text is not too short and not too long (this might impact up to 6% of all affiliation records - these are the records where the affiliation text does not really correspond to an affiliation).
Note that all numbers come from the first version of report "Analysis of affiliations extracted by IIS from XMLs and PDFs" available at https://issue.openaire.research-infrastructures.eu/issues/2010, namely: https://issue.openaire.research-infrastructures.eu/attachments/download/509/2016-04-17_21_46_analysis.html. All these numbers are very rough approximations of the real values.
The articles on the arXiv (for example this thesis) contain a side-label containing the identifier (eg, arXiv:1405.2249v1 [math-ph] 9 May 2014). It would be nice if CERMINE could read and parse this information.
The different formats for the arXiv ids are discussed in the official help.
The CRFBibReferenceParser.parse(String affiliation) method is not working correctly using multiply threads. I could not figure out what is the actual problem. Sometimes indexOutOfBounds, sometimes nullPointer exception is being thrown. When I use one thread this problem is not occur. I created a synchronized wrapper method for parse method, and this is the only way I could make it work. The actual problem is that I have a big bunch of affiliations to parse. It would be great if I could use the power of multiply threads.
Your README don't show a detailed install instruction.
Can you give some detailed info. ?
For compiling, mvn compile
successd
And when mvn install -DskipTests
, I meet this issues:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.4:jar (attach-javadocs) on project cermine-impl: MavenReportException: Error while generating Javadoc:
[ERROR] Exit code: 1 - src/CERMINE-master/cermine-impl/src/main/java/pl/edu/icm/cermine/ComponentFactory.java:87: warning: no description for @throws
UPDATED
I google the issue and use mvn install -DskipTests -Dadditionalparam=-Xdoclint:none
to get rid of all the ERROR and install successfully.
But mvn test
fails like this:
Results :
Failed tests: metadataExtractionTest(pl.edu.icm.cermine.AltPdfNLMMetadataExtractorTest)
metadataExtractionTest(pl.edu.icm.cermine.bibref.KMeansBibReferenceExtractorTest): expected:<...98, 186:528-33.(..)
testAddFeatures(pl.edu.icm.cermine.metadata.affiliation.features.AffiliationDictionaryFeatureTest): Token: 18 W expected:<1> but was:<0>
getBxDocumentTest(pl.edu.icm.cermine.ContentExtractorTest)
getBxDocumentWithGeneralLabelsTest(pl.edu.icm.cermine.ContentExtractorTest)
getBxDocumentWithSpecificLabelsTest(pl.edu.icm.cermine.ContentExtractorTest)
textRawFullTextTest(pl.edu.icm.cermine.ContentExtractorTest): expected:<...nnerhagen(..)
getFullTextWithLabelsTest(pl.edu.icm.cermine.ContentExtractorTest)
getNLMBodyTest(pl.edu.icm.cermine.ContentExtractorTest)
getNLMReferencesTest(pl.edu.icm.cermine.ContentExtractorTest)
getNLMContentTest(pl.edu.icm.cermine.ContentExtractorTest)Tests run: 118, Failures: 11, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] CERMINE Content Extractor and Miner ................ SUCCESS [ 0.164 s]
[INFO] CERMINE Engine Implementation - 1.12-SNAPSHOT ...... FAILURE [01:44 min]
[INFO] CERMINE Tools - 1.12-SNAPSHOT ...................... SKIPPED
[INFO] CERMINE Web Interface - 1.12-SNAPSHOT .............. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:44 min
[INFO] Finished at: 2016-12-19T22:35:29+08:00
[INFO] Final Memory: 12M/159M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project cermine-impl: There are test failures.
[ERROR]
[ERROR] Please refer to /home/duyu/src/CERMINE-master/cermine-impl/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :cermine-impl
Could you give some advice to solve this problem?
Thanks for your reply.
Yu
Hi all,
Thank you for developing Cermine which I found very useful when I tried it on the web.
I 've downloaded the stand alone version. I wanted to use it on windows 10 (64 bit) but it did not execute. Note that I have over jar executable which launch normally with a double click.
Thank you for your responses.
Hi!
Using a current snapshot build, I was able to export references as a bibtex file, but the different authors were not correctly separated by "and" but with a comma:
@article{Peiker2016, author = {Peiker, C, Pott, C, Eckardt, L, Kelm, M, Shin, DI, Willems, S}, doi = {10.1093/europace/euv056}, journal = {Europace}, pages = {332--339}, title = {Dual atrioventricular nodal non-re-entrant tachycardia}, volume = {18}, year = {2016}, }
Furthermore, duplicate citekeys were generated.
Stephan
I am running CERMINE through 10000 or so pdfs, but some of them throw this error, and the program stops running. Can I somehow fix this, or tell CERMINE to skip errors an continue?
File processed: /home/moritz/Desktop/pdf_extraction/pdfs/Other/��ztun et al. 1991 - A new haloether from Laurencia possessing a lauroxacyclododecane ring. Structure and conformational studies.pdf Exception in thread "main" java.lang.IllegalArgumentException: Illegal group reference: group index is missing at java.util.regex.Matcher.appendReplacement(Matcher.java:819) at pl.edu.icm.cermine.content.cleaning.ContentCleaner.cleanHyphenationAndBreaks(ContentCleaner.java:180) at pl.edu.icm.cermine.content.cleaning.ContentCleaner.cleanAllAndBreaks(ContentCleaner.java:236) at pl.edu.icm.cermine.metadata.model.DocumentMetadata.clean(DocumentMetadata.java:277) at pl.edu.icm.cermine.metadata.EnhancerMetadataExtractor.extractMetadata(EnhancerMetadataExtractor.java:106) at pl.edu.icm.cermine.metadata.EnhancerMetadataExtractor.extractMetadata(EnhancerMetadataExtractor.java:36) at pl.edu.icm.cermine.ExtractionUtils.cleanMetadata(ExtractionUtils.java:101) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:320) at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:286) at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:612) at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:628) at pl.edu.icm.cermine.ContentExtractor.main(ContentExtractor.java:724)
I'm trying to train CERMINE again on an my university's scientific paper collection. However, after the InitialBuilder has iterated over all the files (~2300), it always throws this exception:
Exception in thread "main" java.lang.RuntimeException: Feature value is set to NaN: AtRelativeCount
at pl.edu.icm.cermine.tools.classification.general.LinearScaling.scaleFeatureVector(LinearScaling.java:49)
at pl.edu.icm.cermine.tools.classification.general.FeatureVectorScalerImpl.scaleFeatureVector(FeatureVectorScalerImpl.java:61)
at pl.edu.icm.cermine.tools.classification.svm.SVMClassifier.buildDatasetForTraining(SVMClassifier.java:157)
at pl.edu.icm.cermine.tools.classification.svm.SVMClassifier.buildClassifier(SVMClassifier.java:117)
at pl.edu.icm.cermine.libsvm.training.SVMInitialBuilder.getZoneClassifier(SVMInitialBuilder.java:69)
at pl.edu.icm.cermine.libsvm.training.SVMInitialBuilder.main(SVMInitialBuilder.java:143)
The line that causes the exception:
featureValue = a*featureValue + b;
Somehow the featureValue was NaN before the calculation. I think the issues may arise from my dataset. The cermstr dataset was generated from sidecars xml files in which all metadata are included: title, author, abstract, references,... I've proof read 20% of the files and everything seems normal. So far, the only problem I can think of is that there are too few examples of correspondence section, and this section usually spans only one line in a pdf file.
How should I improve my dataset to avoid this exception? Thanks!
We are heavily relying on cermine in IIS and unfortunately every now and then malicious PDF file appears taking huge amount of time to process.
Currently we have adjusted hadoop mapper (by overwriting mapred.task.timeout
for metadataextraction
module) to wait 60 minutes before assuming task does not respond. Apparently this is not enough. Recently after adding ~400k new PDF contents to ObjectStore
cermine got files taking over 60 minutes to process each one of them. This causes whole IIS processing failure and the only way to proceed is to blacklist given document, retry IIS execution and hope we won't need to do this again (but we probably will).
One possible solution is to execute cermine processing in dedicated thread and interrupt it before the time defined in mapred.task.timeout
passes. Currently this won't have any effect because cermine does not honor interruption. If you could check every now and then on Thread.interrupted()
, probably as a part of loop condition check, then we could stop metadataextraction
worker thread and continue with processing other documents.
Up until now my perception was that iText library blocks the execution which could be dead end. But after running visualVM
profiler on cermine I realized there is no single method blocked for ages but the execution time of several methods changes interchangeably. This gives us opportunity to check whether thread was interrupted.
Take a look at the attachment with metadataextraction
CPU-time profiling the methods taking most of the execution time. This could guide us where to check for interruption.
Hi,
Is there any plan to extract acknowledgements from the PDF files. For instance, if taking this paper extract as a metadata that
This work was supported by the ERC through the QGBE grant and by Provincia Autonoma di Trento.
(although this not be the simplest example, as this sentence is embedded in the paper and not in a separate "Acknowledgements" section).
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.