GithubHelp home page GithubHelp logo

mcs07 / chemdataextractor Goto Github PK

View Code? Open in Web Editor NEW
291.0 18.0 112.0 555 KB

Automatically extract chemical information from scientific documents

Home Page: http://chemdataextractor.org

License: MIT License

Python 60.41% Shell 0.44% HTML 39.15%
information-extraction python chemistry text-mining natural-language-processing nlp

chemdataextractor's Introduction

ChemDataExtractor

http://img.shields.io/pypi/v/ChemDataExtractor.svg?style=flat-square http://img.shields.io/pypi/l/ChemDataExtractor.svg?style=flat-square http://img.shields.io/travis/mcs07/ChemDataExtractor.svg?style=flat-square

ChemDataExtractor is a toolkit for extracting chemical information from the scientific literature.

Features

  • HTML, XML and PDF document readers
  • Chemistry-aware natural language processing pipeline
  • Chemical named entity recognition
  • Rule-based parsing grammars for property and spectra extraction
  • Table parser for extracting tabulated data
  • Document processing to resolve data interdependencies

Installation

To install ChemDataExtractor, simply run:

pip install chemdataextractor

Or if you are an Anaconda user, run:

conda install -c chemdataextractor chemdataextractor

Alternatively, try one of the other installation options.

Documentation

Full documentation is available at http://chemdataextractor.org/docs

License

ChemDataExtractor is licensed under the MIT license, a permissive, business-friendly license for open source software.

chemdataextractor's People

Contributors

mcs07 avatar rtchoua avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chemdataextractor's Issues

import error: HTMLParser in Python 3

In Python 3 is HTMLParser in html.parser

So to prevent import error I had to change in cli/dict.py:

line 16: import HTMLParser from html.parser import HTMLParser
line 27: pars = HTMLParser.HTMLParser() pars = HTMLParser()

Fails to recognize 'water'

Expected water to be recognized, but it is not.

from chemdataextractor.doc import Document
Document('water').cems # returns []
Document('H2O').cems # returns [Span('H2O', 0, 3)]

installation failed, windows10, "conda install -c chemdataextractor chemdataextractor"

following your advice to issue #3, I installed anaconda, and run the command. But, solving environment failed, detailed output is listed below. Please help.

conda install -c chemdataextractor chemdataextractor
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:

  • anaconda==2018.12=py37_0 -> curl==7.63.0=h2a8f88b_1000 -> krb5[version='>=1.16.1,<1.17.0a0'] -> tk[version='>=8.6.7,<8.7.0a0']
  • anaconda==2018.12=py37_0 -> mkl-service==1.1.2=py37hb782905_5
  • anaconda==2018.12=py37_0 -> numexpr==2.6.8=py37hdce8814_0
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> msvc_runtime
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> msgpack-python
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> chardet[version='>=3.0.2,<3.1.0']
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> idna[version='>=2.5,<2.7']
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> asn1crypto[version='>=0.21.0']
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> cffi[version='>=1.7'] -> pycparser
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> cryptography-vectors=2.3
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> enum34 -> ordereddict
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> ipaddress
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> packaging -> pyparsing
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> pyasn1[version='>=0.1.8']
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> six[version='>=1.4.1']
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> pyopenssl[version='>=0.14']
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> win_inet_pton
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> colorama
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> distlib
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> distribute
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> html5lib -> webencodings
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> lockfile
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> progress
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> setuptools -> certifi
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> setuptools -> wincertstore
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> wheel
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> sqlite[version='>=3.26.0,<4.0a0']
  • chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> vc=9
  • chemdataextractor -> beautifulsoup4 -> soupsieve -> backports.functools_lru_cache -> backports
  • chemdataextractor -> click
  • chemdataextractor -> lxml -> libxml2[version='>=2.9.4,<2.10.0a0'] -> libiconv[version='>=1.15,<2.0a0']
  • chemdataextractor -> lxml -> libxml2[version='>=2.9.4,<2.10.0a0'] -> zlib[version='>=1.2.11,<1.3.0a0']
  • chemdataextractor -> lxml -> libxslt
  • chemdataextractor -> nltk -> gensim -> numpy[version='>=1.11.3,<2.0a0'] -> mkl_fft
  • chemdataextractor -> nltk -> gensim -> numpy[version='>=1.11.3,<2.0a0'] -> mkl_random
  • chemdataextractor -> nltk -> gensim -> numpy[version='>=1.11.3,<2.0a0'] -> numpy-base==1.11.3=py27hb1d0314_11
  • chemdataextractor -> nltk -> gensim -> scipy[version='>=0.18.1']
  • chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto3 -> botocore[version='>=1.7.0,<1.8.0'] -> docutils[version='>=0.10']
  • chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto3 -> botocore[version='>=1.7.0,<1.8.0'] -> python-dateutil[version='>=2.1,<3.0.0']
  • chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto3 -> s3transfer[version='>=0.1.10,<0.2.0'] -> futures[version='>=2.2.0,<4.0.0']
  • chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto[version='>=2.32']
  • chemdataextractor -> nltk -> matplotlib -> cycler[version='>=0.10']
  • chemdataextractor -> nltk -> matplotlib -> dateutil
  • chemdataextractor -> nltk -> matplotlib -> freetype[version='>=2.8,<2.9.0a0'] -> libpng[version='>=1.6.32,<1.7.0a0']
  • chemdataextractor -> nltk -> matplotlib -> functools32
  • chemdataextractor -> nltk -> matplotlib -> kiwisolver
  • chemdataextractor -> nltk -> matplotlib -> pyqt=5.6 -> qt=5.6 -> icu[version='>=58.2,<59.0a0']
  • chemdataextractor -> nltk -> matplotlib -> pyqt=5.6 -> qt=5.6 -> jpeg[version='>=9b,<10a']
  • chemdataextractor -> nltk -> matplotlib -> pyqt=5.6 -> sip=4.18
  • chemdataextractor -> nltk -> matplotlib -> pyside==1.1.2
  • chemdataextractor -> nltk -> matplotlib -> pytz
  • chemdataextractor -> nltk -> matplotlib -> subprocess32
  • chemdataextractor -> nltk -> matplotlib -> tk=8.5
  • chemdataextractor -> nltk -> matplotlib -> tornado -> backports_abc[version='>=0.4']
  • chemdataextractor -> nltk -> matplotlib -> tornado -> singledispatch
  • chemdataextractor -> nltk -> matplotlib -> tornado -> ssl_match_hostname
  • chemdataextractor -> nltk -> python-crfsuite -> *[track_features=vc10]
  • chemdataextractor -> nltk -> pyyaml -> yaml[version='>=0.1.7,<0.2.0a0']
  • chemdataextractor -> nltk -> scikit-learn -> nose
  • chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> configparser
  • chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> mccabe[version='>=0.6.0,<0.7.0']
  • chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> pep8
  • chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> pycodestyle[version='>=2.0.0,<2.4.0']
  • chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> pyflakes[version='>=1.5.0,<1.6.0']
    Use "conda info " to see the dependencies for each package.

How to create custom parser to get entities in given Text ?

Can anyone let me know , how to write custom parser to fetch Chemical molecule name with constituents details in desired format

[Chemical name + addition : Constituents],[Chemical name + addition : Constituents]

doc = Paragraph('4-Methylmorpholine N-oxide (1.76 mL, 8.42 mmol) and potassium osmate dihydrate (97.3 mg, 0.38 mmol) ')
print(doc.records.serialize())

class BoilingPoint(BaseModel):
name=StringType()
Quan = StringType()
units = StringType()

Compound.addition = ListType(ModelType(BoilingPoint))

import re
from chemdataextractor.parse import R, I, W, Optional, merge

units = R('^(mg|mL|mmol)$')(u'units').add_action(merge) # Define all units in parser
Quan = R(u'^\d+(.\d+)?$')(u'value')
bp = (Quan+ units)(u'mL')

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
root=bp

def interpret(self, result, start, end):
    compound = Compound(
        addition=[
            BoilingPoint(

                #name=first(result.xpath('./name/text()'))
                Quan=first(result.xpath('./value/text()')),
                units=first(result.xpath('./units/text()'))


            )
        ]
     )
    yield compound

Paragraph.parsers = [CompoundParser()]+[BpParser()]

Result :
[{'names': ['4-Methylmorpholine N-oxide']}, {'names': ['potassium osmate dihydrate']}, {'addition': [{'Quan': '1.76', 'units': 'mL'}]}, {'addition': [{'Quan': '8.42', 'units': 'mmol'}]}, {'addition': [{'Quan': '97.3', 'units': 'mg'}]}, {'addition': [{'Quan': '0.38', 'units': 'mmol'}]}]

Expected Result: Chemical name + addition : Constituents

[{'names': ['4-Methylmorpholine N-oxide'],'addition': [{'Quan': '1.76', 'units': 'mL'}]},{'Quan': '8.42', 'units': 'mmol'}}]

Consider tika-python for text extraction?

Hi,

Not sure how you are doing text extraction, but just saw an article in IEEE computing edge that cited your tool. If you have any interested in Apache Tika we provide a functional Python library that you could leverage. Does pdfminer also do the text extraction part?

The benefit of Tika is that it supports text extraction from 1400+ formats.

Cheers,
Chris

Unable to read in xml file

USpatenttest.xml.zip

Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.

I run:
from chemdataextractor import Document
f = open('USpatenttest.xml', 'rb')
doc=Document.from_file(f)

And I get the error File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range

Any advice is greatly appreciated! Thanks

CDE configuration

Hello,

I couldn't find any documentation regarding managing the configuration of the command line interface, and running cde config list doesn't return anything on my installation. What are the parameters that i can set?

Code for Demo Files

I have been working with this library to extract chem information from HTML pages.
I followed http://chemdataextractor.org/demo and saved https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A as an html(input3.html) file.

Below is my code.

with open('input/input3.html', 'rb') as f:
doc = Document.from_file(f)

records = doc.records.serialize()

This does not matches with the records in the json output published at https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A .
A lot of information is missing including smiles, fluorescence_lifetimes etc.

@mcs07 was wondering if you could publish the code that was used for the demo.

Ps : Is there a method which creates the entire json which includes abbreviation + biblio + record or they are extracted separately and stitched together to create the final json output.

type object 'HTMLAwareEntitySubstitution' has no attribute 'preserve_whitespace_tags'

Hello.
I have installed ChemDataExtractor with pip.
Now trying to use it in jupyter notebook with python3.5, calling import chemdataextractor.
Getting the following error:
AttributeError Traceback (most recent call last)
in ()
----> 1 import chemdataextractor

/usr/local/lib/python3.5/dist-packages/chemdataextractor/init.py in ()
24
25
---> 26 from .doc.document import Document

/usr/local/lib/python3.5/dist-packages/chemdataextractor/doc/init.py in ()
13 from future import unicode_literals
14
---> 15 from .document import Document
16 from .text import Text, Title, Heading, Paragraph, Footnote, Citation, Caption, Sentence, Span, Token
17 from .figure import Figure

/usr/local/lib/python3.5/dist-packages/chemdataextractor/doc/document.py in ()
22
23 from ..utils import python_2_unicode_compatible
---> 24 from .text import Paragraph, Citation, Footnote, Heading, Title
25 from .table import Table
26 from .figure import Figure

/usr/local/lib/python3.5/dist-packages/chemdataextractor/doc/text.py in ()
20
21 from ..model import ModelList
---> 22 from ..parse.context import ContextParser
23 from ..parse.cem import ChemicalLabelParser, CompoundHeadingParser, CompoundParser, chemical_name
24 from ..parse.table import CaptionContextParser

/usr/local/lib/python3.5/dist-packages/chemdataextractor/parse/init.py in ()
13 from future import unicode_literals
14
---> 15 from .actions import join, merge, strip_stop, fix_whitespace
16 from .elements import W, I, R, T, H
17 from .elements import Any, Word, Tag, IWord, Regex, Start, End, Hide, Not

/usr/local/lib/python3.5/dist-packages/chemdataextractor/parse/actions.py in ()
18 from lxml.etree import strip_tags
19
---> 20 from ..text import HYPHENS
21
22

/usr/local/lib/python3.5/dist-packages/chemdataextractor/text/init.py in ()
15 import unicodedata
16
---> 17 from bs4 import UnicodeDammit
18
19

/usr/local/lib/python3.5/dist-packages/bs4/init.py in ()
33 import warnings
34
---> 35 from .builder import builder_registry, ParserRejectedMarkup
36 from .dammit import UnicodeDammit
37 from .element import (

/usr/local/lib/python3.5/dist-packages/bs4/builder/init.py in ()
226
227
--> 228 class HTMLTreeBuilder(TreeBuilder):
229 """This TreeBuilder knows facts about HTML.
230

/usr/local/lib/python3.5/dist-packages/bs4/builder/init.py in HTMLTreeBuilder()
232 """
233
--> 234 preserve_whitespace_tags = HTMLAwareEntitySubstitution.preserve_whitespace_tags
235 empty_element_tags = set([
236 # These are from HTML5.

AttributeError: type object 'HTMLAwareEntitySubstitution' has no attribute 'preserve_whitespace_tags'

Is there a way to fix it?
I tried to apply solutions regarding to BeautifulSoup4 installations, suggested elsewhere, but nothing works.
Thank you!

Issue specifying a Reader

I'm following the examples of reading a document into the tool shown here: http://chemdataextractor.org/docs/reading

I am using version 1.2.2 installed using the conda install on linux and working with Python 3.5, I'm testing from a Jupyter Notebook.

If I use the first example on that page to read a locally stored html file, then I successfully read in the HTML and can query the data stored in the doc variable.

If I then try and explicitly call the correct reader RscHtmlReader() I get an error stating that the name is not defined, and the same if I try to specify the AcsHtmlReader().

`NameError Traceback (most recent call last)
in ()
1 f = open(//Downloads//chemdataextractor-journal-articles-source//evaluation journal articles//articles//rsc.nj.c5nj01594d.html', 'rb')
----> 2 doc = Document.from_file(f, readers=[AcsHtmlReader()])

NameError: name 'AcsHtmlReader' is not defined
`
I was also able to use the command line interface to generate an output json file, I'm not sure if these are generated by successfully calling the correct reader or if the application is just falling back to to the generic html reader.
An ideas?

Upgrade to Python > 3.6?

I know this package has not been updated for some time; however, is there any way how we can install it in more recent python versions?

Not able to install chemdataextractor

I am trying to install chemdataextractor
Python version 3.10.11

It gives me the following error:

error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for DAWG
Running setup.py clean for DAWG
Failed to build DAWG
ERROR: Could not build wheels for DAWG, which is required to install pyproject.toml-based projects

Kindly help

Regex expression to extract colon not working

I am attempting to make a custom parser to extract ratio's from this text:

d = Document( Heading(u'1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C2) '), Paragraph(u'Standard Procedure A (0.25 mmol, 1.0 mL CH2Cl2/0.4 mL H2O) was followed with areaction time of 24 h to provide 4A-C2 and 4A-C6 (ratio of 2:3 and C6:C2 by GC-MS, see below)in a combined yield of 66% as a white solid.'))

I am using the following regex in order to do this (it has proven to work independently of the custom parser code):

value = (R("\w+:+\w+"))('value')

Any help would be appreciated.

Full code is here:

class RatioOf(BaseModel): 
    value = StringType() 
    prefix = StringType()
    
Compound.ratio_of` = ListType(ModelType(RatioOf))

import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (I('ratio') | I('of')).hide() 

value = (R("\w+:+\w+"))('value')

ro = (prefix + value)(u'ro') 

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class RoParser(BaseParser):
    root = ro
    def interpret(self, result, start, end):
        compound = Compound(
         ratio_of=[
               RatioOf(
                    prefix=first(result.xpath('./prefix/text()')),
                    value=first(result.xpath('./value/text()'))
                )
            ]
        )
        yield compound
        yield compound

Paragraph.parsers = [RoParser()]

d = Document(
    Heading(u'1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C2) and 1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C6)'),
    Paragraph(u'Standard Procedure A (0.25 mmol, 1.0 mL CH2Cl2/0.4 mL H2O) was followed with a reaction time of 24 h to provide 4A-C2 and 4A-C6 (ratio of 2:3 and C6:C2 by GC-MS, see below)in a combined yield of 66% as a white solid.'))

d.records.serialize()```



RSCHTMLReader throws bytes/string error

I'm using python 3.5 and trying to process an RSC article (10.1039/C6OB02074G)

I see the error:

TypeError: %b requires bytes, or an object that implements __bytes__, not 'str'

The issue seems to be with the replace_rsc_img_chars function in rsc.py.

Looking at it the matches that are obtained from parsing the entity xpath (u1 and u2) are unicode strings (see lines 270, 272). u1 and u2 are then subsequently used to generate rep (line 276) here the code is trying to insert a unicode string into a byte string.

add readers for wiley html format

Hi,
ChemDataExtractor really help me digging some info in the papers.
However, I've found that wiley papers get messed up using Generic HTML reader.
Could you please add a spcific reader for wiley?
Or how should define a reader? I've read your source code, but little description in the reader modules.
Thanks!

https breaks NlmXmlReader

In the NlmXmlReader class

    def detect(self, fstring, fname=None):
        """"""
        if fname and not (fname.endswith('.xml') or fname.endswith('.nxml')):
            return False
        if b'xmlns="http://jats.nlm.nih.gov/ns/archiving' in fstring:
            return True
        if b'JATS-archivearticle1.dtd' in fstring:
            return True
        if b'-//NLM//DTD JATS' in fstring:
            return True
        return False

The NLM's JATS namespace URI uses https now, so my document wasn't being registered as compatible with NlmXmlReader

conda installation issue

HI I've followed the installation instruction but get this error - what am I doing wrong

thanks

Charlie

(base) charlies-MacBook-Pro:~ charliejeynes$ conda config
(base) charlies-MacBook-Pro:~ charliejeynes$ conda install chemdataextractor
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  • chemdataextractor

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

_in_stoplist should return True for entities trimmed out of existence

In an entity like "-aromatic" which is in IGNORE_SUFFIX the resultant entity after running _in_stoplist is of length 0, hence the entity should be ignored (i.e. the function should return True) rather than reporting a 0 length entity.

On an entity which is both in IGNORE_PREFIX and IGNORE_SUFFIX you can get into a situation where the end index is actually before the start end index!

d = Document("non-aromatic")
d.cems
[Span(u'', 4, 3)]

I assume adding this check that the resultant entity's length is > 0 will fix that case as well.

version of pdfminer

Hi Matt.
The installation comes up with an error when I am installing using pip. pdfminer-20140328.tar.gz seems to be the problem.

Collecting pdfminer (from ChemDataExtractor)
Using cached pdfminer-20140328.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\ernest\AppData\Local\Temp\pip-build-uz7pty9c\pdfminer\setup.py", line 3, in
from pdfminer import version
File "C:\Users\ernest\AppData\Local\Temp\pip-build-uz7pty9c\pdfminer\pdfminer__init__.py", line 5
print version
^
SyntaxError: Missing parentheses in call to 'print'

Could I install pdfminer.six OR pdfminer3k and use one of them instead?

Extracting entities inside an entity

Does anyone knows how to write a custom parser to extract a named entity inside an entity.

For example from the following sentence I want to extract 'boiling' which will be inside the prefix entity.

d = Sentence('Synthesis of 2,4,6-trinitrotoluene (3a).The procedure was followed to yield a pale yellow solid (boiling point 240 °C)')

This is my attempt to write the parser:

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    prefix = StringType()
    name = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))`


prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling')(u'name') + I(u'point')).add_action(join)(u'prefix')
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')


class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()')),
                    prefix = first(result.xpath('./prefix/text()')),
                    name = first(result.xpath('./name/text()')),
                    
                )
            ]
        )
        yield compound

Sentence.parsers = [BpParser()]

However what d.records.serialize() produces is,

[{'boiling_points': [{'value': '240',
'units': '°C',
'prefix': 'boiling point'}]}]

Regex expression of \S* is not recognized.

I am trying to create a custom parser to extract the boiling points from the following texts, so that the text between "boiling point" and "of" is optional.

Paragraph(u'The boiling point limit of 2,4,6-trinitrotoluene is 240 °C') # <- text 1
Paragraph(u'The boiling point of 2,4,6-trinitrotoluene is 240 °C') # <- text 2

I try to use the following prefix,

prefix = (I('boiling') + I('point') + Optional(R('^\S*$')).hide() + I('of') +\ R(r'\S+') +(I('is')|I('was')).hide() )(u'prefix').add_action(join)

But this fails for text2 when there is no text between "boiling point" and "of".

I am not sure whether this is related to the way the code is written.

Full code is given below.

from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class BoilingPoint(BaseModel):
    prefix = StringType()
    value = StringType()
    units = StringType()


Compound.boiling_points = ListType(ModelType(BoilingPoint))

import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (I('boiling') + I('point') + Optional(R('^\S*$')).hide() + I('of') +\
           R(r'\S+') +(I('is')|I('was')).hide() )(u'prefix').add_action(join)

units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    prefix=first(result.xpath('./prefix/text()')),
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound


Paragraph.parsers = [BpParser()]

d = Document(
    Heading(u'Synthesis of (3a)'),
#     Paragraph(u'The boiling point limit of 2,4,6-trinitrotoluene is 240 °C') # <- text 1
    Paragraph(u'The boiling point of 2,4,6-trinitrotoluene is 240 °C') # <- text 2
)

d.records.serialize()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.