mcs07 / chemdataextractor Goto Github PK

View Code? Open in Web Editor NEW

291.0 18.0 112.0 555 KB

Automatically extract chemical information from scientific documents

Home Page: http://chemdataextractor.org

License: MIT License

Python 60.41% Shell 0.44% HTML 39.15%

information-extraction python chemistry text-mining natural-language-processing nlp

chemdataextractor's Introduction

ChemDataExtractor

http://img.shields.io/travis/mcs07/ChemDataExtractor.svg?style=flat-square

ChemDataExtractor is a toolkit for extracting chemical information from the scientific literature.

Features

HTML, XML and PDF document readers
Chemistry-aware natural language processing pipeline
Chemical named entity recognition
Rule-based parsing grammars for property and spectra extraction
Table parser for extracting tabulated data
Document processing to resolve data interdependencies

Installation

To install ChemDataExtractor, simply run:

pip install chemdataextractor

Or if you are an Anaconda user, run:

conda install -c chemdataextractor chemdataextractor

Alternatively, try one of the other installation options.

Documentation

Full documentation is available at http://chemdataextractor.org/docs

License

ChemDataExtractor is licensed under the MIT license, a permissive, business-friendly license for open source software.

chemdataextractor's People

Contributors

Stargazers

Watchers

Forkers

timburrow chemlynx sunflower6069 yccai rtchoua alvarovm python3pkg miyoungko cheminfo ryfan-rs rvaidya materialsintelligence raychen05 simonengelke bvaisakh jingyi-cai aspirincode jdagdelen vinven7 leejunhyun zorrotrying cherishyli harrisongoldwyn robfirth christinehc hkbluesky ssatrawada lucaspedroni rodrigovime baharehbamdad liyi593730139 cchew gdsttian savvamotovilov likhitha-surapaneni jarlped ssthurai jydiw unixjunkie andrewtarzia heshenxian1 abarbarov idocx faight4869 emilymikeska1 sudo-npark geonmoo caiyingchun ayushnarsaria obrink quantum93 mlpythoner maddenfederico waldo2590 roguedog94 guptam mohnibohni wuzhilong-chem shashankch033 whoyouwith91 shunsunsun olivrrrrr raoumk pk-organics ltnalsxl shailavij lchperuk abishek85 getwithub kgubsch booleank tarinidash yiducn sunyanici bu-qi hpatrick96 angel-langdon alinevillarreal eldir gojian wanshu322 oliver-s-lee rnaimehaom navbha25 kunlu-ou kandyjam milocheng17 raz2770 jamesthunder svtter qiuj92 ennamarie19 tchagoue mayhemheroes yqq2022 twvonzuben lunyang mettlyz akhlakm arman-tosun

chemdataextractor's Issues

import error: HTMLParser in Python 3

In Python 3 is HTMLParser in html.parser

So to prevent import error I had to change in cli/dict.py:

line 16: ~~import HTMLParser~~ from html.parser import HTMLParser
line 27: ~~pars = HTMLParser.HTMLParser()~~ pars = HTMLParser()

kindly share a sample/example which is hosted at http://chemdataextractor.org/demo

Kindly share or provide a sample which showcased in http://chemdataextractor.org/demo
Thanks in advance,

Fails to recognize 'water'

Expected water to be recognized, but it is not.

from chemdataextractor.doc import Document
Document('water').cems # returns []
Document('H2O').cems # returns [Span('H2O', 0, 3)]

Installation on Windows 7

DAWG doesn't seem to install and there is a problem with lxml.

Unable to install chemdataextractor in google colab

This is my code (which is failing and I'm unable to resolve) https://colab.research.google.com/drive/1FJR8PBW-Wft3mTqLs8QvQcf5fN-jhaju?usp=sharing

I'm trying to pip install chemdataextractor (using the tutorial given here under 'installation': https://cambridgemolecularengineering-chemdataextractor-development.readthedocs-hosted.com/en/latest/getting_started.html#installation )

Kindly help me out with the same

installation failed, windows10, "conda install -c chemdataextractor chemdataextractor"

following your advice to issue #3, I installed anaconda, and run the command. But, solving environment failed, detailed output is listed below. Please help.

conda install -c chemdataextractor chemdataextractor
Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:

anaconda==2018.12=py37_0 -> curl==7.63.0=h2a8f88b_1000 -> krb5[version='>=1.16.1,<1.17.0a0'] -> tk[version='>=8.6.7,<8.7.0a0']
anaconda==2018.12=py37_0 -> mkl-service==1.1.2=py37hb782905_5
anaconda==2018.12=py37_0 -> numexpr==2.6.8=py37hdce8814_0
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> msvc_runtime
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> msgpack-python
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> chardet[version='>=3.0.2,<3.1.0']
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> idna[version='>=2.5,<2.7']
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> asn1crypto[version='>=0.21.0']
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> cffi[version='>=1.7'] -> pycparser
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> cryptography-vectors=2.3
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> enum34 -> ordereddict
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> ipaddress
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> packaging -> pyparsing
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> pyasn1[version='>=0.1.8']
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> cryptography[version='>=1.3.4'] -> six[version='>=1.4.1']
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> pyopenssl[version='>=0.14']
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> cachecontrol -> requests -> urllib3[version='>=1.21.1,<1.23'] -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> win_inet_pton
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> colorama
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> distlib
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> distribute
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> html5lib -> webencodings
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> lockfile
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> progress
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> setuptools -> certifi
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> setuptools -> wincertstore
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> pip -> wheel
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> sqlite[version='>=3.26.0,<4.0a0']
chemdataextractor -> appdirs -> python[version='>=2.7,<2.8.0a0'] -> vc=9
chemdataextractor -> beautifulsoup4 -> soupsieve -> backports.functools_lru_cache -> backports
chemdataextractor -> click
chemdataextractor -> lxml -> libxml2[version='>=2.9.4,<2.10.0a0'] -> libiconv[version='>=1.15,<2.0a0']
chemdataextractor -> lxml -> libxml2[version='>=2.9.4,<2.10.0a0'] -> zlib[version='>=1.2.11,<1.3.0a0']
chemdataextractor -> lxml -> libxslt
chemdataextractor -> nltk -> gensim -> numpy[version='>=1.11.3,<2.0a0'] -> mkl_fft
chemdataextractor -> nltk -> gensim -> numpy[version='>=1.11.3,<2.0a0'] -> mkl_random
chemdataextractor -> nltk -> gensim -> numpy[version='>=1.11.3,<2.0a0'] -> numpy-base==1.11.3=py27hb1d0314_11
chemdataextractor -> nltk -> gensim -> scipy[version='>=0.18.1']
chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto3 -> botocore[version='>=1.7.0,<1.8.0'] -> docutils[version='>=0.10']
chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto3 -> botocore[version='>=1.7.0,<1.8.0'] -> python-dateutil[version='>=2.1,<3.0.0']
chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto3 -> s3transfer[version='>=0.1.10,<0.2.0'] -> futures[version='>=2.2.0,<4.0.0']
chemdataextractor -> nltk -> gensim -> smart_open[version='>=1.2.1'] -> boto[version='>=2.32']
chemdataextractor -> nltk -> matplotlib -> cycler[version='>=0.10']
chemdataextractor -> nltk -> matplotlib -> dateutil
chemdataextractor -> nltk -> matplotlib -> freetype[version='>=2.8,<2.9.0a0'] -> libpng[version='>=1.6.32,<1.7.0a0']
chemdataextractor -> nltk -> matplotlib -> functools32
chemdataextractor -> nltk -> matplotlib -> kiwisolver
chemdataextractor -> nltk -> matplotlib -> pyqt=5.6 -> qt=5.6 -> icu[version='>=58.2,<59.0a0']
chemdataextractor -> nltk -> matplotlib -> pyqt=5.6 -> qt=5.6 -> jpeg[version='>=9b,<10a']
chemdataextractor -> nltk -> matplotlib -> pyqt=5.6 -> sip=4.18
chemdataextractor -> nltk -> matplotlib -> pyside==1.1.2
chemdataextractor -> nltk -> matplotlib -> pytz
chemdataextractor -> nltk -> matplotlib -> subprocess32
chemdataextractor -> nltk -> matplotlib -> tk=8.5
chemdataextractor -> nltk -> matplotlib -> tornado -> backports_abc[version='>=0.4']
chemdataextractor -> nltk -> matplotlib -> tornado -> singledispatch
chemdataextractor -> nltk -> matplotlib -> tornado -> ssl_match_hostname
chemdataextractor -> nltk -> python-crfsuite -> *[track_features=vc10]
chemdataextractor -> nltk -> pyyaml -> yaml[version='>=0.1.7,<0.2.0a0']
chemdataextractor -> nltk -> scikit-learn -> nose
chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> configparser
chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> mccabe[version='>=0.6.0,<0.7.0']
chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> pep8
chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> pycodestyle[version='>=2.0.0,<2.4.0']
chemdataextractor -> nltk -> twython -> requests-oauthlib[version='>=0.4.0'] -> oauthlib[version='>=0.6.2'] -> pyjwt[version='>=1.0.0'] -> flake8 -> pyflakes[version='>=1.5.0,<1.6.0']
Use "conda info " to see the dependencies for each package.

How to create custom parser to get entities in given Text ?

Can anyone let me know , how to write custom parser to fetch Chemical molecule name with constituents details in desired format

[Chemical name + addition : Constituents],[Chemical name + addition : Constituents]

doc = Paragraph('4-Methylmorpholine N-oxide (1.76 mL, 8.42 mmol) and potassium osmate dihydrate (97.3 mg, 0.38 mmol) ')
print(doc.records.serialize())

class BoilingPoint(BaseModel):
name=StringType()
Quan = StringType()
units = StringType()

Compound.addition = ListType(ModelType(BoilingPoint))

import re
from chemdataextractor.parse import R, I, W, Optional, merge

units = R('^(mg|mL|mmol)$')(u'units').add_action(merge) # Define all units in parser
Quan = R(u'^\d+(.\d+)?$')(u'value')
bp = (Quan+ units)(u'mL')

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
root=bp

def interpret(self, result, start, end):
    compound = Compound(
        addition=[
            BoilingPoint(

                #name=first(result.xpath('./name/text()'))
                Quan=first(result.xpath('./value/text()')),
                units=first(result.xpath('./units/text()'))


            )
        ]
     )
    yield compound

Paragraph.parsers = [CompoundParser()]+[BpParser()]

Result :
[{'names': ['4-Methylmorpholine N-oxide']}, {'names': ['potassium osmate dihydrate']}, {'addition': [{'Quan': '1.76', 'units': 'mL'}]}, {'addition': [{'Quan': '8.42', 'units': 'mmol'}]}, {'addition': [{'Quan': '97.3', 'units': 'mg'}]}, {'addition': [{'Quan': '0.38', 'units': 'mmol'}]}]

Expected Result: Chemical name + addition : Constituents

[{'names': ['4-Methylmorpholine N-oxide'],'addition': [{'Quan': '1.76', 'units': 'mL'}]},{'Quan': '8.42', 'units': 'mmol'}}]

Consider tika-python for text extraction?

Hi,

Not sure how you are doing text extraction, but just saw an article in IEEE computing edge that cited your tool. If you have any interested in Apache Tika we provide a functional Python library that you could leverage. Does pdfminer also do the text extraction part?

The benefit of Tika is that it supports text extraction from 1400+ formats.

Cheers,
Chris

Unable to read in xml file

USpatenttest.xml.zip

Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.

I run:
from chemdataextractor import Document
f = open('USpatenttest.xml', 'rb')
doc=Document.from_file(f)

And I get the error File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range

Any advice is greatly appreciated! Thanks

CDE configuration

Hello,

I couldn't find any documentation regarding managing the configuration of the command line interface, and running cde config list doesn't return anything on my installation. What are the parameters that i can set?

Code for Demo Files

I have been working with this library to extract chem information from HTML pages.
I followed http://chemdataextractor.org/demo and saved https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A as an html(input3.html) file.

Below is my code.

with open('input/input3.html', 'rb') as f:
doc = Document.from_file(f)

records = doc.records.serialize()

This does not matches with the records in the json output published at https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A .
A lot of information is missing including smiles, fluorescence_lifetimes etc.

@mcs07 was wondering if you could publish the code that was used for the demo.

Ps : Is there a method which creates the entire json which includes abbreviation + biblio + record or they are extracted separately and stitched together to create the final json output.

type object 'HTMLAwareEntitySubstitution' has no attribute 'preserve_whitespace_tags'

Hello.
I have installed ChemDataExtractor with pip.
Now trying to use it in jupyter notebook with python3.5, calling import chemdataextractor.
Getting the following error:
AttributeError Traceback (most recent call last)
in ()
----> 1 import chemdataextractor

/usr/local/lib/python3.5/dist-packages/chemdataextractor/init.py in ()
24
25
---> 26 from .doc.document import Document

/usr/local/lib/python3.5/dist-packages/chemdataextractor/doc/init.py in ()
13 from future import unicode_literals
14
---> 15 from .document import Document
16 from .text import Text, Title, Heading, Paragraph, Footnote, Citation, Caption, Sentence, Span, Token
17 from .figure import Figure

/usr/local/lib/python3.5/dist-packages/chemdataextractor/doc/document.py in ()
22
23 from ..utils import python_2_unicode_compatible
---> 24 from .text import Paragraph, Citation, Footnote, Heading, Title
25 from .table import Table
26 from .figure import Figure

/usr/local/lib/python3.5/dist-packages/chemdataextractor/doc/text.py in ()
20
21 from ..model import ModelList
---> 22 from ..parse.context import ContextParser
23 from ..parse.cem import ChemicalLabelParser, CompoundHeadingParser, CompoundParser, chemical_name
24 from ..parse.table import CaptionContextParser

/usr/local/lib/python3.5/dist-packages/chemdataextractor/parse/init.py in ()
13 from future import unicode_literals
14
---> 15 from .actions import join, merge, strip_stop, fix_whitespace
16 from .elements import W, I, R, T, H
17 from .elements import Any, Word, Tag, IWord, Regex, Start, End, Hide, Not

/usr/local/lib/python3.5/dist-packages/chemdataextractor/parse/actions.py in ()
18 from lxml.etree import strip_tags
19
---> 20 from ..text import HYPHENS
21
22

/usr/local/lib/python3.5/dist-packages/chemdataextractor/text/init.py in ()
15 import unicodedata
16
---> 17 from bs4 import UnicodeDammit
18
19

/usr/local/lib/python3.5/dist-packages/bs4/init.py in ()
33 import warnings
34
---> 35 from .builder import builder_registry, ParserRejectedMarkup
36 from .dammit import UnicodeDammit
37 from .element import (

/usr/local/lib/python3.5/dist-packages/bs4/builder/init.py in ()
226
227
--> 228 class HTMLTreeBuilder(TreeBuilder):
229 """This TreeBuilder knows facts about HTML.
230

/usr/local/lib/python3.5/dist-packages/bs4/builder/init.py in HTMLTreeBuilder()
232 """
233
--> 234 preserve_whitespace_tags = HTMLAwareEntitySubstitution.preserve_whitespace_tags
235 empty_element_tags = set([
236 # These are from HTML5.

AttributeError: type object 'HTMLAwareEntitySubstitution' has no attribute 'preserve_whitespace_tags'

Is there a way to fix it?
I tried to apply solutions regarding to BeautifulSoup4 installations, suggested elsewhere, but nothing works.
Thank you!

Issue specifying a Reader

I'm following the examples of reading a document into the tool shown here: http://chemdataextractor.org/docs/reading

I am using version 1.2.2 installed using the conda install on linux and working with Python 3.5, I'm testing from a Jupyter Notebook.

If I use the first example on that page to read a locally stored html file, then I successfully read in the HTML and can query the data stored in the doc variable.

If I then try and explicitly call the correct reader RscHtmlReader() I get an error stating that the name is not defined, and the same if I try to specify the AcsHtmlReader().

`NameError Traceback (most recent call last)
in ()
1 f = open(//Downloads//chemdataextractor-journal-articles-source//evaluation journal articles//articles//rsc.nj.c5nj01594d.html', 'rb')
----> 2 doc = Document.from_file(f, readers=[AcsHtmlReader()])

NameError: name 'AcsHtmlReader' is not defined
`
I was also able to use the command line interface to generate an output json file, I'm not sure if these are generated by successfully calling the correct reader or if the application is just falling back to to the generic html reader.
An ideas?

the webserver is unresponsive in case I send a document

http://chemdataextractor.org/

Also, in 2020, it shoud use https.

Upgrade to Python > 3.6?

I know this package has not been updated for some time; however, is there any way how we can install it in more recent python versions?

Not able to install chemdataextractor

I am trying to install chemdataextractor
Python version 3.10.11

It gives me the following error:

error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for DAWG
Running setup.py clean for DAWG
Failed to build DAWG
ERROR: Could not build wheels for DAWG, which is required to install pyproject.toml-based projects

Kindly help

Regex expression to extract colon not working

I am attempting to make a custom parser to extract ratio's from this text:

d = Document( Heading(u'1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C2) '), Paragraph(u'Standard Procedure A (0.25 mmol, 1.0 mL CH2Cl2/0.4 mL H2O) was followed with areaction time of 24 h to provide 4A-C2 and 4A-C6 (ratio of 2:3 and C6:C2 by GC-MS, see below)in a combined yield of 66% as a white solid.'))

I am using the following regex in order to do this (it has proven to work independently of the custom parser code):

value = (R("\w+:+\w+"))('value')

Any help would be appreciated.

Full code is here:

class RatioOf(BaseModel): 
    value = StringType() 
    prefix = StringType()
    
Compound.ratio_of` = ListType(ModelType(RatioOf))

import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (I('ratio') | I('of')).hide() 

value = (R("\w+:+\w+"))('value')

ro = (prefix + value)(u'ro') 

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class RoParser(BaseParser):
    root = ro
    def interpret(self, result, start, end):
        compound = Compound(
         ratio_of=[
               RatioOf(
                    prefix=first(result.xpath('./prefix/text()')),
                    value=first(result.xpath('./value/text()'))
                )
            ]
        )
        yield compound
        yield compound

Paragraph.parsers = [RoParser()]

d = Document(
    Heading(u'1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C2) and 1-(5-Bromo-2-(trifluoromethyl)pyridin-3-yl)ethanone (4A-C6)'),
    Paragraph(u'Standard Procedure A (0.25 mmol, 1.0 mL CH2Cl2/0.4 mL H2O) was followed with a reaction time of 24 h to provide 4A-C2 and 4A-C6 (ratio of 2:3 and C6:C2 by GC-MS, see below)in a combined yield of 66% as a white solid.'))

d.records.serialize()```

RSCHTMLReader throws bytes/string error

I'm using python 3.5 and trying to process an RSC article (10.1039/C6OB02074G)

I see the error:

TypeError: %b requires bytes, or an object that implements __bytes__, not 'str'

The issue seems to be with the replace_rsc_img_chars function in rsc.py.

Looking at it the matches that are obtained from parsing the entity xpath (u1 and u2) are unicode strings (see lines 270, 272). u1 and u2 are then subsequently used to generate rep (line 276) here the code is trying to insert a unicode string into a byte string.

UnicodeEncodeError:'ascii' codec can't encode characters in position 9-11: ordinal not in range(128)

An error is reported when running the code “doc.records.serialise()”.“UnicodeEncodeError:'ascii' codec can't encode characters in position 9-11: ordinal not in range(128)”
However，”doc.elements“works fine。

add readers for wiley html format

Hi,
ChemDataExtractor really help me digging some info in the papers.
However, I've found that wiley papers get messed up using Generic HTML reader.
Could you please add a spcific reader for wiley?
Or how should define a reader? I've read your source code, but little description in the reader modules.
Thanks!

https breaks NlmXmlReader

In the NlmXmlReader class

    def detect(self, fstring, fname=None):
        """"""
        if fname and not (fname.endswith('.xml') or fname.endswith('.nxml')):
            return False
        if b'xmlns="http://jats.nlm.nih.gov/ns/archiving' in fstring:
            return True
        if b'JATS-archivearticle1.dtd' in fstring:
            return True
        if b'-//NLM//DTD JATS' in fstring:
            return True
        return False

The NLM's JATS namespace URI uses https now, so my document wasn't being registered as compatible with NlmXmlReader

conda installation issue

HI I've followed the installation instruction but get this error - what am I doing wrong

thanks

Charlie

(base) charlies-MacBook-Pro:~ charliejeynes$ conda config
(base) charlies-MacBook-Pro:~ charliejeynes$ conda install chemdataextractor
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

chemdataextractor

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

_in_stoplist should return True for entities trimmed out of existence

In an entity like "-aromatic" which is in IGNORE_SUFFIX the resultant entity after running _in_stoplist is of length 0, hence the entity should be ignored (i.e. the function should return True) rather than reporting a 0 length entity.

On an entity which is both in IGNORE_PREFIX and IGNORE_SUFFIX you can get into a situation where the end index is actually before the start end index!

d = Document("non-aromatic")
d.cems
[Span(u'', 4, 3)]

I assume adding this check that the resultant entity's length is > 0 will fix that case as well.

version of pdfminer

Hi Matt.
The installation comes up with an error when I am installing using pip. pdfminer-20140328.tar.gz seems to be the problem.

Collecting pdfminer (from ChemDataExtractor)
Using cached pdfminer-20140328.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\ernest\AppData\Local\Temp\pip-build-uz7pty9c\pdfminer\setup.py", line 3, in
from pdfminer import version
File "C:\Users\ernest\AppData\Local\Temp\pip-build-uz7pty9c\pdfminer\pdfminer__init__.py", line 5
print version
^
SyntaxError: Missing parentheses in call to 'print'

Could I install pdfminer.six OR pdfminer3k and use one of them instead?

Extracting entities inside an entity

Does anyone knows how to write a custom parser to extract a named entity inside an entity.

For example from the following sentence I want to extract 'boiling' which will be inside the prefix entity.

d = Sentence('Synthesis of 2,4,6-trinitrotoluene (3a).The procedure was followed to yield a pale yellow solid (boiling point 240 °C)')

This is my attempt to write the parser:

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    prefix = StringType()
    name = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))`


prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling')(u'name') + I(u'point')).add_action(join)(u'prefix')
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')


class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()')),
                    prefix = first(result.xpath('./prefix/text()')),
                    name = first(result.xpath('./name/text()')),
                    
                )
            ]
        )
        yield compound

Sentence.parsers = [BpParser()]

However what d.records.serialize() produces is,

[{'boiling_points': [{'value': '240',
'units': '°C',
'prefix': 'boiling point'}]}]

online demo hangs (left running for an hour or more)

http://chemdataextractor.org/results/abb704de-ca52-4bc4-973b-d34ee8f1407a

Regex expression of \S* is not recognized.

I am trying to create a custom parser to extract the boiling points from the following texts, so that the text between "boiling point" and "of" is optional.

Paragraph(u'The boiling point limit of 2,4,6-trinitrotoluene is 240 °C') # <- text 1
Paragraph(u'The boiling point of 2,4,6-trinitrotoluene is 240 °C') # <- text 2

I try to use the following prefix,

prefix = (I('boiling') + I('point') + Optional(R('^\S*$')).hide() + I('of') +\ R(r'\S+') +(I('is')|I('was')).hide() )(u'prefix').add_action(join)

But this fails for text2 when there is no text between "boiling point" and "of".

I am not sure whether this is related to the way the code is written.

Full code is given below.

from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class BoilingPoint(BaseModel):
    prefix = StringType()
    value = StringType()
    units = StringType()


Compound.boiling_points = ListType(ModelType(BoilingPoint))

import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (I('boiling') + I('point') + Optional(R('^\S*$')).hide() + I('of') +\
           R(r'\S+') +(I('is')|I('was')).hide() )(u'prefix').add_action(join)

units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    prefix=first(result.xpath('./prefix/text()')),
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound


Paragraph.parsers = [BpParser()]

d = Document(
    Heading(u'Synthesis of (3a)'),
#     Paragraph(u'The boiling point limit of 2,4,6-trinitrotoluene is 240 °C') # <- text 1
    Paragraph(u'The boiling point of 2,4,6-trinitrotoluene is 240 °C') # <- text 2
)

d.records.serialize()

Download too slow almost 0

China's access to the Internet is too slow