GithubHelp home page GithubHelp logo

bramvanroy / spacy_conll Goto Github PK

View Code? Open in Web Editor NEW
72.0 5.0 15.0 214 KB

Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.

License: BSD 2-Clause "Simplified" License

Python 99.27% Makefile 0.73%
conll conll-u spacy spacy-extension nlp natural-language-processing stanza stanford-nlp udpipe stanford-machine-learning spacy-pipeline data-science machine-learning python parser pandas

spacy_conll's Introduction

Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe

This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a spaCy, spacy-stanza, or spacy-udpipe pipeline. It also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in functionality to parse files or text.

Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you to use this library in combination with spacy-stanza, which is a spaCy interface using stanza and its models behind the scenes. Those models use the Universal Dependencies formalism and yield state-of-the-art performance. stanza is a new and improved version of stanfordnlp. As an alternative to the Stanford models, you can use the spaCy wrapper for UDPipe, spacy-udpipe, which is slightly less accurate than stanza but much faster.

Installation

By default, this package automatically installs only spaCy as dependency. Because spaCy's models are not necessarily trained on Universal Dependencies conventions, their output labels are not UD either. By using spacy-stanza or spacy-udpipe, we get the easy-to-use interface of spaCy as a wrapper around stanza and UDPipe respectively, including their models that are trained on UD data.

NOTE: spacy-stanza and spacy-udpipe are not installed automatically as a dependency for this library, because it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install them manually or use one of the available options as described below.

If you want to retrieve CoNLL info as a pandas DataFrame, this library will automatically export it if it detects that pandas is installed. See the Usage section for more.

To install the library, simply use pip.

# only includes spacy by default
pip install spacy_conll

A number of options are available to make installation of additional dependencies easier:

# include spacy-stanza and spacy-udpipe
pip install spacy_conll[parsers]
# include pandas
pip install spacy_conll[pd]
# include pandas, spacy-stanza and spacy-udpipe
pip install spacy_conll[all]
# include pandas, spacy-stanza and spacy-udpipe and additional libaries for testing and formatting
pip install spacy_conll[dev]

Usage

When the ConllFormatter is added to a spaCy pipeline, it adds CoNLL properties for Token, sentence Span and Doc. Note that arbitrary Span's are not included and do not receive these properties.

On all three of these levels, two custom properties are exposed by default, ._.conll and its string representation ._.conll_str. However, if you have pandas installed, then ._.conll_pd will be added automatically, too!

  • ._.conll: raw CoNLL format

    • in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values.
    • in sentence Span: a list of its tokens' ._.conll dictionaries (list of dictionaries).
    • in a Doc: a list of its sentences' ._.conll lists (list of list of dictionaries).
  • ._.conll_str: string representation of the CoNLL format

    • in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.
    • in sentence Span: the expected CoNLL format where each row represents a token. When ConllFormatter(include_headers=True) is used, two header lines are included as well, as per the CoNLL format.
    • in Doc: all its sentences' ._.conll_str combined and separated by new lines.
  • ._.conll_pd: pandas representation of the CoNLL format

    • in Token: a Series representation of this token's CoNLL properties.
    • in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column headers.
    • in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose index is reset.

You can use spacy_conll in your own Python code as a custom pipeline component, or you can use the built-in command-line script which offers typically needed functionality. See the following section for more.

In Python

This library offers the ConllFormatter class which serves as a custom spaCy pipeline component. It can be instantiated as follows. It is important that you import spacy_conll before adding the pipe!

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("conll_formatter", last=True)

Because this library supports different spaCy wrappers (spacy, stanza, and udpipe), a convenience function is available as well. With utils.init_parser you can easily instantiate a parser with a single line. You can find the function's signature below. Have a look at the source code to read more about all the possible arguments or try out the examples.

NOTE: is_tokenized does not work for spacy-udpipe. Using is_tokenized for spacy-stanza also affects sentence segmentation, effectively only splitting on new lines. With spacy, is_tokenized disables sentence splitting completely.

def init_parser(
    model_or_lang: str,
    parser: str,
    *,
    is_tokenized: bool = False,
    disable_sbd: bool = False,
    exclude_spacy_components: Optional[List[str]] = None,
    parser_opts: Optional[Dict] = None,
    **kwargs,
)

For instance, if you want to load a Dutch stanza model in silent mode with the CoNLL formatter already attached, you can simply use the following snippet. parser_opts is passed to the stanza pipeline initialisation automatically. Any other keyword arguments (kwargs), on the other hand, are passed to the ConllFormatter initialisation.

from spacy_conll import init_parser

nlp = init_parser("nl", "stanza", parser_opts={"verbose": False})

The ConllFormatter allows you to customize the extension names, and you can also specify conversion maps for the output properties.

To illustrate, here is an advanced example, showing the more complex options:

  • ext_names: changes the attribute names to a custom key by using a dictionary.
  • conversion_maps: a two-level dictionary that looks like {field_name: {tag_name: replacement}}. In other words, you can specify in which field a certain value should be replaced by another. This is especially useful when you are not satisfied with the tagset of a model and wish to change some tags to an alternative0.
  • field_names: allows you to change the default CoNLL-U field names to your own custom names. Similar to the conversion map above, you should use any of the default field names as keys and add your own key as value. Possible keys are : "ID", "FORM", "LEMMA", "UPOS", "XPOS", "FEATS", "HEAD", "DEPREL", "DEPS", "MISC".

The example below

  • shows how to manually add the component;
  • changes the custom attribute conll_pd to pandas (conll_pd only availabe if pandas is installed);
  • converts any nsubj deprel tag to subj.
import spacy


nlp = spacy.load("en_core_web_sm")
config = {"ext_names": {"conll_pd": "pandas"},
          "conversion_maps": {"deprel": {"nsubj": "subj"}}}
nlp.add_pipe("conll_formatter", config=config, last=True)
doc = nlp("I like cookies.")
print(doc._.pandas)

This is the same as:

from spacy_conll import init_parser

nlp = init_parser("en_core_web_sm",
                  "spacy",
                  ext_names={"conll_pd": "pandas"},
                  conversion_maps={"deprel": {"nsubj": "subj"}})
doc = nlp("I like cookies.")
print(doc._.pandas)

The snippets above will output a pandas DataFrame by using ._.pandas rather than the standard ._.conll_pd, and all occurrences of nsubj in the deprel field are replaced by subj.

   ID     FORM   LEMMA    UPOS    XPOS                                       FEATS  HEAD DEPREL DEPS           MISC
0   1        I       I    PRON     PRP  Case=Nom|Number=Sing|Person=1|PronType=Prs     2   subj    _              _
1   2     like    like    VERB     VBP                     Tense=Pres|VerbForm=Fin     0   ROOT    _              _
2   3  cookies  cookie    NOUN     NNS                                 Number=Plur     2   dobj    _  SpaceAfter=No
3   4        .       .   PUNCT       .                              PunctType=Peri     2  punct    _  SpaceAfter=No

Another initialization example that would replace the column names "UPOS" with "upostag" amd "XPOS" with "xpostag":

import spacy


nlp = spacy.load("en_core_web_sm")
config = {"field_names": {"UPOS": "upostag", "XPOS": "xpostag"}}
nlp.add_pipe("conll_formatter", config=config, last=True)

Reading CoNLL into a spaCy object

It is possible to read a CoNLL string or text file and parse it as a spaCy object. This can be useful if you have raw CoNLL data that you wish to process in different ways. The process is straightforward.

from spacy_conll import init_parser
from spacy_conll.parser import ConllParser


nlp = ConllParser(init_parser("en_core_web_sm", "spacy"))

doc = nlp.parse_conll_file_as_spacy("path/to/your/conll-sample.txt")
'''
or straight from raw text:
conllstr = """
# text = From the AP comes this story :
1	From	from	ADP	IN	_	3	case	3:case	_
2	the	the	DET	DT	Definite=Def|PronType=Art	3	det	3:det	_
3	AP	AP	PROPN	NNP	Number=Sing	4	obl	4:obl:from	_
4	comes	come	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	0:root	_
5	this	this	DET	DT	Number=Sing|PronType=Dem	6	det	6:det	_
6	story	story	NOUN	NN	Number=Sing	4	nsubj	4:nsubj	_
"""
doc = nlp.parse_conll_text_as_spacy(conllstr)
'''

# Multiple CoNLL entries (separated by two newlines) will be included as different sentences in the resulting Doc
for sent in doc.sents:
    for token in sent:
        print(token.text, token.dep_, token.pos_)

Command line

Upon installation, a command-line script is added under tha alias parse-as-conll. You can use it to parse a string or file into CoNLL format given a number of options.

parse-as-conll -h
usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE]
                  [-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v]
                  [--ignore_pipe_errors] [--no_split_on_newline]
                  model_or_lang {spacy,stanza,udpipe}

Parse an input string or input file to CoNLL-U format using a spaCy-wrapped parser. The output
can be written to stdout or a file, or both.

positional arguments:
  model_or_lang         Model or language to use. SpaCy models must be pre-installed, stanza
                        and udpipe models will be downloaded automatically
  {spacy,stanza,udpipe}
                        Which parser to use. Parsers other than 'spacy' need to be installed
                        separately. For 'stanza' you need 'spacy-stanza', and for 'udpipe' the
                        'spacy-udpipe' library is required.

optional arguments:
  -h, --help            show this help message and exit
  -f INPUT_FILE, --input_file INPUT_FILE
                        Path to file with sentences to parse. Has precedence over 'input_str'.
                        (default: None)
  -a INPUT_ENCODING, --input_encoding INPUT_ENCODING
                        Encoding of the input file. Default value is system default. (default:
                        cp1252)
  -b INPUT_STR, --input_str INPUT_STR
                        Input string to parse. (default: None)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Path to output file. If not specified, the output will be printed on
                        standard output. (default: None)
  -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
                        Encoding of the output file. Default value is system default. (default:
                        cp1252)
  -s, --disable_sbd     Whether to disable spaCy automatic sentence boundary detection. In
                        practice, disabling means that every line will be parsed as one
                        sentence, regardless of its actual content. When 'is_tokenized' is
                        enabled, 'disable_sbd' is enabled automatically (see 'is_tokenized').
                        Only works when using 'spacy' as 'parser'. (default: False)
  -t, --is_tokenized    Whether your text has already been tokenized (space-seperated). Setting
                        this option has as an important consequence that no sentence splitting
                        at all will be done except splitting on new lines. So if your input is
                        a file, and you want to use pretokenised text, make sure that each line
                        contains exactly one sentence. (default: False)
  -d, --include_headers
                        Whether to include headers before the output of every sentence. These
                        headers include the sentence text and the sentence ID as per the CoNLL
                        format. (default: False)
  -e, --no_force_counting
                        Whether to disable force counting the 'sent_id', starting from 1 and
                        increasing for each sentence. Instead, 'sent_id' will depend on how
                        spaCy returns the sentences. Must have 'include_headers' enabled.
                        (default: False)
  -j N_PROCESS, --n_process N_PROCESS
                        Number of processes to use in nlp.pipe(). -1 will use as many cores as
                        available. Might not work for a 'parser' other than 'spacy' depending
                        on your environment. (default: 1)
  -v, --verbose         Whether to always print the output to stdout, regardless of
                        'output_file'. (default: False)
  --ignore_pipe_errors  Whether to ignore a priori errors concerning 'n_process' By default we
                        try to determine whether processing works on your system and stop
                        execution if we think it doesn't. If you know what you are doing, you
                        can ignore such pre-emptive errors, though, and run the code as-is,
                        which will then throw the default Python errors when applicable.
                        (default: False)
  --no_split_on_newline
                        By default, the input file or string is split on newlines for faster
                        processing of the split up parts. If you want to disable that behavior,
                        you can use this flag. (default: False)

For example, parsing a single line, multi-sentence string:

parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers

# sent_id = 1
# text = I like cookies.
1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      2       nsubj   _       _
2       like    like    VERB    VBP     Tense=Pres|VerbForm=Fin 0       ROOT    _       _
3       cookies cookie  NOUN    NNS     Number=Plur     2       dobj    _       SpaceAfter=No
4       .       .       PUNCT   .       PunctType=Peri  2       punct   _       _

# sent_id = 2
# text = What about you?
1       What    what    PRON    WP      _       2       dep     _       _
2       about   about   ADP     IN      _       0       ROOT    _       _
3       you     you     PRON    PRP     Case=Acc|Person=2|PronType=Prs  2       pobj    _       SpaceAfter=No
4       ?       ?       PUNCT   .       PunctType=Peri  2       punct   _       SpaceAfter=No

For example, parsing a large input file and writing output to a given output file, using four processes:

parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4

Credits

The first version of this library was inspired by initial work by rgalhama and has evolved a lot since then.

spacy_conll's People

Contributors

bramvanroy avatar koichiyasuoka avatar rgalhama avatar shaked571 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spacy_conll's Issues

Incorrect example

There seems to be an error in the example at https://spacy.io/universe/project/spacy-conll, at least when I try to run it. Instead of init_parser("stanza","en" it should say init_parser("en", "stanza"

It is not possible to install package in virtualenv if system environment has no spacy installed

It is not possible to install the package in virtualenv when spacy not installed in system environment, because setup.py imported all project modules indirectly through importing __init__.py:

$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install .
Processing /home/rominf/dev/spacy_conll
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [24 lines of output]
      Traceback (most recent call last):
        File "/usr/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/usr/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tbqma0n0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-tbqma0n0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-tbqma0n0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 487, in run_setup
          super().run_setup(setup_script=setup_script)
        File "/tmp/pip-build-env-tbqma0n0/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 4, in <module>
        File "/home/rominf/dev/spacy_conll/spacy_conll/__init__.py", line 3, in <module>
          from .formatter import ConllFormatter
        File "/home/rominf/dev/spacy_conll/spacy_conll/formatter.py", line 5, in <module>
          from spacy.language import Language
      ModuleNotFoundError: No module named 'spacy'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

I am working on this.

TypeError: ConllParser.__init__() got an unexpected keyword argument 'is_tokenized' when using command parse-as-conll

Whether I use -t argument or not.

(spacy) mememe@ubuntugpu:~$ parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" -t --include_headers
Traceback (most recent call last):
  File "/mnt/data/mememe/.local/bin/parse-as-conll", line 8, in <module>
    sys.exit(main())
  File "/mnt/data/mememe/.local/lib/python3.10/site-packages/spacy_conll/cli/parse.py", line 175, in main
    parse(cargs)
  File "/mnt/data/mememe/.local/lib/python3.10/site-packages/spacy_conll/cli/parse.py", line 23, in parse
    parser = ConllParser(nlp, is_tokenized=args.is_tokenized)
TypeError: ConllParser.__init__() got an unexpected keyword argument 'is_tokenized'

I have to change line 23 to "parser = ConllParser(nlp)".

command line example is not working

% python -m spacy_conll -h
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/ar/.env/lib/python3.7/site-packages/spacy_conll/__main__.py", line 7, in <module>
    from packaging import version
ModuleNotFoundError: No module named 'packaging'

Wrong Column Names in Output

First off thanks for this library. It fits some great use cases!

I am leveraging this to create conll output from stanza, and I notice that the column names don't actually match the UD format spec. Specifically, the names xpostag, and upostag are used instead of xpos and upos.

Understandably there's a bit of ambiguity here since this is spacy_conll and not spacy_conllu and I think the original conll formats did use xpostag and upostag, but the actual documentation refers frequently to the ConLL-U format, and I think is the only variant that has an actual spec for its format that can be adhered to.

This is all to say, an option to specify the column names would be very helpful at minimum. I'm not sure how the majority of downstream processing tools handle this, but it may also be helpful to have a flag that allows for "strict" adherence to the ConLL-U spec (to avoid the same duplication of official headers by different users) and if false use the existing column names as the legacy default.

Support CoNLL-U Plus

As requested as part of #24

It would be neat to support CoNLL-U Plus:

  • export only the requested fields (and mark the output CoNLL-U with global.columns)
  • allow reading in a CoNLL-U Plus file
  • it also supports custom columns but I am hesitant to support those. Perhaps we can use them, if a custom field is present in the private spaCy registered space ._. then we may use that destination. Will have to think about it some more.

Here is an example of a CoNLL-U Plus file. Note how the first line indicates which fields are present (separated by spaces).

# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
# newdoc id = mf920901-001
# newpar id = mf920901-001-p1
# sent_id = mf920901-001-p1s1A
# text = Slovenská ústava: pro i proti
# text_en = Slovak constitution: pros and cons
1   Slovenská   slovenský   ADJ     AAFS1----1A---- Case=Nom|Degree=Pos|Gender=Fem|Number=Sing|Polarity=Pos 2 amod _ _
2   ústava      ústava      NOUN    NNFS1-----A---- Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos 0 root _ SpaceAfter=No
3   :           :           PUNCT   Z:------------- _          2       punct   _       _
4   pro         pro         ADP     RR--4---------- Case=Acc   2       appos   _       LId=pro-1
5   i           i           CCONJ   J^------------- _          6       cc      _       LId=i-1
6   proti       proti       ADP     RR--3---------- Case=Dat   4       conj    _       LId=proti-1

If you want to see this implemented, please give this post a thumbs up so that I know what to prioritize.

Newlines in text break conll format

When I run

import spacy
from spacy_conll import init_parser
import re

nlp = init_parser("spacy", 'en_core_web_sm',  parser_opts={'verbose': False})

# Parse a given string
txt = """
I like cookies.
You like ice cream.
We like spacy.
""".strip()
# txt = re.sub('\n', '', txt)

doc = nlp(txt)

# Get the CoNLL representation of the whole document, including headers
conll = doc._.conll_str
print(conll)

Then the result is

1	I	-PRON-	PRON	PRP	PronType=prs	2	nsubj	_	_
2	like	like	VERB	VBP	VerbForm=fin|Tense=pres	0	ROOT	_	_
3	cookies	cookie	NOUN	NNS	Number=plur	2	dobj	_	SpaceAfter=No
4	.	.	PUNCT	.	PunctType=peri	2	punct	_	SpaceAfter=No
5	
	
	SPACE	_SP	_	4		_	SpaceAfter=No

1	You	-PRON-	PRON	PRP	PronType=prs	2	nsubj	_	_
2	like	like	SCONJ	IN	_	0	ROOT	_	_
3	ice	ice	NOUN	NN	Number=sing	4	compound	_	_
4	cream	cream	NOUN	NN	Number=sing	2	dobj	_	SpaceAfter=No
5	.	.	PUNCT	.	PunctType=peri	2	punct	_	SpaceAfter=No
6	
	
	SPACE	_SP	_	5		_	SpaceAfter=No

1	We	-PRON-	PRON	PRP	PronType=prs	2	nsubj	_	_
2	like	like	SCONJ	IN	_	0	ROOT	_	_
3	spacy	spacy	NOUN	NN	Number=sing	2	dobj	_	SpaceAfter=No
4	.	.	PUNCT	.	PunctType=peri	2	punct	_	SpaceAfter=No

which is not a valid conllu file. While it is possible to remove newlines before the call to nlp, it would be nice to handle this case also in this package.

Can I read conll format and turn it to spacy object?

Can I take a conll format file and transform it in to spacy object using the repo code?
lets say I have the following file:

  # sent_id = 9
  # text = היא אמרה כי שירות התעסוקה הציע להביא עובדים מדרום לבנון, אך תנועת המושבים סירבה.
  1	היא	הוא	PRON	PRON	Gender=Fem|Number=Sing|Person=3|PronType=Prs	2	nsubj	_	_
  2	אמרה	אמר	VERB	VERB	Gender=Fem|HebBinyan=PAAL|Number=Sing|Person=3|Tense=Past|Voice=Act	0	root	_	_
  3	כי	כי	SCONJ	SCONJ	_	7	mark	_	_
  4	שירות	שירות	NOUN	NOUN	Definite=Cons|Gender=Masc|Number=Sing	7	nsubj	_	_
  5-6	התעסוקה	_	_	_	_	_	_	_	_
  5	ה	ה	DET	DET	Definite=Def|PronType=Art	6	det	_	_
  6	תעסוקה	תעסוקה	NOUN	NOUN	Gender=Fem|Number=Sing	4	compound:smixut	_	_
  7	הציע	הציע	VERB	VERB	Gender=Masc|HebBinyan=HIFIL|Number=Sing|Person=3|Tense=Past|Voice=Act	2	ccomp	_	_
  8	להביא	הביא	VERB	VERB	HebBinyan=HIFIL|VerbForm=Inf|Voice=Act	7	xcomp	_	_
  9	עובדים	עובד	NOUN	NOUN	Gender=Masc|Number=Plur	8	obj	_	_
  10-11	מדרום	_	_	_	_	_	_	_	_
  10	מ	מ	ADP	ADP	_	11	case	_	_
  11	דרום	דרום	NOUN	NOUN	Definite=Cons|Gender=Masc|Number=Sing	8	obl	_	_
  12	לבנון	לבנון	PROPN	PROPN	_	11	compound:smixut	_	SpaceAfter=No
  13	,	,	PUNCT	PUNCT	_	18	punct	_	_
  14	אך	אך	CCONJ	CCONJ	_	18	cc	_	_
  15	תנועת	תנועה	NOUN	NOUN	Definite=Cons|Gender=Fem|Number=Sing	18	nsubj	_	_
  16-17	המושבים	_	_	_	_	_	_	_	_
  16	ה	ה	DET	DET	Definite=Def|PronType=Art	17	det	_	_
  17	מושבים	מושב	NOUN	NOUN	Gender=Masc|Number=Plur	15	compound:smixut	_	_
  18	סירבה	סירב	VERB	VERB	Gender=Fem|HebBinyan=PIEL|Number=Sing|Person=3|Tense=Past|Voice=Act	7	conj	_	SpaceAfter=No
  19	.	.	PUNCT	PUNCT	_	2	punct	_	_

Can I somehow use the code to read the format as spacy doc?

type object 'EnglishDefaults' has no attribute 'tag_map'

when using spacy 3.0.6 I am getting below error

type object 'EnglishDefaults' has no attribute 'tag_map'

import spacy
from spacy_conll import ConllFormatter


nlp = spacy.load('en_core_web_sm')
conllformatter = ConllFormatter(nlp,
                                ext_names={'conll_pd': 'pandas'},
                                conversion_maps={'lemma': {'-PRON-': 'PRON'}})
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies.')
print(doc._.pandas)

only changes is instead of loading linked en model I have loaded en_core_web_sm

Disabling automatic sentence boundary detection

Could you please let me know how to disable automatic sentence boundary detection and parse the following input as a single piece of text. I've tried disable_sbd , but it doesn't seem to work.

import stanza
stanza.download('fr')
from spacy_conll import init_parser

nlp = init_parser("stanza", "fr", parser_opts={"use_gpu": True, "verbose": False}, disable_sbd=True, include_headers=True)

doc = nlp("A l&apos; opposé de ses deux prédécesseurs - Nelson Mandela le béatifié qui a mis l&apos; accent sur la fin du conflit racial et l&apos; aristocrate Thabo Mbeki , dont la maîtrise de la macroéconomie a rassuré les financiers - Zuma reconnaît la demande sous @-@ jacente d&apos; améliorer les conditions de vie matérielles des dix millions d&apos; indigents du pays . &quot; � Les erreurs commises au cours des 15 dernières années nous ont beaucoup appris , notamment la manière dont nous avons , dans une certaine mesure , négligé l &quot; évolution de la population &quot; a @-@ t @-@ il déclaré en avril , avant que son parti n&apos; emporte la victoire .")
conll = doc._.conll_str

print(conll)


# sent_id = 1
# text = A l&apos ; opposé de ses deux prédécesseurs - Nelson Mandela le béatifié qui a mis l&apos ;
1       A       à       ADP             _       2       case    _       _
2       l&apos  l&apos  PROPN           _       0       root    _       _
3       ;       ;       PUNCT           _       4       punct   _       _
4       opposé  opposer VERB    Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part        _       2       acl     _      _
5       de      de      ADP             _       8       case    _       _
6       ses     son     DET     Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs   _       8       det     _       _
7       deux    deux    NUM             _       8       nummod  _       _
8       prédécesseurs   prédécesseur    NOUN    Gender=Masc|Number=Plur _       4       obl:arg _       _
9       -       -       PUNCT           _       10      punct   _       _
10      Nelson  Nelson  PROPN           _       8       appos   _       _
11      Mandela Mandela PROPN           _       10      flat:name       _       _
12      le      le      DET     Definite=Def|Gender=Masc|Number=Sing|PronType=Art       _       13      det     _      _
13      béatifié        béatifier       NOUN    Gender=Masc|Number=Sing _       10      appos   _       _
14      qui     qui     PRON    PronType=Rel    _       16      nsubj   _       _
15      a       avoir   AUX     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       16      aux:tense      _                                               _
16      mis     mettre  VERB    Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part        _       13      acl:relcl      _                                               _
17      l&apos  l&apos  NOUN    Gender=Masc|Number=Sing _       16      obj     _       _
18      ;       ;       PUNCT           _       4       punct   _       _

# sent_id = 2
# text = accent sur la fin de le conflit racial et l&apos ;
1       accent  accent  NOUN    Gender=Masc|Number=Sing _       0       root    _       _
2       sur     sur     ADP             _       4       case    _       _
3       la      le      DET     Definite=Def|Gender=Fem|Number=Sing|PronType=Art        _       4       det     _      _
4       fin     fin     NOUN    Gender=Fem|Number=Sing  _       1       nmod    _       _
5       de      de      ADP             _       7       case    _       _
6       le      le      DET     Definite=Def|Gender=Masc|Number=Sing|PronType=Art       _       7       det     _      _
7       conflit conflit NOUN    Gender=Masc|Number=Sing _       4       nmod    _       _
8       racial  racial  ADJ     Gender=Masc|Number=Sing _       7       amod    _       _
9       et      et      CCONJ           _       10      cc      _       _
10      l&apos  l&apos  ADJ     Gender=Masc|Number=Sing _       8       conj    _       _
11      ;       ;       PUNCT           _       1       punct   _       _

# sent_id = 3
# text = aristocrate Thabo Mbeki , dont la maîtrise de la macroéconomie a rassuré les financiers - Zuma reconnaît la demande sous @-@ jacente d&apos ;
1       aristocrate     aristocrate     NOUN    Gender=Masc|Number=Sing _       17      nsubj   _       _
2       Thabo   Thabo   PROPN           _       1       appos   _       _
3       Mbeki   Mbeki   PROPN           _       2       flat:name       _       _
4       ,       ,       PUNCT           _       12      punct   _       _
5       dont    dont    PRON    PronType=Rel    _       7       nmod    _       _
6       la      le      DET     Definite=Def|Gender=Fem|Number=Sing|PronType=Art        _       7       det     _      _
7       maîtrise        maîtrise        NOUN    Gender=Fem|Number=Sing  _       12      nsubj   _       _
8       de      de      ADP             _       10      case    _       _
9       la      le      DET     Definite=Def|Gender=Fem|Number=Sing|PronType=Art        _       10      det     _      _
10      macroéconomie   macroéconomie   NOUN    Gender=Fem|Number=Sing  _       7       nmod    _       _
11      a       avoir   AUX     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       12      aux:tense      _                                               _
12      rassuré rassurer        VERB    Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part        _       1       acl:relcl                                              __
13      les     le      DET     Definite=Def|Gender=Masc|Number=Plur|PronType=Art       _       14      det     _      _
14      financiers      financier       NOUN    Gender=Masc|Number=Plur _       12      obj     _       _
15      -       -       PUNCT           _       12      punct   _       _
16      Zuma    Zuma    PROPN           _       17      nsubj   _       _
17      reconnaît       reconnaître     VERB    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       0      root                                            __
18      la      le      DET     Definite=Def|Gender=Fem|Number=Sing|PronType=Art        _       19      det     _      _
19      demande demande NOUN    Gender=Fem|Number=Sing  _       17      obj     _       _
20      sous    sous    ADP             _       21      case    _       _
21      @-@     @-@     NOUN    Gender=Masc|Number=Sing _       19      nmod    _       _
22      jacente jacent  ADJ     Gender=Fem|Number=Sing  _       21      amod    _       _
23      d&apos  d&apos  NOUN    Gender=Fem|Number=Sing  _       21      nmod    _       _
24      ;       ;       PUNCT           _       17      punct   _       _

# sent_id = 4
# text = améliorer les conditions de vie matérielles de les dix millions d&apos ;
1       améliorer       améliorer       VERB    VerbForm=Inf    _       0       root    _       _
2       les     le      DET     Definite=Def|Gender=Fem|Number=Plur|PronType=Art        _       3       det     _      _
3       conditions      condition       NOUN    Gender=Fem|Number=Plur  _       1       obj     _       _
4       de      de      ADP             _       5       case    _       _
5       vie     vie     NOUN    Gender=Fem|Number=Sing  _       3       nmod    _       _
6       matérielles     matériel        ADJ     Gender=Fem|Number=Plur  _       5       amod    _       _
7       de      de      ADP             _       10      case    _       _
8       les     le      DET     Definite=Def|Gender=Masc|Number=Plur|PronType=Art       _       10      det     _      _
9       dix     dix     NUM             _       10      nummod  _       _
10      millions        million NOUN    Gender=Masc|Number=Plur _       3       nmod    _       _
11      d&apos  d&apos  ADJ     Gender=Masc|Number=Plur _       10      amod    _       _
12      ;       ;       PUNCT           _       1       punct   _       _

# sent_id = 5
# text = indigents de le pays . &quot ;
1       indigents       indigent        ADJ     Gender=Masc|Number=Plur _       0       root    _       _
2       de      de      ADP             _       4       case    _       _
3       le      le      DET     Definite=Def|Gender=Masc|Number=Sing|PronType=Art       _       4       det     _      _
4       pays    pays    NOUN    Gender=Masc|Number=Sing _       1       obl:arg _       _
5       .       .       PUNCT           _       6       punct   _       _
6       &quot   &quot   X               _       1       conj    _       _
7       ;       ;       PUNCT           _       1       punct   _       _

# sent_id = 6
# text = � Les erreurs commises à le cours de les 15 dernières années nous ont beaucoup appris , notamment la manière dont nous avons , dans une certaine mesure , négligé l &quot;
1       �       �       INTJ            _       16      discourse       _       _
2       Les     le      DET     Definite=Def|Gender=Fem|Number=Plur|PronType=Art        _       3       det     _      _
3       erreurs erreur  NOUN    Gender=Fem|Number=Plur  _       16      obl:mod _       _
4       commises        commettre       VERB    Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part _       3       acl    _                                               _
5       à       à       ADP             _       7       case    _       _
6       le      le      DET     Definite=Def|Gender=Masc|Number=Sing|PronType=Art       _       7       det     _      _
7       cours   cours   NOUN    Gender=Masc|Number=Sing _       4       obl:mod _       _
8       de      de      ADP             _       12      case    _       _
9       les     le      DET     Definite=Def|Gender=Fem|Number=Plur|PronType=Art        _       12      det     _      _
10      15      15      NUM             _       12      nummod  _       _
11      dernières       dernier ADJ     Gender=Fem|Number=Plur  _       12      amod    _       _
12      années  année   NOUN    Gender=Fem|Number=Plur  _       7       nmod    _       _
13      nous    il      PRON    Number=Plur|Person=1|PronType=Prs       _       16      nsubj   _       _
14      ont     avoir   AUX     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   _       16      aux:tense      _                                               _
15      beaucoup        beaucoup        ADV             _       16      advmod  _       _
16      appris  apprendre       VERB    Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part        _       0       root   _                                               _
17      ,       ,       PUNCT           _       20      punct   _       _
18      notamment       notamment       ADV             _       20      advmod  _       _
19      la      le      DET     Definite=Def|Gender=Fem|Number=Sing|PronType=Art        _       20      det     _      _
20      manière manière NOUN    Gender=Fem|Number=Sing  _       16      obj     _       _
21      dont    dont    PRON    PronType=Rel    _       30      iobj    _       _
22      nous    il      PRON    Number=Plur|Person=1|PronType=Prs       _       30      nsubj   _       _
23      avons   avoir   AUX     Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin   _       30      aux:tense      _                                               _
24      ,       ,       PUNCT           _       28      punct   _       _
25      dans    dans    ADP             _       28      case    _       _
26      une     un      DET     Definite=Ind|Gender=Fem|Number=Sing|PronType=Art        _       28      det     _      _
27      certaine        certain ADJ     Gender=Fem|Number=Sing  _       28      amod    _       _
28      mesure  mesure  NOUN    Gender=Fem|Number=Sing  _       30      obl:mod _       _
29      ,       ,       PUNCT           _       28      punct   _       _
30      négligé négliger        VERB    Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part        _       20      acl:relcl                                              __
31      l       le      DET     Definite=Def|Gender=Masc|Number=Sing|PronType=Art       _       32      det     _      _
32      &quot;  &quot;  NOUN    Gender=Masc|Number=Sing _       30      obj     _       _

# sent_id = 7
# text = évolution de la population &quot ;
1       évolution       évolution       NOUN    Gender=Fem|Number=Sing  _       0       root    _       _
2       de      de      ADP             _       4       case    _       _
3       la      le      DET     Definite=Def|Gender=Fem|Number=Sing|PronType=Art        _       4       det     _      _
4       population      population      NOUN    Gender=Fem|Number=Sing  _       1       nmod    _       _
5       &quot   &quot   NOUN    Gender=Fem|Number=Sing  _       4       appos   _       _
6       ;       ;       PUNCT           _       1       punct   _       _

# sent_id = 8
# text = a @-@ t @-@ il déclaré en avril , avant que son parti n&apos ;
1       a       à       ADP             _       3       case    _       _
2       @-@     @-@     X               _       6       obl:mod _       _
3       t       t       NOUN    Gender=Masc|Number=Sing _       6       obl:mod _       _
4       @-@     @-@     X               _       3       appos   _       _
5       il      il      PRON    Gender=Masc|Number=Sing|Person=3|PronType=Prs   _       6       nsubj   _       _
6       déclaré déclarer        VERB    Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part        _       0       root   _                                               _
7       en      en      ADP             _       8       case    _       _
8       avril   avril   NOUN    Gender=Masc|Number=Sing _       6       obl     _       _
9       ,       ,       PUNCT           _       14      punct   _       _
10      avant   avant   ADP             _       14      mark    _       _
11      que     que     SCONJ           _       14      mark    _       _
12      son     son     DET     Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs   _       13      det     _       _
13      parti   parti   NOUN    Gender=Masc|Number=Sing _       14      nsubj   _       _
14      n&apos  ncaposs VERB    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       6       advcl   _      _
15      ;       ;       PUNCT           _       6       punct   _       _

# sent_id = 9
# text = emporte la victoire .
1       emporte emporter        VERB    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       0       root   _                                               _
2       la      le      DET     Definite=Def|Gender=Fem|Number=Sing|PronType=Art        _       3       det     _      _
3       victoire        victoire        NOUN    Gender=Fem|Number=Sing  _       1       obj     _       _
4       .       .       PUNCT           _       1       punct   _       _

thanks,
Ranjita

Sentence boundaries unset

Hello,

I have trained a transformer model only for morphologizer and tagger (no sentencizer included) and I wanted to tag
my data--but I get the following error:

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.

any idea of how can I bypass this?

Thank you!!

Invalid CoNLL string when model output doesn't provide lemma

When a model doesn't successfully lemmatize a token and the lemma attribute of the SpaCy token is unassigned, spacy_conll leaves that column blank in the CoNLL string. As the spacy_conll error message says when it attempts to reconvert the CoNLL string back to a SpaCy document, it is invalid to have an empty column in a CoNLL document. To avoid generating an invalid CoNLL string, I think we want to add a condition to the formatter to make sure the lemma attribute, and therefore the column's value, is never None.

For example:

        token_conll = (
            token_idx,
            token.text,
            token.lemma_ if token.lemma_ else "_",
            token.pos_,
            token.tag_,
            str(token.morph) if token.has_morph and str(token.morph) else "_",
            head_idx,
            token.dep_,
            token._.conll_deps_graphs_field,
            token._.conll_misc_field,
        )

The model I've used that failed to lemmatize a token and left it blank was spacy-stanza, English. And here's a text that I've encountered, which had a token that stanza didn't lemmatize: "VillaJakeF1 There are already a number of tools that can detect it. I’ve been using chatGPT a bit recently to get coding snippets, and I have to say a lot of it is either incomplete or incorrect. I wouldn’t want to rely on it for something as important as a thesis. But it is early days still"

I ran into this problem when, having produced a SpaCy document with a spacy-stanza pipeline that had the CoNLL formatter, I gave the doc._.conll_str to the CoNLL parser to turn the string back into a SpaCy document. But because the formatter had rendered invalid CoNLL, due to the missing lemma, the CoNLL parser failed.

how to specify particular language models in init_parser()?

Can you please let me know how to specify particular language models as a parameter of init_parser() when using spacy-stanza or spacy-udpipe? There are multiple models in many languages and not all the models follow the same tagging scheme, so sometimes we may need to use specific models for certain purposes.

How to directly convert spaCy DocBin object to coNLL format?

Hi,

I am struggling to convert an already NER tagged spaCy DocBin object to coNLL format using the parser? It contains the sentences and the related entity spans.

Does this framework support this functionality? I am struggling to find the correct way.

Kindly help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.