codait / text-extensions-for-pandas Goto Github PK

Natural language processing support for Pandas dataframes.

License: Apache License 2.0

Jupyter Notebook 94.67% Shell 0.05% Python 5.04% CSS 0.05% JavaScript 0.19%

text-extensions-for-pandas's Introduction

Text Extensions for Pandas

Natural language processing support for Pandas dataframes.

Text Extensions for Pandas turns Pandas DataFrames into a universal data structure for representing intermediate data in all phases of your NLP application development workflow.

Web site: https://ibm.biz/text-extensions-for-pandas

API docs: https://text-extensions-for-pandas.readthedocs.io/

Features

SpanArray: A Pandas extension type for spans of text

Connect features with regions of a document
Visualize the internal data of your NLP application
Analyze the accuracy of your models
Combine the results of multiple models

TensorArray: A Pandas extension type for tensors

Represent BERT embeddings in a Pandas series
Store logits and other feature vectors in a Pandas series
Store an entire time series in each cell of a Pandas series

Pandas front-ends for popular NLP toolkits

CoNLL-2020 Paper

Looking for the model training code from our CoNLL-2020 paper, "Identifying Incorrect Labels in the CoNLL-2003 Corpus"? See the notebooks in this directory.

The associated data set is here.

Installation

This library requires Python 3.7+, Pandas, and Numpy.

To install the latest release, just run:

pip install text-extensions-for-pandas

Depending on your use case, you may also need the following additional packages:

spacy (for SpaCy support)
transformers (for transformer-based embeddings and BERT tokenization)
ibm_watson (for IBM Watson support)

Alternatively, packages are available to be installed from conda-forge for use in a conda environment with:

conda install --channel=conda-forge text_extensions_for_pandas

Installation from Source

If you'd like to try out the very latest version of our code, you can install directly from the head of the master branch:

pip install git+https://github.com/CODAIT/text-extensions-for-pandas

You can also directly import our package from your local copy of the text_extensions_for_pandas source tree. Just add the root of your local copy of this repository to the front of sys.path.

Documentation

For examples of how to use the library, take a look at the example notebooks in this directory. You can try out these notebooks on Binder by navigating to https://mybinder.org/v2/gh/frreiss/tep-fred/branch-binder?urlpath=lab/tree/notebooks

To run the notebooks on your local machine, follow the following steps:

Install Anaconda or Miniconda.
Check out a copy of this repository.
Use the script env.sh to set up an Anaconda environment for running the code in this repository.
Type jupyter lab from the root of your local source tree to start a JupyterLab environment.
Navigate to the notebooks directory and choose any of the notebooks there

API documentation can be found at https://text-extensions-for-pandas.readthedocs.io/en/latest/

Contents of this repository

text_extensions_for_pandas: Source code for the text_extensions_for_pandas module.
env.sh: Script to create a conda environment pd capable of running the notebooks and test cases in this project
generate_docs.sh: Script to build the API documentation
api_docs: Configuration files for generate_docs.sh
binder: Configuration files for running notebooks on Binder
config: Configuration files for env.sh.
docs: Project web site
notebooks: example notebooks
resources: various input files used by our example notebooks
test_data: data files for regression tests. The tests themselves are located adjacent to the library code files.
tutorials: Detailed tutorials on using Text Extensions for Pandas to cover complex end-to-end NLP use cases (work in progress).

Contributing

This project is an IBM open source project. We are developing the code in the open under the Apache License, and we welcome contributions from both inside and outside IBM.

To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the Developer's Certificate of Origin 1.1 along with your pull request.

Building and Running Tests

Before building the code in this repository, we recommend that you use the provided script env.sh to set up a consistent build environment:

$ ./env.sh --env_name myenv
$ conda activate myenv

(replace myenv with your choice of environment name).

To run tests, navigate to the root of your local copy and run:

pytest text_extensions_for_pandas

To build pip and source code packages:

python setup.py sdist bdist_wheel

(outputs go into ./dist).

To build API documentation, run:

./generate_docs.sh

text-extensions-for-pandas's People

Contributors

Stargazers

Watchers

text-extensions-for-pandas's Issues

Create BaseSpanArray class as common subclass to CharSpanArray and TokenSpanArray

Currently, TokenSpanArray is derived from CharSpanArray, but is not a proper instance of a CharSpanArray and is missing certain attributes. For example the test for TokenSpanArray in TestPandasMethods.test_equals fails because it uses the CharSpanArray.equals method and does not have attributes like _equivalent_arrays.

It would make the class structure clearer if common functionality was moved to an abstract BaseSpanArray class and specific functionality implemented in the concrete classes.

Implement `_from_sequence_of_strings` for CharSpan and TokenSpan Arrays

This enables reading from CSV, I believe. Not sure if that is something we want to support.

Test class currently skipped: TestPandasParsing

Extend Watson NLU entities translation to include individual entity spans

By default, the entities tool in Watson Natural Language Understanding only outputs aggregate information about entity mentions in the document. Our Pandas translation currently surfaces this information as a dataframe with one row per aggregate entity:

The entities tool has an option called mentions that is off by default, but if you turn this option on...

entities=EntitiesOptions(mentions=True)

...the returned JSON value will have detailed information about specific entity mentions, including span information:

{
      "type": "Person",
      "text": "Lancelot",
      "sentiment": {
        "score": 0.835873,
        "label": "positive"
      },
      "relevance": 0.678523,
      "mentions": [
        {
          "text": "Lancelot",
          "location": [
            1393,
            1401
          ],
          "confidence": 0.614782
        },
        {
          "text": "Lancelot",
          "location": [
            2884,
            2892
          ],
          "confidence": 0.99065
        },
        {
          "text": "Lancelot",
          "location": [
            2894,
            2902
          ],
          "confidence": 0.984138
        },
        {
          "text": "Lancelot",
          "location": [
            4253,
            4261
          ],
          "confidence": 0.989195
        },
        {
          "text": "Lancelot",
          "location": [
            4759,
            4767
          ],
          "confidence": 0.977934
        }
      ],
      "count": 5,
      "confidence": 1
    },

We should extend our watson_nlu_parse_response() function so that it detects the presence of this detailed information and produces a dataframe with one row per mention (as opposed to one row per aggregate) when the detailed information is available.

BUG: Reading large TensorArrays from Feather file fails with buffer size error

Steps to reproduce:

Uncomment the line:
```
#corpus_df = pd.read_feather("outputs/corpus.feather")
```
in cell 19 of notebooks/CoNLL_3.ipynb.
Run the notebook.

Expected result: The notebook should read back the feather file it just wrote.

Actual result: Cell 19 fails with the following stack trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-cc540a64447d> in <module>
----> 1 corpus_df = pd.read_feather("outputs/corpus.feather")
      2 corpus_df

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads)
    101     path = stringify_path(path)
    102 
--> 103     return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
    207     _check_pandas_version()
    208     return (read_table(source, columns=columns, memory_map=memory_map)
--> 209             .to_pandas(use_threads=use_threads))
    210 
    211 

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    764     _check_data_column_metadata_consistency(all_columns)
    765     columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 766     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    767 
    768     axes = [columns, index]

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
   1102                                     list(extension_columns.keys()))
   1103     return [_reconstruct_block(item, columns, extension_columns)
-> 1104             for item in result]
   1105 
   1106 

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
   1102                                     list(extension_columns.keys()))
   1103     return [_reconstruct_block(item, columns, extension_columns)
-> 1104             for item in result]
   1105 
   1106 

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _reconstruct_block(item, columns, extension_columns)
    723             raise ValueError("This column does not support to be converted "
    724                              "to a pandas ExtensionArray")
--> 725         pd_ext_arr = pandas_dtype.__from_arrow__(arr)
    726         block = _int.make_block(pd_ext_arr, placement=placement,
    727                                 klass=_int.ExtensionBlock)

~/pd/tep-conll-3/text_extensions_for_pandas/array/tensor.py in __from_arrow__(self, extension_array)
     73     def __from_arrow__(self, extension_array):
     74         from text_extensions_for_pandas.array.arrow_conversion import ArrowTensorArray
---> 75         values = ArrowTensorArray.to_numpy(extension_array)
     76         return TensorArray(values)
     77 

~/pd/tep-conll-3/text_extensions_for_pandas/array/arrow_conversion.py in to_numpy(pa_ext_array)
    414                 # TODO: look into removing concat and constructing from list w/ shape
    415                 result = np.concatenate([make_numpy_array(chunk)
--> 416                                          for chunk in pa_ext_array.iterchunks()])
    417             else:
    418                 result = make_numpy_array(pa_ext_array.chunk(0))

~/pd/tep-conll-3/text_extensions_for_pandas/array/arrow_conversion.py in <listcomp>(.0)
    414                 # TODO: look into removing concat and constructing from list w/ shape
    415                 result = np.concatenate([make_numpy_array(chunk)
--> 416                                          for chunk in pa_ext_array.iterchunks()])
    417             else:
    418                 result = make_numpy_array(pa_ext_array.chunk(0))

~/pd/tep-conll-3/text_extensions_for_pandas/array/arrow_conversion.py in make_numpy_array(ext_arr)
    408             ext_dtype = ext_list_type.value_type.to_pandas_dtype()
    409             buf = ext_arr.storage.buffers()[3]
--> 410             return np.ndarray(ext_type.shape, buffer=buf, dtype=ext_dtype)
    411 
    412         if isinstance(pa_ext_array, pa.ChunkedArray):

TypeError: buffer is too small for requested array

Add script to generate API docs using sphinx

Follow the instructions at https://docs.readthedocs.io/en/stable/intro/getting-started-with-sphinx.html to set up API documentation generation for the text_extensions_for_pandas project.

If CI is in place (see #33), augment the integration tests so that they also generate the API documentation and check for errors.

Enable support for Pandas `where` method for CharSpan and TokenSpan Arrays

Test currently skipped at: TestPandasMethods.test_where_series

Add support for CharSpanArray and TokenSpanArray from a sequence of only NaNs

Test currently skipped in TestPandasConstructors.test_series_constructor_no_data_with_index

General freshening of documentation for the upcoming release

Main things to do:

Rewrite the README.md
Spiff up the Watson NLU example notebook
Spiff up the table extraction example notebook
Write a blog post to go with the release
Write a front page for the API documentation (will also serve as a web site for the project for now)

Record video intro/demo to Text Extensions for Pandas

Create a 5-10 minute intro/demo video for the project.

Potential flow: Start with some slides that summarize the current contents of README.md (once it's been reworked as part of #112). Then segue into a live demo based off of one of the notebooks in the notebooks directory.

Add a link to the video to our README.md file once it's done.

Add a method to CharSpan/TokenSpan that only compares offsets

The equality operation __equals__ on CharSpan objects and all the data structures that derive from them (TokenSpan, CharSpanArray, TokenSpanArray) uses the following semantics: Two spans are equal if their begin character offsets are equal, their end character offsets are equal, and their target texts are equal.

This definition works pretty well overall, but it can cause problems in some cases:

The user's program has two versions of the same text that differ very slightly -- say, in the addition of a trailing newline -- and wants to compare spans against these slightly different but functionally equivalent target texts.
The user's program creates many identical copies of the target text. The __equals__ method ends up comparing these identical strings repeatedly, resulting in a slowdown.

We should add a second version of the __equals__ method with a different name, say same_begin_and_end, that only compares the begin and end character offsets. This second method would make it convenient for users in any of the above situations to still do a relaxed equality comparison.

Factor out Gremlin code

Move all the code under text_extensions_for_pandas/gremlin to a new project.
Remove all references to Gremlin from the text-extensions-for-pandas project's README.md file.
Add enough regression testing to the new project that we can tell whether the Gremlin engine is completely broken by an upstream change from Text Extensions for Pandas.
Move the current version of the notebook Watson_NLU_Demo.ipynb to the new project to serve as a demo for Gremlin-related functionality.
Replace the portions of Watson_NLU_Demo.ipynb that depend on our Gremlin engine with code that doesn't depend on that engine. My recommendation is to pull out the region of each sentence between two sides of a relationship that Watson NLU found, then display the portion of the SpaCy parse tree that covers that part of the sentence.
Write a README.md file for the new project, explaining that this project holds an experimental embedded Gremlin query processing engine built on top of Pandas and Text Extensions for Pandas, and that the aim of this engine is to support declarative processing of NLP-related graphs such as parse trees and relationship graphs.

Rename `iob_to_spans` to `iob2_to_spans`

Since iob_to_spans takes IOB2 data as input, its name is confusing. The function should be renamed, and all the code that consumes it across our notebooks, tests, and downstream projects, should be updated.

BUG: Indexing with an array of False values ==> crash

Code to reproduce:

import text_extensions_for_pandas as tp
import pandas as pd
df = pd.DataFrame({
    "col1": tp.TensorArray([[1, 2], [3, 4]]) 
})
df[[False, False]]

Expected result: Empty dataframe.

Actual result:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-763acf5b214d> in <module>
      2     "col1": tp.TensorArray([[1, 2], [3, 4]])
      3 })
----> 4 df[[False, False]]

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2789         # Do we have a (boolean) 1d indexer?
   2790         if com.is_bool_indexer(key):
-> 2791             return self._getitem_bool_array(key)
   2792 
   2793         # We are left with two options: a single key, and a collection of keys,

[... 7 levels of Pandas internals stack trace ...]

~/pd/fred-tep/text_extensions_for_pandas/array/tensor.py in take(self, indices, allow_fill, fill_value)
    169                     # of each row.
    170                     values[i] = fill_value
--> 171         return TensorArray(values)
    172 
    173     @property

~/pd/fred-tep/text_extensions_for_pandas/array/tensor.py in __init__(self, values, make_contiguous)
    111         """
    112         if isinstance(values, Iterable):
--> 113             self._tensor = np.stack(values, axis=0)
    114         elif isinstance(values, np.ndarray):
    115             self._tensor = values

<__array_function__ internals> in stack(*args, **kwargs)

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/numpy/core/shape_base.py in stack(arrays, axis, out)
    420     arrays = [asanyarray(arr) for arr in arrays]
    421     if not arrays:
--> 422         raise ValueError('need at least one array to stack')
    423 
    424     shapes = {arr.shape for arr in arrays}

ValueError: need at least one array to stack

Rework README.md

The README.md file for this project is out of date. We need to make several changes:

Tighten up the introduction to the project
Put some example use cases up front
Include links to the notebooks with explanatory text for each ("Here's an in-depth introduction to how to use Text Extensions for Pandas for ...")
Make the linked notebooks something that will display directly in a web browser; either host the static HTML on Github pages or link to a static Watson Studio notebook

Create tokenizer with offsets support for BERT (or similar) embeddings

To implement part 3 of the CoNLL demo, we need to convert each document in the CoNLL-2003 corpus into a dataframe with 3 columns:

<span of token> | <label of token> | <embedding of token>

Once we have such a dataframe, we can quickly train up multiple models on the prebuilt embeddings and use those models to zero in on incorrectly-labeled entities in the data set.

To create this dataframe, we need to retokenize the corpus using the tokenizer that the BERT embeddings are designed for. That tokenizer's code can be found here.

Unfortunately, the BERT tokenizer does not output the character (or code point) offsets of the tokens that it produces. To use this tokenizer, we need to modify the implementation in tokenization.py so that it produces begin and end offsets for every token.I recommend that we contribute this functionality back to the official BERT models as a pull request, so that we don't need to worry about our changes getting stomped on in the future.

Potential alternate approach: Look through the changes and comments of https://github.com/huggingface/transformers/pull/2674 to figure out how to enable huggingface's implementation of tokenization for BERT embeddings that captures character offsets, assuming such an implementation exists.

ENH: Implement converting large TensorArrays from Arrow

The current Arrow conversion code can convert large TensorArray columns to Arrow. Pandas takes care of dividing large values into multiple blocks. However, the conversion code cannot read back large TensorArray values that the user has written out this way, because input from multi-block ArrowRecordBatches is not currently implemented. See line 256 of arrow_conversion.py:

        if extension_array.num_chunks > 1:
            raise ValueError("Only pyarrow.Array with a single chunk is supported")

This limitation prevents us from using binary I/O to save precomputed embeddings in part 3 of the CoNLL demo (see #5). Regenerating BERT embeddings for the entire corpus takes about an hour.

Flesh out regression tests of `spanner` module

We are currently missing regression tests for the files: spanner/extract.py and spanner/project.py. Add basic tests for these two files. It is not necessary at this point to test everything.

Enable automatic running of regression tests on new pull requests

Create a .travis.yml file that configures CI for the repository.

Set up this file to trigger the following actions on the Travis build server for each pull request:

Create a suitable Python environment for running our regression tests (see env.sh)
Activate the Python environment
Run our regression tests (python -m unittest discover from the root of the repository)

Make a pull request to the put the file in place.

Make a second test pull request that deliberately breaks a regression test to ensure that the CI is working after the first PR is merged.

Support le and ge operators for CharSpan, TokenSpan

If we don't want to support these, remove these from all_compare_operators test fixture

Investigate test failure TestPandasMethods.test_value_counts_with_normalize

Something doesn't seem quite right with this test

Bring span join APIs closer to DataFrame.merge()

The Text Extensions span join operations (AdjacentJoin, OverlapJoin, and ContainJoin) ought to have an API closer to that of DataFrame.merge(). This change would lower the learning curve for Pandas users, as well as making use cases like outer joins between sets of results easier to implement.

Major TODOs to make this change happen:

Create a new entry point merge_spans to replace AdjacentJoin, OverlapJoin, and ContainJoin. This function will take two DataFrames instead of two Series as arguments. In addition to the join columns, other columns of the input dataframes should be copied to the output. Main arguments to merge_spans:
- op: {"adjacent", "overlap", "contain"}, default "overlap"
- on/left_on/right_on
- sort
- suffixes
Add outer join support to merge_spans. Outer join support will require some additional arguments (semantics same as the corresponding arguments of pd.merge():
- how: {"left", "right", "outer", "inner"}, default "inner"
- indicator
(stretch) Allow merge_spans to also take an optional list of columns to perform an equijoin on at the same time that it performs the span join.
(stretch) Allow merge_spans to take a pd.Series, TokenSpanArray, or CharSpanArray for either argument.
Deprecate the old operations.

Add more complete testing to the spanner extract test suite

Currently only a simple test case exists. As per comments at #83 (comment), more tests need to be added to exercise the function completely.

Fred's comments on text_extract_dict:

I'd recommend that you remove the last three lines of the current file and replace "file_text" below with a string that exercises the major cases of dictionary extraction:

Match at beginning of string, end of string, or in the middle of the string
One-token match and multi-token match
Non-match that shares the first token (and only the first token) with a two-token dictionary entry
Two overlapping matches

You'll also want to exercise case-insensitivity of the dictionary matching.

I think the location of this file is an anachronism. Would you mind moving it to test_data/spanner?

comments on test_extract_regex_tok:

As with the dictionary test, it would be useful to have a target string that contains the main types of regex match -- matches at the beginning, middle, or end of the string; partial matches; substrings that would be matches except they don't start or end on a token boundary.

Record 30-60 minute video introduction to NLP using Text Extensions for Pandas

Create a 30-60 minute video aimed at practitioners who are familiar with general Python data science tools but are not familiar with natural language processing.

The video should use Text Extensions for Pandas, plus our integrations with Watson NLU, SpaCy, and Huggingface's transformers library.

Potential elevator pitch for this video:

The past few years have seen many exciting new developments in Natural Language Processing, or NLP technology. In this video, we will teach you about the very latest technology, but with an emphasis on the timeless aspects of practical NLP --- the things that haven't changed in the last five, ten, or even fifteen years. We'll show you how you can use the data science tools you already know, plus a little bit of the latest state-of-the-art AI, to build effective production-ready NLP applications.

Review Analyze_Text.ipynb and update as needed

This issue covers making another review pass over the notebook notebooks/Analyze_Text.ipynb to cover the following:

Identify areas where the Markdown text and code isn't clear to a newcomer to the project
Make sure the notebook works in hosted Jupyter environments (Watson Studio, Colab, and similar)

You can create pull requests to address any issues with clarity of the code and text and/or making the notebook work in hosted Jupyter. Be sure to reference this issue in your PRs.

Decide support for 'all', 'any' aggregations for CharSpan and TokenSpan Arrays

Test class currently skipped: TestPandasBooleanReduce

Version 0.1 Roadmap

This issue holds the main checklist of TODO items for the first release of this project.

When an item is completed, please check off the item and add a link to the relevant pull request or commit.

Build/CI

Set up Travis CI for Github repo (#36)
Auto-run regression tests for pull requests (#36)
Scripts and configs to create pip package (2651e37)
Instructions and hooks for running tests with unittest (0c4951c)
Scripts to generate API documentation from docstrings (#39)
Web site for API documentation
Update version and publish pip package to PyPI

Features

Create dataframes from:
- SpaCy tokens and dependency parse (6084523)
- SpaCy entities (27e7fd6)
- CoNLL-X/CoNLL-U format
- CoNLL-2003 format (40b3527 and 2c4b8c5)
- Watson Natural Language Understanding JSON output (9f7a7c8)
Arrow and Feather Serialization/Deserialization
- CharSpanArray/TokenSpanArray (#14)
- TensorArray (#9)
- Reconstitute tokens after deserializing TokenSpanArrays (#14)
Implement all required hooks for ExtensionArray types
- CharSpanArray (#71)
- TokenSpanArray (#81)
- TensorArray (#69, #99, #106)
Implement key elements of spanner algebra
- text_extensions_for_pandas/spanner sub-module (4131b81)
- AdjacentJoin (4131b81)
- OverlapJoin (f50cac5)
- ContainJoin (7a3e73e)
- Consolidate (644575c and #26)

Regression Tests

Documentation

Replace existing Person use case notebook with a demonstration use case for syntactic analysis
Create getting started guide

Refactoring

Move spanner algebra code to multiple files under text_extensions_for_pandas/spanner (4131b81)
Move ExtensionArray types to text_extensions_for_pandas/array (fd9042a)
Split io.py into multiple files (afd64df)
Move the experimental Gremlin engine into a separate project (#60, #65)

Add caching to remaining hash functions of span types

The hash function for TensorArray caches the hash values for performance, with other accessors invalidating the cached values as needed. The remaining extension types, including the types for scalar values, should be updated to perform similar caching. In the case of CharSpan, it would probably be simpler to only cache the hash value of the target string.

Refactor other binary operations over span types to match add.

PR #98 refactored the binary add operation of all the span types (Span, TokenSpan, SpanArray, TokenSpanArray) into a single function add_spans() (located in text_extensions_for_pandas/array/span_util.py) that handles all combinations of the four types in either argument.

Other binary operations should be refactored into the same structure. Here's a list:

Note: The above list may not be complete. Please edit this description to add any missing ones.

Create demo of Text Extensions for Pandas + PySpark

Create a demonstration of using Text Extensions for Pandas and PySpark together to cover an NLP-related use case.

Some ideas for possible use cases:

Compute precision and recall for NLP model outputs over a large corpus
Generate BERT embeddings for an entire corpus in parallel and save the embeddings into a Spark DataFrame
Load BERT embeddings into a Spark DataFrame, then use Spark to train train multiple classification models in parallel over the embeddings
Demonstrate serialization of Pandas extension types between Parquet files, Spark DataFrames, and Pandas DataFrames

Get rid of -DOCSTART- tokens in CoNLL output

conll_2003_to_dataframes() currently passes through the special -DOCSTART- token when importing the CoNLL file format. It would be better if the import code dropped this special token and the sentence boundary that follows it and did not include either of them in the reconstructed document.

Major subtasks

Modify conll_2003_to_dataframes() so that it drops the -DOCSTART- token and the blank line after it when importing a data set in CoNLL-2003 format.
Modify conll_2003_output_to_dataframes() so that it also drops the first two lines of each document when importing model outputs
Update examples and tutorials to reflect this change. Where needed, subtract 11 from the offsets of any spans we computed with the previous version of conll_2003_to_dataframes()

Extend Watson NLU notebook to also use SpaCy

The demo notebook notebooks/WatsonNLP_Demo.ipynb in #25 puts a document through Watson Natural Language Understanding and converts the results from Watson NLU into a collection of dataframes.

Currently the last step of this notebook involves taking the outputs of relationship extraction and connecting those outputs to sentence information from another part of the Watson NLU output.

We should extend this notebook with some additional steps to illustrate both our integration with SpaCy and how Text Extensions for Pandas makes it easier to use multiple NLP frameworks in the same application. Here's a rough outline of the proposed additional steps:

Pass the document text (should be available either in a local variable or as the target text of a CharSpanArray object) to SpaCy's English language model, and convert the resulting graph of Python objects to a dataframe (See example code in Person.ipynb).
Match the entities involved in relationships that Watson NLU returns with the corresponding SpaCy tokens from the SpaCy language model output.
Now we have a connection from entities to sets of 1 or more tokens. Trace the parse tree links in the SpaCy output (encoded in the head column of the dataframe) to identify the least common ancestor of both sides of the relationship. Then pull out the notes of the parse tree that are below this common ancestor. The output of this operation should of course be a dataframe (containing all the nodes in the subtree of the parse tree). The tracing itself should if possible be implemented with a Gremlin query or two.
Use our integration with DisplaCy to show the relevant subparts of the parse tree. The entry point to use is tp.render_parse_tree(). See the example usage at the bottom of Person.ipynb for more info.

Input adapters for CoNLL-X/CoNLL-U format

Add functionality to our io package to handle the Unversal Dependencies CoNLL-X/CoNLL-U format.

Put this new code alongside our existing code for handling the original CoNLL format.

BUG: Document text with two dollar signs doesn't render properly in Jupyter

Steps to reproduce:

Open CoNLL_View_Doc.ipynb in this project, which views individual documents from the CoNLL-2003 corpus.
Point the viewing code at document 35 of the dev fold of the corpus

You'll see something like this:

Note the italic text without spaces. That text is supposed to read "... for $31 a share, or $135 million ...". Instead, JupyterLab is interpreting the dollar signs as the boundaries of a region of embedded LaTeX code.

The _repr_html_() method of CharSpanArray and TokenSpanArray should return a string that renders properly, even when the document contains dollar signs. The current code does escape dollar signs (the above snippet turns into for $31 a share, or $135 million), but some component of JupyterLab drills through the escaping and treats the $'s and everything between them as LaTeX.

A simpler way to reproduce the core problem is to run the following code in a JupyterLab notebook:

class ReproduceBug: 
    def _repr_html_(self):
        return "for &#36;31 a share, or &#36;135 million"

ReproduceBug()

Review Analyze_Model_Outputs.ipynb and update as needed

This issue covers making another review pass over the notebook notebooks/Analyze_Model_Outputs.ipynb to cover the following:

Identify areas where the Markdown text and code isn't clear to a newcomer to the project
Make sure the notebook works in hosted Jupyter environments (Watson Studio, Colab, and similar)

You can create pull requests to address any issues with clarity of the code and text and/or making the notebook work in hosted Jupyter. Be sure to reference this issue in your PRs.

Fix TokenSpanArray.hash

Currently falls back to CharSpanArray.hash, which fails.

Test failing: TestPandasMethods.test_not_hashable

Expose the contents of `io` as multiple sub-packages

Currently, all the public functions under text_extensions_for_pandas.io are exposed in the top-level package of the project. This portion of the namespace is getting crowded, and some of the names are becoming confusing. For example, it's not clear from the name of the make_tokens_and_features() function that the function is part of the SpaCy integration.

We should expose the functions under io as a hierarchy of functions, not a flat collection. I recommend using the current location within the source tree to define the hierarchy. For example, tp.make_tokens_and_features() would become tp.io.spacy.make_tokens_and_features().

FYI @BryanCutler

Stabilize tests that rely on SpaCy models

Some of our regression tests keep failing because the SpaCy models that they depend on keep changing.

We need to adjust our build/test dependencies so that the SpaCy model versions that we run regression tests against do not change from day to day.

Example of failing tests:

$ python -m unittest discover
Downloading: 100%|███████████████████████████| 232k/232k [00:00<00:00, 5.72MB/s]
.......
======================================================================
FAIL: test_iob_to_spans (text_extensions_for_pandas.io.test_conll.CoNLLTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/CODAIT/text-extensions-for-pandas/text_extensions_for_pandas/io/test_conll.py", line 57, in test_iob_to_spans
    4  [100, 113): 'Steven Wright'   PERSON"""
AssertionError: "    [45 chars]    [61, 67): 'Alaska'      GPE\n1      [73, 8[63 chars]RSON" != "    [45 chars]    [4, 11): 'Bermuda'      ORG\n1         [12[145 chars]RSON"
                      token_span ent_type
+ 0           [4, 11): 'Bermuda'      ORG
+ 1         [12, 20): 'Triangle'  PRODUCT
- 0           [61, 67): 'Alaska'      GPE
? ^
+ 2           [61, 67): 'Alaska'      GPE
? ^
- 1      [73, 84): 'Santa Claus'   PERSON
? ^
+ 3      [73, 84): 'Santa Claus'   PERSON
? ^
- 2  [100, 113): 'Steven Wright'   PERSON? ^
+ 4  [100, 113): 'Steven Wright'   PERSON? ^
======================================================================
FAIL: test_make_tokens_and_features (text_extensions_for_pandas.io.test_spacy.IOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/CODAIT/text-extensions-for-pandas/text_extensions_for_pandas/io/test_spacy.py", line 71, in test_make_tokens_and_features
    (8, 8, [34, 35): '.', [34, 35): '.', '.', 'PUNCT', '.', 'punct', 1, '.', 'O', '', False, False, [0, 35): 'She sold c shills by the Sith Lord.')]"""
AssertionError: "[(0,[355 chars]N', 'det', 3, 'x', 'O', '',  True, False, [0, [970 chars].')]" != "[(0,[355 chars]N', 'compound', 3, 'x', 'O', '',  True, False,[975 chars].')]"
Diff is 1957 characters long. Set self.maxDiff to None to see it.
----------------------------------------------------------------------
Ran 89 tests in 1.210s
FAILED (failures=2, skipped=1)
The command "python -m unittest discover" exited with 1.

Decide if supporting Pandas `shift` method for CharSpan and TokenSpan Arrays

Tests in question are TestPandasMethods.test_container_shift and test_shift_non_empty_array

Symbol locations in generated docs don't match intended locations

The generated API docs show functions and classes as being located at the regions of the namespace where their source files reside. For example, the docs for the Span class reference that class as text_extensions_for_pandas.array.char_span.Span even though we export this class at the top level of our namespace as text_extensions_for_pandas.Span.

The locations of symbols in the API docs should match the locations where we intend users to access them from.

Review Integrate_NLP_Libraries.ipynb and update as needed

This issue covers making another review pass over the notebook notebooks/Integrate_NLP_Libraries.ipynb to cover the following:

Identify areas where the Markdown text and code isn't clear to a newcomer to the project
Make sure the notebook works in hosted Jupyter environments (Watson Studio, Colab, and similar)

You can create pull requests to address any issues with clarity of the code and text and/or making the notebook work in hosted Jupyter. Be sure to reference this issue in your PRs.

Implement all required hooks for ExtensionArray types

Implement all required hooks for ExtensionArray types.

In particular, the regression tests that Pandas ships with for ensuring correctness of ExtensionArrays should pass.

Also, extend our own regression suite so that it runs the Pandas tests in addition to our low-level tests.

CoNLL demo, part 3

At the end of part 2 of the demo, we've shown that there are incorrect labels hidden in the CoNLL-2003 validation set, and that you can pinpoint those incorrect labels by data-mining the results of the 16 models the competitors submitted.

Our goal for part 3 of the demo is to pinpoint incorrect labels across the entire data set. The (rough) process to do so will be:

Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a dataframe column (of type TensorType) alongside the tokens and labels.
Use the embeddings to quickly train multiple models at multiple levels of sophistication (something like: SVMs, random forests, and LSTMs with small and large numbers of hidden states). Split the corpus into 10 parts and perform a 10-fold cross-validation.
Repeat the process from part 2 on each fold of the 10-fold cross-validation, comparing the outputs of every model on the validation set for each fold.
Analyze the results of the models to pinpoint potential incorrect labels. Inspect those labels manually and build up a list of labels that are actually incorrect.

Add descriptions for Watson NLU outputs at bottom of `Analyze_Text.ipynb`

The last four cells of Analyze_Text.ipynb show the DataFrame version of Watson NLU's output for entities, keywords, relations, and semantic roles.

We should add some Markdown text in between these cells to explain what kind of data is stored in each column of each DataFrame. Where applicable, we should explain in plain English what the Watson NLU output in one example row of the DataFrame means.

Extend CharSpanArray and TokenSpanArray to support multiple documents

The current implementation of CharSpanArray and TokenSpanArray only allows a single target text for all of the spans in a given array. This restriction is fine as long as all the spans in a given Dataframe come from a single document, but it complicates use cases involving combining information from multiple documents in a single Dataframe. Currently the only way to have spans from multiple documents in a series is to convert CharSpanArray/TokenSpanArray arrays into arrays of type Object containing individual CharSpan and TokenSpan objects.

We should extend our span array types to allow for multiple target texts per array. Key challenges to address:

Memory-efficient representation for the span data
Clean semantics for span comparison across documents. What happens if two documents in a corpus have the same text?
Efficient implementations of the span operations under text_extensions_for_pandas.spanner with multiple target texts
Serialization/deserialization to/from Arrow and Feather format

BUG: Feather-related regression tests failing at head of master branch

When I run python -m unittest discover at the head of the master branch, 5 regression tests related to Feather I/O fail. Output follows:

======================================================================
ERROR: test_feather (text_extensions_for_pandas.array.test_char_span.CharSpanArrayIOTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/test_char_span.py", line 396, in test_feather
    pd.testing.assert_frame_equal(df, df_read)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/_testing.py", line 1382, in assert_frame_equal
    obj=f'{obj}.iloc[:, {i}] (column name="{col}")',
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/_testing.py", line 1184, in assert_series_equal
    assert_extension_array_equal(left.array, right.array)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/_testing.py", line 1034, in assert_extension_array_equal
    right_valid = np.asarray(right[~right_na].astype(object))
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 443, in astype
    return np.array(self, dtype=dtype, copy=copy)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 352, in __iter__
    yield self[i]
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/char_span.py", line 297, in __getitem__
    int(self._ends[item]))
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/char_span.py", line 65, in __init__
    raise ValueError(f"end must be less than length of target string "
ValueError: end must be less than length of target string (32 > 15

======================================================================
ERROR: test_feather (text_extensions_for_pandas.array.test_tensor.TensorArrayIOTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/test_tensor.py", line 346, in test_feather
    df_read = pd.read_feather(filename)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/io/feather_format.py", line 103, in read_feather
    return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/feather.py", line 214, in read_feather
    return (read_table(source, columns=columns, memory_map=memory_map)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/feather.py", line 239, in read_table
    return reader.read()
  File "pyarrow/feather.pxi", line 79, in pyarrow.lib.FeatherReader.read
  File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Buffer #1 too small in array of type int64 and length 10: expected at least 80 byte(s), got 68

======================================================================
ERROR: test_feather_auto_chunked (text_extensions_for_pandas.array.test_tensor.TensorArrayIOTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/test_tensor.py", line 399, in test_feather_auto_chunked
    table = read_table(filename)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/feather.py", line 239, in read_table
    return reader.read()
  File "pyarrow/feather.pxi", line 79, in pyarrow.lib.FeatherReader.read
  File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Buffer #1 too small in array of type int64 and length 1024: expected at least 8192 byte(s), got 4121

======================================================================
ERROR: test_feather_chunked (text_extensions_for_pandas.array.test_tensor.TensorArrayIOTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/test_tensor.py", line 381, in test_feather_chunked
    df_read = pd.read_feather(filename)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/io/feather_format.py", line 103, in read_feather
    return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/feather.py", line 214, in read_feather
    return (read_table(source, columns=columns, memory_map=memory_map)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pyarrow/feather.py", line 239, in read_table
    return reader.read()
  File "pyarrow/feather.pxi", line 79, in pyarrow.lib.FeatherReader.read
  File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Buffer #1 too small in array of type int64 and length 10: expected at least 80 byte(s), got 68

======================================================================
ERROR: test_feather (text_extensions_for_pandas.array.test_token_span.TokenSpanArrayIOTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/test_token_span.py", line 341, in test_feather
    self.do_roundtrip(df1)
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/test_token_span.py", line 333, in do_roundtrip
    pd.testing.assert_frame_equal(df, df_read)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/_testing.py", line 1382, in assert_frame_equal
    obj=f'{obj}.iloc[:, {i}] (column name="{col}")',
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/_testing.py", line 1184, in assert_series_equal
    assert_extension_array_equal(left.array, right.array)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/_testing.py", line 1034, in assert_extension_array_equal
    right_valid = np.asarray(right[~right_na].astype(object))
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 443, in astype
    return np.array(self, dtype=dtype, copy=copy)
  File "/Users/freiss/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 352, in __iter__
    yield self[i]
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/token_span.py", line 311, in __getitem__
    self._tokens, int(self._begin_tokens[item]), int(self._end_tokens[item])
  File "/Users/freiss/pd/tep-conll-4/text_extensions_for_pandas/array/token_span.py", line 71, in __init__
    f"Begin token offset of {begin_token} larger than "
ValueError: Begin token offset of 32 larger than number of tokens (4)

----------------------------------------------------------------------
Ran 87 tests in 1.751s

FAILED (errors=5, skipped=1)

Clarify NER-related background material in Analyze_Model_Outputs.ipynb

In the notebook notebooks/Analyze_Model_Outputs.ipynb (see here), some of the terminology used may be unfamiliar to a newcomer to NLP. In particular, this paragraph could use a gentler introduction to the concepts of named entity recognition and token-level error rate:

IOB2 format is a convenient way to represent a corpus, but it is a less useful representation for analyzing the result quality of named entity recognition models. Most tokens in a typical NER corpus will be tagged O, any measure of error rate in terms of tokens will over-emphasizing the tokens that are part of entities. Token-level error rate implicitly assigns higher weight to named entity mentions that consist of multiple tokens, further unbalancing error metrics. And most crucially, a naive comparison of IOB tags can result in marking an incorrect answer as correct. Consider a case where the correct sequence of labels is B, B, I but the model has output B, I, I; in this case, last two tokens of model output are both incorrect (the model has assigned them to the same entity as the first token), but a naive token-level comparison will consider the last token to be correct.

We should add more Markdown text to this notebook in two places:

At the beginning, there should be a more detailed explanation of named entity recognition models, ideally with a visual illustration of NER model outputs (perhaps drawn by some Python code using displaCy).
The above paragraph should be expanded out with a more detailed explanation of what happens when you use token classification (instead of entity extraction) as the basis for computing model quality.

ENH: Create Dataframe wrapper for Watson table extraction API

Create a wrapper for the Watson Compare and Comply table extraction APIs as part of the "Watson Natural Language Understanding JSON output" line item from our version 0.1 roadmap (#1)

API docs are here: https://cloud.ibm.com/apidocs/compare-comply#extract-a-document-s-tables

The API doc page includes an example of the JSON output of the API (element_classification). The dataframe version of the API output should include all information present in the JSON.

Add value_counts() method to SpanArray to support DataFrame.describe()

DataFrame.describe() on a DataFrame with a span column currently doesn't work because our array types are missing the method value_counts(). The specific stack trace looks like this:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-55-f041b2b916b3> in <module>
----> 1 syntax_df.describe()

[...]

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/algorithms.py in value_counts(values, sort, ascending, normalize, bins, dropna)
    736 
    737             # handle Categorical and sparse,
--> 738             result = Series(values)._values.value_counts(dropna=dropna)
    739             result.name = name
    740             counts = result._values

AttributeError: 'CharSpanArray' object has no attribute 'value_counts'

We should implement the value_counts() method, following the example of the implementation for Pandas' built-in IntervalArray type.

Note that TokenSpanArray is currently a subclass of CharSpanArray, but the implementation of #91 may change that relationship.

Add support for creating empty DataFrame with TokenSpanArray

Test currently catches exception in test_token_span.py TestPandasConstructors.test_construct_empty_dataframe. This should not raise a TypeError.

ENH: Extend CoNLL format I/O to handle files arbitrary extra tags per token

The CoNLL data set we have been using internally contains just the annotations, i.e.:

-DOCSTART- O

CRICKET O
- O
LEICESTERSHIRE I-ORG
TAKE O
OVER O
AT O

Other copies of this data set available on the internet contain part of speech tags and noun phrase annotations, i.e.:

-DOCSTART- -X- -X- O

CRICKET NNP I-NP O
- : O O
LEICESTERSHIRE NNP I-NP I-ORG
TAKE NNP I-NP O
OVER IN I-PP O
AT NNP I-NP O

(see https://github.com/patverga/torch-ner-nlp-from-scratch/blob/master/data/conll2003/eng.testa for example)

The API conll_2003_to_dataframes in text_extensions_for_pandas/io/conll.py should be extended to handle both of these formats. Ideally, that API should be extended to handle an arbitrary mix of single-token and IOB-format columns, so that other data sets in "CoNLL format", such as the ones at https://github.com/juand-r/entity-recognition-datasets, can be read using that API. After these changes, the signature of conll_2003_to_dataframes() should look something like this:

def conll_2003_to_dataframes(input_file: str,
                             column_names: List[str],
                             iob_columns: List[bool],
                             space_before_punct: bool = False)\
        -> List[pd.DataFrame]:
    """
     [. . .]
    :param column_names: Names for the metadata columns that come after the 
     token text. These names will be used to generate the names of the dataframe
     that this function returns.
    :param iob_columns: Mask indicating which of the metadata columns after the
     token text should be treated as being in IOB format. If a column is in IOB format,
     the returned dataframe will contain *two* columns, holding IOB2 tags and 
     entity type tags, respectively. For example, an input column "ent" will turn into
     output columns "ent_iob" and "ent_type".
     [. . .]
    """

The conll_2003_output_to_dataframes function should also be changed so that it passes through the additional data columns that the modified conll_2003_to_dataframes function produces.

The notebooks CoNLL_1.ipynb, CoNLL_2.ipynb, CoNLL_3.ipynb, and CoNLL_4.ipynb should be modified to download the CoNLL dataset from Github (available in this directory as well as a few other places). Instead of downloading the dataset every time they run, these scripts should cache the dataset locally in a local location such as the existing directory notebooks/outputs. NOTE: The data sets we have been using mix up the meanings of testa and testb. When switching to the versions of the CoNLL data set available from Github, you will need to modify some constants in the notebooks to reflect this change -- see, for example cell 2 of CoNLL_1.ipynb.

codait / text-extensions-for-pandas Goto Github PK

text-extensions-for-pandas's Introduction

Text Extensions for Pandas

Features

SpanArray: A Pandas extension type for spans of text

TensorArray: A Pandas extension type for tensors

Pandas front-ends for popular NLP toolkits

CoNLL-2020 Paper

Installation

Installation from Source

Documentation

Contents of this repository

Contributing

Building and Running Tests

text-extensions-for-pandas's People

Contributors

Stargazers

Watchers

Forkers

text-extensions-for-pandas's Issues

Build/CI

Features

Regression Tests

Documentation

Refactoring

Recommend Projects

Recommend Topics

Recommend Org

Jobs