datasig-ac-uk / nlpsig Goto Github PK

View Code? Open in Web Editor NEW

4.0 5.0 0.0 15.67 MB

Package for constructing paths of embeddings obtained from transformers.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

hut23-1132 hut23-1376 nlp transformers

nlpsig's People

Contributors

Stargazers

Watchers

nlpsig's Issues

Write tests for `DataSplit` class

Check plotEmbedding after `dsig.create_features`

In the notebook, after encoding the text data, we can plot the embeddings:

which seems reasonable as we have three classes. In fact, we have four classes, but we map them into three:

            "label":
                {"economy": 2,
                 "obama": 1,
                 "microsoft": 0,
                 "palestine": 0
                }

However, after time injection and

x_data = dsig.create_features(path, sig_combined, last_index_dt_all, bert_embeddings, time_feature)

The results look very different (see the notebook). Do I miss anything?

Write tests for `SentenceEncoder` class

Create examples for documentation

Improve the readthedocs by having more example snippets of code.

Allow passing indices into `DataSplit` and `Folds`

Let user pass in indices into data splitting classes instead of always having to compute them.

Allow splitting the data by group

In Folds, you can make sure that the folds are split by group, but this is not available for DataSplits.

Model specifications

I changed the model_specifics dictionary to a nested one with the following entries. Does this make sense?

model_specifics = {
    "encoder_args": {
        "col_name_text": "content",
        "model_name": "all-MiniLM-L6-v2",
        "model_args": {
            "batch_size": 64,
            "show_progress_bar": True,
            "output_value": 'sentence_embedding', 
            "convert_to_numpy": True,
            "convert_to_tensor": False,
            "device": None,
            "normalize_embeddings": False
        }
    },
    "dim_reduction": {
        "method": 'ppapca', #options: ppapca, ppapcappa, umap
        "num_components": 10, # options: any int number between 1 and embedding dimensions
    },
    "time_injection": {
        "history_tp": 'timestamp', #options: timestamp, None
        "post_tp": 'timestamp', #options: timestamp, timediff, None
    },
    "embedding":{
        "global_embedding_tp": 'SBERT', #options: SBERT, BERT_cls , BERT_mean, BERT_max
        "post_embedding_tp": 'sentence', #options: sentence, reduced
        "feature_combination_method": 'attention', #options concatenation, attention 
    },
    "signature": {
        "dimensions": 3, #options: any int number larger than 1
        "method": 'log', # options: log, sig
        "interval": 1/12
    },
    "classifier": {
        "classifier_name": 'FFN2hidden', # options: FFN2hidden (any future classifiers added)
        "classes_num": '3class', #options: 3class (5class to be added in the future)
    }
}

Add verbose attribute to various classes to turn off printing

Default True

Make PrepareData keep the same index ordering as the original dataframe

Currently in PrepareData, if the data passed in has datetime as a column, it will sort the dataframe by id_column and datetime - this is so that we can make the timeline_index column properly - but we should re-sort this dataframe afterwards for consistency with the original dataframe.

Add method(s) to compute path signatures from PrepareData class

The PrepareData class essentially takes in a dataframe and constructs paths by looking at the history, or by its id. Would be good to be able to compute the path signature directly here.

Be able to work with sequence of tokens

Currently, TextEncoder class only works with sentences through the sentence-transformer package and encoding sentences with SentenceTransformer class in the package. This should allow passing of the data frame and has capability to encode the sentences (i.e. obtain embeddings for) and fine tune the transformer model to some task (or just update the weights by training a language model to the new text).

After meeting with Terry, the package should also be able to work with any streams of text, e.g. with the alphabet dataset (task of predicting the langauge of a word). In this setting, the paths are embeddings of the letters. In the most simple case, these embeddings can be 26-dimensional one-hot encoding of letters, but we could also have more sophisticated embeddings for them.

We can still use SentenceTransformer because I guess we're just pooling one BERT embedding (because it would be just one token).

But perhaps if we're working with just tokens, transformers is more appropriate? It would be nicer when fine-tuning the transformer too I think.

Potentially we can to rename TextEncoder to SentenceEncoder which works when the items in the dataframe that are passed in are sentences - so working with stream of sentences, and this uses sentence-transformers package. We can make a new class TextEncoder which assumes that we're working with streams of tokens, so again each item in the dataframe is a token and we can use the class to obtain embeddings. The class would be able to fine tune the transformer to the available data too.

Write tests for path construction methods

Add tests for methods which construct paths in PrepareData class.

Write tests for `Folds` class

Example with Manifesto Project database

We have decided to start working on an example using the Manifesto Project database. Using this issue to note down ideas.

The Manifesto Project has compiled a large corpus of political manifestos from many elections around the world. We will start out by focusing on the manifestos from UK parties for the last few elections (2019, 2017, 2015, ...) and to begin will just focus on the ones from the largest four parties (Conservative, Labour, Liberal Democrats, SNP).

Here you can find annotated CSVs, which label each of the "quasi-sentences" (sentences which have exactly one statement in) a topic. The topics can be quite granular and can be found at here, but importantly they are grouped (the first digit of their label indicates their larger topic - e.g. 101: "Foreign Special Relationships: Positive", 102 "Foreign Special Relationships: Negative", 103 "Anti-Imperialism - comprised of:", etc. fall under a the topic 1: "Immigration".

We can construct a binary classification task by predicting on the post level whether or not a change in topic has happened using the path signature of the previous last posts, and using the pipeline.

[REVIEW] K fold cross validation with random seeds in FFN

In #1, I removed the cell "K fold cross validation with random seeds in FFN" (at the very end of the notebook). This is because several paths in classification_utils.py are hard coded, and I don't have access to the folders/files. (and I am not sure about the data format/columns/...)

Add option to include additional features into path

There may be columns in the dataframe passed into PrepareData that we wish to include into the path as variables. Add a additional_features parameter into the pad and get_path methods to allow for this.

Implement possibility to train sentence-transformer

The .fit_transformer() method in the SentenceEncoder class tries to fine-tune the sentence transformer to new data. Currently not fully implemented. May need thought about what this means - are we just going to tune the underlying BERT in SBERT to new data?

New contributors and our workflow

Add their GitHub handle to the repo (if actively developing, maybe as an Editor)
(when it is public) people would create a PR (Pull Request) for further development
Workflow:

Signatory doesn't support MacOS

The website says:

Older versions of Signatory supported earlier versions of Python and PyTorch. 
It also included support for MacOS, 
but this has now been dropped as being difficult to maintain.

Title of the README

Currently, the title is: Extract features from textual data using BERT and Path signature

Does it make sense?

Write tests for `TextEncoder` class

Move to `signature_applications` repo?

https://github.com/datasig-ac-uk/signature_applications is a public repo with some example notebooks. Once we are done with the refactoring/review, should we move this repo there?

License?

Creation of robust testing framework

Implement pytest framework for parts of the pipeline
Integrate with current pre-commit framework

Break dependency with signatory

Signatory is currently no longer in development, so will need to think about whether or not it makes sense to keep this as a depenedency in the future. Currently holds back being able to use newer versions of PyTorch.

Currently keeping now as there's work that uses this package which needs signatory.

Generalise the existing implementation to other textual data

Goal: create a library that does the following steps on textual data (and not only the datasets that we have worked with so far):

Several changes are needed to achieve this: (not a complete list)

FFN, cross-validation ---> classification_utils.py needs to be refactored to work with other datasets
How to generalise timeline_id and postid

Fix formatting in readthedocs

Documentation in the readthedocs is not properly formatted. Need to fix the doc-strings so that they display nicer.

Also fix the warnings that are generated when creating the docs.

Add random projections to dimensionality reduction class

Add random projections as one of options for dimensionality reduction (into the DimReduce class).

Add to "random_projection" as option to method argument
Add description to docstrings for class
Add further options for type of random projection ("gaussian", "sparse", ...)
Return the fitted transform into .reducer attribute

datasig-ac-uk / nlpsig Goto Github PK

nlpsig's People

Contributors

Stargazers

Watchers

nlpsig's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs