GithubHelp home page GithubHelp logo

datasig-ac-uk / nlpsig Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 0.0 15.67 MB

Package for constructing paths of embeddings obtained from transformers.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
hut23-1132 hut23-1376 nlp transformers

nlpsig's People

Contributors

dependabot[bot] avatar kasra-hosseini avatar phinate avatar rchan26 avatar ttseriotou avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nlpsig's Issues

Check plotEmbedding after `dsig.create_features`

In the notebook, after encoding the text data, we can plot the embeddings:

Screenshot 2022-10-12 at 14 57 35

which seems reasonable as we have three classes. In fact, we have four classes, but we map them into three:

            "label":
                {"economy": 2,
                 "obama": 1,
                 "microsoft": 0,
                 "palestine": 0
                }

However, after time injection and

x_data = dsig.create_features(path, sig_combined, last_index_dt_all, bert_embeddings, time_feature)

The results look very different (see the notebook). Do I miss anything?

Model specifications

I changed the model_specifics dictionary to a nested one with the following entries. Does this make sense?

model_specifics = {
    "encoder_args": {
        "col_name_text": "content",
        "model_name": "all-MiniLM-L6-v2",
        "model_args": {
            "batch_size": 64,
            "show_progress_bar": True,
            "output_value": 'sentence_embedding', 
            "convert_to_numpy": True,
            "convert_to_tensor": False,
            "device": None,
            "normalize_embeddings": False
        }
    },
    "dim_reduction": {
        "method": 'ppapca', #options: ppapca, ppapcappa, umap
        "num_components": 10, # options: any int number between 1 and embedding dimensions
    },
    "time_injection": {
        "history_tp": 'timestamp', #options: timestamp, None
        "post_tp": 'timestamp', #options: timestamp, timediff, None
    },
    "embedding":{
        "global_embedding_tp": 'SBERT', #options: SBERT, BERT_cls , BERT_mean, BERT_max
        "post_embedding_tp": 'sentence', #options: sentence, reduced
        "feature_combination_method": 'attention', #options concatenation, attention 
    },
    "signature": {
        "dimensions": 3, #options: any int number larger than 1
        "method": 'log', # options: log, sig
        "interval": 1/12
    },
    "classifier": {
        "classifier_name": 'FFN2hidden', # options: FFN2hidden (any future classifiers added)
        "classes_num": '3class', #options: 3class (5class to be added in the future)
    }
}

Make PrepareData keep the same index ordering as the original dataframe

Currently in PrepareData, if the data passed in has datetime as a column, it will sort the dataframe by id_column and datetime - this is so that we can make the timeline_index column properly - but we should re-sort this dataframe afterwards for consistency with the original dataframe.

Be able to work with sequence of tokens

Currently, TextEncoder class only works with sentences through the sentence-transformer package and encoding sentences with SentenceTransformer class in the package. This should allow passing of the data frame and has capability to encode the sentences (i.e. obtain embeddings for) and fine tune the transformer model to some task (or just update the weights by training a language model to the new text).

After meeting with Terry, the package should also be able to work with any streams of text, e.g. with the alphabet dataset (task of predicting the langauge of a word). In this setting, the paths are embeddings of the letters. In the most simple case, these embeddings can be 26-dimensional one-hot encoding of letters, but we could also have more sophisticated embeddings for them.

We can still use SentenceTransformer because I guess we're just pooling one BERT embedding (because it would be just one token).

But perhaps if we're working with just tokens, transformers is more appropriate? It would be nicer when fine-tuning the transformer too I think.

Potentially we can to rename TextEncoder to SentenceEncoder which works when the items in the dataframe that are passed in are sentences - so working with stream of sentences, and this uses sentence-transformers package. We can make a new class TextEncoder which assumes that we're working with streams of tokens, so again each item in the dataframe is a token and we can use the class to obtain embeddings. The class would be able to fine tune the transformer to the available data too.

Example with Manifesto Project database

We have decided to start working on an example using the Manifesto Project database. Using this issue to note down ideas.

The Manifesto Project has compiled a large corpus of political manifestos from many elections around the world. We will start out by focusing on the manifestos from UK parties for the last few elections (2019, 2017, 2015, ...) and to begin will just focus on the ones from the largest four parties (Conservative, Labour, Liberal Democrats, SNP).

Here you can find annotated CSVs, which label each of the "quasi-sentences" (sentences which have exactly one statement in) a topic. The topics can be quite granular and can be found at here, but importantly they are grouped (the first digit of their label indicates their larger topic - e.g. 101: "Foreign Special Relationships: Positive", 102 "Foreign Special Relationships: Negative", 103 "Anti-Imperialism - comprised of:", etc. fall under a the topic 1: "Immigration".

We can construct a binary classification task by predicting on the post level whether or not a change in topic has happened using the path signature of the previous last posts, and using the pipeline.

[REVIEW] K fold cross validation with random seeds in FFN

In #1, I removed the cell "K fold cross validation with random seeds in FFN" (at the very end of the notebook). This is because several paths in classification_utils.py are hard coded, and I don't have access to the folders/files. (and I am not sure about the data format/columns/...)

Add option to include additional features into path

There may be columns in the dataframe passed into PrepareData that we wish to include into the path as variables. Add a additional_features parameter into the pad and get_path methods to allow for this.

Implement possibility to train sentence-transformer

The .fit_transformer() method in the SentenceEncoder class tries to fine-tune the sentence transformer to new data. Currently not fully implemented. May need thought about what this means - are we just going to tune the underlying BERT in SBERT to new data?

New contributors and our workflow

  • Add their GitHub handle to the repo (if actively developing, maybe as an Editor)
  • (when it is public) people would create a PR (Pull Request) for further development
  • Workflow:

Screenshot 2022-10-13 at 15 57 01

Signatory doesn't support MacOS

The website says:

Older versions of Signatory supported earlier versions of Python and PyTorch. 
It also included support for MacOS, 
but this has now been dropped as being difficult to maintain.

Title of the README

Currently, the title is: Extract features from textual data using BERT and Path signature

Does it make sense?

Break dependency with signatory

Signatory is currently no longer in development, so will need to think about whether or not it makes sense to keep this as a depenedency in the future. Currently holds back being able to use newer versions of PyTorch.

Currently keeping now as there's work that uses this package which needs signatory.

Generalise the existing implementation to other textual data

Goal: create a library that does the following steps on textual data (and not only the datasets that we have worked with so far):

Several changes are needed to achieve this: (not a complete list)

  • FFN, cross-validation ---> classification_utils.py needs to be refactored to work with other datasets
  • How to generalise timeline_id and postid

Fix formatting in readthedocs

Documentation in the readthedocs is not properly formatted. Need to fix the doc-strings so that they display nicer.

Also fix the warnings that are generated when creating the docs.

Add random projections to dimensionality reduction class

Add random projections as one of options for dimensionality reduction (into the DimReduce class).

  • Add to "random_projection" as option to method argument
  • Add description to docstrings for class
  • Add further options for type of random projection ("gaussian", "sparse", ...)
  • Return the fitted transform into .reducer attribute

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.