datasig-ac-uk / nlpsig Goto Github PK
View Code? Open in Web Editor NEWPackage for constructing paths of embeddings obtained from transformers.
License: BSD 3-Clause "New" or "Revised" License
Package for constructing paths of embeddings obtained from transformers.
License: BSD 3-Clause "New" or "Revised" License
In the notebook, after encoding the text data, we can plot the embeddings:
which seems reasonable as we have three classes. In fact, we have four classes, but we map them into three:
"label":
{"economy": 2,
"obama": 1,
"microsoft": 0,
"palestine": 0
}
However, after time injection and
x_data = dsig.create_features(path, sig_combined, last_index_dt_all, bert_embeddings, time_feature)
The results look very different (see the notebook). Do I miss anything?
Improve the readthedocs by having more example snippets of code.
Let user pass in indices into data splitting classes instead of always having to compute them.
In Folds
, you can make sure that the folds are split by group, but this is not available for DataSplits
.
I changed the model_specifics
dictionary to a nested one with the following entries. Does this make sense?
model_specifics = {
"encoder_args": {
"col_name_text": "content",
"model_name": "all-MiniLM-L6-v2",
"model_args": {
"batch_size": 64,
"show_progress_bar": True,
"output_value": 'sentence_embedding',
"convert_to_numpy": True,
"convert_to_tensor": False,
"device": None,
"normalize_embeddings": False
}
},
"dim_reduction": {
"method": 'ppapca', #options: ppapca, ppapcappa, umap
"num_components": 10, # options: any int number between 1 and embedding dimensions
},
"time_injection": {
"history_tp": 'timestamp', #options: timestamp, None
"post_tp": 'timestamp', #options: timestamp, timediff, None
},
"embedding":{
"global_embedding_tp": 'SBERT', #options: SBERT, BERT_cls , BERT_mean, BERT_max
"post_embedding_tp": 'sentence', #options: sentence, reduced
"feature_combination_method": 'attention', #options concatenation, attention
},
"signature": {
"dimensions": 3, #options: any int number larger than 1
"method": 'log', # options: log, sig
"interval": 1/12
},
"classifier": {
"classifier_name": 'FFN2hidden', # options: FFN2hidden (any future classifiers added)
"classes_num": '3class', #options: 3class (5class to be added in the future)
}
}
Default True
Currently in PrepareData
, if the data passed in has datetime
as a column, it will sort the dataframe by id_column
and datetime
- this is so that we can make the timeline_index
column properly - but we should re-sort this dataframe afterwards for consistency with the original dataframe.
The PrepareData
class essentially takes in a dataframe and constructs paths by looking at the history, or by its id. Would be good to be able to compute the path signature directly here.
Currently, TextEncoder
class only works with sentences through the sentence-transformer
package and encoding sentences with SentenceTransformer
class in the package. This should allow passing of the data frame and has capability to encode the sentences (i.e. obtain embeddings for) and fine tune the transformer model to some task (or just update the weights by training a language model to the new text).
After meeting with Terry, the package should also be able to work with any streams of text, e.g. with the alphabet dataset (task of predicting the langauge of a word). In this setting, the paths are embeddings of the letters. In the most simple case, these embeddings can be 26-dimensional one-hot encoding of letters, but we could also have more sophisticated embeddings for them.
We can still use SentenceTransformer
because I guess we're just pooling one BERT embedding (because it would be just one token).
But perhaps if we're working with just tokens, transformers
is more appropriate? It would be nicer when fine-tuning the transformer too I think.
Potentially we can to rename TextEncoder
to SentenceEncoder
which works when the items in the dataframe that are passed in are sentences - so working with stream of sentences, and this uses sentence-transformers
package. We can make a new class TextEncoder
which assumes that we're working with streams of tokens, so again each item in the dataframe is a token and we can use the class to obtain embeddings. The class would be able to fine tune the transformer to the available data too.
Add tests for methods which construct paths in PrepareData
class.
We have decided to start working on an example using the Manifesto Project database. Using this issue to note down ideas.
The Manifesto Project has compiled a large corpus of political manifestos from many elections around the world. We will start out by focusing on the manifestos from UK parties for the last few elections (2019, 2017, 2015, ...) and to begin will just focus on the ones from the largest four parties (Conservative, Labour, Liberal Democrats, SNP).
Here you can find annotated CSVs, which label each of the "quasi-sentences" (sentences which have exactly one statement in) a topic. The topics can be quite granular and can be found at here, but importantly they are grouped (the first digit of their label indicates their larger topic - e.g. 101: "Foreign Special Relationships: Positive", 102 "Foreign Special Relationships: Negative", 103 "Anti-Imperialism - comprised of:", etc. fall under a the topic 1: "Immigration".
We can construct a binary classification task by predicting on the post level whether or not a change in topic has happened using the path signature of the previous last posts, and using the pipeline.
In #1, I removed the cell "K fold cross validation with random seeds in FFN" (at the very end of the notebook). This is because several paths in classification_utils.py
are hard coded, and I don't have access to the folders/files. (and I am not sure about the data format/columns/...)
There may be columns in the dataframe passed into PrepareData
that we wish to include into the path as variables. Add a additional_features
parameter into the pad
and get_path
methods to allow for this.
The .fit_transformer()
method in the SentenceEncoder
class tries to fine-tune the sentence transformer to new data. Currently not fully implemented. May need thought about what this means - are we just going to tune the underlying BERT in SBERT to new data?
The website says:
Older versions of Signatory supported earlier versions of Python and PyTorch.
It also included support for MacOS,
but this has now been dropped as being difficult to maintain.
Currently, the title is: Extract features from textual data using BERT and Path signature
Does it make sense?
https://github.com/datasig-ac-uk/signature_applications is a public repo with some example notebooks. Once we are done with the refactoring/review, should we move this repo there?
pytest
framework for parts of the pipelineSignatory is currently no longer in development, so will need to think about whether or not it makes sense to keep this as a depenedency in the future. Currently holds back being able to use newer versions of PyTorch.
Currently keeping now as there's work that uses this package which needs signatory.
Goal: create a library that does the following steps on textual data (and not only the datasets that we have worked with so far):
Several changes are needed to achieve this: (not a complete list)
classification_utils.py
needs to be refactored to work with other datasetstimeline_id
and postid
Documentation in the readthedocs is not properly formatted. Need to fix the doc-strings so that they display nicer.
Also fix the warnings that are generated when creating the docs.
Add random projections as one of options for dimensionality reduction (into the DimReduce class
).
method
argument.reducer
attributeA declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.