Language-Identification

Language identification at sentence as well as word level in both monolingual as well as code-mixed bilingual texts.

Task-1

Our task was focussed on developing a model for Language Identification service that returns the language code of the language in which the text is written. We had datsets from three languages Spanish (ES), Portuguese (PT) and English (EN) and had to predict sample texts in the mention 3 languages.

Data

We had 3 datasets (en, es, pt) with 2.6Mn lines in each file.
We cannot process such a large file at once hence we used random sampler and got 20,000 sampes from all the three files. (command: shuf -n 20000 input_file > output_file)
We gave languaegs ID's as {En: 0, Es: 1, Pt: 2} respectively.
Please contact us for accessing the data.

Pre-Processing

All the texts were converted to lower case.
Removal of punctuations.
All the digits were removed from the text sentences.
Series of contiguous white spaces were replaced by single space.
Removal of hyperlinks

Representation

We used TfidfVectorizer for representing the text in our corpus.

Data split

We splitted the data into 80/20 for trianing and testing at the same time keeping in mind to have similar number of instances for all the three langauges in test too.
Train dataset: {En: 14197, Es: 15279, Pt: 15550}
Test dataset: {En: 4178, Es: 3503, Pt: 3576}

ML Model

Used sklearn for importing the models.
We have used LogisticRegression algorithm for our classification with solver='lbfgs'.

Results

Our system achieved an accuracy of 93%.
The confusion matrix for the result is as below:

0 1 2

0 3620 299 259

1 38 3388 77

2 22 71 3483
The Classification report is as below:

precision recall f1-score support

En 0.98 0.87 0.92 4178

Es 0.90 0.97 0.93 3503

Pt 0.91 0.97 0.94 3576

	0	1	2
0	3620	299	259
1	38	3388	77
2	22	71	3483

	precision	recall	f1-score	support
En	0.98	0.87	0.92	4178
Es	0.90	0.97	0.93	3503
Pt	0.91	0.97	0.94	3576

Running the script and results

To run our model the command is as below:
python3 script_task1.py data.en data.es data.pt langid.test
In the above command 1st agument should be english data file, 2nd as spanish 3rd as potruguese and 4th as the testing file.
test_results.txt will contain the language label predicted for the langid.test file once ran on our model.
test_results.txt above contains the output on the langid.tets file provided for us to test.
Tags here are numerical 0: en, 1: es, 2: pt

Task-2

Our task was focussed on developing a model to distinguish between language variants. Here we wish to distinguish between European Portuguese (PT-PT) and Brazilian Portuguese (PT-BR).

Data

We had 2 datasets (pt-pt and pt-br) with 1.9Mn and 1.5Mn lines.
Due to unability to process such a large file (not high spec system) at once hence we used random sampler and got 65,000 sampels from pt-br and 50,000 samples from pt-pt respectively. (command: shuf -n N input_file > output_file)
We gave languaegs ID's as {pt-br: 0, pt-pt: 1} respectively.

Pre-Processing

All the texts were converted to lower case.
Removal of punctuations.
All the digits were removed from the text sentences.
Series of contiguous white spaces were replaced by single space.
Removal of hyperlinks

Representation

We used TfidfVectorizer for representing the text in our corpus.
Keeping top 6000 words for reprsentation of the sentences.

Data split for our model train and test##

We splitted the data into 80/20 for trianing and testing at the same time keeping in mind to have similar number of instances for all the three langauges in test too.
Total pt-br: 47554 and pt-pt: 50000.
Train dataset: {pt-br: 38052, pt-pt: 39991}
Test dataset: {pt-br: 9502, pt-pt: 10009}

ML Model

Used sklearn for importing the models.
We have used LogisticRegression algorithm for our classification with solver='lbfgs'.

Results

Our system achieved an accuracy of 81.9%.
The confusion matrix for the result is as below:

0 1

0 7735 1767

1 1761 8248
The Classification report is as below:

precision recall f1-score support

pt-br 0.81 0.81 0.81 9502

pt-pt 0.82 0.82 0.82 10009

	0	1
0	7735	1767
1	1761	8248

	precision	recall	f1-score	support
pt-br	0.81	0.81	0.81	9502
pt-pt	0.82	0.82	0.82	10009

Running the script and results

To run our model the command is as below:
python3 script_task2.py data.pt-br data.pt-pt langid-variants.test
In the above command 1st agument should be Brazilian Portuguese (pt-br) file, 2nd as European Portuguese (pt-pt) and 3rd as the testing file.
test_result.txt will contain the language label predicted for the langid-variants.test file once ran on our model.
test_results.txt above contains the output on the langid.tets file provided for us to test.
Tags here are numerical 0: pt-br, 1: pt-pt

Task-3

Implement a deep learning model (recommended: a BILSTM tagger) to detect code switching (language mixture) and return both a list of tokens and a list with one language label per token.
To simplify our work was focussed on English and Spanish, so we were only needed to return for each token either 'en', 'es' or 'other'.

Data

For code switching we will focus on Spanish and English, and the data provided is derived from http://www.care4lang.seas.gwu.edu/cs2/call.html.
This data is a collection of tweets, in particular you have three files for the training set and three for the validation set:
offsets_mod.tsv
tweets.tsv
data.tsv
The first file has the id information about the tweets, together with the tokens positions and the gold labels.
The second has the ids and the actual tweet text.
The third has the combination of the previous files, with the tokens of each sentence and the gold labels associated. More specifically, the columns are: offsets_mod.tsv: {tweet_id, user_id, start, end, gold label} tweets.tsv: {tweet_id, user_id, tweet text} data.tsv: {tweet_id, user_id, start, end, token, gold label}

The gold labels can be one of three:

en
es
other

For this task, we were required to implement a BILSTM tagger.

Approach

We tried to implement a BiLSTM model with character embeddings and see how our model performs for this task.
To encode the character-level information, we will use character embeddings and a LSTM to encode every word to a vector.

Data processing

There were lines in the *_data.tsv files which had " as a token and was inhibittng the entire reading of file in pandas read_csv function.
Hence we removed all the lines from both train as well as dev data files which had " in them.
Keeping a tweet together as this will later adds to context if our model can learn that too.
We created a list of list of tuples, in which each word/token was as a tuple with it's tag and inside a list which contains all the tuples of words from a single tweet.

Results

Our system achieved an accuracy of 96.2% when trained and tested on the train_data.tsv file only.
The confusion matrix for the result is as below:

0 1 2

0 1816 45 26

1 96 4651 70

2 65 51 2545
The Classification report is as below:

precision recall f1-score support

Other 0.92 0.96 0.94 1887

En 0.98 0.97 0.97 4817

Es 0.96 0.96 0.96 2661

	0	1	2
0	1816	45	26
1	96	4651	70
2	65	51	2545

	precision	recall	f1-score	support
Other	0.92	0.96	0.94	1887
En	0.98	0.97	0.97	4817
Es	0.96	0.96	0.96	2661

Final test result

Our system achieved an accuracy of 96.5% when trained on the train_data.tsv file and tested on dev_data.tsv file.
The confusion matrix for the result is as below:

0 1 2

0 17929 156 230

1 876 45412 618

2 796 469 24715
The Classification report is as below:

precision recall f1-score support

Other 0.91 0.98 0.95 18315

En 0.99 0.97 0.98 46906

Es 0.97 0.95 0.96 25980

	0	1	2
0	17929	156	230
1	876	45412	618
2	796	469	24715

	precision	recall	f1-score	support
Other	0.91	0.98	0.95	18315
En	0.99	0.97	0.98	46906
Es	0.97	0.95	0.96	25980

Running the system

Keep the train and test dataset similar to the format of train_data.tsv in the same directory as the script_task3.py.
run the command python3 script_task3.py train_data.tsv test_data.tsv
It'll show two images, 1. The variation of loss and validation loss during training. 2. The confusion matrix image.
At last will print the confusion matrix as well as classification report along with the accuracy of the madel.

silentflame / language-identification Goto Github PK

language-identification's Introduction

Language-Identification

Task-1

Data

Pre-Processing

Representation

Data split

ML Model

Results

Running the script and results

Task-2

Data

Pre-Processing

Representation

Data split for our model train and test##

ML Model

Results

Running the script and results

Task-3

Data

Approach

Data processing

Results

Final test result

Running the system

language-identification's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs