GithubHelp home page GithubHelp logo

pancak3 / geolocatonpredictor-ml-nb Goto Github PK

View Code? Open in Web Editor NEW
1.0 0.0 0.0 29.55 MB

This is the code for predicting geolocation of tweets trainning on token frequency using Decision Tree and Naïve Bayes. @TheUniversityOfMelbourne @pancak3 all rights reserved.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
geotag geolocation tweets machinelearning

geolocatonpredictor-ml-nb's Introduction

Geolocaton Predictor

This is the code for predicting geolocaton of tweets trainning on token frequencies using Decision Tree and Naïve Bayes.

Implementation

Feature selection

In util/preprocessing/merge.py,

  • feature_filter shows it drops single character features like [a, b, ..., n]
  • merge shows it intuitively merges similar features like [aha, ahah, ..., ahahahaha] and [taco, tacos]

Classifier Combination

In preprocess/merge.py,

Instance manipulation

In util/train.py,

  • complement_nb shows it uses bagging to generate multiple training datasets.
  • complement_nb also shows it uses 42-Fold Cross Validation to generate multiple training datasets.

Algorithm manipulation

In util/train.py,

  • complement_nb also shows it uses GridSearchCV to generate multiple classifiers and select the best based on accuracy.

Dataset

Requirements

  • python3+
pip install -r requirements.txt  

Usage

Note: The code will remove the old models and results every time running. MAKE SURE you have saved your satisfying models..

Train

python run.py -t datasets/train-best200.csv datasets/dev-best200.csv  

the output would be like:

INFO:root:[*] Merging datasets/train-best200.csv   
 42%|████████         | 1006/2396 [00:05<00:20, 92.03 users/s]  
...  
...  
[*] Saved models/0.8126_2019-10-02_20:02  
[*] Accuracy: 0.8125955095803455  
 precision    recall   f_scoreCalifornia   0.618944  0.835128  0.710966  
NewYork      0.899371  0.854647  0.876439  
Georgia      0.788070  0.622080  0.695305  
weighted     0.827448  0.812596  0.814974  

Predict

python run.py -p models/ datasets/dev-best200.csv   

the output would be like:

...  
INFO:root:[*] Saved results/final_results.csv  
INFO:root:[*] Time costs in seconds:  
 PredictTime_cost  11.98s  

Score

python run.py -s results/final_results.csv  datasets/dev-best200.csv  

the output would be like:

[*] Accuracy: 0.8224697308099213  
 precision    recall   f_scoreCalifornia   0.653035  0.852199  0.739441  
NewYork      0.747993  0.647940  0.694381  
Georgia      0.909456  0.858296  0.883136  
weighted     0.833854  0.822470  0.824577  
INFO:root:[*] Time costs in seconds:  
 ScoreTime_cost  1.48s  
  

Train&Predict&Score

python run.py \
 -t datasets/train-best200.csv datasets/dev-best200.csv \
 -p models/ datasets/dev-best200.csv \
 -s results/final_results.csv datasets/dev-best200.csv

Help

python run.py -h  

Used libraries

  • sklearn for easily using Complement Naive Bayes, some feature selectors and other learning tools.
  • pandas, numpy for easily handling data.
  • tqdm for showing the process of loop.
  • joblib for dumping/loading memory to/from disk.
  • nltk for capturing word types on the purpose of feature filtering

License

See LICENSE file.

geolocatonpredictor-ml-nb's People

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.