GithubHelp home page GithubHelp logo

yachay-ai / byt5-geotagging Goto Github PK

View Code? Open in Web Editor NEW
158.0 7.0 20.0 13.05 MB

Confidence and Byt5 - based geotagging model predicting coordinates from text alone.

Home Page: http://www.yachay.ai/

License: MIT License

Python 70.33% Jupyter Notebook 29.67%
python deep-learning machine-learning nlp nlp-machine-learning transformers neural-network pytorch geotagging coordinates geo-location

byt5-geotagging's Introduction

Join our community

Cover

Version

Geotagging Model

This repository is designed to support developers in building and training their own geotagging models. The geotagging model architecture provided here allows for customization and training. Additionally, we publish datasets that are well-suited for training in different geolocation detection scenarios.

The current models reach 30km Median Error on Haversine Distance for top 10% most relevant texts. Challenges in the repository issues are open to improve the model's performance.

Architecture and Training

Click to unfold geotagging model architecture diagram.
%%{init:{'theme':'neutral'}}%%
flowchart TD
subgraph "ByT5 classifier"
  a("Input text") --> b("Input_ids")
subgraph "byt5(T5EncoderModel)"
  b("Input_ids")  --> c("byt5.encoder.inp_input_ids")
subgraph "byt5.encoder(T5Stack)"
  c("byt5.encoder.inp_input_ids")  --> d("byt5.encoder.embed_tokens") 
subgraph "byt5.encoder.embed_tokens (Embedding)"
  d("byt5.encoder.embed_tokens")  --> f("embedding")
  e("byt5.encoder.embed_tokens.inp_weights") --> f("embedding") --> g("byt5.encoder.embed_tokens.out_0")
end
  g("byt5.encoder.embed_tokens.out_0") --> h("byt5.encoder.dropout(Dropout)") --> i("byt5.encoder.block.0(T5Block)") --> j("byt5.encoder.block.1(T5Block)") & k("byt5.encoder.block.2-9(T5Block)") & l("byt5.encoder.block.10(T5Block)")
  j("byt5.encoder.block.1(T5Block)") --> k("byt5.encoder.block.2(T5Block)<br><br> ...<br><br>byt5.encoder.block.10(T5Block) ") --> l("byt5.encoder.block.11(T5Block)") --> m("byt5.encoder.final_layer_norm(T5LayerNorm)")
  m("byt5.encoder.final_layer_norm(T5LayerNorm)")-->n("byt5.encoder.dropout(Dropout)")--> o("byt5.encoder.out_0")
end
o("byt5.encoder.out_0") --> p("byt5.out_0")
end
p("byt5.out_0")-->q("(Linear)")
end
q("(Linear)") -->r("logits")
Loading
Train your text-to-location model Open In Colab

Dependencies

Ensure that the following dependencies are installed in your environment to build and train your geotagging model:

transformers==4.29.1
tqdm==4.63.2
pandas==1.4.4
pytorch==1.7.1

To train your geotagging model using the ByT5-encoder based approach, execute the following script:

python train_model.py --train_input_file <training_file> --test_input_file <test_file> --do_train true --do_test true --load_clustering .

Refer to the train_model.py file for a comprehensive list of available parameters.

Output Example

{
   "text":"These kittens need homes and are located in the Omaha area! They have their shots and are spayed/neutered. They need to be gone by JAN 1st! Please Retweet to help spread the word!",
   "geotagging":{
      "lat":41.257160,
      "lon":-95.995102,
      "confidence":0.9950085878372192
   }
}
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "id": 1,
      "properties": {
        "ID": 0
      },
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [-96.296363, 41.112793],
            [-96.296363, 41.345177],
            [-95.786877, 41.345177],
            [-95.786877, 41.112793],
            [-96.296363, 41.112793]
          ]
        ]
      }
    },
    {
      "type": "Feature",
      "id": 2,
      "properties": {
        "ID": 0
      },
      "geometry": {
        "type": "Point",
        "coordinates": [-95.995102, 41.257160]
      }
    }
  ]
}

Loading

Datasets

Our team has curated two comprehensive datasets for two distinct training approaches. These datasets are intended for use in training and validating the models. Share your training results in the repository issues.

Regions dataset Google Drive

The goal of the Regions approach is to look into the dataset of top most populated regions around the world.

  • is an annotated corpus of 500k texts, as well as the respective geocoordinates
  • covers 123 regions
  • includes 5000 tweets per location
Seasons dataset Google Drive

The goal of the Seasons approach is to identify the correlation between the time/date of post, the content, and the location. Time zone differences, as well as seasonality of the events, should be analyzed and used to predict the location. For example: snow is more likely to appear in the Northern Hemisphere, especially if in December. Rock concerts are more likely to happen in the evening and in bigger cities, so the time of the post about a concert should be used to identify the time zone of the author and narrow down the list of potential locations.

  • is a .json of >600.000 texts
  • collected over the span of 12 months
  • covers 15 different time zones
  • focuses on 6 countries (Cuba, Iran, Russia, North Korea, Syria, Venezuela)

Your custom data. The geotagging model supports training and testing on custom datasets. Prepare your data in CSV format with the following columns: text, lat, and lon.

Confidence and Prediction

The geotagging model incorporates confidence estimation to assess the reliability of predicted coordinates. The Relevance field in the output indicates prediction confidence, ranging from 0.0 to 1.0. Higher values indicate increased confidence. For detailed information on confidence estimation and how to utilize the model for geotagging predictions, please refer to the inference.py file. This file provides an example script demonstrating the model architecture and integration of confidence estimation.

Welcome!

Forkers

Forkers repo roster for @Yachay-AI/byt5-geotagging

Feel free to explore the code, adapt it to your specific requirements, and integrate it into your projects. If you have any questions or require assistance, please don't hesitate to reach out. We highly appreciate your feedback and are dedicated to continuously enhancing the geotagging models.

byt5-geotagging's People

Contributors

alinapark avatar ingakaspar avatar lavriz avatar morzhenovsky avatar nataligmhall avatar orzhan avatar romandkane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

byt5-geotagging's Issues

`[Challenge]` 12 months of data

The objective of this challenge is to train a deep learning model to identify the correlation between the time/date of post, the content, and the location. Time zones difference, as well as seasonality of the events, should be analyzed and used to predict the location.

For example: Snow is more likely to appear in the Northern Hemisphere, especially if in December. Rock concerts are more likely to happen in the evening and in bigger cities, so the time of the post about a concert should be used to identify the time zone of the author and narrow down the list of potential locations.

The data set provided is a:

  • .json of >600.000 texts
  • collected over the span of 12 months
  • covering 15 different time zones
  • 6 countries. (Cuba, Iran, Russia, North Korea, Syria, Venezuela).

The data set is here

Deliverable

  • A model which takes a text on the input and returns the coordinates on the output
  • Evaluation metrics obtained on the development dataset, including Mean Absolute Error in kilometers.

We will evaluate the model using the test dataset that is not shared here.

Additional notes

Contact us at [email protected] for any questions or additional requests.

Thank you for contributing to Open Source and making a difference! ʕ•́ᴥ•̀ʔ

Dependencies

I'm trying to get this to run but cant figure out a set of working dependencies, could you share a list?

`[Challenge]` Metadata and Clusters

The objective of this challenge is to train a deep learning model to predict coordinates or cluster regions coordinates of texts while improving on Yachay’s original infrastructure.

We offer an annotated dataset for training and testing, comprising texts and their region cluster IDs, coordinates, post metadata, and more. We recommend considering the post metadata field, but you are free to exclude/include any of the provided dataset fields if it leads to improved validation metrics on your end. Regression, classification, multi-task or else - all solutions and suggestions are welcome!

Yachay team will evaluate the model using the test dataset that is not shared here.

Note: metadata and clusters issue-challenge allows for a higher number/variety of experiments. No hard MSE or EER requirements, we're looking for innovative ideas for infrastructure development.

The provided dataset is here, which:

  • annotated corpus of ~600k+ texts, with respective regions (clusters), timestamps and over 40k user_id-s
  • a median number of 415 texts per region (cluster)
  • each user has at least 6 texts
  • an additional list of cluster_ids with coordinates of the cluster for mapping texts to coordinates.

As for the deliverables, we looking for:

  • a model which takes a text on the input and returns the coordinates on the output
  • evaluation metrics obtained on the development dataset, including Mean Absolute Error in Haversine Distance

Send a Pull Request with your results, comment here for questions, or ping on Discord for requests!

Thank you for contributing to Open Source and making a difference! ʕ•́ᴥ•̀ʔ

`[Challenge]` Top regions

Challenge 1

This competition takes on the goal to improve upon Yachay.ai's infrastructure to train a deep learning model to predicts coordinates (latitude, longitude) of individual texts.

The first suggested methodology on training the model is to look into the annotated data set on texts posted from 123 populated regions around the planet.

No hard MSE or EER requirements, we're looking for innovative ideas for the infrastructure development.

The dataset provided is an:

  • annotated corpus of 500k texts, as well as the respective geocoordinates
  • 123 regions covered
  • 5000 tweets per location

The data set is here

Deliverable

  • A model which takes a text on the input and returns the coordinates on the output
  • Evaluation metrics obtained on the development dataset, including Mean Absolute Error in kilometers.

We will evaluate the model using the test dataset that is not shared here.

Additional notes

Contact us at [email protected] for any questions or additional requests.

`[Challenge]` Confident Predictions Selection

Confident Predictions Selection is a bounty challenge 💸

Your task

To design a model capable of selecting the top 10% of predictions that will exhibit the smallest mean distance.
You have an option to work with either or both of the text data and raw predictions. Additionally, you may perform any form of aggregation or transformation on the raw predictions as you see fit.

Validation Metrics

The primary metric for this challenge is the mean distance of the selected top 10% predictions. Your objective is to minimize this value.
As a secondary metric, we designed the Class Representation Index (CRI). In essence, CRI compares the class distribution before and after filtering, giving a higher weight to classes that were initially larger. The primary purpose of this metric is to detect cases where a class is significantly less represented after filtering compared to its original size.

Please, see the dedicated repository for instructions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.