GithubHelp home page GithubHelp logo

mar-muel / local-geocode Goto Github PK

View Code? Open in Web Editor NEW
19.0 1.0 4.0 89 KB

Simple library for efficient geocoding without making API calls

License: MIT License

Python 100.00%
geocoding geoparser geolocation geocode geonames parser twitter countries

local-geocode's Introduction

Local-geocode ๐ŸŒŽ

This is a very simple geocoding library which runs fully locally (without calling any APIs) and has therefore no limits in terms of processing. It runs very fast due to using an efficient in-memory datastructure called Flashtext. It uses data from http://www.geonames.org/.

This project is mainly used in the context of decoding data from the "user.location" field of tweets but it can in principle be used on any address/location raw text field. Note that if you need very precise geographical information it is better to use one of the many available APIs. By default this repo only detects places with more than 30k inhabitants.

I have compared the predictions by local-geocode with geopy for 500 Twitter user locations. Local-geocode performs signficantly better (85% accuracy) than geopy (64% accuracy) for this use case. Read more about the benchmark here.

Install

pip install local-geocode

Example usage

Local-geocode is able to parse arbitrary location names in many languages, as well as numerous alternative names of places and returns geographic information.

from geocode.geocode import Geocode

gc = Geocode()
gc.load()  # load geonames data

mydata = ['Tel Aviv', 'Mangalore ๐Ÿ‡ฎ๐Ÿ‡ณ']

for input_text in mydata:
    locations = gc.decode(input_text)
    print(locations)

[
    {
        "name": "Tel Aviv",
        "official_name": "Tel Aviv",
        "country_code": "IL",
        "longitude": 34.780570000000004,
        "latitude": 32.08088,
        "geoname_id": "293397",
        "location_type": "city",
        "population": 432892
    }
]
[
    {
        "name": "Mangalore",
        "official_name": "Mangalore",
        "country_code": "IN",
        "longitude": 74.85603,
        "latitude": 12.91723,
        "geoname_id": "1263780",
        "location_type": "city",
        "population": 417387
    },
    {
        "name": "\ud83c\uddee\ud83c\uddf3",
        "official_name": "Republic of India",
        "country_code": "IN",
        "longitude": 79.0,
        "latitude": 22.0,
        "geoname_id": "1269750",
        "location_type": "country",
        "population": 1352617328
    }
]

Usage

The easiest way to integrate local-geocode to your project is to simply run pip install local-geocode. You can also simply clone this repository and copy the folder geocode into your project.

Configuration

When installed with pip, local-geocode comes packaged with 2 pickle files which were generated using the default configuration. You can however change the configuration and then re-compute the pickle files for your needs.

The Geocode() initializer accepts the following arguments:

  • min_population_cutoff (default: 30k): Places below this population size are excluded
  • large_city_population_cutoff (default: 200k): Cities with a population size larger than this will be prioritized. Example: "Los Angeles, USA" will result in "Los Angeles" as the first result, and not "USA".
  • location_types: Provide a list of location types which you would like to filter. By default it uses all location types (i.e. ['city', 'place', 'country', 'admin1', 'admin2', 'admin3', 'admin4', 'admin5', 'admin6', 'admin_other', 'continent', 'region']).

Example:

from geocode.geocode import Geocode

gc = Geocode(min_population_cutoff=100000)
gc.load()  # downloads geonames data (~1.2GB), parses data, generates pickle files in <package folder>/geocode/data for new configuration

(This may take 1-2min to run)

Prioritization

If multiple locations are detected in an input string, local-geocode sorts the output by the following prioritization:

  1. Large cities (population size > large_city_population_cutoff)
  2. States/provinces (admin level 1)
  3. Countries
  4. Places (population size <= large_city_population_cutoff)
  5. Counties (admin levels > 1)
  6. Continents
  7. Regions

Parallelized

If you have a large number of texts to decode, it might make sense to use decode_parallel which runs decode in parallel:

gc = Geocode()
gc.load()  # load geonames data

# a large number of items
mydata = ['Tel Aviv', ..,]
num_cpus = None # By default use all CPUs

locations = gc.decode_parallel(mydata, num_cpus=num_cpus)
print(locations)

Contact

Please open an issue, if you run into problems!

local-geocode's People

Contributors

francoispichard avatar mar-muel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

local-geocode's Issues

Way to change prioritisation

Hello, I'm having the following issue with a local-geocode search. My query is how to go about the following.

Input:
Toledo, Spain
Expected Output:
Toledo, Spain [documented on geonames here here]
Actual Output:
Toledo, Ohio [documented on geonames here here]

I am expecting that this is something to do with the prioritisation in that both Toledo's were discovered, but that the Ohio one is being prioritised due to a larger population? If so, would be great to have a mechanism to fallback to some other type of prioritisation, or even to filter on the country by a parameter in the decode function. Thanks!

Pandas==1.5.3 for rebuilding the geocode dataset

Hi,

Could you update you project for the new usages of the DataFrame append function ?

"AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?"

Blocking pandas to 1.5.3 is starting add great complexity to the work.

Great project,

Recommended way for use in production?

When deploying a project that includes local-geocode to production, I am wondering if there are any specific considerations to take into account? So far, I am struggling to be able to change the Geocode initialisation during production. Generally, I don't mind adding time to my build-time. I am wondering what sort of workflow is best here.

Might it be:

...
-> Python dependencies installed, including local-geocode
-> Some way to run the local-geocode initialisation
-> Dump the output of the load to S3/other storage DB or keep this on the server?
-> Continue build and deploy

Thank you

Odd number of countries retrieved during retrieval of data

Hey there,
I was reinitialising the Geocode class, when I noticed that the number of countries returned by the file is much greater than what I would expect. Screenshot below of the data retrieved and the count of the countries. I would expect this number to be more like 200 so it's closer to the countries listed on this page https://www.geonames.org/countries/

Screenshot 2024-02-04 at 18 19 45

This is a fantastic library btw, thank you for providing it!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.