indecol / country_converter Goto Github PK
View Code? Open in Web Editor NEWThe country converter (coco) - a Python package for converting country names between different classification schemes.
License: GNU General Public License v3.0
The country converter (coco) - a Python package for converting country names between different classification schemes.
License: GNU General Public License v3.0
Maybe add
Programming Language :: Python :: 3
or similar to setup.py
Issue #1
cc.convert(names='UK',to='name_short')
gives you the following result
WARNING:root:UK not found in ISO2
'not found'
the code looks only for ISO2 for 2 letter input, even though the regex check would have captured this correctly
Issue #2
cc.convert(names='macau',to='name_short') gives macao
Ideal name is https://en.wikipedia.org/wiki/Macau
Macau also works with another python package "CountryInfo"
If excluding before country, its considered the country
e.g. Asia excluding China matches China
I am not sure if this is UNRegion issue or country-converter but running:
countries = ['AQ', 'GS', 'IO', 'CX']
country_converter.convert(names=countries,src='ISO2' , to='UNregion')
I get NaN
AQ - Antarctica, GS - SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS,
CX - COCOS (KEELING) ISLANDS, IO - British Indian Ocean Territory
Add regions from MESSAGE:
http://www.iiasa.ac.at/web/home/research/researchPrograms/Energy/MESSAGE-model-regions.en.html
"In April 1964, the republic (Zanzibar) merged with mainland Tanganyika. This United Republic of Tanganyika and Zanzibar was soon renamed, blending the two names, as the United Republic of Tanzania,"
https://en.wikipedia.org/wiki/Zanzibar
Thank you for this package to @konstantinstadler
I have a quick question to whoever can answer and am wondering if we can do this using this package? I have a dataframe (in pandas format) and have a list of countries on one column. There could be multiple duplicates. Example:
df = pd.DataFrame({"Country":["Afghanistan ","Afghanistan ", " Afghanistan", "Åland Island", "Åland Islan"], "Values":[23,45,46,787,875]})
As you can see, some of them have trailing whitespace, some names are not complete, and etc. I know we can fix this issue using regex. But I am wondering if you have already done this for us? Let's assume I want my county name fixed according to UN member. I want a extra column which has fixed all errors i.e. col = Country, Values, Country_NameFixed
Goal : I want to input a df and want a pd.DF at the end.
Could you maybe add some instructions on how to run the tests? This would be helpful for people not familiar with py.test.
I used (after cloning the repo):
python3 -m venv venv
./venv/bin/pip install -e .
./venv/bin/pip install pytest
./venv/bin/pytest --verbose
Provide a possibility to do something like
coco OECD
and getting all OECD countries
coco EXIO1 should give a unique list of EXIO1 countries ...
It seems the Working Paper editor added the paper to ResearchGate and assigned a doi:
Not sure if it's better to use this doi, adding it to the BibTeX file seems to not display the wiod.org url anymore when running pandoc.
It's the first time I notice a RG doi and there seem to have been some questions around them:
http://blog.impactstory.org/researchgate-doi/
So it's probably ok to leave the references as is, just thought I mention it.
Some of them work off the code.
e.g if I input US or USA to my bot it prints not found but if I input United States of America it works (the same issue with the UK)
The error in my console is: WARNING:root: not found not found in regex
Also if I print South Korea it gives me the error that says: TypeError: can only concatenate str (not "list") to str
But it perfectly works with North Korea even when I call it DPRK (which is regarding the previous problem with the USA and the UK)
Furthermore, it doesn't work with Ireland. I am completely confused about this one since it's not an abbreviation and not 2 words long like South Korea.
The error is completely the same as the first one: WARNING:root:not found not found in regex
I can provide you with my code if you want me to (just tell me about it)
It would be really nice if you could help me and solve all these issues or some of them
Thanks
I receive the following error when trying to convert a list of countries:
File "sample_scripts.py", line 21, in <module> names=list(data_frame["location"]), to="name_short" File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/country_converter/country_converter.py", line 319, in convert return coco.convert(*args, **kargs) File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/country_converter/country_converter.py", line 540, in convert na=False)][to].values] File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/pandas/core/strings.py", line 1954, in wrapper return func(self, *args, **kwargs) File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/pandas/core/strings.py", line 2763, in contains self._parent, pat, case=case, flags=flags, na=na, regex=regex File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/pandas/core/strings.py", line 441, in str_contains regex = re.compile(pat, flags=flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 234, in compile return _compile(pattern, flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 924, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 420, in _parse_sub not nested and not items)) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 645, in _parse source.tell() - here + len(this))
As part of my review (openjournals/joss-reviews#332) a question on licensing:
More than anything, the choice of license is also cultural (https://twitter.com/hadleywickham/status/873554179792355328) and it seems that permissive licenses are more popular for Python projects. Given that country_converter
is "infrastructure" and not a scientific model, maybe it could make sense to release under BSD or MIT? This could improve its adoption, I have noted that some Python projects are reluctant to include GPL-ed code. There are of course good reasons to choose GPL and I have myself used it in scientific projects.
If you chose GPL because it in part builds on https://github.com/vincentarelbundock/countrycode, then maybe its authors should be mentioned in the LICENSE file?
Currently, when no parameters are given, coco is not printing anything.
Instead, this should give a help message.
According to
https://unstats.un.org/unsd/tradekb/Knowledgebase/Country-Code
and
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
PSE is the valid ISO3 code for Palestine (instead of PAL).
Needs to change.
WARNING:root: Yugoslavia not found in regex
I think "Heard and Mc Donald Islands" should not have a space after the "Mc".
Maybe you could add .ipynb_checkpoints
to your .gitignore
? I think this is just storing the manually saved versions.
It's great you have UN recognised as an option, but I think you need one more which is all countries except those that have become obsolete. One reason is that if you list countries in a region using the all option eg. Europe, you could get duplicates eg. getting Channel Islands as well as Jersey and Guernsey. UN recognised can be too strict eg. excluding Taiwan.
UNStats https://unstats.un.org/unsd/methodology/m49/ has Jersey and Guernsey. Remove or mark obsolete Channel Islands and add Jersey and Guernsey.
By default, this package subscribes to root logging and uses logging.warning()
How to disable it without affecting root logger of the rest of the app?
install_requires in setup for minimal req,
requirements with full development env (tests)
Input "Congo" returns ISO3 code "COD" while you would expect ISO3 code "COG".
Latvia joined the OECD 2016,
could you please update your data?
Should I write a pull request?
After #24 I wanted to compare country_converter (which I use a lot as coco
) - with pycountry which covers only ISO3 codes:
import country_converter
import pycountry
data = pd.read_table(country_converter.COUNTRY_DATA_FILE, sep='\t', encoding='utf-8')
for _, (code, name) in data[['ISO3', 'name_short']].iterrows():
try:
pycountry.countries.get(alpha_3=code).name
except KeyError:
print(code, name)
This gives these (non-standard) codes:
BA1 British Antarctic Territories
CHI Channel Islands
KSV Kosovo
ANT Netherlands Antilles
EAT Tanganjika
EAZ Zanzibar
Not sure whether these are partly former ones. For Kosovo XK
, and XKK
seem to be used as placeholders:
https://geonames.wordpress.com/2010/03/08/xk-country-code-for-kosovo/
Maybe it's worth making it explicit in the docs that codes are amended.
Thanks for your project!
I think Kosovo's 3 letter code should be XKX not KSV.
A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
Could you elaborate a bit more on this in the Readme? Mentioning ISO codes, numbers as examples might be useful. I don't think it needs to be as detailed as in the paper but some example usages might be helpful.
Thanks for this useful library.
I realise this might be difficult, but Pandas and its dependencies are quite heavyweight, so I was wondering if the code could be written in such a way as to not require Pandas.
If country names are stated multiple times, multiple country codes show.
e.g. China (P.R. of China and Hong Kong, China) becomes '[''CHN'', ''HKG'']' in ISO3 codes
Maybe add links to related software like
pycountry, iso-3166 or countrynames (for non-English names)
This could help potential users to decide which one they need and see advantages and features of country_converter
NA is Namobia, but when I run:
country_converter.convert(names='NA',src='ISO2'
, to='UNregion')`
I get not found
in country-converter (0.5.2)
First of all thanks for building this. I've been reconciling some geographic data by hand the tedious way so far and this looks like it's going to be a huge time saver. The one country string from my data source that country_converter didn't recognize was "Republic of Ireland", which is one of the official names of Ireland. See: [ https://en.wikipedia.org/wiki/Republic_of_Ireland ]. For myself, I'm just going to put in an if statement to catch it, but I thought you might want to know.
Thanks again!
Great job on this library. Easy to use and doing what it says on the can.
I had an issue with Congo vs DRC though. See the reproducible snippet below:
import country_converter as coco
some_names = ['Congo, Dem. Rep.', 'Congo, Rep.']
standard_names = coco.convert(names=some_names, to='name_short')
In [96]: standard_names
Out[96]: [['Congo Republic', 'DR Congo'], 'Congo Republic']
expected output is ['DR Congo', 'Congo Republic']
With regards to the "Community guidelines" question in openjournals/joss-reviews#332
Community guidelines: Are there clear guidelines for third parties wishing to
- Contribute to the software
- Report issues or problems with the software
- Seek support
Maybe adding a CONTRIBUTING.md
file would be useful? Also mentioning the issue tracker in the README.md
could be helpful as people might come across the project e.g. from its PyPI page.
I would also be interested to read what kind of additional groupings you would consider for inclusion.
Another minor question that came up during testing the library. For functions like EU28in
I was at first assuming this would be a function to check being part of a group.
Maybe something like Pandas astype
could work as well?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html
EU28as
?
But this certainly depends on familiarity and convention ...
Huge fan of this library, but it's not currently matching the Netherlands Antilles:
import country_converter as cc
>> cc.convert(names=['Netherlands Antilles'])
WARNING:root:Netherlands Antilles not found in regex
'not found'
The issue seems to be the regex, although it's not entirely clear to me why since regex101 thinks it should work:
>>> import re
>>> re.match('^(?=.*\bant).*(neth.*|dutch)', 'netherlands antilles')
>>>
A simpler regex without the lookahead seems to work fine:
>>> re.match('^(neth.*|dutch).*ant', 'netherlands antilles')
<re.Match object; span=(0, 15), match='netherlands ant'>
Hi, thanks for the library.
I got bitten by unexpected behaviour upon passing a pandas series to convert
. I would have expected the code either to raise a TypeError
or to handle the input correctly. Instead, the series gets converted to a single string, which is matched against the country regexes. In many cases this actually gives the correct result in the end (although with Warning: More than one regular expression match for [list of all countries in series]
being printed a million times). In a few cases, the formatting of the series string somehow prevents a match, and so a country or two are not included in the result.
These lines are the root of the problem:
names = list(names) if (
isinstance(names, tuple) or
isinstance(names, set)) else names
names = names if isinstance(names, list) else [names]
names = [str(n) for n in names]
If names is a pandas series, after these lines names
will be a one-element list containing a single string representing the entire series (the result of calling str
on the series).
The same thing will happen if you input a numpy array, or in fact anything which implements __repr__
.
I suggest changing the code so it tries to coerce the input to a list, and raises a TypeError
upon failure. For instance we could change the above lines to this:
if not isinstance(names, str):
try:
names = list(names)
except TypeError as e:
raise TypeError("names must be coercible to list") from e
names = [str(n) for n in names]
else:
names = [names]
Happy to make a PR if you agree.
Thanks again for the library, saved me tonnes of work.
country_converter.py:412: FutureWarning: read_table is deprecated, use read_csv instead
From Evert:
I believe the ISO3 code for Romania is ROU, not ROM. (see: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3)
In the doc notebook the example with a separate germany appears to be broken.
Version: Does the release version given match the GitHub release (v0.4)?
Could you tag the version published on PyPI in Git as well? This is also helpful for the Zenodo archiving.
Instead of using logging.warning
and so on, could you get a specific logger and use that instead?
Otherwise, I think it's impossible to change the log level only for this package.
I.e.
This code does not print anything:
import logging
from country_converter import CountryConverter
logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.CRITICAL)
my_logger = logging.getLogger('mylogger')
my_logger.info('test')
logging.info('test')
logging.getLogger().info('test')
CountryConverter().convert(['france', 'sadflkj'], src='regex')
While this code prints my log message, but also a warning from CountryCoverter
:
import logging
from country_converter import CountryConverter
logging.basicConfig(level=logging.DEBUG)
my_logger = logging.getLogger('mylogger')
my_logger.info('test')
CountryConverter().convert(['france', 'sadflkj'], src='regex')
I'm not sure if this is best done outside or inside of the match function, but usually when I run the function I'm aware that some matches won't be found and will look for and deal with multiple matches. Meanwhile the long list of warnings pushes useful information further away in my terminal. Can the warnings be turned off or piped somewhere out of the way?
For both doc notebooks!!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.