GithubHelp home page GithubHelp logo

geocoding's Introduction

Geocoding

Mac Linux Windows

PypiVersion JarVersion Python wheels

  • 该模块用于将不规范(或者连续)的文本地址进行尽可能的标准化, 以及对两个地址进行相似度的计算
  • 该模块为 bitlap/geocoding 项目的Python封装,原项目为Kotlin开发
  • 为方便使用Python方法调用,这里使用Python的jpype模块将 bitlap/geocoding 进行封装,因此该模块需要Java环境的支持(需要添加JAVA_HOME等环境变量)
  • GeocodingCHN重新加载功能在Windows平台上可能会遇到错误,参考Jpype Changelog 1.2.0 - 2020-11-29 更新信息。
  • 安装命令 pip install GeocodingCHN

更新信息:

1.4.5

  1. 修复MatchedResult无法解析空结果的问题

1.4.4

  1. 修复无法创建Address实例的问题

1.4.3

  1. 添加save方法用于生成自定义的dat字典文件
  2. 添加match方法用于深度优先匹配符合输入的地址信息
  3. 添加analyze方法用于地址切分

1.4.2

修复 无法添加自定义地址问题,并更新jar包至1.3.1

1.4.1

原项目更新jar包,并适配新增功能。 新增功能

  • GeocodingCHN.Geocoding新增参数设定(为适配org.bitlap.geocoding.GeocodingX类)
    • 新增data_class_path参数,支持自定义地址文件路径,要求该路径为文件绝对路径,默认使用内置地址core/region.dat
    • 新增strict参数,默认 False。当发现没有省和市,且匹配的父项数量等于1时,能成功匹配。
      • True: 严格模式,当发现没有省和市,且匹配的父项数量大于1时,返回 None
      • False: 非严格模式,当发现没有省和市,且匹配的父项数量大于1时,匹配随机一项省和市
    • 新增jvm_path,允许设置JVM路径,但要求该路径为文件绝对路径
  • addRegionEntry 方法新增 replace 参数,表示是否替换旧地址,默认为True

其他更新:

  • 区分 similarityWithResultsimilarity 方法,similarityWithResult 返回MatchedResult类型结果,similarity 返回float类型结果
  • 封装分词方法 segment

GeocodingCHN.Geocoding

from GeocodingCHN import Geocoding
geocoding = Geocoding(data_class_path="core/region.dat",
                      strict= False, 
                      jvm_path= None)
  • data_class_path : 自定义地址文件路径
  • strict : 模式设置
  • jvm_path : JVM路径

GeocodingCHN.Geocoding.normalizing

提供地址标准化

normalizing(address) -> Address

  • address: 文本地址
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text =  '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
address_nor = geocoding.normalizing(text)
print(address_nor)
Address(
	provinceId=370000000000, province=山东省, 
	cityId=370200000000, city=青岛市, 
	districtId=370213000000, district=李沧区, 
	streetId=0, street=, 
	townId=0, town=, 
	villageId=0, village=, 
	road=延川路, 
	roadNum=116号, 
	buildingNum=7号楼2单元802户, 
	text=绿城城园东区
)

GeocodingCHN.Geocoding.similarityWithResult

地址相似度计算,返回详细结果

similarityWithResult(Address1:Address, Address2:Address) -> MatchedResult

  • Address1: 地址1, 由 normalizing 方法返回的 Address 类
  • Address2: 地址2, 由 normalizing 方法返回的 Address 类
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text1 = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
text2 = '山东青岛李沧区延川路绿城城园东区7-2-802'
Address_1 = geocoding.normalizing(text1)
Address_2 = geocoding.normalizing(text2)
print(geocoding.similarityWithResult(Address_1, Address_2))
MatchedResult(
	doc1=Document(terms=[Term(延川路), Term(116号), Term(7), Term(2), Term(802), Term(绿城), Term(城), Term(园), Term(东区)], town=None, village=None, road=Term(延川路), roadNum=Term(116号), roadNumValue=116), 
	doc2=Document(terms=[Term(延川路), Term(7), Term(2), Term(802), Term(绿城), Term(城), Term(园), Term(东区)], town=None, village=None, road=Term(延川路), roadNum=None, roadNumValue=0), 
	terms=['MatchedTerm(Term(延川路), coord=-1.0, density=-1.0, boost=2.0, tfidf=8.0)', 'MatchedTerm(Term(7), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(2), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(802), coord=-1.0, density=-1.0, boost=1.0, tfidf=2.0)', 'MatchedTerm(Term(绿城), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(城), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(园), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)', 'MatchedTerm(Term(东区), coord=1.0, density=1.0, boost=1.0, tfidf=4.0)'], 
	similarity=0.9473309334313418
)

GeocodingCHN.Geocoding.similarity

地址相似度计算

similarityWithResult(Address1:[Address, str], Address2:[Address, str])

  • Address1: 地址1, Address类 或 文本
  • Address2: 地址2, Address类 或 文本
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text1 = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'
text2 = '山东青岛李沧区延川路绿城城园东区7-2-802'
Address_1 = geocoding.normalizing(text1)
Address_2 = geocoding.normalizing(text2)
print(geocoding.similarity(Address_1, Address_2))
0.9473309334313418

GeocodingCHN.Geocoding.addRegionEntry

添加自定义地址

addRegionEntry(Id, parentId, name, RegionType, alias='', replace=True) -> bool

  • Id: 地址的ID
  • parentId: 地址的父ID, 必须存在
  • name: 地址的名称
  • RegionType: RegionType,地址类型
  • alias: 地址的别名, default=''
  • replace: 是否替换旧地址, default=True
from GeocodingCHN import Geocoding
geocoding = Geocoding()
geocoding.addRegionEntry(1, 321200000000, "A街道", geocoding.RegionType.Street)
address_nor = geocoding.normalizing("江苏泰州A街道")
print(address_nor)
Address(
	provinceId=320000000000, province=江苏省, 
	cityId=321200000000, city=泰州市, 
	districtId=321200000000, district=泰州市, 
	streetId=1, street=A街道, 
	townId=0, town=, 
	villageId=0, village=, 
	road=, 
	roadNum=, 
	buildingNum=, 
	text=
)

GeocodingCHN.Geocoding.segment

分词

segment(text: str, seg_type: str = 'ik') -> list

  • text: 输入
  • seg_type: 支持 ['ik', 'simple', 'smart', 'word'],default = 'ik'
from GeocodingCHN import Geocoding
geocoding = Geocoding()
text = '山东青岛李沧区延川路绿城城园东区7-2-802'
print(geocoding.segment(text))
['山东', '青岛', '李沧区', '延川路', '绿城', '城', '园', '东区', '7-2-802']

感谢

geocoding's People

Contributors

casuallyname avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

geocoding's Issues

分词方法segment解析【郫都区】问题

输入:四川省成都市郫都区西源大道1311号3栋4单元1楼102号
segment方法,seg_type = 'ik',
分词结果list为:['四川省', '成都市', '郫', '都', '西源大道', '1311号', '3栋', '4', '单元', '1楼', '102号']
期望结果list为:['四川省', '成都市', '郫都区', '西源大道', '1311号', '3栋', '4', '单元', '1楼', '102号']
请问有啥办法修正结果吗?感谢!

输入:四川省成都市双流区西航港街道临港路四段9号和贵久居福4栋3单元4层402号
分词结果list为:['四川省', '成都市', '双流', '西航港街道', '临港路', '四段', '9号', '和贵久居福', '4栋', '3', '单元', '4层', '402号']
双流区的区字被分词没了?是啥原因

1.4.3新方法的用例

请问可以增加一些1.4.3版本新方法(save、match、analyze)的使用说明和例子吗?感谢

Error: 'NoneType' object has no attribute 'getTerms'

similarityWithResult方法

用例
text1 = '江西省南昌市新建县新建区长堎镇工业三路东侧保利紫云6栋'
text2 = '广东省深圳市宝安区长堎镇工业三路东侧保利紫云6栋'
Address_1 = geocoding.normalizing(text1)
Address_2 = geocoding.normalizing(text2)
print(geocoding.similarityWithResult(Address_1, Address_2))

报错:
File "D:\python3.8\lib\site-packages\GeocodingCHN\Geocoding.py", line 136, in similarityWithResult
return MatchedResult.from_java(self.geocoding.similarityWithResult(address_1.java_class, address_2.java_class))
File "D:\python3.8\lib\site-packages\GeocodingCHN\model\matched.py", line 64, in from_java
return cls(doc1=Document.from_java_class(java.getDoc1()),
File "D:\python3.8\lib\site-packages\GeocodingCHN\model\document.py", line 44, in from_java_class
return cls(terms=[Term.from_java_class(i) for i in java.getTerms()],
AttributeError: 'NoneType' object has no attribute 'getTerms'

1.4.1无法有效添加自定义地址

respect~使用过程中遇到一些问题,无法有效添加自定义地址。

自定义示例代码如下:

geocoding = Geocoding()
geocoding.addRegionEntry(1, 321200000000, "A街道", geocoding.RegionType.Street)
print(geocoding.normalizing("江苏泰州A街道"))

返回的结果:

Address(
	provinceId=320000000000, province=江苏省, 
	cityId=321200000000, city=泰州市, 
	districtId=None, district=None, 
	streetId=None, street=None, 
	townId=None, town=None, 
	villageId=None, village=None, 
	road=None, 
	roadNum=None, 
	buildingNum=None, 
	text=A街道
)

并不能将A街道添加到street参数里。

另外,批量导入自定义地址文件应该放在哪?地址文件的格式是啥?

添加自定义地址失败

添加自定义地址失败,geocoding.addRegionEntry(1, 321200000000, "A街道", geocoding.RegionType.Street) 总是返回 False
无法得到想要的结果
Address(
provinceId=320000000000, province=江苏省,
cityId=321200000000, city=泰州市,
districtId=None, district=None,
streetId=None, street=None,
townId=None, town=None,
villageId=None, village=None,
road=None,
roadNum=None,
buildingNum=None,
text=A街道
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.