GithubHelp home page GithubHelp logo

bitlap / geocoding Goto Github PK

View Code? Open in Web Editor NEW
242.0 242.0 87.0 5.75 MB

:globe_with_meridians: 地理编码技术,提供地址标准化和相似度计算。

License: MIT License

Kotlin 92.42% Java 7.58%
address geocoding kotlin segmentation similarity

geocoding's Issues

region.dat 信息比较旧

大佬好,在解析 “重庆市开州区南门镇” 时,发现目前的 region.dat 的地址信息比较老,没有 开州区 这个区。
想问一下大佬 region.dat 这个文件是我们自己来维护吗,还是互联网上就能获取到呢?如果从互联网能获取的话能麻烦发一下链接吗?谢谢

关于地址后期出现高级信息对标准化的影响

去除后期出现的更高级的信息. 会大幅提升相似度, 作者大大能优化一些这种情况吗?

String t1 = "海南省海口市灵山镇海榆大道4号绿地城.润园海口市灵山西片去旧改项目A-32地块11#楼(栋)2(单元)2(层)203(号)";
String t2 = "海南省海口市灵山镇海榆大道4号绿地城.润园11#楼2单元203";

结果:

海南省海口市灵山镇海榆大道4号绿地城.润园海口市灵山西片去旧改项目A-32地块11#楼(栋)2(单元)2(层)203(号)
addr1 >>>> Address(
	provinceId=460000000000, province=海南省, 
	cityId=460100000000, city=海口市, 
	districtId=460108000000, district=美兰区, 
	streetId=460108101000, street=灵山镇, 
	townId=460108101000, town=灵山镇, 
	villageId=null, village=null, 
	road=null, 
	roadNum=null, 
	buildingNum=A-32, 
	text=西片去旧改项目地块11#楼22203栋单元层号
)
>>>>>>>>>>>>>>>>>
海南省海口市灵山镇海榆大道4号绿地城.润园11#楼2单元203
addr2 >>>> Address(
	provinceId=460000000000, province=海南省, 
	cityId=460100000000, city=海口市, 
	districtId=460108000000, district=美兰区, 
	streetId=460108101000, street=灵山镇, 
	townId=460108101000, town=灵山镇, 
	villageId=null, village=null, 
	road=海榆大道, 
	roadNum=4号, 
	buildingNum=11#楼2单元203, 
	text=绿地城润园
)
加载扩展词典:dic/region.dic
加载扩展词典:dic/community.dic
加载扩展停止词典:dic/stop.dic
相似度结果分析 >>>>>>>>> MatchedResult(
	doc1=Document(terms=[Term(灵山镇), Term(A), Term(32), Term(西片), Term(去), Term(旧), Term(改), Term(项目), Term(地块), Term(11#), Term(楼), Term(22203), Term(栋), Term(单元), Term(层), Term(号)], town=Term(灵山镇), village=null, road=null, roadNum=null, roadNumValue=0), 
	doc2=Document(terms=[Term(灵山镇), Term(海榆大道), Term(4号), Term(11), Term(2), Term(203), Term(绿地城), Term(润园)], town=Term(灵山镇), village=null, road=Term(海榆大道), roadNum=Term(4号), roadNumValue=4), 
	terms=[io.patamon.geocoding.similarity.MatchedTerm@2cfb4a64], 
	similarity=0.4886777774252209
)

去除第二个海口市

String t1 = "海南省海口市灵山镇海榆大道4号绿地城.润园灵山西片去旧改项目A-32地块11#楼(栋)2(单元)2(层)203(号)";
String t2 = "海南省海口市灵山镇海榆大道4号绿地城.润园11#楼2单元203";

结果

海南省海口市灵山镇海榆大道4号绿地城.润园灵山西片去旧改项目A-32地块11#楼(栋)2(单元)2(层)203(号)
addr1 >>>> Address(
	provinceId=460000000000, province=海南省, 
	cityId=460100000000, city=海口市, 
	districtId=460108000000, district=美兰区, 
	streetId=460108101000, street=灵山镇, 
	townId=460108101000, town=灵山镇, 
	villageId=null, village=null, 
	road=海榆大道, 
	roadNum=4号, 
	buildingNum=A-32, 
	text=绿地城润园灵山西片去旧改项目地块11#楼22203栋单元层号
)
>>>>>>>>>>>>>>>>>
海南省海口市灵山镇海榆大道4号绿地城.润园11#楼2单元203
addr2 >>>> Address(
	provinceId=460000000000, province=海南省, 
	cityId=460100000000, city=海口市, 
	districtId=460108000000, district=美兰区, 
	streetId=460108101000, street=灵山镇, 
	townId=460108101000, town=灵山镇, 
	villageId=null, village=null, 
	road=海榆大道, 
	roadNum=4号, 
	buildingNum=11#楼2单元203, 
	text=绿地城润园
)
加载扩展词典:dic/region.dic
加载扩展词典:dic/community.dic
加载扩展停止词典:dic/stop.dic
相似度结果分析 >>>>>>>>> MatchedResult(
	doc1=Document(terms=[Term(灵山镇), Term(海榆大道), Term(4号), Term(A), Term(32), Term(绿地城), Term(润园), Term(灵山), Term(西片), Term(去), Term(旧), Term(改), Term(项目), Term(地块), Term(11#), Term(楼), Term(22203), Term(栋), Term(单元), Term(层), Term(号)], town=Term(灵山镇), village=null, road=Term(海榆大道), roadNum=Term(4号), roadNumValue=4), 
	doc2=Document(terms=[Term(灵山镇), Term(海榆大道), Term(4号), Term(11), Term(2), Term(203), Term(绿地城), Term(润园)], town=Term(灵山镇), village=null, road=Term(海榆大道), roadNum=Term(4号), roadNumValue=4), 
	terms=[io.patamon.geocoding.similarity.MatchedTerm@4b6995df, io.patamon.geocoding.similarity.MatchedTerm@2fc14f68, io.patamon.geocoding.similarity.MatchedTerm@591f989e, io.patamon.geocoding.similarity.MatchedTerm@66048bfd, io.patamon.geocoding.similarity.MatchedTerm@61443d8f], 
	similarity=0.7152705001057788
)

removeRedundancy时误删有用的POI字符串

你好,
感谢开源这么有用的工具。
Geocoding.normalizing 这个API,在匹配完四级行政区之后,为了处理省市区重复书写的情况,removeRedundancy() 函数会继续移除能够解析到的省市区/县 乡镇/街道 及其之前的字符串,方便专心处理POI字符串。 但当POI字符串中出现了正常的地名字符串后(如 浙江省杭州市西湖区**建设银河西湖支行),removeRedundancy() 函数会错误的将 POI中的信息删除,只剩下“支行”。

举个栗子:

print(Geocoding.normalizing("浙江省杭州市西湖区**建设银河西湖支行"))

[Out]
Address(
	provinceId=330000000000, province=浙江省, 
	cityId=330100000000, city=杭州市, 
	districtId=330106000000, district=西湖区, 
	streetId=null, street=null, 
	townId=null, town=null, 
	villageId=null, village=null, 
	road=null, 
	roadNum=null, 
	buildingNum=null, 
	text=支行
)

自定义地址时可将错误的地址关联到正确的上面去么?

感觉自定义地址是在字典里面新增地址的,而不是用于将错误地址改正后解析的? @IceMimosa

举个例子:

Geocoding.addRegionEntry(510000000000L, 100000000000L, "四州省", RegionType.Province, "四川")
Geocoding.normalizing("四州省广安市广安区")

能够将地址正确的解析为:四川省广安市广安区么?

“天津市静海区“ 静海区被解析成县

“天津市静海区大丰堆镇齐小王村村委会东100米“
这个地址会被解析成
provinceId=120000000000, province=天津,
cityId=120100000000, city=天津市,
districtId=120223000000, district=静海县,
streetId=120223113000, street=大丰堆镇,
townId=120223113000, town=大丰堆镇,
villageId=null, village=null,
road=null,
roadNum=null,
buildingNum=null,
text=齐小王村村委会东100米

通过自定义数据增加“静海区”还是不能解决。
是地址库没更新的问题吗?
这个是要通过修改地址库修改吗?

分词方法segment解析【郫都区】问题

输入:四川省成都市郫都区西源大道1311号3栋4单元1楼102号
segment方法,seg_type = 'ik',
分词结果list为:['四川省', '成都市', '郫', '都', '西源大道', '1311号', '3栋', '4', '单元', '1楼', '102号']
期望结果list为:['四川省', '成都市', '郫都区', '西源大道', '1311号', '3栋', '4', '单元', '1楼', '102号']
请问有啥办法修正结果吗?感谢!

normalizing: 标准化将数字没了

使用 normalizing: 标准化方法,输入地址:北京市海淀区西北旺东路10号院东区323102,发现返回数字323102 没有了
麻烦帮忙看看

解析地址时buildingNum出现问题,怎样修改,谢谢大神

Address(
provinceId=110000000000, province=北京,
cityId=110100000000, city=北京市,
districtId=110102000000, district=西城区,
streetId=null, street=null,
townId=null, town=null,
villageId=null, village=null,
road=新康街,
roadNum=2号院,
buildingNum=null,
text=1号楼北侧楼房
)

相似度为0

Addr1:江苏省南京市建邺区庐山路98-1号
Addr2:江苏省南京市庐山路98-1号

But I got the result 0.0 ?

不知道是我倒腾代码搞错?还是本来的bug?

编写一个基于国家地址库生成dat文件的工具类

实现思路

工具类输入

1. 地址数据网址

比如:http://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2023/,或者类参数固定死2022、2023等等的输入。
如果有接口直接调用更好,没有的话可以用jsoup对页面进行爬虫

2. 地址层级

由于层级越深,生成的最终文件肯定越大。所以需要限制下地址的层级,比如1:省,2:市,3:区,4:街道/镇,5:居委会

3. 文件格式

json/pb...

RegionEntity的children属性未初始化,下级行政区划添加失败

对于一个空的dat字典文件,GeocodingX.addRegionEntry时,未初始化RegionEntity的children属性,导致下级的行政区划未能成功添加。DefaultRegionCache中的如下代码,最后一行在children未初始化(null)时,父RegionEntity不会添加子RegionEntity

override fun addRegionEntity(entity: RegionEntity) {
    this.loadChildrenInCache(entity)
    this.REGION_CACHE[entity.id] = entity
    this.REGION_CACHE[entity.parentId]?.children?.add(entity)
}

为什么打包成.exe就运行报错了,怎么解决呢

代码:
from GeocodingCHN import Geocoding

geocoding = Geocoding()

text = '山东青岛李沧区延川路116号绿城城园东区7号楼2单元802户'

address_nor = geocoding.normalizing(text)

print(address_nor)

错误:
Traceback (most recent call last):
File "main.py", line 3, in
File "GeocodingCHN\Geocoding.py", line 61, in init
File "jpype_jclass.py", line 99, in new
TypeError: Class org.bitlap.geocoding.GeocodingX is not found

使用国标库 标准化没办法到5级

比如这个地址:东莞市莞城街道罗沙社区东兴路(道路)东兴门诊部1层.
图片

结果:

  Address(
	  provinceId=440000000000, province=广东省, 
	  cityId=441900000000, city=东莞市, 
	  districtId=441900006000, district=莞城街道, 
	  streetId=null, street=null, 
	  townId=null, town=null, 
	  villageId=null, village=null, 
	  road=罗沙社区东兴路, 
	  roadNum=, 
	  buildingNum=null, 
	  text=东兴门诊部1层道路
  )

如何由唯一城镇来定位

QQ截图20220106155405

灵山镇海榆大道4号绿地城.润园11#楼2单元203

只取一个省的信息, 进行匹配, 最后匹配出来了街道和区的id, 但没有继续匹配上省, 市, 该怎样修改一下代码呢?

无法精确到五级

大佬好,按照说明方法导入了五级地址库至mysql中,重新生成了dat文件,发现地址标准返回无法精确五级,这个怎么处理?

省/直辖市
市/州
县/区
乡/镇
村/社区

多个匹配结果返回的问题

比如输入“南山区”,会有两个匹配,一个是黑龙江省的,一个是广东省的,但是目前只会返回第一个;
还有如果只输入一个镇,返回的只有null,这里该怎么改呢?

下载依赖失败

通过readme 下载github的repo依赖失败:

Failed to execute goal on project customer-experience-data-factory: Could not resolve dependencies for project com.treeyee.cloud:customer-experience-data-factory:jar:0.0.1-SNAPSHOT: io.patamon.geocoding:geocoding:jar:1.1.6 was not found in https://raw.github.com/icemimosa/maven/release/ during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of patamon.release.repository has elapsed or updates are forced -> [Help 1]

克隆项目本地编译生成jar包,将jar添加到项目也运行失败(本地项目是java项目):
java.lang.NoClassDefFoundError: kotlin/jvm/internal/Intrinsics
at io.patamon.geocoding.Geocoding.similarity(Geocoding.kt)

老哥你知道这是啥原因吗

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.