GithubHelp home page GithubHelp logo

为什么我建索引的时候必须配置filter“nGram”才能在搜索的时候输入拼音得到数据? about elasticsearch-analysis-pinyin HOT 18 CLOSED

medcl avatar medcl commented on June 10, 2024
为什么我建索引的时候必须配置filter“nGram”才能在搜索的时候输入拼音得到数据?

from elasticsearch-analysis-pinyin.

Comments (18)

medcl avatar medcl commented on June 10, 2024

第一种是在medcl这个索引下面新建的这个analyzer,你的index是searchshowindex_v3,两个是分开的,你换一下index重新配置一下试试

from elasticsearch-analysis-pinyin.

ganxiaomao avatar ganxiaomao commented on June 10, 2024

非常感谢,还是没弄明白各种配置字段的含义 我试试看
此邮件来自易信 - 点击下载,免费短信、免费国际通话服务等你体验!

在2014年7月4日 21:13:05, Medcl [email protected] 写道:

第一种是在medcl这个索引下面新建的这个analyzer,你的index是searchshowindex_v3,两个是分开的,你换一下index重新配置一下试试


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

ganxiaomao avatar ganxiaomao commented on June 10, 2024

我想起来了,我建analyzer的时候指定的索引就是searchshowindex_v3,帖子里贴出来的是我直接从教程里面复制的。所以现在还是搞不懂为什么,配置好以后测试分词效果都可以,就是建立索引好像无法成功,log信息里面没有任何报错,就是搜索的时候找不到数据,我的搜索语句如下:
http://localhost:9200/searchshowindex_v3/searchshowtype/_search?q=dianshiji
我的数据库里面有很多关于“电视机”的数据,但就是一个也搜不出来,只要我在配置analyzer的时候加上nGram就能搜出结果,但是基本上我数据库有多少数据,它就出多少数据。我贴一下我的完整配置流程吧:
第一步:创建索引
curl -XPUT localhost:9200/searchshowindex_v3 -d'{
"index":{
"analysis":{
"analyzer":{"pinyin_analyzer":{"tokenizer":"my_pinyin","filter":["standard","lowercase"]}},
"tokenizer":{"my_pinyin":{"type":"pinyin","first_letter":"append","padding_char":""}}
}}}'
第二步:配置mapping
curl -XPOST localhost:9200/searchshowindex_v3/searchshowtype/_mapping -d'
{
"searchshowtype":{
"_all":{"analyzer":"pinyin_analyzer","term_vector":"no","store":false}}}'
第三步:创建mongodb的river
curl -XPUT localhost:9200/_river/ssmongo2/_meta -d'{
"type":"mongodb",
"mongodb":{
"host":"192.168.0.10", "port":22222,
"db":"verticalsearch",
"collection":"searchshow"
},
"index":{"name":"searchshowindex_v3","type":"searchshowtype"}
}'
这个配置过程我用IK也试过,是完全可以的,就是pinyin的时候出了这种问题。不知道究竟错在哪里了。

from elasticsearch-analysis-pinyin.

medcl avatar medcl commented on June 10, 2024

测试数据和查询也发一下

send via my Phone.

在 2014年7月5日,8:19,ganxiaomao [email protected] 写道:

我想起来了,我建analyzer的时候指定的索引就是searchshowindex_v3,帖子里贴出来的是我直接从教程里面复制的。所以现在还是搞不懂为什么,配置好以后测试分词效果都可以,就是建立索引好像无法成功,log信息里面没有任何报错,就是搜索的时候找不到数据,我的搜索语句如下:
http://localhost:9200/searchshowindex_v3/searchshowtype/_search?q=dianshiji
我的数据库里面有很多关于“电视机”的数据,但就是一个也搜不出来,只要我在配置analyzer的时候加上nGram就能搜出结果,但是基本上我数据库有多少数据,它就出多少数据。我贴一下我的完整配置流程吧:
第一步:创建索引
curl -XPUT localhost:9200/searchshowindex_v3 -d'{

"index":{

"analysis":{
"analyzer":{"pinyin_analyzer":{"tokenizer":"my_pinyin","filter":["standard","lowercase"]}},
"tokenizer":{"my_pinyin":{"type":"pinyin","first_letter":"append","padding_char":""}}
}}}'
第二步:配置mapping
curl -XPOST localhost:9200/searchshowindex_v3/searchshowtype/_mapping -d'
{

"searchshowtype":{
"_all":{"analyzer":"pinyin_analyzer","term_vector":"no","store":false}}}'
第三步:创建mongodb的river
curl -XPUT localhost:9200/_river/ssmongo2/_meta -d'{

"type":"mongodb",
"mongodb":{
"host":"192.168.0.10", "port":22222,
"db":"verticalsearch",
"collection":"searchshow"
},
"index":{"name":"searchshowindex_v3","type":"searchshowtype"}
}'
这个配置过程我用IK也试过,是完全可以的,就是pinyin的时候出了这种问题。不知道究竟错在哪里了。


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

ganxiaomao avatar ganxiaomao commented on June 10, 2024

我mongodb数据库中的数据样式如下:
{
'url' => 'http://gdxrz.com/',
'title' => '电视挂架|电视机吊架|电视机支架|显示器支架|液晶电视机挂架|液晶...',
'info' => ' tcl 电视挂架 nb 电视挂架 投影仪吊架 投影机支架 显示器推车 投影机吊架 lg 电视挂架 电视架厂家 电视机吊架 红叶支架幕 红叶支架幕 投影仪支架 ',
}
查询语句为:
http://localhost:9200/searchshowindex_v3/searchshowtype/_search?q=dianshiji

from elasticsearch-analysis-pinyin.

medcl avatar medcl commented on June 10, 2024

查询需要指明字段才行如 title:dianshiji

Medcl'

http://log.medcl.net

------------------ 原始邮件 ------------------
发件人: "ganxiaomao"[email protected];
发送时间: 2014年7月5日(星期六) 中午1:55
收件人: "medcl/elasticsearch-analysis-pin"[email protected];
抄送: "Medcl'"[email protected];
主题: Re: [elasticsearch-analysis-pinyin] 为什么我建索引的时候必须配置filter“nGram”才能在搜索的时候输入拼音得到数据? (#9)

我mongodb数据库中的数据样式如下:
{
'url' => 'http://gdxrz.com/',
'title' => '电视挂架|电视机吊架|电视机支架|显示器支架|液晶电视机挂架|液晶...',
'info' => ' tcl 电视挂架 nb 电视挂架 投影仪吊架 投影机支架 显示器推车 投影机吊架 lg 电视挂架 电视架厂家 电视机吊架 红叶支架幕 红叶支架幕 投影仪支架 ',
}
查询语句为:
http://localhost:9200/searchshowindex_v3/searchshowtype/_search?q=dianshiji


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

ganxiaomao avatar ganxiaomao commented on June 10, 2024

非常感谢你的回复,这两天不上班就没看到。现在的问题依旧存在,虽然你说了查询时指明字段,但还是不行。我一开始在mapping里面就配置了_all,它的analyzer为pinyin,其他字段并没有配置,根据我使用IK的经验,直接q=关键字是能够有结果的,不然的话也不会在我将pinyin里的filter多配置一个nGram的时候就能搜到结果,然后根据你说的字段的问题我又重新尝试了一下,步骤如下:
1.创建索引:
curl -XPUT localhost:9200/searchshowindex_v1 -d'
{
"index":{
"analysis":{
"analyzer":{"pinyin_analyzer":{"tokenizer":"my_pinyin","filter":["standard"]}},
"tokenizer":{"my_pinyin":{"type":"pinyin","first_letter":"none","padding_char":""}}
}
}
}'
2.配置mapping
curl -XPOST localhost:9200/searchshowindex_v1/searchshowtype/_mapping -d'
{
"searchshowtype":{
"properties":{
title:{
“type”:"string",
"store":"no",
"term_vector":"with_positions_offsets",
"analyzer":"pinyin_analyzer",
"boost":5}
}
}
}'
3.创建_river
curl -XPUT localhost:9200/_river/ssmongo1/_meta -d'{
"type":"mongodb",
"mongodb":{
"host":"192.168.0.10", "port":22222,
"db":"verticalsearch",
"collection":"searchshow"
},
"index":{"name":"searchshowindex_v1","type":"searchshowtype"}
}'
4.查询
http://localhost:9200/searchshowindex_v1/searchshowtype/_search?q=title:dianshiji
结果为:
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
5.analyzer测试
http://localhost:9200/searchshowindex_v1/_analyze?text=电视机&analyzer=pinyin_analyzer
结果为:
{"tokens":[{"token":"dianshiji","start_offset":0,"end_offset":3,"type":"word","position":1}]}

以上就是针对某个字段配置es并测试的结果,并没有出现想要的结果,所以还是不明白怎么回事。

from elasticsearch-analysis-pinyin.

medcl avatar medcl commented on June 10, 2024

索引里面有数据么?

http://localhost:9200/searchshowindex_v1/searchshowtype/_search?q=*

在 2014年7月7日,上午8:28,ganxiaomao [email protected] 写道:

非常感谢你的回复,这两天不上班就没看到。现在的问题依旧存在,虽然你说了查询时指明字段,但还是不行。我一开始在mapping里面就配置了_all,它的analyzer为pinyin,其他字段并没有配置,根据我使用IK的经验,直接q=关键字是能够有结果的,不然的话也不会在我将pinyin里的filter多配置一个nGram的时候就能搜到结果,然后根据你说的字段的问题我又重新尝试了一下,步骤如下:
1.创建索引:
curl -XPUT localhost:9200/searchshowindex_v1 -d'
{
"index":{
"analysis":{
"analyzer":{"pinyin_analyzer":{"tokenizer":"my_pinyin","filter":["standard"]}},
"tokenizer":{"my_pinyin":{"type":"pinyin","first_letter":"none","padding_char":""}}
}
}
}'
2.配置mapping
curl -XPOST localhost:9200/searchshowindex_v1/searchshowtype/_mapping -d'
{
"searchshowtype":{
"properties":{
title:{
“type”:"string",
"store":"no",
"term_vector":"with_positions_offsets",
"analyzer":"pinyin_analyzer",
"boost":5}
}
}
}'
3.创建_river
curl -XPUT localhost:9200/_river/ssmongo1/_meta -d'{
"type":"mongodb",
"mongodb":{
"host":"192.168.0.10", "port":22222,
"db":"verticalsearch",
"collection":"searchshow"
},
"index":{"name":"searchshowindex_v1","type":"searchshowtype"}
}'
4.查询
http://localhost:9200/searchshowindex_v1/searchshowtype/_search?q=title:dianshiji
结果为:
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
5.analyzer测试
http://localhost:9200/searchshowindex_v1/_analyze?text=电视机&analyzer=pinyin_analyzer
结果为:
{"tokens":[{"token":"dianshiji","start_offset":0,"end_offset":3,"type":"word","position":1}]}

以上就是针对某个字段配置es并测试的结果,并没有出现想要的结果,所以还是不明白怎么回事。


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

ganxiaomao avatar ganxiaomao commented on June 10, 2024

用你给的语句查询了一下,有数据,如下截取其中一部分:
{"took":48,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":92,"max_score":1.0,"hits":[{"_index":"searchshowindex_v1","_type":"searchshowtype","_id":"53857e17f2136f31f083c5cd","_score":1.0, "_source" : {"title_suggest":"电视--家电--人民网","_id":"53857e17f2136f31f083c5cd","title":"电视--家电--人民网","url":"http://homea.people.com.cn/GB/41406/index.html","info":" 选 电视上人民网家电频道。权威实力源自人民!"}},{"_index":"searchshowindex_v1","_type":"searchshowtype","_id":"53857e17f2136f31f083c5da","_score":1.0, "_source" : {"title_suggest":"7788电视网___**最大的电视机拍卖、交易网站","_id":"53857e17f2136f31f083c5da","title":"7788电视网___**最大的电视机拍卖、交易网站","url":"http://www.7788ds.com/","info":" 7788 电视网是黑白 电视、显像管 电视机、旧 电视机等的收藏、投资、交易平台。"}},{"_index":"searchshowindex_v1","_type":"searchshowtype","_id":"53857e17f2136f31f083c601","_score":1.0, "_source" : {"title_suggest":"康佳集团 - 精致产品,美妙生活","_id":"53857e17f2136f31f083c601","title":"康佳集团 - 精致产品,美妙生活","url":"http://www.konka.com/","info":" 电视 白色家电 手机 机顶盒 生活电器 厨卫电器 视讯 房地产 商用 电视 商用视讯 商用机顶盒 www.konka.com/ 2014-05-17 - 百度快照"}},{"_index":"searchshowindex_v1","_type":"searchshowtype","_id":"53857e17f2136f31f083c5bb","_score":1.0, "_source" : {"title_suggest":"【液晶电视】液晶电视报价及图片大全-ZOL中关村在线","_id":"53857e17f2136f31f083c5bb","title":"【液晶电视】液晶电视报价及图片大全-ZOL中关村在线","url":"http://detail.zol.com.cn/digital_tv/","info":" ZOL中关村在线提供液晶 电视最新价格及经销商报价,包括液晶 电视大全,液晶 电视参数,液晶 电视评测,液晶 电视图片,液晶 电视论坛等详细内容,为您购买液晶 电视提供最全面参考 detail.zol.com.c"}},{"_index":"searchshowindex_v1","_type":"searchshowtype","_id":"53857e17f2136f31f083c5c2","_score":1.0, "_source" : {"title_suggest":"【液晶电视频道】液晶电视排行榜|行情|评测-万维家电网","_id":"53857e17f2136f31f083c5c2","title":"【液晶电视频道】液晶电视排行榜|行情|评测-万维家电网","url":"http://tv.ea3w.com/","info":" 作为国内最专业的液晶 电视,等离子 电视频道,本频道提供液晶 电视,等离子 电视报价,行情、评测、导购、调研等相关资讯 tv.ea3w.com/ 2014-05-17 - 百度快照"}},

from elasticsearch-analysis-pinyin.

liukaitj avatar liukaitj commented on June 10, 2024

这个问题有解决么?我现在遇到了同样的问题。。

from elasticsearch-analysis-pinyin.

medcl avatar medcl commented on June 10, 2024

拼音插件只做一件事情,就是把"中文"=>"zhongwen",所以默认是完整的pinyin,是一个整体,如果你需要模糊匹配,那就进一步分词处理,配置一个filter,ngramFilter可以对"zhongwen"进一步切分,比如成:"zh""ho""on"等,这样你就可以模糊搜索了

from elasticsearch-analysis-pinyin.

liukaitj avatar liukaitj commented on June 10, 2024

但是ngramFilter又会将所有包含的"zh"、"on"的文档匹配出来,这显然不是通常想要的结果。有没有一个介于两者之间,比如切分成"zhong"、"wen"的filter?

from elasticsearch-analysis-pinyin.

liukaitj avatar liukaitj commented on June 10, 2024

我又查了下,貌似是my_pinyin这个tokenizer有点问题,比如“全国首发”这四个字,在ES内部被tokenize成了"quan guo shou fa"一个整体,而不是"quan"、"guo"、"shou"、"fa"这四个token,我看my_pinyin下的padding_char参数设置的是一个空格啊,怎么tokenize不起作用呢?好奇怪。。

from elasticsearch-analysis-pinyin.

medcl avatar medcl commented on June 10, 2024

你用whitespace或者standard filter,拼音的padding char设置空格,按空格切

send via my Phone.

在 2015年8月4日,下午7:21,liukai [email protected] 写道:

但是ngramFilter又会将所有包含的"zh"、"on"的文档匹配出来,这显然不是通常想要的结果。有没有一个介于两者之间,比如切分成"zhong"、"wen"的filter?


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

medcl avatar medcl commented on June 10, 2024

padding只是在不同拼音间加隔断,未切哦

send via my Phone.

在 2015年8月4日,下午7:41,liukai [email protected] 写道:

我又查了下,貌似是my_pinyin这个tokenizer有点问题,比如“全国首发”这四个字,在ES内部被tokenize成了"quan guo shou fa"一个整体,而不是"quan"、"guo"、"shou"、"fa"这四个token,我看my_pinyin下的padding_char参数设置的是一个空格啊,怎么tokenize不起作用呢?好奇怪。。


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

ganxiaomao avatar ganxiaomao commented on June 10, 2024

非常感谢你的回复,我来试试

在 2015-08-04 19:53:11,"Medcl" [email protected] 写道:
你用whitespace或者standard filter,拼音的padding char设置空格,按空格切

send via my Phone.

在 2015年8月4日,下午7:21,liukai [email protected] 写道:

但是ngramFilter又会将所有包含的"zh"、"on"的文档匹配出来,这显然不是通常想要的结果。有没有一个介于两者之间,比如切分成"zhong"、"wen"的filter?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

liukaitj avatar liukaitj commented on June 10, 2024

不用ngram了,匹配得太不准确了。扩展了一个基于IK分词的拼音插件,放在这了:https://github.com/liukaitj/elasticsearch-analysis-ik-pinyin ,可以根据IK分词出来的短语进行拼音匹配,避免了过度匹配问题。

from elasticsearch-analysis-pinyin.

medcl avatar medcl commented on June 10, 2024

其实你使用一个ik的tokenizer,再加一个拼音的filter就行了

send via my Phone.

在 2015年8月6日,下午7:41,liukai [email protected] 写道:

不用ngram了,匹配得太不准确了。扩展了一个基于IK分词的拼音插件,放在这了:https://github.com/liukaitj/elasticsearch-analysis-ik-pinyin ,可以根据IK分词出来的短语进行拼音匹配,避免了过度匹配问题。


Reply to this email directly or view it on GitHub.

from elasticsearch-analysis-pinyin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.