GithubHelp home page GithubHelp logo

ik-analyzer's People

Contributors

liangjie avatar

Watchers

 avatar

ik-analyzer's Issues

java.lang.ArrayIndexOutOfBoundsException: 0

java.lang.ArrayIndexOutOfBoundsException: 0
        at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:167)
        at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:156)
        at org.wltea.analyzer.dic.Dictionary.loadExtendWords(Dictionary.java:405)

林兄,现在在加载分词的时候,会报这个错误 

目前还没有找到真正的原因。还请帮助看看。

Original issue reported on code.google.com by xplazy on 1 Aug 2011 at 8:19

分词"词元长度优先"导致在lucene中高亮显示错位

现象:在索引时如果一个Document对象只包含一个名为"content"的Fi
eld,即
********
doc.add(new Field("content", "大家好", Store.YES, Index.ANALYZED, 
TermVector.WITH_POSITIONS_OFFSETS))
********
这样在高亮时是没问题的。
但是如果一个Document对象包含两个或两个以上名为"content"的Fie
ld,即
********
doc.add(new Field("content", "大家好", Store.YES, Index.ANALYZED, 
TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("content", "这里是随便的一句话", Store.YES, 
Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
********
在搜索"随便"时的高亮结果为"大家好;这里<b>是随<b/>便的一句
话",而希望的结果是"大家好;这里是<b>随便<b/>的一句话"
注1:如果一个Document对象包含多个同名Field,在高亮时需要把所�
��的Field值加起来,用一个字符隔开,前面示例的分隔符为半�
��分号
注2:本问题只有在使用IK分词时产生,使用其他比如StrandardAn
alyzer不存在。

原因:IK在对“大家好”分词时的结果为
********
[0-3] 大家好
[0-2] 大家
********
导致在索引时计算offset和position错位一个字符,如果IK分词结�
��为
********
[0-2] 大家
[0-3] 大家好
********
则不会出现问题

解决办法:修改Lexeme.compareTo方法中如下代码
********
if(this.begin == other.getBegin()){                  
  //词元长度优先                   
  if(this.length > other.getLength()){                           
    return -1;                   
  }else if(this.length == other.getLength()){                           
    return 0;                   
  }else {
     return 1;                   
  }                              
} 
********
将其中的“词元长度优先”改为相反,即返回值-1和1互换位��
�。

Original issue reported on code.google.com by [email protected] on 19 Jan 2011 at 5:12

DisableWords只能删除自定义的词元,无法删除预制的词源

For Example:

IKSegmenter iks = new IKSegmenter(new StringReader("uml Just test"), true);
List<String> words = new ArrayList<String>();
words.add("uml");
Dictionary.disableWords(words);
Lexeme exeme = null;
do {
exeme = iks.next();
if (exeme != null) {
System.out.println(exeme.getLexemeText());
}
} while (exeme != null);
是使用方法不对吗?




Original issue reported on code.google.com by [email protected] on 30 Mar 2012 at 6:27

请加入ant或maven或ivy的配置文件。

首先向各位开发者问好,你们辛苦了,向你们致敬!
如果可能请加入ant或maven或ivy的配置文件,方便我们下载编译
啊,谢谢,当然了,如果你给我提供一份依赖包,我也可以��
�忙弄个maven、ant、ivy的编译配置文件。


Original issue reported on code.google.com by [email protected] on 29 Mar 2012 at 9:20

可否将源码传至SVN

林兄好,

一直在用IKAnalyzer,也一直在学习,得知最近有版本更新,甚�
��惊喜,可否把新版代码传至SVN,目前SVNcheckout的还是2.8的版�
��,非常感谢林兄为开源分词器做的贡献,祝越来越好。

Lovell Liu
2012/3/16

Original issue reported on code.google.com by [email protected] on 16 Mar 2012 at 3:09

query中多个空格的问题

“唱歌_____跳舞”-下划线表示空格
==》 (+title:唱歌 +title: +title: +title: +title: +title:跳舞)

祝好

Original issue reported on code.google.com by [email protected] on 20 Nov 2009 at 5:31

分词问题

What steps will reproduce the problem?
1. IKSegmentation ikSeg = new IKSegmentation(new StringReader(testString) , 
true);
2. 把这个句子进行分词"一一列举,一一对应"

What is the expected output? What do you see instead?
我认为,expected output是:
0-4 : 一一列举 :    CJK_NORMAL
0-2 : 一一 :  NUMEBER
0-1 : 一 :     UNKNOWN
1-3 : 一列 :  CJK_NORMAL
2-4 : 列举 :  CJK_NORMAL
2-3 : 列 :     COUNT
5-9 : 一一对应 :    CJK_NORMAL
5-7 : 一一 :  NUMEBER
5-6 : 一 :     UNKNOWN
6-8 : 一对 :  CJK_NORMAL
7-9 : 对应 :  CJK_NORMAL
但是输出结果为:
0-2 : 一一 :  NUMEBER
0-1 : 一 :     UNKNOWN
1-3 : 一列 :  CJK_NORMAL
2-4 : 列举 :  CJK_NORMAL
2-3 : 列 :     COUNT
5-9 : 一一对应 :    CJK_NORMAL
5-7 : 一一 :  NUMEBER
5-6 : 一 :     UNKNOWN
6-8 : 一对 :  CJK_NORMAL
7-9 : 对应 :  CJK_NORMAL

What version of the product are you using? On what operating system?
我是实用win7旗舰英文版(64位),jdk1.6_u21(64位),eclipse3.4.
1(默认编码是GBK),调试模式和运行模式(eclipse工程下)都
出现同样问题。

Please provide any additional information below.

我觉得问题是UTF-8文件头的前导信息导致主词典文件的第一个
词被破坏(一一列举是主词典的第一个词)。

Original issue reported on code.google.com by [email protected] on 7 Apr 2011 at 10:59

怎么实现关键字 AND 的操作

1. 我使用的是IKAnalyzer3.2.3Stable_bin 和 lucene-3.0.2
2. 我用的方式是IKQueryParser.parseMultiField(fields, keyword)
3. 怎么实现关键字 AND 的操作?
4. IKAnalyzer3.2.3使用的Query Parser 
Syntax是对应的是lucene哪个版本?




Original issue reported on code.google.com by [email protected] on 27 Jul 2010 at 8:47

词性问题

IK分词返回的Lexeme有NORMAL等信息,但我觉得还是不够。
是否现在的版本可以提供到 名词,动词 的先验等信息呢。


Original issue reported on code.google.com by [email protected] on 8 Jul 2011 at 4:09

支持solr3.3

在solr3.3中使用ik进行短语查询时总将短语分割成“并且”的��
�系,是ik现在不支持solr3.3中使用吗?


Original issue reported on code.google.com by [email protected] on 16 Aug 2011 at 2:17

设置 isMaxWordLength 为 true 搜索有值,但是高度后值为null

IKAnalyzer analyzer = new IKAnalyzer(true);
        QueryParser parser = new QueryParser(Version.LUCENE_29, "title",
                new IKAnalyzer());
        Directory fsDir = FSDirectory.open(new File("D:/java/server/apache-tomcat-6.0.18/bin/solr/data/index"));
        IndexReader reader = null;

        try {
            reader = IndexReader.open(fsDir, true);

            Query query = parser.parse("title:【搜索引擎】Google购物搜索");

            IndexSearcher searcher = new IndexSearcher(reader);

            TopDocs hits = searcher.search(query, null, 10000);

            Document doc = null;

            ScoreDoc[] scoreDocs = hits.scoreDocs;
            int length = scoreDocs.length;

            for (int i = 0; i < length; i++) {
                doc = searcher.doc(scoreDocs[i].doc);
                //有值
                String value = doc.get("title");
                //System.out.println(doc.get("id"));
                //System.out.println(doc.getBoost());

                SimpleHTMLFormatter sHtmlF = new SimpleHTMLFormatter(
                        "<b>", "</b>");
                Highlighter highlighter = new Highlighter(sHtmlF, new QueryScorer(
                        query));
                highlighter.setTextFragmenter(new SimpleFragmenter(100000));
                if (value != null) {
                    TokenStream tokenStream = analyzer.tokenStream("title",
                            new StringReader(value));
                    String str = highlighter.getBestFragment(tokenStream, value);
                                //值为null
                    System.out.println(str);
                }

            }

            searcher.close();

Original issue reported on code.google.com by [email protected] on 28 Jun 2010 at 3:17

建立索引时分词出错

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.wltea.analyzer.dic.DictSegment.lookforSegment(DictSegment.java:183)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:148)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:152)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:152)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:152)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:128)
    at org.wltea.analyzer.dic.Dictionary.loadMainDict(Dictionary.java:97)
    at org.wltea.analyzer.dic.Dictionary.<init>(Dictionary.java:71)
    at org.wltea.analyzer.dic.Dictionary.<clinit>(Dictionary.java:41)
    at org.wltea.analyzer.seg.ChineseSegmenter.<init>(ChineseSegmenter.java:37)
    at org.wltea.analyzer.cfg.Configuration.loadSegmenter(Configuration.java:114)
    at org.wltea.analyzer.IKSegmentation.<init>(IKSegmentation.java:54)
    at org.wltea.analyzer.lucene.IKTokenizer.<init>(IKTokenizer.java:44)
    at org.wltea.analyzer.lucene.IKAnalyzer.tokenStream(IKAnalyzer.java:45)
    at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:52)
    at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.ja
va:126)
    at
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProce
ssorPerThread.java:246)
    at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773)
    at
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1928)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1902)

Original issue reported on code.google.com by [email protected] on 10 Mar 2010 at 4:23

DictSegment匹配函数代码

DictSegment.java中第89-92行:
for(DictSegment seg : segmentArray){
    if(seg != null && seg.nodeChar.equals(keyChar)){
        //找到匹配的段
        ds = seg;
    }
}
其中ds=seg;找到匹配之后似乎应该break跳出for循环

Original issue reported on code.google.com by [email protected] on 28 Apr 2010 at 3:41

IKAnalyzerDemo的一个非常容易重现的 bug

text 和 keyword 都设置为 
"顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶��
�顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶�
��顶顶顶顶顶顶顶顶顶顶顶   回顶下~~~~~~~~~~~~~~~~";

结果就挂掉了。。
请帮忙尽快修复

Original issue reported on code.google.com by [email protected] on 20 Dec 2010 at 6:10

在运行分词效果测试时,在没有加载Analysis前调用Dictionary.addWords(distinctWords)会发生NullPointException

说明一下,我需要的效果是在运行期动态的添加词汇(也包括在
初始化时加载指定的词汇)
1. 
在没有初始化之前org.wltea.analyzer.lucene.IKAnalyzer();前调用Dictiona
ry.addWords(Collection<String>);方法时发生NullPointException.
2. 
经过研读代码发现需要在开始前手动的调用Dictionary.getInstance(
new Configuration());
3.我想这也算是一个Bug吧?

对您的辛勤劳动,开元精神表示感谢!

祝您做的更好!

Original issue reported on code.google.com by [email protected] on 16 Mar 2012 at 12:35

能否加到Maven的儲存庫上面

您好:
因為我們的項目都是用Maven做jar檔管理,
能否把IKAnalyzer上到Maven的儲存庫,
這樣就比較方便管理,
謝謝!!

Original issue reported on code.google.com by [email protected] on 29 Apr 2010 at 5:37

给调用者更多的配置方式

目前Configuration的构造函数是private的, 通过static instance获取, 
所以必须把XML的配置文件放到classpath里. 
而我希望XML配置文件能够由代码来指定, 
比如新建Analyzer的时候传入一个Configuration, 
这样让调用者有机会从数据库或者别的途径获取配置. 
另外希望词库文件也一样. 
如果Configuration能够传入一个String[]或者一个Reader作为词库或��
�StopWords库那么更好. 
这样词库可以从数据库或者别的什么地方读取.

Original issue reported on code.google.com by [email protected] on 24 Jun 2011 at 12:36

與Lucene 2.9.2是否相容

您好:
我這邊項目測試的狀況如下:
我們採用Hibernate Search,目前只相容到Lucene 2.9。
我們下載IKAnalyzer 3.2的source 
code,發現他的IKTokenizer是依據Lucene 3.0版
實作的。
請問一下有沒有什麼辦法可以讓這兩個可以接起來的?
謝謝!!

Original issue reported on code.google.com by [email protected] on 29 Apr 2010 at 7:14

接口的设置建议

org.wltea.analyzer.dic.Dictionary.loadExtendWords(List<String> extWords)
org.wltea.analyzer.dic.Dictionary.loadExtendStopWords(List<String> extStopWords)
......

List改成Collection,以方便扩展

Original issue reported on code.google.com by [email protected] on 3 Aug 2010 at 2:42

加号被删除,应该怎么做

IKQueryParser.parse("a", QueryParser.escape("我爱C++"))

我已经在词库中加入C++,分词结果中的C是没有C++的。请教如�
��处理

Original issue reported on code.google.com by AvengerBevis on 17 Feb 2011 at 6:14

数字和汉字组合的词在查询中不生效

What steps will reproduce the problem?
1. 
比如,给词库加载自定义的词库,其中包括数字和汉字组合��
�词
List<String> words = new ArrayList<String>();
words.add("江宁");
words.add("江宁区");
words.add("庄排路");
words.add("胜利新寓");
words.add("39号");
words.add("22栋");
words.add("1单元");
words.add("201室");
Dictionary.loadExtendWords(words);
2. 使用IKQueryParser来解析一个查询语句
String source = "江宁江宁区庄排路胜利新寓39号22栋1单元201室";
Query ikQuery = IKQueryParser.parse("fullName", source);
3. 
打印出得到的ikQuery的字符串,会发现全中文的词是有效的,�
��是数字和中文组合的词在分词的时候没有被当做一个词来处
理。
System.out.println("IKQuery = " + ikQuery.toString());
What is the expected output? What do you see instead?
期待出现的输出:
IKQuery = +fullName:江宁 +((+fullName:江宁区 +fullName:庄排路) 
fullName:区庄) +fullName:胜利新寓 +((fullName:39 +fullName:号) 
fullName:39号) +((fullName:22 +fullName:栋) fullName:22栋) +((fullName:1 
+fullName:单元) fullName:1单元) +((fullName:201 +fullName:室) 
fullName:201室)

实际出现的输出:
IKQuery = +fullName:江宁 +((+fullName:江宁区 +fullName:庄排路) 
fullName:区庄) +fullName:胜利新寓 +fullName:39 +fullName:号 +fullName:22 
+fullName:栋 +fullName:1 +fullName:单元 +fullName:201 +fullName:室

也就是说,实际分词的时候,数字作为词,数字后面的汉字��
�独立作为一个词,但是数字和汉字的组合没有被当做一个词�
��即时在加载词库的时候,明显加载了数字和汉字组合的词。
What version of the product are you using? On what operating system?
版本:V3.2.3
OS:Windows7


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 8 Jul 2010 at 4:49

怎么获取分词结果

比如一段文本解析成这些词语:用户|套餐|咨询|2M|陈佳
|...
怎么获取这些词语呢,以及词语出现的频率?

不好意思,我是初学者,谢谢大家指导

Original issue reported on code.google.com by [email protected] on 25 Oct 2011 at 9:15

中文繁体

是否支持中文繁体的分词?

支持的话,中文繁体的分词使用方法和中文简体的主要区别��
�什么地方?

Original issue reported on code.google.com by [email protected] on 3 May 2011 at 7:18

字典问题

/org/wltea/analyzer/dic/main.dic
/org/wltea/analyzer/dic/stopword.dic
/org/wltea/analyzer/dic/surname.dic
有三个文件有bom头,另外建议把默认的stopword.dic从jar里面提��
�来,由用户来决定是否采用,搜索英文名称时默认的stopword��
�很多负作用

Original issue reported on code.google.com by [email protected] on 14 Apr 2011 at 2:57

IKAnalyzer的两个问题

1. 
Dictionary.addWords,当添加的词库太大(我这里是将近30万)时��
�会出现bug,20万左右时没有这个问题
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.wltea.analyzer.dic.DictSegment.lookforSegment(DictSegment.java:228)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:199)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:204)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:204)
    at org.wltea.analyzer.dic.DictSegment.fillSegment(DictSegment.java:170)
    at org.wltea.analyzer.dic.Dictionary.addWords(Dictionary.java:119)

2. 
当词库从20万跃至30万时,分词速度急剧下降,这是什么原因�
��

环境为:win7 + jdk1.6
version: IKAnalyzer2012_u3.zip 


What version of the product are you using? On what operating system?


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 27 Mar 2012 at 11:05

停词表不起作用?

采用动态加载停词表,如有个停词为"怎么了",跟踪也可以看��
�该停词有:
org.wltea.analyzer.dic.Dictionary.isStopWord("怎么了吧".toCharArray(), 0, 
3):true。
但是在查询的时候,这个停词没起作用,如我查询词"你怎么�
��啊",查询解析器分词的结果为:Boolean Query:+NAME:"你 怎么 啊".




Original issue reported on code.google.com by [email protected] on 12 Nov 2010 at 3:17

i want to join this project

i am a student from central south university, majoring in software engineering, 
and it would be my pleasure if i could join this project. cuz i cant find out 
any email address of you guys, so please contact me to this address : 
canly.xiao[at]gmail.com, thx.

Original issue reported on code.google.com by [email protected] on 30 Oct 2010 at 12:18

linux下运行时,提示 exception.

What steps will reproduce the problem?
1. Exception in thread "main" java.lang.NullPointerException
        at org.wltea.analyzer.core.AnalyzeContext.compound(AnalyzeContext.java:382)
        at org.wltea.analyzer.core.AnalyzeContext.getNextLexeme(AnalyzeContext.java:325)
        at org.wltea.analyzer.core.IKSegmenter.next(IKSegmenter.java:116)
        at org.wltea.analyzer.lucene.IKTokenizer.incrementToken(IKTokenizer.java:73)
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 14 Mar 2012 at 10:11

最新的u2更新包,对于输入的utf8编码的句子仍然没有合并0结尾的量词。

What steps will reproduce the problem?
辛苦了。
使用最新的u2更新包,对于输入的utf8编码的句子仍然没有合��
�0结尾的量词。
例子见下:
1.      [java]    64    66  |1丈|
     [java]    66    69  |三百克|
     [java]    69    72  |1公克|
     [java]    72    74  |5克|
     [java]    74    76  |10|
     [java]    76    77  |克|
     [java]    77    78  |向|
     [java]    78    80  |迭代|
     [java]    80    81  |最|
     [java]    81    84  |细粒度|
     [java]    84    86  |切分|
     [java]    86    88  |算法|
     [java]    88    94  |2000ml|
     [java]    96    99  |300|
     [java]    99   100  |克|
     [java]   101   104  |550|
     [java]   104   106  |毫升|
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 16 Mar 2012 at 1:49

量词,10克解析不出来, 5克的词可以分出来。可能是对0的处理造成的。

What steps will reproduce the problem?
1. "1丈三百克1公克5克10克向迭代最细粒度切分算法2000ml, 
300克。550毫升时尚
2. 1丈, 三百克, 1公克, 5克, 10, 克, 向, 迭代, 最, 细粒度, 
切分, 算法,2000ml, 300, 克, 550, 毫升,
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 14 Mar 2012 at 7:38

特殊字符串导致的分词异常NullPointerException

分词字符串:"重庆康田2.7亿西永拿地 
楼面价2474元/平方按公示,该地块属于沙坪坝区西永组团U分��
�U8-8-1/03地块,土地用途为二类居住用地、商业金融业用地,�
��地面积约72756方,建筑规模要求不超过11万方,起拍价约2.1��
�。"

调用lucene的分词代码:
public static List<String> analysing(String input, boolean useSmart) {
        try {
            // 生成analyzer实例
            Analyzer analyzer = new IKAnalyzer(useSmart);
            // 取得Token流
            Reader reader = new StringReader(input);
            TokenStream stream = analyzer.tokenStream("", reader);
            // 重置到流的开始位置
            stream.reset();
            // 添加工具类
            CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
            // 循环打印所有分词及其位置
            List<String> result = new ArrayList<String>();
            while (stream.incrementToken()) {
                LOG.info(termAtt.toString());
                result.add(termAtt.toString());
            }

            return result;
        } catch (Exception e) {
            LOG.error("分词异常", e);
        }

        return null;
    }

异常信息:
java.lang.NullPointerException
    at org.wltea.analyzer.core.AnalyzeContext.compound(AnalyzeContext.java:382)
    at org.wltea.analyzer.core.AnalyzeContext.getNextLexeme(AnalyzeContext.java:325)
    at org.wltea.analyzer.core.IKSegmenter.next(IKSegmenter.java:116)
    at org.wltea.analyzer.lucene.IKTokenizer.incrementToken(IKTokenizer.java:73)
    at com.test.util.AnalyzerUtil.analysing(AnalyzerUtil.java:77)
    at com.test.util.AnalyzerUtil.main(AnalyzerUtil.java:54)

Original issue reported on code.google.com by [email protected] on 19 Mar 2012 at 3:47

IKQueryParser的效果问题

IKQueryParser现在的分词对数量词的切分效果感觉很不好。比如�
��三星”,尽管已经在词库里了,但因为“三”是数词,分出
来的效果是:“f1:三星 
f1:三”这样的两个词的或效果。结果搜索结果就出现一堆和��
�星无关的内容。我目前是打算重写IKQueryParser来去掉这种情况
,就是判断acceptedBranchs中,如果某个TokenBranch的term在词典中��
�判断其相邻的TokenBranch是否有完全重叠情况,有的话在转成qu
ery时过滤掉。不知道你们对此有什么想法?

from:[email protected]

Original issue reported on code.google.com by [email protected] on 20 Oct 2010 at 4:20

IK Analyzer 如何处理位置增量

看了源代码,貌似有多个ISegmenter同时处理一段文本,那位置�
��量如何处理?
如果要使用短语查询的话是否会出问题?
谢谢指教


Original issue reported on code.google.com by [email protected] on 23 Dec 2010 at 9:40

使用IKQueryParser的时候,不能指定分词器使用最大词长切分法,

What steps will reproduce the problem?
目前的IKQueryParser的_parse方法,使用了如下代码:
IKSegmentation ikSeg = new IKSegmentation(input); //version 3.2,line 118
它默认使用了最细粒度切分,但是没有提供途径让用户修改��
�词算法。

What is the expected output? What do you see instead?
建议给IKQueryParser类增加一个static方法允许设置分词算法使用�
��大词长切分法还
是最细粒度切分法,如果不指定,默认使用最细粒度切分法��
�
private static boolean isMaxWordLength = false;

public static boolean setMaxWordLength(boolean isMaxWordLength ) {
    isMaxWordLength = isMaxWordLength ;
}

然后,118行代码改成
IKSegmentation ikSeg = new IKSegmentation(input, isMaxWordLength );

What version of the product are you using? On what operating system?
3.2.0Stable

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 10 May 2010 at 5:31

模糊搜索

建立索引的时候用得是IKAnalyzer,搜索的时候用IKParser

IKParser.parseMut..Fields("金水路10号");
搜索结果为空,因为所引里面没有“号”这个term。

怎么办?

Original issue reported on code.google.com by [email protected] on 28 Nov 2010 at 1:26

若支持 Lucene 3.5


Exceptions:
java.lang.AssertionError: Analyzer implementation classes or at least their 
tokenStream() and reusableTokenStream() implementation
s must be final
    at org.apache.lucene.analysis.Analyzer.assertFinal(Analyzer.java:59)
    at org.apache.lucene.analysis.Analyzer.<init>(Analyzer.java:45)
    at org.wltea.analyzer.lucene.IKAnalyzer.<init>(IKAnalyzer.java:65)
    at org.wltea.analyzer.lucene.IKAnalyzer.<init>(IKAnalyzer.java:56)

java.lang.AssertionError: TokenStream implementation classes or at least their 
incrementToken() implementation must be final
    at org.apache.lucene.analysis.TokenStream.assertFinal(TokenStream.java:119)
    at org.apache.lucene.analysis.TokenStream.<init>(TokenStream.java:92)
    at org.apache.lucene.analysis.Tokenizer.<init>(Tokenizer.java:41)
    at org.wltea.analyzer.lucene.IKTokenizer.<init>(IKTokenizer.java:20)
    at org.wltea.analyzer.lucene.IKAnalyzer.reusableTokenStream(IKAnalyzer.java:52)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:126)
    at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2066)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2040)
--------------------------------------------------------------------
Solution
1. change org.wltea.analyzer.lucene.IKAnalyzer to "final"
2. change org.wltea.analyzer.lucene.IKTokenizer to "final"
--------------------------------------------------------------------
Test
Not tested yet.

Original issue reported on code.google.com by [email protected] on 24 Mar 2012 at 2:12

最新的分词结果怎么跟1.4版本差那么多啊

  ChineseAnalyzer ik = new ChineseAnalyzer();
          //检索内容   
          String text = "IKAnalyzer是一个开源的,10块,100平方米基 java 语言开发
的轻量级的中文分词工具包。从2006年12月推出1.0版开始, 
IKAnalyzer 已经推出了 3 个大
版本。";   

          List dictList = ik.getKeyWord(text);
          for(int i = 0; i < dictList.size();i++){
              System.out.println(dictList.get(i));
          }
用最新版的分词结果
ikanalyzer
是
一个
开源
的
块
100
平方米
基
java
语言
开发
轻量级
中文
分词
工具包
从
2006
年
12
月
推出
1.0
开始
已经
出了
3
个大
版本

数字后面怎么不加上单位了??而且10快前面的10丢失了。


Original issue reported on code.google.com by [email protected] on 18 Nov 2009 at 2:28

IKQParserPlugin, 当搜索句子时

What steps will reproduce the problem?
1. solrconfig.xml
2. <queryParser name="ik" class="org.apache.solr.search.IKQParserPlugin"/>
3. IKQParserPlugin code:

public class IKQParserPlugin extends QParserPlugin {
  public static String NAME = "ik";

  public void init(NamedList args) {
  }

  public QParser createParser(String qstr, SolrParams localParams, 
SolrParams params, SolrQueryRequest req) {
    return new IKQParser(qstr, localParams, params, req);
  }
}

class IKQParser extends QParser {
    String defaultField;

    public IKQParser(String qstr, SolrParams localParams, SolrParams 
params, SolrQueryRequest req) {
        super(qstr, localParams, params, req);
    }

    public Query parse() throws ParseException {
        String qstr = getString();

        defaultField = getParam(CommonParams.DF);
        if (defaultField==null) {
            defaultField = getReq().getSchema
().getDefaultSearchFieldName();
        }

        Query query = null;
        try {
            query = IKQueryParser.parse(defaultField, qstr);
        }
        catch (IOException e) {
            // TODO: handle exception
        }
        return query;
    }

    public String[] getDefaultHighlightFields() {
        return defaultField == null ? null : new String[]
{defaultField};
    }

}

Original issue reported on code.google.com by [email protected] on 28 Jan 2010 at 10:11

用户自己空格分隔的词之间,建议使用MUST

感觉其他的搜索引擎,都默认是MUST,如“唱歌 跳舞”

并附带一点: 
IKQueryParser.parseMultiField这类方法是不是都重复切词了呢?是
否一次切词建立多个Query这样效率高呢?


多谢提供不错的分词工具,祝愿越来越好

Original issue reported on code.google.com by [email protected] on 20 Nov 2009 at 5:23

java.lang.ArrayIndexOutOfBoundsException: 3072

org.wltea.analyzer.core.AnalyzeContext中
//默认缓冲区大小
private static final int BUFF_SIZE = 3072;
能修改么?
javadoc中没找到相关方法。
我是初学者,问题水平可能比较低,还望耐心解答。

Original issue reported on code.google.com by [email protected] on 20 Mar 2012 at 1:00

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.