GithubHelp home page GithubHelp logo

mmseg4j's Introduction

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

1、mmseg4j 用 Chih-Hao Tsai 的 MMSeg 算法(http://technology.chtsai.org/mmseg/ )实现的中文分词器,并实现 lucene 的 analyzer 和 solr 的TokenizerFactory 以方便在Lucene和Solr中使用。

2、MMSeg 算法有两种分词方法:Simple和Complex,都是基于正向最大匹配。Complex 加了四个规则过虑。官方说:词语的正确识别率达到了 98.41%。mmseg4j 已经实现了这两种分词算法。
 * 1.5版的分词速度simple算法是 1100kb/s左右、complex算法是 700kb/s左右,(测试机:AMD athlon 64 2800+ 1G内存 xp)。
 * 1.6版在complex基础上实现了最多分词(max-word)。“很好听” -> "很好|好听"; “中华人民共和国” -> "中华|华人|共和|国"; “**人民银行” -> "**|人民|银行"。
 * 1.7-beta 版, 目前 complex 1200kb/s左右, simple 1900kb/s左右, 但内存开销了50M左右. 上几个版都是在10M左右.

mmseg4j实现的功能详情请看:http://mmseg4j.googlecode.com/svn/trunk/CHANGES.txt

3、在 com.chenlb.mmseg4j.example包里的类示例了三种分词效果。

4、在 com.chenlb.mmseg4j.analysis包里扩展lucene analyzer。MMSegAnalyzer默认使用max-word方式分词。

5、在 com.chenlb.mmseg4j.solr包里扩展solr tokenizerFactory。
在 solr的 schema.xml 中定义 field type如:
	<fieldType name="textComplex" class="solr.TextField" >
      <analyzer>
        <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/>
      </analyzer>
    </fieldType>
	<fieldType name="textMaxWord" class="solr.TextField" >
      <analyzer>
        <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="dic"/>
      </analyzer>
    </fieldType>
	<fieldType name="textSimple" class="solr.TextField" >
      <analyzer>
        <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="n:/OpenSource/apache-solr-1.3.0/example/solr/my_dic"/>
      </analyzer>
    </fieldType>

dicPath 指定词库位置(每个MMSegTokenizerFactory可以指定不同的目录,当是相对目录时,是相对 solr.home 的目录),mode 指定分词模式(simple|complex|max-word,默认是max-word)。

6、运行,词典用mmseg.dic.path属性指定、在classpath 目录下或在当前目录下的data目录,默认是 classpath/data 目录。如果使用 mmseg4j-*-with-dic.jar 包可以不指定词库目录(如果指定也可以,它们也可以被加载)。

java -jar mmseg4j-core-1.9.0.jar 这里是字符串。

java -cp .;mmseg4j-core-1.9.0.jar -Dmmseg.dic.path=./other-dic com.chenlb.mmseg4j.example.Simple 这里是字符串。

java -cp .;mmseg4j-core-1.9.0.jar com.chenlb.mmseg4j.example.MaxWord 这里是字符串

7、一些字符的处理
英文、俄文、希腊、数字(包括①㈠⒈)的分出一连串的。目前版本没有处理小数字问题,
如ⅠⅡⅢ是单字分,字库(chars.dic)中没找到也单字分。

8、词库:
 * data/chars.dic 是单字与语料中的频率,一般不用改动,1.5版本中已经加到mmseg4j的jar里了,我们不需要关心它,当然你在词库目录放这个文件可能覆盖它。
 * data/units.dic 是单字的单位,默认读jar包里的,你也可以自定义覆盖它。
 * data/words.dic 是词库文件,一行一词,当然你也可以使用自己的,1.5版本使用 sogou 词库,1.0的版本是用 rmmseg 带的词库。
 * data/wordsxxx.dic 1.6版支持多个词库文件,data 目录(或你定义的目录)下读到"words"前缀且".dic"为后缀的文件。如:data/words-my.dic。

9、MMseg4jHandler:
添加 MMseg4jHandler 类,可以在solr中用url的方式来控制加载检测词库。参数:
 * dicPath 是指定词库的目录,特性与MMSegTokenizerFactory中的dicPath一样(相对目录是,是相对 solr.home)。
 * check 是指是否检测词库,其值是true 或 on。
 * reload 是否尝试加载词库,其值是 true 或 on。此值为 true,会忽视 check 参数。

solrconfig.xml:

	<requestHandler name="/mmseg4j" class="com.chenlb.mmseg4j.solr.MMseg4jHandler" >
		<lst name="defaults">
			<str name="dicPath">dic</str>
		</lst>
	</requestHandler>

此功能可以让外置程序做相关的控制,如:尝试加载词库,然后外置程序决定是否重做索引。


在 solr 1.3/1.4 与 lucene 2.3/2.4/2.9 测试过,官方博客 http://blog.chenlb.com/category/mmseg4j , 如果发现问题或bug与我联系 chenlb2008#gmail.com 。

1.7.2 与 1.6.2 开始核心的程序与 lucene 和 solr 扩展分开打包,方便兼容低版本的 lucene,同时给出低版本(<= lucene 2.2)的 lucene 扩展请仿照 MMSegTokenizer.java。

1.9.0 支持 solr/lucene 4.0.0 正式版。
1.9.1 支持 solr/lucene 4.3.1

可以在 http://code.google.com/p/mmseg4j/issues/list 提出希望 mmseg4j 有的功能。

历史版本:

 * 1.0.2 http://mmseg4j.googlecode.com/svn/branches/mmseg4j-1.0/
 * 1.5   http://mmseg4j.googlecode.com/svn/branches/mmseg4j-1.5/
 * 1.6.2 http://mmseg4j.googlecode.com/svn/branches/mmseg4j-1.6/
 * 1.7.3 http://mmseg4j.googlecode.com/svn/branches/mmseg4j-1.7/

mmseg4j's People

Watchers

 avatar

mmseg4j's Issues

solr 多核模式下出错

HTTP Status 500 - Severe errors in solr configuration. Check your log files for 
more detailed information on what may be wrong. If you want solr to continue 
after configuration errors, change: 
<abortOnConfigurationError>false</abortOnConfigurationError> in solr.xml 
------------------------------------------------------------- 
java.lang.OutOfMemoryError: Java heap space at 
java.util.HashMap.<init>(HashMap.java:209) at 
com.chenlb.mmseg4j.CharNode$TreeNode.<init>(CharNode.java:151) at 
com.chenlb.mmseg4j.CharNode$KeyTree.add(CharNode.java:91) at 
com.chenlb.mmseg4j.CharNode.addWordTail(CharNode.java:25) at 
com.chenlb.mmseg4j.Dictionary$WordsFileLoading.row(Dictionary.java:265) at 
com.chenlb.mmseg4j.Dictionary.load(Dictionary.java:283) at 
com.chenlb.mmseg4j.Dictionary.loadWord(Dictionary.java:209) at 
com.chenlb.mmseg4j.Dictionary.loadDic(Dictionary.java:191) at 
com.chenlb.mmseg4j.Dictionary.reload(Dictionary.java:356) at 
com.chenlb.mmseg4j.Dictionary.init(Dictionary.java:122) at 
com.chenlb.mmseg4j.Dictionary.<init>(Dictionary.java:115) at 
com.chenlb.mmseg4j.Dictionary.getInstance(Dictionary.java:78) at 
com.chenlb.mmseg4j.solr.Utils.getDict(Utils.java:22) at 
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.inform(MMSegTokenizerFactory.java:
70) at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:551) at 
org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:124) at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:481) at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:335) at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:219) at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161
) at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96) at 
org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterCon
fig.java:277) at 
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConf
ig.java:258) at 
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterC
onfig.java:382) at 
org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.
java:103) at 
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4638) 
at 
org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5294
) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at 
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:895) 
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:871) at 
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615) at 
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:649) 
solr在多核模式下,每个core下面都有自己的字典文件的情况下
出错。
solr 3.6  
mmseg4j 1.8.5
tomcat 7.0.27

Original issue reported on code.google.com by [email protected] on 20 Jun 2012 at 9:28

Class Cast Error

What steps will reproduce the problem?
1. Move solr/example to standalone folder
2. ant build mmseg with solr's lib not with mmseg's lib.
3. Put mmseg.jar to example/lib
4. Update xml
5. Run java -jar start.jar

Class cast error


1.40 Solr version and latest mmseg version 1.8.2


Create the project with Eclipse, import MMseg.jar and all solr's jar.

TokenizFactory tt = new MMSegTokenizeFactory();

That's okay not compile time error.

Can not figure our what happened.


Check the blog comments, mentioned that put the mmseg.jar to solr/lib instead 
of solr/example/lib.

Can not understand why such way can solve this. I tried, but failed.

Besides, i only want to run the example project as standalone with MMSeg.



Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 27 Aug 2010 at 5:18

java.lang.OutOfMemoryError after reload Solr on Tomcat

I use mmseg4j-1.7.2 with solr 1.3 (nightly) on Tomcat 6.0.18. When I reload
solr from Tomcat Web Application Manager I get (it works all right after
restart tomcat):

HTTP Status 500 - Severe errors in solr configuration. Check your log files
for more detailed information on what may be wrong. If you want solr to
continue after configuration errors, change:
<abortOnConfigurationError>false</abortOnConfigurationError> in solr.xml
-------------------------------------------------------------
java.lang.OutOfMemoryError: Java heap space at
java.util.HashMap.<init>(HashMap.java:209) at
com.chenlb.mmseg4j.CharNode$TreeNode.<init>(CharNode.java:230) at
com.chenlb.mmseg4j.CharNode$KeyTree.add(CharNode.java:170) at
com.chenlb.mmseg4j.CharNode.addWordTail(CharNode.java:30) at
com.chenlb.mmseg4j.Dictionary$3.row(Dictionary.java:152) at
com.chenlb.mmseg4j.Dictionary.load(Dictionary.java:202) at
com.chenlb.mmseg4j.Dictionary.loadDic(Dictionary.java:141) at
com.chenlb.mmseg4j.Dictionary.init(Dictionary.java:69) at
com.chenlb.mmseg4j.Dictionary.<init>(Dictionary.java:60) at
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.inform(MMSegTokenizerFactory.java:
80)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:426)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:102) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:376) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:237) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:113
)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConf
ig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterC
onfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.
java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
at
org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at
org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:916)
at
org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:53
6)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114
)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244)
-------------------------------------------------------------
java.lang.OutOfMemoryError: Java heap space at
java.util.HashMap.<init>(HashMap.java:209) at
com.chenlb.mmseg4j.CharNode$TreeNode.<init>(CharNode.java:230) at
com.chenlb.mmseg4j.CharNode$KeyTree.add(CharNode.java:170) at
com.chenlb.mmseg4j.CharNode.addWordTail(CharNode.java:30) at
com.chenlb.mmseg4j.Dictionary$3.row(Dictionary.java:152) at
com.chenlb.mmseg4j.Dictionary.load(Dictionary.java:202) at
com.chenlb.mmseg4j.Dictionary.loadDic(Dictionary.java:141) at
com.chenlb.mmseg4j.Dictionary.init(Dictionary.java:69) at
com.chenlb.mmseg4j.Dictionary.<init>(Dictionary.java:60) at
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.inform(MMSegTokenizerFactory.java:
80)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:426)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:102) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:376) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:237) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:113
)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConf
ig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterC
onfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.
java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
at
org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at
org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:916)
at
org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:53
6)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114
)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244)

type Status report

message Severe errors in solr configuration. Check your log files for more
detailed information on what may be wrong. If you want solr to continue
after configuration errors, change:
<abortOnConfigurationError>false</abortOnConfigurationError> in solr.xml
-------------------------------------------------------------
java.lang.OutOfMemoryError: Java heap space at
java.util.HashMap.<init>(HashMap.java:209) at
com.chenlb.mmseg4j.CharNode$TreeNode.<init>(CharNode.java:230) at
com.chenlb.mmseg4j.CharNode$KeyTree.add(CharNode.java:170) at
com.chenlb.mmseg4j.CharNode.addWordTail(CharNode.java:30) at
com.chenlb.mmseg4j.Dictionary$3.row(Dictionary.java:152) at
com.chenlb.mmseg4j.Dictionary.load(Dictionary.java:202) at
com.chenlb.mmseg4j.Dictionary.loadDic(Dictionary.java:141) at
com.chenlb.mmseg4j.Dictionary.init(Dictionary.java:69) at
com.chenlb.mmseg4j.Dictionary.<init>(Dictionary.java:60) at
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.inform(MMSegTokenizerFactory.java:
80)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:426)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:102) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:376) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:237) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:113
)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConf
ig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterC
onfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.
java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
at
org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at
org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:916)
at
org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:53
6)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114
)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244)
-------------------------------------------------------------
java.lang.OutOfMemoryError: Java heap space at
java.util.HashMap.<init>(HashMap.java:209) at
com.chenlb.mmseg4j.CharNode$TreeNode.<init>(CharNode.java:230) at
com.chenlb.mmseg4j.CharNode$KeyTree.add(CharNode.java:170) at
com.chenlb.mmseg4j.CharNode.addWordTail(CharNode.java:30) at
com.chenlb.mmseg4j.Dictionary$3.row(Dictionary.java:152) at
com.chenlb.mmseg4j.Dictionary.load(Dictionary.java:202) at
com.chenlb.mmseg4j.Dictionary.loadDic(Dictionary.java:141) at
com.chenlb.mmseg4j.Dictionary.init(Dictionary.java:69) at
com.chenlb.mmseg4j.Dictionary.<init>(Dictionary.java:60) at
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.inform(MMSegTokenizerFactory.java:
80)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:426)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:102) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:376) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:237) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:113
)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConf
ig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterC
onfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.
java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
at
org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at
org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:916)
at
org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:53
6)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114
)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244)

description The server encountered an internal error (Severe errors in solr
configuration. Check your log files for more detailed information on what
may be wrong. If you want solr to continue after configuration errors,
change: <abortOnConfigurationError>false</abortOnConfigurationError> in
solr.xml -------------------------------------------------------------
java.lang.OutOfMemoryError: Java heap space at
java.util.HashMap.<init>(HashMap.java:209) at
com.chenlb.mmseg4j.CharNode$TreeNode.<init>(CharNode.java:230) at
com.chenlb.mmseg4j.CharNode$KeyTree.add(CharNode.java:170) at
com.chenlb.mmseg4j.CharNode.addWordTail(CharNode.java:30) at
com.chenlb.mmseg4j.Dictionary$3.row(Dictionary.java:152) at
com.chenlb.mmseg4j.Dictionary.load(Dictionary.java:202) at
com.chenlb.mmseg4j.Dictionary.loadDic(Dictionary.java:141) at
com.chenlb.mmseg4j.Dictionary.init(Dictionary.java:69) at
com.chenlb.mmseg4j.Dictionary.<init>(Dictionary.java:60) at
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.inform(MMSegTokenizerFactory.java:
80)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:426)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:102) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:376) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:237) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:113
)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConf
ig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterC
onfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.
java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
at
org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at
org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:916)
at
org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:53
6)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114
)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244)
-------------------------------------------------------------
java.lang.OutOfMemoryError: Java heap space at
java.util.HashMap.<init>(HashMap.java:209) at
com.chenlb.mmseg4j.CharNode$TreeNode.<init>(CharNode.java:230) at
com.chenlb.mmseg4j.CharNode$KeyTree.add(CharNode.java:170) at
com.chenlb.mmseg4j.CharNode.addWordTail(CharNode.java:30) at
com.chenlb.mmseg4j.Dictionary$3.row(Dictionary.java:152) at
com.chenlb.mmseg4j.Dictionary.load(Dictionary.java:202) at
com.chenlb.mmseg4j.Dictionary.loadDic(Dictionary.java:141) at
com.chenlb.mmseg4j.Dictionary.init(Dictionary.java:69) at
com.chenlb.mmseg4j.Dictionary.<init>(Dictionary.java:60) at
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.inform(MMSegTokenizerFactory.java:
80)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:426)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:102) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:376) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:237) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:113
)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConf
ig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterC
onfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.
java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4363)
at
org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at
org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:916)
at
org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:53
6)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114
)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244) )
that prevented it from fulfilling this request.
Apache Tomcat/6.0.18

Original issue reported on code.google.com by [email protected] on 1 Jul 2009 at 8:27

真的很强,能再好一点就有绝对优势了

我对一下6种中文分词做了横向评测,我以搜索引擎对分词的�
��求作为评测标准,
IK
-http://code.google.com/p/ik-analyzer/
Mmseg4j
-http://code.google.com/p/mmseg4j/
SmartCN
-a java implementation of itcclass, from latest lucene pakage, under 
org.apache.lucene.analysis.cn.smart
Paoding
-http://code.google.com/p/paoding/
Stanford
-http://nlp.stanford.edu/software/segmenter.shtml
ICTCLAS2011
-http://hi.baidu.com/drkevinzhang/blog/item/149e29f8ace33e046c22eb45.html

评测的结果出乎我的意料,Mmseg4j不但是最快的(Standfor,ICTCLAS
2011比较慢),而且效果是最好的:它的分词粒度小,很少产�
��不相关的词(相对paoding,IK),对公司名分词效果很好。不足�
��是相对(smartCN,ICTCLAS2011)对歧义的判断没那么准。虽然搜��
�引擎对歧义并不太介意(相对于名词的分词)但是如果能在�
��方面提高的话,对于别的分词系统就会有绝对的优势。

这是我测出有问题的用例:
结婚 的 和尚 未 结婚 的 
他说 的确 实在 理 
把手 抬起 来 (把/手)
邓 颖 超生 前 使用 过 的 物品 
阿拉 斯 加 遭 强暴 风雪 袭击 致 xx 人 死亡 (强/暴)
今后 三年 中将 翻 两 番 (中/将)
乒乓 球 拍卖 完了 
粮食 不 卖给 八路 军 
下面两个ICTCLAS2011也搞不定
费 孝 通向 人大 常委 会 提交 书面 报告 
梁 启 超生 前 住在 这里 
当然很多ICTCLAS2011搞不定的Mmseg都搞定了,如
吴 江西 陵 印刷厂

email:[email protected]

Original issue reported on code.google.com by *[email protected] on 12 May 2011 at 6:15

在elasticsearch中应用,各种问题,不支持多线程,空指针,数组越界等

f82c878dee7","location":{"provinceId":"320000","cityId":"320500"},"address":"常
熟市虞山镇富仓路8号","trade":{"id":"336","name":"医院","parentId":"12
"},"name":"常熟三院(二级甲等)","namePinyin":[]}]}
java.util.ConcurrentModificationException
        at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
        at java.util.AbstractList$Itr.remove(AbstractList.java:357)
        at com.chenlb.mmseg4j.rule.Rule.remainChunks(Rule.java:41)
        at com.chenlb.mmseg4j.ComplexSeg.seg(ComplexSeg.java:93)
        at com.chenlb.mmseg4j.MaxWordSeg.seg(MaxWordSeg.java:19)
        at com.chenlb.mmseg4j.MMSeg.next(MMSeg.java:179)
        at com.chenlb.mmseg4j.analysis.MMSegTokenizer.incrementToken(MMSegTokenizer.java:62)
        at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:55)
        at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60)
        at org.apache.lucene.analysis.snowball.SnowballFilter.incrementToken(SnowballFilter.java:76)
        at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:141)
        at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:276)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060)
        at org.elasticsearch.index.engine.robin.RobinEngine.innerIndex(RobinEngine.java:574)
        at org.elasticsearch.index.engine.robin.RobinEngine.index(RobinEngine.java:486)
        at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:323)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
[2012-08-01 15:42:28,279][DEBUG][action.bulk              ] [Aftershock] 
[ptc][2] failed to execute bulk item (index) index 
{[ptc][dial][158f4902ad9e335d976f6dbe8b1841b1], 
source[{"id":"158f4902ad9e335d976f6dbe8b1841b1","location":{"provinceId":"430000
","cityId":"430700"},"address":"常德市洞庭大道中段104号","trade":{"id"
:"336","name":"医院","parentId":"12"},"name":"常德市妇幼保健院(二级�
��等)","namePinyin":[]}]}
java.lang.NullPointerException
        at com.chenlb.mmseg4j.MMSeg.next(MMSeg.java:180)
        at com.chenlb.mmseg4j.analysis.MMSegTokenizer.incrementToken(MMSegTokenizer.java:62)
        at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:55)
        at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60)
       at org.apache.lucene.analysis.snowball.SnowballFilter.incrementToken(SnowballFilter.java:76)
        at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:197)
        at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:276)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060)
        at org.elasticsearch.index.engine.robin.RobinEngine.innerIndex(RobinEngine.java:574)
        at org.elasticsearch.index.engine.robin.RobinEngine.index(RobinEngine.java:486)
        at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:323)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

Original issue reported on code.google.com by [email protected] on 2 Aug 2012 at 3:26

html代码支持分词吗

我现在有这样的字符串 "金属 &gt; 金属丝", &gt; 
是html代码对吧, 我想分成 三个词
1.金属 
2.&gt; 
3. 金属丝

我试了不行, 请问官方可行吗?

Original issue reported on code.google.com by [email protected] on 19 Mar 2013 at 4:45

1.8.2可能的bug,发现很多字漏掉了

文本:记者连日来在珠江新城、海珠区、荔湾区等区域踩盘��
�并采访了广州一手住宅主
要代理商和部分开发商,试图拼出一幅较为完整的买房群体��
�:6月房价迅速拉升以来,
众多改善型置业和投资客“跑步”进入楼市,尤其是企业主��
�批发市场档主等生意人,
在购房群体中日渐成“多数派”。
分词结果:记 | 连日来 | 在 | 珠江 | 新城 | 海 | 珠 | 区 | 荔 
| 湾 | 区 | 等 | 
区域 | 踩 | 盘 | 并 | 采访 | 了 | 广州 | 住宅 | 主要 | 代理商 
| 和 | 部分 | 
商 | 试图 | 拼出 | 较为 | 完整 | 的 | 买房 | 群体 | 图 | 月 | 
房价 | 迅速 | 
拉 | 升 | 以来 | 众多 | 改善 | 型 | 置 | 业 | 和 | 投资 | 客 | 
跑步 | 进入 | 
楼 | 市 | 尤其是 | 企业 | 主和 | 批发市场 | 档 | 主 | 等 | 
生意人 | 在 | 购房 
| 群体 | 中日 | 渐成 | 多数 | 派


Original issue reported on code.google.com by [email protected] on 1 Dec 2009 at 5:30

数字开头的词语不能被整体切出

#1 扩充辞典
加入词语[7天连锁酒店]
#2 测试
输入:我喜欢7天连锁酒店
输出:我 | 喜欢 | 7 | 天 | 连锁 | 酒店
期望结果:我 | 喜欢 | 7天连锁酒店

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
mmseg4j-1.8
ubuntu

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 10 Feb 2011 at 8:30

最新版本1.8使用问题 lucene highlighter 异常

原先项目中使用MMSEG4J 1.73版本,对应lucene版本:
lucene: 2.9,高亮组件:lucene-highlighter-2.9.0.jar

mmseg4j : mmseg4j-all-1.7.3.jar

在使用lucene搜索时,很正常;
切换到 mmseg4j 1.80 后,在搜索时出现异常:
......
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 未知 
exceeds length of provided text sized 2
    at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlig
hter.java:254)
    at 
org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter
.java:184)
    at 
org.apache.lucene.search.highlight.Highlighter.getBestFragment(Highlighter.
java:107)
    at 
org.apache.lucene.search.highlight.Highlighter.getBestFragment(Highlighter.
java:85)

而把高亮组件切换成 lucene-highlighter-2.2.0.jar时(lucent 
还是2.9,mmseg 为
1.80),报一下异常:
java.lang.StringIndexOutOfBoundsException: String index out of range: 3
    at java.lang.String.substring(String.java:1934)
    at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlig
hter.java:271)
    at 
org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter
.java:175)
    at 
org.apache.lucene.search.highlight.Highlighter.getBestFragment(Highlighter.
java:101)
    at 
org.apache.lucene.search.highlight.Highlighter.getBestFragment(Highlighter.
java:80)


两种情况下把只要把mmseg1.8 切换成 1.7.3 都可以运行正常。

暂时忙碌,无暇修改源代码。只是简单把问题反馈,因仓促��
�因,不知道是否已经说
清楚。

希望修改之。


Original issue reported on code.google.com by [email protected] on 21 Oct 2009 at 3:08

分词异常

在solr后台分析词句时出现错误,无法分词
null:java.lang.RuntimeException: java.lang.NoSuchMethodError: 
org.apache.lucene.analysis.Tokenizer.reset(Ljava/io/Reader;)V

重启后恢复,但是不知道什么原因,一会就会出现这个

Original issue reported on code.google.com by [email protected] on 18 Jan 2013 at 9:06

希望可以在分词加入英文的小写


例如Kobe Bryant在分词后 kobe 不能搜索到结果 
Kobe可以,虽然这个是中文分词
法,但经常有中英混输的情况,如果在输入英文后,查不到��
�果,也是比较失望的,
另外我看了你的源代码,建议在词库的引进上采用单例模式��
�在web应用上会比较好,
谢谢你的帮助,另外我看了solol的mmseg的接口比较不错,你是�
��有意看看,另外再
次感谢你的solr分词

Original issue reported on code.google.com by [email protected] on 15 Aug 2009 at 6:37

加上车牌分词

如:津A12345
A1 A12 A123 A1234 A12345 1 2 3 4 5 12 123 1234 12345 23 234 2345 .....

Original issue reported on code.google.com by [email protected] on 9 Dec 2011 at 4:19

exception while load mmseg4j under solr 4 beta


hi,
when I use mmseg4j-1.9.0.v20120712-SNAPSHOT with solr 4 beta, I met error as 
following. I put all the jar files of mmseg4j under direct {solr.home}\lib. I 
attached the schema.xml file. the schema is the example shiped with solr 4, i 
made a little update according guide of mmseg4j. could someone help what's 
wrong.

here is the error messages
----------
init failure for [schema.xml] analyzer/tokenizer: Error loading class 
'com.chenlb.mmseg4j.solr.MMSegTokenizerFactory'
        at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168)
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:369)
        at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:113)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850)
        at org.apache.solr.core.CoreContainer.load(CoreContainer.java:539)
        at org.apache.solr.core.CoreContainer.load(CoreContainer.java:360)
        at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:309)
        at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:106)
        at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
        at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
        at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
        at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
        at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
        at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
        at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
        at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
        at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
        at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
        at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
        at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
        at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
        at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
        at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
        at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
        at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
        at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
        at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
        at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
        at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
        at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
        at org.eclipse.jetty.server.Server.doStart(Server.java:263)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
        at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.eclipse.jetty.start.Main.invokeMain(Main.java:457)
        at org.eclipse.jetty.start.Main.start(Main.java:602)
        at org.eclipse.jetty.start.Main.main(Main.java:82)
Caused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] analyzer/tokenizer: Error loading
class 'com.chenlb.mmseg4j.solr.MMSegTokenizerFactory'
        at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168)
        at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:344)
        at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
        at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
        at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142)
        ... 46 more
Caused by: org.apache.solr.common.SolrException: Error loading class 
'com.chenlb.mmseg4j.solr.MMSegTokenizerFactory'
        at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:438)
        at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:459)
        at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:86)
        at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142)
        ... 50 more
Caused by: java.lang.ClassNotFoundException: 
com.chenlb.mmseg4j.solr.MMSegTokenizerFactory
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Unknown Source)
        at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:422)
        ... 53 more
------------

thanks.

Original issue reported on code.google.com by [email protected] on 4 Jan 2013 at 7:47

Attachments:

用max-word 模式分词“很好听”

What steps will reproduce the problem?
1.  java -cp .;mmseg4j-all-1.8.4.jar com.chenlb.mmseg4j.example.MaxWord
2. 然后输入:很好听
3.

What is the expected output? What do you see instead?
期望输出: 很好 | 好听
但是实际结果是: 很 | 好听

What version of the product are you using? On what operating system?
mmseg4j-all-1.8.4.jar

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 4 Jan 2011 at 7:30

mmseg4j-1.9.1-snapshot mmseg4j-solr MMSegTokenizerFactory异常

MMSeg没有reset

直接上代码:
public class MMSegTokenizerFactory extends TokenizerFactory implements 
ResourceLoaderAware {

    public Tokenizer create(Reader input) {
        MMSegTokenizer tokenizer = tokenizerLocal.get();
        if(tokenizer == null) {
            tokenizer = newTokenizer(input);
        } else {
            try {
                tokenizer.setReader(input);
                                //此处应该reset,加上这行代码就修掉bug
                                tokenizer.reset();
            } catch (IOException e) {
                tokenizer = newTokenizer(input);
                log.info("MMSegTokenizer.reset i/o error by:"+e.getMessage());
            }
        }

        return tokenizer;
    }
......
}

Original issue reported on code.google.com by [email protected] on 22 Mar 2013 at 9:03

可以多个词库文件

可以多个词库文件?
1. 增加几个自定义词库,比如,地名,公司名及其他。
2. 分开几个文件来进行词库维护。




Original issue reported on code.google.com by [email protected] on 5 Apr 2009 at 2:40

添加针对身份证号,手机号级汽车牌号等特殊字段的一元分词

您好:
     使用mmseg4j分词,感觉很不错,mmseg4j能不能添加这样一种功能,:就是添加一种分词模式:针对身份证号,手机号以及汽车牌号等特殊字段进行一元分词,这样我在搜索的时候,只要输入其中相连的一段数字 或者 京B3 等就能查询出需要的结果(高亮显示)。
自己尝试着改写mmseg4j以及solr的一些分词法,功能是实现了,
但是不理想,当数据达到亿级以上时,查询速度就很慢。

联系方式:[email protected]

Original issue reported on code.google.com by [email protected] on 19 Mar 2012 at 3:36

希望增加对中英文或中文和数字混合的词汇分词

在实际分词中,会遇到“比亚迪F3”,"马自达2","蜘蛛侠3","7��
�连锁酒店"等这些中英文混合或中文数字混合的词汇,我尝试
将这些词加入到词库,但分词结果还是会将字母和数字单独��
�出来。

不知道在哪个版面能实现?谢谢


Original issue reported on code.google.com by [email protected] on 3 Oct 2012 at 3:56

1.8.2 highlighting bug

搜索“日本”
返回的结果中会命中“<日本本州6日>日本文化体验,经济超��
�”这条记录,
但是highlighting的结果是把“<日本本州6日>日本文化体验,经��
�超值”中的
所有term都作为highlighting返回了。
我分词用的是max-word方式,自己的词库。

solr版本:1.4。

solr 1.3 + mmseg4j 1.6版本没有这个问题。



Original issue reported on code.google.com by [email protected] on 30 Dec 2009 at 1:59

MMSeg分词的thread-safe问题

What steps will reproduce the problem?
1. we use elasticsearch 
plugin:https://github.com/medcl/elasticsearch-analysis-mmseg
2. we put data to es node for index
3. exception happens(null pointer and concurrent modification) on some of the 
data, we check each field analyzed by mmseg and find no exception in 
single-thread.

The elasticsearch log is as below:
f82c878dee7","location":{"provinceId":"320000","cityId":"320500"},"address":"常
熟市虞山镇富仓路8号","trade":{"id":"336","name":"医院","parentId":"12
"},"name":"常熟三院(二级甲等)","namePinyin":[]}]}

java.util.ConcurrentModificationException

        at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)

        at java.util.AbstractList$Itr.remove(AbstractList.java:357)

        at com.chenlb.mmseg4j.rule.Rule.remainChunks(Rule.java:41)

        at com.chenlb.mmseg4j.ComplexSeg.seg(ComplexSeg.java:93)

        at com.chenlb.mmseg4j.MaxWordSeg.seg(MaxWordSeg.java:19)

        at com.chenlb.mmseg4j.MMSeg.next(MMSeg.java:179)

        at com.chenlb.mmseg4j.analysis.MMSegTokenizer.incrementToken(MMSegTokenizer.java:62)

        at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:55)

        at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60)

        at org.apache.lucene.analysis.snowball.SnowballFilter.incrementToken(SnowballFilter.java:76)

        at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:141)

        at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:276)

        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)

        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060)

        at org.elasticsearch.index.engine.robin.RobinEngine.innerIndex(RobinEngine.java:574)

        at org.elasticsearch.index.engine.robin.RobinEngine.index(RobinEngine.java:486)

        at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:323)

        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)

        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)

        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)

[2012-08-01 15:42:28,279][DEBUG][action.bulk              ] [Aftershock] 
[ptc][2] failed to execute bulk item (index) index 
{[ptc][dial][158f4902ad9e335d976f6dbe8b1841b1], 
source[{"id":"158f4902ad9e335d976f6dbe8b1841b1","location":{"provinceId":"430000
","cityId":"430700"},"address":"常德市洞庭大道中段104号","trade":{"id"
:"336","name":"医院","parentId":"12"},"name":"常德市妇幼保健院(二级�
��等)","namePinyin":[]}]}

java.lang.NullPointerException

        at com.chenlb.mmseg4j.MMSeg.next(MMSeg.java:180)

        at com.chenlb.mmseg4j.analysis.MMSegTokenizer.incrementToken(MMSegTokenizer.java:62)

        at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:55)

        at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60)

       at org.apache.lucene.analysis.snowball.SnowballFilter.incrementToken(SnowballFilter.java:76)

        at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:197)

        at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:276)

        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)

        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060)

        at org.elasticsearch.index.engine.robin.RobinEngine.innerIndex(RobinEngine.java:574)

        at org.elasticsearch.index.engine.robin.RobinEngine.index(RobinEngine.java:486)

        at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:323)

        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:158)

        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)

        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

Original issue reported on code.google.com by ww.wang.cs on 2 Aug 2012 at 5:06

Solr 4.0下面分词出现Exception,这个是兼容性问题吗?用的是1.9版本


Caused by: java.lang.NoSuchMethodError: 
org.apache.lucene.analysis.Tokenizer.reset(Ljava/io/Reader;)V
    at com.chenlb.mmseg4j.analysis.MMSegTokenizer.reset(MMSegTokenizer.java:33)
    at com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.create(MMSegTokenizerFactory.java:51)
    at org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:64)
    at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:66)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:134)
    at org.apache.lucene.document.Field.tokenStream(Field.java:555)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:95)
    at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:307)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:244)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:373)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1445)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210)
    at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
    at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:432)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:557)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325)
    at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
    at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:230)
    at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
    ... 12 more

Original issue reported on code.google.com by [email protected] on 28 Dec 2012 at 4:59

中文繁体

是否提供对中文繁体的分词功能?

中文繁体的分词和中文简体的分词的用法是否一致?

谢谢!


Original issue reported on code.google.com by [email protected] on 3 May 2011 at 7:15

不支持solr3.6.1?

solr3.6.1 ,jetty8.17. 配置了schema,启动报错

2012-9-29 11:17:52 org.apache.solr.util.plugin.AbstractPluginLoader load
??Ϣ: created : org.apache.solr.analysis.StandardTokenizerFactory
2012-9-29 11:17:52 org.apache.solr.util.plugin.AbstractPluginLoader load
??Ϣ: created : org.apache.solr.analysis.TurkishLowerCaseFilterFactory
2012-9-29 11:17:52 org.apache.solr.util.plugin.AbstractPluginLoader load
??Ϣ: created : org.apache.solr.analysis.StopFilterFactory
2012-9-29 11:17:52 org.apache.solr.util.plugin.AbstractPluginLoader load
??Ϣ: created : org.apache.solr.analysis.SnowballPorterFilterFactory
2012-9-29 11:17:52 org.apache.solr.util.plugin.AbstractPluginLoader load
??Ϣ: created text_tr: org.apache.solr.schema.TextField
2012-9-29 11:17:52 org.apache.solr.common.SolrException log
????: java.lang.NoClassDefFoundError: 
org/apache/lucene/analysis/util/ResourceLoaderAware
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:429)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
    at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:378)
    at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:409)
    at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:83)
    at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
    at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986)
    at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60)
    at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453)
    at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:481)
    at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
    at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490)
    at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:123)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:478)
    at org.apache.solr.core.CoreContainer.load(CoreContainer.java:332)
    at org.apache.solr.core.CoreContainer.load(CoreContainer.java:216)
    at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161)
    at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96)
    at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:724)
    at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:263)
    at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1238)
    at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:706)
    at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:480)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39)
    at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186)
    at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494)
    at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141)
    at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145)
    at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56)
    at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609)
    at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540)
    at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403)
    at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:555)
    at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:230)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:81)
    at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58)
    at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:96)
    at org.eclipse.jetty.server.Server.doStart(Server.java:277)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1265)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1188)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.eclipse.jetty.start.Main.invokeMain(Main.java:468)
    at org.eclipse.jetty.start.Main.start(Main.java:616)
    at org.eclipse.jetty.start.Main.main(Main.java:92)
Caused by: java.lang.ClassNotFoundException: 
org.apache.lucene.analysis.util.ResourceLoaderAware
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    ... 72 more

Original issue reported on code.google.com by [email protected] on 29 Sep 2012 at 3:24

Lucene 2.9 修改了 Tokenizer 类接口,导致 MMSegTokenizer 无法运行

lucene 2.4 中 Tokenizer 类中维护了一个 protected Reader input;
但在 lucene 2.9 中,变成了 protected CharStream input;

导致 MMSegTokenizer  抛异常:
java.lang.NoSuchFieldError: input
    at com.chenlb.mmseg4j.analysis.MMSegTokenizer.init(MMSegTokenizer.java:34)
    at com.chenlb.mmseg4j.analysis.MMSegTokenizer.<init>(MMSegTokenizer.java:30)
    at
com.chenlb.mmseg4j.analysis.MMSegAnalyzer.tokenStream(MMSegAnalyzer.java:63)
    at search.analysis.MyAnalyzer.getMMseg4jTokenStream(MyAnalyzer.java:97)

修改建议:
在 MMSegTokenizer 类中自己维护一个 Reader input 成员吧。


Original issue reported on code.google.com by [email protected] on 6 Aug 2009 at 11:42

Bug in 1.8.3,断词意外中断

What steps will reproduce the problem?

----------------------------------------------------------
C:\development\3rdparties\mmseg4j-1.8.3>java -Dmmseg.dic.path=C:\development\3rd
parties\mmseg4j-1.8.3\data -jar mmseg4j-all-1.8.3.jar 
仅售48元,即可抢购LA 
MER海蓝之谜洁肤油30ml。其利用了珍贵的海洋精华油,与肌肤�
��易贴合,是日常清洁护理第一步的最佳选择。
2011-4-24 10:57:13 com.chenlb.mmseg4j.Dictionary getDefalutPath
信息: look up in mmseg.dic.path=C:\development\3rdparties\mmseg4j-1.8.3\data
2011-4-24 10:57:13 com.chenlb.mmseg4j.Dictionary loadDic
信息: chars loaded time=199ms, line=12638, on 
file=C:\development\3rdparties\mms
eg4j-1.8.3\data\chars.dic
2011-4-24 10:57:13 com.chenlb.mmseg4j.Dictionary loadWord
信息: words loaded time=0ms, line=36, on 
file=C:\development\3rdparties\mmseg4j-
1.8.3\data\words-my.dic
2011-4-24 10:57:13 com.chenlb.mmseg4j.Dictionary loadWord
信息: words loaded time=127ms, line=157202, on 
file=C:\development\3rdparties\mm
seg4j-1.8.3\data\words.dic
2011-4-24 10:57:13 com.chenlb.mmseg4j.Dictionary loadDic
信息: load all dic use time=331ms
2011-4-24 10:57:13 com.chenlb.mmseg4j.Dictionary loadUnit
信息: unit loaded time=0ms, line=24, on 
file=C:\development\3rdparties\mmseg4j-1
.8.3\data\units.dic
仅售 | 48 | 元 | 即可 | 抢购 | la

        -- 说明: 输入 QUIT 或 EXIT 退出
----------------------------------------------------------

What is the expected output? What do you see instead?
断词至“LA MER”后意外中断,后续断词失败

What version of the product are you using? On what operating system?
mmseg4j-1.8.3, windows 7

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 24 Apr 2011 at 3:03

lucene 3.1.0 断言问题

你好, 我使用的是 mmseg4j-all-1.8.4-with-dic-source.jar 和 lucene 3.1.0.

在lucene 3.1.0 中 org.apache.lucene.analysis.Analyzer 
初始化时会断言子类的tokenStream, 
reusableTokenStream方法是否为final.目前mmseg4j的MMSegAnalyzer没有final
, 请修改.

谢谢.

Original issue reported on code.google.com by [email protected] on 15 Apr 2011 at 2:13

java.lang.AbstractMethodError with Solr 1.4 Nightly Build

1. 使用 solr 1.4 nightly build + mmseg4j-1.6.2
2. 啟動時,出現 "java.lang.AbstractMethodError" 訊息
3. 仍能看見 "http://localhost:8983/solr/admin/" 管理頁
   但進入 "http://localhost:8983/solr/admin/analysis.jsp" 嘗試分詞時出現 Error 
500
訊息內容請見附件 HTM 檔

請問是否 mmseg4j 未能支援 solr 1.4?
因正嘗試開發必須使用 solr 1.4 的網站
十分期望 mmseg4j 能在 1.4 使用

Original issue reported on code.google.com by [email protected] on 18 Sep 2009 at 9:37

Attachments:

solr4.0 eg4j-1.9.0.v20120712-SNAPSHOT.zip 高亮时出错

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.
异常信息:
信息: [collection1] webapp=/solr path=/select 
params={q=news_title:程序&hl.simple.pre=<em>&hl.simple.post=</em>&hl.fl=news_t
itle&wt=xml&hl=true} hits=2 status=500 QTime=150 
十月 11, 2012 5:11:19 下午 org.apache.solr.common.SolrException log
严重: null:java.io.IOException: Stream closed
    at java.io.StringReader.ensureOpen(Unknown Source)
    at java.io.StringReader.read(Unknown Source)
    at java.io.BufferedReader.fill(Unknown Source)
    at java.io.BufferedReader.read(Unknown Source)
    at java.io.FilterReader.read(Unknown Source)
    at java.io.PushbackReader.read(Unknown Source)
    at com.chenlb.mmseg4j.MMSeg.readNext(MMSeg.java:42)
    at com.chenlb.mmseg4j.MMSeg.next(MMSeg.java:64)
    at com.chenlb.mmseg4j.analysis.MMSegTokenizer.incrementToken(MMSegTokenizer.java:63)
    at org.apache.solr.highlight.TokenOrderingFilter.incrementToken(DefaultSolrHighlighter.java:629)
    at org.apache.lucene.search.highlight.OffsetLimitTokenFilter.incrementToken(OffsetLimitTokenFilter.java:43)
    at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78)
    at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50)
    at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:225)
    at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:510)
    at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
    at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:136)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:454)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:275)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1001)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:585)
    at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1770)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)


Original issue reported on code.google.com by [email protected] on 11 Oct 2012 at 9:25

1.9 在solr 4 beta 下面创建索引的时候报错

2012-8-16 0:14:43 com.chenlb.mmseg4j.solr.MMSegTokenizerFactory newSeg
信息: create new Seg ...
2012-8-16 0:14:43 com.chenlb.mmseg4j.solr.MMSegTokenizerFactory newSeg
信息: use max-word mode
2012-8-16 0:14:43 org.apache.solr.common.SolrException log
严重: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: 
org.apache.l
ucene.analysis.Tokenizer.reset(Ljava/io/Reader;)V
        at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilt
er.java:468)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:296)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:243)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:210)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:224)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:169)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:168)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:98)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
927)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:118)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
a:407)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp
11Processor.java:987)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(
AbstractProtocol.java:579)
        at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
t.java:1805)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec
utor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NoSuchMethodError: org.apache.lucene.analysis.Tokenizer.res
et(Ljava/io/Reader;)V
        at com.chenlb.mmseg4j.analysis.MMSegTokenizer.reset(MMSegTokenizer.java:
33)
        at com.chenlb.mmseg4j.solr.MMSegTokenizerFactory.create(MMSegTokenizerFa
ctory.java:51)
        at org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerCha
in.java:64)
        at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerW
rapper.java:69)
        at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:82)
        at org.apache.solr.highlight.DefaultSolrHighlighter.createAnalyzerTStrea
m(DefaultSolrHighlighter.java:603)
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHigh
lighter(DefaultSolrHighlighter.java:477)
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(Defau
ltSolrHighlighter.java:401)
        at org.apache.solr.handler.component.HighlightComponent.process(Highligh
tComponent.java:136)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(Sea
rchHandler.java:206)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:454)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:275)
        ... 15 more



Original issue reported on code.google.com by [email protected] on 15 Aug 2012 at 4:23

Solr4.0+mmseg4j中文乱码问题

安装好solr4.0+mmseg4j后,进入Analysis页面,查询中文“我是**
人”时出现结果:

text
raw_bytes
start
end
type
position
½
[c2 bd]
11
12
other_number
1

使用的是textcomplex的模式,且已经设置了tomcat的URIEncoding。

请教各位,是什么原因导致无法显示中文呢?在此谢过各位��
�神了!

Original issue reported on code.google.com by [email protected] on 17 Dec 2012 at 1:37

Attachments:

用max-word 模式分词“很好听”

What steps will reproduce the problem?
1.  java -cp .;mmseg4j-all-1.8.4.jar com.chenlb.mmseg4j.example.MaxWord
2. 然后输入:很好听
3.

What is the expected output? What do you see instead?
期望输出: 很好 | 好听
但是实际结果是: 很 | 好听

What version of the product are you using? On what operating system?
mmseg4j-all-1.8.4.jar

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 4 Jan 2011 at 7:29

  • Merged into: #16

版本號沒有更新

最新版本號為1.9.x但在Solr Admin 中看到的卻是1.8
檢查程式碼 MMseg4jHandler.java 還是在舊的版本...

public String getVersion() {

  return "1.8";

}


Original issue reported on code.google.com by [email protected] on 24 Jan 2013 at 4:10

MMSeg.next() 断句有个 bug

MMSeg.next() 断句有个 bug。
空白字符后面的英文会丢失,且分词停止。

如:“手机电子书 http”空格后面的http丢了。

已经修复,请看mmseg4j-1.patch



Original issue reported on code.google.com by [email protected] on 28 Mar 2009 at 7:21

Attachments:

mmseg4j 1.9.0 在 lucene 4 下建不了索引



Document doc = new Document();

            doc.add(new TextField("content",  "This is the text to be indexed.", Field.Store.YES));
            doc.add(new TextField("addr", "河北", Field.Store.YES));
            indexWriter.addDocument(doc);
indexWriter.close();

只能建立 addr 的索引,不能创建content 
的索引,只能对一个字段建索引,会是什么原因?

Original issue reported on code.google.com by [email protected] on 25 Oct 2012 at 3:13

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.