CodeForge QQ客服 CodeForge 400电话 客服电话 4006316121

CombinatoricsKnife.java ( 文件浏览 )

  • 发布于2016-05-17
  • 浏览次数:0
  • 下载次数:0
  • 下载需 1 积分
  • 侵权举报
			/**
 * Copyright 2007 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package net.paoding.analysis.knife;

import java.util.HashSet;

import net.paoding.analysis.dictionary.Dictionary;
import net.paoding.analysis.dictionary.Hit;

/**
 * 排列组合Knife。
 * <p>
 * 
 * 该Knife把遇到的非LIMIT字符视为一个单词分出。<br>
 * 同时如果有以该词语开头的字符串在x-for-combinatorics.dic出现也会切出
 * 
 * @author Zhiliang Wang [qieqie.wang@gmail.com]
 * 
 * @since 1.0
 * 
 */
public abstract class CombinatoricsKnife implements Knife, DictionariesWare {


	protected Dictionary combinatoricsDictionary;

	protected HashSet/* <String> */noiseTable;

	public CombinatoricsKnife() {

	
}

	public CombinatoricsKnife(String[] noiseWords) {

		setNoiseWords(noiseWords);
	
}

	public void setNoiseWords(String[] noiseWords) {

		noiseTable = new HashSet/* <String> */((int) (noiseWords.length * 1.5));
		for (int i = 0; i < noiseWords.length; i++) {

			noiseTable.add(noiseWords[i]);
		
}
	
}

	public void setDictionaries(Dictionaries dictionaries) {

		combinatoricsDictionary = dictionaries.getCombinatoricsDictionary();
	
}

	public int dissect(Collector collector, Beef beef, int offset) {

		// 当point == -1时表示本次分解没有遇到POINT性质的字符;
		// 如果point != -1,该值表示POINT性质字符的开始位置,
		// 这个位置将被返回,下一个Knife将从point位置开始分词
		int point = -1;

		// 记录同质字符分词结束极限位置(不包括limit位置的字符)-也就是assignable方法遇到LIMIT性质的字符的位置
		// 如果point==-1,limit将被返回,下一个Knife将从limit位置开始尝试分词
		int limit = offset + 1;

		// 构建point和limit变量的值:
		// 往前直到遇到LIMIT字符;
		// 其中如果遇到第一次POINT字符,则会将它记录为point
		GO_UNTIL_LIMIT: while (true) {

			switch (assignable(beef, offset, limit)) {

			case LIMIT:
				break GO_UNTIL_LIMIT;
			case POINT:
				if (point == -1) {

					point = limit;
				
}
			
}
			limit++;
		
}
		// 如果最后一个字符也是ASSIGNED以及POINT,
		// 且beef之前已经被分解了一部分(从而能够腾出空间以读入新的字符),则需要重新读入字符后再分词
		if (limit == beef.length() && offset > 0) {

			return -offset;
		
}

		// 检索是否有以该词语位前缀的词典词语
		// 若有,则将它解出
		int dicWordVote = -1;
		if (combinatoricsDictionary != null && beef.charAt(limit) > 0xFF) {

			dicWordVote = tryDicWord(collector, beef, offset, limit);
		
}

		// 收集从offset分别到point以及limit的词
		// 注意这里不收集从point到limit的词
		// ->当然可能从point到limit的字符也可能是一个词,不过这不是本次分解的责任
		// ->如果认为它应该是个词,那么只要配置对应的其它Knife实例,该Knife会有机会把它切出来的
		// ->因为我们会返回point作为下一个Knife分词的开始。

		int pointVote = collectPoint(collector, beef, offset, point, limit,
				dicWordVote);
		int limitVote = collectLimit(collector, beef, offset, point, limit,
				dicWordVote);

		return nextOffset(beef, offset, point, limit, pointVote, limitVote,
				dicWordVote);
	
}

	/**
	 * 通知收集从offset到第一个LIMIT字符的词,并投票下一个Knife开始的分词位置。如果不存在POINT字符,则Point的值为-1。
	 * <p>
	 * 
	 * 默认方法实现:如果不存在POINT性质的字符,则直接返回不做任何切词处理。
	 * 
	 * @param collector
	 * @param beef
	 * @param offset
	 *            本次分解的内容在beef中的开始位置
	 * @param point
	 *            本次分解的内容的第一个POINT性质字符的位置,-1表示不存在该性质的字符
	 * @param limit
	 *            本次分解的内容的LIMIT性质字符
	 * @return 投票下一个Knife开始分词的位置;-1表示弃权。默认方法实现:弃权。
	 */
	protected int collectPoint(Collector collector, Beef beef, int offset,
			int point, int limit, int dicWordVote) {

		if (point != -1 && dicWordVote == -1) {

			collectIfNotNoise(collector, beef, offset, point);
		
}
		return -1;
	
}

	/**
	 * 通知收集从offset到第一个LIMIT字符的词,并投票下一个Knife开始的分词位置。
	 * <p>
	 * 
	 * 默认方法实现:把从offset位置到limit位置止(不包含边界)的字符串视为一个词切出。
	 * 
	 * @param collector
	 * @param beef
	 * @param offset
	 *            本次分解的内容在beef中的开始位置
	 * @param point
	 *            本次分解的内容的第一个POINT性质字符的位置,-1表示不存在该性质的字符
	 * @param limit
	 *            本次分解的内容的LIMIT性质字符
	 * 
	 * @param dicWordVote 
	 * 
	 * @return 投票下一个Knife开始分词的位置;-1表示弃权。默认方法实现:弃权。
	 */
	protected int collectLimit(Collector collector, Beef beef, int offset,
			int point, int limit, int dicWordVote) {

		if (dicWordVote == -1) {

			collectIfNotNoise(collector, beef, offset, limit);
		
}
		return -1;
	
}

	/**
	 * 尝试从combinatorics字典中检索,如果存在以offset到limit位置止(不包含limit边界)字符串开始的词语,则切出该词语。
	 * <p>
	 * 如没有检索到这样的词语,则本方法返回-1弃权投票下一个Knife的开始分解位置。<br>
	 * 如果检索到这样的词语,在切出在词语的同时,投票返回这个词语的结束位置(词语本身不包含该结束位置的字符)
	 * <p>
	 * 
	 * (for version 2.0.4+):<br>
	 * 本方法目前存在的局限:<br>
	 * 如果字典中的某个词语刚好分隔在两次beef之中,比如"U"刚好是此次beef的最后字符,而"盘"是下一次beef的第一个字符,<br>
	 * 这种情况现在 {
@link CombinatoricsKnife
}还没机制办法识别将之处理为一个词语
	 * 
	 * @param collector
	 * @param beef
	 * @param offset
	 * @param limit
	 * @return
	 */
	protected int tryDicWord(Collector collector, Beef beef, int offset,
			int limit) {

		int ret = limit;
		for (int end = limit + 1, count = limit - offset + 1; end <= beef
				.length(); end++, count++) {

			Hit hit = combinatoricsDictionary.search(beef, offset, count);
			if (hit.isUndefined()) {

				break;
			
} else if (hit.isHit()) {

				collectIfNotNoise(collector, beef, offset, end);
				// 收到词语,将ret设置为该词语的end
				ret = end;
			
}
			// gotoNextChar为true表示在词典中存在以当前词
...
...
(完整源码请下载查看)
			
...
展开> <收缩

下载源码到电脑,阅读使用更方便

1 积分

快速下载
还剩0行未阅读,继续阅读
云测速

源码文件列表

温馨提示: 点击源码文件名可预览文件内容哦 ^_^
...
名称 大小 修改日期
SimpleReadListener2.java.svn-base2.53 kB2012-10-10|10:55
SimpleReadListener.java.svn-base2.72 kB2012-10-10|10:55
ReadListener.java.svn-base936.00 B2012-10-10|10:55
FileWordsReader.java.svn-base3.95 kB2012-10-10|10:55
Difference.java.svn-base2.82 kB2012-10-10|10:55
Detector.java.svn-base3.14 kB2012-10-10|10:55
Node.java.svn-base1.88 kB2012-10-10|10:55
DifferenceListener.java.svn-base879.00 B2012-10-10|10:55
Snapshot.java.svn-base5.98 kB2012-10-10|10:55
ExtensionFileFilter.java.svn-base1.22 kB2012-10-10|10:55
Estimate.java.svn-base4.94 kB2012-10-10|10:55
TryPaodingAnalyzer.java.svn-base10.66 kB2012-10-10|10:55
MaxWordLengthTokenCollector.java.svn-base2.43 kB2012-10-10|10:55
MostWordsTokenCollector.java.svn-base2.88 kB2012-10-10|10:55
SortingDictionariesCompiler.java.svn-base7.04 kB2012-10-10|10:55
CompiledFileDictionaries.java.svn-base8.25 kB2012-10-10|10:55
MostWordsModeDictionariesCompiler.java.svn-base9.05 kB2012-10-10|10:55
all-wcprops854.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries847.00 B2012-10-10|10:55
all-wcprops1.13 kB2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries1.07 kB2012-10-10|10:55
all-wcprops449.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries555.00 B2012-10-10|10:55
TokenCollector.java.svn-base966.00 B2012-10-10|10:55
PaodingAnalyzerBean.java.svn-base4.05 kB2012-10-10|10:55
PaodingAnalyzer.java.svn-base4.46 kB2012-10-10|10:55
PaodingTokenizer.java.svn-base5.00 kB2012-10-10|10:55
all-wcprops1.03 kB2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries1.00 kB2012-10-10|10:55
PaodingAnalysisException.java.svn-base1.16 kB2012-10-10|10:55
KnifeBox.java.svn-base2.47 kB2012-10-10|10:55
LetterKnife.java.svn-base1.50 kB2012-10-10|10:55
PaodingMaker.java.svn-base21.26 kB2012-10-10|10:55
CharSet.java.svn-base2.13 kB2012-10-10|10:55
CombinatoricsKnife.java.svn-base10.65 kB2012-10-10|10:55
DictionariesCompiler.java.svn-base1.28 kB2012-10-10|10:55
FileDictionaries.java.svn-base12.74 kB2012-10-10|10:55
Dictionaries.java.svn-base1.85 kB2012-10-10|10:55
Knife.java.svn-base5.80 kB2012-10-10|10:55
Beef.java.svn-base3.84 kB2012-10-10|10:55
SmartKnifeBox.java.svn-base974.00 B2012-10-10|10:55
Collector.java.svn-base1.55 kB2012-10-10|10:55
Paoding.java.svn-base1.35 kB2012-10-10|10:55
FakeKnife.java.svn-base2.08 kB2012-10-10|10:55
CJKKnife.java.svn-base14.72 kB2012-10-10|10:55
NumberKnife.java.svn-base4.38 kB2012-10-10|10:55
DictionariesWare.java.svn-base853.00 B2012-10-10|10:55
FileDictionariesDifferenceListener.java.svn-base2.42 kB2012-10-10|10:55
CollectorStdoutImpl.java.svn-base1.18 kB2012-10-10|10:55
all-wcprops125.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries318.00 B2012-10-10|10:55
ReadListener.java936.00 B2012-10-10|10:55
FileWordsReader.java3.95 kB2012-10-10|10:55
SimpleReadListener.java2.72 kB2012-10-10|10:55
SimpleReadListener2.java2.53 kB2012-10-10|10:55
Detector.java3.14 kB2012-10-10|10:55
Node.java1.88 kB2012-10-10|10:55
ExtensionFileFilter.java1.22 kB2012-10-10|10:55
Difference.java2.82 kB2012-10-10|10:55
DifferenceListener.java879.00 B2012-10-10|10:55
Snapshot.java5.98 kB2012-10-10|10:55
HashBinaryDictionary.java.svn-base6.67 kB2012-10-10|10:55
Word.java.svn-base1.84 kB2012-10-10|10:55
Dictionary.java.svn-base1.71 kB2012-10-10|10:55
BinaryDictionary.java.svn-base3.15 kB2012-10-10|10:55
Hit.java.svn-base5.01 kB2012-10-10|10:55
DictionaryDelegate.java.svn-base1.30 kB2012-10-10|10:55
TryPaodingAnalyzer.java10.66 kB2012-10-10|10:55
Estimate.java4.94 kB2012-10-10|10:55
all-wcprops750.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries853.00 B2012-10-10|10:55
MaxWordLengthTokenCollector.java2.43 kB2012-10-10|10:55
CompiledFileDictionaries.java8.25 kB2012-10-10|10:55
SortingDictionariesCompiler.java7.04 kB2012-10-10|10:55
MostWordsModeDictionariesCompiler.java9.05 kB2012-10-10|10:55
MostWordsTokenCollector.java2.88 kB2012-10-10|10:55
all-wcprops291.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries421.00 B2012-10-10|10:55
Constants.java.svn-base4.78 kB2012-10-10|10:55
all-wcprops2.88 kB2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries2.76 kB2012-10-10|10:55
all-wcprops1.01 kB2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries1.07 kB2012-10-10|10:55
PaodingTokenizer.java5.00 kB2012-10-10|10:55
TokenCollector.java966.00 B2012-10-10|10:55
PaodingAnalyzer.java4.46 kB2012-10-10|10:55
PaodingAnalyzerBean.java4.05 kB2012-10-10|10:55
ChineseTokenizerFactory.java1.63 kB2012-10-10|11:06
SolrPaodingTokenizer.java1.09 kB2012-10-10|11:06
PaodingAnalysisException.java1.16 kB2012-10-10|10:55
all-wcprops242.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries458.00 B2012-10-10|10:55
SmartKnifeBox.java974.00 B2012-10-10|10:55
PaodingMaker.java21.26 kB2012-10-10|10:55
CJKKnife.java14.72 kB2012-10-10|10:55
DictionariesWare.java853.00 B2012-10-10|10:55
Knife.java5.80 kB2012-10-10|10:55
Collector.java1.55 kB2012-10-10|10:55
Paoding.java1.35 kB2012-10-10|10:55
FileDictionariesDifferenceListener.java2.42 kB2012-10-10|10:55
FakeKnife.java2.08 kB2012-10-10|10:55
CharSet.java2.13 kB2012-10-10|10:55
Dictionaries.java1.85 kB2012-10-10|10:55
CollectorStdoutImpl.java1.18 kB2012-10-10|10:55
FileDictionaries.java12.74 kB2012-10-10|10:55
CombinatoricsKnife.java10.65 kB2012-10-10|10:55
KnifeBox.java2.47 kB2012-10-10|10:55
Beef.java3.84 kB2012-10-10|10:55
LetterKnife.java1.50 kB2012-10-10|10:55
DictionariesCompiler.java1.28 kB2012-10-10|10:55
NumberKnife.java4.38 kB2012-10-10|10:55
BinaryDictionary.java3.15 kB2012-10-10|10:55
Dictionary.java1.71 kB2012-10-10|10:55
DictionaryDelegate.java1.30 kB2012-10-10|10:55
Word.java1.84 kB2012-10-10|10:55
HashBinaryDictionary.java6.67 kB2012-10-10|10:55
Hit.java5.01 kB2012-10-10|10:55
all-wcprops97.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries273.00 B2012-10-10|10:55
Constants.java4.78 kB2012-10-10|10:55
all-wcprops89.00 B2012-10-10|10:55
format2.00 B2012-10-10|10:55
entries264.00 B2012-10-10|10:55
readme36.00 B2012-10-11|16:29
paoding-analysis.properties187.00 B2012-10-10|10:55
paoding-analysis-default.properties220.00 B2012-10-10|10:55
paoding-analyzer.properties389.00 B2012-10-10|10:55
paoding-dic-home.properties450.00 B2012-10-11|11:36
paoding-knives.properties212.00 B2012-10-10|10:55
paoding-knives-user.properties260.00 B2012-10-10|10:55
pom.xml2.99 kB2012-10-13|14:56
zh-solr-se-solr-paoding-analysis-0.1.jar103.19 kB2012-10-13|14:19
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
01.96 kB
云测速

CombinatoricsKnife.java (316.60 kB)

需要 1 积分
您持有 积分

CodeForge积分(原CF币)全新升级,功能更强大,使用更便捷,不仅可以用来下载海量源代码马上还可兑换精美小礼品了 了解更多

您的积分不足

支付宝优惠套餐快速获取 30 积分

订单支付完成后,积分将自动加入到您的账号。以下是优惠期的人民币价格,优惠期过后将恢复美元价格。

更多付款方式:网银PayPal

上传代码,免费获取

您本次下载所消耗的积分将转交上传作者。

同一源码,30天内重复下载,只扣除一次积分。

登录 CodeForge

还没有CodeForge账号? 立即注册
关注微博
联系客服

Switch to the English version?

Yes
CodeForge 英文版
No
CodeForge 中文版

完善个人资料,获价值¥30元积分奖励!

^_^"呃 ...

Sorry!这位大神很神秘,未开通博客呢,请浏览一下其他的吧
好的