Elasticsearch Analyzer 内置分词器( 二 )

4. Simple Analyzer简单的分词器 分词规则就是 遇到 非字母的 就分词, 并且转化为小写,(lowercase tokennizer )
POST _analyze{"analyzer": "simple","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]4.1 DefinitionTokenizer

  • Lower Case Tokenizer
4.2 Configuation无配置参数
4.3 实验simple analyzer 分词器的实现 就是如下
PUT /simple_example{"settings": {"analysis": {"analyzer": {"rebuilt_simple": {"tokenizer": "lowercase","filter": []}}}}}5. Stop Analyzerstop analyzer和 simple analyzer 一样, 只是多了 过滤 stop word 的 token filter , 并且默认使用 english 停顿词规则
POST _analyze{"analyzer": "stop","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}// 可以看到 非字母进行分词 并且转小写 然后 去除了停顿词[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]5.1 DefinitionTokenizer
  • Lower Case Tokenizer : 转小写的
Token filters
  • Stop Token Filter : 过滤停顿词 默认使用 规则 english
5.2 Configuration
  • stopwords : 指定分词的规则 默认 english , 或者分词的数组
  • stopwords_path : 指定分词停顿词文件
5.3 实验如下就是对 Stop Analyzer 的实现 , 先转小写 后进行停顿词的过滤
PUT /stop_example{"settings": {"analysis": {"filter": {"english_stop": {"type":"stop","stopwords":"_english_"}},"analyzer": {"rebuilt_stop": {"tokenizer": "lowercase","filter": ["english_stop"]}}}}}设置 stopwords 参数 指定过滤的停顿词列表
PUT my_index{"settings": {"analysis": {"analyzer": {"my_stop_analyzer": {"type": "stop","stopwords": ["the", "over"]}}}}}POST my_index/_analyze{"analyzer": "my_stop_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ quick, brown, foxes, jumped, lazy, dog, s, bone ]6. Whitespace Analyzer空格 分词器, 顾名思义 遇到空格就进行分词, 不会转小写
POST _analyze{"analyzer": "whitespace","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]6.1 DefinitionTokenizer
  • Whitespace Tokenizer
6.2 Configuration无配置
6.3 实验whitespace analyzer 的实现就是如下, 可以根据实际情况进行 添加 filter
PUT /whitespace_example{"settings": {"analysis": {"analyzer": {"rebuilt_whitespace": {"tokenizer": "whitespace","filter": []}}}}}7. Keyword Analyzer很特殊 它不会进行分词, 怎么输入 就怎么输出
POST _analyze{"analyzer": "keyword","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}//注意 这里并没有进行分词 而是原样输出[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]7.1 DefinitionTokennizer
  • Keyword Tokenizer
7.2 Configuration无配置
7.3 实验rebuit 如下 就是 Keyword Analyzer 实现
PUT /keyword_example{"settings": {"analysis": {"analyzer": {"rebuilt_keyword": {"tokenizer": "keyword","filter": []}}}}}8. Patter Analyzer正则表达式 进行拆分 ,注意 正则匹配的是 标记, 就是要被分词的标记 默认是 按照 \w+ 正则分词
POST _analyze{"analyzer": "pattern","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}// 默认是 按照 \w+ 正则[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]8.1 DefinitionTokennizer
  • Pattern Tokenizer
Token Filters
  • Lower Case Token Filter
  • Stop Token Filter (默认未开启)
8.2 ConfigurationpatternA Java regular expression, defaults to \W+.flagsJava regular expression.lowercase

经验总结扩展阅读