4. Simple Analyzer简单的分词器 分词规则就是 遇到 非字母的 就分词, 并且转化为小写,(lowercase tokennizer )
POST _analyze{"analyzer": "simple","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
4.1 DefinitionTokenizer
- Lower Case Tokenizer
4.3 实验simple analyzer 分词器的实现 就是如下
PUT /simple_example{"settings": {"analysis": {"analyzer": {"rebuilt_simple": {"tokenizer": "lowercase","filter": []}}}}}
5. Stop Analyzerstop analyzer和 simple analyzer 一样, 只是多了 过滤 stop word 的 token filter , 并且默认使用 english 停顿词规则POST _analyze{"analyzer": "stop","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}// 可以看到 非字母进行分词 并且转小写 然后 去除了停顿词[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
5.1 DefinitionTokenizer- Lower Case Tokenizer : 转小写的
- Stop Token Filter : 过滤停顿词 默认使用 规则 english
- stopwords : 指定分词的规则 默认 english , 或者分词的数组
- stopwords_path : 指定分词停顿词文件
PUT /stop_example{"settings": {"analysis": {"filter": {"english_stop": {"type":"stop","stopwords":"_english_"}},"analyzer": {"rebuilt_stop": {"tokenizer": "lowercase","filter": ["english_stop"]}}}}}
设置 stopwords 参数 指定过滤的停顿词列表PUT my_index{"settings": {"analysis": {"analyzer": {"my_stop_analyzer": {"type": "stop","stopwords": ["the", "over"]}}}}}POST my_index/_analyze{"analyzer": "my_stop_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
6. Whitespace Analyzer空格 分词器, 顾名思义 遇到空格就进行分词, 不会转小写
POST _analyze{"analyzer": "whitespace","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
6.1 DefinitionTokenizer- Whitespace Tokenizer
6.3 实验whitespace analyzer 的实现就是如下, 可以根据实际情况进行 添加 filter
PUT /whitespace_example{"settings": {"analysis": {"analyzer": {"rebuilt_whitespace": {"tokenizer": "whitespace","filter": []}}}}}
7. Keyword Analyzer很特殊 它不会进行分词, 怎么输入 就怎么输出POST _analyze{"analyzer": "keyword","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}//注意 这里并没有进行分词 而是原样输出[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
7.1 DefinitionTokennizer- Keyword Tokenizer
7.3 实验rebuit 如下 就是 Keyword Analyzer 实现
PUT /keyword_example{"settings": {"analysis": {"analyzer": {"rebuilt_keyword": {"tokenizer": "keyword","filter": []}}}}}
8. Patter Analyzer正则表达式 进行拆分 ,注意 正则匹配的是 标记, 就是要被分词的标记 默认是 按照 \w+ 正则分词
POST _analyze{"analyzer": "pattern","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}// 默认是 按照 \w+ 正则[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
8.1 DefinitionTokennizer- Pattern Tokenizer
- Lower Case Token Filter
- Stop Token Filter (默认未开启)
pattern
A Java regular expression, defaults to \W+
.flags
Java regular expression.lowercase
经验总结扩展阅读
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- ElasticSearch这些坑记得避开
- 有用的内置Node.js APIs
- 京东云开发者|ElasticSearch降本增效常见的方法
- 记录在linux上单机elasticsearch8和kibana8
- AgileBoot - 如何集成内置数据库H2和内置Redis
- Elasticsearch rest-high-level-client 基本操作
- SpringBoot内置工具类,告别瞎写工具类了
- Hyperf使用ElasticSearch记录
- Mysql通过Canal同步Elasticsearch
- 4 Java注解:一个真实的Elasticsearch案例