[Elasticsearch] 使用 Elasticsearch + Kibana 實現中文全文檢索

簡介

Elasticsearch 是一個以 Apache Lucene 為核心，分散式的 RESTful 風格的搜尋和數據分析引擎。它是以 JSON 的形式儲存資料，並提供即時的分析及搜尋。

Kibana 將 Elasticsearch 中的資料以視覺化的方式呈現，並提供操作 Elastic Stack 的 UI 介面。

ik analyzer 是一個 Elasticsearch 的中文分詞 plugin，由於 Elasticsearch 預設對於中文的分詞是一個字一個字切割，沒有分詞的話，中文搜尋的效果會比較差，所以我們需要加上中文分詞的 plugin, 讓搜尋的結果更好。

基本概念

Elasticsearch 和一般 RDBMS 的架構不同，所以在名詞上也不一樣，以下表格是 MySQL 和 Elasticsearch 名詞的對應關係:

MySQL	Elasticsearch
Server	Node
Database	Index
Table	Type
Row	Document
Column	Field

在 Elasticsearch 中，Index 的名稱必須是小寫，而同一個 Index、同一個 Type 中的每筆 record 的資料欄位不需要相同 (NoSQL的概念)。另外，根據此文章，在 Elasticsearch 6.x 版只允許每個 Index 包含一個 Type，在 7.x 版將會完全移除 Type.

安裝及設定

安裝&設定 Elasticsearch

在安裝 Elasticsearch 之前，需要先安裝 Java 環境:

1 2	$ sudo apt-get update $ sudo apt-get install default-jre

安裝完 Java 後，就可以來安裝 Elasticsearch, 直接下載並解壓縮即可:

1 2	$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.7.2.tar.gz $ tar zxvf elasticsearch-6.7.2.tar.gz

設定 Elasticsearch (<ES_DIR>/config/elasticsearch.yml):

cluster name
如果有多台 Elasticsearch node 要加入 cluster, 則必須定義相同的 cluster name.

1	cluster.name: cluster_name

node.name
自訂義 node name.

1	node.name: node1

bootstrap.memory_lock
設定為 true 是為了防止 swap 到 ES 的 memory.

1	bootstrap.memory_lock: true

network

# ------------------------ Network ------------------------
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: host_ip
#
# Set a custom port for HTTP:
#
http.port: port
http.cors.enabled: true
http.cors.allow-origin: '*'
http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers : X-Requested-With,X-Auth-Token,Content-Type, Content-Length

thread_pool

可以參考 Elasticsearch官網

index
For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size of this pool is 1 + # of available processors.

write
For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

The size parameter controls the number of threads, and defaults to the number of cores times 5.

# Thread pool
thread_pool:
    index:
        size: 13
        queue_size: 1000
    write:
        size: 13
        queue_size: 1000

ps. 如果不知道機器的 available processors，可以使用 nproc 指令來查詢。

接下來修改 JVM 的設定，Elasticsearch 預設的 JVM 大小為 1GB, 如果需要調整 memory 大小，可以在 ./elasticsearch-6.7.2/config/jvm.options 做修改:

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

# Set the minimum heap size (Xms) and maximum heap size (Xmx) to be equal to each other
-Xms4g
-Xmx4g

ps. Elasticsearch 最多只會使用系統的 50% memory, 且不建議設定超過 32GB.

設定好之後啟動 Elasticsearch:

1 2	$ cd elasticsearch-6.7.2/bin $ ./elasticsearch

啟動時可能會遇到一些問題，這裡列了目前有遇到的錯誤訊息和解決方式，如果啟動時有出現這些錯誤訊息可以參考以下解法:

vm.max_map_count 太小
Error message:
1
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least
解決方式: 增加 vm.max_map_count 的大小限制，vm.max_map_count 是用來限制 process 在 Virtual Memory Areas 擁有的最大數量。
1
$ sudo sysctl -w vm.max_map_count=262144

無法 lock memory
Error message:

1	Unable to lock JVM Memory: error=12, reason=Cannot allocate memory

解決方式: 修改 /etc/security/limits.conf 設定

1
2
3

# allow user 'userA' mlockall
userA soft memlock unlimited
userA hard memlock unlimited

接著重新登入即可生效。

NullPointerException
Error message:

1 2	[ERROR][o.e.b.Bootstrap ] Exception java.lang.NullPointerException: null

解決方式: 設定 cgroup

$ sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,cpu,cpuacct cgroup /sys/fs/cgroup/cpu,cpuacct 

# 如果設定 line 1 的指令後，一樣無法啟動，請再試著設定以下內容:
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,freezer cgroup /sys/fs/cgroup/freezer
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,blkio cgroup /sys/fs/cgroup/blkio
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,hugetlb cgroup /sys/fs/cgroup/hugetlb
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,devices cgroup /sys/fs/cgroup/devices
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,net_cls,net_prio cgroup /sys/fs/cgroup/net_cls,net_prio
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,cpuset cgroup /sys/fs/cgroup/cpuset
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,memory cgroup /sys/fs/cgroup/memory
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,pids cgroup /sys/fs/cgroup/pids
# sudo mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,perf_event cgroup /sys/fs/cgroup/perf_event

安裝&設定 ik analyzer

參考 ik analyzer, 使用 elasticsearch-plugin 安裝 ik analyzer:

1	$ ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.7.2/elasticsearch-analysis-ik-6.7.2.zip

NOTE: 將 6.7.2 替換為所使用的 Elasticsearch 版本。

接著設定 Dictionary, 可以準備自定義的字典檔讓搜尋效果更好, 字典及設定檔可以放在 <ES_DIR>/config/analysis-ik/ 或是 <ES_DIR>/plugins/analysis-ik/config/ 底下，再來在 IKAnalyzer.cfg.xml 設定使用自定義的字典:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
  <comment>IK Analyzer 扩展配置</comment>
  <!--用户可以在这里配置自己的扩展字典 -->
  <entry key="ext_dict">
    custom/mydict.dic;
    extra_main.dic;
    extra_single_word_low_freq.dic;
  </entry>
  <!--用户可以在这里配置自己的扩展停止词字典-->
  <entry key="ext_stopwords">
    extra_stopword.dic
  </entry>
  <!--用户可以在这里配置远程扩展字典 -->
  <!-- <entry key="remote_ext_dict">words_location</entry> -->
  <!--用户可以在这里配置远程扩展停止词字典-->
  <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

最後重新啟動 Elasticsearch 就可以使用 ik analyzer.

安裝&設定 Kibana

直接從 Kibana 網站中下載並解壓縮:

1
2
3

$ wget https://artifacts.elastic.co/downloads/kibana/kibana-6.7.2-linux-x86_64.tar.gz
$ tar zxvf kibana-6.7.2-linux-x86_64.tar.gz
$ cd kibana-6.7.2-linux-x86_64

設定 <KIBANA_DIR>/config/kibana.yml:

# Kibana host and port
server.port: 5601
server.host: "localhost"

# Elasticsearch url
elasticsearch.url: "http://localhost:9200"

logging.dest: /path/to/kibana/log

執行 Kibana:

1	$ ./bin/kibana

啟動 Kibana 之後，開啟瀏覽器進入: http://localhost:5601，如果有正常連線到 Elasticsearch 就會看到以下畫面:

操作 Elasticsearch

設定 Mapping

在建立索引之前，要先設定 Index 的 mapping，設定資料欄位的 datatype, format, analyzer 等等，讓搜尋效果可以更好，詳細的 Mapping 設定方式可以參考官方文件，這裡範例是指定資料欄位使用 ik analyzer:

$ curl -XPOST 'http://localhost:9200/index/type/_mapping' -H 'Content-Type:application/json' -d'
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "ik_max_word",
      "search_analyzer": "ik_max_word"
    }
  }
}'

設定回傳的資料量限制

Elasticsearch 預設的 from + size 最多為 10000, 如果需要增加大小，需要修改以下設定:

$ curl -XPUT 'http://<host>:<port>/<index>/_settings' -d'
{
  "index": {
    "max_result_window" : "300000"
  }
}'

但是有可能會占用更多記憶體，所以要小心設定。

設定 Log level

當資料量大的時候，在 Elasticsearch 做全文搜尋時，CPU 使用率常常會很高，甚至到達 100%，我們可以透過設定 Log level 來減少 IO，降低 CPU 使用率:

$ curl -XPUT 'http://<host>:<port>/<index>/_settings' -d'
{
  "index.search.slowlog.level": "info"
}'

檢查狀態

Index 狀態

1	$ curl -XGET http://127.0.0.1:9200/_cat/indices?v

回傳結果:

health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   dcard    SWUzC7XRSU6DfGGwa_rtHw   5   1          6            0     84.4kb         84.4kb
yellow open   ptt      DzwrF9YMRQm_vzpp_Xz8eg   5   1        912            0    166.4kb        166.4kb
yellow open   mylog    x74lnUFcTZO1IU4rznok0w   5   1          5            0     37.5kb         37.5kb
yellow open   my_index IFc49rh8So2AtXrW5gIXng   5   1          5            0     16.9kb         16.9kb
yellow open   fbpost   xD9-BAR_R4ybkhAYnXzLYA   5   1          6            1     98.9kb         98.9kb
yellow open   news     WLNPnFXpQ8yQEH8c_saClg   5   1          5            0     55.3kb         55.3kb
green  open   .kibana  isZ0WfytSzWQcXXH0PxmWA   1   0          5            2     33.1kb         33.1kb

health: index 的健康狀態
- red: 資料缺損，無法使用
- yellow: 資料只有一份，沒有 shards, 如果單一節點損壞的話，無法回復
- green: 資料有 shards 的備，如果單點損壞還是可以正常檢索
status: 是否啟用
index: 索引名稱
uuid: unique key
pri: 主要 shards 數量
rep: 備份 shards 數量
docs.count: index 中 doc 筆數
docs.deleted: index 中刪除的 doc 筆數
store.size: 儲存主要和備份資料所占用的空間
pri.store.size: 儲存主要資料所佔用的空間

線程狀態

1	$ curl -s -XGET http://127.0.0.1:9200/_cat/thread_pool?v

Cluster 狀態

1	$ curl -XGET http://127.0.0.1:9200/_cat/health?v

Node 狀態

1	$ curl -XGET http://127.0.0.1:9200/_cat/nodes?v

新增資料

基本的指令格式:

1 2	$ curl -XPUT 'http://<host>:<port>/<index>/<type>/<doc_id>' -d ' {"data": data}'

其中 doc_id 不一定要有，如果沒有指定 doc_id, HTTP request method 需使用 POST. For example:

# 指定 doc_id
$ curl -XPUT http://localhost:9200/index/type/1 -H 'Content-Type:application/json' -d'
{"content":"蘋果好綠！宣布全球設施已 100％ 使用再生能源"}'

# 不指定 doc_id
$ curl -XPOST http://localhost:9200/index/type -H 'Content-Type:application/json' -d'
{"content":"阿里山花季謝幕 紫藤接替營造紫色浪漫"}'

$ curl -XPOST http://localhost:9200/index/type -H 'Content-Type:application/json' -d'
{"content":"日本環球影城全新夜間遊行 四大特點搶先看"}'

$ curl -XPOST http://localhost:9200/index/type -H 'Content-Type:application/json' -d'
{"content":"阿里山花季今閉幕群花接力開"}'

$ curl -XPOST http://localhost:9200/index/fulltext -H 'Content-Type:application/json' -d'
{"content":"不甩聯合國美國怒嗆：要讓「怪物」阿薩德付出代價"}'

查詢資料 (By doc_id)

基本指令格式:

1	$ curl 'http://<host>:<port>/<index>/<type>/<doc_id>?pretty=true

其中 URL 參數 pretty=true 代表以方便讀取的格式回傳, for example:

1	$ curl 'http://localhost:9200/index/type/1?pretty=true'

回傳結果:

{
  "_index" : "index",
  "_type" : "type",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "content":"蘋果好綠！宣布全球設施已 100％ 使用再生能源"
  }
}

刪除資料

使用 DELETE method, for example:

1	$ curl -XDELETE 'http://localhost:9200/index/type/1'

也可以直接刪除 Index 中的所有資料:

1	$ curl -XDELETE 'http://localhost:9200/index'

更新資料

使用 PUT method, 重新發送一次 request 即可:

1 2	$ curl -XPUT http://localhost:9200/index/type/1 -H 'Content-Type:application/json' -d' {"content":"蘋果好綠！宣布全球設施已 100％使用再生能源. Update!!"}'

回傳結果:

{
  "_index": "index",
  "_type": "type",
  "_id": "1",
  "_version": 2,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 3,
  "_primary_term": 1
}

我們可以看到回傳的結果中 _version 是 2, result 為 updated, 代表是更新資料而不是新增資料。

搜尋

使用 Elasticsearch 最重要的就是搜尋的功能，我們使用它所提供的 Search API 來搜尋:

取得所有紀錄

使用 GET method, 後面加上 _search:

1	$ curl 'http://localhost:9200/index/type/_search'

回傳結果:

{
  "took": 31,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 1,
    "hits": [
      {
        "_index": "index",
        "_type": "type",
        "_id": "PXlMs2IBy_MbTvZJQ_1N",
        "_score": 1,
        "_source": {
          "content": "阿里山花季謝幕 紫藤接替營造紫色浪漫"
        }
      },
      {
        "_index": "index",
        "_type": "type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "content": "蘋果好綠！宣布全球設施已 100％ 使用再生能源"
        }
      },
      // ...
    ]
  }
}

全文搜尋

Elasticsearch 有自己的查詢語法，詳細可以參考 Query DSL, 這裡的範例是搜尋 美國 並將比對到的地方 highlight:

$ curl 'localhost:9200/index/type/_search'  -d '
{
  "query" : {
    "match": { "content": "美國" }
  },
  "highlight" : {
    "fields" : {
      "content" : {}
    }
  }
}'

搜尋結果:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.92769736,
    "hits": [
      {
        "_index": "index",
        "_type": "type",
        "_id": "QHlMs2IBy_MbTvZJXv3t",
        "_score": 0.92769736,
        "_source": {
          "content": "不甩聯合國美國怒嗆：要讓「怪物」阿薩德付出代價"
        },
        "highlight": {
          "content": [
            "不甩聯合國<em>美國</em>怒嗆：要讓「怪物」阿薩德付出代價"
          ]
        }
      }
    ]
  }
}

備份

Elasticsearch 的備份，可以參考由 taskrabbit 開發的 elasticsearch-dump 工具來做資料的備份及轉移。

操作 Kibana

由於從 cmd 操作 Elasticsearch 比較難閱讀，Kibana 提供漂亮的介面，讓我們可以更方便地操作 Elasticsearch, 並將 Elasticsearch 中的資料以視覺化的方式呈現，這裡我們簡單介紹一下 Kibana 的使用方式。

一進到 Kibana 頁面，我們可以看到左側主選單有以下幾個功能:

Discover: 檢視每個索引下的紀錄筆數和內容。
Visualize: 將搜尋結果是以視覺化的圖表呈現，並可以將搜尋結果或圖表儲存。
Dashboard: 組合多個已儲存的圖表或搜尋結果，方便一次瀏覽所有資訊。
Dev Tools: 是一個方便的除錯測試工具，可以在 Console 輸入指令直接操作 Elasticsearch.
Management: 設定 Kibana 對應的 Elasticsearch index patterns, 管理已儲存的搜尋結果 object、視覺化圖表以及進階的設定。

如果是第一次使用，我們需要先到 Management 建立 Index pattern, 這樣就可以在 Discover 中搜尋特定條件的資料，並且在 Visualize 中以視覺化圖表的方式呈現，最後透過 Dashboard 組合多個搜尋結果 object 和圖表，一次瀏覽所需要的資訊。

wshs0713's blog

[Elasticsearch] 使用 Elasticsearch + Kibana 實現中文全文檢索

簡介

基本概念

安裝及設定

安裝&設定 Elasticsearch

安裝&設定 ik analyzer

安裝&設定 Kibana

操作 Elasticsearch

設定 Mapping

設定回傳的資料量限制

設定 Log level

檢查狀態

Index 狀態

線程狀態

Cluster 狀態

Node 狀態

新增資料

查詢資料 (By doc_id)

刪除資料

更新資料

搜尋

取得所有紀錄

全文搜尋

備份

操作 Kibana

參考資料