Elasticsearch từ Cơ Bản đến Nâng Cao

Mục Lục

Phần 1: Nền Tảng

Chương 1: Giới thiệu Elasticsearch
- Elasticsearch là gì? Tại sao dùng ES?
- Kiến trúc tổng quan
- So sánh với RDBMS và các search engine khác
- Các use case phù hợp
Chương 2: Cài đặt và Cấu hình
- Cài đặt với Docker
- Cài đặt cluster multi-node
- Kibana và Dev Tools
- Cấu hình bảo mật cơ bản
Chương 3: Khái niệm Cốt lõi
- Index, Document, Field
- Shards và Replicas
- Inverted Index - Trái tim của Elasticsearch
- Node roles và Cluster topology

Phần 2: Thao Tác Cơ Bản

Chương 4: CRUD Operations
- Index API - Tạo document
- Get API - Đọc document
- Update API - Cập nhật document
- Delete API - Xóa document
- Bulk API - Thao tác hàng loạt
Chương 5: Mapping và Data Types
- Dynamic vs Explicit mapping
- Tất cả các data types
- Multi-fields
- Index templates
- Runtime fields

Phần 3: Tìm Kiếm

Chương 6: Query DSL Cơ Bản
- Query context vs Filter context
- Full-text queries: match, multi_match, query_string
- Term-level queries: term, terms, range, exists
- Compound queries: bool, dis_max
Chương 7: Query DSL Nâng Cao
- Function score queries
- Nested queries
- Parent-child queries
- Geo queries
- Percolator

Phần 4: Phân Tích Văn Bản

Chương 8: Text Analysis
- Analyzer pipeline
- Built-in analyzers
- Custom analyzers
- Tokenizers và Token filters
- Xử lý tiếng Việt

Phần 5: Aggregations

Chương 9: Aggregations
- Metric aggregations
- Bucket aggregations
- Pipeline aggregations
- Nested aggregations
- Thực hành với dữ liệu thực tế

Phần 6: Hiệu Năng và Sản Xuất

Chương 10: Performance Optimization
- Shard strategy
- Caching mechanisms
- Query optimization
- Indexing optimization
- Monitoring và troubleshooting
Chương 11: Tính Năng Nâng Cao
- Highlighting
- Suggesters (completion, term)
- Scroll và Search After
- Point in Time
- Cross-cluster search

Phần 7: Thực Chiến

Chương 12: Use Cases Thực Tế
- E-commerce search engine
- Log analytics với ELK Stack
- Full-text search cho ứng dụng
- Real-time analytics dashboard
- Vietnamese content search
Chương 13: Best Practices & Production
- Cluster sizing và capacity planning
- Security hardening
- Backup và restore
- Zero-downtime migrations
- Monitoring với Elastic Stack

Chương 1: Giới thiệu Elasticsearch

1.1 Elasticsearch là gì?

Elasticsearch (ES) là một search engine phân tán, mã nguồn mở, được xây dựng trên nền tảng Apache Lucene. Nó được phát triển bởi Shay Banon và lần đầu release vào năm 2010. Elasticsearch cho phép bạn lưu trữ, tìm kiếm và phân tích lượng lớn dữ liệu một cách nhanh chóng và theo thời gian thực (near real-time).

Định nghĩa kỹ thuật

Elasticsearch là:

Search Engine: Tối ưu hóa cho việc tìm kiếm full-text với độ liên quan (relevance scoring)
Distributed Database: Tự phân tán dữ liệu trên nhiều node
NoSQL Document Store: Lưu trữ JSON documents, không cần schema cứng
Analytics Engine: Cung cấp aggregation framework mạnh mẽ
REST API First: Toàn bộ tương tác qua HTTP/JSON API

Elasticsearch trong Elastic Stack

Elasticsearch thường được sử dụng cùng với các công cụ khác trong Elastic Stack (trước đây gọi là ELK Stack):

┌─────────────────────────────────────────────────────────────┐
│                        ELASTIC STACK                        │
│                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────────┐  │
│  │  Beats   │    │ Logstash │    │    Kibana (UI)        │  │
│  │(Agents)  │───>│(Pipeline)│───>│  Visualization &     │  │
│  └──────────┘    └──────────┘    │  Dashboards          │  │
│                                  └──────────────────────┘  │
│                         │                    │              │
│                         ▼                    ▼              │
│                  ┌──────────────────────────────────────┐  │
│                  │         ELASTICSEARCH                 │  │
│                  │   Search + Storage + Analytics       │  │
│                  └──────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Beats: Lightweight data shippers (Filebeat, Metricbeat, Packetbeat...)
Logstash: Data processing pipeline (ETL)
Elasticsearch: Core search và storage engine
Kibana: Visualization và management UI

1.2 Tại sao cần Elasticsearch?

Vấn đề với SQL LIKE và Full-Text Search truyền thống

Hầu hết backend developers bắt đầu với MySQL/PostgreSQL và sử dụng LIKE operator:

sql

-- Tìm sản phẩm có tên chứa "điện thoại"
SELECT * FROM products WHERE name LIKE '%điện thoại%';

-- Hoặc PostgreSQL full-text
SELECT * FROM products 
WHERE to_tsvector('english', description) @@ to_tsquery('smartphone');

Vấn đề:

Vấn đề	Giải thích
Hiệu năng kém	`LIKE '%keyword%'` không dùng index, full table scan
Không có ranking	Không biết document nào "liên quan hơn"
Ngôn ngữ hạn chế	Khó xử lý tiếng Việt có dấu, từ đồng nghĩa
Không scalable	10M+ rows thì rất chậm
Không có analytics	Khó làm báo cáo phức tạp
Không real-time	Đặc biệt với MySQL fulltext index

Elasticsearch giải quyết các vấn đề này

Tìm kiếm: "điện thoại samsung màn hình lớn pin trâu"
                          │
                    Elasticsearch
                          │
              ┌───────────┴───────────┐
              │                       │
       Phân tích query          Inverted Index
    (tokenize, analyze)         (pre-built)
              │                       │
              └───────────┬───────────┘
                          │
                  Kết quả có ranking
                  (relevance score)
                          │
              ┌───────────┴───────────┐
              │ Score: 0.95           │ Score: 0.87
              │ Samsung S24 Ultra     │ Samsung A55
              │ "điện thoại màn lớn   │ "màn hình 6.4 inch
              │  pin 5000mAh"         │  pin 5000mAh"
              └───────────────────────┘

Lợi ích cụ thể:

Tốc độ: Tìm kiếm trong hàng triệu documents trong < 100ms
Relevance Scoring: Tự động tính điểm liên quan (TF-IDF, BM25)
Text Analysis: Hiểu ngôn ngữ tự nhiên, dấu câu, từ đồng nghĩa
Distributed: Tự phân tán, scale horizontally
Near Real-Time: Document được index trong ~1 giây
Aggregations: Faceted search, analytics dashboard

1.3 Các Use Case Phù Hợp

Use Case 1: E-Commerce Search

Bài toán: Website bán hàng như Shopee/Tiki với 10 triệu sản phẩm. Người dùng search "điện thoại samsung 5G dưới 10 triệu".

Yêu cầu:

Full-text search với typo tolerance
Filter theo giá, brand, category
Faceted search (hiển thị số lượng theo category)
Spell correction: "samsumg" → "samsung"
Autocomplete khi người dùng gõ

Tại sao MySQL không đủ:

sql

-- Query này sẽ rất chậm và không có ranking
SELECT * FROM products 
WHERE (name LIKE '%samsung%' OR description LIKE '%samsung%')
  AND (name LIKE '%5G%' OR description LIKE '%5G%')
  AND price < 10000000
ORDER BY ??? -- Không biết sắp xếp theo gì
LIMIT 20;

Elasticsearch giải quyết:

json

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "điện thoại samsung 5G",
            "fields": ["name^3", "description", "brand^2"],
            "fuzziness": "AUTO"
          }
        }
      ],
      "filter": [
        { "range": { "price": { "lte": 10000000 } } },
        { "term": { "status": "active" } }
      ]
    }
  },
  "aggs": {
    "by_brand": {
      "terms": { "field": "brand.keyword", "size": 10 }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 3000000, "key": "Dưới 3 triệu" },
          { "from": 3000000, "to": 7000000, "key": "3-7 triệu" },
          { "from": 7000000, "to": 10000000, "key": "7-10 triệu" }
        ]
      }
    }
  }
}

Use Case 2: Log Analytics (ELK Stack)

Bài toán: Hệ thống microservices với 50 services, mỗi giây generate 100,000 log lines. Cần debug production issue.

Vòng đời log:

Application Logs (JSON)
        │
        ▼
    Filebeat (collect)
        │
        ▼
    Logstash (parse, enrich)
        │
        ▼
  Elasticsearch (store, index)
        │
        ▼
    Kibana (visualize, alert)

Ví dụ log được index:

json

{
  "@timestamp": "2024-01-15T10:30:45.123Z",
  "service": "payment-service",
  "level": "ERROR",
  "message": "Payment failed: insufficient funds",
  "trace_id": "abc123def456",
  "user_id": "user_789",
  "amount": 500000,
  "error_code": "INSUFFICIENT_FUNDS",
  "response_time_ms": 45,
  "host": "prod-microservice-03"
}

Query tìm lỗi trong 1 giờ qua:

json

GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "match": { "service": "payment-service" } }
      ],
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "errors_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "5m"
      }
    },
    "top_errors": {
      "terms": {
        "field": "error_code.keyword",
        "size": 10
      }
    }
  },
  "sort": [{ "@timestamp": "desc" }]
}

Use Case 3: Content/Article Search

Bài toán: Nền tảng tin tức như VnExpress với 5 triệu bài viết. Người dùng tìm "bão số 3 thiệt hại".

Đặc điểm:

Tiếng Việt với dấu ~ khó xử lý
Nội dung dài (10,000+ chữ mỗi bài)
Cần highlight kết quả tìm kiếm
Phân loại theo chủ đề (news categories)

json

GET /articles/_search
{
  "query": {
    "multi_match": {
      "query": "bão số 3 thiệt hại",
      "fields": ["title^5", "summary^3", "content"],
      "type": "best_fields"
    }
  },
  "highlight": {
    "fields": {
      "title": {},
      "content": {
        "fragment_size": 200,
        "number_of_fragments": 3
      }
    }
  },
  "aggs": {
    "by_category": {
      "terms": { "field": "category.keyword" }
    },
    "published_date": {
      "date_histogram": {
        "field": "published_at",
        "calendar_interval": "day"
      }
    }
  }
}

Use Case 4: Real-time Analytics & Monitoring

Bài toán: Dashboard theo dõi metrics của hệ thống: CPU, memory, response time, error rate theo thời gian thực.

Data flow:

Servers → Metricbeat → Elasticsearch → Kibana Dashboard
                                           │
                                    [Real-time refresh: 5s]
                                    - CPU usage per host
                                    - Memory trends
                                    - Request rate
                                    - Error rate alerting

Use Case 5: Autocomplete & Suggestions

Bài toán: Search box với autocomplete như Google - khi user gõ "iph" thì hiện ra "iPhone 15", "iPhone 14 Pro", "iPhone charger"...

Elasticsearch có Completion Suggester được tối ưu cho use case này với cấu trúc FST (Finite State Transducer) trong memory, cho response time < 5ms.

1.4 So Sánh Elasticsearch với Các Giải Pháp Khác

Elasticsearch vs MySQL Full-Text Search

Tiêu chí	MySQL FTS	Elasticsearch
Tốc độ tìm kiếm	Chậm với data lớn	Nhanh (< 100ms với triệu docs)
Relevance Scoring	Cơ bản	BM25, có thể tùy chỉnh
Scalability	Vertical scaling	Horizontal scaling (thêm node)
Text Analysis	Hạn chế	Cực kỳ linh hoạt
Aggregations	Hạn chế (GROUP BY)	Rất mạnh
Near Real-time	Index chậm	~1 giây
Tiếng Việt	Kém	Hỗ trợ tốt (custom analyzer)
Transactions	ACID	Không có
Joins	Dễ	Phức tạp (nested/parent-child)
Chi phí vận hành	Đơn giản	Phức tạp hơn

Elasticsearch vs MongoDB Atlas Search

Tiêu chí	MongoDB Atlas Search	Elasticsearch
Tích hợp DB	Tích hợp sẵn với MongoDB	Cần sync riêng
Phức tạp setup	Đơn giản	Phức tạp hơn
Performance	Tốt	Tốt hơn cho search-heavy
Aggregations	Tốt	Rất mạnh hơn
Ecosystem	Smaller	Rất lớn (ELK)
Cost	Theo usage	Self-managed hoặc Elastic Cloud

Elasticsearch vs Apache Solr

Cả hai đều dùng Lucene làm core, nhưng:

Tiêu chí	Solr	Elasticsearch
API	XML/JSON	JSON REST API
Distributed	ZooKeeper	Built-in native
Ease of use	Phức tạp hơn	Developer-friendly hơn
Analytics	Hạn chế	Aggregation framework mạnh
Community	Nhỏ hơn	Lớn hơn
Cloud	Không có cloud service riêng	Elastic Cloud
Use case	Legacy enterprise search	Modern applications

Khi nào KHÔNG dùng Elasticsearch

Elasticsearch không phù hợp cho:

Primary database: Không có ACID transactions, có thể mất data
Relational data với nhiều joins: ES không tối ưu cho joins
Small data (< 1000 records): Overhead không đáng
Write-heavy workloads: ES tối ưu cho read, write có latency cao hơn SQL
Sensitive financial data: Khi cần transaction guarantees

1.5 Kiến Trúc Tổng Quan

Kiến Trúc Phân Tán

                    CLIENT (Application)
                           │
                    Load Balancer
                           │
                ┌──────────┼──────────┐
                │          │          │
           ┌────▼────┐ ┌───▼────┐ ┌──▼─────┐
           │ Node 1  │ │ Node 2 │ │ Node 3 │
           │(Master) │ │(Data)  │ │(Data)  │
           └────┬────┘ └───┬────┘ └──┬─────┘
                │          │         │
                └──────────┴─────────┘
                        Cluster

Luồng Xử Lý Query

Client gửi query
       │
       ▼
Coordinating Node (nhận request)
       │
       ▼
Broadcast query đến các shards liên quan
       │
    ┌──┴──┐
    │     │
 Shard 0  Shard 1
 (Node 1) (Node 2)
    │     │
    └──┬──┘
       │
Gather & merge kết quả
(sort, rank, paginate)
       │
       ▼
Return kết quả cho client

Luồng Index Document

Client gửi document mới
         │
         ▼
Routing: hash(document_id) % num_shards
         │
         ▼
Primary Shard nhận document
         │
    ┌────┴────┐
    │         │
Write to     Replicate to
Lucene       Replica Shards
segment      (asynchronous)
    │
    ▼
Translog (durability)
    │
    ▼
Refresh (default 1s) → searchable

1.6 Lịch Sử và Phiên Bản

Năm	Sự kiện
2004	Shay Banon bắt đầu Compass (tiền thân của ES)
2010	Elasticsearch 0.4 - lần đầu public release
2012	Elastic (công ty) được thành lập
2013	ElasticSearch 1.0
2014	Logstash và Kibana gia nhập Elastic Stack
2015	Elasticsearch 2.0
2017	Elasticsearch 5.0 (thống nhất version với Kibana, Logstash)
2018	Elasticsearch 6.0
2019	Elasticsearch 7.0 (loại bỏ mapping types)
2021	Elasticsearch 7.10 - License thay đổi (SSPL)
2022	Elasticsearch 8.0 (security by default, TSDB)
2023	Elasticsearch 8.x với vector search (kNN)

Phiên bản hiện tại (2024): Elasticsearch 8.x

Major Changes ở 8.x cần biết:

Security by default: HTTPS, authentication bật mặc định
Loại bỏ mapping types: Không còn _type field
Stack-based licensing: Nhiều tính năng miễn phí hơn
Vector search (kNN): Search theo embeddings cho AI/ML
TSDB (Time Series Data Stream): Tối ưu cho time series data

1.7 Cách Elasticsearch Hoạt Động - Cơ Bản

Inverted Index là gì?

Đây là khái niệm quan trọng nhất. Hãy hiểu qua ví dụ:

Dữ liệu:

Doc 1: "Điện thoại Samsung Galaxy S24"
Doc 2: "Điện thoại iPhone 15 Pro"
Doc 3: "Laptop Samsung Galaxy Book"

Inverted Index được xây dựng:

Token         → Documents
─────────────────────────
"điện"        → [Doc1, Doc2]
"thoại"       → [Doc1, Doc2]
"samsung"     → [Doc1, Doc3]
"galaxy"      → [Doc1, Doc3]
"s24"         → [Doc1]
"iphone"      → [Doc2]
"15"          → [Doc2]
"pro"         → [Doc2]
"laptop"      → [Doc3]
"book"        → [Doc3]

Khi query "samsung":

Lookup "samsung" → [Doc1, Doc3]
Tính relevance score cho mỗi doc
Trả về kết quả đã được sort

So với table scan của SQL:

SQL: Duyệt qua từng row → O(n)
Elasticsearch: Lookup trong inverted index → O(1) ~ O(log n)

BM25 - Thuật Toán Tính Điểm Relevance

Elasticsearch dùng BM25 (Best Match 25) để tính điểm relevance:

$$\text{score}(D, Q) = \sum_{i=1}^{n} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}$$

Trong đó:

$f(q_i, D)$ = tần suất của term $q_i$ trong document $D$ (TF)
$|D|$ = độ dài document
$avgdl$ = độ dài trung bình của tất cả documents
$k_1$ = tham số kiểm soát ảnh hưởng của TF (default: 1.2)
$b$ = tham số normalization theo độ dài (default: 0.75)
$IDF(q_i)$ = Inverse Document Frequency

IDF được tính: $$IDF(q_i) = \ln\left(1 + \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}\right)$$

Giải thích intuition:

Term xuất hiện nhiều hơn trong document → score cao hơn (TF)
Term xuất hiện ít trong toàn bộ corpus → quý hơn → score cao hơn (IDF)
Document ngắn mà chứa term → relevant hơn document dài (length normalization)

1.8 Các Khái Niệm Cần Nhớ Ngay

Khái niệm ES	Tương đương SQL
Index	Database/Table
Document	Row/Record
Field	Column
Mapping	Schema
Shard	Partition
Node	Server
Cluster	Database Cluster

Lưu ý quan trọng: Trước ES 7.0, có khái niệm "Type" tương đương với Table trong một Database. Từ ES 7.0+, mỗi Index chỉ có một type (_doc), nên Index ≈ Table.

Tóm Tắt Chương 1

Elasticsearch là search engine phân tán, xây dựng trên Lucene
Dùng ES khi cần: full-text search, log analytics, real-time analytics, autocomplete
Không dùng ES làm primary database, không dùng khi cần ACID transactions
Inverted Index là lý do ES tìm kiếm nhanh
BM25 là thuật toán tính điểm relevance mặc định
ES hoạt động tốt nhất khi kết hợp với SQL DB (SQL cho primary storage, ES cho search)

Bước Tiếp Theo

→ Chương 2: Cài đặt và Cấu hình - Thiết lập môi trường để bắt đầu thực hành

Chương 2: Cài Đặt và Cấu Hình Elasticsearch

2.1 Yêu Cầu Hệ Thống

Yêu cầu phần cứng tối thiểu (Development)

Tài nguyên	Minimum	Recommended
RAM	2GB	8GB+
CPU	2 cores	4+ cores
Disk	10GB	SSD 50GB+
OS	Linux/macOS/Windows	Linux (Ubuntu 22.04+)
JVM	Java 17+	Java 17 (bundled)

Yêu cầu hệ thống tối thiểu (Production)

Tài nguyên	Hot Nodes	Warm/Cold Nodes
RAM	32-64GB	16-32GB
CPU	8-16 cores	4-8 cores
Disk	NVMe SSD	SSD/HDD
Network	10Gbps	1Gbps

Quan trọng: Elasticsearch sử dụng JVM heap + OS page cache. Quy tắc: JVM heap = 50% RAM, không vượt 30GB (compressed object pointers)

2.2 Cài Đặt với Docker (Khuyến nghị cho Development)

Option 1: Single Node với Docker Compose

Tạo file docker-compose.yml:

yaml

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: elasticsearch
    environment:
      - node.name=es01
      - cluster.name=my-elasticsearch
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
      # Tắt security cho development (KHÔNG làm trên production!)
      - xpack.security.enabled=false
      - xpack.security.http.ssl.enabled=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
      - "9300:9300"
    networks:
      - elastic
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9200/_health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    container_name: kibana
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    networks:
      - elastic
    depends_on:
      elasticsearch:
        condition: service_healthy

volumes:
  es_data:
    driver: local

networks:
  elastic:
    driver: bridge

Khởi động:

bash

docker-compose up -d

# Kiểm tra trạng thái
docker-compose ps

# Xem logs
docker-compose logs -f elasticsearch

Kiểm tra ES đã chạy:

bash

curl http://localhost:9200

# Kết quả mong đợi:
{
  "name" : "es01",
  "cluster_name" : "my-elasticsearch",
  "cluster_uuid" : "abc123...",
  "version" : {
    "number" : "8.12.0",
    ...
  },
  "tagline" : "You Know, for Search"
}

Option 2: Multi-Node Cluster với Docker Compose

Đây là setup thực tế hơn, giả lập cluster production:

yaml

version: '3.8'

services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=my-cluster
      - node.roles=master,data,ingest
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es01_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - elastic

  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: es02
    environment:
      - node.name=es02
      - cluster.name=my-cluster
      - node.roles=master,data,ingest
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es02_data:/usr/share/elasticsearch/data
    networks:
      - elastic

  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: es03
    environment:
      - node.name=es03
      - cluster.name=my-cluster
      - node.roles=master,data,ingest
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es03_data:/usr/share/elasticsearch/data
    networks:
      - elastic

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    container_name: kibana
    environment:
      - ELASTICSEARCH_HOSTS=http://es01:9200
    ports:
      - "5601:5601"
    networks:
      - elastic
    depends_on:
      - es01

volumes:
  es01_data:
  es02_data:
  es03_data:

networks:
  elastic:
    driver: bridge

Các lệnh quản lý:

bash

# Khởi động cluster
docker-compose up -d

# Xem cluster health
curl http://localhost:9200/_cluster/health?pretty

# Xem các nodes
curl http://localhost:9200/_cat/nodes?v

# Output mong đợi:
# ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role  name
# 172.18.0.2           12          95   3    0.02    0.07     0.13 cdfhilmrstw es01
# 172.18.0.3           11          95   2    0.02    0.07     0.13 cdfhilmrstw es02
# 172.18.0.4           13          95   2    0.02    0.07     0.13 cdfhilmrstw es03

2.3 Cài Đặt với Security (Khuyến Nghị cho Production)

Từ ES 8.0, security bật mặc định. Đây là cách setup đúng:

yaml

version: '3.8'

services:
  setup:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
    user: "0"
    command: >
      bash -c '
        if [ x${ELASTIC_PASSWORD} == x ]; then
          echo "Set the ELASTIC_PASSWORD environment variable in the .env file";
          exit 1;
        fi;
        if [ ! -f config/certs/ca.zip ]; then
          echo "Creating CA";
          bin/elasticsearch-certutil ca --silent --pem -out config/certs/ca.zip;
          unzip config/certs/ca.zip -d config/certs;
        fi;
        if [ ! -f config/certs/certs.zip ]; then
          echo "Creating certs";
          cat > config/certs/instances.yml << EOF
          instances:
            - name: es01
              dns:
                - es01
                - localhost
              ip:
                - 127.0.0.1
          EOF
          bin/elasticsearch-certutil cert --silent --pem -out config/certs/certs.zip \
            --in config/certs/instances.yml --ca-cert config/certs/ca/ca.crt \
            --ca-key config/certs/ca/ca.key;
          unzip config/certs/certs.zip -d config/certs;
        fi;
        echo "Setting file permissions"
        chown -R root:root config/certs;
        find . -type d -exec chmod 750 \{\} \;;
        find . -type f -exec chmod 640 \{\} \;;
        echo "Waiting for Elasticsearch availability";
        until curl -s --cacert config/certs/ca/ca.crt https://es01:9200 | \
          grep -q "missing authentication credentials"; do sleep 30; done;
        echo "Setting elastic password";
        until curl -s -X POST --cacert config/certs/ca/ca.crt \
          -u "elastic:changeme" -H "Content-Type: application/json" \
          https://es01:9200/_security/user/elastic/_password \
          -d "{\"password\":\"${ELASTIC_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
        echo "All done!";
      '

  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=my-secure-cluster
      - discovery.type=single-node
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=certs/es01/es01.key
      - xpack.security.http.ssl.certificate=certs/es01/es01.crt
      - xpack.security.http.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=certs/es01/es01.key
      - xpack.security.transport.ssl.certificate=certs/es01/es01.crt
      - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate
    volumes:
      - certs:/usr/share/elasticsearch/config/certs
      - es01_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    ulimits:
      memlock:
        soft: -1
        hard: -1

volumes:
  certs:
  es01_data:

File .env:

ELASTIC_PASSWORD=MySecureP@ssw0rd123

Kết nối với security:

bash

# Với SSL và authentication
curl -u elastic:MySecureP@ssw0rd123 \
  --cacert /path/to/ca.crt \
  https://localhost:9200

# Hoặc tắt SSL verification (không dùng production!)
curl -k -u elastic:MySecureP@ssw0rd123 https://localhost:9200

2.4 Cài Đặt trên Ubuntu Server (Production-like)

Step 1: Import GPG Key

bash

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | \
  sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg

Step 2: Add Repository

bash

sudo apt-get install apt-transport-https
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] \
  https://artifacts.elastic.co/packages/8.x/apt stable main" | \
  sudo tee /etc/apt/sources.list.d/elastic-8.x.list

Step 3: Install

bash

sudo apt-get update && sudo apt-get install elasticsearch

Step 4: Configure (elasticsearch.yml)

bash

sudo nano /etc/elasticsearch/elasticsearch.yml

yaml

# /etc/elasticsearch/elasticsearch.yml

# Cluster
cluster.name: production-cluster
node.name: node-1

# Network
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

# Discovery
discovery.seed_hosts: ["node-1-ip", "node-2-ip", "node-3-ip"]
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

# Paths
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

# Memory
bootstrap.memory_lock: true

# Security (production)
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true

# Slow log thresholds
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.fetch.warn: 1s
index.indexing.slowlog.threshold.index.warn: 10s

Step 5: Configure JVM

bash

sudo nano /etc/elasticsearch/jvm.options.d/heap.options

# Set heap to 50% of RAM, max 30GB
-Xms8g
-Xmx8g

Step 6: System Configuration

bash

# Disable swap
sudo swapoff -a
# Permanent: comment out swap in /etc/fstab

# Increase virtual memory map count
sudo sysctl -w vm.max_map_count=262144
# Permanent:
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf

# Increase file descriptor limits
sudo nano /etc/security/limits.conf
# Add:
# elasticsearch soft nofile 65536
# elasticsearch hard nofile 65536
# elasticsearch soft memlock unlimited
# elasticsearch hard memlock unlimited

Step 7: Start và Enable

bash

sudo systemctl daemon-reload
sudo systemctl enable elasticsearch
sudo systemctl start elasticsearch

# Check status
sudo systemctl status elasticsearch
sudo journalctl -u elasticsearch -f

2.5 Kibana - Giao Diện Quản Lý

Kibana với Docker

Đã có trong docker-compose ở trên. Truy cập: http://localhost:5601

Kibana Dev Tools

Đây là công cụ quan trọng nhất để học ES. Đường dẫn: Kibana → Dev Tools (Management)

Giao diện Dev Tools:

┌─────────────────────────────────────────────────────────────┐
│  Dev Tools                              [History] [Settings]│
├───────────────────────┬─────────────────────────────────────┤
│                       │                                     │
│  GET _cluster/health  │  {                                  │
│  ▶ (Ctrl+Enter)       │    "cluster_name": "my-cluster",   │
│                       │    "status": "green",               │
│  POST /products/_doc  │    "timed_out": false,              │
│  {                    │    "number_of_nodes": 3,            │
│    "name": "iPhone"   │    "number_of_data_nodes": 3,       │
│  }                    │    "active_shards": 15              │
│                       │  }                                  │
│  [Request panel]      │  [Response panel]                   │
└───────────────────────┴─────────────────────────────────────┘

Shortcuts hữu ích:

Ctrl+Enter hoặc Cmd+Enter: Chạy query hiện tại
Ctrl+Space: Auto-complete
Ctrl+/: Toggle comment
Click ▶ button cạnh query

2.6 Cấu Hình quan trọng cho Production

elasticsearch.yml Production Template

yaml

# ======================== Elasticsearch Configuration =========================
# Cluster
cluster.name: prod-cluster
node.name: ${HOSTNAME}

# Node roles - tách biệt master và data nodes cho cluster lớn
node.roles: [data, ingest]
# Master-only nodes:
# node.roles: [master]

# ---------------------------------- Paths ------------------------------------
path.data:
  - /mnt/data/elasticsearch
path.logs: /var/log/elasticsearch

# ---------------------------------- Memory -----------------------------------
bootstrap.memory_lock: true

# ---------------------------------- Network ----------------------------------
network.host: _eth0_
http.port: 9200
transport.port: 9300
http.compression: true
http.max_content_length: 200mb

# --------------------------------- Discovery ---------------------------------
discovery.seed_hosts:
  - "master-1:9300"
  - "master-2:9300"
  - "master-3:9300"

cluster.initial_master_nodes:
  - "master-1"
  - "master-2"
  - "master-3"

# ---------------------------------- Various -----------------------------------
# Hạn chế dynamic scripting (security)
script.allowed_types: stored
script.allowed_contexts: search, update

# Slow logs
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.fetch.warn: 1s
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.source: 1000

# Circuit breakers
indices.breaker.total.use_real_memory: true
indices.breaker.total.limit: 95%
indices.breaker.request.limit: 60%
indices.breaker.fielddata.limit: 40%

# --------------------------------- Security ----------------------------------
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/certs/elastic-certificates.p12

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: /etc/elasticsearch/certs/http.p12

# --------------------------------- Monitoring -----------------------------------
xpack.monitoring.collection.enabled: true

Quan Trọng: vm.max_map_count

Elasticsearch cần vm.max_map_count >= 262144. Nếu không set:

[1]: max virtual memory areas vm.max_map_count [65530] is too low, 
     increase to at least [262144]

Fix:

bash

# Temporary
sudo sysctl -w vm.max_map_count=262144

# Permanent
echo 'vm.max_map_count=262144' | sudo tee -a /etc/sysctl.d/elasticsearch.conf
sudo sysctl --system

2.7 Kiểm Tra Cluster và APIs Cơ Bản

Cluster Health API

bash

GET /_cluster/health

# Response:
{
  "cluster_name": "my-cluster",
  "status": "green",        # green/yellow/red
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 5,
  "active_shards": 15,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

Ý nghĩa cluster status:

🟢 Green: Tất cả primary và replica shards hoạt động
🟡 Yellow: Primary shards hoạt động, nhưng một số replica chưa được assign (thường với single node)
🔴 Red: Một số primary shards không hoạt động → data loss có thể xảy ra

Cat APIs - Human Readable

bash

# Xem tất cả nodes
GET /_cat/nodes?v

# Xem tất cả indices
GET /_cat/indices?v&s=index

# Xem shards
GET /_cat/shards?v

# Xem aliases
GET /_cat/aliases?v

# Xem cluster health ngắn gọn
GET /_cat/health?v

# Output cat/nodes:
# ip         heap.percent ram.percent cpu load_1m node.role name
# 172.18.0.2           45          67   5    0.15  master,data  es01
# 172.18.0.3           38          67   3    0.10  data         es02

Node Info APIs

bash

# Thông tin chi tiết về các nodes
GET /_nodes

# Stats của nodes (quan trọng cho monitoring)
GET /_nodes/stats

# JVM stats
GET /_nodes/stats/jvm

# Indices stats
GET /_nodes/stats/indices

# OS stats
GET /_nodes/stats/os

Cluster Settings

bash

# Xem tất cả settings hiện tại
GET /_cluster/settings?include_defaults=true

# Thay đổi settings động (không cần restart)
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  },
  "transient": {
    "logger.level": "DEBUG"
  }
}

Phân biệt persistent và transient:

persistent: Lưu vào cluster state, survive restart
transient: Mất khi restart cluster

2.8 Kết Nối từ ứng dụng Backend

Node.js với @elastic/elasticsearch

bash

npm install @elastic/elasticsearch

javascript

const { Client } = require('@elastic/elasticsearch');

// Development (no security)
const client = new Client({
  node: 'http://localhost:9200'
});

// Production (with security)
const client = new Client({
  node: 'https://my-es-cluster:9200',
  auth: {
    username: 'elastic',
    password: 'MySecurePassword'
  },
  tls: {
    ca: fs.readFileSync('/path/to/ca.crt'),
    rejectUnauthorized: true
  }
});

// Test kết nối
async function ping() {
  const result = await client.ping();
  console.log('Connected:', result);
}

ping().catch(console.error);

Java với Spring Boot

xml

<!-- pom.xml -->
<dependency>
  <groupId>co.elastic.clients</groupId>
  <artifactId>elasticsearch-java</artifactId>
  <version>8.12.0</version>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.17.0</version>
</dependency>

java

// ElasticSearchConfig.java
@Configuration
public class ElasticSearchConfig {
    
    @Value("${elasticsearch.host}")
    private String host;
    
    @Value("${elasticsearch.port}")
    private int port;
    
    @Bean
    public ElasticsearchClient elasticsearchClient() {
        RestClient restClient = RestClient.builder(
            new HttpHost(host, port, "http")
        ).build();
        
        ElasticsearchTransport transport = new RestClientTransport(
            restClient, new JacksonJsonpMapper()
        );
        
        return new ElasticsearchClient(transport);
    }
}

Python với elasticsearch-py

bash

pip install elasticsearch

python

from elasticsearch import Elasticsearch

# Development
es = Elasticsearch("http://localhost:9200")

# Production
es = Elasticsearch(
    "https://my-es-cluster:9200",
    http_auth=("elastic", "password"),
    verify_certs=True,
    ca_certs="/path/to/ca.crt"
)

# Test connection
if es.ping():
    print("Connected to Elasticsearch!")
    info = es.info()
    print(f"Version: {info['version']['number']}")

Go với olivere/elastic hoặc elastic/go-elasticsearch

bash

go get github.com/elastic/go-elasticsearch/v8

package main

import (
    "github.com/elastic/go-elasticsearch/v8"
    "log"
)

func main() {
    cfg := elasticsearch.Config{
        Addresses: []string{
            "http://localhost:9200",
        },
    }
    
    es, err := elasticsearch.NewClient(cfg)
    if err != nil {
        log.Fatalf("Error creating client: %s", err)
    }
    
    // Test connection
    res, err := es.Info()
    if err != nil {
        log.Fatalf("Error getting response: %s", err)
    }
    defer res.Body.Close()
    
    log.Println("Connected to Elasticsearch!")
}

2.9 Troubleshooting Cài Đặt Thường Gặp

Lỗi: "max virtual memory areas vm.max_map_count too low"

bash

sudo sysctl -w vm.max_map_count=262144

Lỗi: "max file descriptors too low"

bash

# Check hiện tại
ulimit -n

# Tăng lên
ulimit -n 65536

# Permanent: /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536

Lỗi: cluster status RED sau khi restart

bash

# Check unassigned shards
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason

# Retry shard allocation
POST /_cluster/reroute?retry_failed=true

# Nếu vẫn RED, check specific index
GET /_cluster/allocation/explain

Lỗi: Out of Memory / GC pressure

bash

# Check heap usage
GET /_nodes/stats/jvm

# Nếu heap usage > 75% thường xuyên:
# 1. Tăng JVM heap (tối đa 50% RAM, không quá 30GB)
# 2. Giảm field data cache
# 3. Xem xét thêm nodes

# Emergency: clear field data cache
POST /_cache/clear?fielddata=true

Lỗi: "This node is not master eligible"

Khi cluster mới hình thành cần ít nhất 1 master-eligible node. Check:

bash

GET /_cluster/settings

# Set voting configuration exclusions nếu cần
POST /_cluster/voting_config_exclusions?node_names=old-master-node

Tóm Tắt Chương 2

Docker là cách nhanh nhất để bắt đầu với ES development
Luôn dùng xpack.security.enabled=false chỉ cho development/learning
Production cần: security, SSL, proper JVM heap, vm.max_map_count
Kibana Dev Tools là công cụ tốt nhất để thực hành ES queries
Cluster status green = tốt, yellow = cần chú ý, red = khẩn cấp
ES cung cấp drivers cho tất cả ngôn ngữ backend phổ biến

Bước Tiếp Theo

→ Chương 3: Khái niệm Cốt lõi - Hiểu sâu về Index, Shard, Document và Inverted Index

Chương 3: Khái Niệm Cốt Lõi của Elasticsearch

3.1 Kiến Trúc Dữ Liệu

Index

Index là đơn vị lưu trữ logic cao nhất trong Elasticsearch. Nó tương đương với một bảng (table) trong SQL hoặc một collection trong MongoDB.

RELATIONAL DB          ELASTICSEARCH
─────────────────      ─────────────────
Database        ≈      (cluster)
Table           ≈      Index
Row             ≈      Document  
Column          ≈      Field
Schema          ≈      Mapping

Đặt tên Index:

Chỉ dùng chữ thường (lowercase)
Không có dấu cách, dấu phẩy, colon, asterisk
Không bắt đầu bằng -, _, +
Không dùng . hoặc ..
Tối đa 255 bytes
Convention tốt: {app}-{resource}-{env} → myapp-products-prod

Tạo index:

bash

# Tạo index đơn giản
PUT /products

# Tạo index với settings và mappings
PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "index": {
      "refresh_interval": "1s",
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase", "stop"]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "my_analyzer"
      },
      "price": {
        "type": "double"
      },
      "created_at": {
        "type": "date"
      }
    }
  }
}

Document

Document là đơn vị lưu trữ cơ bản trong ES. Mỗi document là một JSON object.

json

{
  "_index": "products",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "id": 1,
    "name": "iPhone 15 Pro Max",
    "brand": "Apple",
    "category": "smartphones",
    "price": 34990000,
    "specs": {
      "ram": "8GB",
      "storage": "256GB",
      "screen_size": 6.7,
      "battery": 4422
    },
    "tags": ["flagship", "5G", "iOS"],
    "images": [
      "https://cdn.apple.com/iphone-15-pro-1.jpg",
      "https://cdn.apple.com/iphone-15-pro-2.jpg"
    ],
    "in_stock": true,
    "created_at": "2023-09-22T10:00:00Z",
    "updated_at": "2024-01-15T08:30:00Z"
  }
}

Các metadata fields của Document:

Field	Mô tả
`_index`	Index chứa document
`_id`	Unique identifier của document
`_version`	Version number, tăng mỗi lần update
`_seq_no`	Sequence number cho optimistic concurrency
`_primary_term`	Primary term cho optimistic concurrency
`_source`	JSON document gốc được index
`_score`	Relevance score (chỉ khi search)

Field và Data Types

Mỗi field trong document có một data type xác định cách ES lưu trữ và index nó.

Ví dụ mapping với nhiều loại field:

json

{
  "mappings": {
    "properties": {
      "product_id": { "type": "keyword" },
      "name": { "type": "text" },
      "description": { "type": "text" },
      "price": { "type": "double" },
      "quantity": { "type": "integer" },
      "discount_rate": { "type": "float" },
      "in_stock": { "type": "boolean" },
      "created_at": { "type": "date" },
      "last_updated": { 
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      },
      "location": { "type": "geo_point" },
      "tags": { "type": "keyword" },
      "category_path": { "type": "text" },
      "specs": {
        "type": "object",
        "properties": {
          "weight": { "type": "float" },
          "dimensions": { "type": "keyword" }
        }
      },
      "reviews": {
        "type": "nested",
        "properties": {
          "user_id": { "type": "keyword" },
          "rating": { "type": "integer" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

Phân biệt text và keyword - cực kỳ quan trọng:

	`text`	`keyword`
Mục đích	Full-text search	Exact match, aggregations
Analyzed	Có (tokenize, lowercase...)	Không
Ví dụ	"iPhone 15 Pro Max" → ["iphone", "15", "pro", "max"]	"iPhone 15 Pro Max"
Dùng cho	match query	term query, sort, aggregation
Case sensitive	Không (lowercase)	Có

bash

# TEXT field - full-text search
GET /products/_search
{
  "query": {
    "match": {
      "name": "iphone pro"  # Tìm được "iPhone 15 Pro Max"
    }
  }
}

# KEYWORD field - exact match
GET /products/_search
{
  "query": {
    "term": {
      "brand.keyword": "Apple"  # Phải match chính xác "Apple"
    }
  }
}

3.2 Shards và Replicas

Primary Shards

Shard là đơn vị phân tán cơ bản. Mỗi index được chia thành nhiều shards, mỗi shard là một Lucene index độc lập.

INDEX: products (3 shards, 1 replica)
─────────────────────────────────────
Node 1                Node 2               Node 3
┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  Shard P0   │   │  Shard P1   │   │  Shard P2   │
│ (Primary)   │   │ (Primary)   │   │ (Primary)   │
│             │   │             │   │             │
│  Shard R1   │   │  Shard R2   │   │  Shard R0   │
│ (Replica)   │   │ (Replica)   │   │ (Replica)   │
└─────────────┘   └─────────────┘   └─────────────┘

Quy tắc quan trọng:

Primary và Replica không bao giờ nằm trên cùng một node
Số primary shards được quyết định khi tạo index và không thay đổi được
Số replica shards có thể thay đổi sau khi tạo

Cách tính số shards:

Công thức tham khảo:

Số primary shards = Tổng dung lượng dữ liệu / Target shard size  
Target shard size = 20-40GB (guidelines từ Elastic)

Ví dụ: 
- Dữ liệu: 300GB
- Target shard: 30GB
- Số shards = 300GB / 30GB = 10 shards

Tác hại của quá nhiều shards (over-sharding):

Mỗi shard tốn RAM (~1.5MB metadata/shard trong heap)
Query overhead: scatter-gather across nhiều shards
Segment merging phức tạp hơn

Ví dụ cấu hình:

bash

# Index nhỏ (< 10GB)
PUT /small-index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

# Index vừa (10-100GB)  
PUT /medium-index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

# Index lớn (100GB+)
PUT /large-index
{
  "settings": {
    "number_of_shards": 10,
    "number_of_replicas": 2
  }
}

Routing - Document đến Shard nào?

Khi index một document, ES cần quyết định primary shard nào sẽ chứa document đó:

routing = hash(document_id) % number_of_primary_shards

Ví dụ:

document_id = "product-123"
hash("product-123") = 12345678
number_of_primary_shards = 3

routing = 12345678 % 3 = 0  → Shard 0

Đây là lý do không thể thay đổi số primary shards sau khi tạo index - vì công thức routing sẽ sai.

Custom routing:

bash

# Index document vào shard cụ thể
PUT /products/_doc/1?routing=category-electronics
{
  "name": "iPhone 15",
  "category": "electronics"
}

# Search với routing (chỉ search relevant shards)
GET /products/_search?routing=category-electronics
{
  "query": {
    "term": { "category.keyword": "electronics" }
  }
}

Replica Shards

Replica là bản sao của primary shard, phục vụ 2 mục đích:

High Availability: Nếu node chứa primary bị chết, replica sẽ được promoted thành primary
Read Throughput: Search request có thể được gửi đến cả primary và replica

bash

# Thay đổi số replicas (có thể thay đổi sau khi tạo)
PUT /products/_settings
{
  "number_of_replicas": 2
}

# 0 replica - Tốt nhất cho bulk indexing
PUT /products/_settings
{
  "number_of_replicas": 0
}
# Sau khi bulk xong:
PUT /products/_settings
{
  "number_of_replicas": 1
}

Trade-off của replicas:

Nhiều replicas = High availability + Read throughput cao hơn
Nhiều replicas = Tốn thêm disk, RAM, network cho replication
Với single node: replicas không được assign (cluster status yellow)

3.3 Inverted Index - Trái Tim của Elasticsearch

Lucene Segment

Mỗi Elasticsearch shard = một Lucene index. Mỗi Lucene index gồm nhiều segments (immutable).

Elasticsearch Shard
        │
        ▼
   Lucene Index
        │
   ┌────┼────┐
   │    │    │
Seg0  Seg1  Seg2  (immutable files)
   │
   ├── .fdt  (stored fields - _source)
   ├── .fdx  (field index)
   ├── .tim  (term index)  
   ├── .tip  (term index pointer)
   ├── .doc  (frequencies và positions)
   ├── .pos  (positions)
   ├── .pay  (payloads)
   ├── .nvd  (norms data)
   ├── .nvm  (norms metadata)
   └── .dvm  (doc values metadata)

Quá Trình Index Document

Bước 1: Document được nhận
{
  "title": "Elasticsearch Guide",
  "content": "Learn Elasticsearch from scratch"
}
                    │
                    ▼
Bước 2: Analyzer xử lý text fields
"Elasticsearch Guide" 
  → tokenize: ["Elasticsearch", "Guide"]
  → lowercase: ["elasticsearch", "guide"]
  → stemming/stop words nếu có

"Learn Elasticsearch from scratch"
  → ["learn", "elasticsearch", "from", "scratch"]
  → (stop words): ["learn", "elasticsearch", "scratch"]
                    │
                    ▼
Bước 3: Ghi vào in-memory buffer
(chờ refresh)
                    │
          ┌─────────┴─────────┐
          │                   │
          ▼                   ▼
  Translog            Refresh (mỗi 1s)
  (WAL for           Tạo mới Segment
  durability)        từ buffer
                          │
                          ▼
                  Document searchable
                   (near real-time)

Inverted Index Chi Tiết

Sau khi analyze, ES xây dựng inverted index:

Term Dictionary:
┌────────────────┬────────────────────────────────────┐
│ Term           │ Posting List                       │
├────────────────┼────────────────────────────────────┤
│ elasticsearch  │ Doc1(pos:0), Doc2(pos:1), Doc5(pos:0) │
│ guide          │ Doc1(pos:1), Doc3(pos:2)            │
│ learn          │ Doc2(pos:0)                         │
│ scratch        │ Doc2(pos:3)                         │
│ tutorial       │ Doc3(pos:0), Doc4(pos:0)            │
└────────────────┴────────────────────────────────────┘

Posting List chi tiết (với positions và frequencies):
"elasticsearch":
  - Doc1: freq=1, positions=[0]
  - Doc2: freq=2, positions=[1, 5]  (xuất hiện 2 lần)
  - Doc5: freq=1, positions=[0]

Positions cho phép phrase queries:

bash

# Phrase query - tìm "elasticsearch guide" liên tiếp
GET /docs/_search
{
  "query": {
    "match_phrase": {
      "title": "elasticsearch guide"
    }
  }
}

Segment Merge

Lucene segments là immutable. Khi delete/update document:

Delete: Đánh dấu "deleted" trong .liv file, không xóa khỏi segment
Update: Delete old + Insert new

Theo thời gian, nhiều small segments được merge thành ít large segments hơn:

Before merge:
Seg0 (100 docs, 20 deleted)
Seg1 (50 docs, 5 deleted)
Seg2 (30 docs, 10 deleted)
         │
    Merge bởi Lucene
         │
After merge:
Seg3 (145 docs, 0 deleted)  ← docs deleted thật sự ở đây

Force merge (dùng cẩn thận, chỉ cho read-only indices):

bash

POST /products/_forcemerge?max_num_segments=1

3.4 Node Roles trong Cluster

Các Loại Node

yaml

# elasticsearch.yml - Cấu hình node roles

# Master-eligible node (quản lý cluster)
node.roles: [master]

# Data node (lưu trữ và xử lý data)
node.roles: [data]

# Data nodes theo tier (Elasticsearch 7.10+)
node.roles: [data_hot]   # SSD, frequent access
node.roles: [data_warm]  # HDD, less frequent
node.roles: [data_cold]  # Archived data
node.roles: [data_frozen]# Very rarely accessed

# Ingest node (pre-processing)
node.roles: [ingest]

# Coordinating-only node (route requests)
node.roles: []  # Empty = coordinating only

# ML node
node.roles: [ml]

# Transform node
node.roles: [transform]

Master Node

Master node chịu trách nhiệm:

Theo dõi trạng thái của tất cả nodes
Quyết định shard allocation
Quản lý cluster state changes (tạo/xóa index)
Cluster-wide settings

Master election: Elasticsearch dùng Raft-based consensus algorithm:

Cluster cần minimum (N/2 + 1) master-eligible nodes để vote
→ Tối thiểu 3 master-eligible nodes để có fault tolerance

Nếu chỉ có 2 nodes:
- 1 node fail → không đủ quorum → cluster stop nhận writes
→ Luôn dùng số lẻ (3, 5, 7)

Dedicated master (production):

yaml

# Master node - không lưu data
node.roles: [master]
node.master: true

# Tắt data trên master để giảm load
# Chỉ làm cluster management

Data Node

Data nodes lưu trữ shards và xử lý CRUD + search operations.

Data tiers (ILM - Index Lifecycle Management):

Hot Tier (SSD)          Warm Tier (SSD/HDD)    Cold Tier (HDD)     Frozen Tier (S3)
─────────────           ───────────────────    ────────────        ─────────────────
Indices hiện tại        Indices ít truy cập    Archive data        Very old data
Write + Read Heavy      Read-only              Read-only           Search on demand
Fast indexing           Slower queries OK      Slow queries OK     Very slow
Replicas = 1+           Replicas có thể = 0    Replicas = 0        Mount on demand

Coordinating Node

Mỗi node có thể là coordinating node. Khi nhận search request:

Client → Coordinating Node
              │
    ┌─────────┼─────────┐
    │         │         │
  Shard 0   Shard 1   Shard 2
(Node 1)  (Node 2)  (Node 3)
    │         │         │
    └─────────┼─────────┘
              │
    Merge results
    Sort & Rank globally
              │
           Response

Dedicated coordinating nodes thường dùng khi:

Cluster có heavy search load
Muốn tách biệt coordination overhead khỏi data nodes
API gateway pattern

3.5 Cluster State và Cluster Metadata

Cluster state chứa:

Index metadata (settings, mappings, aliases)
Shard allocation (shard nào ở node nào)
Nodes trong cluster

Cluster state được đồng bộ trên tất cả nodes và được quản lý bởi master node.

bash

# Xem cluster state (rất lớn, cẩn thận)
GET /_cluster/state/metadata/products

# Xem cluster state ngắn gọn
GET /_cluster/state/metadata?filter_path=metadata.indices.*.settings

3.6 Near Real-Time Search

Elasticsearch không phải real-time 100% - nó near real-time (NRT).

Refresh Cycle

Document indexed
       │
       ▼
In-memory buffer      Translog
(not searchable yet)  (durability)
       │
    [Refresh - default: 1 giây]
       │  
       ▼
   Lucene Segment  ← Document NOW searchable
(immutable file)

Tùy chỉnh refresh interval:

bash

# Tắt auto-refresh (cho bulk indexing)
PUT /products/_settings
{
  "refresh_interval": "-1"
}

# Sau khi bulk xong:
PUT /products/_settings
{
  "refresh_interval": "1s"
}

# Force refresh ngay lập tức
POST /products/_refresh

# Index document và refresh ngay
PUT /products/_doc/1?refresh=true
{
  "name": "iPhone 15"
}

# Index document và chờ refresh
PUT /products/_doc/1?refresh=wait_for
{
  "name": "iPhone 15"
}

Translog - Durability

Translog là Write-Ahead Log (WAL) như trong databases:

Document write
      │
      ├──→ Memory buffer (sẽ refresh)
      │
      └──→ Translog (survives crash)
                │
           [Flush - default: 5 phút hoặc khi translog đạt 512MB]
                │
                ▼
          Lucene commit
          (fsync to disk)
          Translog cleared

Flush vs Refresh:

Refresh: Buffer → Segment (searchable), mỗi 1 giây
Flush: Lucene commit + clear translog, mỗi 5 phút

3.7 Aliases - Quản Lý Index Linh Hoạt

Alias là tên tham chiếu đến một hoặc nhiều indices. Rất quan trọng trong production.

Tại sao dùng Alias?

Bài toán: Bạn có index products-v1 với 3 shards. Giờ cần tái cấu trúc (reindex) thành products-v2 với 6 shards. Nếu application code dùng index name trực tiếp, bạn phải deploy code mới. Nhưng nếu dùng alias products, chỉ cần: swap alias.

bash

# Tạo alias trỏ đến index cụ thể
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "products-v1",
        "alias": "products"
      }
    }
  ]
}

# Application đọc/ghi qua alias
GET /products/_search {...}
POST /products/_doc {...}

# Zero-downtime reindex
# Bước 1: Reindex sang index mới
POST /_reindex
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" }
}

# Bước 2: Swap alias atomically
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

Alias với Filter (Virtual Index)

bash

# Alias với filter - chỉ thấy active products
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "products",
        "alias": "active_products",
        "filter": {
          "term": { "status": "active" }
        }
      }
    }
  ]
}

# Query qua alias - tự động filter
GET /active_products/_search
{
  "query": { "match_all": {} }
  # Tự động thêm: status = active
}

Alias với Write Index

Khi alias trỏ đến nhiều indices:

bash

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "products-2024-01",
        "alias": "products",
        "is_write_index": true  # Writes go here
      }
    },
    {
      "add": {
        "index": "products-2023",
        "alias": "products"
        # Reads come from both, writes only to 2024-01
      }
    }
  ]
}

3.8 Index Templates

Templates cho phép tự động áp dụng settings/mappings khi tạo index mới khớp với pattern.

bash

# Index Template cơ bản
PUT /_index_template/logs-template
{
  "index_patterns": ["logs-*", "events-*"],
  "priority": 1,
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-ilm-policy",
      "index.lifecycle.rollover_alias": "logs"
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "level": {
          "type": "keyword"
        },
        "message": {
          "type": "text"
        },
        "service": {
          "type": "keyword"
        },
        "trace_id": {
          "type": "keyword"
        }
      }
    }
  }
}

# Khi tạo index "logs-2024-01-15" tự động áp dụng template
PUT /logs-2024-01-15
# Không cần specify settings - lấy từ template

3.9 Data Streams

Data Streams là abstraction mới (ES 7.9+) cho time-series data (logs, metrics, traces).

bash

# Tạo index template cho data stream
PUT /_index_template/logs-ds-template
{
  "index_patterns": ["logs-app-*"],
  "data_stream": {},
  "template": {
    "settings": {
      "number_of_shards": 1
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "message": { "type": "text" }
      }
    }
  }
}

# Tạo data stream
PUT /_data_stream/logs-app-prod

# Index document vào data stream
POST /logs-app-prod/_doc
{
  "@timestamp": "2024-01-15T10:30:00Z",
  "message": "Application started",
  "level": "INFO"
}

Data stream tự động quản lý:

Rollover: Tạo index mới khi index hiện tại quá lớn/cũ
Backing indices: logs-app-prod-000001, logs-app-prod-000002...
Write luôn vào index mới nhất
Read từ tất cả backing indices

3.10 Cluster Coordination và Fault Tolerance

Split Brain Problem

Split brain xảy ra khi network partition chia cluster thành 2 phần, cả 2 đều nghĩ mình là primary:

Trước khi partition:
Node1(Master) ─────── Node2 ─────── Node3

Sau network partition:
Node1(Master)          Node2 ─────── Node3
                                (Node3 được elected master)
                       
Cả 2 group đều accept writes → data inconsistency!

Elasticsearch giải quyết bằng minimum_master_nodes (ES 6) hoặc voting configuration (ES 7+):

yaml

# ES 7+: Tự động tính dựa trên master-eligible nodes
# Nếu có 3 master-eligible nodes, minimum quorum = 2
# Network partition → nhóm 1 node không thể form cluster
# → Chỉ nhóm 2 nodes tiếp tục hoạt động

Node Failure và Recovery

bash

# Xem allocation status
GET /_cluster/allocation/explain

# Khi node fail, ES tự động:
# 1. Replica của shards trên failed node được promoted to primary
# 2. New replicas được tạo trên remaining nodes
# 3. Nếu primary bị mất → data có thể mất (nếu chỉ có 1 replica và cũng mất)

# Delay allocation - tránh unnecessary rebalancing
PUT /_cluster/settings
{
  "persistent": {
    "index.unassigned.node_left.delayed_timeout": "5m"
  }
}
# ES chờ 5 phút trước khi reallocate, 
# cho node fail thời gian để restart

Tóm Tắt Chương 3

Kiến trúc dữ liệu:

Index = Table trong SQL
Document = Row, được lưu dưới dạng JSON
Field = Column, có data type
Mapping = Schema

Phân tán:

Primary Shard: Đơn vị phân tán, số cố định khi tạo
Replica Shard: Bản sao, số có thể thay đổi
Routing: hash(doc_id) % num_primary_shards
Không bao giờ để primary và replica cùng node

Tìm kiếm:

Inverted Index: Cấu trúc data lookup nhanh
Segment: Immutable Lucene file, merge theo thời gian
Refresh (1s): Buffer → Segment (near real-time)
Flush (5m): Lucene commit + clear translog

Cluster:

Master node: Quản lý cluster state
Data node: Lưu trữ shards
Coordinating: Route requests, merge results
Quorum: N/2 + 1 để tránh split brain

Best practices:

Dùng Alias thay vì index name trực tiếp trong code
Index Template cho time-series data
Data Streams cho logs/metrics
Single node → yellow status (replicas unassigned) là bình thường

Bước Tiếp Theo

→ Chương 4: CRUD Operations - Thực hành các thao tác cơ bản với documents

Chương 4: CRUD Operations - Thao Tác Cơ Bản

4.1 Setup Dữ Liệu Thực Hành

Trước tiên, tạo index và insert dữ liệu mẫu. Chúng ta sẽ dùng dataset về sản phẩm thương mại điện tử xuyên suốt khóa học.

bash

# Tạo index products với mapping cụ thể
PUT /products
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "product_id": { "type": "keyword" },
      "name": {
        "type": "text",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      },
      "brand": { "type": "keyword" },
      "category": { "type": "keyword" },
      "subcategory": { "type": "keyword" },
      "description": { "type": "text" },
      "price": { "type": "double" },
      "original_price": { "type": "double" },
      "discount_percentage": { "type": "integer" },
      "rating": { "type": "float" },
      "review_count": { "type": "integer" },
      "in_stock": { "type": "boolean" },
      "stock_quantity": { "type": "integer" },
      "tags": { "type": "keyword" },
      "images": { "type": "keyword", "index": false },
      "created_at": { "type": "date" },
      "updated_at": { "type": "date" }
    }
  }
}

4.2 Index API - Tạo Document

Tạo Document với ID tự định nghĩa

bash

# PUT - Chỉ định ID
PUT /products/_doc/1
{
  "product_id": "SP001",
  "name": "iPhone 15 Pro Max 256GB",
  "brand": "Apple",
  "category": "smartphones",
  "subcategory": "iOS",
  "description": "iPhone 15 Pro Max với chip A17 Pro, camera 48MP, màn hình 6.7 inch Super Retina XDR",
  "price": 34990000,
  "original_price": 38490000,
  "discount_percentage": 9,
  "rating": 4.8,
  "review_count": 1250,
  "in_stock": true,
  "stock_quantity": 45,
  "tags": ["flagship", "5G", "iOS", "camera"],
  "images": [
    "https://example.com/iphone15-1.jpg",
    "https://example.com/iphone15-2.jpg"
  ],
  "created_at": "2024-01-10T08:00:00Z",
  "updated_at": "2024-01-15T10:30:00Z"
}

Response:

json

{
  "_index": "products",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Tạo Document với ID tự động (POST)

bash

# POST - ES tự sinh ID (UUID)
POST /products/_doc
{
  "product_id": "SP002",
  "name": "Samsung Galaxy S24 Ultra",
  "brand": "Samsung",
  "category": "smartphones",
  "subcategory": "Android",
  "description": "Samsung Galaxy S24 Ultra với bút S Pen, camera 200MP, màn hình 6.8 inch Dynamic AMOLED",
  "price": 31990000,
  "original_price": 33990000,
  "discount_percentage": 6,
  "rating": 4.7,
  "review_count": 890,
  "in_stock": true,
  "stock_quantity": 62,
  "tags": ["flagship", "5G", "Android", "S-Pen", "camera"],
  "created_at": "2024-01-12T09:00:00Z",
  "updated_at": "2024-01-15T11:00:00Z"
}

Response với auto-generated ID:

json

{
  "_index": "products",
  "_id": "xK7mHI4BpQz3t2RfMnLw",   // Auto-generated UUID
  "_version": 1,
  "result": "created",
  ...
}

Tạo Document chỉ khi chưa tồn tại (CREATE - not UPDATE)

bash

# Sử dụng _create endpoint
PUT /products/_create/1
{
  "name": "iPhone 15 Pro Max"
}
# Nếu ID=1 đã tồn tại → lỗi 409 Conflict

# Hoặc op_type=create
PUT /products/_doc/1?op_type=create
{
  "name": "iPhone 15 Pro Max"
}

Khi lỗi duplicate:

json

{
  "error": {
    "type": "version_conflict_engine_exception",
    "reason": "[1]: version conflict, document already exists (current version [1])"
  },
  "status": 409
}

4.3 Get API - Đọc Document

Lấy Document theo ID

bash

GET /products/_doc/1

# Response:
{
  "_index": "products",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "product_id": "SP001",
    "name": "iPhone 15 Pro Max 256GB",
    "brand": "Apple",
    "price": 34990000,
    ...
  }
}

Khi Document không tồn tại

bash

GET /products/_doc/999

# Response:
{
  "_index": "products",
  "_id": "999",
  "found": false
}
# HTTP 404

Lấy chỉ một số fields (source filtering)

bash

# Chỉ lấy name và price
GET /products/_doc/1?_source_includes=name,price

# Loại trừ fields nặng
GET /products/_doc/1?_source_excludes=description,images

# Tắt _source hoàn toàn (chỉ metadata)
GET /products/_doc/1?_source=false

Multi Get API - Lấy nhiều documents cùng lúc

bash

GET /products/_mget
{
  "ids": ["1", "2", "3"]
}

# Hoặc từ nhiều indices
GET /_mget
{
  "docs": [
    { "_index": "products", "_id": "1" },
    { "_index": "products", "_id": "2" },
    { "_index": "orders", "_id": "ORD001" }
  ]
}

# Response:
{
  "docs": [
    {
      "_index": "products",
      "_id": "1",
      "found": true,
      "_source": { ... }
    },
    {
      "_index": "products",
      "_id": "999",
      "found": false
    }
  ]
}

Kiểm tra Document Tồn Tại (HEAD)

bash

# Chỉ check existence, không trả về data (tiết kiệm bandwidth)
HEAD /products/_doc/1

# HTTP 200 nếu tồn tại, 404 nếu không

4.4 Update API - Cập Nhật Document

Partial Update (chỉ update field cụ thể)

bash

# Update một số fields
POST /products/_update/1
{
  "doc": {
    "price": 32990000,
    "discount_percentage": 14,
    "updated_at": "2024-01-16T09:00:00Z"
  }
}

# Response:
{
  "_index": "products",
  "_id": "1",
  "_version": 2,
  "result": "updated",
  "_shards": { "total": 1, "successful": 1 }
}

Cơ chế hoạt động của Update:

Bước 1: Lấy document hiện tại từ Lucene
Bước 2: Merge doc mới vào document hiện tại
Bước 3: Delete document cũ (đánh dấu deleted)
Bước 4: Index document mới

Đây là lý do Update trong ES không phải in-place → chậm hơn so với SQL UPDATE.

Upsert - Update hoặc Insert nếu không tồn tại

bash

POST /products/_update/5
{
  "doc": {
    "stock_quantity": 100,
    "updated_at": "2024-01-16T10:00:00Z"
  },
  "upsert": {
    "product_id": "SP005",
    "name": "New Product",
    "stock_quantity": 100,
    "price": 5000000,
    "created_at": "2024-01-16T10:00:00Z",
    "updated_at": "2024-01-16T10:00:00Z"
  }
}

Scripted Update - Update với logic phức tạp

bash

# Tăng giá sản phẩm lên 10%
POST /products/_update/1
{
  "script": {
    "source": "ctx._source.price = ctx._source.price * 1.1",
    "lang": "painless"
  }
}

# Thêm element vào array
POST /products/_update/1
{
  "script": {
    "source": "ctx._source.tags.add(params.new_tag)",
    "lang": "painless",
    "params": {
      "new_tag": "sale"
    }
  }
}

# Xóa element khỏi array
POST /products/_update/1
{
  "script": {
    "source": "ctx._source.tags.removeAll(Collections.singletonList(params.tag_to_remove))",
    "lang": "painless",
    "params": {
      "tag_to_remove": "old-tag"
    }
  }
}

# Conditional update
POST /products/_update/1
{
  "script": {
    "source": """
      if (ctx._source.stock_quantity > 0) {
        ctx._source.stock_quantity--;
        if (ctx._source.stock_quantity == 0) {
          ctx._source.in_stock = false;
        }
      } else {
        ctx.op = 'noop';
      }
    """,
    "lang": "painless"
  }
}

Update By Query - Cập nhật nhiều documents

bash

# Tăng giá tất cả sản phẩm Apple lên 5%
POST /products/_update_by_query
{
  "script": {
    "source": "ctx._source.price = ctx._source.price * 1.05",
    "lang": "painless"
  },
  "query": {
    "term": {
      "brand": "Apple"
    }
  }
}

# Response:
{
  "took": 156,
  "timed_out": false,
  "total": 5,
  "updated": 5,
  "deleted": 0,
  "batches": 1,
  "conflicts": 0,
  "noops": 0,
  "throttled_millis": 0,
  "failures": []
}

Full Document Replace

bash

# PUT thay thế TOÀN BỘ document
PUT /products/_doc/1
{
  "name": "iPhone 15 Pro Max 512GB",
  "price": 39990000
  # Tất cả fields khác bị XÓA!
}

Cẩn thận: PUT _doc sẽ xóa tất cả fields không có trong request mới.

4.5 Delete API - Xóa Document

Xóa Document theo ID

bash

DELETE /products/_doc/1

# Response:
{
  "_index": "products",
  "_id": "1",
  "_version": 2,    # Version tăng khi delete
  "result": "deleted",
  "_shards": { "total": 1, "successful": 1, "failed": 0 }
}

Delete By Query - Xóa nhiều documents

bash

# Xóa tất cả products hết hàng và không được review
POST /products/_delete_by_query
{
  "query": {
    "bool": {
      "must": [
        { "term": { "in_stock": false } },
        { "range": { "review_count": { "lte": 0 } } }
      ]
    }
  }
}

# Xóa với async (cho dataset lớn)
POST /products/_delete_by_query?wait_for_completion=false
{
  "query": { "match_all": {} }
}
# Returns task ID để track tiến trình

Xóa Index

bash

DELETE /products

# Xóa nhiều indices
DELETE /products,orders

# Xóa theo pattern (cẩn thận!)
DELETE /logs-2023-*

4.6 Bulk API - Thao Tác Hàng Loạt

Đây là API quan trọng nhất cho performance khi xử lý nhiều documents.

Format của Bulk Request

{"action": {"metadata"}}
{"document_data"}
{"action": {"metadata"}}
{"document_data"}
...

Lưu ý quan trọng: Mỗi cặp action/data phải trên 2 dòng riêng biệt (NDJSON format).

bash

POST /products/_bulk
{"index": {"_id": "2"}}
{"product_id": "SP002", "name": "Samsung Galaxy S24 Ultra", "brand": "Samsung", "price": 31990000, "rating": 4.7, "in_stock": true, "category": "smartphones"}
{"index": {"_id": "3"}}
{"product_id": "SP003", "name": "Xiaomi 14 Pro", "brand": "Xiaomi", "price": 19990000, "rating": 4.5, "in_stock": true, "category": "smartphones"}
{"create": {"_id": "4"}}
{"product_id": "SP004", "name": "OPPO Find X7 Ultra", "brand": "OPPO", "price": 23990000, "rating": 4.6, "in_stock": false, "category": "smartphones"}
{"update": {"_id": "2"}}
{"doc": {"discount_percentage": 10}}
{"delete": {"_id": "999"}}

Note: Dòng cuối cùng phải có newline \n.

Bulk với nhiều indices

bash

POST /_bulk
{"index": {"_index": "products", "_id": "10"}}
{"name": "Laptop Dell XPS 15", "price": 45000000}
{"index": {"_index": "orders", "_id": "ORD001"}}
{"user_id": "U001", "total": 45000000, "status": "processing"}
{"delete": {"_index": "products", "_id": "old-product-1"}}

Bulk Response và Xử Lý Lỗi

json

{
  "took": 45,
  "errors": true,
  "items": [
    {
      "index": {
        "_index": "products",
        "_id": "2",
        "result": "created",
        "status": 201
      }
    },
    {
      "index": {
        "_index": "products",
        "_id": "3",
        "result": "created",
        "status": 201
      }
    },
    {
      "delete": {
        "_index": "products",
        "_id": "999",
        "result": "not_found",
        "status": 404
      }
    }
  ]
}

Bulk API không fail toàn bộ khi một item lỗi - mỗi item có status riêng.

Best Practices khi dùng Bulk API

bash

# 1. Optimal bulk size: 5-15MB per request, hoặc 1000-5000 documents
# Quá lớn → GC pressure, quá nhỏ → overhead

# 2. Tắt replicas trong khi bulk loading
PUT /products/_settings
{
  "number_of_replicas": 0,
  "refresh_interval": "-1"
}

# 3. Bulk load
POST /products/_bulk
... many documents ...

# 4. Bật lại replicas và refresh
PUT /products/_settings
{
  "number_of_replicas": 1,
  "refresh_interval": "1s"
}

POST /products/_refresh

Bulk với Node.js (Thực tế)

javascript

const { Client } = require('@elastic/elasticsearch');
const fs = require('fs');
const readline = require('readline');

const client = new Client({ node: 'http://localhost:9200' });

async function bulkIndexProducts(products) {
  const operations = products.flatMap(product => [
    { index: { _index: 'products', _id: product.id } },
    product
  ]);

  const { body: bulkResponse } = await client.bulk({
    refresh: true,
    operations
  });

  if (bulkResponse.errors) {
    const erroredDocuments = [];
    bulkResponse.items.forEach((action, i) => {
      const operation = Object.keys(action)[0];
      if (action[operation].error) {
        erroredDocuments.push({
          status: action[operation].status,
          error: action[operation].error,
          operation: operations[i * 2],
          document: operations[i * 2 + 1]
        });
      }
    });
    console.error('Failed documents:', erroredDocuments);
  }

  return bulkResponse;
}

// Đọc từ CSV và bulk index theo batches
async function indexFromFile(filePath) {
  const BATCH_SIZE = 1000;
  let batch = [];
  let totalIndexed = 0;

  const rl = readline.createInterface({
    input: fs.createReadStream(filePath),
    crlfDelay: Infinity
  });

  for await (const line of rl) {
    const product = JSON.parse(line);
    batch.push(product);

    if (batch.length >= BATCH_SIZE) {
      await bulkIndexProducts(batch);
      totalIndexed += batch.length;
      console.log(`Indexed ${totalIndexed} documents`);
      batch = [];
    }
  }

  // Index remaining
  if (batch.length > 0) {
    await bulkIndexProducts(batch);
    totalIndexed += batch.length;
  }

  console.log(`Total indexed: ${totalIndexed} documents`);
}

4.7 Search API Cơ Bản

URI Search (Quick & Dirty)

bash

# Search đơn giản qua URL parameter
GET /products/_search?q=iPhone

# Với field cụ thể
GET /products/_search?q=name:iPhone

# Với multiple terms
GET /products/_search?q=brand:Apple AND category:smartphones

# Với sort và pagination
GET /products/_search?q=Apple&sort=price:asc&from=0&size=10

Request Body Search

bash

# Match all documents
GET /products/_search
{
  "query": {
    "match_all": {}
  }
}

# Với pagination và sorting
GET /products/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "name": "iPhone"
    }
  },
  "sort": [
    { "price": "asc" },
    { "_score": "desc" }
  ],
  "_source": ["name", "brand", "price", "rating"]
}

Response structure:

json

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1.5,
    "hits": [
      {
        "_index": "products",
        "_id": "1",
        "_score": 1.5,
        "_source": {
          "name": "iPhone 15 Pro Max 256GB",
          "brand": "Apple",
          "price": 34990000,
          "rating": 4.8
        }
      }
    ]
  }
}

4.8 Concurrency Control

Optimistic Concurrency Control

ES sử dụng _seq_no và _primary_term cho optimistic locking:

bash

# Lấy document với seq_no và primary_term
GET /products/_doc/1

# Response có:
{
  "_seq_no": 5,
  "_primary_term": 1,
  ...
}

# Update chỉ khi seq_no và primary_term khớp
PUT /products/_doc/1?if_seq_no=5&if_primary_term=1
{
  "name": "Updated iPhone 15 Pro Max",
  "price": 33990000
}

# Nếu document đã được update bởi process khác:
# 409 Conflict
{
  "error": {
    "type": "version_conflict_engine_exception",
    "reason": "[1]: version conflict, required seqNo [5], primary term [1]. 
               current document has seqNo [7] and primary term [1]"
  }
}

Use case thực tế - Shopping cart:

javascript

async function decreaseStock(productId, quantity) {
  const maxRetries = 3;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    // Get current state
    const { body: product } = await client.get({
      index: 'products',
      id: productId
    });
    
    const currentStock = product._source.stock_quantity;
    
    if (currentStock < quantity) {
      throw new Error('Insufficient stock');
    }
    
    try {
      // Try to update with concurrency check
      await client.index({
        index: 'products',
        id: productId,
        if_seq_no: product._seq_no,
        if_primary_term: product._primary_term,
        document: {
          ...product._source,
          stock_quantity: currentStock - quantity,
          in_stock: (currentStock - quantity) > 0
        }
      });
      
      return { success: true, remaining_stock: currentStock - quantity };
      
    } catch (error) {
      if (error.statusCode === 409 && attempt < maxRetries - 1) {
        // Conflict - retry
        console.log(`Retry attempt ${attempt + 1}`);
        await new Promise(resolve => setTimeout(resolve, 100 * (attempt + 1)));
        continue;
      }
      throw error;
    }
  }
  
  throw new Error('Max retries exceeded');
}

4.9 Reindex API

Dùng khi cần tái cấu trúc index (thay đổi mapping, số shards):

bash

# Reindex từ index này sang index khác
POST /_reindex
{
  "source": {
    "index": "products-v1"
  },
  "dest": {
    "index": "products-v2"
  }
}

# Reindex với query filter (chỉ copy một phần)
POST /_reindex
{
  "source": {
    "index": "products",
    "query": {
      "term": { "brand": "Apple" }
    }
  },
  "dest": {
    "index": "apple-products"
  }
}

# Reindex với script transformation
POST /_reindex
{
  "source": {
    "index": "old-products"
  },
  "dest": {
    "index": "new-products"
  },
  "script": {
    "source": """
      // Đổi tên field
      ctx._source.product_name = ctx._source.name;
      ctx._source.remove('name');
      
      // Tính discount
      if (ctx._source.original_price != null && ctx._source.price != null) {
        ctx._source.discount_amount = ctx._source.original_price - ctx._source.price;
      }
      
      // Convert string to integer
      ctx._source.view_count = Integer.parseInt(ctx._source.views);
      ctx._source.remove('views');
    """,
    "lang": "painless"
  }
}

# Async reindex cho large datasets
POST /_reindex?wait_for_completion=false
{
  "source": { "index": "millions-of-products" },
  "dest": { "index": "new-millions-of-products" }
}
# Returns: {"task": "abc123:456"}

# Check task progress
GET /_tasks/abc123:456

4.10 Insert Bộ Dữ Liệu Hoàn Chỉnh để Thực Hành

bash

# Insert nhiều sản phẩm để practice
POST /products/_bulk
{"index": {"_id": "1"}}
{"product_id": "SP001", "name": "iPhone 15 Pro Max 256GB", "brand": "Apple", "category": "smartphones", "subcategory": "iOS", "price": 34990000, "original_price": 38490000, "discount_percentage": 9, "rating": 4.8, "review_count": 1250, "in_stock": true, "stock_quantity": 45, "tags": ["flagship", "5G", "iOS", "camera"], "created_at": "2024-01-10T08:00:00Z"}
{"index": {"_id": "2"}}
{"product_id": "SP002", "name": "Samsung Galaxy S24 Ultra 256GB", "brand": "Samsung", "category": "smartphones", "subcategory": "Android", "price": 31990000, "original_price": 33990000, "discount_percentage": 6, "rating": 4.7, "review_count": 890, "in_stock": true, "stock_quantity": 62, "tags": ["flagship", "5G", "Android", "S-Pen"], "created_at": "2024-01-12T09:00:00Z"}
{"index": {"_id": "3"}}
{"product_id": "SP003", "name": "Xiaomi 14 Pro 512GB", "brand": "Xiaomi", "category": "smartphones", "subcategory": "Android", "price": 19990000, "original_price": 21990000, "discount_percentage": 9, "rating": 4.5, "review_count": 456, "in_stock": true, "stock_quantity": 120, "tags": ["flagship", "5G", "Android", "Leica"], "created_at": "2024-01-08T10:00:00Z"}
{"index": {"_id": "4"}}
{"product_id": "SP004", "name": "MacBook Pro 16 M3 Max", "brand": "Apple", "category": "laptops", "subcategory": "macOS", "price": 89990000, "original_price": 94990000, "discount_percentage": 5, "rating": 4.9, "review_count": 678, "in_stock": true, "stock_quantity": 23, "tags": ["professional", "M3", "macOS"], "created_at": "2024-01-05T08:00:00Z"}
{"index": {"_id": "5"}}
{"product_id": "SP005", "name": "Dell XPS 15 9530", "brand": "Dell", "category": "laptops", "subcategory": "Windows", "price": 45990000, "original_price": 49990000, "discount_percentage": 8, "rating": 4.4, "review_count": 234, "in_stock": true, "stock_quantity": 18, "tags": ["professional", "OLED", "Windows"], "created_at": "2024-01-06T09:00:00Z"}
{"index": {"_id": "6"}}
{"product_id": "SP006", "name": "iPad Pro 12.9 M2 256GB WiFi", "brand": "Apple", "category": "tablets", "subcategory": "iPadOS", "price": 27990000, "original_price": 29990000, "discount_percentage": 7, "rating": 4.7, "review_count": 567, "in_stock": true, "stock_quantity": 34, "tags": ["M2", "iPadOS", "creative"], "created_at": "2024-01-07T08:00:00Z"}
{"index": {"_id": "7"}}
{"product_id": "SP007", "name": "Sony WH-1000XM5 Headphones", "brand": "Sony", "category": "audio", "subcategory": "headphones", "price": 8990000, "original_price": 9990000, "discount_percentage": 10, "rating": 4.8, "review_count": 2100, "in_stock": true, "stock_quantity": 89, "tags": ["ANC", "wireless", "premium"], "created_at": "2024-01-09T10:00:00Z"}
{"index": {"_id": "8"}}
{"product_id": "SP008", "name": "Samsung 65\" QLED 4K Smart TV", "brand": "Samsung", "category": "televisions", "subcategory": "QLED", "price": 35990000, "original_price": 42990000, "discount_percentage": 16, "rating": 4.6, "review_count": 345, "in_stock": false, "stock_quantity": 0, "tags": ["4K", "QLED", "Smart TV", "large"], "created_at": "2024-01-04T08:00:00Z"}
{"index": {"_id": "9"}}
{"product_id": "SP009", "name": "Asus ROG Strix G16 RTX 4070", "brand": "Asus", "category": "laptops", "subcategory": "gaming", "price": 52990000, "original_price": 57990000, "discount_percentage": 9, "rating": 4.5, "review_count": 189, "in_stock": true, "stock_quantity": 15, "tags": ["gaming", "RTX4070", "144Hz", "RGB"], "created_at": "2024-01-11T10:00:00Z"}
{"index": {"_id": "10"}}
{"product_id": "SP010", "name": "AirPods Pro 2nd Gen", "brand": "Apple", "category": "audio", "subcategory": "earbuds", "price": 6490000, "original_price": 6990000, "discount_percentage": 7, "rating": 4.7, "review_count": 3400, "in_stock": true, "stock_quantity": 156, "tags": ["ANC", "wireless", "iOS"], "created_at": "2024-01-13T09:00:00Z"}

4.11 Thực Hành: CRUD với Real-World Scenarios

Scenario 1: E-commerce - Xử lý đơn hàng

bash

# Kiểm tra tồn kho trước khi đặt hàng
GET /products/_doc/1

# Giảm số lượng tồn kho (sử dụng scripted update)
POST /products/_update/1
{
  "script": {
    "source": """
      if (ctx._source.stock_quantity >= params.quantity) {
        ctx._source.stock_quantity -= params.quantity;
        ctx._source.in_stock = ctx._source.stock_quantity > 0;
        ctx._source.updated_at = params.now;
      } else {
        // Không update, thông báo là noop
        ctx.op = 'noop';
      }
    """,
    "lang": "painless",
    "params": {
      "quantity": 2,
      "now": "2024-01-16T14:30:00Z"
    }
  }
}

Scenario 2: Content Management - Bulk publish articles

bash

# Tạo index articles
PUT /articles
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
      "content": { "type": "text" },
      "author": { "type": "keyword" },
      "category": { "type": "keyword" },
      "tags": { "type": "keyword" },
      "status": { "type": "keyword" },
      "published_at": { "type": "date" },
      "view_count": { "type": "integer" },
      "like_count": { "type": "integer" }
    }
  }
}

# Bulk insert articles
POST /articles/_bulk
{"index": {"_id": "ART001"}}
{"title": "Hướng dẫn sử dụng Elasticsearch cho người mới", "content": "Elasticsearch là một search engine mạnh mẽ...", "author": "Nguyễn Văn A", "category": "technology", "tags": ["elasticsearch", "backend", "search"], "status": "published", "published_at": "2024-01-15T08:00:00Z", "view_count": 1520, "like_count": 89}
{"index": {"_id": "ART002"}}
{"title": "Top 10 frameworks Node.js năm 2024", "content": "Node.js ecosystem ngày càng phát triển...", "author": "Trần Thị B", "category": "technology", "tags": ["nodejs", "javascript", "backend"], "status": "published", "published_at": "2024-01-14T10:00:00Z", "view_count": 2340, "like_count": 156}
{"index": {"_id": "ART003"}}
{"title": "Docker và Kubernetes trong thực tế", "content": "Containerization đã thay đổi cách chúng ta deploy...", "author": "Lê Văn C", "category": "devops", "tags": ["docker", "kubernetes", "devops"], "status": "draft", "view_count": 0, "like_count": 0}

# Update tất cả draft articles của 1 author thành published
POST /articles/_update_by_query
{
  "query": {
    "bool": {
      "must": [
        { "term": { "status": "draft" } },
        { "term": { "author": "Lê Văn C" } }
      ]
    }
  },
  "script": {
    "source": """
      ctx._source.status = 'published';
      ctx._source.published_at = params.published_at;
    """,
    "lang": "painless",
    "params": {
      "published_at": "2024-01-16T09:00:00Z"
    }
  }
}

Scenario 3: User Activity Tracking

bash

# Index user events  
PUT /user-events
{
  "settings": { "number_of_shards": 1 },
  "mappings": {
    "properties": {
      "user_id": { "type": "keyword" },
      "event_type": { "type": "keyword" },
      "product_id": { "type": "keyword" },
      "timestamp": { "type": "date" },
      "session_id": { "type": "keyword" },
      "device": { "type": "keyword" },
      "metadata": { "type": "object", "dynamic": true }
    }
  }
}

# Track events realtime
POST /user-events/_doc
{
  "user_id": "U001",
  "event_type": "product_view",
  "product_id": "SP001",
  "timestamp": "2024-01-16T14:30:00Z",
  "session_id": "sess_abc123",
  "device": "mobile",
  "metadata": {
    "referrer": "search",
    "search_query": "iPhone 15",
    "position": 1
  }
}

POST /user-events/_doc
{
  "user_id": "U001",
  "event_type": "add_to_cart",
  "product_id": "SP001",
  "timestamp": "2024-01-16T14:32:00Z",
  "session_id": "sess_abc123",
  "device": "mobile",
  "metadata": {
    "quantity": 1
  }
}

Tóm Tắt Chương 4

Operation	API	Method	Ghi chú
Tạo với ID	`/index/_doc/id`	PUT	Overwrite nếu tồn tại
Tạo auto ID	`/index/_doc`	POST	UUID tự sinh
Create only	`/index/_create/id`	PUT	409 nếu đã có
Lấy document	`/index/_doc/id`	GET	404 nếu không có
Lấy nhiều docs	`/index/_mget`	GET/POST	Batch get
Check tồn tại	`/index/_doc/id`	HEAD	200/404
Update partial	`/index/_update/id`	POST	Merge fields
Update script	`/index/_update/id`	POST	Logic phức tạp
Update nhiều	`/_update_by_query`	POST	Với filter
Xóa document	`/index/_doc/id`	DELETE
Xóa nhiều	`/_delete_by_query`	POST	Với filter
Bulk ops	`/index/_bulk`	POST	Nhiều ops cùng lúc
Reindex	`/_reindex`	POST	Copy/transform

Performance tips:

Dùng Bulk API cho > 10 documents
Tắt refresh_interval và replicas khi bulk loading
Dùng _source_includes để giảm network traffic
Dùng optimistic concurrency (_seq_no) khi cần tránh race conditions

Bước Tiếp Theo

→ Chương 5: Mapping và Data Types - Hiểu sâu về cách ES lưu trữ và phân tích các loại dữ liệu khác nhau

Chương 5: Mapping và Data Types

5.1 Mapping là gì?

Mapping trong Elasticsearch tương tự như schema trong SQL - nó định nghĩa cách ES lưu trữ và index các fields trong document. Tuy nhiên, ES có hai chế độ mapping:

Dynamic Mapping: ES tự đoán data type khi thấy field mới
Explicit Mapping: Bạn tự định nghĩa rõ ràng từng field

Tại sao Explicit Mapping quan trọng?

javascript

// ES dynamic mapping đoán sai:
{
  "phone": "0901234567"  // ES đoán là text (đúng)
  "price": "29.99"       // ES đoán là text (SAI! nên là double)
  "user_id": "123456"    // ES đoán là long (có thể không mong muốn)
  "zip_code": "10000"    // ES đoán là long (SAI! nên là keyword)
}

Hậu quả của dynamic mapping sai:

price là text → không sort được, không range query được
zip_code là long → không thể query "01234" (leading zero bị mất)
user_id là long → tốn memory không cần thiết

5.2 Dynamic Mapping

Default Type Mapping Rules

JSON value	ES data type
`"hello"`	`text` với `keyword` sub-field
`123`	`long`
`12.34`	`float`
`true`/`false`	`boolean`
`"2024-01-15"`	`date`
`{ "a": 1 }`	`object`
`[1, 2, 3]`	`long` array

Kiểm tra Dynamic Mapping

bash

# Index document không có explicit mapping
PUT /auto-mapping-test/_doc/1
{
  "name": "John Doe",
  "age": 30,
  "salary": 15000000.50,
  "is_active": true,
  "created_date": "2024-01-15",
  "address": {
    "city": "Hanoi",
    "district": "Cau Giay"
  },
  "hobbies": ["reading", "coding"]
}

# Xem mapping được tạo tự động
GET /auto-mapping-test/_mapping

Kết quả:

json

{
  "auto-mapping-test": {
    "mappings": {
      "properties": {
        "address": {
          "properties": {
            "city": {
              "type": "text",
              "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
            },
            "district": {
              "type": "text",
              "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
            }
          }
        },
        "age": { "type": "long" },
        "created_date": { "type": "date" },
        "hobbies": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
        },
        "is_active": { "type": "boolean" },
        "name": {
          "type": "text",
          "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
        },
        "salary": { "type": "float" }
      }
    }
  }
}

Cấu Hình Dynamic Mapping

bash

PUT /products
{
  "mappings": {
    "dynamic": "strict",    # strict | true | false | runtime
    "properties": {
      "name": { "type": "text" },
      "price": { "type": "double" }
    }
  }
}

Giá trị	Hành vi
`true` (default)	Tự động tạo mapping cho fields mới
`false`	Ignore fields không có trong mapping (không index, nhưng lưu trong _source)
`strict`	Throw error nếu field không khai báo
`runtime`	Tạo runtime fields thay vì standard fields

5.3 Tất Cả Data Types

String Types

`text` - Full-text search

bash

PUT /articles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",       # Analyzer khi indexing
        "search_analyzer": "standard", # Analyzer khi searching (default: same as analyzer)
        "index_options": "positions",  # docs | freqs | positions | offsets
        "term_vector": "with_positions_offsets",  # Cho highlighting
        "similarity": "BM25",          # Similarity algorithm
        "fielddata": false,            # Cần true để sort/agg trên text (tốn RAM!)
        "eager_global_ordinals": false,
        "norms": true                  # Cần cho relevance scoring
      }
    }
  }
}

`keyword` - Exact match, aggregations, sorting

bash

{
  "status": {
    "type": "keyword",
    "ignore_above": 256,     # Strings dài hơn bị ignored khi indexing
    "null_value": "NULL",    # Giá trị thay thế khi field null
    "doc_values": true,      # Cần cho sort/agg
    "normalizer": "lowercase_normalizer",  # Apply trước khi index
    "split_queries_on_whitespace": false
  }
}

Keyword normalizer:

bash

PUT /test-normalizer
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "status": {
        "type": "keyword",
        "normalizer": "lowercase_normalizer"
      }
    }
  }
}

# "Active", "ACTIVE", "active" đều match nhau với này

Numeric Types

bash

{
  "mappings": {
    "properties": {
      "byte_field": { "type": "byte" },           # -128 to 127
      "short_field": { "type": "short" },         # -32768 to 32767
      "int_field": { "type": "integer" },         # ±2^31
      "long_field": { "type": "long" },           # ±2^63 (default cho integers)
      "float_field": { "type": "float" },         # 32-bit IEEE 754
      "double_field": { "type": "double" },       # 64-bit IEEE 754
      "half_float": { "type": "half_float" },     # 16-bit IEEE 754
      "scaled_float": {
        "type": "scaled_float",
        "scaling_factor": 100                     # Lưu 29.99 as 2999 (integer)
      }
    }
  }
}

Khi dùng loại nào?

Type	Use case
`byte`/`short`/`integer`	Số nguyên nhỏ (age, quantity, score)
`long`	Số nguyên lớn (timestamp millis, user_id lớn)
`float`	Tỷ lệ, phần trăm, coordinates (chấp nhận floating point error)
`double`	Tính toán tài chính (cần precision cao)
`scaled_float`	Tiền tệ (price * 100 lưu như integer) - tiết kiệm disk

bash

# Best practice cho tiền:
{
  "price": {
    "type": "scaled_float",
    "scaling_factor": 100
  }
}
# 29990000 VND lưu như 2999000000 integer (tránh floating point errors)

Date Types

bash

{
  "mappings": {
    "properties": {
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_optional_time",
        "null_value": null,
        "ignore_malformed": false
      },
      "timestamp_ms": {
        "type": "date",
        "format": "epoch_millis"       # Unix timestamp in ms
      },
      "timestamp_sec": {
        "type": "date",
        "format": "epoch_second"       # Unix timestamp in seconds
      }
    }
  }
}

Date formats được hỗ trợ:

bash

"format": "strict_date_optional_time"  # ISO8601: 2024-01-15T10:30:00Z
"format": "yyyy-MM-dd"                 # 2024-01-15
"format": "dd/MM/yyyy"                 # 15/01/2024
"format": "epoch_millis"               # 1705312200000
"format": "epoch_second"              # 1705312200

# Multiple formats với ||
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"

Date math trong queries:

bash

GET /products/_search
{
  "query": {
    "range": {
      "created_at": {
        "gte": "now-7d/d",   # Rounded to start of day, 7 days ago
        "lte": "now/d"       # Rounded to start of today
      }
    }
  }
}

# Date math operators:
# now-1h    = 1 giờ trước
# now+1d    = ngày mai
# now/d     = làm tròn xuống ngày
# now/M     = làm tròn xuống tháng
# 2024-01-15||-3d = 3 ngày trước 2024-01-15

Boolean Type

bash

{
  "in_stock": {
    "type": "boolean",
    "null_value": false    # Default khi field null
  }
}

# Accepted values:
# true: true, "true", "1"
# false: false, "false", "0"

Object Type

bash

{
  "address": {
    "type": "object",          # (default cho nested JSON)
    "dynamic": true,
    "enabled": true,           # false = lưu nhưng không index
    "properties": {
      "street": { "type": "text" },
      "city": { "type": "keyword" },
      "lat": { "type": "double" },
      "lng": { "type": "double" }
    }
  }
}

# Document:
{
  "address": {
    "street": "123 Đường Cầu Giấy",
    "city": "Hà Nội",
    "lat": 21.0278,
    "lng": 105.8342
  }
}

# Internally stored as flat:
# address.street, address.city, address.lat, address.lng

Vấn đề với Object arrays:

bash

# BÀI TOÁN:
{
  "users": [
    { "email": "alice@example.com", "role": "admin" },
    { "email": "bob@example.com", "role": "viewer" }
  ]
}

# ES FLATTENS to:
# users.email = ["alice@example.com", "bob@example.com"]
# users.role = ["admin", "viewer"]

# Query: "alice" with "viewer" role → MATCH (sai!) 
# vì mối quan hệ giữa email và role bị phá vỡ

# Giải pháp: nested type (xem bên dưới)

Nested Type

bash

{
  "reviews": {
    "type": "nested",
    "properties": {
      "user_id": { "type": "keyword" },
      "rating": { "type": "integer" },
      "comment": { "type": "text" },
      "created_at": { "type": "date" }
    }
  }
}

# Document:
{
  "product_id": "SP001",
  "reviews": [
    {
      "user_id": "U001",
      "rating": 5,
      "comment": "Sản phẩm tuyệt vời!",
      "created_at": "2024-01-05T10:00:00Z"
    },
    {
      "user_id": "U002",
      "rating": 3,
      "comment": "Bình thường, không xuất sắc.",
      "created_at": "2024-01-10T14:00:00Z"
    }
  ]
}

# Query với nested:
GET /products/_search
{
  "query": {
    "nested": {
      "path": "reviews",
      "query": {
        "bool": {
          "must": [
            { "term": { "reviews.user_id": "U001" } },
            { "range": { "reviews.rating": { "gte": 4 } } }
          ]
        }
      }
    }
  }
}

Lưu ý: Nested documents được lưu như separate hidden documents → tốn IO, query chậm hơn. Chỉ dùng khi cần duy trì relationship giữa các fields trong object.

Geo Types

`geo_point` - Điểm trên bản đồ

bash

{
  "location": {
    "type": "geo_point"
  }
}

# Nhiều cách biểu diễn:
{ "location": { "lat": 21.0278, "lon": 105.8342 } }
{ "location": [105.8342, 21.0278] }  # [lon, lat]!
{ "location": "21.0278,105.8342" }   # "lat,lon"
{ "location": "u4pruydqqvj" }        # Geohash

bash

# Distance query
GET /stores/_search
{
  "query": {
    "geo_distance": {
      "distance": "10km",
      "location": {
        "lat": 21.0278,
        "lon": 105.8342
      }
    }
  }
}

`geo_shape` - Shapes (polygon, line...)

bash

{
  "area": {
    "type": "geo_shape"
  }
}

# Ví dụ polygon
{
  "area": {
    "type": "polygon",
    "coordinates": [
      [[105.82, 21.02], [105.85, 21.02], [105.85, 21.05], [105.82, 21.05], [105.82, 21.02]]
    ]
  }
}

Specialized Types

`ip` - IP addresses

bash

{
  "client_ip": {
    "type": "ip"
  }
}

# Supports IPv4 and IPv6
{ "client_ip": "192.168.1.100" }
{ "client_ip": "2001:db8::1" }

# CIDR range query
GET /access-logs/_search
{
  "query": {
    "term": {
      "client_ip": "192.168.1.0/24"  # CIDR notation
    }
  }
}

`range` - Giá trị khoảng

bash

{
  "age_range": { "type": "integer_range" },
  "date_range": { "type": "date_range" },
  "price_range": { "type": "float_range" },
  "ip_range": { "type": "ip_range" }
}

# Document với range values
{
  "age_range": {
    "gte": 25,
    "lte": 35
  },
  "date_range": {
    "gte": "2024-01-01",
    "lte": "2024-01-31"
  }
}

# Tìm documents có range bao gồm giá trị
GET /promotions/_search
{
  "query": {
    "term": {
      "date_range": {
        "value": "2024-01-15"  # Ngày 15 có nằm trong range?
      }
    }
  }
}

`completion` - Autocomplete

bash

{
  "product_suggest": {
    "type": "completion",
    "analyzer": "simple",
    "search_analyzer": "simple",
    "preserve_separators": true,
    "preserve_position_increments": true,
    "max_input_length": 50
  }
}

# Document với suggest input
{
  "name": "iPhone 15 Pro Max",
  "product_suggest": {
    "input": ["iPhone", "iPhone 15", "iPhone 15 Pro", "iPhone 15 Pro Max", "Apple iPhone"],
    "weight": 10  # Độ ưu tiên trong suggestions
  }
}

# Suggest query
POST /products/_search
{
  "suggest": {
    "product_name_suggest": {
      "prefix": "iph",
      "completion": {
        "field": "product_suggest",
        "size": 5,
        "skip_duplicates": true
      }
    }
  }
}

`join` - Parent-Child Relationship

bash

{
  "mappings": {
    "properties": {
      "product_category": {
        "type": "join",
        "relations": {
          "category": "product"  # category là parent của product
        }
      }
    }
  }
}

# Index parent
PUT /catalog/_doc/cat-1
{
  "name": "Smartphones",
  "product_category": { "name": "category" }
}

# Index child (phải có routing = parent ID)
PUT /catalog/_doc/prod-1?routing=cat-1
{
  "name": "iPhone 15",
  "price": 34990000,
  "product_category": {
    "name": "product",
    "parent": "cat-1"
  }
}

`dense_vector` - Vector Search (AI/ML)

bash

{
  "description_vector": {
    "type": "dense_vector",
    "dims": 384,             # Số dimensions (phụ thuộc model)
    "index": true,           # Cần để kNN search
    "similarity": "cosine"   # cosine | dot_product | l2_norm
  }
}

# Index với vector embedding
PUT /articles/_doc/1
{
  "title": "Elasticsearch và AI",
  "description": "...",
  "description_vector": [0.1, -0.3, 0.7, ...]  # 384 float values
}

# kNN search (Semantic Search)
POST /articles/_search
{
  "knn": {
    "field": "description_vector",
    "query_vector": [0.15, -0.28, 0.65, ...],  # Embedding của query
    "k": 10,
    "num_candidates": 100
  }
}

5.4 Multi-Fields

Multi-fields cho phép index cùng một field theo nhiều cách - rất phổ biến trong thực tế:

bash

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",               # Full-text search
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword"          # Exact match, sort, agg
          },
          "suggest": {
            "type": "completion"       # Autocomplete
          },
          "ngram": {
            "type": "text",
            "analyzer": "ngram_analyzer"  # Partial match (n-grams)
          }
        }
      },
      "category": {
        "type": "keyword",
        "fields": {
          "text": {
            "type": "text"
          }
        }
      }
    }
  }
}

Cách query multi-fields:

bash

# Full-text search
{ "match": { "name": "iphone pro" } }

# Exact match
{ "term": { "name.keyword": "iPhone 15 Pro Max 256GB" } }

# Sort theo name alphabetically
{ "sort": [{ "name.keyword": "asc" }] }

# Aggregation theo name
{ "aggs": { "top_products": { "terms": { "field": "name.keyword" } } } }

# Suggest
{ "suggest": { "prefix": "iph", "completion": { "field": "name.suggest" } } }

5.5 Mapping Parameters Quan Trọng

`index` - Có lập chỉ mục không?

bash

{
  "image_urls": {
    "type": "keyword",
    "index": false     # Lưu trong _source nhưng không searchable
  }
}
# Dùng cho fields chỉ cần display, không cần tìm kiếm

`doc_values` - Cột-based storage

bash

{
  "price": {
    "type": "double",
    "doc_values": true   # Default: true (cần cho sort/agg)
    # false = không sort/agg được nhưng tiết kiệm disk
  }
}

Hiểu về doc_values:

Inverted Index (row storage):
"apple" → [doc1, doc5, doc9]
"samsung" → [doc2, doc3, doc7]

Doc Values (column storage):
doc1: price=34990000, brand=Apple
doc2: price=31990000, brand=Samsung
doc3: price=29990000, brand=Samsung

Inverted index tốt cho search (term → documents)
Doc values tốt cho sort/aggregation (documents → values)

`store` - Lưu riêng trong index

bash

{
  "title": {
    "type": "text",
    "store": true      # Lưu field riêng biệt ngoài _source
  }
}
# Ít dùng. Dùng khi _source rất lớn nhưng chỉ cần retrieve 1 field

`enabled` - Tắt indexing hoàn toàn

bash

{
  "raw_json": {
    "type": "object",
    "enabled": false   # Lưu nhưng không index, không search được
  }
}
# Dùng cho fields chỉ cần lưu (blob storage)

`null_value` - Giá trị mặc định khi null

bash

{
  "status": {
    "type": "keyword",
    "null_value": "unknown"
  }
}
# null sẽ được indexed như "unknown"
# Query: term { "status": "unknown" } sẽ tìm được null fields

`copy_to` - Sao chép giá trị sang field khác

bash

{
  "first_name": {
    "type": "text",
    "copy_to": "full_name"
  },
  "last_name": {
    "type": "text",
    "copy_to": "full_name"
  },
  "full_name": {
    "type": "text"
  }
}

# Document:
{ "first_name": "Nguyễn", "last_name": "Văn A" }

# full_name tự động = "Nguyễn Văn A"
# Search: match { "full_name": "Nguyễn A" } → tìm được
# full_name KHÔNG xuất hiện trong _source, chỉ được indexed

`ignore_above` - Bỏ qua string quá dài

bash

{
  "tag": {
    "type": "keyword",
    "ignore_above": 256   # Strings > 256 chars sẽ không indexed
  }
}
# Tránh index garbage data, giảm index size

`ignore_malformed` - Bỏ qua dữ liệu sai format

bash

{
  "price": {
    "type": "double",
    "ignore_malformed": true
  }
}
# "price": "not-a-number" → document vẫn được indexed, field bị ignored
# Thay vì throw error (behavior mặc định)

5.6 Index Templates và Component Templates

Component Templates - Tái sử dụng Mapping

bash

# Tạo component template cho common fields
PUT /_component_template/common-fields
{
  "template": {
    "mappings": {
      "properties": {
        "created_at": { "type": "date" },
        "updated_at": { "type": "date" },
        "created_by": { "type": "keyword" },
        "is_deleted": { "type": "boolean" }
      }
    }
  }
}

# Component template cho settings
PUT /_component_template/standard-settings
{
  "template": {
    "settings": {
      "number_of_replicas": 1,
      "refresh_interval": "1s",
      "index.codec": "best_compression"
    }
  }
}

# Index template kết hợp nhiều component templates
PUT /_index_template/products-template
{
  "index_patterns": ["products-*"],
  "composed_of": ["standard-settings", "common-fields"],
  "priority": 100,
  "template": {
    "mappings": {
      "properties": {
        "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
        "price": { "type": "scaled_float", "scaling_factor": 100 },
        "brand": { "type": "keyword" }
      }
    }
  }
}

# Bất kỳ index nào match "products-*" sẽ kế thừa settings này
PUT /products-2024-01
# Tự động có: common-fields + standard-settings + products-specific mappings

5.7 Runtime Fields

Runtime fields được tính toán tại thời điểm query, không cần reindex:

bash

PUT /products
{
  "mappings": {
    "runtime": {
      "discounted_price": {
        "type": "double",
        "script": {
          "source": """
            double price = doc['price'].value;
            double discount = doc['discount_percentage'].value;
            emit(price * (1 - discount / 100.0));
          """,
          "lang": "painless"
        }
      },
      "is_premium": {
        "type": "boolean",
        "script": {
          "source": "emit(doc['price'].value > 20000000)",
          "lang": "painless"
        }
      },
      "category_brand": {
        "type": "keyword",
        "script": {
          "source": "emit(doc['category'].value + '-' + doc['brand'].value)",
          "lang": "painless"
        }
      }
    },
    "properties": {
      "price": { "type": "double" },
      "discount_percentage": { "type": "integer" },
      "category": { "type": "keyword" },
      "brand": { "type": "keyword" }
    }
  }
}

# Query dùng runtime field
GET /products/_search
{
  "query": {
    "range": {
      "discounted_price": {
        "gte": 15000000,
        "lte": 25000000
      }
    }
  },
  "_source": ["name", "price", "discount_percentage"],
  "fields": ["discounted_price"]  # Include computed field in response
}

Runtime fields tạm thời (chỉ cho 1 query):

bash

GET /products/_search
{
  "runtime_mappings": {
    "final_price": {
      "type": "double",
      "script": {
        "source": """
          double base = doc['price'].value;
          if (doc['discount_percentage'].size() > 0) {
            base = base * (1 - doc['discount_percentage'].value / 100.0);
          }
          emit(base);
        """
      }
    }
  },
  "query": {
    "range": {
      "final_price": { "lte": 10000000 }
    }
  }
}

5.8 Mapping Best Practices

Mapping cho E-commerce Product Index

bash

PUT /ecommerce-products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "product_name_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "word_delimiter_graph", "stop"]
        },
        "sku_analyzer": {
          "type": "keyword",
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      
      "product_id": {
        "type": "keyword"
      },
      
      "sku": {
        "type": "keyword",
        "normalizer": "lowercase"
      },
      
      "name": {
        "type": "text",
        "analyzer": "product_name_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 512
          },
          "suggest": {
            "type": "completion"
          }
        }
      },
      
      "description": {
        "type": "text",
        "index_options": "offsets"
      },
      
      "short_description": {
        "type": "text",
        "copy_to": "combined_text"
      },
      
      "combined_text": {
        "type": "text",
        "store": false
      },
      
      "brand": {
        "type": "keyword"
      },
      
      "category": {
        "type": "keyword"
      },
      
      "category_path": {
        "type": "text",
        "analyzer": "path_hierarchy_analyzer"
      },
      
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100
      },
      
      "original_price": {
        "type": "scaled_float",
        "scaling_factor": 100
      },
      
      "discount_pct": {
        "type": "byte"
      },
      
      "currency": {
        "type": "keyword"
      },
      
      "stock": {
        "properties": {
          "quantity": { "type": "integer" },
          "in_stock": { "type": "boolean" },
          "warehouse": { "type": "keyword" }
        }
      },
      
      "ratings": {
        "properties": {
          "average": { "type": "half_float" },
          "count": { "type": "integer" },
          "distribution": {
            "properties": {
              "1_star": { "type": "integer" },
              "2_star": { "type": "integer" },
              "3_star": { "type": "integer" },
              "4_star": { "type": "integer" },
              "5_star": { "type": "integer" }
            }
          }
        }
      },
      
      "attributes": {
        "type": "nested",
        "properties": {
          "name": { "type": "keyword" },
          "value": { "type": "keyword" }
        }
      },
      
      "tags": {
        "type": "keyword"
      },
      
      "images": {
        "type": "keyword",
        "index": false,
        "doc_values": false
      },
      
      "status": {
        "type": "keyword"
      },
      
      "visibility": {
        "type": "keyword"
      },
      
      "created_at": { "type": "date" },
      "updated_at": { "type": "date" },
      "published_at": { "type": "date" }
    }
  }
}

Mapping cho Log Index

bash

PUT /app-logs
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.lifecycle.name": "logs-policy"
  },
  "mappings": {
    "dynamic": "false",
    "properties": {
      "@timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "service": { "type": "keyword" },
      "message": { "type": "text", "norms": false },
      "trace_id": { "type": "keyword", "doc_values": false },
      "span_id": { "type": "keyword", "doc_values": false },
      "user_id": { "type": "keyword" },
      "http": {
        "properties": {
          "method": { "type": "keyword" },
          "path": { "type": "keyword" },
          "status_code": { "type": "short" },
          "duration_ms": { "type": "float" }
        }
      },
      "error": {
        "properties": {
          "type": { "type": "keyword" },
          "message": { "type": "text", "norms": false },
          "stack_trace": { "type": "text", "index": false }
        }
      },
      "host": { "type": "keyword" },
      "environment": { "type": "keyword" }
    }
  }
}

Giải thích tối ưu hóa cho logs:

"norms": false cho message: norms tốn RAM và không cần thiết (không cần relevance scoring cho logs)
"doc_values": false cho trace_id: không cần sort/agg trên field này
"dynamic": "false" : Ngăn ES tạo mapping cho fields không biết trước
"index": false cho stack_trace: stack trace rất dài, chỉ cần lưu không cần search

5.9 Cập Nhật Mapping

Một số thập thêm field vào mapping:

bash

# Thêm field mới vào mapping hiện có
PUT /products/_mapping
{
  "properties": {
    "weight_kg": { "type": "float" },
    "country_of_origin": { "type": "keyword" }
  }
}

Những gì KHÔNG thể thay đổi sau khi tạo:

Thay đổi type của field (text → keyword)
Thay đổi analyzer của field
Thay đổi index từ false sang true
Thay đổi số primary shards

Giải pháp: Reindex sang index mới với mapping mới

bash

# Bước 1: Tạo index mới với mapping đúng
PUT /products-v2
{ "mappings": { ... correct mappings ... } }

# Bước 2: Reindex data
POST /_reindex
{
  "source": { "index": "products" },
  "dest": { "index": "products-v2" }
}

# Bước 3: Switch alias
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products", "alias": "products-current" } },
    { "add": { "index": "products-v2", "alias": "products-current" } }
  ]
}

Tóm Tắt Chương 5

Data types quan trọng nhất:

text vs keyword: text cho search, keyword cho exact match/sort/agg
date: hỗ trợ nhiều formats, date math rất mạnh
nested: giữ relationship trong object arrays
scaled_float: tốt nhất cho tiền tệ
keyword với normalizer: case-insensitive exact match
dense_vector: cho AI/ML similarity search

Mapping best practices:

Dùng "dynamic": "strict" cho production (tránh surprises)
copy_to để tạo "search all fields" field
Tắt doc_values và norms cho text fields không cần sort/relevance
ignore_above cho keyword fields để tránh index garbage
ignore_malformed: true cho data với uncertain quality
Luôn dùng explicit mapping, không rely vào dynamic mapping cho production

Bước Tiếp Theo

→ Chương 6: Query DSL Cơ Bản - Master the art of writing Elasticsearch queries

Chương 6: Query DSL Cơ Bản

6.1 Query Context vs Filter Context

Đây là khái niệm cực kỳ quan trọng ảnh hưởng đến cả chức năng và performance.

Query Context - "How well does this match?"

Câu hỏi: Document relevant đến mức nào?

bash

GET /products/_search
{
  "query": {
    "match": {           # ← Đây là query context
      "name": "iPhone"  # Tính điểm relevance (_score)
    }
  }
}

ES tính _score cho mỗi document
Kết quả được sort theo score (mặc định)
Chậm hơn vì cần tính toán score
Dùng khi cần relevance ranking

Filter Context - "Does this match?"

Câu hỏi: Document có khớp không? (Yes/No)

bash

GET /products/_search
{
  "query": {
    "bool": {
      "filter": [        # ← Đây là filter context
        { "term": { "brand": "Apple" } },
        { "range": { "price": { "lte": 20000000 } } }
      ]
    }
  }
}

Không tính _score (luôn là 0 hoặc 1)
Kết quả được cache bởi ES
Nhanh hơn query context
Dùng cho: filter theo status, filter theo category, range filter...

Kết Hợp: "Best of Both Worlds"

bash

GET /products/_search
{
  "query": {
    "bool": {
      "must": [                         # Query context - tính _score
        {
          "multi_match": {
            "query": "điện thoại samsung",
            "fields": ["name^3", "description"]
          }
        }
      ],
      "filter": [                       # Filter context - không tính score, cached
        { "term": { "in_stock": true } },
        { "term": { "category": "smartphones" } },
        { "range": { "price": { "gte": 5000000, "lte": 30000000 } } }
      ]
    }
  }
}

Nguyên tắc: Đặt tất cả điều kiện "phải đúng" vào filter context khi không cần scoring.

6.2 Match Queries - Full-text Search

`match` - Query Cơ Bản Nhất

bash

GET /products/_search
{
  "query": {
    "match": {
      "name": "iphone pro"
    }
  }
}

Bên trong xử lý:

Analyze "iphone pro" → tokens: ["iphone", "pro"]
Tìm documents có "iphone" OR "pro"
Tính BM25 score
Sort và return

Các tùy chọn của match:

bash

GET /products/_search
{
  "query": {
    "match": {
      "name": {
        "query": "iphone samsung galaxy",
        "operator": "and",          # Phải có ALL terms (mặc định: or)
        "minimum_should_match": "75%",  # Ít nhất 75% terms phải match
        "fuzziness": "AUTO",        # Typo tolerance
        "prefix_length": 2,         # Không fuzzy cho 2 ký tự đầu
        "max_expansions": 10,       # Max fuzzy expansions
        "zero_terms_query": "none",  # none|all khi query chỉ có stop words
        "lenient": false            # Có bỏ qua format errors không
      }
    }
  }
}

fuzziness giải thích:

Giá trị	Ý nghĩa
`0`	Không fuzzy (exact match)
`1`	1 ký tự sai (edit distance = 1)
`2`	2 ký tự sai
`"AUTO"`	ES tự quyết: 0 cho 1-2 chars, 1 cho 3-5 chars, 2 cho 6+ chars

bash

# "samsumg" → tìm được "samsung" với fuzziness: 1
# "iphne" → tìm được "iphone" với fuzziness: 1
GET /products/_search
{
  "query": {
    "match": {
      "brand": {
        "query": "samsumg",
        "fuzziness": "AUTO"
      }
    }
  }
}

`match_phrase` - Tìm Cụm Từ Chính Xác

bash

# Tìm "pro max" liên tiếp theo đúng thứ tự
GET /products/_search
{
  "query": {
    "match_phrase": {
      "name": "pro max"
    }
  }
}
# Match: "iPhone 15 Pro Max" ✓
# Not match: "Max Pro iPhone 15" ✗ (sai thứ tự)
# Not match: "Pro chip Max battery" ✗ (không liên tiếp)

slop - Cho phép gap giữa các terms:

bash

GET /products/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "iPhone Max",
        "slop": 2    # Cho phép 2 words khác nằm giữa "iPhone" và "Max"
      }
    }
  }
}
# Match: "iPhone 15 Pro Max" ✓ (có 2 words ở giữa)
# Match: "iPhone Max 15" ✓ (slop = 1)

`match_phrase_prefix` - Autocomplete đơn giản

bash

# "iph" match "iphone", "iphone 15", "iphone case"...
GET /products/_search
{
  "query": {
    "match_phrase_prefix": {
      "name": {
        "query": "iphone pro",   # Prefix của word cuối
        "max_expansions": 20     # Giới hạn prefix expansions
      }
    }
  }
}
# "iphone pro" → match "iphone 16 pro", "iphone pro max", "iphone pro case"

`multi_match` - Tìm trên Nhiều Fields

bash

GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "samsung flagship",
      "fields": [
        "name^3",            # Boost x3 nếu match trong name
        "description^1",     # Boost x1 (default)
        "brand^2",           # Boost x2
        "tags"
      ],
      "type": "best_fields"  # best_fields | most_fields | cross_fields | phrase | phrase_prefix
    }
  }
}

Các loại type của multi_match:

`best_fields` (default)

Lấy score của field match tốt nhất
Thêm tie_breaker cho các fields khác match ít hơn

bash

{
  "type": "best_fields",
  "tie_breaker": 0.3
  # score = best_field_score + 0.3 * other_matching_fields_score
}

`most_fields`

Cộng scores của tất cả fields match
Dùng khi cùng content được index nhiều cách (language analyzers)

bash

{
  "type": "most_fields"
}

`cross_fields`

Treat các fields như một field lớn
Terms có thể xuất hiện ở bất kỳ field nào

bash

{
  "type": "cross_fields",
  "operator": "and"
  # "John" phải match trong một trong các fields
  # "Smith" phải match trong một trong các fields
}

Use case: Tìm người theo tên:

bash

GET /users/_search
{
  "query": {
    "multi_match": {
      "query": "Nguyễn Văn A",
      "fields": ["first_name", "last_name"],
      "type": "cross_fields",
      "operator": "and"
    }
  }
}

6.3 Term-Level Queries - Exact Match

Term-level queries không analyze input - so sánh trực tiếp với indexed values.

`term` - Exact Match Một Giá Trị

bash

GET /products/_search
{
  "query": {
    "term": {
      "brand": {
        "value": "Apple",
        "boost": 1.0         # Boost factor
      }
    }
  }
}

# Short form:
{
  "query": {
    "term": { "brand": "Apple" }
  }
}

Cẩn thận với text fields:

bash

# SAI - text fields được analyze (lowercase), "Apple" không tồn tại
{
  "term": { "name": "Apple" }   # ← SAI
}

# ĐÚNG - dùng .keyword sub-field cho exact match
{
  "term": { "name.keyword": "Apple iPhone 15" }  # ← ĐÚNG
}

# HOẶC dùng match cho text fields
{
  "match": { "name": "iPhone 15" }  # ← ĐÚNG (analyzed)
}

`terms` - Match Một trong Nhiều Giá Trị

bash

GET /products/_search
{
  "query": {
    "terms": {
      "brand": ["Apple", "Samsung", "Xiaomi"]
    }
  }
}
# Tương đương SQL: WHERE brand IN ('Apple', 'Samsung', 'Xiaomi')

Terms Lookup - Lấy terms từ document khác:

bash

# Lấy danh sách followed brands từ user profile
GET /products/_search
{
  "query": {
    "terms": {
      "brand": {
        "index": "users",         # Lấy từ index này
        "id": "user_123",         # Document ID
        "path": "followed_brands" # Field chứa values
      }
    }
  }
}

`range` - Query Theo Khoảng Giá Trị

bash

# Số
GET /products/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 5000000,   # greater than or equal
        "lte": 20000000,  # less than or equal
        "gt": 4000000,    # strictly greater than
        "lt": 20000001    # strictly less than
      }
    }
  }
}

# Ngày tháng
GET /orders/_search
{
  "query": {
    "range": {
      "created_at": {
        "gte": "2024-01-01",
        "lte": "2024-01-31",
        "format": "yyyy-MM-dd",
        "time_zone": "+07:00"
      }
    }
  }
}

# Relative dates
GET /orders/_search
{
  "query": {
    "range": {
      "created_at": {
        "gte": "now-30d/d",  # 30 ngày trước, làm tròn theo ngày
        "lte": "now/d"       # Hiện tại, làm tròn theo ngày
      }
    }
  }
}

`exists` - Document có Field không?

bash

# Tìm documents có field description
GET /products/_search
{
  "query": {
    "exists": { "field": "description" }
  }
}

# Tìm documents KHÔNG có field (must_not + exists)
GET /products/_search
{
  "query": {
    "bool": {
      "must_not": [
        { "exists": { "field": "description" } }
      ]
    }
  }
}

Lưu ý về "null" vs "not exists":

null → field KHÔNG tồn tại (không được index bởi exists)
"null_value" trong mapping → field có giá trị đặc biệt → CÓ tồn tại với exists

`prefix` - Tìm Theo Tiền Tố

bash

# Tìm brands bắt đầu bằng "Sam"
GET /products/_search
{
  "query": {
    "prefix": {
      "brand": {
        "value": "Sam",
        "rewrite": "constant_score"
      }
    }
  }
}
# Match: "Samsung", "Samsonite", "Samlex"...

`wildcard` - Tìm Theo Pattern

bash

GET /products/_search
{
  "query": {
    "wildcard": {
      "product_id": {
        "value": "SP0*",          # * = any characters
        "case_insensitive": true  # ES 7.10+
      }
    }
  }
}

# ? = single character
{
  "wildcard": {
    "product_id": "SP0??"  # Khớp với SP001, SP099 (6 chars total)
  }
}

Cảnh báo performance: Wildcard với * ở đầu (*pattern) rất chậm vì không dùng được index.

`regexp` - Regular Expression

bash

GET /products/_search
{
  "query": {
    "regexp": {
      "product_id": {
        "value": "SP[0-9]{3}",
        "flags": "ALL",
        "case_insensitive": true,
        "max_determinized_states": 10000  # Giới hạn complexity
      }
    }
  }
}

Cảnh báo: Regexp queries rất chậm, tránh dùng trong production searches.

`fuzzy` - Tìm Kiếm Mờ

bash

GET /products/_search
{
  "query": {
    "fuzzy": {
      "brand": {
        "value": "Samsumg",
        "fuzziness": 1,
        "prefix_length": 3,    # Prefix "Sam" phải match chính xác
        "max_expansions": 50,
        "transpositions": true  # "ab" → "ba" (count as 1 edit)
      }
    }
  }
}

`ids` - Tìm Theo Document IDs

bash

GET /products/_search
{
  "query": {
    "ids": {
      "values": ["1", "2", "3", "4"]
    }
  }
}

6.4 Compound Queries - Kết Hợp Queries

`bool` Query - Linh Hoạt Nhất

Bool query kết hợp nhiều queries với logic Boolean:

bash

GET /products/_search
{
  "query": {
    "bool": {
      "must": [...],         # ANĐ - phải match, ảnh hưởng score
      "should": [...],       # HOẶC - nên match, tăng score nếu match
      "must_not": [...],     # KHÔNG - không được match
      "filter": [...]        # ANĐ - phải match, KHÔNG ảnh hưởng score (cached)
    }
  }
}

Ví dụ thực tế - Tìm kiếm sản phẩm:

bash

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "điện thoại 5G",
            "fields": ["name^3", "description"]
          }
        }
      ],
      "should": [
        { "term": { "tags": "flagship" } },
        { "range": { "rating": { "gte": 4.5 } } }
      ],
      "must_not": [
        { "term": { "status": "discontinued" } }
      ],
      "filter": [
        { "term": { "in_stock": true } },
        { "term": { "category": "smartphones" } },
        {
          "range": {
            "price": {
              "gte": 5000000,
              "lte": 30000000
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

minimum_should_match:

bash

"bool": {
  "should": [
    { "term": { "tags": "flagship" } },
    { "term": { "tags": "5G" } },
    { "term": { "tags": "camera" } }
  ],
  "minimum_should_match": 2  # Ít nhất 2 trong 3 should phải match
}

Nested bool queries:

bash

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "smartphone" } }
      ],
      "filter": [
        {
          "bool": {
            "should": [
              { "term": { "brand": "Apple" } },
              { "term": { "brand": "Samsung" } }
            ],
            "minimum_should_match": 1
          }
        },
        { "term": { "in_stock": true } }
      ]
    }
  }
}

`boosting` Query

Tìm documents match positive query, nhưng giảm score của documents match negative query:

bash

GET /products/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": { "name": "smartphone" }  # Tìm smartphones
      },
      "negative": {
        "term": { "condition": "refurbished" }  # Giảm score hàng tân trang
      },
      "negative_boost": 0.3   # Score của refurbished = original * 0.3
    }
  }
}

`dis_max` - Disjunction Max

Lấy max score từ các sub-queries (thay vì cộng như bool should):

bash

GET /products/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "name": "iPhone" } },
        { "match": { "description": "iPhone" } }
      ],
      "tie_breaker": 0.3   # 0 = strict max, 1 = average, 0.3 = weighted
    }
  }
}

Khi dùng dis_max vs multi_match best_fields?

Thực ra multi_match với type: best_fields là wrapper của dis_max. Hầu hết dùng multi_match.

`constant_score` - Filter với Fixed Score

bash

GET /products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": { "brand": "Apple" }
      },
      "boost": 1.5    # Tất cả matching docs có score = 1.5
    }
  }
}

6.5 Special Queries

`match_all` và `match_none`

bash

# Trả về tất cả documents
GET /products/_search
{
  "query": { "match_all": {} }
}

# Không trả về document nào (dùng trong dynamic queries)
GET /products/_search
{
  "query": { "match_none": {} }
}

`query_string` - Apache Lucene Query Syntax

bash

GET /products/_search
{
  "query": {
    "query_string": {
      "query": "brand:Apple AND price:[20000000 TO 50000000] AND -category:accessories",
      "default_field": "name",
      "allow_leading_wildcard": false
    }
  }
}

# Query string syntax:
# AND, OR, NOT, +, -
# field:value
# range: price:[10 TO 50], date:[2024-01-01 TO 2024-12-31]
# wildcard: name:ipho*
# phrase: name:"iphone pro"
# boost: name:iphone^3

simple_query_string - Giống query_string nhưng không throw errors (safer):

bash

GET /products/_search
{
  "query": {
    "simple_query_string": {
      "query": "iphone +apple -case",
      "fields": ["name^3", "description"],
      "default_operator": "and"
    }
  }
}

# Syntax:
# + = must match (AND)
# | = should match (OR)
# - = must not match
# " " = phrase
# * = prefix
# ( ) = grouping
# ~ = fuzzy

6.6 Highlighting - Tô Sáng Kết Quả

bash

GET /products/_search
{
  "query": {
    "match": { "name": "iPhone pro" }
  },
  "highlight": {
    "fields": {
      "name": {
        "pre_tags": ["<strong>"],
        "post_tags": ["</strong>"],
        "number_of_fragments": 0,      # 0 = return full field (no fragmentation)
        "fragment_size": 150           # Kích thước mỗi fragment
      },
      "description": {
        "pre_tags": ["<em style='color:red'>"],
        "post_tags": ["</em>"],
        "number_of_fragments": 3,      # Số fragments trả về
        "fragment_size": 200
      }
    },
    "require_field_match": false,      # false = highlight even if not in query
    "type": "unified"                  # unified | plain | fvh
  }
}

Response với highlighting:

json

{
  "hits": {
    "hits": [
      {
        "_source": {
          "name": "iPhone 15 Pro Max 256GB",
          "description": "iPhone 15 Pro Max với chip A17 Pro..."
        },
        "highlight": {
          "name": ["<strong>iPhone</strong> 15 <strong>Pro</strong> Max 256GB"],
          "description": [
            "...với chip A17 <strong>Pro</strong>...",
            "...<strong>iPhone</strong> 15 Pro Max..."
          ]
        }
      }
    ]
  }
}

Highlight types:

unified (default): Tốt nhất, sử dụng BM25 để chọn tốt fragments
plain: Chậm, cần re-analyze text; dùng khi unified không work tốt
fvh (Fast Vector Highlighter): Nhanh nhất, cần term_vector: with_positions_offsets trong mapping

6.7 Pagination

From/Size (Standard Pagination)

bash

GET /products/_search
{
  "from": 0,      # Offset (trang 1 = 0, trang 2 = 10, ...)
  "size": 10,     # Kích thước trang
  "query": { "match_all": {} },
  "sort": [{ "price": "asc" }]
}

Giới hạn: from + size <= 10000 (mặc định). Để override:

bash

PUT /products/_settings
{
  "index.max_result_window": 50000
}

Nhưng không nên override vì deep pagination rất tốn RAM:

From 9990, size 10 → ES phải fetch 10000 docs từ mỗi shard, sort, lấy 10 cuối

Search After (Efficient Deep Pagination)

bash

# Trang đầu tiên
GET /products/_search
{
  "size": 10,
  "query": { "match_all": {} },
  "sort": [
    { "price": "asc" },
    { "_id": "asc" }    # Tiebreaker (phải unique)
  ]
}

# Response last hit:
# "_source": { "price": 5990000 }
# "_id": "prod-789"

# Trang tiếp theo - dùng sort values của last document
GET /products/_search
{
  "size": 10,
  "query": { "match_all": {} },
  "sort": [
    { "price": "asc" },
    { "_id": "asc" }
  ],
  "search_after": [5990000, "prod-789"]  # Giá trị từ last document
}

Ưu điểm của search_after:

Không có giới hạn 10000
Constant memory usage bất kể page nào
Tuy nhiên: không thể jump đến random page, chỉ có thể next/prev

Point in Time (PIT) - Consistent Pagination

Khi data thay đổi liên tục, search_after có thể trả về kết quả không nhất quán. PIT tạo "snapshot" của index state:

bash

# Tạo PIT
POST /products/_pit?keep_alive=5m

# Response:
{ "id": "46ToAwMDaWR5BXV1..." }

# Search với PIT
GET /_search
{
  "size": 10,
  "query": { "match_all": {} },
  "sort": [
    { "price": "asc" },
    { "_shard_doc": "asc" }  # Implicit tiebreaker với PIT
  ],
  "pit": {
    "id": "46ToAwMDaWR5BXV1...",
    "keep_alive": "5m"
  }
}

# Next page
GET /_search
{
  "size": 10,
  "sort": [...],
  "search_after": [5990000, 12345],
  "pit": {
    "id": "46ToAwMDaWR5BXV1...",
    "keep_alive": "5m"
  }
}

# Xóa PIT khi xong
DELETE /_pit
{
  "id": "46ToAwMDaWR5BXV1..."
}

6.8 Sorting

Sort cơ bản

bash

GET /products/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "price": "asc" },       # Giá từ thấp đến cao
    { "rating": "desc" },     # Rating từ cao xuống
    "_score"                  # Sau đó theo relevance
  ]
}

Sort với missing values

bash

GET /products/_search
{
  "sort": [
    {
      "discount_percentage": {
        "order": "desc",
        "missing": "_last"   # _first | _last | custom_value
      }
    }
  ]
}

Sort theo field trong nested object

bash

GET /hotels/_search
{
  "sort": [
    {
      "rooms.price": {
        "order": "asc",
        "nested": {
          "path": "rooms",
          "filter": {
            "term": { "rooms.type": "double" }
          }
        },
        "mode": "min"   # min | max | sum | avg | median
      }
    }
  ]
}

6.9 Source Filtering

bash

GET /products/_search
{
  "query": { "match_all": {} },
  
  # Cách 1: Chỉ lấy một số fields
  "_source": ["name", "price", "brand"],
  
  # Cách 2: Loại trừ fields
  "_source": {
    "excludes": ["description", "images"]
  },
  
  # Cách 3: Include và exclude
  "_source": {
    "includes": ["name", "specs.*"],
    "excludes": ["specs.internal"]
  }
}

`fields` vs `_source`

Từ ES 7.10+, có thể dùng fields để lấy thêm thông tin:

bash

GET /products/_search
{
  "_source": false,
  "fields": [
    "name",
    "price",
    { "field": "created_at", "format": "dd/MM/yyyy" }  # Custom format
  ]
}

6.10 Script Score - Tùy Chỉnh Scoring

bash

GET /products/_search
{
  "query": {
    "script_score": {
      "query": {
        "bool": {
          "filter": [
            { "term": { "in_stock": true } }
          ]
        }
      },
      "script": {
        "source": """
          // Custom score: kết hợp nhiều signals
          double base = _score;
          double rating = doc['rating'].value;
          double reviewBoost = Math.log(1 + doc['review_count'].value);
          double freshness = 1.0;
          
          // Boost newer products
          long created = doc['created_at'].value.toEpochMilli();
          long now = System.currentTimeMillis();
          long daysOld = (now - created) / (1000L * 60 * 60 * 24);
          if (daysOld < 30) freshness = 1.5;
          else if (daysOld < 90) freshness = 1.2;
          
          return base * rating * reviewBoost * freshness;
        """,
        "lang": "painless"
      }
    }
  }
}

6.11 Thực hành: Xây Dựng Search API

Use Case: E-commerce Product Search API

bash

# Query hoàn chỉnh cho trang tìm kiếm sản phẩm
GET /products/_search
{
  "from": 0,
  "size": 20,
  
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "điện thoại samsung 5G",
            "fields": ["name^5", "description^1", "brand^3", "tags^2"],
            "type": "best_fields",
            "fuzziness": "AUTO",
            "minimum_should_match": "60%"
          }
        }
      ],
      "filter": [
        { "term": { "in_stock": true } },
        { "term": { "status": "active" } },
        {
          "range": {
            "price": { "gte": 5000000, "lte": 35000000 }
          }
        }
      ],
      "should": [
        { "term": { "tags": "5G" } },
        { "term": { "tags": "flagship" } },
        {
          "range": { "rating": { "gte": 4.5 } }
        }
      ],
      "minimum_should_match": 0
    }
  },
  
  "sort": [
    { "_score": "desc" },
    { "rating": "desc" },
    { "review_count": "desc" }
  ],
  
  "highlight": {
    "fields": {
      "name": { "pre_tags": ["<mark>"], "post_tags": ["</mark>"] },
      "description": {
        "pre_tags": ["<mark>"],
        "post_tags": ["</mark>"],
        "number_of_fragments": 2,
        "fragment_size": 150
      }
    }
  },
  
  "aggs": {
    "brands": {
      "terms": { "field": "brand", "size": 20 }
    },
    "price_stats": {
      "stats": { "field": "price" }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "Dưới 5 triệu", "to": 5000000 },
          { "key": "5-10 triệu", "from": 5000000, "to": 10000000 },
          { "key": "10-20 triệu", "from": 10000000, "to": 20000000 },
          { "key": "20-30 triệu", "from": 20000000, "to": 30000000 },
          { "key": "Trên 30 triệu", "from": 30000000 }
        ]
      }
    },
    "avg_rating": {
      "avg": { "field": "rating" }
    }
  },
  
  "_source": ["product_id", "name", "brand", "price", "original_price", 
              "discount_percentage", "rating", "review_count", "in_stock",
              "tags", "images"]
}

Tóm Tắt Chương 6

Chọn Query Đúng

Tình huống	Query nên dùng
Full-text search	`match`, `multi_match`
Exact match	`term`, `terms`
Khoảng giá trị	`range`
Kiểm tra field tồn tại	`exists`
Kết hợp nhiều conditions	`bool`
Typo tolerance	`match` với `fuzziness`
Phrase search	`match_phrase`
Tìm theo pattern	`wildcard` (cẩn thận performance)
Lấy tất cả	`match_all`

Filter vs Query

	Filter	Query
Score	Không tính	Có tính
Cache	Có (nhanh hơn)	Không
Dùng khi	Condition phải đúng	Cần ranking
Ví dụ	status=active, price range	full-text search

Bước Tiếp Theo

→ Chương 7: Query DSL Nâng Cao - Function score, nested queries, percolator và các kỹ thuật advanced khác

Chương 7: Query DSL Nâng Cao

7.1 Function Score Query - Tùy Chỉnh Relevance Scoring

Function Score Query cho phép bạn tùy chỉnh hoàn toàn cách tính điểm relevance. Đây là công cụ mạnh nhất để kiểm soát thứ tự hiển thị kết quả.

Cấu Trúc Cơ Bản

bash

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": { "name": "smartphone" }   # Query cơ bản
      },
      "functions": [
        {
          # Function 1
        },
        {
          # Function 2
        }
      ],
      "score_mode": "multiply",   # Cách kết hợp scores từ functions
      "boost_mode": "multiply",   # Cách kết hợp với query score
      "min_score": 0.5,           # Loại bỏ docs có score < 0.5
      "boost": 1.0
    }
  }
}

score_mode - Cách combine scores từ nhiều functions:

multiply (default): Nhân tất cả function scores
sum: Cộng tất cả
avg: Trung bình
first: Score của function đầu tiên match
max: Giá trị lớn nhất
min: Giá trị nhỏ nhất

boost_mode - Cách combine function score kết quả với query score:

multiply (default): query_score * function_score
replace: Thay query_score bằng function_score
sum: Cộng lại
avg: Trung bình
max/min: Lấy max/min

Function 1: `weight` - Boost Đơn Giản

bash

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "name": "phone" } },
      "functions": [
        {
          "filter": { "term": { "brand": "Apple" } },
          "weight": 2.0         # Apple products nhận score * 2
        },
        {
          "filter": { "term": { "tags": "sale" } },
          "weight": 1.5         # Sale products boost 1.5x
        },
        {
          "filter": {
            "range": { "rating": { "gte": 4.5 } }
          },
          "weight": 1.3         # High-rated products boost 1.3x
        }
      ],
      "score_mode": "multiply"
    }
  }
}

Function 2: `field_value_factor` - Boost theo Giá Trị Field

bash

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "description": "smartphone" } },
      "field_value_factor": {
        "field": "popularity_score",    # Field chứa giá trị
        "factor": 1.2,                  # Nhân với factor
        "modifier": "log1p",            # Áp dụng hàm toán học
        "missing": 1                    # Giá trị khi field null
      },
      "boost_mode": "multiply"
    }
  }
}

modifier options:

Modifier	Formula
`none`	`field_value * factor`
`log`	`log(field_value * factor)`
`log1p`	`log(1 + field_value * factor)`
`log2p`	`log(2 + field_value * factor)`
`ln`	`ln(field_value * factor)`
`ln1p`	`ln(1 + field_value * factor)`
`square`	`(field_value * factor)^2`
`sqrt`	`sqrt(field_value * factor)`
`reciprocal`	`1 / (field_value * factor)`

Thực tế: Boost theo số lượt review và rating:

bash

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": { "name": "laptop" }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "review_count",
            "modifier": "log1p",
            "factor": 0.1,
            "missing": 0
          }
        },
        {
          "field_value_factor": {
            "field": "rating",
            "modifier": "square",
            "factor": 0.5,
            "missing": 3.0
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

Function 3: `gauss/linear/exp` - Decay Functions

Decay functions giúp giảm score theo khoảng cách từ một origin point. Rất hữu ích cho:

Boost sản phẩm theo thời gian (newer = higher score)
Boost địa điểm gần người dùng
Boost sản phẩm trong khoảng giá mong muốn

Gauss Decay - Giảm dần như hình chuông (smooth)

bash

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "gauss": {
        "created_at": {
          "origin": "now",       # Từ điểm này
          "scale": "30d",        # Sau 30 ngày thì score = 0.5
          "offset": "7d",        # 7 ngày đầu không bị giảm (score = 1)
          "decay": 0.5           # Score tại scale = 0.5
        }
      }
    }
  }
}

Linear Decay - Giảm tuyến tính

bash

GET /stores/_search
{
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "linear": {
        "location": {
          "origin": "21.0278,105.8342",  # User's location
          "scale": "10km",               # score = 0.5 tại 10km
          "offset": "1km",               # Trong 1km score = 1
          "decay": 0.5
        }
      }
    }
  }
}

Exp Decay - Giảm theo hàm mũ (steep drop-off)

bash

GET /flights/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": { "destination": "Hanoi" }
      },
      "exp": {
        "price": {
          "origin": 3000000,    # Target price
          "scale": 500000,      # 500k VND offset
          "decay": 0.5
        }
      }
    }
  }
}

Function 4: `random_score` - Kết Quả Ngẫu Nhiên Nhất Quán

bash

GET /recommendations/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": { "term": { "category": "electronics" } }
        }
      },
      "random_score": {
        "seed": 12345,           # Cùng seed = cùng kết quả
        "field": "_seq_no"       # Field để randomize
      }
    }
  }
}

Use case: Randomize sản phẩm, nhưng nhất quán trong cùng 1 session (dùng session_id làm seed).

Function 5: `script_score`

bash

GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "name": "phone" } },
      "script_score": {
        "script": {
          "source": """
            double score = _score;
            
            // Boost trending products
            if (doc['is_trending'].value) {
              score *= 1.5;
            }
            
            // Boost high-margin products
            double margin = (doc['price'].value - doc['cost'].value) / doc['price'].value;
            score *= (1 + margin * 0.5);
            
            // Freshness boost
            long ageInDays = (System.currentTimeMillis() - doc['created_at'].value.toEpochMilli()) 
                              / (1000L * 60 * 60 * 24);
            double freshnessBoost = Math.max(0.5, 1.0 - ageInDays / 365.0);
            score *= freshnessBoost;
            
            return score;
          """
        }
      }
    }
  }
}

Ví Dụ Thực Tế: E-commerce Search Ranking

bash

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "samsung điện thoại",
                "fields": ["name^3", "description"]
              }
            }
          ],
          "filter": [
            { "term": { "in_stock": true } },
            { "term": { "status": "active" } }
          ]
        }
      },
      "functions": [
        {
          "filter": { "term": { "tags": "sponsored" } },
          "weight": 3.0
        },
        {
          "field_value_factor": {
            "field": "rating",
            "modifier": "square",
            "factor": 0.4,
            "missing": 3.0
          }
        },
        {
          "field_value_factor": {
            "field": "review_count",
            "modifier": "log1p",
            "factor": 0.15,
            "missing": 0
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "60d",
              "offset": "14d",
              "decay": 0.5
            }
          }
        },
        {
          "filter": {
            "range": {
              "discount_percentage": { "gte": 10 }
            }
          },
          "weight": 1.2
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

7.2 Nested Queries

Khi nào dùng Nested?

Khi document có array of objects và cần query theo combination của fields trong object:

bash

# Data:
{
  "product": "Laptop Dell",
  "variants": [
    { "color": "black", "storage": "512GB", "price": 25000000 },
    { "color": "silver", "storage": "1TB", "price": 30000000 }
  ]
}

# Vấn đề với plain object:
# Tìm "black AND 1TB" sẽ incorrectly match vì ES flatten array
# "black" ở variant 1, "1TB" ở variant 2 → sai!

Nested Query Cơ Bản

bash

# Mapping:
PUT /products-with-variants
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },
      "variants": {
        "type": "nested",
        "properties": {
          "color": { "type": "keyword" },
          "storage": { "type": "keyword" },
          "price": { "type": "double" }
        }
      }
    }
  }
}

# Query: Tìm sản phẩm có variant màu black VÀ storage 1TB
GET /products-with-variants/_search
{
  "query": {
    "nested": {
      "path": "variants",
      "query": {
        "bool": {
          "must": [
            { "term": { "variants.color": "black" } },
            { "term": { "variants.storage": "1TB" } }
          ]
        }
      },
      "score_mode": "max",     # Điểm cao nhất từ matching nested docs
      "inner_hits": {
        "name": "matching_variants",   # Trả về nested docs đã match
        "size": 3,
        "highlight": {
          "fields": { "variants.color": {} }
        }
      }
    }
  }
}

Nested Aggregations

bash

GET /products-with-variants/_search
{
  "aggs": {
    "variants_agg": {
      "nested": {
        "path": "variants"
      },
      "aggs": {
        "by_color": {
          "terms": { "field": "variants.color" }
        },
        "price_stats": {
          "stats": { "field": "variants.price" }
        },
        "by_storage": {
          "terms": { "field": "variants.storage" },
          "aggs": {
            "avg_price_per_storage": {
              "avg": { "field": "variants.price" }
            }
          }
        }
      }
    }
  }
}

Reverse Nested Aggregation

bash

GET /products-with-variants/_search
{
  "aggs": {
    "colors": {
      "nested": { "path": "variants" },
      "aggs": {
        "color_terms": {
          "terms": { "field": "variants.color" },
          "aggs": {
            "products_count": {
              "reverse_nested": {}   # Trở về parent document
            }
          }
        }
      }
    }
  }
}

7.3 Parent-Child Relationships

Parent-child cho phép define relationship giữa các documents trong cùng index mà không cần denormalize.

Khi nào dùng Parent-Child vs Nested?

	Nested	Parent-Child
Lưu trữ	Embedded trong parent	Separate documents
Update	Reindex entire parent	Độc lập
Query performance	Nhanh hơn	Chậm hơn (join)
Khi dùng	Ít thay đổi, không quá nhiều	Nhiều updates, số lượng lớn

Thiết Lập Parent-Child

bash

PUT /company
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "department": "employee",  # department là parent của employee
          "employee": "task"         # employee là parent của task (multi-level)
        }
      },
      "name": { "type": "keyword" },
      "description": { "type": "text" }
    }
  }
}

# Index parent (department)
PUT /company/_doc/dept-1
{
  "name": "Engineering",
  "description": "Software engineering department",
  "my_join_field": {
    "name": "department"
  }
}

# Index child (employee) - phải có routing = parent_id
PUT /company/_doc/emp-1?routing=dept-1
{
  "name": "Nguyễn Văn A",
  "email": "nguyenvana@company.com",
  "salary": 30000000,
  "my_join_field": {
    "name": "employee",
    "parent": "dept-1"
  }
}

PUT /company/_doc/emp-2?routing=dept-1
{
  "name": "Trần Thị B",
  "email": "tranthib@company.com",
  "salary": 35000000,
  "my_join_field": {
    "name": "employee",
    "parent": "dept-1"
  }
}

# Index grandchild (task) - routing = employee (closest parent) 
# But cần specify routing của shard (= original parent dept-1)
PUT /company/_doc/task-1?routing=dept-1
{
  "title": "Build search UI",
  "status": "in_progress",
  "my_join_field": {
    "name": "task",
    "parent": "emp-1"
  }
}

Parent-Child Queries

`has_child` - Tìm parents có child matching

bash

# Tìm departments có employee với salary > 30M
GET /company/_search
{
  "query": {
    "has_child": {
      "type": "employee",
      "query": {
        "range": { "salary": { "gt": 30000000 } }
      },
      "min_children": 1,      # Ít nhất 1 matching child
      "max_children": 10,     # Tối đa 10 matching children
      "score_mode": "max",    # none | avg | sum | max | min
      "inner_hits": {
        "size": 5
      }
    }
  }
}

`has_parent` - Tìm children có parent matching

bash

# Tìm employees thuộc department "Engineering"
GET /company/_search
{
  "query": {
    "has_parent": {
      "parent_type": "department",
      "query": {
        "term": { "name": "Engineering" }
      },
      "score": true   # Kế thừa score từ parent
    }
  }
}

`parent_id` - Tìm children theo parent ID cụ thể

bash

GET /company/_search
{
  "query": {
    "parent_id": {
      "type": "employee",
      "id": "dept-1"     # Lấy tất cả employees của dept-1
    }
  }
}

7.4 Percolate Query

Percolator là một unique feature: thay vì "tìm documents match query", bạn "tìm queries match document".

Use cases:

Notifications: Khi có sản phẩm mới match filter của user
Alerting: Khi log event match một alert rule
Content classification: Tự động tag content

Thiết Lập Percolator

bash

# Tạo index chứa các queries (percolator index)
PUT /product-alerts
{
  "mappings": {
    "properties": {
      "query": {
        "type": "percolator"    # Field lưu query
      },
      # Mirror mapping của index cần match
      "name": { "type": "text" },
      "brand": { "type": "keyword" },
      "price": { "type": "double" },
      "category": { "type": "keyword" },
      "tags": { "type": "keyword" }
    }
  }
}

# Đăng ký user alerts như stored queries
PUT /product-alerts/_doc/alert-user1-iphone
{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "iPhone" } }
      ],
      "filter": [
        { "range": { "price": { "lte": 30000000 } } }
      ]
    }
  },
  "user_id": "user_001",
  "alert_name": "iPhone under 30M"
}

PUT /product-alerts/_doc/alert-user2-samsung
{
  "query": {
    "bool": {
      "must": [
        { "term": { "brand": "Samsung" } },
        { "range": { "price": { "lte": 25000000 } } }
      ]
    }
  },
  "user_id": "user_002",
  "alert_name": "Samsung cheap phone"
}

Dùng Percolate Query

bash

# Khi sản phẩm mới được tạo, kiểm tra alerts nào match
GET /product-alerts/_search
{
  "query": {
    "percolate": {
      "field": "query",          # Field chứa stored queries
      "document": {              # Document mới được tạo
        "name": "iPhone 15 Pro Max 256GB",
        "brand": "Apple",
        "price": 34990000,
        "category": "smartphones",
        "tags": ["flagship", "5G"]
      }
    }
  }
}

# Response: Trả về tất cả alert queries match document
{
  "hits": {
    "hits": [
      {
        "_id": "alert-user1-iphone",
        "_source": {
          "user_id": "user_001",
          "alert_name": "iPhone under 30M",
          "query": {...}
        }
        # NOTE: Price 34.99M > 30M → Không match alert này
        # → Không xuất hiện trong kết quả
      }
    ]
  }
}

Percolate nhiều documents cùng lúc

bash

GET /product-alerts/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "documents": [
        {
          "name": "Samsung Galaxy S24",
          "brand": "Samsung",
          "price": 24990000
        },
        {
          "name": "iPhone 16",
          "brand": "Apple",
          "price": 28000000
        }
      ]
    }
  }
}

7.5 Geo Queries

`geo_distance` - Trong Bán Kính

bash

# Setup mapping
PUT /restaurants
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },
      "location": { "type": "geo_point" },
      "rating": { "type": "float" }
    }
  }
}

# Index restaurants
POST /restaurants/_bulk
{"index": {"_id": "r1"}}
{"name": "Phở Hà Nội", "location": {"lat": 21.028, "lon": 105.834 }, "rating": 4.5}
{"index": {"_id": "r2"}}
{"name": "Bún Bò Huế", "location": {"lat": 21.032, "lon": 105.838 }, "rating": 4.3}
{"index": {"_id": "r3"}}
{"name": "Pizza Hut", "location": {"lat": 21.006, "lon": 105.852 }, "rating": 3.8}

# Tìm nhà hàng trong bán kính 5km
GET /restaurants/_search
{
  "query": {
    "geo_distance": {
      "distance": "5km",
      "distance_type": "arc",   # arc (chính xác) | plane (xấp xỉ, nhanh hơn)
      "location": {
        "lat": 21.028,
        "lon": 105.834
      }
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": { "lat": 21.028, "lon": 105.834 },
        "order": "asc",
        "unit": "km",
        "distance_type": "arc"
      }
    }
  ],
  "script_fields": {
    "distance_km": {
      "script": {
        "source": "doc['location'].arcDistance(params.lat, params.lon) / 1000",
        "params": { "lat": 21.028, "lon": 105.834 }
      }
    }
  }
}

`geo_bounding_box` - Trong Hộp Chữ Nhật

bash

GET /restaurants/_search
{
  "query": {
    "geo_bounding_box": {
      "location": {
        "top_left": { "lat": 21.05, "lon": 105.80 },
        "bottom_right": { "lat": 21.00, "lon": 105.90 }
      }
    }
  }
}

`geo_polygon` - Trong Đa Giác

bash

GET /properties/_search
{
  "query": {
    "geo_polygon": {
      "location": {
        "points": [
          { "lat": 21.05, "lon": 105.80 },
          { "lat": 21.05, "lon": 105.90 },
          { "lat": 21.00, "lon": 105.90 },
          { "lat": 21.00, "lon": 105.80 }
        ]
      }
    }
  }
}

`geo_shape` - Query với GeoJSON shapes

bash

GET /districts/_search
{
  "query": {
    "geo_shape": {
      "area": {
        "shape": {
          "type": "circle",
          "coordinates": [105.834, 21.028],
          "radius": "10km"
        },
        "relation": "intersects"  # intersects | within | contains | disjoint
      }
    }
  }
}

7.6 More Like This (MLT) Query

Tìm documents tương tự với document/text cho trước:

bash

# Tìm bài viết tương tự bài viết có ID "article-123"
GET /articles/_search
{
  "query": {
    "more_like_this": {
      "fields": ["title", "content"],
      "like": [
        {
          "_index": "articles",
          "_id": "article-123"
        }
      ],
      "min_term_freq": 1,       # Term phải xuất hiện ít nhất 1 lần
      "max_query_terms": 25,    # Số terms trong query được generated
      "min_doc_freq": 1,        # Term phải xuất hiện trong ít nhất N docs
      "minimum_should_match": "20%"
    }
  }
}

# Tìm documents tương tự text cho trước
GET /articles/_search
{
  "query": {
    "more_like_this": {
      "fields": ["title", "content"],
      "like": "Elasticsearch search engine distributed",
      "min_term_freq": 1,
      "max_query_terms": 20
    }
  }
}

7.7 Span Queries

Span queries cho phép query về vị trí của terms trong document (useful cho legal/academic search):

bash

# Tìm "quick" và "fox" không quá 5 vị trí cách nhau
GET /articles/_search
{
  "query": {
    "span_near": {
      "clauses": [
        { "span_term": { "content": "elasticsearch" } },
        { "span_term": { "content": "distributed" } }
      ],
      "slop": 5,          # Tối đa 5 words giữa chúng
      "in_order": false   # Có cần đúng thứ tự không
    }
  }
}

7.8 Knn (k-Nearest Neighbor) - Vector Search

ES 8.x có native support cho vector search, quan trọng cho AI/ML applications:

bash

# Mapping với dense_vector
PUT /articles-semantic
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "embedding": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

# Index document với vector (embedding từ sentence-transformers model)
PUT /articles-semantic/_doc/1
{
  "title": "Tìm kiếm ngữ nghĩa với Elasticsearch",
  "content": "Vector search là tương lai...",
  "embedding": [0.1, -0.3, 0.7, 0.2, ...]  # 384 dimensions
}

# kNN search - Semantic search
GET /articles-semantic/_search
{
  "knn": {
    "field": "embedding",
    "query_vector": [-0.1, 0.3, 0.65, ...],  # Embedding của search query
    "k": 10,                                   # Top K results
    "num_candidates": 100                      # Candidates để xem xét
  },
  "fields": ["title", "content"],
  "_source": false
}

# Hybrid search: kNN + BM25
GET /articles-semantic/_search
{
  "knn": {
    "field": "embedding",
    "query_vector": [...],
    "k": 5,
    "num_candidates": 50,
    "boost": 0.7
  },
  "query": {
    "match": {
      "content": "elasticsearch distributed search",
      "boost": 0.3
    }
  }
}

Real-world Vector Search Pipeline

python

from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

model = SentenceTransformer('all-MiniLM-L6-v2')
es = Elasticsearch("http://localhost:9200")

# Encode documents khi indexing
def index_article(article):
    embedding = model.encode(article['content']).tolist()
    es.index(
        index='articles-semantic',
        document={
            **article,
            'embedding': embedding
        }
    )

# Encode query khi searching
def semantic_search(query_text, size=10):
    query_embedding = model.encode(query_text).tolist()
    
    result = es.search(
        index='articles-semantic',
        body={
            "knn": {
                "field": "embedding",
                "query_vector": query_embedding,
                "k": size,
                "num_candidates": size * 10
            },
            "fields": ["title"],
            "_source": {"excludes": ["embedding"]}
        }
    )
    
    return result['hits']['hits']

# Usage:
results = semantic_search("cách tối ưu hóa elasticsearch performance")

7.9 Advanced Highlighting

Fast Vector Highlighter (FVH)

Cần mapping đặc biệt nhưng nhanh hơn default highlighter:

bash

# Mapping
PUT /articles
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "term_vector": "with_positions_offsets"  # Required for FVH
      }
    }
  }
}

# Query với FVH
GET /articles/_search
{
  "query": { "match": { "content": "elasticsearch" } },
  "highlight": {
    "type": "fvh",
    "fields": {
      "content": {
        "fragment_size": 150,
        "number_of_fragments": 3,
        "order": "score",         # score | none
        "boundary_max_scan": 20,  # Scan max chars để tìm boundary
        "boundary_chars": ".,!?\n",
        "matched_fields": ["content", "content.english"]  # Multi-field highlighting
      }
    }
  }
}

7.10 Collapse - Deduplication

Nhóm kết quả theo một field, chỉ trả về top N documents mỗi nhóm:

bash

# Tìm sản phẩm, mỗi brand chỉ hiển thị 1 sản phẩm tốt nhất
GET /products/_search
{
  "query": { "match": { "name": "phone" } },
  "collapse": {
    "field": "brand",            # Collapse theo brand
    "inner_hits": {
      "name": "all_brands",      # Lấy thêm 2 sản phẩm khác của brand
      "size": 2,
      "sort": [{ "price": "asc" }]
    },
    "max_concurrent_group_searches": 4
  },
  "sort": ["_score"]
}

7.11 Search Templates

Parameterize queries để reuse:

bash

# Lưu search template
PUT /_scripts/product-search-template
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "{{query}}",
                "fields": ["name^3", "description"]
              }
            }
          ],
          "filter": [
            {{#brand}}
            { "term": { "brand": "{{brand}}" } },
            {{/brand}}
            {
              "range": {
                "price": {
                  {{#min_price}}"gte": {{min_price}},{{/min_price}}
                  {{#max_price}}"lte": {{max_price}}{{/max_price}}
                }
              }
            }
          ]
        }
      },
      "from": "{{from}}{{^from}}0{{/from}}",
      "size": "{{size}}{{^size}}10{{/size}}"
    }
  }
}

# Sử dụng template
GET /products/_search/template
{
  "id": "product-search-template",
  "params": {
    "query": "iphone pro",
    "brand": "Apple",
    "min_price": 20000000,
    "max_price": 50000000,
    "from": 0,
    "size": 20
  }
}

# Preview rendered template
GET /_render/template
{
  "id": "product-search-template",
  "params": {
    "query": "samsung",
    "size": 5
  }
}

Tóm Tắt Chương 7

Advanced Query Reference

Query	Khi dùng
`function_score`	Tùy chỉnh scoring với business logic
`nested`	Query arrays duy trì relationships
`has_child`/`has_parent`	Parent-child documents
`percolate`	Reverse matching (queries match documents)
`more_like_this`	Content-based recommendations
`knn`	Semantic/vector search với AI/ML
`geo_distance`	Tìm kiếm theo vị trí địa lý
`span_near`	Proximity search (khoảng cách terms)

Decay Functions Cheat Sheet

Function	Shape	Dùng khi
`gauss`	Bell curve	Popularity decay, thường được dùng nhất
`linear`	Straight line	Khoảng cách địa lý
`exp`	Exponential	Giá cả rất gần target quan trọng hơn

Bước Tiếp Theo

→ Chương 8: Text Analysis - Hiểu sâu về cách ES phân tích và xử lý văn bản, đặc biệt tiếng Việt

Chương 8: Text Analysis - Phân Tích Văn Bản

8.1 Tại Sao Text Analysis Quan Trọng?

Text analysis là quá trình chuyển đổi text thành tokens để lưu vào inverted index. Đây là nơi quyết định search có "thông minh" không.

Vấn đề nếu không có Analysis

User searches: "Điện thoại iOS 5G tốt nhất"

Không có analysis:
- Token: ["Điện thoại iOS 5G tốt nhất"] — 1 token duy nhất
- Chỉ exact match → không tìm được

Standard analysis:  
- Tokens: ["điện", "thoại", "ios", "5g", "tốt", "nhất"]
- Tìm documents có chứa bất kỳ token nào

Custom analysis với synonym:
- Tokens: ["điện", "thoại", "điện_thoại", "smartphone", "ios", "apple", "5g", "tốt", "nhất"]
- Còn highlight được từ đồng nghĩa

8.2 Analysis Pipeline

Khi ES xử lý text, nó qua 3 bước:

Raw Text
   │
   ▼
┌─────────────┐
│  CHARACTER  │  Xử lý ký tự trước tokenization
│   FILTERS   │  HTML strip, pattern replace, etc.
└─────────────┘
   │
   ▼
┌─────────────┐
│  TOKENIZER  │  Chia text thành tokens
│             │  (1 tokenizer duy nhất mỗi analyzer)
└─────────────┘
   │
   ▼
┌─────────────┐
│   TOKEN     │  Xử lý từng token
│   FILTERS   │  Lowercase, stop words, synonyms, stemming...
└─────────────┘
   │
   ▼
Tokens (terms in inverted index)

8.3 Kiểm Tra Analysis

Luôn test analyzer trước khi deploy:

bash

# Test analyzer built-in
GET /_analyze
{
  "analyzer": "standard",
  "text": "iPhone 15 Pro Max là điện thoại tốt nhất 2024"
}

# Response:
{
  "tokens": [
    { "token": "iphone", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 },
    { "token": "15", "start_offset": 7, "end_offset": 9, "type": "<NUM>", "position": 1 },
    { "token": "pro", "start_offset": 10, "end_offset": 13, "type": "<ALPHANUM>", "position": 2 },
    { "token": "max", "start_offset": 14, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 },
    { "token": "là", "start_offset": 18, "end_offset": 20, "type": "<ALPHANUM>", "position": 4 },
    { "token": "điện", "start_offset": 21, "end_offset": 25, "type": "<ALPHANUM>", "position": 5 },
    { "token": "thoại", "start_offset": 26, "end_offset": 31, "type": "<ALPHANUM>", "position": 6 },
    { "token": "tốt", "start_offset": 32, "end_offset": 35, "type": "<ALPHANUM>", "position": 7 },
    { "token": "nhất", "start_offset": 36, "end_offset": 40, "type": "<ALPHANUM>", "position": 8 },
    { "token": "2024", "start_offset": 41, "end_offset": 45, "type": "<NUM>", "position": 9 }
  ]
}

# Test custom tokenizer
GET /_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "stop"],
  "text": "The Quick Brown Fox"
}

# Test trên index cụ thể
GET /products/_analyze
{
  "field": "name",   # Dùng analyzer của field này
  "text": "iPhone 15 Pro Max"
}

8.4 Built-in Analyzers

`standard` Analyzer (Default)

bash

GET /_analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over lazy dog's bone."
}
# Tokens: [the, 2, quick, brown, foxes, jumped, over, lazy, dog's, bone]

Tokenize theo word boundaries (Unicode)
Lowercase everything
Remove qua basic stop words (lang-dependent)

`simple` Analyzer

bash

GET /_analyze
{
  "analyzer": "simple",
  "text": "iPhone 15 Pro Max"
}
# Tokens: [iphone, pro, max]  ← Bỏ numbers, chia theo non-letter chars

`whitespace` Analyzer

bash

GET /_analyze
{
  "analyzer": "whitespace",
  "text": "iPhone 15 Pro Max"
}
# Tokens: [iPhone, 15, Pro, Max]  ← Chỉ split theo whitespace, GIỮ case

`keyword` Analyzer

bash

GET /_analyze
{
  "analyzer": "keyword",
  "text": "iPhone 15 Pro Max"
}
# Tokens: [iPhone 15 Pro Max]  ← Không phân tích, giữ nguyên toàn bộ

`pattern` Analyzer

bash

GET /_analyze
{
  "analyzer": "pattern",
  "text": "product:SKU001:category:electronics"
}
# Default split by: \W+
# Tokens: [product, sku001, category, electronics]

Language Analyzers

bash

# English analyzer - stemming, stop words
GET /_analyze
{
  "analyzer": "english",
  "text": "The cats are running quickly through the trees"
}
# Tokens: [cat, run, quickli, tree]  ← Stemmed!

# Default tiếng Việt:
# Không có built-in Vietnamese analyzer
# Cần custom analyzer hoặc plugin

8.5 Tokenizers

`standard` Tokenizer

bash

GET /_analyze
{
  "tokenizer": "standard",
  "text": "Hello, World! Testing 1-2-3."
}
# Tokens: [Hello, World, Testing, 1, 2, 3]
# Dựa trên Unicode Text Segmentation

`whitespace` Tokenizer

bash

GET /_analyze
{
  "tokenizer": "whitespace",
  "text": "Hello World testing-1-2-3"
}
# Tokens: [Hello, World, testing-1-2-3]  ← Giữ dấu gạch ngang

`keyword` Tokenizer

bash

# Không phân tích, toàn bộ text = 1 token
GET /_analyze
{
  "tokenizer": "keyword",
  "text": "My Product SKU-001"
}
# Tokens: [My Product SKU-001]

`pattern` Tokenizer (Regex-based)

bash

GET /_analyze
{
  "tokenizer": {
    "type": "pattern",
    "pattern": ","     # Split theo dấu phẩy
  },
  "text": "apple,samsung,xiaomi,oppo"
}
# Tokens: [apple, samsung, xiaomi, oppo]

`ngram` Tokenizer

N-gram tạo ra tất cả subsequences có độ dài n từ text. Rất hữu ích cho partial matching:

bash

GET /_analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 2,
    "max_gram": 3,
    "token_chars": ["letter", "digit"]
  },
  "text": "iphone"
}
# Tokens: [ip, iph, ph, pho, ho, hon, on, one, ne]
# Cho phép tìm "pho" match "iphone" → Partial search!

`edge_ngram` Tokenizer

Chỉ tạo ngrams từ đầu từ - rất tốt cho autocomplete:

bash

GET /_analyze
{
  "tokenizer": {
    "type": "edge_ngram",
    "min_gram": 2,
    "max_gram": 10,
    "token_chars": ["letter", "digit"]
  },
  "text": "iphone"
}
# Tokens: [ip, iph, ipho, iphon, iphone]
# Gõ "iph" → tìm được "iphone"!

`path_hierarchy` Tokenizer

Dành cho đường dẫn phân cấp:

bash

GET /_analyze
{
  "tokenizer": "path_hierarchy",
  "text": "electronics/smartphones/apple/iphone-15"
}
# Tokens: [electronics, electronics/smartphones, electronics/smartphones/apple, 
#          electronics/smartphones/apple/iphone-15]
# Query "electronics" tìm được tất cả products trong category điện tử!

8.6 Token Filters

Token filters xử lý từng token sau khi tokenizer:

`lowercase` Filter

bash

# "iPhone" → "iphone"
# "SAMSUNG" → "samsung"

`stop` Filter - Xóa Stop Words

bash

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stop",
      "stopwords": ["_english_"]  # Built-in stop word lists
    }
  ],
  "text": "The quick brown fox"
}
# Tokens: [quick, brown, fox]  ← "The" bị xóa

Lưu ý: Stop words trong tiếng Việt (_vietnamese_) không có trong built-in - cần custom list:

bash

{
  "type": "stop",
  "stopwords": ["là", "và", "của", "có", "trong", "với", "được", "để", "cho", "từ", "tại", "về", "theo", "đó", "này", "đã", "sẽ", "thì", "không", "một", "những", "các", "cũng", "như", "nhưng", "hay", "hoặc", "mà"]
}

`synonym` Filter - Từ Đồng Nghĩa

bash

PUT /products-vi
{
  "settings": {
    "analysis": {
      "filter": {
        "vi_synonyms": {
          "type": "synonym",
          "synonyms": [
            "điện thoại, smartphone, mobile phone",
            "laptop, máy tính xách tay, notebook",
            "tai nghe, headphone, earphone",
            "iphone => apple iphone",     # ← implies: "iphone" → indexed as "apple iphone"
            "sam sung => samsung"          # ← Typo correction via synonym
          ],
          "lenient": true   # Bỏ qua lỗi khi load synonyms
        }
      }
    }
  }
}

Synonym file (dùng file thay vì hardcode):

bash

{
  "type": "synonym_graph",
  "synonyms_path": "analysis/synonyms.txt",  # Relative to ES config dir
  "updateable": true   # Có thể reload mà không cần restart
}

File synonyms.txt:

# E-commerce synonyms
điện thoại, smartphone, mobile phone, dt
laptop, máy tính xách tay, notebook, máy tính
tai nghe, headphone, earphone, headset

# Brand synonyms
sam sung => samsung
iphone 15 pro => apple iphone 15 pro

`stemmer` Filter - Stemming

Đưa từ về dạng gốc (root form):

bash

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    "lowercase",
    {
      "type": "stemmer",
      "language": "english"
    }
  ],
  "text": "running runners ran"
}
# Tokens: [run, runner, ran]  ← stems (not perfect)

# Snowball stemmer (thường tốt hơn)
{
  "type": "snowball",
  "language": "English"
}

`word_delimiter_graph` Filter

Tách từ phức hợp và xử lý dấu câu phức tạp:

bash

GET /_analyze
{
  "tokenizer": "keyword",
  "filter": [
    {
      "type": "word_delimiter_graph",
      "generate_word_parts": true,
      "generate_number_parts": true,
      "catenate_words": true,    # "WiFi-6E" → "WiFi6E" cũng được index
      "catenate_numbers": true,
      "split_on_case_change": true    # "camelCase" → "camel" + "Case"
    }
  ],
  "text": "Wi-Fi 6E"
}
# Tokens: [Wi, WiFi, Fi, 6E, 6, E]

`edge_ngram` Filter

Khác với edge_ngram tokenizer, filter này áp dụng sau tokenization:

bash

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 10
    }
  ],
  "text": "iPhone Samsung"
}
# Tokens: [i, ip, iph, ipho, iphon, iphone, s, sa, sam, sams, samsu, samsun, samsung]

`asciifolding` Filter

Chuyển ký tự có dấu sang ASCII tương đương:

bash

GET /_analyze
{
  "tokenizer": "standard",
  "filter": ["asciifolding"],
  "text": "café résumé"
}
# Tokens: [cafe, resume]
# "café" → "cafe", "résumé" → "resume"

`phonetic` Filter

Tìm kiếm theo phát âm (Soundex, Metaphone...):

bash

# Cần cài plugin: elasticsearch-analysis-phonetic
PUT /phonetic-test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "phonetic_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_phonetic"]
        }
      },
      "filter": {
        "my_phonetic": {
          "type": "phonetic",
          "encoder": "metaphone",
          "replace": false
        }
      }
    }
  }
}

8.7 Character Filters

`html_strip` - Loại Bỏ HTML Tags

bash

GET /_analyze
{
  "char_filter": ["html_strip"],
  "tokenizer": "standard",
  "text": "<h1>Sản phẩm</h1> <p>Mô tả <strong>tốt</strong></p>"
}
# Tokens: [Sản, phẩm, Mô, tả, tốt]  ← HTML tags removed

`mapping` - Thay Thế Ký Tự

bash

GET /_analyze
{
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "₫ => vnd",
        "0 => zero",
        "1 => one",
        "& => and",
        ":) => happy",
        "😊 => happy"
      ]
    }
  ],
  "tokenizer": "standard",
  "text": "Giá: 500₫ bạn & tôi :)"
}
# "500₫ bạn & tôi :)" → "500vnd bạn and tôi happy"

`pattern_replace` - Thay Thế Theo Regex

bash

GET /_analyze
{
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "(\\d+)-SKU",
      "replacement": "SKU_$1"
    }
  ],
  "tokenizer": "standard",
  "text": "Product 123-SKU in stock"
}
# "123-SKU" → "SKU_123"

8.8 Custom Analyzers

Ví Dụ 1: Product Name Analyzer

bash

PUT /ecommerce
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_and_special": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "/ => or",
            "₫ => vnd",
            "% => percent"
          ]
        }
      },
      "tokenizer": {
        "product_tokenizer": {
          "type": "standard"
        }
      },
      "filter": {
        "product_stop": {
          "type": "stop",
          "stopwords": ["tại", "của", "và", "hoặc", "với", "cho"]
        },
        "product_synonyms": {
          "type": "synonym",
          "synonyms": [
            "điện thoại, smartphone, mobile",
            "laptop, notebook, máy tính xách tay",
            "gb => gb",
            "tb => terabyte"
          ]
        },
        "sku_code_preserving": {
          "type": "word_delimiter_graph",
          "generate_word_parts": true,
          "generate_number_parts": true,
          "split_on_case_change": false,
          "preserve_original": true
        }
      },
      "analyzer": {
        "product_name_analyzer": {
          "type": "custom",
          "char_filter": ["html_and_special"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "sku_code_preserving",
            "product_stop",
            "product_synonyms"
          ]
        },
        "product_search_analyzer": {
          "type": "custom",
          "char_filter": ["html_and_special"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "product_stop",
            "product_synonyms"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "product_name_analyzer",
        "search_analyzer": "product_search_analyzer"
      }
    }
  }
}

# Test:
GET /ecommerce/_analyze
{
  "analyzer": "product_name_analyzer",
  "text": "Samsung Galaxy S24 Ultra 256GB/12GB RAM"
}

Ví Dụ 2: Autocomplete Analyzer (Edge N-gram)

bash

PUT /autocomplete-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        },
        "autocomplete_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
          # Không dùng edge_ngram khi search!
        }
      },
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete_index_analyzer",
        "search_analyzer": "autocomplete_search_analyzer"
        # Khi index: "iphone" → [ip, iph, ipho, iphon, iphone]
        # Khi search: "ipho" → [ipho]
        # Match: "ipho" in [ip, iph, ipho, iphon, iphone] → FOUND!
      }
    }
  }
}

Test autocomplete:

bash

# Index
PUT /autocomplete-index/_doc/1
{ "name": "iPhone 15 Pro Max 256GB" }

PUT /autocomplete-index/_doc/2
{ "name": "iPad Pro 12.9 M2" }

# Search - autocomplete
GET /autocomplete-index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "ipho",
        "analyzer": "autocomplete_search_analyzer"
      }
    }
  }
}
# Kết quả: Both documents match vì cả 2 có "iphone"/"ipad" chứa "ipho"... không!
# Chỉ iPhone match "ipho"

Ví Dụ 3: N-gram Analyzer cho Tìm Kiếm Substring

bash

PUT /substring-search
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 4,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product_code": {
        "type": "text",
        "analyzer": "ngram_analyzer",
        "search_analyzer": "search_analyzer"
      }
    }
  }
}

# Test: "SP001234" sẽ tạo ra: 
# [sp, spo, spoo, po, poo, p00, oo, o01, 01, 012, 12, 123, 23, 234, 34]
# User search "001" → match!

8.9 Xử Lý Tiếng Việt

Tiếng Việt là một ngôn ngữ phân tích đặc thù:

Không có space để tách từ (thực ra có space nhưng đơn vị từ là chữ tiếng Việt)
Tiếng Việt có tính đặc biệt: "điện thoại" là 2 âm tiết = 1 từ
Dấu câu ảnh hưởng nghĩa: "ma", "má", "mà", "mã", "mả", "mạ" là 6 từ khác nhau

Vấn Đề với Standard Analyzer

bash

GET /_analyze
{
  "analyzer": "standard",
  "text": "điện thoại di động thông minh"
}
# Tokens: [điện, thoại, di, động, thông, minh]
# "điện thoại" bị tách thành 2 token riêng biệt
# "thông minh" → 2 tokens riêng biệt

Hậu quả: "điện thoại" (phone) và "thoại kịch" (play) đều chứa "thoại" → có thể cross-match.

Plugin ICU (Unicode Support)

bash

# Cài plugin
bin/elasticsearch-plugin install analysis-icu

# ICU Analyzer
GET /_analyze
{
  "analyzer": "icu_analyzer",
  "text": "điện thoại di động thông minh"
}
# Better Unicode handling (diacritics, word boundaries)

Plugin Vietnamese Analysis (vn-nlp)

Dùng plugin tích hợp NLP cho tiếng Việt:

bash

# Option 1: elasticsearch-analysis-vietnamese plugin
# https://github.com/duydo/elasticsearch-analysis-vietnamese

# Cài đặt (cần compatible với ES version):
bin/elasticsearch-plugin install \
  file:///path/to/elasticsearch-analysis-vietnamese-8.x.x.zip

# Sau khi cài:
PUT /vi-articles
{
  "settings": {
    "analysis": {
      "analyzer": {
        "vi_analyzer": {
          "type": "vi_analyzer"  # Plugin analyzer
        }
      }
    }
  }
}

GET /vi-articles/_analyze
{
  "analyzer": "vi_analyzer",
  "text": "điện thoại di động thông minh tốt nhất 2024"
}
# Tokens: [điện_thoại, di_động, thông_minh, tốt_nhất, 2024]
# Words are properly segmented!

Custom Vietnamese Analyzer (Không cần plugin)

Nếu không thể cài plugin, có thể dùng stop words và synonym:

bash

PUT /vi-products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "remove_diacritics_map": {
          "type": "mapping",
          "mappings": [
            "à => a", "á => a", "ả => a", "ã => a", "ạ => a",
            "â => a", "ầ => a", "ấ => a", "ẩ => a", "ẫ => a", "ậ => a",
            "ă => a", "ằ => a", "ắ => a", "ẳ => a", "ẵ => a", "ặ => a",
            "è => e", "é => e", "ẻ => e", "ẽ => e", "ẹ => e",
            "ê => e", "ề => e", "ế => e", "ể => e", "ễ => e", "ệ => e",
            "ì => i", "í => i", "ỉ => i", "ĩ => i", "ị => i",
            "ò => o", "ó => o", "ỏ => o", "õ => o", "ọ => o",
            "ô => o", "ồ => o", "ố => o", "ổ => o", "ỗ => o", "ộ => o",
            "ơ => o", "ờ => o", "ớ => o", "ở => o", "ỡ => o", "ợ => o",
            "ù => u", "ú => u", "ủ => u", "ũ => u", "ụ => u",
            "ư => u", "ừ => u", "ứ => u", "ử => u", "ữ => u", "ự => u",
            "ỳ => y", "ý => y", "ỷ => y", "ỹ => y", "ỵ => y",
            "đ => d"
          ]
        }
      },
      "filter": {
        "vi_stop_words": {
          "type": "stop",
          "stopwords": [
            "là", "và", "của", "có", "trong", "với", "được", "để", "cho",
            "từ", "tại", "về", "theo", "đó", "này", "đã", "sẽ", "thì",
            "không", "một", "những", "các", "cũng", "như", "nhưng",
            "hay", "hoặc", "mà", "rất", "thêm", "bởi", "vì", "sau",
            "trước", "khi", "nếu", "vẫn", "còn", "đã", "đang"
          ]
        },
        "vi_synonyms": {
          "type": "synonym",
          "synonyms": [
            "đt, dien thoai, smartphone => điện thoại",
            "laptop, may tinh xach tay, notebook => laptop",
            "airpods, tai nghe apple => tai nghe apple",
            "op lung, bao da, case => phụ kiện điện thoại",
            "pin du phong, power bank => sạc dự phòng"
          ]
        }
      },
      "analyzer": {
        "vi_standard": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "vi_stop_words",
            "vi_synonyms"
          ]
        },
        "vi_no_diacritics": {
          "type": "custom",
          "char_filter": ["remove_diacritics_map"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "vi_stop_words"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "vi_standard",
        "fields": {
          "no_diacritics": {
            "type": "text",
            "analyzer": "vi_no_diacritics"  # Tìm không dấu cũng được
          }
        }
      }
    }
  }
}

Test tìm không dấu:

bash

GET /vi-products/_analyze
{
  "analyzer": "vi_no_diacritics",
  "text": "dien thoai samsung galaxy"
}
# "điện thoại samsung galaxy" và "dien thoai samsung galaxy" đều match nhau

# Query tìm không dấu
GET /vi-products/_search
{
  "query": {
    "multi_match": {
      "query": "dien thoai samsung",
      "fields": ["name", "name.no_diacritics"]
    }
  }
}

8.10 Index-time vs Search-time Analysis

Quan trọng: ES analyze text 2 lần:

Khi index: Text → tokens → stored in inverted index
Khi search: Query → tokens → lookup in index

Mặc định: Cùng analyzer cho cả 2. Nhưng có thể dùng khác nhau:

bash

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete_index_analyzer",   # Index: create edge ngrams
        "search_analyzer": "standard"                # Search: standard tokens
      }
    }
  }
}

Khi dùng khác nhau:

Autocomplete: Index với edge_ngram, search với standard
Typo tolerance: Index với standard, search với phonetic
Synonym expansion: Có thể khác nhau để tránh index quá nhiều

8.11 Analyze API nâng cao

bash

# Xem chi tiết quá trình analysis
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    "lowercase",
    "stop"
  ],
  "char_filter": ["html_strip"],
  "text": "<h1>Hello World</h1>",
  "explain": true    # Chi tiết từng bước
}

# Response:
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [
      {
        "name": "html_strip",
        "filtered_text": ["\nHello World\n"]
      }
    ],
    "tokenizer": {
      "name": "standard",
      "tokens": [
        { "token": "Hello", "start_offset": 1, "end_offset": 6 },
        { "token": "World", "start_offset": 7, "end_offset": 12 }
      ]
    },
    "tokenfilters": [
      {
        "name": "lowercase",
        "tokens": [
          { "token": "hello", "start_offset": 1, "end_offset": 6 },
          { "token": "world", "start_offset": 7, "end_offset": 12 }
        ]
      }
    ]
  }
}

8.12 Multi-language Support

bash

PUT /multilang-articles
{
  "settings": {
    "analysis": {
      "analyzer": {
        "en_analyzer": {
          "type": "english"
        },
        "vi_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        },
        "fr_analyzer": {
          "type": "french"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title_en": { "type": "text", "analyzer": "en_analyzer" },
      "title_vi": { "type": "text", "analyzer": "vi_analyzer" },
      "title_fr": { "type": "text", "analyzer": "fr_analyzer" },
      "language": { "type": "keyword" }
    }
  }
}

# Query based on language
GET /multilang-articles/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              { "term": { "language": "en" } },
              { "match": { "title_en": "search engine" } }
            ]
          }
        },
        {
          "bool": {
            "must": [
              { "term": { "language": "vi" } },
              { "match": { "title_vi": "công cụ tìm kiếm" } }
            ]
          }
        }
      ]
    }
  }
}

Tóm Tắt Chương 8

Analysis Pipeline Recap

Text → Char Filters → Tokenizer → Token Filters → Tokens

Char Filters: html_strip, mapping, pattern_replace
Tokenizers: standard, whitespace, ngram, edge_ngram, path_hierarchy
Token Filters: lowercase, stop, synonym, stemmer, edge_ngram

Quyết Định Analyzer

Dùng để làm gì	Analyzer phù hợp
Full-text search tiếng Anh	`english` (stemming)
Full-text search tiếng Việt	Custom (stop words + synonyms)
Exact match	`keyword`
Autocomplete	Custom edge_ngram
Partial/substring search	Custom ngram
Product codes/IDs	`keyword` hoặc `whitespace`
Log messages	`standard` hoặc `whitespace`
Multi-language	Separate fields per language

Nguyên Tắc Vàng

Test trước khi deploy với _analyze API
Index và Search analyzer phải compatible (tokens của search phải có trong index)
Đừng dùng edge_ngram làm search analyzer (sẽ tạo quá nhiều tokens)
Synonyms nên dùng synonym_graph filter và chỉ ở search-time nếu có thể
Stop words cẩn thận - "not" là stop word nhưng quan trọng trong nhiều contexts

Bước Tiếp Theo

→ Chương 9: Aggregations - Analytics mạnh mẽ với Elasticsearch Aggregation Framework

Chương 9: Aggregations - Phân Tích Dữ Liệu

9.1 Aggregations Là Gì?

Aggregations là tính năng analytics mạnh mẽ của Elasticsearch cho phép tính toán thống kê, nhóm dữ liệu, và phân tích xu hướng trên tập dữ liệu lớn trong real-time.

Query vs Aggregation

bash

# Query: Tìm kiếm documents
GET /orders/_search
{
  "query": { "term": { "status": "completed" } },
  "hits": { "total": 1250 }
}

# Aggregation: Phân tích dữ liệu
GET /orders/_search
{
  "size": 0,             # Không cần documents, chỉ cần aggregation result
  "aggs": {
    "total_revenue": {   # Tên aggregation (tùy đặt)
      "sum": {
        "field": "amount"
      }
    }
  }
}
# Kết quả: Total revenue = 254,750,000 VNĐ

9.2 Các Loại Aggregation

Aggregations
├── Metric Aggregations    (Tính toán số liệu: sum, avg, min, max)
├── Bucket Aggregations    (Nhóm documents: terms, date_histogram, range)
├── Pipeline Aggregations  (Aggregation trên aggregation khác)
└── Matrix Aggregations    (Tính toán ma trận, ít dùng)

9.3 Metric Aggregations

`avg`, `sum`, `min`, `max`, `value_count`

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "avg_order_value": {
      "avg": { "field": "amount" }
    },
    "total_revenue": {
      "sum": { "field": "amount" }
    },
    "min_order": {
      "min": { "field": "amount" }
    },
    "max_order": {
      "max": { "field": "amount" }
    },
    "order_count": {
      "value_count": { "field": "order_id" }
    }
  }
}

# Response:
{
  "aggregations": {
    "avg_order_value": { "value": 204125.0 },
    "total_revenue": { "value": 255156250.0 },
    "min_order": { "value": 15000.0 },
    "max_order": { "value": 45000000.0 },
    "order_count": { "value": 1250 }
  }
}

`stats` - Tổng Hợp 5 Metrics

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "order_stats": {
      "stats": { "field": "amount" }
    }
  }
}

# Response:
{
  "aggregations": {
    "order_stats": {
      "count": 1250,
      "min": 15000.0,
      "max": 45000000.0,
      "avg": 204125.0,
      "sum": 255156250.0
    }
  }
}

`extended_stats` - Thêm Phân Phối Thống Kê

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "order_extended_stats": {
      "extended_stats": {
        "field": "amount",
        "sigma": 2.0    # Standard deviation bounds
      }
    }
  }
}

# Response:
{
  "aggregations": {
    "order_extended_stats": {
      "count": 1250,
      "min": 15000.0,
      "max": 45000000.0,
      "avg": 204125.0,
      "sum": 255156250.0,
      "sum_of_squares": 2.1e+15,
      "variance": 1543200000.0,
      "std_deviation": 39283.5,
      "std_deviation_bounds": {
        "upper": 282692.0,    # avg + 2 * std_dev
        "lower": 125558.0     # avg - 2 * std_dev
      }
    }
  }
}

`percentiles` - Phân Vị

bash

GET /response-times/_search
{
  "size": 0,
  "aggs": {
    "latency_percentiles": {
      "percentiles": {
        "field": "response_time_ms",
        "percents": [50, 90, 95, 99, 99.9],
        "keyed": false,   # Array thay vì object
        "tdigest": {
          "compression": 100   # Độ chính xác (default 100, cao hơn = chính xác hơn)
        }
      }
    }
  }
}

# Response:
{
  "aggregations": {
    "latency_percentiles": {
      "values": [
        { "key": 50.0, "value": 45.2 },   # p50 = median = 45ms
        { "key": 90.0, "value": 189.7 },  # p90 = 189ms
        { "key": 95.0, "value": 340.1 },  # p95 = 340ms
        { "key": 99.0, "value": 1250.0 }, # p99 = 1.25s
        { "key": 99.9, "value": 4500.0 }  # p99.9 = 4.5s
      ]
    }
  }
}
# Kết luận: 99% requests < 1.25s, nhưng có outlier tới 4.5s

`percentile_ranks` - Xếp Hạng Phân Vị

bash

GET /response-times/_search
{
  "size": 0,
  "aggs": {
    "rank_200ms": {
      "percentile_ranks": {
        "field": "response_time_ms",
        "values": [200, 500, 1000]   # Bao nhiêu % requests < 200ms, 500ms, 1000ms?
      }
    }
  }
}

# Response:
{
  "aggregations": {
    "rank_200ms": {
      "values": {
        "200.0": 90.5,    # 90.5% của requests < 200ms
        "500.0": 97.2,    # 97.2% < 500ms  
        "1000.0": 99.1    # 99.1% < 1000ms
      }
    }
  }
}

`cardinality` - Đếm Distinct

bash

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "unique_users": {
      "cardinality": {
        "field": "user_id",
        "precision_threshold": 10000   # Tradeoff: memory vs accuracy
      }
    },
    "unique_ips": {
      "cardinality": {
        "field": "ip_address"
      }
    }
  }
}
# Kết quả ước tính (HyperLogLog algorithm), error rate ~0.5-3%
# precision_threshold cao hơn = chính xác hơn = tốn nhiều RAM hơn

`top_hits` - Lấy Document Mẫu

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "top_recent_orders": {
      "top_hits": {
        "size": 3,
        "_source": ["order_id", "customer_name", "amount"],
        "sort": [{ "created_at": "desc" }]
      }
    }
  }
}

Scripted Metric Aggregation

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "profit_metric": {
      "scripted_metric": {
        "init_script": "state.total_revenue = 0; state.total_cost = 0",
        "map_script": """
          state.total_revenue += doc['revenue'].value;
          state.total_cost += doc['cost'].value;
        """,
        "combine_script": """
          return ['revenue': state.total_revenue, 'cost': state.total_cost]
        """,
        "reduce_script": """
          double total_revenue = 0;
          double total_cost = 0;
          for (state in states) {
            total_revenue += state.revenue;
            total_cost += state.cost;
          }
          return (total_revenue - total_cost) / total_revenue * 100;
        """
      }
    }
  }
}
# Tính profit margin theo Painless script

9.4 Bucket Aggregations

Bucket aggregations nhóm documents vào các "bucket":

`terms` - Nhóm Theo Giá Trị

bash

GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category.keyword",
        "size": 10,              # Số buckets trả về (default 10)
        "order": { "_count": "desc" }   # Sắp xếp theo số documents
      }
    }
  }
}

# Response:
{
  "aggregations": {
    "by_category": {
      "doc_count_error_upper_bound": 0,    # Error bound với distributed computing
      "sum_other_doc_count": 150,          # Documents không nằm trong top 10
      "buckets": [
        { "key": "smartphones", "doc_count": 450 },
        { "key": "laptops", "doc_count": 280 },
        { "key": "tablets", "doc_count": 120 },
        ...
      ]
    }
  }
}

Lưu ý quan trọng về terms aggregation:

terms không trả về exact count trên multi-shard clusters
Mỗi shard trả về top N của mình, kết hợp có thể miss documents
Tăng shard_size để chính xác hơn (default: size * 1.5 + 10)

bash

{
  "terms": {
    "field": "category.keyword",
    "size": 10,
    "shard_size": 100   # Mỗi shard trả về 100, combine để lấy top 10
  }
}

`rare_terms` - Tìm Giá Trị Hiếm

bash

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "rare_error_codes": {
      "rare_terms": {
        "field": "error_code",
        "max_doc_count": 5    # Chỉ buckets có <= 5 documents
      }
    }
  }
}

`date_histogram` - Nhóm Theo Thời Gian

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "orders_over_time": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "day",    # 1d, 1w, 1M, 1q, 1y
        "time_zone": "Asia/Ho_Chi_Minh",
        "min_doc_count": 0,            # Hiển thị cả ngày không có đơn
        "extended_bounds": {           # Force range kể cả không có data
          "min": "2024-01-01",
          "max": "2024-12-31"
        },
        "format": "yyyy-MM-dd"         # Format key
      }
    }
  }
}

# Response:
{
  "aggregations": {
    "orders_over_time": {
      "buckets": [
        { "key_as_string": "2024-01-01", "key": 1704067200000, "doc_count": 45 },
        { "key_as_string": "2024-01-02", "key": 1704153600000, "doc_count": 67 },
        { "key_as_string": "2024-01-03", "key": 1704240000000, "doc_count": 0 },
        ...
      ]
    }
  }
}

Fixed interval thay vì calendar:

bash

{
  "date_histogram": {
    "field": "timestamp",
    "fixed_interval": "1h",     # 1h, 30m, 15m, 1m, etc.
    "offset": "+8h"             # Shift bucket boundaries
  }
}

`histogram` - Nhóm Theo Khoảng Số

bash

GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_distribution": {
      "histogram": {
        "field": "price",
        "interval": 500000,      # 0-500k, 500k-1M, 1M-1.5M...
        "min_doc_count": 1
      }
    }
  }
}

# Response:
{
  "aggregations": {
    "price_distribution": {
      "buckets": [
        { "key": 0.0, "doc_count": 150 },       # 0-500k
        { "key": 500000.0, "doc_count": 280 },   # 500k-1M
        { "key": 1000000.0, "doc_count": 350 },  # 1M-1.5M
        { "key": 1500000.0, "doc_count": 200 },  # 1.5M-2M
        { "key": 2000000.0, "doc_count": 120 }   # 2M-2.5M
      ]
    }
  }
}

`range` - Khoảng Tùy Chỉnh

bash

GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "Dưới 1 triệu", "to": 1000000 },
          { "key": "1-5 triệu", "from": 1000000, "to": 5000000 },
          { "key": "5-10 triệu", "from": 5000000, "to": 10000000 },
          { "key": "10-20 triệu", "from": 10000000, "to": 20000000 },
          { "key": "Trên 20 triệu", "from": 20000000 }
        ]
      }
    }
  }
}

`date_range`

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "order_periods": {
      "date_range": {
        "field": "created_at",
        "time_zone": "Asia/Ho_Chi_Minh",
        "format": "yyyy-MM-dd",
        "ranges": [
          { "key": "Hôm qua", "from": "now-2d/d", "to": "now-1d/d" },
          { "key": "Hôm nay", "from": "now/d", "to": "now" },
          { "key": "Tháng này", "from": "now/M", "to": "now" },
          { "key": "Tháng trước", "from": "now-1M/M", "to": "now/M" }
        ]
      }
    }
  }
}

`filters` - Multiple Named Filters

bash

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "log_levels": {
      "filters": {
        "filters": {
          "errors": { "term": { "level": "ERROR" } },
          "warnings": { "term": { "level": "WARN" } },
          "infos": { "term": { "level": "INFO" } },
          "criticals": { "match": { "message": "critical fatal" } }
        }
      }
    }
  }
}

# Response:
{
  "aggregations": {
    "log_levels": {
      "buckets": {
        "errors": { "doc_count": 45 },
        "warnings": { "doc_count": 230 },
        "infos": { "doc_count": 15420 },
        "criticals": { "doc_count": 12 }
      }
    }
  }
}

`geo_distance` - Nhóm Theo Khoảng Cách Địa Lý

bash

GET /restaurants/_search
{
  "size": 0,
  "aggs": {
    "distance_from_center": {
      "geo_distance": {
        "field": "location",
        "origin": { "lat": 10.7769, "lon": 106.7009 },  # Trung tâm TPHCM
        "unit": "km",
        "ranges": [
          { "key": "< 1km", "to": 1 },
          { "key": "1-3km", "from": 1, "to": 3 },
          { "key": "3-5km", "from": 3, "to": 5 },
          { "key": "> 5km", "from": 5 }
        ]
      }
    }
  }
}

`significant_terms` - Thuật Ngữ Đặc Trưng

bash

GET /articles/_search
{
  "query": {
    "term": { "category": "technology" }
  },
  "size": 0,
  "aggs": {
    "significant_tech_terms": {
      "significant_terms": {
        "field": "content"
      }
    }
  }
}
# Tìm các từ đặc trưng cho articles trong category technology
# so với toàn bộ corpus

`nested` Aggregation

bash

# Với dữ liệu nested (review trong product):
GET /products/_search
{
  "size": 0,
  "aggs": {
    "reviews": {
      "nested": {
        "path": "reviews"
      },
      "aggs": {
        "avg_rating": {
          "avg": { "field": "reviews.rating" }
        },
        "review_distribution": {
          "terms": { "field": "reviews.rating" }
        }
      }
    }
  }
}

9.5 Sub-Aggregations (Lồng Aggregation)

Đây là tính năng mạnh nhất của aggregation - nested aggregations:

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "orders_by_month": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "month",
        "format": "yyyy-MM"
      },
      "aggs": {                          # Sub-aggregation
        "revenue_per_month": {
          "sum": { "field": "amount" }
        },
        "avg_order_value": {
          "avg": { "field": "amount" }
        },
        "top_categories": {
          "terms": {
            "field": "category.keyword",
            "size": 3
          },
          "aggs": {                      # Sub-sub-aggregation!
            "category_revenue": {
              "sum": { "field": "amount" }
            }
          }
        }
      }
    }
  }
}

# Response structure:
{
  "aggregations": {
    "orders_by_month": {
      "buckets": [
        {
          "key_as_string": "2024-01",
          "doc_count": 1250,
          "revenue_per_month": { "value": 25000000 },
          "avg_order_value": { "value": 20000 },
          "top_categories": {
            "buckets": [
              {
                "key": "smartphones",
                "doc_count": 450,
                "category_revenue": { "value": 10000000 }
              },
              {
                "key": "laptops",
                "doc_count": 280,
                "category_revenue": { "value": 8500000 }
              }
            ]
          }
        }
      ]
    }
  }
}

9.6 Pipeline Aggregations

Pipeline aggregations tính toán trên kết quả của aggregation khác:

`avg_bucket`, `sum_bucket`, `min_bucket`, `max_bucket`

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "monthly_revenue": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "month"
      },
      "aggs": {
        "revenue": {
          "sum": { "field": "amount" }
        }
      }
    },
    "avg_monthly_revenue": {
      "avg_bucket": {
        "buckets_path": "monthly_revenue>revenue"  # Parent aggregation path
      }
    },
    "max_monthly_revenue": {
      "max_bucket": {
        "buckets_path": "monthly_revenue>revenue"
      }
    }
  }
}

# "monthly_revenue>revenue" = "trong monthly_revenue, lấy revenue metric"

`derivative` - Tốc Độ Thay Đổi

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "sales_per_week": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "week"
      },
      "aggs": {
        "total_sales": {
          "sum": { "field": "amount" }
        },
        "sales_growth": {   # Derivative: week-over-week growth
          "derivative": {
            "buckets_path": "total_sales"
          }
        }
      }
    }
  }
}

# Response:
# Week 1: total_sales = 5M, sales_growth = null (no previous)
# Week 2: total_sales = 6M, sales_growth = +1M
# Week 3: total_sales = 5.5M, sales_growth = -500K (decline!)

`cumulative_sum` - Tổng Tích Lũy

bash

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "daily_sales": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "day",
        "format": "yyyy-MM-dd"
      },
      "aggs": {
        "daily_revenue": {
          "sum": { "field": "amount" }
        },
        "cumulative_revenue": {
          "cumulative_sum": {
            "buckets_path": "daily_revenue"
          }
        }
      }
    }
  }
}

# Response:
# Day 1: daily = 500K, cumulative = 500K
# Day 2: daily = 750K, cumulative = 1.25M
# Day 3: daily = 600K, cumulative = 1.85M
# ...Biểu đồ cummulative revenue!

`moving_avg` - Moving Average (Đã Deprecated)

bash

# moving_fn thay thế moving_avg:
GET /metrics/_search
{
  "size": 0,
  "aggs": {
    "hourly_requests": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1h"
      },
      "aggs": {
        "request_count": {
          "value_count": { "field": "request_id" }
        },
        "smooth_trend": {
          "moving_fn": {
            "buckets_path": "request_count",
            "window": 24,       # 24-hour window
            "script": "MovingFunctions.unweightedAvg(values)"
          }
        }
      }
    }
  }
}

`bucket_sort` - Sắp Xếp Buckets

bash

GET /products/_search
{
  "size": 0,
  "aggs": {
    "top_selling_categories": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "total_sales": {
          "sum": { "field": "sales_count" }
        },
        "sort_by_sales": {
          "bucket_sort": {
            "sort": [{ "total_sales": { "order": "desc" } }],
            "size": 5         # Top 5 categories
          }
        }
      }
    }
  }
}

`bucket_selector` - Lọc Buckets

bash

GET /products/_search
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "sold_date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_sales": {
          "sum": { "field": "amount" }
        },
        "only_high_months": {
          "bucket_selector": {
            "buckets_path": { "totalSales": "total_sales" },
            "script": "params.totalSales > 10000000"  # Chỉ tháng > 10M
          }
        }
      }
    }
  }
}

9.7 Kết Hợp Query + Aggregation

bash

GET /orders/_search
{
  "query": {                    # Scope: chỉ aggregation trên orders của user này
    "term": { "customer_id": "CUST001" }
  },
  "size": 0,
  "aggs": {
    "my_orders_by_status": {
      "terms": { "field": "status" }
    },
    "total_spent": {
      "sum": { "field": "amount" }
    }
  }
}

Global Aggregation - Ignore Query Scope

bash

GET /orders/_search
{
  "query": {
    "term": { "status": "completed" }
  },
  "size": 0,
  "aggs": {
    "completed_revenue": {
      "sum": { "field": "amount" }              # Trên completed orders
    },
    "all_orders_revenue": {
      "global": {},                              # Bỏ qua query filter
      "aggs": {
        "total": {
          "sum": { "field": "amount" }           # Trên TẤT CẢ orders
        }
      }
    }
  }
}
# Có thể tính: completed_revenue / all_revenue = completion rate

Filter Aggregation

bash

GET /products/_search
{
  "size": 0,
  "aggs": {
    "premium_products": {
      "filter": {
        "range": { "price": { "gte": 10000000 } }
      },
      "aggs": {
        "avg_premium_price": {
          "avg": { "field": "price" }
        }
      }
    },
    "budget_products": {
      "filter": {
        "range": { "price": { "lt": 1000000 } }
      },
      "aggs": {
        "avg_budget_price": {
          "avg": { "field": "price" }
        }
      }
    }
  }
}

9.8 Use Case Thực Tế

Use Case 1: E-Commerce Sales Dashboard

bash

# Dashboard tổng quan doanh thu
GET /orders/_search
{
  "size": 0,
  "query": {
    "range": {
      "created_at": {
        "gte": "now-30d",
        "lte": "now"
      }
    }
  },
  "aggs": {
    "total_revenue": {
      "sum": { "field": "amount" }
    },
    "total_orders": {
      "value_count": { "field": "order_id" }
    },
    "avg_order_value": {
      "avg": { "field": "amount" }
    },
    "unique_customers": {
      "cardinality": { "field": "customer_id" }
    },
    "revenue_by_day": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "day",
        "format": "yyyy-MM-dd",
        "time_zone": "Asia/Ho_Chi_Minh"
      },
      "aggs": {
        "daily_revenue": { "sum": { "field": "amount" } },
        "daily_orders": { "value_count": { "field": "order_id" } }
      }
    },
    "revenue_by_category": {
      "terms": {
        "field": "category.keyword",
        "size": 10,
        "order": { "category_revenue": "desc" }
      },
      "aggs": {
        "category_revenue": { "sum": { "field": "amount" } },
        "avg_price": { "avg": { "field": "price" } }
      }
    },
    "payment_methods": {
      "terms": { "field": "payment_method.keyword" }
    },
    "order_status": {
      "terms": { "field": "status.keyword" }
    },
    "refund_rate": {
      "filters": {
        "filters": {
          "refunded": { "term": { "status": "refunded" } },
          "completed": { "term": { "status": "completed" } }
        }
      },
      "aggs": {
        "amount": { "sum": { "field": "amount" } }
      }
    }
  }
}

Use Case 2: Faceted Search cho E-Commerce

bash

# Search + Aggregation cho faceted search (filters panel)
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "điện thoại" } }
      ],
      "filter": [
        { "term": { "category.keyword": "smartphones" } },
        { "range": { "price": { "gte": 5000000, "lte": 20000000 } } },
        { "term": { "brand.keyword": "Samsung" } }
      ]
    }
  },
  "aggs": {
    "all_brands": {
      "global": {},    # Count toàn bộ, không bị filter bởi brand filter
      "aggs": {
        "filtered": {
          "filter": {  # Tái áp dụng các filter khác (không phải brand)
            "bool": {
              "must": [
                { "match": { "name": "điện thoại" } }
              ],
              "filter": [
                { "term": { "category.keyword": "smartphones" } },
                { "range": { "price": { "gte": 5000000, "lte": 20000000 } } }
              ]
            }
          },
          "aggs": {
            "brands": {
              "terms": { "field": "brand.keyword", "size": 20 }
            }
          }
        }
      }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "Dưới 5 triệu", "to": 5000000 },
          { "key": "5-10 triệu", "from": 5000000, "to": 10000000 },
          { "key": "10-20 triệu", "from": 10000000, "to": 20000000 },
          { "key": "Trên 20 triệu", "from": 20000000 }
        ]
      }
    },
    "avg_rating": {
      "avg": { "field": "rating" }
    },
    "ram_options": {
      "terms": { "field": "specs.ram.keyword", "size": 10 }
    },
    "storage_options": {
      "terms": { "field": "specs.storage.keyword", "size": 10 }
    }
  }
}

Use Case 3: Log Analytics

bash

# Application performance monitoring
GET /application-logs-*/_search
{
  "size": 0,
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-1h",
        "lte": "now"
      }
    }
  },
  "aggs": {
    "error_rate": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m",
        "format": "HH:mm"
      },
      "aggs": {
        "total_requests": { "value_count": { "field": "request_id" } },
        "error_requests": {
          "filter": {
            "range": { "status_code": { "gte": 500 } }
          }
        },
        "error_pct": {
          "bucket_script": {
            "buckets_path": {
              "errors": "error_requests._count",
              "total": "total_requests"
            },
            "script": "params.errors / params.total * 100"
          }
        },
        "p95_latency": {
          "percentiles": {
            "field": "duration_ms",
            "percents": [95]
          }
        }
      }
    },
    "slow_endpoints": {
      "terms": {
        "field": "endpoint.keyword",
        "size": 10,
        "order": { "avg_duration": "desc" }
      },
      "aggs": {
        "avg_duration": { "avg": { "field": "duration_ms" } },
        "max_duration": { "max": { "field": "duration_ms" } },
        "error_count": {
          "filter": {
            "range": { "status_code": { "gte": 400 } }
          }
        }
      }
    },
    "status_breakdown": {
      "range": {
        "field": "status_code",
        "ranges": [
          { "key": "2xx", "from": 200, "to": 300 },
          { "key": "3xx", "from": 300, "to": 400 },
          { "key": "4xx", "from": 400, "to": 500 },
          { "key": "5xx", "from": 500, "to": 600 }
        ]
      }
    }
  }
}

Use Case 4: Time-Series Analytics với Moving Average

bash

# Phát hiện anomalies trong số đơn hàng
GET /orders/_search
{
  "size": 0,
  "aggs": {
    "hourly_orders": {
      "date_histogram": {
        "field": "created_at",
        "fixed_interval": "1h",
        "format": "yyyy-MM-dd HH:mm"
      },
      "aggs": {
        "order_count": {
          "value_count": { "field": "order_id" }
        },
        "moving_avg": {
          "moving_fn": {
            "buckets_path": "order_count",
            "window": 24,
            "script": "MovingFunctions.unweightedAvg(values)"
          }
        },
        "moving_std": {
          "moving_fn": {
            "buckets_path": "order_count",
            "window": 24,
            "script": "MovingFunctions.stdDev(values, MovingFunctions.unweightedAvg(values))"
          }
        },
        "upper_bound": {
          "bucket_script": {
            "buckets_path": {
              "avg": "moving_avg",
              "std": "moving_std"
            },
            "script": "params.avg + 2 * params.std"
          }
        },
        "is_anomaly": {
          "bucket_selector": {
            "buckets_path": {
              "count": "order_count",
              "upper": "upper_bound"
            },
            "script": "params.count > params.upper"
          }
        }
      }
    }
  }
}
# Tìm các giờ có số đơn bất thường cao (spike detection)

9.9 Performance Tips cho Aggregations

1. Sử dụng `filter` trước aggregation

bash

# CHẬM: Aggregation trên toàn bộ data
GET /orders/_search
{
  "size": 0,
  "aggs": {
    "recent_revenue": {
      "filter": { "range": { "created_at": { "gte": "now-7d" } } },
      "aggs": {
        "total": { "sum": { "field": "amount" } }
      }
    }
  }
}

# NHANH: Query filter (cached) trước aggregation
GET /orders/_search
{
  "size": 0,
  "query": {
    "range": { "created_at": { "gte": "now-7d" } }
  },
  "aggs": {
    "total_revenue": {
      "sum": { "field": "amount" }
    }
  }
}

2. Giới hạn `size` trong `terms`

bash

# Không nên lấy quá nhiều buckets
{
  "terms": {
    "field": "product_id",
    "size": 10      # Default 10, đừng set 1000+
  }
}

3. Dùng `doc_values`

Fields dùng trong aggregation phải có doc_values: true (default cho non-analyzed fields):

bash

PUT /orders
{
  "mappings": {
    "properties": {
      "status": {
        "type": "keyword",
        "doc_values": true    # Default: true cho keyword
      },
      "amount": {
        "type": "double",
        "doc_values": true    # Default: true cho numeric fields
      }
    }
  }
}

4. `eager_global_ordinals` cho high-cardinality terms

bash

PUT /products
{
  "mappings": {
    "properties": {
      "category": {
        "type": "keyword",
        "eager_global_ordinals": true    # Pre-compute ordinals at index time
      }
    }
  }
}

Tóm Tắt Chương 9

Loại	Dùng khi	Ví dụ
Metric	Tính toán số liệu	avg, sum, min/max, percentiles
Bucket	Nhóm documents	terms, date_histogram, range, geo_distance
Pipeline	Aggregation trên aggregation	derivative, cumulative_sum, bucket_selector

Khi Nào Dùng Gì

Sales dashboard: date_histogram + sum per period
Faceted search: terms + global agg để giữ counts độc lập
Performance monitoring: percentiles, date_histogram với moving_fn
Anomaly detection: moving_fn + bucket_selector
Inventory analysis: terms + sub-aggs với stats

Bước Tiếp Theo

→ Chương 10: Performance Optimization

Chương 10: Performance Optimization - Tối Ưu Hiệu Năng

10.1 Tại Sao Cần Tối Ưu?

Elasticsearch có thể chậm khi:

Cluster không được cấu hình phù hợp
Queries không hiệu quả
Mapping sai kiểu dữ liệu
Shard không được phân bổ hợp lý
Heap memory không đủ

Mục tiêu: Search < 100ms, Indexing > 10,000 docs/giây

10.2 Shard Strategy

Quy Tắc Kích Thước Shard

Shard Size:     10GB - 50GB lý tưởng
                Tối đa 65GB (trên production)
Shard Count:    Bằng số CPU core (hoặc gấp đôi)
Total Shards:   < 1000 per node (rule of thumb)
Heap Usage:     ~30MB RAM per shard

Tính toán số shard:

Số shard = ceil(Total Data Size / Target Shard Size)

Ví dụ:
- Dữ liệu: 500GB
- Target shard size: 25GB
- Số primary shard = ceil(500/25) = 20 shards
- Với 3 replicas: 20 * 4 = 80 total shards
- Với 5 nodes: 80/5 = 16 shards per node (OK!)

Shard Splitting vs Reindex

bash

# Nếu cần tách shard (khi data grow):
POST /my-index/_split/my-index-split
{
  "settings": {
    "index.number_of_shards": 6    # Phải là bội số (gấp đôi, ba...) của số cũ
  }
}

# Nếu cần thu gọn shard:
POST /my-index/_shrink/my-index-small
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1
  }
}
# Cần move tất cả shards về cùng 1 node trước:
PUT /my-index/_settings
{
  "index.routing.allocation.require._name": "node-1"
}

Oversharding - Lỗi Phổ Biến

bash

# SAI: 50 shards cho 1GB data
PUT /small-index
{
  "settings": {
    "number_of_shards": 50    # WAY TOO MANY for small data!
  }
}

# ĐÚNG:
PUT /small-index
{
  "settings": {
    "number_of_shards": 1,    # 1 shard cho data < 20GB
    "number_of_replicas": 1
  }
}

Forcemerge cho Read-only Indices

bash

# Sau khi index xong, merge segments để tăng tốc search:
POST /my-index/_forcemerge?max_num_segments=1

# Chỉ làm trên indices KHÔNG CÒN WRITE nữa!
# (Time-series: tháng cũ, log indices cũ)

10.3 Mapping Optimization

Tắt `_source` Khi Không Cần

bash

# Nếu chỉ cần search, không cần fetch original document:
PUT /analytics-events
{
  "mappings": {
    "_source": { "enabled": false }   # Tiết kiệm ~50% disk space!
    # CẢNH BÁO: Không thể update, reindex, debug sau này!
  }
}

# Thay vào đó: Dùng source filtering
GET /products/_search
{
  "_source": ["name", "price"],   # Chỉ lấy fields cần thiết
  "query": { "match_all": {} }
}

Chọn Đúng Data Type

bash

# CHẬM: Lưu IP dạng text
{ "ip_address": "192.168.1.1" }   # keyword type

# NHANH: Dùng ip type
{
  "mappings": {
    "properties": {
      "ip_address": { "type": "ip" }    # Native IP type, CIDR queries, range queries
    }
  }
}

# CHẬM: Date dạng string
{ "created_at": "2024-01-15 10:30:00" }   # text/keyword

# NHANH: Date type
{
  "created_at": {
    "type": "date",
    "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
  }
}

Tắt `index` Cho Fields Chỉ Dùng cho Aggregation/Sort

bash

PUT /products
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },       # index: true (default)
      "price": { "type": "double" },    # index: true - Có thể range query
      "internal_cost": {
        "type": "double",
        "index": false    # Không tìm kiếm được, nhưng vẫn aggregation được
      },
      "metadata": {
        "type": "object",
        "enabled": false    # Toàn bộ object không được index (lưu nhưng không search)
      }
    }
  }
}

`doc_values` - Khi Nào Tắt

bash

PUT /logs
{
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "doc_values": false     # text fields không có doc_values mặc định
      },
      "log_level": {
        "type": "keyword",
        "doc_values": false     # Tắt nếu KHÔNG BAO GIỜ sort/agg trên field này
                                # Tiết kiệm disk
      }
    }
  }
}

`norms` - Tắt Cho Fields Không Cần Relevance Scoring

bash

PUT /products
{
  "mappings": {
    "properties": {
      "product_code": {
        "type": "text",
        "norms": false        # Product code: exact match, không cần relevance
      },
      "description": {
        "type": "text",
        "norms": true         # Description: cần relevance scoring (default)
      }
    }
  }
}

10.4 Indexing Performance

Tăng Tốc Bulk Indexing

bash

# 1. Tắt replicas trong khi indexing
PUT /my-index/_settings
{
  "index": {
    "number_of_replicas": 0,    # Tắt replicas
    "refresh_interval": "-1"    # Tắt auto refresh
  }
}

# 2. Bulk index
POST /my-index/_bulk
{ "index": {} }
{ "field1": "value1" }
{ "index": {} }
{ "field2": "value2" }
...

# 3. Sau khi done, restore settings
PUT /my-index/_settings
{
  "index": {
    "number_of_replicas": 1,
    "refresh_interval": "1s"
  }
}

# 4. Force merge
POST /my-index/_forcemerge?max_num_segments=5

Bulk Size Optimization

bash

# Thử với 5MB-15MB per batch:
POST /_bulk    # Request body ~10MB
{ "index": { "_index": "products" } }
{ ... }
...
# 1000-5000 documents per request thường tốt nhất
# Benchmark với cluster của bạn

Refresh Interval

bash

# Default: 1 giây (gần real-time)
# Tăng để giảm write load:
PUT /high-write-index/_settings
{
  "index.refresh_interval": "30s"    # Chỉ near-real-time
}

# Tắt hoàn toàn (dùng khi initial load):
PUT /initial-load-index/_settings
{
  "index.refresh_interval": "-1"
}

Translog Settings

bash

PUT /my-index/_settings
{
  "index": {
    "translog": {
      "sync_interval": "5s",         # Default: 5s (durability vs performance)
      "durability": "async",         # async: faster, slight data loss risk
      "flush_threshold_size": "1gb"  # Flush translog when > 1GB
    }
  }
}

10.5 Query Optimization

Filter vs Query Context

bash

# CHẬM: Range trong query context (requires scoring)
GET /orders/_search
{
  "query": {
    "bool": {
      "must": [
        { "range": { "created_at": { "gte": "now-7d" } } },  # Must-score
        { "match": { "product_name": "iPhone" } }
      ]
    }
  }
}

# NHANH: Range trong filter context (cached, no scoring)
GET /orders/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "product_name": "iPhone" } }   # Scoring needed
      ],
      "filter": [
        { "range": { "created_at": { "gte": "now-7d" } } }   # Cached!
      ]
    }
  }
}

Avoid Wildcards và Leading Wildcards

bash

# CHẬM: Leading wildcard - phải scan toàn bộ inverted index
GET /products/_search
{
  "query": {
    "wildcard": { "sku": "*123" }   # Không scan được index!
  }
}

# NHANH: Trailing wildcard (có thể dùng prefix query)
GET /products/_search
{
  "query": {
    "prefix": { "sku": "SKU-123" }  # Nhanh hơn vì prefix được index
  }
}

# TỐT NHẤT: Dùng edge_ngram để hỗ trợ partial match

Pagination

bash

# CHẬM: Deep pagination với from/size
GET /products/_search
{
  "from": 10000,    # ES phải fetch 10,000 + size documents, sort, discard
  "size": 10,
  "query": { "match_all": {} }
}

# NHANH: Search After
GET /products/_search
{
  "size": 10,
  "sort": [
    { "created_at": "desc" },
    { "_id": "asc" }          # Tiebreaker phải là unique!
  ],
  "search_after": ["2024-01-15", "doc-id-12345"]   # Từ trang trước
}

# TỐT NHẤT cho scan toàn bộ: PIT (Point in Time)
POST /products/_pit?keep_alive=5m
# Response: { "id": "pit-id-abc123" }

GET /products/_search
{
  "size": 1000,
  "pit": { "id": "pit-id-abc123", "keep_alive": "1m" },
  "sort": [{ "_shard_doc": "asc" }]  # Efficient sort with PIT
}

`_count` Thay Vì Count với Agg

bash

# Nếu chỉ cần đếm:
GET /orders/_count
{
  "query": {
    "term": { "status": "completed" }
  }
}
# Nhanh hơn nhiều so với search với size=0 + cardinality agg!

Query Profiling

bash

GET /products/_search
{
  "profile": true,    # Chi tiết timing của từng query component
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "iPhone" } }
      ],
      "filter": [
        { "term": { "category": "smartphones" } }
      ]
    }
  }
}

# Response bao gồm:
{
  "profile": {
    "shards": [
      {
        "id": "[node1][products][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "+name:iphone #category:smartphones",
                "time_in_nanos": 1234567,
                "breakdown": {
                  "score": 45000,
                  "build_scorer": 125000,
                  "match": 890000
                }
              }
            ],
            "rewrite_time": 12345,
            "collector": [...]
          }
        ]
      }
    ]
  }
}

10.6 Caching

Node Query Cache (Filter Cache)

bash

# Tự động cache filter context queries
# Cache key = query DSL
GET /orders/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "status": "active" } }   # Được cache sau lần đầu!
      ]
    }
  }
}

# Xem cache stats:
GET /_stats/query_cache

# Cấu hình:
# elasticsearch.yml
# indices.queries.cache.size: 10%   # Default 10% heap

Shard Request Cache

bash

# Cache kết quả aggregation (không cache hits)
GET /orders/_search
{
  "size": 0,           # QUAN TRỌNG: size=0 mới được cache!
  "request_cache": true,
  "aggs": {
    "total": { "sum": { "field": "amount" } }
  }
}

# Xem cache:
GET /_stats/request_cache

# Cache bị invalidate khi segment thay đổi (new data, merge, refresh)

Fielddata Cache

bash

# Dùng cho text fields trong aggregation/sort (không nên dùng!)
# Text fields không có doc_values → phải load vào heap

# Cấu hình limit:
PUT /products/_settings
{
  "index.fielddata.cache": "none"    # Tắt fielddata (bắt lỗi sớm)
}

# Hoặc global limit:
# elasticsearch.yml
# indices.fielddata.cache.size: 20%

# Xem fielddata usage:
GET /_cat/fielddata?v

# Clear cache:
POST /products/_cache/clear?fielddata=true

10.7 Index Lifecycle Management (ILM)

ILM tự động quản lý vòng đời index:

ILM Policy

bash

PUT /_ilm/policy/log-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50GB",
            "max_age": "7d",
            "max_docs": 10000000
          },
          "set_priority": {
            "priority": 100    # Hot indices: highest priority
          }
        }
      },
      "warm": {
        "min_age": "30d",    # 30 ngày sau khi rollover
        "actions": {
          "shrink": {
            "number_of_shards": 1    # Giảm số shard
          },
          "forcemerge": {
            "max_num_segments": 1    # Merge tất cả segments
          },
          "allocate": {
            "number_of_replicas": 1  # Giảm replicas
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "90d",
        "actions": {
          "freeze": {},            # Freeze index (đọc từ disk, không keep in memory)
          "allocate": {
            "number_of_replicas": 0,  # Không có replicas
            "require": {
              "data": "cold"     # Di chuyển sang cold tier nodes
            }
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "365d",    # Xóa sau 1 năm
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Index Template với ILM

bash

PUT /_index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "index.lifecycle.name": "log-policy",           # Gắn ILM policy
      "index.lifecycle.rollover_alias": "logs-active" # Alias cho current index
    }
  }
}

# Bootstrap first index:
PUT /logs-000001
{
  "aliases": {
    "logs-active": {
      "is_write_index": true    # Chỉ viết vào index này
    }
  }
}

# Write via alias:
POST /logs-active/_doc
{ "message": "log message", "@timestamp": "2024-01-15T10:30:00Z" }

Rollover Thủ Công (Khi cần)

bash

POST /logs-active/_rollover
{
  "conditions": {
    "max_age": "7d",
    "max_docs": 5000000
  }
}
# Tạo logs-000002 nếu đủ điều kiện

10.8 Slow Log

bash

# Cấu hình slow log để identify slow queries:
PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "2s",
  "index.search.slowlog.threshold.query.info": "1s",
  "index.search.slowlog.threshold.query.debug": "500ms",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.search.slowlog.level": "info"
}

PUT /products/_settings
{
  "index.indexing.slowlog.threshold.index.warn": "5s",
  "index.indexing.slowlog.threshold.index.info": "2s",
  "index.indexing.slowlog.level": "info"
}

# Log sẽ xuất hiện trong: logs/elasticsearch_index_search_slowlog.json

10.9 Hardware và JVM Tuning

JVM Heap Settings

bash

# jvm.options:
# Rules:
# 1. Tối đa 50% RAM vật lý
# 2. KHÔNG vượt quá 31GB (compressed oops threshold)
# 3. Xms = Xmx (tránh GC overhead)

-Xms16g      # Min heap
-Xmx16g      # Max heap (= Min)

# Hệ thống 32GB RAM:
# 16GB heap cho ES, 16GB còn lại cho OS file cache (Lucene dùng!)

Swap

bash

# Tắt swap hoàn toàn:
swapoff -a
# Và trong /etc/fstab: comment out swap lines

# Hoặc lock memory:
# elasticsearch.yml
bootstrap.memory_lock: true

# Verify:
GET /_nodes/stats/process?filter_path=**.mlockall

Disk

bash

# SSD: Intel NVMe hoặc tương đương
# RAID: RAID-0 (ES có replication)
# Tránh: NFS, network storage cho data nodes

# elasticsearch.yml
path.data:
  - /data1/elasticsearch    # Multiple data paths
  - /data2/elasticsearch    # Tận dụng multiple disks

# Disk-based allocation:
cluster.routing.allocation.disk.watermark.low: "85%"    # Stop allocating
cluster.routing.allocation.disk.watermark.high: "90%"   # Start moving
cluster.routing.allocation.disk.watermark.flood_stage: "95%"  # Read-only!

10.10 Concurrent Search Optimization

`batched_reduce_size`

bash

GET /products/_search
{
  "batched_reduce_size": 512,    # Reduce từng 512 shards một (tiết kiệm memory)
  "query": { "match_all": {} }
}

Adaptive Replica Selection

bash

# elasticsearch.yml
cluster.routing.use_adaptive_replica_selection: true   # Default: true
# ES tự chọn shard response nhanh nhất

Search Thread Pool

bash

# elasticsearch.yml
thread_pool.search.size: 14      # Thường: (CPU_count * 3) / 2 + 1
thread_pool.search.queue_size: 1000

10.11 Monitoring Performance

Cluster Health và Stats

bash

# Xem cluster health:
GET /_cluster/health?pretty

# Node stats:
GET /_nodes/stats?filter_path=nodes.*.indices.search,nodes.*.indices.indexing

# Hot threads:
GET /_nodes/hot_threads

# Task API:
GET /_tasks?actions=*search*&detailed=true

# Pending tasks:
GET /_cluster/pending_tasks

Cat API

bash

GET /_cat/health?v              # Cluster health
GET /_cat/nodes?v&h=name,heap.percent,cpu,load_1m,master
GET /_cat/indices?v&h=index,docs.count,store.size&s=store.size:desc
GET /_cat/shards?v&h=index,shard,prirep,state,docs,store,node
GET /_cat/segments?v            # Segment count per shard
GET /_cat/allocation?v          # Disk usage per node
GET /_cat/recovery?v&active_only=true  # Active recoveries

Tóm Tắt Performance Optimization

Area	Quick Win
Mapping	Dùng đúng type, tắt index/norms/fielddata khi không cần
Sharding	10-50GB per shard, không over-shard
Indexing	Tắt replicas + refresh khi bulk load
Queries	Dùng filter context, tránh leading wildcards
Pagination	Dùng search_after thay vì deep from/size
Caching	Tận dụng filter cache, avoid near-realtime khi không cần
ILM	Tự động rollover, warm/cold tiers
JVM	50% RAM, max 31GB, Xms=Xmx
Disk	SSD, RAID-0, tắt swap

Bước Tiếp Theo

→ Chương 11: Advanced Features

Chương 11: Advanced Features - Tính Năng Nâng Cao

11.1 Completion Suggester - Autocomplete Đỉnh cao

Completion suggester là cách tốt nhất cho autocomplete, nhanh hơn nhiều so với prefix query:

Mapping cho Completion

bash

PUT /products
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },
      "suggest": {
        "type": "completion",
        "analyzer": "simple",
        "search_analyzer": "simple",
        "max_input_length": 50,
        "contexts": [          # Optional: Context-aware suggestions
          {
            "name": "category",
            "type": "category"
          }
        ]
      }
    }
  }
}

Indexing với Suggest Field

bash

PUT /products/_doc/1
{
  "name": "iPhone 15 Pro Max 256GB",
  "suggest": {
    "input": [
      "iPhone 15 Pro Max",
      "iPhone 15",
      "iPhone Pro Max",
      "điện thoại Apple"
    ],
    "weight": 90     # Sắp xếp ưu tiên (cao hơn = ưu tiên hơn)
  }
}

PUT /products/_doc/2
{
  "name": "Samsung Galaxy S24 Ultra",
  "suggest": {
    "input": [
      "Samsung Galaxy S24 Ultra",
      "Samsung S24",
      "Galaxy Ultra"
    ],
    "weight": 85
  }
}

# Context-aware:
PUT /products/_doc/3
{
  "name": "MacBook Pro 14 M3",
  "suggest": {
    "input": ["MacBook Pro", "MacBook M3"],
    "weight": 80,
    "contexts": {
      "category": ["laptop", "apple"]
    }
  }
}

Querying Completion Suggester

bash

POST /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "iph",      # User đang gõ
      "completion": {
        "field": "suggest",
        "fuzzy": {
          "fuzziness": 1    # Cho phép 1 lỗi đánh máy
        },
        "size": 5           # Số gợi ý trả về
      }
    }
  }
}

# Context-aware suggestion:
POST /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "mac",
      "completion": {
        "field": "suggest",
        "size": 5,
        "contexts": {
          "category": ["laptop"]    # Chỉ gợi ý laptops
        }
      }
    }
  }
}

# Response:
{
  "suggest": {
    "product_suggest": [
      {
        "text": "iph",
        "options": [
          {
            "text": "iPhone 15 Pro Max",
            "_score": 90,
            "_id": "1",
            "_source": { "name": "iPhone 15 Pro Max 256GB" }
          }
        ]
      }
    ]
  }
}

11.2 Term Suggester - Spell Correction

Gợi ý khi người dùng gõ sai chính tả:

bash

POST /products/_search
{
  "suggest": {
    "spelling_correction": {
      "text": "samsng gaalaxy",   # Sai chính tả
      "term": {
        "field": "name",
        "suggest_mode": "missing",     # only, missing, always
        "min_word_length": 3,
        "min_doc_freq": 1,
        "max_edits": 2,                # Edit distance
        "sort": "score"                # score | frequency
      }
    }
  }
}

# Response:
{
  "suggest": {
    "spelling_correction": [
      {
        "text": "samsng",
        "options": [
          { "text": "samsung", "score": 0.75, "freq": 450 }
        ]
      },
      {
        "text": "gaalaxy",
        "options": [
          { "text": "galaxy", "score": 0.8, "freq": 380 }
        ]
      }
    ]
  }
}

Phrase Suggester

bash

POST /articles/_search
{
  "suggest": {
    "phrase_suggestion": {
      "text": "điên thoại di động",   # "điên" thay vì "điện"
      "phrase": {
        "field": "content",
        "max_errors": 2,
        "confidence": 1.0,
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

11.3 Percolator - Reverse Search

Thay vì tìm documents matching query, percolator tìm queries matching document:

bash

# Use case: Notification system
# User đăng ký: "Thông báo khi có iPhone 15 dưới 25 triệu"

# 1. Index percolator (lưu queries)
PUT /price-alerts
{
  "mappings": {
    "properties": {
      "query": {
        "type": "percolator"   # Lưu query DSL
      },
      "user_id": { "type": "keyword" },
      "product_query": {
        "type": "text",
        "analyzer": "standard"
      },
      "max_price": { "type": "double" }
    }
  }
}

# 2. Index user alerts (as percolator queries)
PUT /price-alerts/_doc/alert-001
{
  "user_id": "user123",
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "iPhone 15" } }
      ],
      "filter": [
        { "range": { "price": { "lte": 25000000 } } }
      ]
    }
  }
}

PUT /price-alerts/_doc/alert-002
{
  "user_id": "user456",
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "MacBook Pro" } }
      ],
      "filter": [
        { "range": { "price": { "lte": 40000000 } } }
      ]
    }
  }
}

# 3. Khi có sản phẩm mới, tìm users cần thông báo:
GET /price-alerts/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "document": {               # Sản phẩm mới vừa được update
        "name": "iPhone 15 Pro Max 256GB Xanh",
        "price": 23500000,
        "brand": "Apple",
        "category": "smartphones"
      }
    }
  }
}

# Response: alert-001 (user123) match! → Gửi notification cho user123

Bulk Percolation (Nhiều Documents)

bash

GET /price-alerts/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "documents": [       # Nhiều documents
        { "name": "iPhone 15", "price": 23000000 },
        { "name": "Samsung S24", "price": 18000000 }
      ]
    }
  }
}

11.4 Async Search - Tìm Kiếm Bất Đồng Bộ

Cho queries chạy lâu (analytics, reports):

bash

# Submit async search
POST /orders/_async_search?wait_for_completion_timeout=1s
{
  "aggs": {
    "monthly_revenue": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "month"
      },
      "aggs": {
        "revenue": { "sum": { "field": "amount" } }
      }
    }
  }
}

# Response ngay lập tức:
{
  "id": "async-search-id-abc123",
  "is_partial": true,      # Chưa xong
  "is_running": true,
  "start_time_in_millis": 1704067200000,
  "expiration_time_in_millis": 1704153600000,
  "response": {
    "took": 150,
    "hits": { "total": { "value": 10000 } }
  }
}

# Sau đó poll kết quả:
GET /_async_search/async-search-id-abc123
# is_running = false khi xong

# Xóa khi không cần:
DELETE /_async_search/async-search-id-abc123

11.5 Cross-Cluster Search (CCS)

Tìm kiếm trên nhiều clusters:

bash

# Cấu hình remote clusters:
PUT /_cluster/settings
{
  "persistent": {
    "cluster.remote": {
      "cluster-us-west": {
        "seeds": ["node1.us-west.example.com:9300"]
      },
      "cluster-eu-central": {
        "seeds": ["node1.eu-central.example.com:9300"]
      }
    }
  }
}

# Search trên nhiều clusters:
GET /logs-*,cluster-us-west:logs-*,cluster-eu-central:logs-*/_search
{
  "query": {
    "match": { "message": "error" }
  }
}

# Aggregate trên nhiều clusters:
GET /cluster-us-west:orders,cluster-eu-central:orders/_search
{
  "size": 0,
  "aggs": {
    "global_revenue": { "sum": { "field": "amount" } }
  }
}

11.6 EQL - Event Query Language

EQL cho phân tích security events, phát hiện sequences:

bash

# Phát hiện brute force attack:
# Sequence: nhiều lần login fail → 1 lần login success
GET /security-events/_eql/search
{
  "query": """
    sequence with maxspan=5m
      [authentication where event.outcome == "failure"] with runs=5
      [authentication where event.outcome == "success"]
  """,
  "filter": {
    "term": { "source.ip": "192.168.1.100" }
  }
}

# Tìm event theo pattern:
GET /security-events/_eql/search
{
  "query": """
    process where process.name == "cmd.exe" 
      and process.parent.name == "excel.exe"
  """
}
# Phát hiện Excel spawning cmd.exe (macro malware!)

11.7 Data Streams

Data Streams tối ưu cho time-series data (logs, metrics):

bash

# Tạo Index Lifecycle Policy
PUT /_ilm/policy/data-stream-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50GB",
            "max_age": "30d"
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

# Tạo index template cho data stream:
PUT /_index_template/logs-template
{
  "index_patterns": ["logs-myapp-*"],
  "data_stream": {},          # Mark as data stream template
  "template": {
    "settings": {
      "index.lifecycle.name": "data-stream-policy"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },   # REQUIRED field!
        "message": { "type": "text" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" }
      }
    }
  }
}

# Tạo data stream:
PUT /_data_stream/logs-myapp-production

# Index vào data stream:
POST /logs-myapp-production/_doc
{
  "@timestamp": "2024-01-15T10:30:00Z",
  "message": "User login successful",
  "level": "INFO",
  "service": "auth-service"
}

# Search trên data stream (transparent, search all backing indices):
GET /logs-myapp-production/_search
{
  "query": {
    "range": {
      "@timestamp": { "gte": "now-1h" }
    }
  }
}

# Xem thông tin data stream:
GET /_data_stream/logs-myapp-production

# Manual rollover:
POST /logs-myapp-production/_rollover

11.8 Transform API

Transform tổng hợp và transform data thành pivot tables:

bash

# Tạo transform: pivot orders thành daily revenue per product
PUT /_transform/daily-product-revenue
{
  "source": {
    "index": "orders",
    "query": {
      "range": {
        "created_at": { "gte": "now-90d" }
      }
    }
  },
  "dest": {
    "index": "revenue-by-product-daily"
  },
  "pivot": {
    "group_by": {
      "date": {
        "date_histogram": {
          "field": "created_at",
          "calendar_interval": "day"
        }
      },
      "product_id": {
        "terms": { "field": "product_id" }
      }
    },
    "aggregations": {
      "total_revenue": {
        "sum": { "field": "amount" }
      },
      "order_count": {
        "value_count": { "field": "order_id" }
      },
      "avg_price": {
        "avg": { "field": "price" }
      }
    }
  },
  "sync": {
    "time": {
      "field": "created_at",
      "delay": "60s"    # Continuously update with 60s delay
    }
  }
}

# Start transform:
POST /_transform/daily-product-revenue/_start

# Xem stats:
GET /_transform/daily-product-revenue/_stats

# Query destination index:
GET /revenue-by-product-daily/_search
{
  "query": {
    "term": { "product_id": "PROD001" }
  },
  "sort": [{ "date": "desc" }]
}

11.9 Enrich Processor (Ingest Pipeline)

Làm giàu documents trong quá trình indexing:

bash

# 1. Tạo source index (lookup table):
PUT /user-profiles
{
  "mappings": {
    "properties": {
      "user_id": { "type": "keyword" },
      "name": { "type": "text" },
      "tier": { "type": "keyword" },
      "location": { "type": "keyword" }
    }
  }
}

PUT /user-profiles/_doc/1
{ "user_id": "USR001", "name": "Nguyễn Văn A", "tier": "gold", "location": "TPHCM" }

# 2. Tạo enrich policy:
PUT /_enrich/policy/user-enrich-policy
{
  "match": {
    "indices": "user-profiles",
    "match_field": "user_id",
    "enrich_fields": ["name", "tier", "location"]
  }
}

POST /_enrich/policy/user-enrich-policy/_execute

# 3. Tạo ingest pipeline sử dụng enrich:
PUT /_ingest/pipeline/order-enrich-pipeline
{
  "processors": [
    {
      "enrich": {
        "policy_name": "user-enrich-policy",
        "field": "customer_id",       # Field in document to match
        "target_field": "customer",   # Where to put enriched data
        "max_matches": 1
      }
    }
  ]
}

# 4. Index sử dụng pipeline:
POST /orders/_doc?pipeline=order-enrich-pipeline
{
  "order_id": "ORD001",
  "customer_id": "USR001",
  "amount": 1500000
}

# Document được stored:
{
  "order_id": "ORD001",
  "customer_id": "USR001",
  "amount": 1500000,
  "customer": {                  # Enriched!
    "name": "Nguyễn Văn A",
    "tier": "gold",
    "location": "TPHCM"
  }
}

11.10 Cross-Fields Query

Tìm kiếm words trải rộng nhiều fields:

bash

# Tìm "Nguyễn Văn" có thể ở first_name hoặc last_name:
GET /users/_search
{
  "query": {
    "multi_match": {
      "query": "Nguyễn Văn A",
      "fields": ["first_name", "last_name", "full_name"],
      "type": "cross_fields",    # Treat as one big field!
      "operator": "and"
    }
  }
}

# Hoặc dùng copy_to trong mapping để gom fields:
PUT /users
{
  "mappings": {
    "properties": {
      "first_name": {
        "type": "text",
        "copy_to": "full_name"   # Copy vào virtual field
      },
      "last_name": {
        "type": "text",
        "copy_to": "full_name"
      },
      "full_name": {
        "type": "text"           # Virtual field, không stored
      }
    }
  }
}

GET /users/_search
{
  "query": {
    "match": {
      "full_name": "Nguyễn Văn A"   # Tìm trên combined field
    }
  }
}

11.11 Runtime Fields

Runtime fields tính toán trong quá trình query (không stored):

bash

PUT /orders
{
  "mappings": {
    "runtime": {
      "is_high_value": {
        "type": "boolean",
        "script": {
          "source": "emit(doc['amount'].value > 10000000)"
        }
      },
      "total_with_tax": {
        "type": "double",
        "script": {
          "source": "emit(doc['amount'].value * 1.1)"  # +10% VAT
        }
      },
      "order_day_of_week": {
        "type": "keyword",
        "script": {
          "source": """
            ZonedDateTime date = doc['created_at'].value;
            emit(date.getDayOfWeek().toString());
          """
        }
      }
    }
  }
}

# Query on runtime field:
GET /orders/_search
{
  "query": {
    "term": { "is_high_value": true }
  },
  "_source": ["order_id", "amount"],
  "fields": ["total_with_tax", "order_day_of_week"]
}

# Agg on runtime field:
GET /orders/_search
{
  "size": 0,
  "aggs": {
    "orders_by_day": {
      "terms": { "field": "order_day_of_week" }
    }
  }
}

Runtime Fields Trong Query (Ad-hoc)

bash

GET /orders/_search
{
  "runtime_mappings": {
    "discount_amount": {
      "type": "double",
      "script": "emit(doc['original_price'].value - doc['final_price'].value)"
    }
  },
  "query": {
    "range": {
      "discount_amount": { "gte": 100000 }    # Đơn có giảm giá >= 100k
    }
  }
}

11.12 Search As You Type

Field type đặc biệt tối ưu cho search-as-you-type:

bash

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "search_as_you_type",      # Tạo 3 sub-fields tự động
        "max_shingle_size": 3
      }
    }
  }
}

# Tự động tạo:
# name              (standard analysis)
# name._2gram       (bigrams)
# name._3gram       (trigrams)
# name._index_prefix (edge ngrams)

# Query:
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "iphone pro",
      "type": "bool_prefix",    # Specific type for SAYT
      "fields": [
        "name",
        "name._2gram",
        "name._3gram",
        "name._index_prefix"
      ]
    }
  }
}
# "iphone pro" → match "iPhone 15 Pro Max"
# Gõ từng ký tự đều trả về kết quả!

11.13 Security - Field-Level và Document-Level

bash

# Field-level security: Ẩn sensitive fields
POST /_security/role/product-viewer
{
  "indices": [
    {
      "names": ["products"],
      "privileges": ["read"],
      "field_security": {
        "grant": ["*"],
        "except": ["cost_price", "supplier_id"]  # Ẩn internal fields
      }
    }
  ]
}

# Document-level security: Chỉ thấy active products
POST /_security/role/customer-role
{
  "indices": [
    {
      "names": ["products"],
      "privileges": ["read"],
      "query": {
        "term": { "status": "active" }   # Chỉ thấy active products
      }
    }
  ]
}

# User chỉ thấy orders của chính họ (using username placeholder):
POST /_security/role/customer-order-role
{
  "indices": [
    {
      "names": ["orders"],
      "privileges": ["read"],
      "query": {
        "term": { "customer_username": "{{_user.username}}" }
      }
    }
  ]
}

Tóm Tắt Chương 11

Feature	Dùng khi
Completion Suggester	Autocomplete nhanh nhất
Term Suggester	Spell correction
Percolator	Reverse search, alert/notification
Async Search	Queries chạy > 1 giây
Cross-cluster Search	Search trên nhiều clusters
EQL	Security event analysis, sequence detection
Data Streams	Time-series (logs, metrics)
Transform	Pivot tables, continuous aggregations
Enrich Processor	Làm giàu data khi indexing
Runtime Fields	Ad-hoc computed fields
Search As You Type	Real-time search-as-you-type
Field/Doc-level Security	Fine-grained access control

Bước Tiếp Theo

→ Chương 12: Real-world Use Cases

Chương 12: Real-world Use Cases - Ứng Dụng Thực Tế

12.1 Use Case 1: E-Commerce Search Engine (Tiki/Shopee-style)

Kiến trúc hệ thống

                    ┌─────────────────────────────────────┐
                    │           API Gateway               │
                    └────────────────┬────────────────────┘
                                     │
               ┌─────────────────────┼─────────────────────┐
               │                     │                      │
        ┌──────▼──────┐      ┌───────▼───────┐     ┌───────▼──────┐
        │  Product    │      │   Search      │     │  Analytics   │
        │  Service    │      │   Service     │     │   Service    │
        └──────┬──────┘      └───────┬───────┘     └───────┬──────┘
               │                     │                      │
               │         ┌────────────▼─────────────┐      │
               │         │      Elasticsearch        │      │
               │         │   (3-node production)     │      │
               │         └──────────────────────────-┘      │
               │                                            │
        ┌──────▼──────────────────────────────────────────-▼──┐
        │                    PostgreSQL                         │
        │            (Source of truth for products)            │
        └──────────────────────────────────────────────────────┘

Bước 1: Thiết Kế Index Mapping

bash

PUT /ecommerce-products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "char_filter": {
        "html_strip": { "type": "html_strip" },
        "special_chars": {
          "type": "mapping",
          "mappings": ["& => and", "₫ => vnd", "% => percent"]
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3,
          "token_chars": ["letter", "digit"]
        }
      },
      "filter": {
        "vi_stopwords": {
          "type": "stop",
          "stopwords": ["là","và","của","có","trong","với","được","để","cho","từ","tại"]
        },
        "product_synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "điện thoại, smartphone, dt, mobile",
            "laptop, notebook, máy tính xách tay",
            "iphone => apple iphone",
            "macbook => apple macbook",
            "usb-c, usbc, type-c"
          ],
          "lenient": true
        },
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "vi_product_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "special_chars"],
          "tokenizer": "standard",
          "filter": ["lowercase", "vi_stopwords"]
        },
        "vi_search_analyzer": {
          "type": "custom",
          "char_filter": ["special_chars"],
          "tokenizer": "standard",
          "filter": ["lowercase", "vi_stopwords", "product_synonyms"]
        },
        "autocomplete_index": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "edge_ngram_filter"]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product_id": { "type": "keyword" },
      "sku": { "type": "keyword" },
      "name": {
        "type": "text",
        "analyzer": "vi_product_analyzer",
        "search_analyzer": "vi_search_analyzer",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 },
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete_index",
            "search_analyzer": "autocomplete_search"
          },
          "ngram": {
            "type": "text",
            "analyzer": "vi_product_analyzer"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "vi_product_analyzer",
        "search_analyzer": "vi_search_analyzer",
        "norms": false
      },
      "brand": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "category": {
        "type": "keyword"
      },
      "category_path": {
        "type": "text",
        "analyzer": "path_analyzer",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "price": { "type": "double" },
      "original_price": { "type": "double" },
      "discount_percent": { "type": "integer" },
      "currency": { "type": "keyword" },
      "rating": { "type": "float" },
      "review_count": { "type": "integer" },
      "sold_count": { "type": "long" },
      "stock_quantity": { "type": "integer" },
      "status": { "type": "keyword" },
      "is_featured": { "type": "boolean" },
      "is_flash_sale": { "type": "boolean" },
      "flash_sale_end": { "type": "date" },
      "images": { "type": "keyword", "index": false },
      "tags": { "type": "keyword" },
      "attributes": {
        "type": "nested",
        "properties": {
          "name": { "type": "keyword" },
          "value": { "type": "keyword" }
        }
      },
      "shipping": {
        "properties": {
          "weight_g": { "type": "integer" },
          "free_shipping": { "type": "boolean" },
          "estimated_days": { "type": "integer" }
        }
      },
      "seller": {
        "properties": {
          "seller_id": { "type": "keyword" },
          "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
          "is_official": { "type": "boolean" },
          "rating": { "type": "float" }
        }
      },
      "location": { "type": "geo_point" },
      "created_at": { "type": "date" },
      "updated_at": { "type": "date" },
      "boost_score": { "type": "float" },
      "suggest": {
        "type": "completion",
        "analyzer": "simple"
      }
    }
  }
}

Bước 2: Query Tìm Kiếm Sản Phẩm

bash

GET /ecommerce-products/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "điện thoại samsung 5g",
                "fields": [
                  "name^5",          # Boost name
                  "brand^3",
                  "tags^2",
                  "description"
                ],
                "type": "best_fields",
                "fuzziness": "AUTO",
                "prefix_length": 2
              }
            }
          ],
          "filter": [
            { "term": { "status": "active" } },
            { "range": { "stock_quantity": { "gt": 0 } } }
          ],
          "should": [
            { "term": { "is_featured": true } },
            { "term": { "is_flash_sale": true } },
            { "term": { "seller.is_official": true } }
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "sold_count",
            "factor": 0.5,
            "modifier": "log1p",
            "missing": 0
          }
        },
        {
          "field_value_factor": {
            "field": "rating",
            "factor": 2,
            "modifier": "none",
            "missing": 3
          }
        },
        {
          "field_value_factor": {
            "field": "boost_score",
            "factor": 1,
            "modifier": "none",
            "missing": 1
          }
        },
        {
          "gauss": {
            "updated_at": {
              "origin": "now",
              "scale": "30d",
              "offset": "7d",
              "decay": 0.5
            }
          },
          "weight": 1.5
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  },
  "aggs": {
    "brands": {
      "terms": { "field": "brand.keyword", "size": 20 }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "Dưới 3 triệu", "to": 3000000 },
          { "key": "3-7 triệu", "from": 3000000, "to": 7000000 },
          { "key": "7-15 triệu", "from": 7000000, "to": 15000000 },
          { "key": "15-30 triệu", "from": 15000000, "to": 30000000 },
          { "key": "Trên 30 triệu", "from": 30000000 }
        ]
      }
    },
    "avg_rating": { "avg": { "field": "rating" } }
  },
  "sort": [
    { "_score": "desc" },        # Relevance first
    { "is_flash_sale": "desc" },  # Flash sale items
    { "sold_count": "desc" }      # Popular items
  ],
  "from": 0,
  "size": 24,
  "highlight": {
    "fields": {
      "name": {
        "pre_tags": ["<em>"],
        "post_tags": ["</em>"]
      }
    }
  }
}

Bước 3: Node.js Service Implementation

javascript

// search-service.js
const { Client } = require('@elastic/elasticsearch');
const client = new Client({ 
  node: 'https://localhost:9200',
  auth: { username: 'elastic', password: process.env.ES_PASSWORD },
  tls: { ca: fs.readFileSync('certs/http_ca.crt') }
});

class ProductSearchService {
  async search({
    query,
    category,
    brand,
    priceMin,
    priceMax,
    minRating,
    sortBy = 'relevance',
    page = 1,
    size = 24
  }) {
    const must = [];
    const filter = [
      { term: { status: 'active' } },
      { range: { stock_quantity: { gt: 0 } } }
    ];

    // Text search
    if (query) {
      must.push({
        multi_match: {
          query,
          fields: ['name^5', 'brand^3', 'tags^2', 'description'],
          type: 'best_fields',
          fuzziness: 'AUTO',
          prefix_length: 2
        }
      });
    } else {
      must.push({ match_all: {} });
    }

    // Category filter
    if (category) filter.push({ term: { category } });
    
    // Brand filter
    if (brand) filter.push({ terms: { 'brand.keyword': Array.isArray(brand) ? brand : [brand] } });
    
    // Price range filter
    if (priceMin || priceMax) {
      const priceRange = { range: { price: {} } };
      if (priceMin) priceRange.range.price.gte = priceMin;
      if (priceMax) priceRange.range.price.lte = priceMax;
      filter.push(priceRange);
    }
    
    // Rating filter
    if (minRating) filter.push({ range: { rating: { gte: minRating } } });

    // Sort options
    const sortOptions = {
      relevance: [{ _score: 'desc' }, { sold_count: 'desc' }],
      price_asc: [{ price: 'asc' }],
      price_desc: [{ price: 'desc' }],
      newest: [{ created_at: 'desc' }],
      best_selling: [{ sold_count: 'desc' }],
      rating: [{ rating: 'desc' }, { review_count: 'desc' }]
    };

    const body = {
      query: {
        function_score: {
          query: {
            bool: {
              must,
              filter,
              should: [
                { term: { is_featured: true } },
                { term: { is_flash_sale: true } },
                { term: { 'seller.is_official': true } }
              ]
            }
          },
          functions: [
            {
              field_value_factor: {
                field: 'sold_count',
                factor: 0.5,
                modifier: 'log1p',
                missing: 0
              }
            },
            {
              field_value_factor: {
                field: 'rating',
                factor: 2,
                modifier: 'none',
                missing: 3
              }
            }
          ],
          score_mode: 'sum',
          boost_mode: 'multiply'
        }
      },
      aggs: {
        brands: { terms: { field: 'brand.keyword', size: 20 } },
        price_ranges: {
          range: {
            field: 'price',
            ranges: [
              { key: 'Dưới 3 triệu', to: 3000000 },
              { key: '3-7 triệu', from: 3000000, to: 7000000 },
              { key: '7-15 triệu', from: 7000000, to: 15000000 },
              { key: 'Trên 15 triệu', from: 15000000 }
            ]
          }
        },
        rating_distribution: {
          range: {
            field: 'rating',
            ranges: [
              { key: '5 sao', from: 4.5, to: 5.1 },
              { key: '4+ sao', from: 4 },
              { key: '3+ sao', from: 3 }
            ]
          }
        }
      },
      sort: sortOptions[sortBy] || sortOptions.relevance,
      from: (page - 1) * size,
      size,
      highlight: {
        fields: {
          name: { pre_tags: ['<mark>'], post_tags: ['</mark>'], number_of_fragments: 0 }
        }
      },
      _source: ['product_id', 'name', 'brand', 'price', 'original_price', 
                'discount_percent', 'rating', 'review_count', 'sold_count',
                'images', 'is_flash_sale', 'shipping']
    };

    const result = await client.search({ index: 'ecommerce-products', body });
    
    return {
      total: result.hits.total.value,
      products: result.hits.hits.map(hit => ({
        ...hit._source,
        score: hit._score,
        highlight: hit.highlight
      })),
      aggregations: {
        brands: result.aggregations.brands.buckets,
        price_ranges: result.aggregations.price_ranges.buckets,
        rating_distribution: result.aggregations.rating_distribution.buckets
      },
      page,
      size,
      total_pages: Math.ceil(result.hits.total.value / size)
    };
  }

  async autocomplete(query) {
    const result = await client.search({
      index: 'ecommerce-products',
      body: {
        suggest: {
          product_suggest: {
            prefix: query,
            completion: {
              field: 'suggest',
              fuzzy: { fuzziness: 1 },
              size: 8
            }
          }
        }
      }
    });

    return result.suggest.product_suggest[0].options.map(opt => ({
      id: opt._id,
      name: opt._source.name,
      brand: opt._source.brand,
      price: opt._source.price
    }));
  }

  async getProductIndexHandler(product) {
    // Sync từ PostgreSQL/MongoDB → Elasticsearch
    await client.index({
      index: 'ecommerce-products',
      id: product.product_id,
      document: {
        ...product,
        suggest: {
          input: [
            product.name,
            product.brand,
            `${product.brand} ${product.name}`,
            ...product.tags || []
          ],
          weight: Math.floor(product.sold_count / 100) + (product.rating * 10)
        }
      }
    });
  }
}

12.2 Use Case 2: ELK Stack Log Analytics

Kiến trúc

Applications → Beats (Filebeat/Metricbeat) → Logstash → Elasticsearch → Kibana

Logstash Pipeline Configuration

ruby

# /etc/logstash/conf.d/app-logs.conf

input {
  beats {
    port => 5044
  }
}

filter {
  # Parse JSON logs từ applications
  if [type] == "app-log" {
    json {
      source => "message"
      target => "app"
    }
    
    # Enrich with GeoIP
    if [app][client_ip] {
      geoip {
        source => "[app][client_ip]"
        target => "[app][geoip]"
      }
    }
    
    # User-agent parsing
    if [app][user_agent] {
      useragent {
        source => "[app][user_agent]"
        target => "[app][ua]"
      }
    }
    
    # Phân loại log level
    mutate {
      add_field => { "[@metadata][rollover_alias]" => "app-logs" }
    }
  }
  
  # Parse nginx access logs
  if [type] == "nginx" {
    grok {
      match => {
        "message" => '%{COMBINEDAPACHELOG}'
      }
    }
    date {
      match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
    }
    mutate {
      convert => { "response" => "integer" }
      convert => { "bytes" => "integer" }
      convert => { "request_time" => "float" }
    }
  }
  
  # Remove unnecessary fields
  mutate {
    remove_field => ["@version", "agent", "ecs"]
  }
}

output {
  elasticsearch {
    hosts => ["https://es-node1:9200", "https://es-node2:9200"]
    user => "logstash_user"
    password => "${LOGSTASH_ES_PASSWORD}"
    ssl_certificate_authorities => "/etc/logstash/certs/ca.crt"
    
    index => "logs-%{[@metadata][rollover_alias]}-%{+YYYY.MM.dd}"
    
    # ILM-managed rollover:
    # ilm_rollover_alias => "app-logs"
    # ilm_policy => "log-retention-policy"
  }
}

Kibana Dashboard Queries (via ES)

bash

# Dashboard: Error rate over time
GET /logs-app-*/_search
{
  "size": 0,
  "query": {
    "range": { "@timestamp": { "gte": "now-24h" } }
  },
  "aggs": {
    "error_rate_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m",
        "format": "HH:mm"
      },
      "aggs": {
        "total": { "value_count": { "field": "_id" } },
        "errors": {
          "filter": { "range": { "app.status_code": { "gte": 500 } } }
        },
        "error_rate": {
          "bucket_script": {
            "buckets_path": { "err": "errors._count", "tot": "total" },
            "script": "params.tot > 0 ? params.err / params.tot * 100 : 0"
          }
        }
      }
    }
  }
}

# Top slow endpoints
GET /logs-app-*/_search
{
  "size": 0,
  "query": {
    "range": { "@timestamp": { "gte": "now-1h" } }
  },
  "aggs": {
    "slow_endpoints": {
      "terms": {
        "field": "app.endpoint.keyword",
        "size": 20,
        "order": { "p95_latency.values.95\\.0": "desc" }
      },
      "aggs": {
        "p95_latency": {
          "percentiles": {
            "field": "app.duration_ms",
            "percents": [50, 95, 99]
          }
        },
        "error_count": {
          "filter": { "range": { "app.status_code": { "gte": 500 } } }
        }
      }
    }
  }
}

# Geographic distribution
GET /logs-app-*/_search
{
  "size": 0,
  "query": { "range": { "@timestamp": { "gte": "now-24h" } } },
  "aggs": {
    "by_country": {
      "terms": { "field": "app.geoip.country_code2.keyword", "size": 20 }
    }
  }
}

12.3 Use Case 3: Real-time Inventory Search

Vấn Đề

Warehouse có 1 triệu SKUs. Nhân viên cần tìm hàng nhanh theo nhiều tiêu chí:

SKU code
Tên sản phẩm
Vị trí kho
Nhà cung cấp
Ngày hết hạn

bash

PUT /warehouse-inventory
{
  "settings": {
    "number_of_shards": 2,
    "analysis": {
      "analyzer": {
        "sku_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["uppercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "sku": {
        "type": "keyword",
        "fields": {
          "analyze": { "type": "text", "analyzer": "sku_analyzer" }
        }
      },
      "name": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "quantity": { "type": "integer" },
      "location": {
        "properties": {
          "warehouse": { "type": "keyword" },
          "zone": { "type": "keyword" },
          "rack": { "type": "keyword" },
          "shelf": { "type": "keyword" },
          "bin": { "type": "keyword" },
          "full_location": { "type": "keyword" }
        }
      },
      "supplier": {
        "properties": {
          "supplier_id": { "type": "keyword" },
          "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }
        }
      },
      "expiry_date": { "type": "date", "format": "yyyy-MM-dd" },
      "last_updated": { "type": "date" },
      "status": { "type": "keyword" },
      "unit_cost": { "type": "double" }
    }
  }
}

# Query: Tìm sản phẩm sắp hết hạn trong 30 ngày
GET /warehouse-inventory/_search
{
  "query": {
    "bool": {
      "must": [
        { "query_string": { "query": "thực phẩm" } }
      ],
      "filter": [
        {
          "range": {
            "expiry_date": {
              "lte": "now+30d",
              "gte": "now"
            }
          }
        },
        { "range": { "quantity": { "gt": 0 } } }
      ]
    }
  },
  "aggs": {
    "expiry_urgency": {
      "range": {
        "field": "expiry_date",
        "ranges": [
          { "key": "Hết hạn trong 7 ngày", "to": "now+7d" },
          { "key": "7-15 ngày", "from": "now+7d", "to": "now+15d" },
          { "key": "15-30 ngày", "from": "now+15d", "to": "now+30d" }
        ]
      },
      "aggs": {
        "total_quantity": { "sum": { "field": "quantity" } },
        "total_value": {
          "sum": {
            "script": {
              "source": "doc['quantity'].value * doc['unit_cost'].value"
            }
          }
        }
      }
    },
    "by_warehouse": {
      "terms": { "field": "location.warehouse" }
    }
  },
  "sort": [{ "expiry_date": "asc" }]
}

12.4 Use Case 4: Semantic Search với Vector Embeddings

Kiến trúc

User Query → Embedding Model → Vector → kNN Search → Results
Document   → Embedding Model → Vector → ES dense_vector field

Setup

bash

# Mapping với dense_vector
PUT /knowledge-base
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 1536,           # OpenAI embedding dimensions
        "index": true,
        "similarity": "cosine"
      },
      "category": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

Python Implementation

python

import openai
from elasticsearch import Elasticsearch

es = Elasticsearch(
    "https://localhost:9200",
    http_auth=("elastic", "your_password"),
    ca_certs="certs/http_ca.crt"
)

openai.api_key = os.environ["OPENAI_API_KEY"]

def get_embedding(text: str) -> list[float]:
    """Tạo embedding vector từ OpenAI API"""
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response['data'][0]['embedding']

def index_document(doc_id: str, title: str, content: str, category: str):
    """Index document với embedding"""
    embedding = get_embedding(title + " " + content[:500])
    
    es.index(
        index="knowledge-base",
        id=doc_id,
        document={
            "title": title,
            "content": content,
            "content_vector": embedding,
            "category": category,
            "created_at": datetime.now().isoformat()
        }
    )

def semantic_search(query: str, category: str = None, size: int = 10):
    """Tìm kiếm semantic với vector + keyword hybrid"""
    query_vector = get_embedding(query)
    
    # Hybrid search: semantic + keyword
    body = {
        "query": {
            "bool": {
                "should": [
                    {
                        "knn": {
                            "field": "content_vector",
                            "query_vector": query_vector,
                            "num_candidates": 100,
                            "boost": 0.7    # 70% weight cho semantic
                        }
                    },
                    {
                        "multi_match": {
                            "query": query,
                            "fields": ["title^3", "content"],
                            "boost": 0.3    # 30% weight cho keyword
                        }
                    }
                ]
            }
        },
        "size": size,
        "_source": ["title", "content", "category", "created_at"]
    }
    
    if category:
        body["query"]["bool"]["filter"] = [{"term": {"category": category}}]
    
    result = es.search(index="knowledge-base", body=body)
    
    return [{
        "title": hit["_source"]["title"],
        "content": hit["_source"]["content"][:200] + "...",
        "score": hit["_score"],
        "id": hit["_id"]
    } for hit in result["hits"]["hits"]]

# Usage:
results = semantic_search("Cách tối ưu truy vấn database PostgreSQL")
# Tìm được cả:
# - Articles về PostgreSQL query optimization
# - Articles về database performance với từ đồng nghĩa
# - Articles về index tuning even if không chứa exact keywords

12.5 Use Case 5: Multi-tenant Search

Scenario

SaaS platform, mỗi customer có data riêng biệt:

Option 1: Index per tenant (tốt cho isolation, kém về số lượng)

bash

# Mỗi customer → 1 index
products-tenant-001
products-tenant-002
...

# Update tenant settings:
PUT /products-tenant-001/_settings
{
  "index.routing.allocation.require.tier": "hot"
}

Option 2: Field per tenant (hơn về management, có isolation risk)

bash

PUT /products-shared
{
  "mappings": {
    "properties": {
      "tenant_id": { "type": "keyword" },
      "name": { "type": "text" },
      ...
    }
  }
}

# Luôn filter by tenant_id:
GET /products-shared/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "tenant_id": "tenant-001" } }
      ],
      "must": [
        { "match": { "name": "query" } }
      ]
    }
  }
}

Option 3: Document-level Security (DLS)

bash

POST /_security/role/tenant-001-role
{
  "indices": [{
    "names": ["products-shared"],
    "privileges": ["read"],
    "query": {
      "term": { "tenant_id": "tenant-001" }
    }
  }]
}
# Bất kỳ user nào với role này chỉ thấy data của tenant-001

12.6 Use Case 6: Real-time User Activity Stream

Clickstream Analytics

bash

PUT /user-events
{
  "settings": {
    "number_of_shards": 5,
    "index.lifecycle.name": "events-policy"
  },
  "mappings": {
    "properties": {
      "@timestamp": { "type": "date" },
      "user_id": { "type": "keyword" },
      "session_id": { "type": "keyword" },
      "event_type": { "type": "keyword" },
      "product_id": { "type": "keyword" },
      "category": { "type": "keyword" },
      "page": { "type": "keyword" },
      "duration_ms": { "type": "integer" },
      "device": {
        "properties": {
          "type": { "type": "keyword" },
          "os": { "type": "keyword" },
          "browser": { "type": "keyword" }
        }
      },
      "location": { "type": "geo_point" },
      "referrer": { "type": "keyword" },
      "ip": { "type": "ip" }
    }
  }
}

# Real-time funnel analysis
GET /user-events/_search
{
  "size": 0,
  "query": {
    "range": { "@timestamp": { "gte": "now-24h" } }
  },
  "aggs": {
    "funnel": {
      "filters": {
        "filters": {
          "1_view_product": { "term": { "event_type": "product_view" } },
          "2_add_to_cart": { "term": { "event_type": "add_to_cart" } },
          "3_checkout": { "term": { "event_type": "checkout_start" } },
          "4_purchase": { "term": { "event_type": "purchase_complete" } }
        }
      },
      "aggs": {
        "unique_users": {
          "cardinality": { "field": "user_id" }
        }
      }
    },
    "top_exit_pages": {
      "terms": {
        "field": "page",
        "size": 10,
        "order": { "avg_time": "asc" }
      },
      "aggs": {
        "avg_time": { "avg": { "field": "duration_ms" } }
      }
    }
  }
}

# User cohort analysis
GET /user-events/_search
{
  "size": 0,
  "aggs": {
    "weekly_cohorts": {
      "date_histogram": {
        "field": "first_visit",
        "calendar_interval": "week"
      },
      "aggs": {
        "retention": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "week"
          },
          "aggs": {
            "users": { "cardinality": { "field": "user_id" } }
          }
        }
      }
    }
  }
}

Tóm Tắt Chương 12

Use Case	ES Features Used
E-commerce Search	Multi-match, function_score, faceted search, autocomplete
Log Analytics	Time-series, aggregations, ELK Stack
Inventory Search	Range queries, nested aggs, scripted fields
Semantic Search	dense_vector, kNN, hybrid search
Multi-tenant	Index isolation, DLS, tenant_id filtering
Clickstream Analytics	High-volume indexing, time-series, cardinality

Chương 13: Best Practices & Production - Vận Hành Thực Tế

13.1 Production Checklist

Trước Khi Deploy

bash

# 1. Security
✅ TLS/SSL được bật (không dùng HTTP plaintext)
✅ Authentication được bật
✅ Default passwords đã đổi
✅ API keys cho external access (không dùng elastic superuser)
✅ Network không expose port 9200/9300 ra public internet

# 2. Cluster
✅ Ít nhất 3 master-eligible nodes
✅ discovery.seed_hosts được cấu hình
✅ cluster.initial_master_nodes được set (chỉ bootstrap lần đầu)
✅ vm.max_map_count = 262144

# 3. Memory
✅ Heap size = 50% RAM, max 31GB
✅ Xms = Xmx
✅ bootstrap.memory_lock = true
✅ Swap tắt

# 4. Disk
✅ SSD cho data nodes
✅ Disk watermarks được cấu hình
✅ Backup strategy installed (snapshot repository)

# 5. Index
✅ Shard sizing phù hợp (< 50GB per shard)
✅ Replica = 1 ít nhất
✅ ILM policy cho time-series indices
✅ Mapping designed kỹ (không dynamic mapping vô kiểm soát)

13.2 Security Hardening

Tạo User và Role System

bash

# Tạo roles
POST /_security/role/app-read-role
{
  "indices": [
    {
      "names": ["products", "categories"],
      "privileges": ["read", "view_index_metadata"]
    }
  ],
  "cluster": ["monitor"]
}

POST /_security/role/app-write-role
{
  "indices": [
    {
      "names": ["products", "categories"],
      "privileges": ["read", "write", "view_index_metadata"]
    }
  ],
  "cluster": ["monitor", "manage_index_templates"]
}

POST /_security/role/analytics-role
{
  "indices": [
    {
      "names": ["orders", "events-*"],
      "privileges": ["read"],
      "field_security": {
        "grant": ["*"],
        "except": ["payment_details", "personal_info"]    # Ẩn sensitive fields
      }
    }
  ]
}

# Tạo users
POST /_security/user/app-service
{
  "password": "SecurePassword123!",
  "roles": ["app-read-role", "app-write-role"],
  "full_name": "E-commerce Application",
  "email": "app@company.com"
}

POST /_security/user/analytics-dashboard
{
  "password": "AnalyticsPass456!",
  "roles": ["analytics-role"],
  "full_name": "Analytics Dashboard"
}

API Keys (Preferred cho Services)

bash

# Tạo API key (không expire, scoped permissions)
POST /_security/api_key
{
  "name": "ecommerce-backend-key",
  "expiration": "90d",
  "role_descriptors": {
    "product-access": {
      "indices": [
        {
          "names": ["products"],
          "privileges": ["read", "write"]
        }
      ]
    }
  }
}

# Response:
{
  "id": "api-key-id",
  "name": "ecommerce-backend-key",
  "api_key": "abc123...",
  "encoded": "base64-encoded-key"   # Dùng cái này trong header
}

# Sử dụng API key:
curl -H "Authorization: ApiKey base64-encoded-key" \
     https://localhost:9200/products/_search

# Node.js:
const client = new Client({
  node: 'https://localhost:9200',
  auth: { apiKey: 'base64-encoded-key' }
});

# Xem và revoke API keys:
GET /_security/api_key?name=ecommerce-backend-key
DELETE /_security/api_key
{ "id": "api-key-id" }

Network Security

bash

# elasticsearch.yml
network.host: _site_          # Chỉ bind local network
http.port: 9200
transport.port: 9300

# Hoặc explicit:
network.bind_host: ["_local_", "10.0.0.10"]
network.publish_host: "10.0.0.10"   # Địa chỉ để nodes khác connect

# Firewall (iptables):
# Allow 9200 chỉ từ application servers:
iptables -A INPUT -p tcp --dport 9200 -s 10.0.1.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 9200 -j DROP

# Allow 9300 chỉ giữa ES nodes:
iptables -A INPUT -p tcp --dport 9300 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 9300 -j DROP

13.3 Backup & Restore - Snapshot

Cấu Hình Snapshot Repository

bash

# Option 1: Shared file system (NFS/EFS)
PUT /_snapshot/my-backup
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backups/elasticsearch",
    "compress": true
  }
}

# Option 2: AWS S3
# Cài plugin: elasticsearch-plugin install repository-s3
PUT /_snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "ap-southeast-1",
    "base_path": "production-backups",
    "compress": true,
    "chunk_size": "1gb",
    "server_side_encryption": true
  }
}
# AWS credentials qua IAM Role hoặc:
# AWS_ACCESS_KEY_ID và AWS_SECRET_ACCESS_KEY trong keystore

# Option 3: GCS (Google Cloud Storage)
PUT /_snapshot/gcs-backup
{
  "type": "gcs",
  "settings": {
    "bucket": "my-es-backups-bucket",
    "base_path": "elasticsearch",
    "compress": true
  }
}

Tạo Snapshots

bash

# Snapshot toàn bộ cluster:
PUT /_snapshot/s3-backup/snapshot-01
{
  "indices": "*",
  "include_global_state": true,   # Bao gồm cluster settings, ILM policies
  "metadata": {
    "taken_by": "cron-job",
    "taken_because": "nightly backup"
  }
}

# Snapshot cụ thể indices:
PUT /_snapshot/s3-backup/products-snapshot-20240115
{
  "indices": ["products", "categories"],
  "include_global_state": false
}

# Xem status snapshot:
GET /_snapshot/s3-backup/snapshot-01/_status

# Liệt kê snapshots:
GET /_snapshot/s3-backup/*

# Xóa snapshot cũ:
DELETE /_snapshot/s3-backup/snapshot-01

SLM - Snapshot Lifecycle Management

bash

PUT /_slm/policy/nightly-backup
{
  "schedule": "0 30 1 * * ?",           # Mỗi ngày 1:30 AM
  "name": "<daily-snap-{now/d}>",
  "repository": "s3-backup",
  "config": {
    "indices": "*",
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",      # Xóa snapshots cũ hơn 30 ngày
    "min_count": 5,             # Giữ ít nhất 5 snapshots
    "max_count": 50             # Giữ tối đa 50 snapshots
  }
}

# Activate SLM:
POST /_slm/policy/nightly-backup/_execute

# Xem SLM stats:
GET /_slm/stats

Restore

bash

# Restore toàn bộ snapshot:
POST /_snapshot/s3-backup/snapshot-01/_restore
{
  "indices": "*",
  "include_global_state": true,
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"   # Rename để tránh conflict
}

# Restore chỉ 1 index:
POST /_snapshot/s3-backup/snapshot-01/_restore
{
  "indices": ["products"],
  "ignore_unavailable": true,
  "index_settings": {
    "index.number_of_replicas": 0    # Override settings khi restore
  }
}

# Theo dõi restore:
GET /_recovery?active_only=true

13.4 Zero-Downtime Migration

Kịch bản: Thay Đổi Mapping

Bạn cần thêm field mới với analyzer khác. Không thể modify mapping existing → Cần reindex.

bash

# Bước 1: Tạo index mới với mapping mới
PUT /products-v2
{
  "settings": { ... },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "new-analyzer"   # Analyzer mới
      },
      "new_field": { "type": "keyword" }   # Field mới
    }
  }
}

# Bước 2: Reindex từ v1 sang v2 (keep v1 serving traffic)
POST /_reindex?wait_for_completion=false
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" },
  "script": {
    "source": """
      ctx._source.new_field = ctx._source.category + '_' + ctx._source.brand;
    """
  }
}

# Theo dõi reindex:
GET /_tasks?actions=*reindex&detailed=true

# Bước 3: Sau khi reindex xong, verify data
GET /products-v2/_count     # Phải bằng products-v1
GET /products-v2/_search    # Test queries

# Bước 4: Chuyển alias từ v1 sang v2 (atomic operation!)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}
# All traffic now goes to products-v2!
# Zero downtime because alias switch is atomic

# Bước 5: Verify trên v2
GET /products/_search

# Bước 6: Xóa v1 sau vài ngày
DELETE /products-v1

13.5 Monitoring

Elasticsearch Metrics Quan Trọng

bash

# Cluster health:
GET /_cluster/health?level=indices

# Key metrics:
GET /_cat/nodes?v&h=name,heap.percent,heap.max,cpu,load_1m,disk.used_percent,master

# Trữ lượng này dùng nhiều nhất:
GET /_cat/indices?v&h=index,health,docs.count,store.size,pri.store.size&s=store.size:desc

# GC statistics:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

# Thread pools (quan trọng để biết bottleneck):
GET /_nodes/stats/thread_pool?filter_path=nodes.*.thread_pool

# Response:
{
  "nodes": {
    "node-1": {
      "thread_pool": {
        "search": {
          "threads": 14,
          "queue": 0,         # Queue > 0 = CPU bound!
          "active": 2,
          "rejected": 5,      # Rejected > 0 = quá tải!
          "largest": 14,
          "completed": 120000
        }
      }
    }
  }
}

Metricbeat cho Elasticsearch Monitoring

yaml

# metricbeat.yml
metricbeat.modules:
- module: elasticsearch
  metricsets:
    - node
    - node_stats
    - index
    - index_recovery
    - index_summary
    - shard
    - ml_job
  period: 10s
  hosts: ["https://localhost:9200"]
  username: "monitoring-user"
  password: "${ES_MONITORING_PASSWORD}"
  ssl.certificate_authorities: ["/etc/metricbeat/certs/ca.crt"]

output.elasticsearch:
  hosts: ["https://monitoring-cluster:9200"]  # Separate monitoring cluster
  username: "monitoring-writer"
  password: "${MONITORING_WRITER_PASSWORD}"
  ssl.certificate_authorities: ["/etc/metricbeat/certs/ca.crt"]

13.6 Alerting

Watcher Alerts

bash

# Alert: Cluster health đỏ
PUT /_watcher/watch/cluster-health-alert
{
  "trigger": {
    "schedule": { "interval": "1m" }    # Check mỗi phút
  },
  "input": {
    "http": {
      "request": {
        "url": "http://localhost:9200/_cluster/health",
        "auth": { "basic": { "username": "watcher-user", "password": "xxx" } }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": {
        "eq": "red"    # Condition: status = "red"
      }
    }
  },
  "actions": {
    "slack_alert": {
      "webhook": {
        "scheme": "https",
        "host": "hooks.slack.com",
        "port": 443,
        "method": "post",
        "path": "/services/xxx/yyy/zzz",
        "params": {},
        "headers": { "Content-Type": "application/json" },
        "body": """{"text": "🚨 ES Cluster HEALTH is RED! Immediate action required."}"""
      }
    },
    "email_alert": {
      "email": {
        "to": ["devops@company.com"],
        "subject": "ALERT: Elasticsearch Cluster Health",
        "body": {
          "html": "<p>Cluster status: <strong>RED</strong></p>"
        }
      }
    }
  }
}

# Alert: Disk usage cao
PUT /_watcher/watch/disk-usage-alert
{
  "trigger": {
    "schedule": { "interval": "5m" }
  },
  "input": {
    "search": {
      "request": {
        "indices": [".monitoring-es-*"],
        "body": {
          "size": 1,
          "query": {
            "bool": {
              "filter": [
                { "range": { "@timestamp": { "gte": "now-10m" } } }
              ]
            }
          },
          "aggs": {
            "max_disk": {
              "max": { "field": "elasticsearch.node.stats.fs.total.available_in_bytes" }
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": """
        def total = ctx.payload.hits.hits[0]._source.elasticsearch.node.stats.fs.total.total_in_bytes;
        def avail = ctx.payload.aggregations.max_disk.value;
        def used_pct = (total - avail) / total * 100;
        return used_pct > 80;
      """
    }
  },
  "actions": {
    "notify": {
      "webhook": {
        "scheme": "https",
        "host": "hooks.slack.com",
        "method": "post",
        "body": """{"text": "⚠️ Elasticsearch disk usage is above 80%!"}"""
      }
    }
  }
}

13.7 Common Pitfalls và Solutions

1. Split-Brain (Network Partition)

bash

# VẤNĐỀ: Cluster bị tách thành 2 phần, cả 2 tưởng mình là master

# GIẢI PHÁP: Cấu hình đúng discovery
# elasticsearch.yml
discovery.seed_hosts:
  - "node1:9300"
  - "node2:9300"
  - "node3:9300"

cluster.initial_master_nodes:
  - "node1"
  - "node2"
  - "node3"

# LUÔN dùng số lẻ master-eligible nodes: 3, 5, 7
# Quorum: (n/2) + 1
# 3 nodes → quorum = 2 (cần 2/3 đồng ý để elect master)

2. Out of Memory / GC Issues

bash

# Triệu chứng: Query chậm, GC overhead, OutOfMemoryError

# Cách tìm vấn đề:
GET /_nodes/stats/jvm

# GC time > 10% CPU time → Có issue

# Solutions:
# 1. Giảm field data cache:
PUT /_cluster/settings
{
  "persistent": {
    "indices.fielddata.cache.size": "20%"
  }
}

# 2. Tránh aggregation trên text fields (dùng keyword):
# BAD: agg trên "message" text field → fielddata loaded vào heap
# GOOD: agg trên "status.keyword"

# 3. Tăng heap (max 31GB):
# jvm.options: -Xms24g -Xmx24g

# 4. Giảm số concurrent searches:
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_recoveries": 2
  }
}

3. Mapping Explosion

bash

# VẤNĐỀ: Dynamic mapping tạo ra quá nhiều fields
# Ví dụ: Log có JSON payload với nhiều keys khác nhau
# → Cluster state bloom lên GB+

# Giải pháp 1: Index template với strict mapping
PUT /_index_template/strict-logs
{
  "index_patterns": ["logs-*"],
  "template": {
    "mappings": {
      "dynamic": "strict",          # Reject unknown fields
      "properties": {
        "timestamp": { "type": "date" },
        "message": { "type": "text" },
        "level": { "type": "keyword" }
      }
    }
  }
}

# Giải pháp 2: Flatten arbitrary JSON
{
  "properties": {
    "metadata": {
      "type": "flattened"    # Store as single field, không tạo sub-fields
    }
  }
}

# Giải pháp 3: Limit total fields
PUT /_settings
{
  "index.mapping.total_fields.limit": 200    # Default: 1000
}

4. Hotspot Shards

bash

# VẤNĐỀ: Một số shards nhận quá nhiều writes/reads
# → Nodes chứa shard đó bị quá tải

# Giải pháp: Routing tốt hơn
# Option 1: Không dùng custom routing (để ES phân phối tự động)
# Option 2: Nếu cần custom routing, đảm bảo key phân bổ đều:

GET /orders/_search
{
  "query": { "match_all": {} },
  "aggs": {
    "shard_distribution": {
      "terms": {
        "field": "_routing",    # Xem routing distribution
        "size": 20
      }
    }
  }
}

# Dùng routing_partition_size để phân tán với custom routing:
PUT /orders
{
  "settings": {
    "number_of_shards": 4,
    "routing_partition_size": 2    # Route tới 1 trong 2 shards thay vì 1 cố định
  }
}

5. Score chạy sai / kết quả không như mong đợi

bash

# Debug scoring:
GET /products/_explain/doc-id
{
  "query": {
    "match": { "name": "iPhone" }
  }
}

# Response giải thích tại sao document này có score X:
{
  "matched": true,
  "explanation": {
    "value": 4.5,
    "description": "weight(name:iphone in 0)",
    "details": [
      {
        "value": 4.5,
        "description": "score(freq=1.0), product of:",
        "details": [
          {
            "value": 2.2,
            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))"
          }
        ]
      }
    ]
  }
}

13.8 Upgrade Best Practices

Rolling Upgrade

bash

# Quy trình upgrade từ 8.x sang 8.y (minor versions):

# 1. Backup trước
PUT /_snapshot/backup/pre-upgrade-snapshot

# 2. Tắt shard allocation:
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

# 3. Sync flush:
POST /_flush/synced

# 4. Stop node
systemctl stop elasticsearch

# 5. Upgrade Elasticsearch package

# 6. Start node
systemctl start elasticsearch

# 7. Verify node là healthy:
GET /_nodes/_local/stats

# 8. Re-enable allocation (sau khi mọi shard recovered):
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null  # Reset to default
  }
}

# 9. Repeat cho node tiếp theo

13.9 Capacity Planning

Tính Toán Tài Nguyên

Data Volume:
  - Daily ingest: 1 billion events/day
  - Event size: 500 bytes avg
  - Daily raw data: 500GB/day
  - Compression factor: ~3x
  - Stored daily: ~167GB/day

Retention: 30 days
Total hot storage: 5TB

With 1 replica: 10TB total
With 20% overhead: 12TB disk space

Shard sizing (50GB target):
  12TB / 50GB = 240 shards
  With 2 replicas = 720 total shards
  With 6 data nodes = 120 shards/node (acceptable)

Memory (heap) per node:
  @30MB/shard × 120 shards = 3.6GB
  Add processing overhead: 6-8GB heap per node
  On 64GB servers: 32GB heap, 32GB OS cache

CPU:
  Search: p95 < 100ms → need ~14 CPU cores
  Indexing: 100k events/s × 500B = 50MB/s → need 8-12 cores
  Use: 32 CPU cores per node (standard server)

Sample Production Cluster Configuration

yaml

# production-elasticsearch.yml template

cluster.name: production-search-cluster

# Node identification
node.name: ${HOSTNAME}
node.roles: [ master, data, ingest ]      # Hot nodes

# Paths
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch

# Network
network.host: _site_
http.port: 9200
transport.port: 9300

# Discovery
discovery.seed_hosts:
  - "es-node-01:9300"
  - "es-node-02:9300"
  - "es-node-03:9300"
  - "es-node-04:9300"
  - "es-node-05:9300"

cluster.initial_master_nodes:
  - "es-node-01"
  - "es-node-02"
  - "es-node-03"

# Security
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: certs/transport.p12
xpack.security.transport.ssl.truststore.path: certs/transport.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/http.p12

# Memory
bootstrap.memory_lock: true

# Disk watermarks
cluster.routing.allocation.disk.watermark.low: "80%"
cluster.routing.allocation.disk.watermark.high: "85%"
cluster.routing.allocation.disk.watermark.flood_stage: "90%"

# Logging
logger.level: WARN

13.10 Tổng Kết Toàn Bộ Series

Con Đường Học Elasticsearch

Level 1 - Beginner (Chapters 1-4)
  ✅ Hiểu Elasticsearch là gì và khi nào dùng
  ✅ Setup local development với Docker
  ✅ Hiểu Index, Document, Mapping, Shards
  ✅ CRUD operations cơ bản

Level 2 - Intermediate (Chapters 5-8)
  ✅ Thiết kế mapping chuyên nghiệp
  ✅ Viết queries phức tạp với Bool query
  ✅ Advanced queries (nested, parent-child, geo)
  ✅ Text analysis và custom analyzers

Level 3 - Advanced (Chapters 9-11)
  ✅ Aggregation framework đầy đủ
  ✅ Performance optimization
  ✅ Advanced features (percolator, semantic search)

Level 4 - Production (Chapters 12-13)
  ✅ Real-world use cases
  ✅ Security, backup, monitoring
  ✅ Capacity planning và troubleshooting

Quick Reference Card

bash

# ----- CLUSTER -----
GET /_cluster/health
GET /_cat/nodes?v
GET /_cat/indices?v&s=store.size:desc

# ----- SEARCH -----
GET /{index}/_search
GET /{index}/_count
GET /{index}/_explain/{id}

# ----- INDEX -----
PUT /{index}
DELETE /{index}
PUT /{index}/_settings
GET /{index}/_mapping
POST /{index}/_reindex   # (POST /_reindex)

# ----- DOCS -----
PUT /{index}/_doc/{id}
GET /{index}/_doc/{id}
POST /{index}/_update/{id}
DELETE /{index}/_doc/{id}
POST /{index}/_bulk

# ----- ANALYSIS -----
GET /{index}/_analyze
GET /_analyze

# ----- ALIASES -----
POST /_aliases

# ----- SNAPSHOTS -----
PUT /_snapshot/{repo}/{name}
GET /_snapshot/{repo}/{name}
POST /_snapshot/{repo}/{name}/_restore

# ----- SECURITY -----
POST /_security/user/{username}
POST /_security/role/{rolename}
POST /_security/api_key

Nguồn Tham Khảo

Official Documentation

Sách

Elasticsearch: The Definitive Guide (O'Reilly)
Relevant Search (Manning)
Elasticsearch in Action (Manning)

Cộng Đồng

Elastic Discuss
Elastic Blog
GitHub - elastic/elasticsearch Mọi code samples đều được test với Elasticsearch 8.x.*

Elasticsearch từ Cơ Bản đến Nâng Cao ​

Mục Lục ​

Phần 1: Nền Tảng ​

Phần 2: Thao Tác Cơ Bản ​

Phần 3: Tìm Kiếm ​

Phần 4: Phân Tích Văn Bản ​

Phần 5: Aggregations ​

Phần 6: Hiệu Năng và Sản Xuất ​

Phần 7: Thực Chiến ​

Chương 1: Giới thiệu Elasticsearch ​

1.1 Elasticsearch là gì? ​

Định nghĩa kỹ thuật ​

Elasticsearch trong Elastic Stack ​

1.2 Tại sao cần Elasticsearch? ​

Vấn đề với SQL LIKE và Full-Text Search truyền thống ​

Elasticsearch giải quyết các vấn đề này ​

1.3 Các Use Case Phù Hợp ​

Use Case 1: E-Commerce Search ​

Use Case 2: Log Analytics (ELK Stack) ​

Use Case 3: Content/Article Search ​

Use Case 4: Real-time Analytics & Monitoring ​

Use Case 5: Autocomplete & Suggestions ​

1.4 So Sánh Elasticsearch với Các Giải Pháp Khác ​

Elasticsearch vs MySQL Full-Text Search ​

Elasticsearch vs MongoDB Atlas Search ​

Elasticsearch vs Apache Solr ​

Khi nào KHÔNG dùng Elasticsearch ​

1.5 Kiến Trúc Tổng Quan ​

Kiến Trúc Phân Tán ​

Luồng Xử Lý Query ​

Luồng Index Document ​

1.6 Lịch Sử và Phiên Bản ​

Major Changes ở 8.x cần biết: ​

1.7 Cách Elasticsearch Hoạt Động - Cơ Bản ​

Inverted Index là gì? ​

BM25 - Thuật Toán Tính Điểm Relevance ​

1.8 Các Khái Niệm Cần Nhớ Ngay ​

Tóm Tắt Chương 1 ​

Bước Tiếp Theo ​

Chương 2: Cài Đặt và Cấu Hình Elasticsearch ​

2.1 Yêu Cầu Hệ Thống ​

Yêu cầu phần cứng tối thiểu (Development) ​

Yêu cầu hệ thống tối thiểu (Production) ​

2.2 Cài Đặt với Docker (Khuyến nghị cho Development) ​

Option 1: Single Node với Docker Compose ​

Option 2: Multi-Node Cluster với Docker Compose ​

2.3 Cài Đặt với Security (Khuyến Nghị cho Production) ​

2.4 Cài Đặt trên Ubuntu Server (Production-like) ​

Step 1: Import GPG Key ​

Step 2: Add Repository ​

Step 3: Install ​

Step 4: Configure (elasticsearch.yml) ​

Step 5: Configure JVM ​

Step 6: System Configuration ​

Step 7: Start và Enable ​

2.5 Kibana - Giao Diện Quản Lý ​

Kibana với Docker ​

Kibana Dev Tools ​

2.6 Cấu Hình quan trọng cho Production ​

elasticsearch.yml Production Template ​

Quan Trọng: vm.max_map_count ​

2.7 Kiểm Tra Cluster và APIs Cơ Bản ​

Cluster Health API ​

Cat APIs - Human Readable ​

Node Info APIs ​

Cluster Settings ​

2.8 Kết Nối từ ứng dụng Backend ​

Node.js với @elastic/elasticsearch ​

Java với Spring Boot ​

Python với elasticsearch-py ​

Go với olivere/elastic hoặc elastic/go-elasticsearch ​

2.9 Troubleshooting Cài Đặt Thường Gặp ​

Lỗi: "max virtual memory areas vm.max_map_count too low" ​

Lỗi: "max file descriptors too low" ​

Lỗi: cluster status RED sau khi restart ​

Lỗi: Out of Memory / GC pressure ​

Lỗi: "This node is not master eligible" ​

Tóm Tắt Chương 2 ​

Bước Tiếp Theo ​

Chương 3: Khái Niệm Cốt Lõi của Elasticsearch ​

Elasticsearch từ Cơ Bản đến Nâng Cao

Mục Lục

Phần 1: Nền Tảng

Phần 2: Thao Tác Cơ Bản

Phần 3: Tìm Kiếm

Phần 4: Phân Tích Văn Bản

Phần 5: Aggregations

Phần 6: Hiệu Năng và Sản Xuất

Phần 7: Thực Chiến

Chương 1: Giới thiệu Elasticsearch

1.1 Elasticsearch là gì?

Định nghĩa kỹ thuật

Elasticsearch trong Elastic Stack

1.2 Tại sao cần Elasticsearch?

Vấn đề với SQL LIKE và Full-Text Search truyền thống

Elasticsearch giải quyết các vấn đề này

1.3 Các Use Case Phù Hợp

Use Case 1: E-Commerce Search

Use Case 2: Log Analytics (ELK Stack)

Use Case 3: Content/Article Search

Use Case 4: Real-time Analytics & Monitoring

Use Case 5: Autocomplete & Suggestions

1.4 So Sánh Elasticsearch với Các Giải Pháp Khác

Elasticsearch vs MySQL Full-Text Search

Elasticsearch vs MongoDB Atlas Search

Elasticsearch vs Apache Solr

Khi nào KHÔNG dùng Elasticsearch

1.5 Kiến Trúc Tổng Quan

Kiến Trúc Phân Tán

Luồng Xử Lý Query

Luồng Index Document

1.6 Lịch Sử và Phiên Bản

Major Changes ở 8.x cần biết:

1.7 Cách Elasticsearch Hoạt Động - Cơ Bản

Inverted Index là gì?

BM25 - Thuật Toán Tính Điểm Relevance

1.8 Các Khái Niệm Cần Nhớ Ngay

Tóm Tắt Chương 1

Bước Tiếp Theo

Chương 2: Cài Đặt và Cấu Hình Elasticsearch

2.1 Yêu Cầu Hệ Thống

Yêu cầu phần cứng tối thiểu (Development)

Yêu cầu hệ thống tối thiểu (Production)

2.2 Cài Đặt với Docker (Khuyến nghị cho Development)

Option 1: Single Node với Docker Compose

Option 2: Multi-Node Cluster với Docker Compose

2.3 Cài Đặt với Security (Khuyến Nghị cho Production)

2.4 Cài Đặt trên Ubuntu Server (Production-like)

Step 1: Import GPG Key

Step 2: Add Repository

Step 3: Install

Step 4: Configure (elasticsearch.yml)

Step 5: Configure JVM

Step 6: System Configuration

Step 7: Start và Enable

2.5 Kibana - Giao Diện Quản Lý

Kibana với Docker

Kibana Dev Tools

2.6 Cấu Hình quan trọng cho Production

elasticsearch.yml Production Template

Quan Trọng: vm.max_map_count

2.7 Kiểm Tra Cluster và APIs Cơ Bản

Cluster Health API

Cat APIs - Human Readable

Node Info APIs

Cluster Settings

2.8 Kết Nối từ ứng dụng Backend

Node.js với @elastic/elasticsearch

Java với Spring Boot

Python với elasticsearch-py

Go với olivere/elastic hoặc elastic/go-elasticsearch

2.9 Troubleshooting Cài Đặt Thường Gặp

Lỗi: "max virtual memory areas vm.max_map_count too low"

Lỗi: "max file descriptors too low"

Lỗi: cluster status RED sau khi restart

Lỗi: Out of Memory / GC pressure

Lỗi: "This node is not master eligible"

Tóm Tắt Chương 2

Bước Tiếp Theo

Chương 3: Khái Niệm Cốt Lõi của Elasticsearch