Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Tokenizer - Edge-n-gram #8378

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 150 additions & 0 deletions _analyzers/tokenizers/edge-n-gram-tokenizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
layout: default
title: Edge-N-Gram Tokenizer
parent: Tokenizers
nav_order: 20
---

# Edge n-gram tokenizer

The `edge n-gram` tokenizer generates partial word tokens, or `n-grams`, starting from the beginning of each word. It splits the text based on specified characters and produces tokens with lengths defined by a minimum and maximum range. This tokenizer is particularly useful for implementing search-as-you-type functionality.

Edge n-grams are ideal for autocomplete searches where the order of the words may vary, such as with product names or addresses. However, for text with a fixed order, like movie or song titles, the completion suggester may be more efficient.

## How the edge n-gram tokenizer works
By default, the `edge n-gram` tokenizer produces tokens with a minimum length of `1` and a maximum length of `2`.

For example, analyzing the text `OpenSearch` with the default configuration will produce
```
O, Op
```
These short n-grams are often not sufficient for meaningful searches, so configuring the tokenizer is necessary to adjust the gram lengths.

## Edge-n-gram tokenizer configuration
Configuration options include:
- `min_gram`: The minimum token length. Default is 1.
- `max_gram`: The maximum token length. Default is 2.
- `custom_token_chars:` Defines custom characters to be treated as part of a token (e.g., `+-_`).
- `token_chars`: Defines which character classes should be included in tokens. The tokenizer will split tokens on characters that aren’t part of the specified classes. Default is to include all characters. Available character classes include:
- `letter`: Alphabetic characters (e.g., `a`, `b`, `ç`, `京`)
- `digit`: Numeric characters (e.g., `3`, `7`)
- `punctuation`: Punctuation symbols (e.g., `!`, `?`)
- `symbol`: Other symbols (e.g., `$`, `√`)
- `whitespace`: Space or newline characters
- `custom`: Allows you to specify custom characters through the custom_token_chars setting.

### `max_gram` parameter limitations
The `max_gram` value defines the upper limit for token length. If a search query is longer than the maximum token length, the query may fail to match any indexed terms. For example, if `max_gram` is set to `4`, a search for `"`searching` will not match `sear`

A possible solution is to use a `truncate` token filter to limit search terms to the `max_gram` length, though this could yield imprecise results. For instance, truncating `"`searching`"` to `sear` might match terms like `search` or `seared`, which may not be relevant.

### Example configuration
We can configure the `edge n-gram` tokenizer to produce tokens between `3` and `6` characters in length, considering both letters and symbols as valid token characters:
```
PUT edge_n_gram_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"tokenizer": "my_custom_tokenizer"
}
},
"tokenizer": {
"my_custom_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 6,
"token_chars": [
"letter" ]
}
}
}
}
}
```
Analysing some sample text with our index
```
POST edge_n_gram_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Code 42 rocks!"
}
```
Outputted tokens would be
```
[Cod, Code, roc, rock, rocks]
```

## Best practices
It is recommended to use the `edge n-gram` tokenizer only at indexing time to ensure partial word tokens are stored. At search time, a simpler analyzer should be used to match full user queries.

## Search-as-you-type configuration
To implement search-as-you-type functionality, it's typical to use the `edge n-gram` tokenizer only during indexing and a simpler analyzer at search time. The following configuration demonstrates this:
```
PUT my_autocomplete_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_index": {
"tokenizer": "autocomplete_edge_ngram",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete_edge_ngram": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 8,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}

```
Index a document with the field `product_name` and refresh the index:
```
PUT my_custom_index/_doc/1
{
"product": "Laptop Pro"
}

POST my_custom_index/_refresh

```
Then, perform a search with the query `Lap`":`
```
GET my_custom_index/_search
{
"query": {
"match": {
"product": {
"query": "Laptop",
"operator": "and"
}
}
}
}
```

Like this, partial matches can be found with an edge n-gram tokenizer, ensuring terms like "Laptop" and "Pro" are indexed correctly for search queries.