Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update ai data masking doc #1310

Merged
merged 1 commit into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 23 additions & 32 deletions plugins/wasm-rust/extensions/ai-data-masking/README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,35 @@
# 功能说明
---
title: AI 数据脱敏
keywords: [higress,ai data masking]
description: AI 数据脱敏插件配置参考
---

## 功能说明

对请求/返回中的敏感词拦截、替换

![image](https://github.com/user-attachments/assets/f281c8c3-9613-4053-94aa-067694cc5fd4)


```mermaid
sequenceDiagram
participant 用户
participant 敏感词插件
participant 后端服务

用户->>敏感词插件: 请求数据(如:包含[email protected])
敏感词插件->>敏感词插件: 数据解析
opt 如果包含拦截词
敏感词插件-->>用户: 返回预设错误消息 (拦截)
end
opt 替换敏感词
敏感词插件->>后端服务: 关键词替换后的请求数据 (将[email protected]替换为****@gmail.com)
后端服务->>敏感词插件: 原始返回响应(包含 ****@gmail.com)
敏感词插件->>用户: 数据恢复后的相应数据(将****@gmail.com恢复为[email protected])
end
```
![image](https://img.alicdn.com/imgextra/i4/O1CN0156Wtko1T9JO0RiWow_!!6000000002339-0-tps-1314-638.jpg)

## 处理数据范围
### 处理数据范围
- openai协议:请求/返回对话内容
- jsonpath:只处理指定字段
- raw:整个请求/返回body

## 敏感词拦截
### 敏感词拦截
- 处理数据范围中出现敏感词直接拦截,返回预设错误信息
- 支持系统内置敏感词库和自定义敏感词

## 敏感词替换
### 敏感词替换
- 将请求数据中出现的敏感词替换为脱敏字符串,传递给后端服务。可保证敏感数据不出域
- 部分脱敏数据在后端服务返回后可进行还原
- 自定义规则支持标准正则和grok规则,替换字符串支持变量替换

## 运行属性

插件执行阶段:`认证阶段`
插件执行优先级:`991`

# 配置字段
## 配置字段

| 名称 | 数据类型 | 默认值 | 描述 |
| -------- | -------- | -------- | -------- |
Expand All @@ -57,7 +48,7 @@ sequenceDiagram
| replace_roles.restore | bool | false | 是否恢复 |
| replace_roles.value | string | - | 替换值(支持正则变量) |

# 配置示例
## 配置示例

```yaml
system_deny: true
Expand Down Expand Up @@ -98,17 +89,17 @@ replace_roles:
# hash后的值提供给大模型,从大模型返回的数据中会将hash值还原为原始值
```

# 敏感词替换样例
## 敏感词替换样例

## 用户请求内容
### 用户请求内容

请将 `curl http://172.20.5.14/api/openai/v1/chat/completions -H "Authorization: sk-12345" -H "Auth: [email protected]"` 改成post方式

## 处理后请求大模型内容
### 处理后请求大模型内容

`curl http://***.***.***.***/api/openai/v1/chat/completions -H "Authorization: 48a7e98a91d93896d8dac522c5853948" -H "Auth: ****@gmail.com"` 改成post方式

## 大模型返回内容
### 大模型返回内容

您想要将一个 `curl` 的 GET 请求转换为 POST 请求,并且这个请求是向一个特定的 API 发送数据。下面是修改后的 `curl` 命令,以 POST 方式发送:

Expand All @@ -129,7 +120,7 @@ curl -X POST \

请注意,您需要将 `"key":"value"` 替换为您实际要发送的数据内容。如果您的 API 接受不同的数据结构或者需要特定的字段,请根据实际情况调整这部分内容。

## 处理后返回用户内容
### 处理后返回用户内容

您想要将一个 `curl` 的 GET 请求转换为 POST 请求,并且这个请求是向一个特定的 API 发送数据。下面是修改后的 `curl` 命令,以 POST 方式发送:

Expand All @@ -151,7 +142,7 @@ curl -X POST \
请注意,您需要将 `"key":"value"` 替换为您实际要发送的数据内容。如果您的 API 接受不同的数据结构或者需要特定的字段,请根据实际情况调整这部分内容。


# 相关说明
## 相关说明

- 流模式中如果脱敏后的词被多个chunk拆分,可能无法进行还原
- 流模式中,如果敏感词语被多个chunk拆分,可能会有敏感词的一部分返回给用户的情况
Expand Down
131 changes: 131 additions & 0 deletions plugins/wasm-rust/extensions/ai-data-masking/README_EN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
title: AI Data Masking
keywords: [higress, ai data masking]
description: AI Data Masking Plugin Configuration Reference
---
## Function Description
Interception and replacement of sensitive words in requests/responses
![image](https://img.alicdn.com/imgextra/i4/O1CN0156Wtko1T9JO0RiWow_!!6000000002339-0-tps-1314-638.jpg)

### Data Handling Scope
- openai protocol: Request/response conversation content
- jsonpath: Only process specified fields
- raw: Entire request/response body

### Sensitive Word Interception
- Directly intercept sensitive words in the data handling scope and return preset error messages
- Supports system's built-in sensitive word library and custom sensitive words

### Sensitive Word Replacement
- Replace sensitive words in request data with masked strings before passing to back-end services. Ensures that sensitive data does not leave the domain
- Some masked data can be restored after being returned by the back-end service
- Custom rules support standard regular expressions and grok rules, and replacement strings support variable substitution

## Execution Properties
Plugin Execution Phase: `Authentication Phase`
Plugin Execution Priority: `991`

## Configuration Fields
| Name | Data Type | Default Value | Description |
| ---------------------- | ---------------- | -------------- | ------------------------------------ |
| deny_openai | bool | true | Intercept openai protocol |
| deny_jsonpath | string | [] | Intercept specified jsonpath |
| deny_raw | bool | false | Intercept raw body |
| system_deny | bool | true | Enable built-in interception rules |
| deny_code | int | 200 | HTTP status code when intercepted |
| deny_message | string | Sensitive words found in the question or answer have been blocked | AI returned message when intercepted |
| deny_raw_message | string | {"errmsg":"Sensitive words found in the question or answer have been blocked"} | Content returned when not openai intercepted |
| deny_content_type | string | application/json | Content type header returned when not openai intercepted |
| deny_words | array of string | [] | Custom sensitive word list |
| replace_roles | array | - | Custom sensitive word regex replacement |
| replace_roles.regex | string | - | Rule regex (built-in GROK rule) |
| replace_roles.type | [replace, hash] | - | Replacement type |
| replace_roles.restore | bool | false | Whether to restore |
| replace_roles.value | string | - | Replacement value (supports regex variables) |

## Configuration Example
```yaml
system_deny: true
deny_openai: true
deny_jsonpath:
- "$.messages[*].content"
deny_raw: true
deny_code: 200
deny_message: "Sensitive words found in the question or answer have been blocked"
deny_raw_message: "{\"errmsg\":\"Sensitive words found in the question or answer have been blocked\"}"
deny_content_type: "application/json"
deny_words:
- "Custom sensitive word 1"
- "Custom sensitive word 2"
replace_roles:
- regex: "%{MOBILE}"
type: "replace"
value: "****"
# Mobile number 13800138000 -> ****
- regex: "%{EMAILLOCALPART}@%{HOSTNAME:domain}"
type: "replace"
restore: true
value: "****@$domain"
# Email [email protected] -> ****@gmail.com
- regex: "%{IP}"
type: "replace"
restore: true
value: "***.***.***.***"
# IP 192.168.0.1 -> ***.***.***.***
- regex: "%{IDCARD}"
type: "replace"
value: "****"
# ID card number 110000000000000000 -> ****
- regex: "sk-[0-9a-zA-Z]*"
restore: true
type: "hash"
# hash sk-12345 -> 9cb495455da32f41567dab1d07f1973d
# The hashed value is provided to the large model, and the hash value will be restored to the original value from the data returned by the large model
```

## Sensitive Word Replacement Example
### User Request Content
Please change `curl http://172.20.5.14/api/openai/v1/chat/completions -H "Authorization: sk-12345" -H "Auth: [email protected]"` to POST method

### Processed Request Large Model Content
`curl http://***.***.***.***/api/openai/v1/chat/completions -H "Authorization: 48a7e98a91d93896d8dac522c5853948" -H "Auth: ****@gmail.com"` change to POST method

### Large Model Returned Content
You want to convert a `curl` GET request to a POST request, and this request is sending data to a specific API. Below is the modified `curl` command to send as POST:
```sh
curl -X POST \
-H "Authorization: 48a7e98a91d93896d8dac522c5853948" \
-H "Auth: ****@gmail.com" \
-H "Content-Type: application/json" \
-d '{"key":"value"}' \
http://***.***.***.***/api/openai/v1/chat/completions
```
Here are the following modifications made:
- `-X POST` sets the request method to POST.
- `-H "Content-Type: application/json"` sets the `Content-Type` in the request header to `application/json`, which is typically used to inform the server that the data you are sending is in JSON format.
- `-d '{"key":"value"}'` sets the data to be sent, where `'{"key":"value"}'` is a simple example of a JSON object. You need to replace it with the actual data you want to send.

Please note that you need to replace `"key":"value"` with the actual data content you want to send. If your API accepts a different data structure or requires specific fields, please adjust this part according to your actual situation.

### Processed Return to User Content
You want to convert a `curl` GET request to a POST request, and this request is sending data to a specific API. Below is the modified `curl` command to send as POST:
```sh
curl -X POST \
-H "Authorization: sk-12345" \
-H "Auth: [email protected]" \
-H "Content-Type: application/json" \
-d '{"key":"value"}' \
http://172.20.5.14/api/openai/v1/chat/completions
```
Here are the following modifications made:
- `-X POST` sets the request method to POST.
- `-H "Content-Type: application/json"` sets the `Content-Type` in the request header to `application/json`, which is typically used to inform the server that the data you are sending is in JSON format.
- `-d '{"key":"value"}'` sets the data to be sent, where `'{"key":"value"}'` is a simple example of a JSON object. You need to replace it with the actual data you want to send.

Please note that you need to replace `"key":"value"` with the actual data content you want to send. If your API accepts a different data structure or requires specific fields, please adjust this part according to your actual situation.

## Related Notes
- In streaming mode, if the masked words are split across multiple chunks, restoration may not be possible
- In streaming mode, if sensitive words are split across multiple chunks, there may be cases where part of the sensitive word is returned to the user
- Grok built-in rule list: https://help.aliyun.com/zh/sls/user-guide/grok-patterns
- Built-in sensitive word library data source: https://github.com/houbb/sensitive-word/tree/master/src/main/resources
Loading