Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enable easy migration from BM25 to Neural Index with Reindex Step #617

Closed
dbwiddis opened this issue Mar 26, 2024 · 5 comments
Closed
Assignees
Labels
enhancement New feature or request v2.15.0

Comments

@dbwiddis
Copy link
Member

Is your feature request related to a problem?

One of the most common use cases for ML offerings is enabling Neural Search by adding embeddings to an existing index. The steps to set this up are simple, documented here, and here and @owaiskazi19 demonstrated them early on in our exploratory development with a scrappy demo

  • Configure an Ingest Pipeline to add the text embeddings to an index
  • Create a new index using the pipeline prepared in the previous step
  • Reindex an existing text (BM25) index onto the index created in the previous step.

What solution would you like?

The creation of an ingest pipeline and new index configuration have already been completed and will be part of the 2.13 release:

To complete this solution we need to add a ReindexStep that calls the Reindex API.

Reindexing does have some cautions that a user should be aware of, and hiding these cautions behind an automated workflow risks surprising users with some behavior.

It will only reindex documents which were in the original index at the start of the operation, so if an index is still being written to, the reindex won't capture new documents. Also, it's an expensive operation on large indices, with this note in the linked docs:

Reindexing can be an expensive operation depending on the size of your source index. We recommend you disable replicas in your destination index by setting number_of_replicas to 0 and re-enable them once the reindex process is complete.

Using a model for embeddings adds even more "expense" to this process.

Accordingly, there should be at least some sort of "confirmation prompt" when this workflow step is used. From a backend/template perspective, a path parameter expressing the user's acknowledgement of these cautions (e.g., allow_expensive=true or similar) should be required, with the provisioning step failed-fast with a helpful/verbose error message if there is a reindex step present and the parameter is not set true.

What alternatives have you considered?

The status quo as-of 2.13, which enables setting up the ingest pipeline and new index, but requires the user to manually perform the reindexing operation. (These steps could be combined from a front-end perspective but remain separate on the back-end.)

Do you have any additional context?

It's also possible to update the same index in-place with a pipeline using Update by query. This should be considered for a future addition.

@dbwiddis dbwiddis added enhancement New feature or request untriaged v2.14.0 and removed untriaged labels Mar 26, 2024
@minalsha
Copy link
Collaborator

minalsha commented Apr 2, 2024

@sean-zheng-amazon could you please review this issue?

@owaiskazi19
Copy link
Member

owaiskazi19 commented Apr 12, 2024

BM25 index

PUT bm25_index
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "similarity": {
        "default": {
          "type": "BM25"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "your_field": {
        "type": "text"
      }
    }
  }
}

Neural index with same mapping

PUT localhost:9200/neural_index
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "your_field": {
        "type": "text"
      }
    }
  }
}

Template

PUT _plugins/_flow_framework/workflow
{
  "name": "Reindex test",
  "description": "Reindex test",
  "use_case": "PROVISION",
  "version": {
    "template": "1.0.0",
    "compatibility": [
      "2.12.0",
      "3.0.0"
    ]
  },
  "workflows": {
    "provision": {
      "nodes": [
        {
          "id": "reindex",
          "type": "reindex",
          "user_inputs": {
            "source_index": "bm25_index",
            "destination_index": "neural_index"
          }

        }
      ]
    }
  }
}

Response:

-04-12T20:40:55,600][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] Queueing process [reindex]. Can start immediately!
[2024-04-12T20:40:55,600][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Starting reindex.
[2024-04-12T20:40:55,603][INFO ][o.o.f.w.ReIndexStep      ] [ip-172-31-56-214] Reindex from source: bm25_index to destination neural_index1
[2024-04-12T20:40:55,618][INFO ][o.o.f.i.FlowFrameworkIndicesHandler] [ip-172-31-56-214] updated resources created of UWIK1I4BfZukSsNLfiXu
[2024-04-12T20:40:55,618][INFO ][o.o.f.w.ReIndexStep      ] [ip-172-31-56-214] successfully updated resource created in state index: .plugins-flow-framework-state
[2024-04-12T20:40:55,618][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Finished reindex.
[2024-04-12T20:40:55,618][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] Provisioning completed successfully for workflow UWIK1I4BfZukSsNLfiXu
[2024-04-12T20:40:55,633][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] updated workflow UWIK1I4BfZukSsNLfiXu state to COMPLETED

@owaiskazi19
Copy link
Member

@navneet1v and @vamshin can you take a look at the issue and the draft PR? Thanks

@minalsha minalsha added v2.15.0 and removed v2.14.0 labels Apr 22, 2024
@minalsha
Copy link
Collaborator

@owaiskazi19 please update the latest status and share the draft PR with additional parameters for reindexing. thanks

@owaiskazi19
Copy link
Member

@minalsha Added refresh, requests_per_second, require_alias, slices, max_docs parameters and raised a new #718.

{
  "name": "Reindex test",
  "description": "Reindex test",
  "use_case": "PROVISION",
  "version": {
    "template": "1.0.0",
    "compatibility": [
      "2.12.0",
      "3.0.0"
    ]
  },
  "workflows": {
    "provision": {
      "nodes": [
        {
          "id": "reindex",
          "type": "reindex",
          "user_inputs": {
            "source_index": "bm25_index",
            "destination_index": "neural_index",
            "refresh": true,
            "requests_per_second": 2,
            "require_alias": "false",
            "slices": 1,
            "max_docs": 2
          }

        }
      ]
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v2.15.0
Projects
Status: 2.15.0 (Release window opens on June 10th, 2024 and closes on June 25th, 2024)
Development

No branches or pull requests

4 participants