Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Updates: routing table, models endpoint, inference routing, rebase on safety_refactor #85

Open
wants to merge 45 commits into
base: safety_refactor
Choose a base branch
from

Conversation

yanxi0830
Copy link
Contributor

@yanxi0830 yanxi0830 commented Sep 20, 2024

Changes

Major Changes

✅ rebase on top of safety_refactor branch
✅ routers migration
- move memory router to use llama_stack/distribution/routers
- add RoutingTable class with to manage provider implementations
- add inference routing based on model param to use different models
models/list & models/get endpoint to query information on what model is currently being served & corresponding api provider
- 1 adapter could support multiple models (e.g. fireworks/together/ollama), for this case, we list all supported models by the adapter
safety/list_shields to query available shields in distribution

Minor Changes

✅ backward compatibility: single adapter without routing implementation still supported
✅ example configs to configure advanced routing table
✅ configure script to configure single adapter backed endpoints

Not in this PR

🚧 register_model endpoint to add new models
🚧 Client CLI for querying distribution models endpoint --> will be in llama-stack-apps
🚧 Update client SDK package to deal with new OpenAPI spec on run_shield

Tests

Routing Table

Test Inference Routing
python -m llama_stack.apis.inference.client 
python sdk_examples/inference/client.py localhost 5000 false
  • Meta-Llama3.1-8B-Instruct
    image
  • Meta-Llama3.1-8B
    image
Test Memory Routing
python -m llama_stack.apis.memory.client
  • vector memory type
    image

  • switch to keyvalue memory type remote::pgvector
    image

Test Agents
python -m llama_stack.apis.agents.client
python sdk_examples/agents/client.py

image

Test Safety
python -m llama_stack.apis.safety.client localhost 5000

image

Models Endpoint with Routing Table

  • using ollama supported models list
inference:
    provider_id: remote::ollama
    config:
      url: https://xxx

image

  • using default local
inference:
    provider_id: meta-reference
    config:
      model: Meta-Llama3.1-8B-Instruct
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1

image

Models Endpoint without Routing Table

  • using routing table
provider_routing_table:
  inference:
    - routing_key: Meta-Llama3.1-8B-Instruct
      provider_id: meta-reference
      config:
        model: Meta-Llama3.1-8B-Instruct
        quantization: null
        torch_seed: null
        max_seq_len: 4096
        max_batch_size: 1
    - routing_key: Meta-Llama3.1-8B
      provider_id: remote::ollama
      config:
        url: https:://ollama.com

image

run.yaml

built_at: '2024-09-18T13:41:17.656743'
image_name: local
docker_image: null
conda_env: local
apis_to_serve:
- inference
- memory
- telemetry
- agents
- safety
provider_map:
  telemetry:
    provider_id: meta-reference
    config: {}
  safety:
    provider_id: meta-reference
    config:
      llama_guard_shield:
        model: Llama-Guard-3-8B
        excluded_categories: []
        disable_input_check: false
        disable_output_check: false
      prompt_guard_shield:
        model: Prompt-Guard-86M
  agents:
    provider_id: meta-reference
    config: {}
provider_routing_table:
  inference:
    - routing_key: Meta-Llama3.1-8B-Instruct
      provider_id: meta-reference
      config:
        model: Meta-Llama3.1-8B-Instruct
        quantization: null
        torch_seed: null
        max_seq_len: 4096
        max_batch_size: 1
    # - routing_key: Meta-Llama3.1-8B
    #   provider_id: meta-reference
    #   config:
    #     model: Meta-Llama3.1-8B
    #     quantization: null
    #     torch_seed: null
    #     max_seq_len: 4096
    #     max_batch_size: 1
  memory:
    # - routing_key: keyvalue
    #   provider_id: remote::pgvector
    #   config:
    #     host: localhost
    #     port: 5432
    #     db: vectordb
    #     user: vectoruser
    #     password: xxxx
    - routing_key: vector
      provider_id: meta-reference
      config: {}

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 20, 2024
@yanxi0830 yanxi0830 changed the title [WIP] new router migration API Updates: routing table, models endpoint, inference routing, rebase on safety_refactor Sep 22, 2024
@ashwinb ashwinb changed the base branch from main to api_updates_3 September 22, 2024 07:11
@ashwinb ashwinb changed the base branch from api_updates_3 to safety_refactor September 22, 2024 07:12
@yanxi0830 yanxi0830 marked this pull request as ready for review September 22, 2024 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants