Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(llmobs): poc ragas evaluation integration #10431

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

lievan
Copy link
Contributor

@lievan lievan commented Aug 28, 2024

(Not meant to be merged) This is meant to be a full E2E implementation of a RAGAS evaluation runner that continuously generates faithfulness scores from finished LLM Obs span events and sends those scores as evaluation metrics.

This PR will be split into two separate PR's:

  • Implement a background faithfulness evaluation runner with a dummy "faithfulness" function
  • Introduce ragas as a dependency and plug in ragas faithfulnes evals

Future features not in this PR.

  • implement sampling
  • implement more ragas eval runners (e.g. context utilization). We can then refactor out some of the code from RagasFaithfulnessEvaluationRunner to BaseEvaluationRunner, etc.

Some design things:

  • we're not going to expose any public in code options to enable ragas (yet). Keep it as env var for now
  • spans from the ragas eval itself have the same ml app but a different service name
  • calling ragas.evaluate(metrics=[..] ) is the most popular entrypoint for using ragas. We don't use this function and basically re-implement faithfulness step by step so we have more control over what's going on under the hood

This poc roughly represents these pr's, combined:

#10645 (modifying integration generated spans)
#10638 (prompt templating)
#10662 (ragas faithfulness runner)
(insert one that has real ragas faithfulness evals)

Usage

Run

DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED=true python3 your_llm_script.py

You will see the faithfulness score in the UI

image

There will also be a trace of the ragas evaluation itself with the same ml app name but a different ragas service

image

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved
ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved
ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved
ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved
@datadog-dd-trace-py-rkomorn
Copy link

Datadog Report

Branch report: evan.li/ragas-integration
Commit report: a0d7718
Test service: dd-trace-py

❌ 301 Failed (0 Known Flaky), 139671 Passed, 1713 Skipped, 5h 15m 45.99s Total Time
❄️ 12 New Flaky
⌛ 1 Performance Regression

❌ Failed Tests (301)

This report shows up to 5 failed tests.

  • test_completion[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

    Expand for error
     expected call not found.
     Expected: enqueue({'trace_id': '66cf40c50000000037287785185eb587', 'span_id': '3403625228837624890', 'parent_id': 'undefined', 'session_id': '66cf40c50000000037287785185eb587', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40c50000000037287785185eb587', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858565109591403, 'duration': 8149866, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
     Actual: enqueue({'trace_id': '66cf40c50000000037287785185eb587', 'span_id': '3403625228837624890', 'parent_id': 'undefined', 'session_id': '66cf40c50000000037287785185eb587', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40c50000000037287785185eb587', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858565109591403, 'duration': 8149866, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
    
  • test_completion[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

    Expand for error
     expected call not found.
     Expected: enqueue({'trace_id': '66cf40a70000000084bc5d21d0fac8ff', 'span_id': '11644070325589362363', 'parent_id': 'undefined', 'session_id': '66cf40a70000000084bc5d21d0fac8ff', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a70000000084bc5d21d0fac8ff', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535325358370, 'duration': 7759169, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
     Actual: enqueue({'trace_id': '66cf40a70000000084bc5d21d0fac8ff', 'span_id': '11644070325589362363', 'parent_id': 'undefined', 'session_id': '66cf40a70000000084bc5d21d0fac8ff', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a70000000084bc5d21d0fac8ff', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535325358370, 'duration': 7759169, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
    
  • test_completion[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

    Expand for error
     expected call not found.
     Expected: enqueue({'trace_id': '66cf40b50000000066b9f58a86213995', 'span_id': '11869315331452028980', 'parent_id': 'undefined', 'session_id': '66cf40b50000000066b9f58a86213995', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b50000000066b9f58a86213995', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549126614686, 'duration': 7909324, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
     Actual: enqueue({'trace_id': '66cf40b50000000066b9f58a86213995', 'span_id': '11869315331452028980', 'parent_id': 'undefined', 'session_id': '66cf40b50000000066b9f58a86213995', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b50000000066b9f58a86213995', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549126614686, 'duration': 7909324, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
    
  • test_image[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

    Expand for error
     expected call not found.
     Expected: enqueue({'trace_id': '66cf40a700000000ab93ea9642221cd3', 'span_id': '14265083996756512590', 'parent_id': 'undefined', 'session_id': '66cf40a700000000ab93ea9642221cd3', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a700000000ab93ea9642221cd3', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535740954352, 'duration': 8972315, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})
     Actual: enqueue({'trace_id': '66cf40a700000000ab93ea9642221cd3', 'span_id': '14265083996756512590', 'parent_id': 'undefined', 'session_id': '66cf40a700000000ab93ea9642221cd3', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a700000000ab93ea9642221cd3', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535740954352, 'duration': 8972315, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})
    
  • test_image[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

    Expand for error
     expected call not found.
     Expected: enqueue({'trace_id': '66cf40b500000000d4d41a76928e75d4', 'span_id': '1603422911116823466', 'parent_id': 'undefined', 'session_id': '66cf40b500000000d4d41a76928e75d4', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b500000000d4d41a76928e75d4', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549584668859, 'duration': 8718860, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})
     Actual: enqueue({'trace_id': '66cf40b500000000d4d41a76928e75d4', 'span_id': '1603422911116823466', 'parent_id': 'undefined', 'session_id': '66cf40b500000000d4d41a76928e75d4', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b500000000d4d41a76928e75d4', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549584668859, 'duration': 8718860, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})
    

New Flaky Tests (12)

  • test_schematization[service_schema0] - test_snowflake.py - Last Failure

    Expand for error
     failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
       Traceback (most recent call last):
         File "/root/project/ddtrace/_monkey.py", line 165, in on_import
           imported_module = importlib.import_module(path)
         File "/root/.pyenv/versions/3.9.16/lib/python3.9/importlib/__init__.py", line 127, in import_module
           return _bootstrap._gcd_import(name[level:], package, level)
         File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
         File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
         File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
         File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
     ...
    
  • test_schematization[service_schema0] - test_snowflake.py - Last Failure

    Expand for error
     failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
       Traceback (most recent call last):
         File "/root/project/ddtrace/_monkey.py", line 165, in on_import
           imported_module = importlib.import_module(path)
         File "/root/.pyenv/versions/3.7.16/lib/python3.7/importlib/__init__.py", line 127, in import_module
           return _bootstrap._gcd_import(name[level:], package, level)
         File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
         File "<frozen importlib._bootstrap>", line 983, in _find_and_load
         File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
         File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
     ...
    
  • test_schematization[service_schema1] - test_snowflake.py - Last Failure

    Expand for error
     failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
       Traceback (most recent call last):
         File "/root/project/ddtrace/_monkey.py", line 165, in on_import
           imported_module = importlib.import_module(path)
         File "/root/.pyenv/versions/3.9.16/lib/python3.9/importlib/__init__.py", line 127, in import_module
           return _bootstrap._gcd_import(name[level:], package, level)
         File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
         File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
         File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
         File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
     ...
    
  • test_schematization[service_schema1] - test_snowflake.py - Last Failure

    Expand for error
     failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
       Traceback (most recent call last):
         File "/root/project/ddtrace/_monkey.py", line 165, in on_import
           imported_module = importlib.import_module(path)
         File "/root/.pyenv/versions/3.7.16/lib/python3.7/importlib/__init__.py", line 127, in import_module
           return _bootstrap._gcd_import(name[level:], package, level)
         File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
         File "<frozen importlib._bootstrap>", line 983, in _find_and_load
         File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
         File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
     ...
    
  • test_schematization[service_schema2] - test_snowflake.py - Last Failure

    Expand for error
     failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
       Traceback (most recent call last):
         File "/root/project/ddtrace/_monkey.py", line 165, in on_import
           imported_module = importlib.import_module(path)
         File "/root/.pyenv/versions/3.9.16/lib/python3.9/importlib/__init__.py", line 127, in import_module
           return _bootstrap._gcd_import(name[level:], package, level)
         File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
         File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
         File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
         File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
     ...
    

⌛ Performance Regressions vs Default Branch (1)

  • test_schematized_operation_name_default - test_molten.py 3.69s (+3.1s, +523%) - Details

@pr-commenter
Copy link

pr-commenter bot commented Aug 28, 2024

Benchmarks

Benchmark execution time: 2024-09-13 12:23:39

Comparing candidate commit d0a0304 in PR branch evan.li/ragas-integration with baseline commit dc7e31e in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 353 metrics, 47 unstable metrics.

Copy link
Contributor

github-actions bot commented Sep 5, 2024

CODEOWNERS have been resolved as:

ddtrace/llmobs/evaluations/ragas/faithfulness/_runner.py                @DataDog/ml-observability
ddtrace/llmobs/evaluations/ragas/faithfulness/_scorer.py                @DataDog/ml-observability
ddtrace/llmobs/evaluations/ragas/faithfulness/_utils.py                 @DataDog/ml-observability
releasenotes/notes/support-prompt-annotations-ffc517e600499b6b.yaml     @DataDog/apm-python
ddtrace/contrib/gevent/__init__.py                                      @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability
ddtrace/llmobs/utils.py                                                 @DataDog/ml-observability
riotfile.py                                                             @DataDog/apm-python
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability
tests/llmobs/test_llmobs_trace_processor.py                             @DataDog/ml-observability
.riot/requirements/10c1700.txt                                          @DataDog/apm-python
.riot/requirements/1d27194.txt                                          @DataDog/apm-python
.riot/requirements/1e43cd0.txt                                          @DataDog/apm-python
.riot/requirements/53ca1b8.txt                                          @DataDog/apm-python
.riot/requirements/9b8c904.txt                                          @DataDog/apm-python
.riot/requirements/da9b714.txt                                          @DataDog/apm-python

lievan and others added 3 commits September 6, 2024 09:16
Co-authored-by: datadog-datadog-prod-us1[bot] <88084959+datadog-datadog-prod-us1[bot]@users.noreply.github.com>
Co-authored-by: datadog-datadog-prod-us1[bot] <88084959+datadog-datadog-prod-us1[bot]@users.noreply.github.com>
@lievan lievan changed the title feat(llmobs): POC ragas evaluation callback feat(llmobs): poc ragas evaluation callback Sep 6, 2024
@lievan lievan changed the title feat(llmobs): poc ragas evaluation callback feat(llmobs): poc ragas evaluation integration Sep 6, 2024
}


def test_annotate_prompt_object(LLMObs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Suggested change
def test_annotate_prompt_object(LLMObs):
def test_annotate_prompt_object(l_l_m_obs):
use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

View in Datadog  Leave us feedback  Documentation

}


def test_annotate_prompt_wrong_type(LLMObs, mock_logs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Suggested change
def test_annotate_prompt_wrong_type(LLMObs, mock_logs):
def test_annotate_prompt_wrong_type(l_l_m_obs, mock_logs):
use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

View in Datadog  Leave us feedback  Documentation

@@ -787,6 +789,78 @@
)


def test_annotate_prompt_dict(LLMObs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Suggested change
def test_annotate_prompt_dict(LLMObs):
def test_annotate_prompt_dict(l_l_m_obs):
use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

View in Datadog  Leave us feedback  Documentation

mock_logs.reset_mock()


def test_annotate_prompt_wrong_kind(LLMObs, mock_logs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Suggested change
def test_annotate_prompt_wrong_kind(LLMObs, mock_logs):
def test_annotate_prompt_wrong_kind(l_l_m_obs, mock_logs):
use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

View in Datadog  Leave us feedback  Documentation

lievan added a commit that referenced this pull request Sep 30, 2024
…or (#10662)

This PR 
1. implements `EvaluatorRunner`, a periodic service for LLM Obs that is
responsible for running evaluations.
2. implements one dummy ragas faithfulness evaluator. The job of an
evaluator is to define a `run_and_submit_evaluation` function which
takes a span, generates an evaluation metric for that span, and submit
the evaluation metric using a stored reference to LLM Obs instance.

We add the `_DD_LLMOBS_EVALUATORS` env var to detect which evaluators
should be enabled. Right now, the only supported evaluation is
`ragas_faithfulness`. If no evaluators are detected, the evaluator
runner is not started.

Within the trace processor, spans events—after being enqueued to the
span writer—are enqueued to the evaluator runner.

#### Intended Usage
```
_DD_LLMOBS_EVALUATORS="ragas_faithfulness,ragas_.." DD_LLMOBS_ENABLED=true python3 app.py
```

### No user facing changes for this pr
No changelog since this PR only implements the internal skeleton code
necessary for RAGAS evaluation integration. The environment variable to
enable the ragas evaluator service is hidden
(`_DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED`) and will be made public when
we implement an actual faithfulness function.

### (Full e2e poc, which contains some differences)
See #10431 for an idea of
what the full e2e implementation of the ragas integration looks like.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: lievan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant