feat(llmobs): poc ragas evaluation integration #10431

lievan · 2024-08-28T15:15:21Z

(Not meant to be merged) This is meant to be a full E2E implementation of a RAGAS evaluation runner that continuously generates faithfulness scores from finished LLM Obs span events and sends those scores as evaluation metrics.

This PR will be split into two separate PR's:

Implement a background faithfulness evaluation runner with a dummy "faithfulness" function
Introduce ragas as a dependency and plug in ragas faithfulnes evals

Future features not in this PR.

implement sampling
implement more ragas eval runners (e.g. context utilization). We can then refactor out some of the code from RagasFaithfulnessEvaluationRunner to BaseEvaluationRunner, etc.

Some design things:

we're not going to expose any public in code options to enable ragas (yet). Keep it as env var for now
spans from the ragas eval itself have the same ml app but a different service name
calling ragas.evaluate(metrics=[..] ) is the most popular entrypoint for using ragas. We don't use this function and basically re-implement faithfulness step by step so we have more control over what's going on under the hood

This poc roughly represents these pr's, combined:

#10645 (modifying integration generated spans)
#10638 (prompt templating)
#10662 (ragas faithfulness runner)
(insert one that has real ragas faithfulness evals)

Usage

Run

DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED=true python3 your_llm_script.py

You will see the faithfulness score in the UI

There will also be a trace of the ragas evaluation itself with the same ml app name but a different `ragas` service

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

…gas-integration

ddtrace/llmobs/utils.py

datadog-dd-trace-py-rkomorn · 2024-08-28T15:45:02Z

Datadog Report

Branch report: evan.li/ragas-integration
Commit report: a0d7718
Test service: dd-trace-py

❌ 301 Failed (0 Known Flaky), 139671 Passed, 1713 Skipped, 5h 15m 45.99s Total Time
❄️ 12 New Flaky
⌛ 1 Performance Regression

❌ Failed Tests (301)

This report shows up to 5 failed tests.

test_completion[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

Expand for error

 expected call not found.
 Expected: enqueue({'trace_id': '66cf40c50000000037287785185eb587', 'span_id': '3403625228837624890', 'parent_id': 'undefined', 'session_id': '66cf40c50000000037287785185eb587', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40c50000000037287785185eb587', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858565109591403, 'duration': 8149866, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
 Actual: enqueue({'trace_id': '66cf40c50000000037287785185eb587', 'span_id': '3403625228837624890', 'parent_id': 'undefined', 'session_id': '66cf40c50000000037287785185eb587', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40c50000000037287785185eb587', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858565109591403, 'duration': 8149866, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})

test_completion[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

Expand for error

 expected call not found.
 Expected: enqueue({'trace_id': '66cf40a70000000084bc5d21d0fac8ff', 'span_id': '11644070325589362363', 'parent_id': 'undefined', 'session_id': '66cf40a70000000084bc5d21d0fac8ff', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a70000000084bc5d21d0fac8ff', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535325358370, 'duration': 7759169, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
 Actual: enqueue({'trace_id': '66cf40a70000000084bc5d21d0fac8ff', 'span_id': '11644070325589362363', 'parent_id': 'undefined', 'session_id': '66cf40a70000000084bc5d21d0fac8ff', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a70000000084bc5d21d0fac8ff', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535325358370, 'duration': 7759169, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})

test_completion[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

Expand for error

 expected call not found.
 Expected: enqueue({'trace_id': '66cf40b50000000066b9f58a86213995', 'span_id': '11869315331452028980', 'parent_id': 'undefined', 'session_id': '66cf40b50000000066b9f58a86213995', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b50000000066b9f58a86213995', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549126614686, 'duration': 7909324, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})
 Actual: enqueue({'trace_id': '66cf40b50000000066b9f58a86213995', 'span_id': '11869315331452028980', 'parent_id': 'undefined', 'session_id': '66cf40b50000000066b9f58a86213995', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b50000000066b9f58a86213995', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549126614686, 'duration': 7909324, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Respond only in all caps.', 'role': 'system'}, {'content': 'Hello, I am looking for information about some books!', 'role': 'user'}, {'content': 'What is the best selling book?', 'role': 'user'}]}, 'output': {'messages': [{'content': 'THE BEST-SELLING BOOK OF ALL TIME IS "DON', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 32, 'output_tokens': 15, 'total_tokens': 47}})

test_image[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

Expand for error

 expected call not found.
 Expected: enqueue({'trace_id': '66cf40a700000000ab93ea9642221cd3', 'span_id': '14265083996756512590', 'parent_id': 'undefined', 'session_id': '66cf40a700000000ab93ea9642221cd3', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a700000000ab93ea9642221cd3', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535740954352, 'duration': 8972315, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})
 Actual: enqueue({'trace_id': '66cf40a700000000ab93ea9642221cd3', 'span_id': '14265083996756512590', 'parent_id': 'undefined', 'session_id': '66cf40a700000000ab93ea9642221cd3', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40a700000000ab93ea9642221cd3', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858535740954352, 'duration': 8972315, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})

test_image[ddtrace_global_config0] - test_anthropic_llmobs.py - Details

Expand for error

 expected call not found.
 Expected: enqueue({'trace_id': '66cf40b500000000d4d41a76928e75d4', 'span_id': '1603422911116823466', 'parent_id': 'undefined', 'session_id': '66cf40b500000000d4d41a76928e75d4', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b500000000d4d41a76928e75d4', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549584668859, 'duration': 8718860, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15.0}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})
 Actual: enqueue({'trace_id': '66cf40b500000000d4d41a76928e75d4', 'span_id': '1603422911116823466', 'parent_id': 'undefined', 'session_id': '66cf40b500000000d4d41a76928e75d4', 'ml_app': '<ml-app-name>', 'name': 'anthropic.request', 'tags': ['version:', 'env:', 'service:', 'source:integration', 'ml_app:<ml-app-name>', 'session_id:66cf40b500000000d4d41a76928e75d4', 'ddtrace.version:2.13.0.dev87+ga0d7718b7', 'error:0'], 'start_ns': 1724858549584668859, 'duration': 8718860, 'status': 'ok', 'meta': {'span.kind': 'llm', 'input': {'messages': [{'content': 'Hello, what do you see in the following image?', 'role': 'user'}, {'content': '([IMAGE DETECTED])', 'role': 'user'}]}, 'output': {'messages': [{'content': 'The image shows the logo for a company or product called "Datadog', 'role': 'assistant'}]}, 'model_name': 'claude-3-opus-20240229', 'model_provider': 'anthropic', 'metadata': {'temperature': 0.8, 'max_tokens': 15}}, 'metrics': {'input_tokens': 246, 'output_tokens': 15, 'total_tokens': 261}})

New Flaky Tests (12)

test_schematization[service_schema0] - test_snowflake.py - Last Failure

Expand for error

 failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
   Traceback (most recent call last):
     File "/root/project/ddtrace/_monkey.py", line 165, in on_import
       imported_module = importlib.import_module(path)
     File "/root/.pyenv/versions/3.9.16/lib/python3.9/importlib/__init__.py", line 127, in import_module
       return _bootstrap._gcd_import(name[level:], package, level)
     File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
     File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
     File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
     File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
 ...

test_schematization[service_schema0] - test_snowflake.py - Last Failure

Expand for error

 failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
   Traceback (most recent call last):
     File "/root/project/ddtrace/_monkey.py", line 165, in on_import
       imported_module = importlib.import_module(path)
     File "/root/.pyenv/versions/3.7.16/lib/python3.7/importlib/__init__.py", line 127, in import_module
       return _bootstrap._gcd_import(name[level:], package, level)
     File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
     File "<frozen importlib._bootstrap>", line 983, in _find_and_load
     File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
     File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
 ...

test_schematization[service_schema1] - test_snowflake.py - Last Failure

Expand for error

 failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
   Traceback (most recent call last):
     File "/root/project/ddtrace/_monkey.py", line 165, in on_import
       imported_module = importlib.import_module(path)
     File "/root/.pyenv/versions/3.9.16/lib/python3.9/importlib/__init__.py", line 127, in import_module
       return _bootstrap._gcd_import(name[level:], package, level)
     File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
     File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
     File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
     File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
 ...

test_schematization[service_schema1] - test_snowflake.py - Last Failure

Expand for error

 failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
   Traceback (most recent call last):
     File "/root/project/ddtrace/_monkey.py", line 165, in on_import
       imported_module = importlib.import_module(path)
     File "/root/.pyenv/versions/3.7.16/lib/python3.7/importlib/__init__.py", line 127, in import_module
       return _bootstrap._gcd_import(name[level:], package, level)
     File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
     File "<frozen importlib._bootstrap>", line 983, in _find_and_load
     File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
     File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
 ...

test_schematization[service_schema2] - test_snowflake.py - Last Failure

Expand for error

 failed to import ddtrace module 'ddtrace.contrib.botocore' when patching on import
   Traceback (most recent call last):
     File "/root/project/ddtrace/_monkey.py", line 165, in on_import
       imported_module = importlib.import_module(path)
     File "/root/.pyenv/versions/3.9.16/lib/python3.9/importlib/__init__.py", line 127, in import_module
       return _bootstrap._gcd_import(name[level:], package, level)
     File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
     File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
     File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
     File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
 ...

⌛ Performance Regressions vs Default Branch (1)

test_schematized_operation_name_default - test_molten.py 3.69s (+3.1s, +523%) - Details

pr-commenter · 2024-08-28T16:17:24Z

Benchmarks

Benchmark execution time: 2024-09-13 12:23:39

Comparing candidate commit d0a0304 in PR branch evan.li/ragas-integration with baseline commit dc7e31e in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 353 metrics, 47 unstable metrics.

github-actions · 2024-09-05T21:16:18Z

CODEOWNERS have been resolved as:

ddtrace/llmobs/evaluations/ragas/faithfulness/_runner.py                @DataDog/ml-observability
ddtrace/llmobs/evaluations/ragas/faithfulness/_scorer.py                @DataDog/ml-observability
ddtrace/llmobs/evaluations/ragas/faithfulness/_utils.py                 @DataDog/ml-observability
releasenotes/notes/support-prompt-annotations-ffc517e600499b6b.yaml     @DataDog/apm-python
ddtrace/contrib/gevent/__init__.py                                      @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability
ddtrace/llmobs/utils.py                                                 @DataDog/ml-observability
riotfile.py                                                             @DataDog/apm-python
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability
tests/llmobs/test_llmobs_trace_processor.py                             @DataDog/ml-observability
.riot/requirements/10c1700.txt                                          @DataDog/apm-python
.riot/requirements/1d27194.txt                                          @DataDog/apm-python
.riot/requirements/1e43cd0.txt                                          @DataDog/apm-python
.riot/requirements/53ca1b8.txt                                          @DataDog/apm-python
.riot/requirements/9b8c904.txt                                          @DataDog/apm-python
.riot/requirements/da9b714.txt                                          @DataDog/apm-python

ddtrace/llmobs/utils.py

Co-authored-by: datadog-datadog-prod-us1[bot] <88084959+datadog-datadog-prod-us1[bot]@users.noreply.github.com>

ddtrace/llmobs/utils.py

…ce-py into evan.li/ragas-integration

datadog-datadog-prod-us1 · 2024-09-13T11:51:16Z

tests/llmobs/test_llmobs_service.py

+        }
+
+
+def test_annotate_prompt_object(LLMObs):


⚪ Code Quality Violation

Suggested change

def test_annotate_prompt_object(LLMObs):

def test_annotate_prompt_object(l_l_m_obs):

use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

Python Documentation Testing: setUp()

datadog-datadog-prod-us1 · 2024-09-13T11:51:16Z

tests/llmobs/test_llmobs_service.py

+        }
+
+
+def test_annotate_prompt_wrong_type(LLMObs, mock_logs):


⚪ Code Quality Violation

Suggested change

def test_annotate_prompt_wrong_type(LLMObs, mock_logs):

def test_annotate_prompt_wrong_type(l_l_m_obs, mock_logs):

use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

Python Documentation Testing: setUp()

datadog-datadog-prod-us1 · 2024-09-13T11:51:16Z

tests/llmobs/test_llmobs_service.py

@@ -787,6 +789,78 @@
        )


+def test_annotate_prompt_dict(LLMObs):


⚪ Code Quality Violation

Suggested change

def test_annotate_prompt_dict(LLMObs):

def test_annotate_prompt_dict(l_l_m_obs):

use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

Python Documentation Testing: setUp()

datadog-datadog-prod-us1 · 2024-09-13T11:51:16Z

tests/llmobs/test_llmobs_service.py

+        mock_logs.reset_mock()
+
+
+def test_annotate_prompt_wrong_kind(LLMObs, mock_logs):


⚪ Code Quality Violation

Suggested change

def test_annotate_prompt_wrong_kind(LLMObs, mock_logs):

def test_annotate_prompt_wrong_kind(l_l_m_obs, mock_logs):

use snake_case and not camelCase (...read more)

Ensure that function use snake_case.

This rule is not valid for tests files (prefixed by test_ or suffixed by _test.py) because testing requires some camel case methods, such as, tearDown, setUp, and more.

Learn More

Python Documentation Testing: setUp()

…or (#10662) This PR 1. implements `EvaluatorRunner`, a periodic service for LLM Obs that is responsible for running evaluations. 2. implements one dummy ragas faithfulness evaluator. The job of an evaluator is to define a `run_and_submit_evaluation` function which takes a span, generates an evaluation metric for that span, and submit the evaluation metric using a stored reference to LLM Obs instance. We add the `_DD_LLMOBS_EVALUATORS` env var to detect which evaluators should be enabled. Right now, the only supported evaluation is `ragas_faithfulness`. If no evaluators are detected, the evaluator runner is not started. Within the trace processor, spans events—after being enqueued to the span writer—are enqueued to the evaluator runner. #### Intended Usage ``` _DD_LLMOBS_EVALUATORS="ragas_faithfulness,ragas_.." DD_LLMOBS_ENABLED=true python3 app.py ``` ### No user facing changes for this pr No changelog since this PR only implements the internal skeleton code necessary for RAGAS evaluation integration. The environment variable to enable the ragas evaluator service is hidden (`_DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED`) and will be made public when we implement an actual faithfulness function. ### (Full e2e poc, which contains some differences) See #10431 for an idea of what the full e2e implementation of the ragas integration looks like. ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: lievan <[email protected]>

lievan added 5 commits August 27, 2024 11:10

ragas callback skeleton

de70bd3

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

0e749db

…gas-integration

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

1f96705

…gas-integration

rename to eval callbacks

2f96eab

bug fixes'

a0d7718

datadog-datadog-prod-us1 bot reviewed Aug 28, 2024

View reviewed changes

ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved

ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved

datadog-datadog-prod-us1 bot reviewed Aug 28, 2024

View reviewed changes

ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved

ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved

lievan added 2 commits August 31, 2024 23:59

span by span evals

e33875b

ragas poc

189ca4f

datadog-datadog-prod-us1 bot reviewed Sep 5, 2024

View reviewed changes

ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved

lievan and others added 3 commits September 6, 2024 09:16

trace tasks

6240950

Update ddtrace/llmobs/utils.py

c34639b

Co-authored-by: datadog-datadog-prod-us1[bot] <88084959+datadog-datadog-prod-us1[bot]@users.noreply.github.com>

Update ddtrace/llmobs/utils.py

d8fae5c

Co-authored-by: datadog-datadog-prod-us1[bot] <88084959+datadog-datadog-prod-us1[bot]@users.noreply.github.com>

datadog-datadog-prod-us1 bot reviewed Sep 6, 2024

View reviewed changes

ddtrace/llmobs/utils.py Outdated Show resolved Hide resolved

lievan added 4 commits September 6, 2024 09:22

rm role constants

62059ec

Merge branch 'evan.li/ragas-integration' of github.com:DataDog/dd-tra…

0260cc7

…ce-py into evan.li/ragas-integration

refactor

b2567fc

just focus on ragas faithfulness thing

19f8a00

lievan changed the title ~~feat(llmobs): POC ragas evaluation callback~~ feat(llmobs): poc ragas evaluation callback Sep 6, 2024

remove the prompt stuff

cd26c15

lievan changed the title ~~feat(llmobs): poc ragas evaluation callback~~ feat(llmobs): poc ragas evaluation integration Sep 6, 2024

lievan added 6 commits September 6, 2024 13:23

remove dupe name

9f3ce40

use futures, remember to cancel futures on exit

82ad564

save

7e539dd

cleanup

0443e1a

annotate token usage

795c84d

support annotating prompt template data

641fe72

lievan added 8 commits September 12, 2024 11:30

rquirements

07e09ef

rel note

55ea8fb

llm span requirements

e93f38f

fix type annotaiton, pop prompt

54826db

prompt vars validation

0bbdffe

test for tradce processor

6200025

merge in prompt templating stuff

5e7910a

improve the ragas traces

d0a0304

datadog-datadog-prod-us1 bot reviewed Sep 13, 2024

View reviewed changes

lievan mentioned this pull request Sep 13, 2024

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llmobs): poc ragas evaluation integration #10431

feat(llmobs): poc ragas evaluation integration #10431

lievan commented Aug 28, 2024 •

edited

Loading

datadog-dd-trace-py-rkomorn bot commented Aug 28, 2024

pr-commenter bot commented Aug 28, 2024 •

edited

Loading

github-actions bot commented Sep 5, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot Sep 13, 2024

datadog-datadog-prod-us1 bot Sep 13, 2024

datadog-datadog-prod-us1 bot Sep 13, 2024

datadog-datadog-prod-us1 bot Sep 13, 2024

	def test_annotate_prompt_object(LLMObs):
	def test_annotate_prompt_object(l_l_m_obs):

	def test_annotate_prompt_wrong_type(LLMObs, mock_logs):
	def test_annotate_prompt_wrong_type(l_l_m_obs, mock_logs):

	def test_annotate_prompt_dict(LLMObs):
	def test_annotate_prompt_dict(l_l_m_obs):

		mock_logs.reset_mock()


		def test_annotate_prompt_wrong_kind(LLMObs, mock_logs):

	def test_annotate_prompt_wrong_kind(LLMObs, mock_logs):
	def test_annotate_prompt_wrong_kind(l_l_m_obs, mock_logs):

feat(llmobs): poc ragas evaluation integration #10431

Are you sure you want to change the base?

feat(llmobs): poc ragas evaluation integration #10431

Conversation

lievan commented Aug 28, 2024 • edited Loading

This poc roughly represents these pr's, combined:

Usage

You will see the faithfulness score in the UI

There will also be a trace of the ragas evaluation itself with the same ml app name but a different ragas service

Checklist

Reviewer Checklist

datadog-dd-trace-py-rkomorn bot commented Aug 28, 2024

Datadog Report

❌ Failed Tests (301)

New Flaky Tests (12)

⌛ Performance Regressions vs Default Branch (1)

pr-commenter bot commented Aug 28, 2024 • edited Loading

Benchmarks

github-actions bot commented Sep 5, 2024 • edited Loading

datadog-datadog-prod-us1 bot Sep 13, 2024

Choose a reason for hiding this comment

⚪ Code Quality Violation

Learn More

datadog-datadog-prod-us1 bot Sep 13, 2024

Choose a reason for hiding this comment

⚪ Code Quality Violation

Learn More

datadog-datadog-prod-us1 bot Sep 13, 2024

Choose a reason for hiding this comment

⚪ Code Quality Violation

Learn More

datadog-datadog-prod-us1 bot Sep 13, 2024

Choose a reason for hiding this comment

⚪ Code Quality Violation

Learn More

lievan commented Aug 28, 2024 •

edited

Loading

There will also be a trace of the ragas evaluation itself with the same ml app name but a different `ragas` service

pr-commenter bot commented Aug 28, 2024 •

edited

Loading

github-actions bot commented Sep 5, 2024 •

edited

Loading