-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(llmobs): poc ragas evaluation integration #10431
base: main
Are you sure you want to change the base?
Conversation
Datadog ReportBranch report: ❌ 301 Failed (0 Known Flaky), 139671 Passed, 1713 Skipped, 5h 15m 45.99s Total Time ❌ Failed Tests (301)
New Flaky Tests (12)
⌛ Performance Regressions vs Default Branch (1)
|
BenchmarksBenchmark execution time: 2024-09-13 12:23:39 Comparing candidate commit d0a0304 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 353 metrics, 47 unstable metrics. |
|
Co-authored-by: datadog-datadog-prod-us1[bot] <88084959+datadog-datadog-prod-us1[bot]@users.noreply.github.com>
Co-authored-by: datadog-datadog-prod-us1[bot] <88084959+datadog-datadog-prod-us1[bot]@users.noreply.github.com>
…ce-py into evan.li/ragas-integration
} | ||
|
||
|
||
def test_annotate_prompt_object(LLMObs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚪ Code Quality Violation
def test_annotate_prompt_object(LLMObs): | |
def test_annotate_prompt_object(l_l_m_obs): |
use snake_case and not camelCase (...read more)
Ensure that function use snake_case
.
This rule is not valid for tests files (prefixed by test_
or suffixed by _test.py
) because testing requires some camel case methods, such as, tearDown
, setUp
, and more.
Learn More
} | ||
|
||
|
||
def test_annotate_prompt_wrong_type(LLMObs, mock_logs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚪ Code Quality Violation
def test_annotate_prompt_wrong_type(LLMObs, mock_logs): | |
def test_annotate_prompt_wrong_type(l_l_m_obs, mock_logs): |
use snake_case and not camelCase (...read more)
Ensure that function use snake_case
.
This rule is not valid for tests files (prefixed by test_
or suffixed by _test.py
) because testing requires some camel case methods, such as, tearDown
, setUp
, and more.
Learn More
@@ -787,6 +789,78 @@ | |||
) | |||
|
|||
|
|||
def test_annotate_prompt_dict(LLMObs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚪ Code Quality Violation
def test_annotate_prompt_dict(LLMObs): | |
def test_annotate_prompt_dict(l_l_m_obs): |
use snake_case and not camelCase (...read more)
Ensure that function use snake_case
.
This rule is not valid for tests files (prefixed by test_
or suffixed by _test.py
) because testing requires some camel case methods, such as, tearDown
, setUp
, and more.
Learn More
mock_logs.reset_mock() | ||
|
||
|
||
def test_annotate_prompt_wrong_kind(LLMObs, mock_logs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚪ Code Quality Violation
def test_annotate_prompt_wrong_kind(LLMObs, mock_logs): | |
def test_annotate_prompt_wrong_kind(l_l_m_obs, mock_logs): |
use snake_case and not camelCase (...read more)
Ensure that function use snake_case
.
This rule is not valid for tests files (prefixed by test_
or suffixed by _test.py
) because testing requires some camel case methods, such as, tearDown
, setUp
, and more.
Learn More
…or (#10662) This PR 1. implements `EvaluatorRunner`, a periodic service for LLM Obs that is responsible for running evaluations. 2. implements one dummy ragas faithfulness evaluator. The job of an evaluator is to define a `run_and_submit_evaluation` function which takes a span, generates an evaluation metric for that span, and submit the evaluation metric using a stored reference to LLM Obs instance. We add the `_DD_LLMOBS_EVALUATORS` env var to detect which evaluators should be enabled. Right now, the only supported evaluation is `ragas_faithfulness`. If no evaluators are detected, the evaluator runner is not started. Within the trace processor, spans events—after being enqueued to the span writer—are enqueued to the evaluator runner. #### Intended Usage ``` _DD_LLMOBS_EVALUATORS="ragas_faithfulness,ragas_.." DD_LLMOBS_ENABLED=true python3 app.py ``` ### No user facing changes for this pr No changelog since this PR only implements the internal skeleton code necessary for RAGAS evaluation integration. The environment variable to enable the ragas evaluator service is hidden (`_DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED`) and will be made public when we implement an actual faithfulness function. ### (Full e2e poc, which contains some differences) See #10431 for an idea of what the full e2e implementation of the ragas integration looks like. ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: lievan <[email protected]>
(Not meant to be merged) This is meant to be a full E2E implementation of a RAGAS evaluation runner that continuously generates faithfulness scores from finished LLM Obs span events and sends those scores as evaluation metrics.
This PR will be split into two separate PR's:
Future features not in this PR.
RagasFaithfulnessEvaluationRunner
toBaseEvaluationRunner
, etc.Some design things:
ragas.evaluate(metrics=[..] )
is the most popular entrypoint for using ragas. We don't use this function and basically re-implement faithfulness step by step so we have more control over what's going on under the hoodThis poc roughly represents these pr's, combined:
#10645 (modifying integration generated spans)
#10638 (prompt templating)
#10662 (ragas faithfulness runner)
(insert one that has real ragas faithfulness evals)
Usage
Run
You will see the faithfulness score in the UI
There will also be a trace of the ragas evaluation itself with the same ml app name but a different
ragas
serviceChecklist
Reviewer Checklist