chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

lievan · 2024-09-13T19:32:17Z

This PR

implements EvaluatorRunner, a periodic service for LLM Obs that is responsible for running evaluations.
implements one dummy ragas faithfulness evaluator. The job of an evaluator is to define a run_and_submit_evaluation function which takes a span, generates an evaluation metric for that span, and submit the evaluation metric using a stored reference to LLM Obs instance.

We add the _DD_LLMOBS_EVALUATORS env var to detect which evaluators should be enabled. Right now, the only supported evaluation is ragas_faithfulness. If no evaluators are detected, the evaluator runner is not started.

Within the trace processor, spans events—after being enqueued to the span writer—are enqueued to the evaluator runner.

Intended Usage

_DD_LLMOBS_EVALUATORS="ragas_faithfulness,ragas_.." DD_LLMOBS_ENABLED=true python3 app.py

No user facing changes for this pr

No changelog since this PR only implements the internal skeleton code necessary for RAGAS evaluation integration. The environment variable to enable the ragas evaluator service is hidden (_DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED) and will be made public when we implement an actual faithfulness function.

(Full e2e poc, which contains some differences)

See #10431 for an idea of what the full e2e implementation of the ragas integration looks like.

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2024-09-13T19:32:57Z

CODEOWNERS have been resolved as:

ddtrace/llmobs/_evaluators/ragas/faithfulness.py                        @DataDog/ml-observability
ddtrace/llmobs/_evaluators/runner.py                                    @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability

datadog-dd-trace-py-rkomorn · 2024-09-13T19:50:26Z

Datadog Report

Branch report: evan.li/ragas-skeleton
Commit report: be893a1
Test service: dd-trace-py

✅ 0 Failed, 100 Passed, 850 Skipped, 1m 26.57s Total duration (13m 4.6s time saved)

tests/llmobs/test_llmobs_service.py

tests/llmobs/test_llmobs_ragas_faithfulness_evaluator.py

tests/llmobs/conftest.py

tests/llmobs/test_llmobs_ragas_faithfulness_evaluator.py

ddtrace/settings/config.py

pr-commenter · 2024-09-17T17:11:35Z

Benchmarks

Benchmark execution time: 2024-09-26 22:59:56

Comparing candidate commit 62ecbe4 in PR branch evan.li/ragas-skeleton with baseline commit 23a54ce in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 257 metrics, 51 unstable metrics.

ddtrace/llmobs/_trace_processor.py

ddtrace/llmobs/_evaluations/ragas/faithfulness/evaluator.py

ddtrace/llmobs/_trace_processor.py

tests/llmobs/test_llmobs_evaluator_runner.py

…gas-skeleton

…py into evan.li/ragas-skeleton

…gas-skeleton

…py into evan.li/ragas-skeleton

tests/llmobs/test_llmobs_evaluator_runner.py

ddtrace/llmobs/_evaluators/ragas/faithfulness.py

ddtrace/llmobs/_evaluators/runner.py

Yun-Kim · 2024-09-24T23:15:47Z

ddtrace/llmobs/_evaluators/runner.py

+        batches_of_results = []
+
+        for evaluator in self.evaluators:
+            batches_of_results.append(self.executor.map(lambda span: evaluator.evaluate(span), spans))


any specific reason why we need to use multithreading in a periodic service which is already a background worker thread? Does multithreading make a large difference in the evaluation call performance?

Is ThreadPoolExecutor.map() a blocking function? I.e. are we guaranteed to wait on this before we hit line 75?

Since the each evaluation for each span is independent of each other i thought it would make sense to implement multi-threading here. I imagine this would make a performance difference for evaluations that require a lot of IO operations e.g. api calls to model providers but I have not done any benchmarks here yet.

This also has another benefit where uncaught crashes in one evaluation thread over a span wouldn't impact the evaluation of other spans in the same dequeued batch.

Is ThreadPoolExecutor.map() a blocking function? I.e. are we guaranteed to wait on this before we hit line 75?

It isn't blocking, map just returns a lazy iterator over the results as they finish.

But this doesn't really matter -- I forgot to make the update here but we don't actually need to collect the evaluation results in this run function.

Previously, the EvaluationRunner stored an instance of the eval metric writer so it was the job of the runner to collect eval results and enqueue it to the eval metric writer. However, i've refactored it so that the runner passes an instance of LLMObs to each evaluator and the evaluator can just use LLMObs.submit_evaluation to enqueue evaluations.

ddtrace/llmobs/_evaluators/runner.py

ddtrace/llmobs/_llmobs.py

ddtrace/llmobs/_trace_processor.py

tests/llmobs/test_llmobs_service.py

tests/llmobs/test_llmobs_evaluator_runner.py

…gas-skeleton

ddtrace/llmobs/_evaluators/ragas/faithfulness.py

tests/llmobs/conftest.py

sabrenner

looks good! just left a couple questions, but overall really cool to see this start to take shape 😄

ddtrace/llmobs/_trace_processor.py

ddtrace/llmobs/_evaluators/runner.py

lievan added 2 commits September 13, 2024 14:51

implement ragas faithfulenss runner with dummy ragas score generator

571d317

remove newline

4b3d840

lievan changed the title ~~feat(llmobs): Implement ragas faithfulenss runner with dummy ragas score generator~~ feat(llmobs): implement ragas faithfulenss runner with dummy ragas score generator Sep 13, 2024

lievan changed the title ~~feat(llmobs): implement ragas faithfulenss runner with dummy ragas score generator~~ feat(llmobs): implement ragas faithfulness runner with dummy ragas score generator Sep 13, 2024

lievan changed the title ~~feat(llmobs): implement ragas faithfulness runner with dummy ragas score generator~~ feat(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function Sep 13, 2024

lievan mentioned this pull request Sep 13, 2024

feat(llmobs): poc ragas evaluation integration #10431

Draft

2 tasks

lievan added the changelog/no-changelog A changelog entry is not required for this PR. label Sep 13, 2024

lievan added 3 commits September 16, 2024 08:55

pydantic v1

7b9c929

refactor into evaluator list

2e883a0

add unit tests

7b31443

datadog-datadog-prod-us1 bot reviewed Sep 17, 2024

View reviewed changes

lievan added 2 commits September 17, 2024 12:49

fix expectde span event

13229bd

merg conf

b493e20

lievan marked this pull request as ready for review September 17, 2024 16:54

lievan requested review from a team as code owners September 17, 2024 16:54

lievan requested review from tabgok and rachelyangdog September 17, 2024 16:54

Yun-Kim changed the title ~~feat(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function~~ chore(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function Sep 17, 2024

Yun-Kim reviewed Sep 17, 2024

View reviewed changes

ddtrace/settings/config.py Outdated Show resolved Hide resolved

lievan changed the title ~~chore(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function~~ chore(llmobs): implement skeleton code for ragas faithfulness runner that uses a dummy faithfulness function Sep 17, 2024

lievan changed the title ~~chore(llmobs): implement skeleton code for ragas faithfulness runner that uses a dummy faithfulness function~~ chore(llmobs): implement skeleton code for ragas faithfulness evaluator Sep 17, 2024

remove config option, use only env var

d290dcd

Yun-Kim reviewed Sep 17, 2024

View reviewed changes

lievan added 2 commits September 17, 2024 14:33

address comments

6e49cca

refactor into one evaluator service

fcf9991

datadog-datadog-prod-us1 bot reviewed Sep 18, 2024

View reviewed changes

tests/llmobs/test_llmobs_evaluator_runner.py Outdated Show resolved Hide resolved

tests/llmobs/test_llmobs_evaluator_runner.py Outdated Show resolved Hide resolved

lievan and others added 11 commits September 18, 2024 11:10

dont cancel futures

10b276f

refactor dummy faithfulness into class

b6fa4e0

rename field to label

be893a1

Merge branch 'main' into evan.li/ragas-skeleton

a309330

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

d849067

…gas-skeleton

Merge branch 'evan.li/ragas-skeleton' of github.com:DataDog/dd-trace-…

fd73621

…py into evan.li/ragas-skeleton

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

2f48461

…gas-skeleton

Merge branch 'main' into evan.li/ragas-skeleton

04d202e

refactor so we store the service only in ragas

a991f15

Merge branch 'evan.li/ragas-skeleton' of github.com:DataDog/dd-trace-…

38e0a23

…py into evan.li/ragas-skeleton

rename a test

ea5d4fa

datadog-datadog-prod-us1 bot reviewed Sep 23, 2024

View reviewed changes

tests/llmobs/test_llmobs_evaluator_runner.py Outdated Show resolved Hide resolved

Yun-Kim reviewed Sep 24, 2024

View reviewed changes

clean up

e66ee7e

datadog-datadog-prod-us1 bot reviewed Sep 26, 2024

View reviewed changes

tests/llmobs/test_llmobs_evaluator_runner.py Show resolved Hide resolved

tests/llmobs/test_llmobs_evaluator_runner.py Show resolved Hide resolved

lievan added 5 commits September 26, 2024 09:54

rename, fix test

c04dac4

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

0a956e1

…gas-skeleton

fork safety

6d9c136

fix tests

7d7192c

add more comments

bb8d388

lievan commented Sep 26, 2024

View reviewed changes

ddtrace/llmobs/_evaluators/ragas/faithfulness.py Show resolved Hide resolved

delete unused fixture

3c17dee

lievan commented Sep 26, 2024

View reviewed changes

tests/llmobs/conftest.py Show resolved Hide resolved

sabrenner reviewed Sep 26, 2024

View reviewed changes

ddtrace/llmobs/_trace_processor.py Show resolved Hide resolved

ddtrace/llmobs/_evaluators/runner.py Show resolved Hide resolved

ddtrace/llmobs/_evaluators/runner.py Outdated Show resolved Hide resolved

remove debug

62ecbe4

sabrenner approved these changes Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

lievan commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading

datadog-dd-trace-py-rkomorn bot commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 17, 2024 •

edited

Loading

Yun-Kim Sep 24, 2024

Yun-Kim Sep 24, 2024

lievan Sep 25, 2024

lievan Sep 25, 2024

sabrenner left a comment

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

Are you sure you want to change the base?

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

Conversation

lievan commented Sep 13, 2024 • edited Loading

Intended Usage

No user facing changes for this pr

(Full e2e poc, which contains some differences)

Checklist

Reviewer Checklist

github-actions bot commented Sep 13, 2024 • edited Loading

datadog-dd-trace-py-rkomorn bot commented Sep 13, 2024 • edited Loading

Datadog Report

pr-commenter bot commented Sep 17, 2024 • edited Loading

Benchmarks

Yun-Kim Sep 24, 2024

Choose a reason for hiding this comment

Yun-Kim Sep 24, 2024

Choose a reason for hiding this comment

lievan Sep 25, 2024

Choose a reason for hiding this comment

lievan Sep 25, 2024

Choose a reason for hiding this comment

sabrenner left a comment

Choose a reason for hiding this comment

lievan commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading

datadog-dd-trace-py-rkomorn bot commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 17, 2024 •

edited

Loading