-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662
base: main
Are you sure you want to change the base?
Conversation
|
Datadog ReportBranch report: ✅ 0 Failed, 100 Passed, 850 Skipped, 1m 26.57s Total duration (13m 4.6s time saved) |
…py into evan.li/ragas-skeleton
…py into evan.li/ragas-skeleton
ddtrace/llmobs/_evaluators/runner.py
Outdated
batches_of_results = [] | ||
|
||
for evaluator in self.evaluators: | ||
batches_of_results.append(self.executor.map(lambda span: evaluator.evaluate(span), spans)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason why we need to use multithreading in a periodic service which is already a background worker thread? Does multithreading make a large difference in the evaluation call performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ThreadPoolExecutor.map()
a blocking function? I.e. are we guaranteed to wait on this before we hit line 75?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the each evaluation for each span is independent of each other i thought it would make sense to implement multi-threading here. I imagine this would make a performance difference for evaluations that require a lot of IO operations e.g. api calls to model providers but I have not done any benchmarks here yet.
This also has another benefit where uncaught crashes in one evaluation thread over a span wouldn't impact the evaluation of other spans in the same dequeued batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ThreadPoolExecutor.map() a blocking function? I.e. are we guaranteed to wait on this before we hit line 75?
It isn't blocking, map
just returns a lazy iterator over the results as they finish.
But this doesn't really matter -- I forgot to make the update here but we don't actually need to collect the evaluation results in this run
function.
Previously, the EvaluationRunner
stored an instance of the eval metric writer so it was the job of the runner to collect eval results and enqueue it to the eval metric writer. However, i've refactored it so that the runner passes an instance of LLMObs
to each evaluator and the evaluator can just use LLMObs.submit_evaluation
to enqueue evaluations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! just left a couple questions, but overall really cool to see this start to take shape 😄
This PR
EvaluatorRunner
, a periodic service for LLM Obs that is responsible for running evaluations.run_and_submit_evaluation
function which takes a span, generates an evaluation metric for that span, and submit the evaluation metric using a stored reference to LLM Obs instance.We add the
_DD_LLMOBS_EVALUATORS
env var to detect which evaluators should be enabled. Right now, the only supported evaluation isragas_faithfulness
. If no evaluators are detected, the evaluator runner is not started.Within the trace processor, spans events—after being enqueued to the span writer—are enqueued to the evaluator runner.
Intended Usage
No user facing changes for this pr
No changelog since this PR only implements the internal skeleton code necessary for RAGAS evaluation integration. The environment variable to enable the ragas evaluator service is hidden (
_DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED
) and will be made public when we implement an actual faithfulness function.(Full e2e poc, which contains some differences)
See #10431 for an idea of what the full e2e implementation of the ragas integration looks like.
Checklist
Reviewer Checklist