Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore the use of static LLM model implementations for deterministic testing and potentially lowered costs #27

Open
d33bs opened this issue Jul 20, 2023 · 7 comments

Comments

@d33bs
Copy link
Collaborator

d33bs commented Jul 20, 2023

This issue was inspired by #25 through considering testing limitations. This issue proposes the use of static models like the Llama and many others which could be referenced as part of testing procedures to help increase deterministic testing results through explicit model versioning (acknowledging some continuous updates from OpenAI models) and also potentially reduce the costs of API access to OpenAI. Benefits might include increasing testing assurance, expanding how new / different models perform with manubot-ai-editor, and accounting for continuous update performance changes with OpenAI (where the results may differ enough that test results could be confusing). Implementation of these static models might involve the use of things like privateGPT or similar tooling.

@miltondp
Copy link
Collaborator

miltondp commented Jul 24, 2023

@d33bs, this is a really great idea, thank you for opening this issue! I already requested access and I'm downloading the models to give it a quick try :)

The model files might be too big to download within GitHub Actions (that's the advantage of the API). But this is definitely something we'd like to do in the future if time allows (since it was not originally included in the Sloan grant).

@miltondp
Copy link
Collaborator

Sorry for the confusion. I realized that you refer to the testing part, and I think it's a great idea to use explicit snapshots when referencing models. Nonetheless, as I mentioned here, I think testing prompts should use more advanced tools such as OpenAI Evals instead of simple unit testing as now.

@miltondp
Copy link
Collaborator

Ok, I couldn't resist the temptation of trying Llama 2 :) I downloaded the 7B-chat model (I think I need more than one GPU for the other models) and it produced some very interesting revisions, for instance, of this abstract:

image

Llama 2's API for chat completion is very similar to the OpenAI's one, so it shouldn't be hard to implement. And as you suggest, this could greatly improve prompt testing, for instance, maybe by providing a pilot model before using an expensive one.

@miltondp
Copy link
Collaborator

Another option I just realized is to use a very low temperature, even zero.

@d33bs
Copy link
Collaborator Author

d33bs commented Jul 28, 2023

Cheers, thanks for these thoughts and the exploration @miltondp ! The temperature parameter seems like a great way to ensure greater determinism. Exciting that the Llama 2 API is similar to OpenAI's; I wonder how much of this could be abstracted for the usecases here (maybe there's already a unified way to interact with these?).

Generally I felt concerned that some OpenAI tests may shift in their output with continuous updates applied and no way to statically define a version. I'm unsure about how this works for them, but it feels like some of their endpoints are akin to a Docker image :latest tag (where it's sometimes a best practice to specify a version for software deployments). While the latest version is often what we want for experimentation, cutting-edge changes can sometimes lead to unexpected results. For example, latest changes to the models could incur different output even with 0 temperature output, meaning a test might not pass from one moment to the next.

Here I also wondered about additional advantages of open-source models (mostly thinking about Llama here, but there are others):

  • Lighter-weight model size + implementation footprints could mean quicker iteration / development velocity (in addition to eco-friendliness)
  • Free models would make this tool more widely available to those who may not have budgetary access to OpenAI
  • Free / open models may be available in geographic locations where OpenAI is not accessible

@miltondp
Copy link
Collaborator

miltondp commented Aug 1, 2023

I agree that referencing versions of models would be ideal. I understand that OpenAI offers this as "snapshots" of models, at least for some of them such as gpt-3.5-turbo. They also include a deprecation date after new models are released, but I think you are not forced to move to the next model version when that happens.

I completely agree with all of your points. I think I will make the Llama 2 support a high priority for the next iterations because it can reduce costs in accessing the OpenAI API, at least to test prompts preliminarily, as well as reducing developing time as you said. And I'd love to see the tool being used by colleagues from Argentina, for instance, where the costs of proprietary models are sometimes prohibitive.

Great comments and ideas. Thank you again, @d33bs!

@castedo
Copy link

castedo commented Sep 6, 2024

#51 might be useful here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants