What is llamero
?
Simply put, llamero
is a shard for Crystal that allows you to interact with llama.cpp models from within your application.
Here's a basic example:
require "llamero"
model = Llamero::BaseModel.new(model_name: "meta-llama-3-8b-instruct-Q6_K.gguf")
puts model.quick_chat([{ role: "user", content: "Hey Llama! Tell me your best joke about programming" }])
Currently, you will need to clone the llama.cpp repo, build it and symlink the bin to /usr/local/bin/llamacpp for this shard to work as intended.
You will also need python 3.12 or later and pip
brew install python3 pip
Then you can clone and build llama.cpp
Important Note: these instructions tie you to an older release of llama.cpp due to a bug that was introduced around late Feb 2024 - March 2024. This bug has not been fixed as of yet, which breaks this shard entirely because the llama.cpp binary will not execute from the symbolic link we want to create for running it outside of the llama.cpp directory.
cd ~/ && git clone [email protected]:ggerganov/llama.cpp.git && cd llama.cpp && git fetch --tags && git checkout f1a98c52 && make
You will now be on a stable version of llama.cpp and able to make the symbolic link to run this shard. You will be in a detached HEAD state, so you will need to checkout the f1a98c52
commit if you intend to switch to master/main or another release.
Now create the symlink for the main binary, run this from within the llama.cpp directory root
For Mac users, this command will create a symlink for you
sudo ln -s $(pwd)/llama.cpp/main /usr/local/bin/llamacpp
Next we'll link the tokenizer
sudo ln -s $(pwd)/tokenize /usr/local/bin/llamatokenize
You will also need to download some models. This is a quick reference list. You can choose any model that's already quantized into gguf, or you can convert your own models using the llama.cpp quantization tool.
Choose a model from below to start with. The links should bring you directly to the model files page. You want to "download" the model file.
Model Name | Description | RAM Required | Prompt Template |
---|---|---|---|
Mixtral dolphin-2.7-mixtral-8x7b-GGUF | A quantized model optimized for 8x7b settings, works about as well as ChatGPT 4 | ~27GB | chat template |
Mistril-7B-instruct-v0.2-GGUF | A quantized model from Mistril, works about as well as ChatGPT 3.5 | ~6GB | chat template |
Llama3 8b-Instruct-GGUF | A quantized model from Llama 3, works about as well as GPT-4 but limited knowledge | ~8GB | chat template |
You can always download a different model, it just needs to be in the GGUF
quantized format, or you'll need to quantize the model from llama.cpp's quantization tool.
Move the model you downloaded into a directory that you'll configure in your project to use.
I recommend ~/models
as this is the default directory that Llamero will check for models.
-
Add the dependency to your
shard.yml
:dependencies: llamero: github: crimson-knight/llamero
-
Run
shards install
require "llamero"
TODO: Write usage instructions here
To Do: [] Generate chat templates by reading from the model (integrate with HF's C-lib)
TODO: Write development instructions here
- Fork it (https://github.com/crimson-knight/llamero/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
- crimson-knight - creator and maintainer