Halp!
I've used AWS SageMaker to fine tune Llama 3.2 1B with a set of questions and answers, downloaded the output from S3, but when I try converting it to run in Ollama it seems that an extra two tokens have mysteriously appeared and stop it from working:
% ollama create llama-q-and-a
transferring model data 100%
converting model
Error: vocabulary is larger than expected '128258' instead of '128256
If I trick it by modifying the downloaded config.json
by changing "vocab_size": 128256
to "vocab_size": 128258` it will then create it, but then running it breaks because the architecture is out by two:
% ollama create llama-q-and-a
transferring model data 100%
converting model
creating new layer sha256:27cc8e47a5b0677b27796952267dc8a821d478de44482bee52a2860f01a2d380
creating new layer sha256:e4e2d5fb1c3129b5ccc8fc5c19d1c06f6e8421f28d7dcfc3e80a081e34ecffdf
writing manifest
success
% ollama run llama-q-and-a
Error: llama runner process has terminated: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected 2048, 128258, got 2048, 128256, 1, 1
I've tried various ways of converting the model to GGUF and ONNX with a spot of Python first, none have worked so far. Any advice greatly appreciated. Ultimately I want to be able to use Ollama + my model on a Raspberry Pi 5 8GB. Thanks š
PS
For reference, when I load and run the model with HF transformers in Python it's fine and I can run inferences fine - it's just transformers is too meaty for my needs whereas Ollama is inference-only optimised.