Concrete ML v1.8 : Towards Decentralized Private LLAMA Fine-Tuning

January 14, 2025
Andrei Stoian

Concrete ML v1.8 marks a major step in enabling privacy-preserving fine-tuning for Large Language Models (LLMs). This release improves the speed and usability of LLM hybrid fine-tuning with an optimized FHE backend specific to LLMs and a new Low Rank Approximation API. 

Additionally, Concrete ML v1.8 now supports Python 3.12, ensuring compatibility with the latest tools and frameworks.

A better API to fine-tune LLMs on encrypted data

Concrete ML provides fine-tuning functionalities for LLMs as a client-server protocol: 

  • The server performs encrypted (FHE) inference on the linear layers of the model.
  • The client performs gradient descent on the fine-tuning weights. 

As a first step, developers would compile a LLM to work with FHE in this manner. Before deploying the model, they can check on their own machine that the compiled model can produce good results. To streamline this workflow, Concrete ML v1.8 introduced  a fine-tuning API inspired by HuggingFace PEFT LoraTrainer, making this process more efficient.

from concrete.ml.torch.lora import LoraTrainer

lora_trainer = LoraTrainer(
    model=peft_model,
    optimizer=optimizer,
    loss_fn=causal_lm_loss,
    lr_scheduler=lr_scheduler,
    training_args=training_args_dict,
    n_layers_to_skip_for_backprop=3,
)
lora_trainer.compile(inputset, n_bits=16)

lora_trainer.train(dataloader, num_epochs=EPOCHS, fhe="execute", device=device)

As the snippet shows, Concrete ML requires a single additional call to [.c-inline-code]compile[.c-inline-code], compared to the HuggingFace LoraTrainer. 

Furthermore, this new version allows developers to efficiently use the GPU to evaluate the compiled model on their own machine, significantly speeding up the development process. Check out this notebook to learn how to use the new API on LLAMA.

Faster FHE LLM fine-tuning

Concrete ML v1.8 adds an optimized FHE backend for the LLM fine-tuning use case. Leveraging the GPU, this backend speeds up computation significantly. The low-level operation that is implemented by the backend is the encrypted matrix - clear matrix multiplication, with the encrypted matrix representing private user data and the clear matrix containing the model’s original weights. 

The key highlight of the backend is the efficient compression Both input and output are compressed to sizes only around 4 times larger than non-encrypted data.

With the current implementation:

  • Fine-tuning a LLAMA 8B model on 100,000 tokens takes around 70 hours.
  • The estimated cost is around $500 using a decentralized network of 100 consumer-grade GPUs.

This represents a significant step forward, and further optimizations in future releases are expected to reduce cost and latency by a factor of 4.

In summary, Concrete ML v1.8 makes it easier and faster to fine-tune LLMs securely on encrypted data, bringing us one step closer to scalable and decentralized AI solutions. Stay tuned for more updates in the future!

Additional links

Read more related posts

No items found.