TFHE-rs v1.1: Fine-Grained GPU Control and More Operators

April 10, 2025
Jean-Baptiste Orfila, Arthur Meyre, Agnes Leroy

This month, Zama released TFHE-rs v1.1, bringing several major improvements and new features for both GPU and CPU backends.

On the GPU side, the backend is upgraded to adopt the same default cryptographic parameters as the CPU, reducing the probability of computational errors to less than 2⁻¹²⁸ — all with minimal impact on performance. Multi-GPU support has been significantly improved as well: users can now explicitly choose which GPU to run on, enabling close to 500 encrypted 64-bit additions per second on 8×H100 GPUs

On the CPU side, this release expands the operator set by supporting more scalar cases, making homomorphic computations more versatile and efficient.

In this blog post, we’ll dive into the details of what’s new in TFHE-rs v1.1.

Better multi-GPU throughput

Before v1.1, TFHE-rs' High-Level API automatically dispatched workloads across all available GPUs using a hardcoded strategy for any encrypted operations. While this is effective for very large integer precisions(>128 bits), and for operations that load the GPU extensively (such as multiplication), it isn’t ideal for smaller operations, typically the 64-bit encrypted additions or comparisons.

Starting with v1.1, developers can select exactly which GPU to use for each operation, optimizing performance on multi-GPU setups. 

Here’s a quick example of executing a hundred 64-bit encrypted additions per GPU in parallel, where each addition is computed on a single GPU:

use tfhe::{ConfigBuilder, set_server_key, ClientKey, CompressedServerKey, FheUint64, GpuIndex};
use tfhe::prelude::*;
use rayon::prelude::*;
use tfhe::core_crypto::gpu::get_number_of_gpus;
use rand::{thread_rng, Rng};
fn main() {
    let config = ConfigBuilder::default().build();

    let client_key = ClientKey::generate(config);
    let compressed_server_key = CompressedServerKey::new(&client_key);

    let num_gpus = get_number_of_gpus();
    let sks_vec = (0..num_gpus)
        .map(|i| compressed_server_key.decompress_to_specific_gpu(GpuIndex::new(i)))
        .collect::>();

    let batch_size = num_gpus * 100;

    let mut rng = thread_rng();
    let left_inputs = (0..batch_size)
        .map(|_| FheUint64::encrypt(rng.gen::(), &client_key))
        .collect::>();
    let right_inputs = (0..batch_size)
        .map(|_| FheUint64::encrypt(rng.gen::(), &client_key))
        .collect::>();

    let chunk_size = (batch_size / num_gpus) as usize;
    left_inputs
        .par_chunks(chunk_size)
        .zip(
            right_inputs
                .par_chunks(chunk_size)
        )
        .enumerate()
        .for_each(
            |(i, (left_inputs_on_gpu_i, right_inputs_on_gpu_i))| {
                left_inputs_on_gpu_i
                    .par_iter()
                    .zip(right_inputs_on_gpu_i.par_iter())
                    .for_each(|(left_input, right_input)| {
                        set_server_key(sks_vec[i].clone());
                        left_input + right_input;
                    });
            },
        );
}

What’s happening here? The first thing that differs from a usual GPU computation with TFHE-rs is the way the server key is defined:

let sks_vec = (0..num_gpus)
        .map(|i| compressed_server_key.decompress_to_specific_gpu(GpuIndex::new(i)))
        .collect::>();

Here, a vector of server keys is created, each on a specific GPU. 

Then, instead of calling [.c-inline-code]par_iter()[.c-inline-code] onto the inputs as one would naturally do, the inputs are chunked to be distributed onto all the GPUs:

left_inputs
        .par_chunks(chunk_size)
        .zip(
            right_inputs
                .par_chunks(chunk_size)
        )
        .enumerate()
        .for_each(
            |(i, (left_inputs_on_gpu_i, right_inputs_on_gpu_i))| {
                left_inputs_on_gpu_i
                    .par_iter()
                    .zip(right_inputs_on_gpu_i.par_iter())
                    .for_each(|(left_input, right_input)| {
                        set_server_key(sks_vec[i].clone());
                        left_input + right_input;
                    });
            },
        );

By setting the server key corresponding to the GPU associated to each chunk with [.c-inline-code]set_server_key(sks_vec[i].clone())[.c-inline-code], the additions get computed on all the GPUs independently. Note that when doing [.c-inline-code]sks_vec[i].clone()[.c-inline-code], the pointer to the server key is copied to the specific thread, not the content of the server key itself, so it does not induce additional overhead. You can go further to maximize multi-GPU throughput by following our dedicated tutorial

With this logic set up, TFHE-rs can now achieve close to 500 additions of 64-bit encrypted integers per second on 8xH100 GPUs.

New operators on the CPU backend

TFHE-rs v1.1 brings several additions and improvements to the CPU backend:

  • Scalar support for [.c-inline-code]select[.c-inline-code]: Previously, the [.c-inline-code]select[.c-inline-code] operation only worked with encrypted values. In v1.1, you can now use scalar (plaintext) values as selectable operands. For 64-bit inputs, this operation executes in approximately 20 milliseconds.
  • Improved subtraction: Subtraction now supports a scalar on the left-hand side, making expressions like [.c-inline-code]scalar - encrypted[.c-inline-code] possible. For 64-bit operands, this operation takes around 79 milliseconds.
  • New dot product operator: v1.1 introduces a dot product operation between a vector of [.c-inline-code]FheBool[.c-inline-code] values and any supported scalar type. On a vector of 1,024 elements, execution time is approximately 2 seconds.

All performance benchmarks were measured on an [.c-inline-code]AWS hpc7a.96xlarge[.c-inline-code] instance.

Smarter key generation

To better support operations in memory-constrained environments, v1.1 also introduces “chunked” bootstrapping key generation. This feature allows the bootstrapping key to be generated in smaller chunks, which can later be assembled into a full key on higher-capacity servers used for encrypted computation.

The next release of TFHE-rs will continue to improve performance and introduce new features. Stay tuned!

Additional links

Read more related posts

No items found.