TFHE-rs v0.7: Ciphertext Compression, Multi-GPU Support and More

July 5, 2024
Jean-Baptiste Orfila, Arthur Meyre, Agnes Leroy

TFHE-rs v0.7 now supports the compression of ciphertexts that encrypt the result of some homomorphic computations. This new feature reduces the size of ciphertexts by up to 1,900x with the provided parameters! Additionally, TFHE-rs v0.7 allows users to leverage multi-GPU architectures, which are widely deployed on servers, to drastically enhance computational performance. As usual, this release introduces a plethora of new features and improvements, as detailed below!

Compressing ciphertexts after homomorphic computation

One of the challenges of  FHE implementation is the size of the ciphertexts. By default, the ratio between one bit of cleartext and its encrypted equivalent is around 8,200, meaning that it takes 8,200 bits of data in a ciphertext to represent 1 bit of data in a cleartext. TFHE-rs has supported input compression since v0.2; however, reducing the post-computation sizes of ciphertexts was not possible until now. Starting from this release of v0.7, ciphertexts can now be compressed at any point in the program.

For now, post-computation compression is available only for a few parameter sets, as demonstrated in the code snippet below. A compressed ciphertext list with these parameters can have up to 256 slots, each capable of containing 2 bits of encrypted data. For example, 8 [.c-inline-code]FheUint64[.c-inline-code] values may be optimally stored in one list. Table 1 provides a summary of the ciphertext sizes and compression ratios.

Table 1: Sizes of compressed ciphertexts as a function of the number of cleartext bits.

The following example demonstrates how to use TFHE-rs to compress ciphertexts with the newly introduced [.c-inline-code]CompressedCiphertextList[.c-inline-code]. It is heterogeneous, thus allowing the storage of any type of ciphertext together, such as [.c-inline-code]FheUint32[.c-inline-code], [.c-inline-code]FheUint16[.c-inline-code], [.c-inline-code]FheBool[.c-inline-code], etc. In instances where the number of input bits exceeds the maximum threshold, the list is automatically split into multiple ones.

use tfhe::prelude::*;
use tfhe::shortint::parameters::{COMP_PARAM_MESSAGE_2_CARRY_2, PARAM_MESSAGE_2_CARRY_2};
use tfhe::{
    set_server_key, CompressedCiphertextListBuilder, FheBool, FheInt64, FheUint2, FheUint32,
};

fn main() {
    let config = tfhe::ConfigBuilder::with_custom_parameters(PARAM_MESSAGE_2_CARRY_2, None)
        .enable_compression(COMP_PARAM_MESSAGE_2_CARRY_2)
        .build();

    let ck = tfhe::ClientKey::generate(config);
    let sk = tfhe::ServerKey::new(&ck);

    set_server_key(sk);

    let ct1 = FheUint32::encrypt(17_u32, &ck);
    let ct2 = FheInt64::encrypt(-1i64, &ck);
    let ct3 = FheBool::encrypt(false, &ck);
    let ct4 = FheUint2::encrypt(3u8, &ck);

    let serialized_ct1 = bincode::serialize(&ct1).unwrap();
    let serialized_ct2 = bincode::serialize(&ct2).unwrap();
    let serialized_ct3 = bincode::serialize(&ct3).unwrap();
    let serialized_ct4 = bincode::serialize(&ct4).unwrap();

    let uncompressed_serialized_size =
        serialized_ct1.len() + serialized_ct2.len() + serialized_ct3.len() + serialized_ct4.len();

    println!(
        "Uncompressed serialized size: {} bytes",
        uncompressed_serialized_size
    );

    let compressed_list = CompressedCiphertextListBuilder::new()
        .push(ct1)
        .push(ct2)
        .push(ct3)
        .push(ct4)
        .build()
        .unwrap();

    let serialized = bincode::serialize(&compressed_list).unwrap();

    let compressed_serialized_size = serialized.len();

    println!(
        "Compressed serialized size: {} bytes",
        compressed_serialized_size
    );

    println!(
        "Compression ratio for 105 bits {}",
        uncompressed_serialized_size as f64 / compressed_serialized_size as f64
    );
}

Accelerating homomorphic computations with multiple GPUs

TFHE-rs v0.7 enables the use of multiple GPUs for homomorphic computations for the first time, marking a significant advancement in performance. There is no need to change the code to execute on multiple GPUs. To maintain the API as user-friendly as possible, the configuration is set automatically; the user has no fine-grained control over the selection of GPUs.

However, there are  certain limitations: only GPUs with peer access to GPU 0 via NVLink are used for the computations. Depending on the platform, this may limit the number of GPUs that TFHE-rs can effectively harness.

The multi-GPU support, along with some optimizations introduced in this release, brings unprecedented performance for integer operations.

Figure 1: Timings of 64-bit addition, multiplication and division, where the two inputs are encrypted, running on CPU (hpc7a.96xlarge from AWS) vs one and two H100 GPUs. The parameters correspond to two bits of message and two bits of carry, using the multi-bit PBS with a grouping factor equal to 3.

The optimal number of GPUs per operation varies depending on the operation itself and the integer precision specified by the user. Comprehensive arrays of benchmark results for both single and multiple GPUs across all specified precisions are available in the documentation.

Additional features and improvements

TFHE-rs v0.7 also includes some other new features and performance improvements:

  • Updated cryptographic parameter sets: Previously, the default failure probability for programmable bootstrapping was less than 2^−40. To reduce the probability of errors over a long run, the new parameter sets now default to 2^−64. The impact on performance is negligible.
  • New vector and array operations: TFHE-rs now includes operations on vectors of ciphertexts. For example, it is now possible to compute equality between two vectors of ciphertexts or to check if one vector is contained within another.
  • Improved Zero-Knowledge Proofs: Through optimizations and dedicated parameter sets for compact public key encryption, both the commitment size and the proof and verification timings have been reduced. More details and benchmarks are available in the documentation.
  • Optimized keyswitch on GPU: The time to keyswitch has been reduced from 5.3 ms to 123 µs for the default parameters, bringing the overall latency of programmable bootstrapping (which includes the aforementioned keyswitch) down to 4.3 ms (compared to 9.5 ms in the previous version of TFHE-rs). 

The next release of TFHE-rs will focus on enhancing multi-GPU performance, along with expanding the set of available operations. 

Additional links

Read more related posts