Making FHE Faster for ML: Beating our Previous Paper Benchmarks with Concrete ML

July 23, 2024

—

Benoit Chevallier-Mames and Celia Kherfallah

Speeding up FHE ML

At Zama, our goal is not only to make FHE accessible to all developers, but also to make it extremely fast. Indeed, improving the speed and efficiency of FHE is key to making it useful in real-world applications. In this blog post, we'll show you how we've made FHE faster by using Concrete ML, and how we've beaten our previous performance benchmarks from our paper called: "Programmable Bootstrapping Enables Efficient Homomorphic Inference of Deep Neural Networks".

Making an FHE compiler

From the beginning, we realized at Zama that making an FHE compiler was key to simplify the user experience and prevent them from dealing with complicated cryptographic parameters and complex settings (which are critical for security and ensuring the correctness of computation). We also decided that adding support for various backends should be a top priority, with CPU first and GPU / FPGA / ASICs later. Of course, as for any complicated deep tech project the path is not a straight line.

Our original attempt based on an ONNX-based compiler

We initially began with a purely Python-based compiler built on ONNX. Although this experiment was not completed and therefore not open-sourced, it provided valuable insights. The concept was to take an ONNX model from a machine learning user and progressively modify it to be Fully Homomorphic Encryption (FHE)-friendly. The final step would be generating FHE bytecode executable on our FHE virtual machine. Our goal was ambitious: to handle every aspect for the user.

Looking back at our 2021 privacy-preserving ML paper

In 2020 and 2021, we made rapid progress and successfully converted multi-layer perceptrons (MLP). Our goal was to demonstrate that, with TFHE—the cryptographic scheme we use at Zama—it is possible to handle very deep neural networks. This is significant because cryptographic noise has historically limited the depth of neural networks in cryptography. We showcased this breakthrough in our paper called: "Programmable Bootstrapping Enables Efficient Homomorphic Inference of Deep Neural Networks".

In this paper, we used our ONNX-based prototype to convert NN-20, NN-50, and NN-100 into their FHE equivalents and ran them on powerful servers. The NN-i models were CNN neural networks, starting with a convolutional layer and a ReLU activation, followed by (i-1) dense layers with ReLU activations, and ending with a final dense layer. NN-20, NN-50, and NN-100 are notably deep neural networks. Our goal was to showcase the capability of TFHE, rather than to address a specific machine learning problem, since MNIST is a relatively simple task.

The results were as follows:

Table 6.2 from the Programmable Bootstrapping Enables Efficient Homomorphic Inference of Deep Neural Networks paper

‍

From manual optimizations to FHE compilers

While the method in our paper achieved good results, we realized that creating a monolithic compiler to handle both machine learning tasks, such as quantization, and cryptographic tasks simultaneously would be too cumbersome for practical use. A change of plan was necessary to separate the different stages of model conversion to FHE:

User Control: Users would handle machine-learning-related choices and tasks, such as selecting FHE-compatible layers, applying pruning, or choosing the appropriate quantization method (e.g., PTQ or QAT).
Compiler Focus: The new compiler would focus exclusively on cryptography and security, the core of our expertise.

This pivot was also the right moment to adjust our strategy. We moved away from the purely ML-oriented compiler concept towards a more generic compiler framework with specialized frontends. The first of these frontends would be Concrete ML (for machine learning) and Concrete Python (for classical Python mathematical computations).

We replaced our ONNX-based approach with an MLIR approach. MLIR, a classical framework used in compilers, allowed us to leverage existing community-built tools for traditional compilers and layer FHE constraints on top of them. This transition enabled us to build a more robust and flexible solution.

There were additional advantages to this pivot, such as improved support for hardware accelerators like GPUs, FPGAs, and ASICs, thanks to the Concrete backends. Additionally, this change simplified our Concrete optimizer, making the overall system more efficient and versatile.

The impact of our changes

In our ONNX-based compiler, we used an approximate computation approach, where computations in the clear were replaced by computations over encrypted data that were close but not fully exact. With Concrete, we decided to use the approximate approach only when it worked well. We also provided tools for users to verify the impact of the loss of exactness on accuracy.

For models that require full exactness, and as a fallback when the approximate computation is not sufficiently accurate, Concrete offers a fully exact approach. Today, with these two approaches, we allow advanced users to set the degree of "approximateness" according to their needs..

At the beginning of Concrete, operators like roundPBS or truncatePBS, which replace table lookups T[i] with T′[i′] where i′ contains the leading bits of i, were not present. These operators are particularly valuable in the context of machine learning, as models are inherently resistant to these computational simplifications. The shorter i′ index makes computations significantly faster. In our ONNX-based compiler, these operators were available natively for complex reasons. Today, they are also available in Concrete and are widely used in Concrete ML.

Due to these changes (the exact approach and the initial absence of roundPBS), reproducing experiments like NN-20 on first releases of Concrete ML would have resulted in much longer execution times. As a team, our goal was to make Concrete outperform our first compiler. We are pleased to announce that we have achieved this milestone, and we have created notebooks to demonstrate this improvement to you.

Replicating the NN-20 and NN-50 Experiments with Concrete ML

Now, with this notebook, you can easily reproduce the NN experiments. A 20-layer neural network, Fp32MNIST, was trained in the usual manner, using Pytorch on MNIST clear data, with a cross-entropy loss. To initialize the network, simply specify the number of layers and create an instance as follows:

fp32_mnist = Fp32MNIST(nb_layers=20)

Train the network with several epochs on MNIST, until convergence. To speed up the training of the 50-layer model, we began by initializing the first 20 layers with the weights from the previously trained 20-layer network. These layers are then frozen to focus solely on training the last 30 layers. Finally, we unfroze all the layers and fine-tuned the entire network.

fp32_mnist = Fp32MNIST(nb_layers=50)

checkpoint = torch.load(f"MLP_20/fp32/MNIST_fp32_state_dict.pt")
del checkpoint["linears.54.weight"] # remove last layer classifier
del checkpoint["linears.54.bias"]   # remove last layer classifier
for k, v in fp32_mnist.named_parameters():
    if "linears.54" in k:		   # stop freezing at this layer
        break
    v.requires_grad = False
    print(f"Frozen {k}")
fp32_mnist.load_state_dict(checkpoint, strict=False)

Training NN-100 and conducting experiments with this model were not reproduced, as having such high depth does not offer any accuracy advantage and training such a deep neural network is time-consuming. However, we emphasize that NN-100 would still be supported by Concrete ML. TFHE can support any depth due to its bootstrapping capability, which can reduce noise indefinitely.

The NN-20 and NN-50 models were compiled with Concrete ML to work on encrypted data as follows:

   q_module = compile_torch_model(
        fp32_mnist,
        torch_inputset=data_calibration,
        n_bits=6,
        rounding_threshold_bits={"n_bits": 6, "method": "APPROXIMATE"},
        p_error=0.1,
    )

The quantization [.c-inline-code]n_bits[.c-inline-code] was set to 6. With post-training quantization, which is performed by [.c-inline-code]compile_toch_model[.c-inline-code], values greater than or equal to 6 bits should be used to avoid degrading accuracy. To ensure optimal latency on encrypted data, the [.c-inline-code]rounding_threshold_bits[.c-inline-code] and [.c-inline-code]p_error[.c-inline-code] values were set through experimentation. See the documentation for rounding and the one for the TLU one-off error tolerance for more details.

New results

We ran the experiments on hpc7a machine. The results can be found in the following table:

As we can see, the execution time for NN-20 is 21x times faster than it was in 2021 in our paper, and 14 times faster for NN-50. Finally, we have reached the point where every aspect about our pivot is positive:

It’s easier for our users
We can benefit from the advantage of the MLIR framework
We can easily support several frontends, enabling the compilation of Python, ML, and virtually many other languages
We can support various backends for different hardware accelerators.

In this improvement ratio, a part (approximately 2x) comes from the machines, which have been improved. The rest comes from:

(i) Our efforts in software engineering, particularly in the underlying TFHE-rs library

(ii) Our improvements in cryptography, notably the roundPBS operator

(iii) Better quantization techniques in the ML parts, as well as the effective use of the roundPBS

(iv) Much better management of the compilation in our MLIR-based Concrete.

Additional links

Star the Concrete ML Github repository to endorse our work.
Review the Concrete ML documentation.
Get support on our community channels.
Participate in the Zama Bounty Program to get rewards in cash!

Latest Blog Posts

Blockchain Confidentiality Goes Public at Devconnect in Buenos Aires

Announcements

Zama's Buenos Aires double-header: The Zama World's Fair & The Zama CoFHE Shop.

Developer Program October 2025: Your Golden Ticket to DevConnect

Announcements

This season, the top builder will be rewarded with a full trip to DevConnect Buenos Aires to join the Zama team in person.

Bounty Track October 2025: Build an Universal FHEVM SDK

Announcements

Build an universal FHEVM SDK: a framework-agnostic frontend toolkit that helps developers run confidential dApps with ease.

Read more →

Back to blog

Privacy is necessary for an open society in the electronic age. Privacy is not secrecy. A private matter is something one doesn't want the whole world to know, but a secret matter is something one doesn't want anybody to know. Privacy is the power to selectively reveal oneself to the world.If two parties have some sort of dealings, then each has a memory of their interaction. Each party can speak about their own memory of this; how could anyone prevent it? One could pass laws against it, but the freedom of speech, even more than privacy, is fundamental to an open society; we seek not to restrict any speech at all. If many parties speak together in the same forum, each can speak to all the others and aggregate together knowledge about individuals and other parties. The power of electronic communications has enabled such group speech, and it will not go away merely because we might want it to.Since we desire privacy, we must ensure that each party to a transaction have knowledge only of that which is directly necessary for that transaction. Since any information can be spoken of, we must ensure that we reveal as little as possible. In most cases personal identity is not salient. When I purchase a magazine at a store and hand cash to the clerk, there is no need to know who I am. When I ask my electronic mail provider to send and receive messages, my provider need not know to whom I am speaking or what I am saying or what others are saying to me; my provider only need know how to get the message there and how much I owe them in fees. When my identity is revealed by the underlying mechanism of the transaction, I have no privacy. I cannot here selectively reveal myself; I must always reveal myself.Therefore, privacy in an open society requires anonymous transaction systems. Until now, cash has been the primary such system. An anonymous transaction system is not a secret transaction system. An anonymous system empowers individuals to reveal their identity when desired and only when desired; this is the essence of privacy.Privacy in an open society also requires cryptography. If I say something, I want it heard only by those for whom I intend it. If the content of my speech is available to the world, I have no privacy. To encrypt is to indicate the desire for privacy, and to encrypt with weak cryptography is to indicate not too much desire for privacy. Furthermore, to reveal one's identity with assurance when the default is anonymity requires the cryptographic signature.We cannot expect governments, corporations, or other large, faceless organizations to grant us privacy out of their beneficence. It is to their advantage to speak of us, and we should expect that they will speak. To try to prevent their speech is to fight against the realities of information. Information does not just want to be free, it longs to be free. Information expands to fill the available storage space. Information is Rumor's younger, stronger cousin; Information is fleeter of foot, has more eyes, knows more, and understands less than Rumor.We must defend our own privacy if we expect to have any. We must come together and create systems which allow anonymous transactions to take place. People have been defending their own privacy for centuries with whispers, darkness, envelopes, closed doors, secret handshakes, and couriers. The technologies of the past did not allow for strong privacy, but electronic technologies do.We the Cypherpunks are dedicated to building anonymous systems. We are defending our privacy with cryptography, with anonymous mail forwarding systems, with digital signatures, and with electronic money.Cypherpunks write code. We know that someone has to write software to defend privacy, and since we can't get privacy unless we all do, we're going to write it. We publish our code so that our fellow Cypherpunks may practice and play with it. Our code is free for all to use, worldwide. We don't much care if you don't approve of the software we write. We know that software can't be destroyed and that a widely dispersed system can't be shut down.Cypherpunks deplore regulations on cryptography, for encryption is fundamentally a private act. The act of encryption, in fact, removes information from the public realm. Even laws against cryptography reach only so far as a nation's border and the arm of its violence. Cryptography will ineluctably spread over the whole globe, and with it the anonymous transactions systems that it makes possible.For privacy to be widespread it must be part of a social contract. People must come and together deploy these systems for the common good. Privacy only extends so far as the cooperation of one's fellows in society. We the Cypherpunks seek your questions and your concerns and hope we may engage you so that we do not deceive ourselves. We will not, however, be moved out of our course because some may disagree with our goals.The Cypherpunks are actively engaged in making the networks safer for privacy. Let us proceed together apace.Onward.Eric Hughes9 March 1993