Training Predictive Models on Encrypted Data using Fully Homomorphic Encryption

March 14, 2024
Jordan Frery and Luis Montero

The transformative power of data across sectors like healthcare, finance, advertising, and genomics cannot be overstated. Yet, as valuable as this data can be, it is often laden with sensitive information that can include personally identifiable details, making its security paramount. Herein lies the potential of Fully Homomorphic Encryption (FHE) – a groundbreaking technology that secures data while preserving its utility, allowing data owners to process it even in its encrypted state.

The implications of FHE stretch far into the future of machine learning, offering a path to unlock use-cases where data privacy isn't just a requirement but a cornerstone. By enabling the training of machine learning models on encrypted data, FHE introduces a new era of privacy protections in collaborative environments. Imagine a scenario where entities can enrich their models by leveraging the data of others, without ever compromising the integrity and confidentiality of the information shared. This not only safeguards privacy but also fosters a culture of trust and cooperation.

Moreover, in such collaborative settings, the ability to train interpretable models on encrypted data can be revolutionary. It offers clear insights into how external data enhances model performance, thereby validating the value of collaboration. This transparency and understanding are crucial, creating a solid foundation of confidence among parties that their partnership is not only secure but also mutually beneficial.

In essence, Fully Homomorphic Encryption is not just a tool for data security; it is a catalyst for innovation, enabling safer, more productive collaborations across industries where privacy concerns have traditionally hindered progress. Its application in training machine learning models on encrypted data promises a future where privacy and utility go hand in hand, unlocking unprecedented potential for growth and advancement.

Training encrypted data with Concrete ML

With its most recent release, v1.4, Concrete ML supports training of Logistic Regression models on encrypted data. This class of models are well known for their simplicity, robustness and interpretability. The implementation uses stochastic gradient descent to train the model, so the data sets are split into batches during training. 

The current encrypted training approach in Concrete ML performs encryption, training and decryption jointly. This is useful for development purposes for data scientists that want to explore the accuracy of models trained on encrypted data. In a real client-server application one expects that these steps will be separate so that the server only receives and sends encrypted data.

First, instantiate the Logistic Regression training class, [.c-inline-code]SGDClassifier[.c-inline-code].

from concrete.ml.sklearn import SGDClassifier
model = SGDClassifier(fit_encrypted=True)

Next, simply call the fit function while specifying that the training should use FHE. This function quantizes and encrypts the training data and labels and, after training, decrypts the resulting model. 

model.fit(X_binary, y_binary, fhe="execute")

The model parameters, stored in the clear in the model object can now be used to predict on new clear or encrypted data. 

You can easily use the above code to learn a classifier on the well known breast-cancer dataset. The following code shows how to train batch by batch in order to monitor the model accuracy throughout the training process.

First, download, split and scale the dataset. 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X2, y2 = datasets.load_breast_cancer(return_X_y=True)
x2_train, x2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, stratify=y2)

scaler = MinMaxScaler(feature_range=[-1, 1])
x2_train = scaler.fit_transform(x2_train)
x2_test = scaler.transform(x2_test)

rng = np.random.default_rng()
perm = rng.permutation(x2_train.shape[0])
x2_train = x2_train[perm, ::]
y2_train = y2_train[perm]

Now, train the classifier by encrypting each batch of data individually and running it through the training algorithm. The model is decrypted after each batch so that it can be evaluated.

clf = SGDClassifier(
    random_state=42,
    max_iter=50,
    fit_encrypted=True,
    warm_start=True,
)

# Go through the training batches
for idx in range(x2_train.shape[0] // clf.batch_size):
    batch_range = range(idx * clf.batch_size, (idx + 1) * clf.batch_size)
    x_batch = x2_train[batch_range, ::]
    y_batch = y2_train[batch_range]

    # Fit on a single batch with partial_fit
    clf.partial_fit(x_batch, y_batch, fhe="execute")

    # Measure accuracy of the model with FHE simulation
    clf.compile(x2_train)
    y_pred_fhe = clf.predict(x2_test, fhe="simulate")
    accuracy = (y_pred_fhe == y2_test).mean()

Plotting the above accuracy after each batch was processed gives the following graph:

The training time for this model is around 11 seconds per batch on a large AWS server. For the entire breast-cancer dataset the time to train the model is 13 minutes. For more results, check out this research paper that will be presented at FHE.org 2024

Future work

While currently only the single user experimentation use-case is explored, the natural use-case for training is the client-server setting. In future versions of Concrete ML this deployment setting will be added, following the API for deploying encrypted inference. 

Collaboration between multiple parties is where training on encrypted data really shines. Using threshold protocols it is possible for multiple parties to generate keys that keep their individual data secure, while allowing joint training. Furthermore, in future versions of Concrete ML, more complex models such as neural networks will be enabled for encrypted training. 

Additional links

Read more related posts

No items found.