Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (2024)

¹¹institutetext: School of Computer Science and Technology, Fudan University, Shanghai, China
¹¹email: {jyzheng23,hlzhu22,yfdong22,songzy23,zhenhaozhang23}@m.fudan.edu.cn, {18110240046,ylzhao}@fudan.edu.cn

Jieyu Zheng Haoliang Zhu Yifan Dong Zhenyu Song Zhenhao Zhang Yafang Yang Yunlei Zhao ${}^{\href mailto:ylzhao@fudan.edu.cn}$

Abstract

TLS is extensively utilized for secure data transmission over networks. However, with the advent of quantum computers, the security of TLS based on traditional public-key cryptography is under threat. To counter quantum threats, it is imperative to integrate post-quantum algorithms into TLS. Most PQ-TLS research focuses on integration and evaluation, but few studies address the improvement of PQ-TLS performance by optimizing PQC implementation.
For the TLS protocol, handshake performance is crucial, and for post-quantum TLS (PQ-TLS) the performance of post-quantum key encapsulation mechanisms (KEMs) directly impacts handshake performance. In this work, we explore the impact of post-quantum KEMs on PQ-TLS performance. We explore how to improve ML-KEM performance using the latest Intel’s Advanced Vector Extensions instruction set AVX-512. We detail a spectrum of techniques devised to parallelize polynomial multiplication, modular reduction, and other computationally intensive modules within ML-KEM. Our optimized ML-KEM implementation achieves up to 1.64 $\times$ speedup compared to the latest AVX2 implementation. Furthermore, we introduce a novel batch key generation method for ML-KEM that can seamlessly integrate into the TLS protocols. The batch method accelerates the key generation procedure by 3.5 $\times$ to 4.9 $\times$ . We integrate the optimized AVX-512 implementation of ML-KEM into TLS 1.3, and assess handshake performance under both PQ-only and hybrid modes. The assessment demonstrates that our faster ML-KEM implementation results in a higher number of TLS 1.3 handshakes per second under both modes. Additionally, we revisit two IND-1-CCA KEM constructions discussed in Eurocrypt22 and Asiacrypt23. Besides, we implement them based on ML-KEM and integrate the one of better performance into TLS 1.3 with benchmarks.

Keywords:

Post-Quantum Cryptography TLS 1.3 ML-KEM AVX-512.

1 Introduction

Digital communications are ubiquitous worldwide, with most Internet connections relying on Transport Layer Security (TLS) to secure data transmission. However, the current TLS protocol remains vulnerable to quantum attacks. TLS employs public-key cryptography algorithms, including Elliptic Curve Diffie-Hellman (ECDH), Elliptic Curve Digital Signature Algorithm (ECDSA), and RSA. However, these algorithms are susceptible to quantum computing threats, as demonstrated by Shor’s algorithm [46]. To address this vulnerability, the National Institute of Standards and Technology (NIST) launched the Post-Quantum Cryptography (PQC) competition in 2016. After three rounds, NIST announced the selection of the first algorithms to be standardized, including Kyber, Dilithium, Falcon, and SPHINCS+ [35]. In August 2023, NIST designated three standard drafts: ML-KEM [51], ML-DSA [50], and ML-SLH [52], renamed from Kyber, Dilithium and SPHINCS+ respectively.

Amidst the continuous progress in NIST’s PQC algorithm standardization, significant endeavors have been devoted to the development of PQ-TLS over the past decade. Beginning in 2014, Chang et al. [14] introduced a post-quantum SSL/TLS library for embedded systems. Subsequently, Bos et al. [12] developed PQC ciphersuites for TLS based on the ring learning with errors (R-LWE) problem. In July 2016, Chrome introduced the newhope1024 post-quantum option, signaling an initial step towards integrating post-quantum cryptography into mainstream browsers. However, due to patent issues, Chrome removed the newhope1024 option in November 2016. Subsequent efforts by industry giants such as Google and Cloudflare focused on evaluating the performance of post-quantum cryptographic candidates within TLS. In 2018, Google and Cloudflare conducted experiments with NIST PQC candidate HRSS and X25519 in TLS 1.3 Chrome, while in 2019 they utilized ntruhrss701 and sntrup761 in their TLS experiments [31].

Recent research on PQ-TLS has predominantly concentrated on three pivotal domains:

•
Integration of PQC into the TLS Protocol: Exploring methodologies to seamlessly incorporate post-quantum cryptographic mechanisms into the TLS framework [18, 44, 43, 40, 39, 23, 37].
•
Performance Evaluation and Communication Overheads: Assessing the computational efficiency and communication overheads incurred by post-quantum cryptographic primitives within the TLS ecosystem [26, 49, 13, 53, 25, 21, 48, 38, 47, 6].
•
Optimized Implementations for Enhanced PQ-TLS Efficiency: Enhancing the computational efficiency of post-quantum cryptography within TLS through optimized implementations [10].

Most research on PQ-TLS primarily focuses on exploring how to integrate various PQC cryptographic primitives into TLS and evaluating their performance. However, there is a noticeable scarcity of work improving PQ-TLS performance. Performance stands as a critical factor in TLS applications. As an integral component of PQ-TLS, PQC algorithms directly impact the handshake time of PQ-TLS. Therefore, optimizing the implementation of PQC algorithms and integrating them into TLS may contribute to reducing the handshake time of PQ-TLS.

Motivations.

Recent work presented an accelerated Ed25519 and X25519 AVX-512 engine tailored for TLS 1.3, offering significant performance improvement [55]. Previous work explored optimized PQC implementations using the latest Intel Single Instruction Multiple Data (SIMD) instruction AVX-512 (e.g. [15, 32, 16]). However, the integration of PQC AVX-512 optimized implementations into TLS 1.3 remains unrealized, and there is currently no AVX-512 implementation available for ML-KEM. This gap prompts our focus on optimizing ML-KEM using AVX-512 and seamlessly integrating the optimized ML-KEM implementation into TLS 1.3, and also provides an opportunity to explore the impact of AVX-512 instructions on PQ-TLS handshake protocols.

As of now, existing research on PQ-TLS migration relies on OQS-OpenSSL, which lacks support for OpenSSL3 and remains outdated. However, the latest OQS provider [8] not only supports OpenSSL3 but also facilitates a clear separation between OpenSSL code and PQC KEM code. Our comprehensive integration of ML-KEM using the OQS provider could provide valuable guidance for researchers seeking to migrate to PQ-TLS.

In addition, recent studies [28, 30, 43, 44] have affirmed the sufficiency of IND-1-CCA KEMs for TLS 1.3 handshake to be secure. Intriguingly, IND-1-CCA KEMs can be obtained from any OW-CPA/IND-CPA KEMs without re-encryption and de-randomization [28, 30] that are required by IND-CCA KEMs used in PQ-TLS previously. An idea can be easily deduced that TLS 1.3 handshake might demonstrate improved efficiency by applying such IND-1-CCA KEMs. However, there remains a notable absence of experiments focusing on IND-1-CCA-security KEM TLS 1.3. This gap in research catalyzes us to conduct experiments and evaluations on IND-1-CCA-secure PQ-TLS handshake protocols.

Contributions.

In this work, we aim to bridge the gap between PQC engineering implementations and TLS protocol applications. We approach to this task from both an optimization engineering perspective and a TLS system perspective. We will later open source our code.

•
We present the first optimized implementation of ML-KEM using AVX-512. As the main bottleneck in ML-KEM lies in polynomial multiplication and hash functions, we achieve 32-way parallel polynomial multiplication and 8-way hash function. Besides, we enhance polynomial rejection and central binomial distribution sampling through the new features of AVX-512 like masked registers and compressive store instructions. Our implementation successfully passes NIST’s KAT tests, achieving a 1.64 $\times$ speedup compared to the state-of-the-art AVX2 implementation of ML-KEM.
•
We propose a batch key generation method for ML-KEM to batch 8 independent key pairs. Our batch key generation method achieves a speedup of 3.5 $\times$ to 4.9 $\times$ compared to key generation without batching. This batch generation approach can also be applied to other key generation processes involving hash function calls.
•
We revisit two IND-1-CCA KEM constructions discussed in Eurocrypt’22 [28] and Asiacrypt’23 [30], and implement them with the underlying CPA-secure PKE of ML-KEM. We then evaluate the performance of IND-1-CCA KEMs, and integrate the better one into TLS 1.3. The benchmark results indicate that IND-1-CCA KEMs improve the performance of the TLS 1.3 handshake compared to IND-CCA KEMs.
•
We integrate the AVX-512 optimized implementation of ML-KEM into TLS 1.3, assess its impact on TLS 1.3 handshake time, and evaluate the influence of different KEM constructions on TLS handshake efficiency. Our evaluation reveals that an efficient implementation of ML-KEM utilizing AVX-512 can yield a higher number of handshakes per second compared to the latest AVX2 implementation.

2 Preliminaries

2.1 Notation

The notation in this paper is the same as the FIPS 203 draft [51]. We denote $\mathcal{R}_{q}$ as the cyclic polynomial ring $\mathbb{Z}_{q}[x]/(x^{n}+1)$ . We define $r^{\prime}=r~{}{\bmod}^{\pm}~{}\alpha$ (resp. $r^{\prime}=r\bmod\alpha$ ) to be the unique element $r^{\prime}$ in the range $-\left\lfloor{\frac{\alpha}{2}}\right\rfloor<r^{\prime}\leq\left\lfloor{\frac{%\alpha}{2}}\right\rfloor$ (resp. $0\leq r^{\prime}<\alpha$ ) such that $\alpha|(r-r^{\prime})$ . By default, regular font lettersdenote elements in $\mathcal{R}_{q}$ , bold lower-case letters are column vectors and bold upper-case letters are matrices.

2.2 ML-KEM

ML-KEM is a NIST-standardized lattice-based KEM. Its security is based on the Module Learning With Errors (M-LWE) problem. ML-KEM is derived from Round 3 version of Kyber [7]. The polynomial multiplication over $\mathbb{Z}_{3329}[x]/\left(x^{256}+1\right)$ is a fundamental operation in ML-KEM. Utilizing the property $n|(q-1)$ , ML-KEM employs an incomplete Number Theoretic Transform (NTT) to accelerate this operation. ML-KEM uses four SHA-3 hash functions: SHA3-256, SHA3-512, SHAKE128, and SHAKE256. For more details, readers can refer to FIPS 203 draft [51].
Number Theoretic Transform.NTT is a variant of Fast Fourier Transform (FFT) in finite fields. Its essence is to use the point-value representation of polynomials to perform efficient polynomial multiplication operations. We denote the forward NTT transform as NTT, and the inverse NTT as INTT. The symbol “ $\cdot$ ” denotes point-wise multiplication. Polynomial multiplication $h(x)=f(x)\times g(x)\in\mathbb{Z}_{q}[x]/\left(x^{n}+1\right)$ can be computed as follows:

2.3 AVX-512 Instruction Set

The AVX-512 instruction set was introduced by Intel in 2013 and initially supported in the 2016 Xeon Phi series processors [2], AVX-512 has since become a pivotal feature for both major CPU manufacturers. AVX-512 offers several key functionalities: it introduces 32 512-bit zmm registers, enabling simultaneous processing of multiple data elements and accelerating vectorized computations. AVX-512 covers a wide range of floating-point and integer operations for diverse computational requirements. Additionally, AVX-512 provides 8 mask registers (k0-k7) for conditional operations, allowing instructions to execute based on conditions. These mask registers enhance the flexibility and efficiency of data processing by enabling advanced operations such as compression/expansion and masking.

2.4 TLS 1.3 and PQ-TLS

TLS is a standard developed by the Internet Engineering Task Force (IETF) in 1999. Its primary role is to encrypt communication between web applications and servers, with the latest version being TLS 1.3 [41]. When initiating a TLS connection, the client and server exchange parameters such as TLS version and ciphersuite. Our study focuses on enhancing TLS security against post-quantum threats, particularly through the adoption of PQC Key Exchange (KEX) in PQ-TLS. PQ-TLS operates in two modes: hybrid mode and PQ-only mode. The hybrid mode, as standardized by the IETF in TLS 1.3 [1], supports simultaneous usage of ECDH and PQC KEM, albeit at the cost of increased data transmission size and computational resources. Conversely, the PQ-only mode exclusively employs PQC KEM for Key Exchange and PQC signature algorithms for authentication.

3 ML-KEM AVX-512 Implementation

The most time-consuming operations in ML-KEM include modular reduction, polynomial sampling, polynomial multiplication, and hash functions. In this section, we will outline our rationale behind the AVX-512 implementation design for these computationally intensive components.

3.1 Modular Reduction Implementation

Modular reduction plays a crucial role in ML-KEM polynomial arithmetic, where the arithmetic operates over the ring $\mathbb{Z}_{q}$ with $q=3329$ . However, AVX-512 instructions lack dedicated support for modular reduction computations. In this context, we introduce the signed versions of two constant-time reduction algorithms commonly utilized in lattice-based cryptography: Montgomery reduction [34] and Barrett reduction [9], as proposed by Seiler [45].
Reduction Algorithms.The Signed Montgomery reduction computes the Hensel remainder of a signed integer within the range $[-\beta q/2,\beta q/2)$ , specifically employed in ML-KEM to reduce the product of two 16-bit coefficients. The output remainder integer resides in the Montgomery domain and is further multiplied by $\beta^{-1}$ . Barrett reduction operates within a smaller input range compared to Montgomery reduction. Typically, the input to Barrett reduction does not exceed a 16-bit signed integer range. Therefore, Barrett reduction is commonly utilized to reduce coefficients that surpass the range $[-q,q]$ after addition and subtraction operations. The resulting output from Barrett reduction remains within $\mathbb{Z}_{q}$ .
AVX-512 Implementations.We give our AVX-512 implementations of signed Montgomery reduction and Barrett reduction.For the modulus $q=3329=2^{13}-2^{9}+1$ , we set $\beta=2^{16}$ . This configuration ensures that one coefficient occupies 16 bits within a 512-bit vector register, enabling 32-way parallelism. Given that $v$ remains constant in Barrett reduction [45, Sec 3.3], we can precompute $v$ and store it in the vector register. Regarding the computation of $av/2^{\beta}$ , it involves two operations: multiplication and division. These two operations can be executed using a single AVX-512 instruction, vpmulhw, which computes the product of every two 16-bit data lanes in vector registers and retains only the higher 16 bits. Additionally, the division by $2^{\lfloor\log(q)\rfloor-1}$ can be efficiently implemented using AVX-512 shift instructions. In Algorithm 1, we introduce a macro red16 designed to compute Barrett reduction using AVX-512 instructions. Within this macro, the zmm\r register holds the output of the Barrett reduction, while zmm\rv stores the constant used in Barrett reduction, represented by $v=\left\lfloor\frac{2^{\lfloor\log(3329)\rfloor-1}2^{16}}{3329}\right\rceil=20%159$ . Additionally, the zmm\rl register stores an immediate value. Furthermore, we will introduce our 32-way Montgomery reduction AVX-512 implementation, combined with butterfly operation, in Section 3.2.

1:.macro red16 r, rv, rl

2:vpmulhw %zmm\rv, %zmm\r, %zmm\rl

3:vpsraw $10, %zmm\rl, %zmm\rl

4:vpsubw %zmm\rl, %zmm\r, %zmm\r

5:.endm

3.2 NTT Implementation

NTT is one of the most intricate components in the ML-KEM AVX-512 implementation. Both NTT and INTT operations in ML-KEM require 7 layers of butterfly operations to obtain the final result. In this section, we will introduce several techniques used in NTT implementation.
Register Allocation.A 512-bit vector register zmm can accommodate a maximum of 32 16-bit integers, requiring only 8 vector registers to store all 256 polynomial coefficients. Therefore, we allocate 8 vector registers for storing coefficients, 2 for intermediate registers, and 2 for constant registers, leaving 20 vector registers unused.
Butterfly Unit. Our Cooley-Tukey butterfly pseudo-code is outlined in Algorithm 3. Registers zmm\l and zmm\r respectively store coefficients $f_{i}$ and $f_{j+len/2}$ , while zmm0 holds the modulus $q$ . We precompute $q^{-1}\bmod^{\pm}\beta$ into twiddle factors to reduce one multiplication operation in Montgomery reduction. To obtain the low and high bits of the 16-bit integer multiplication product, we utilize vector integer multiplication instructions vpmullw and vpmulhw. These instructions eliminate the necessity of extending coefficients to 32 bits after multiplication. Our Gentleman-Sande Butterfly pseudo-code is presented in Algorithm 2. Similar to the Cooley-Tukey butterfly, registers zmm\l and zmm\r store coefficients $f_{i}$ and $f_{j+len/2}$ , respectively. Additionally, registers zmm\zl and zmm\zr hold precomputed $\zeta\cdot q^{-1}$ and $\zeta$ values, respectively. The order of twiddle factors in the NTT differs from that in the INTT, thus the same twiddle factors can be used. However, employing the same twiddle factors table necessitates additional permutation. In our implementation, we opt to utilize two distinct twiddle factors tables for the NTT and INTT.

1:.macro Gentleman-Sande butterfly l,r,zl,zh

2:vpsubw %zmm\l,%zmm\r,%zmm21

3:vpaddw %zmm\r,%zmm\l,%zmm\l

4:vpmulhw %zmm\zh,%zmm21,%zmm22

5:vpmulhw %zmm\zh,%zmm21,%zmm22

6:vpmulhw %zmm0,%zmm21,%zmm21

7:vpsubw %zmm21,%zmm22,%zmm\r

8:.endm

1:.macro Cooley-Tukey butterfly l,r,zl,zh

2:vpmullw %zmm\zl,%zmm\r,%zmm11

3:vpmulhw %zmm\zh,%zmm\r,%zmm\r

4:vpmulhw %zmm0,%zmm11,%zmm11

5:vpsubw %zmm11,%zmm\l,%zmm\r

6:vpaddw %zmm11,%zmm\l,%zmm\l

7:.endm

In the implementation of NTT and INTT, we employ layer merging and coefficient permutation methods to reduce the memory access overhead. Specifically, we merge the 7 layers of NTT. We load all coefficients into 8 vector registers only at the first layer and store them back in memory after completing the final layer computation. Achieving such layer merging incurs some additional overhead for the coefficient permutation. Since each vector register accommodates 32 coefficients, in the initial three layers of the NTT, the coefficients stored in vector registers satisfy the correct distance and can directly perform the Cooley-Tukey butterfly. Starting from the fourth layer, as the butterfly distance becomes 16, the coefficient pairs requiring butterfly operations are housed within the same vector register. Thus, it’s necessary to separate the coefficient pairs at corresponding distances into different vector registers.

Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (3)

As depicted in Figure 3, we can observe the arrangement of coefficients within a vector register during the third to seventh layers of the NTT. The functions Shuffle16, Shuffle8, Shuffle4, Shuffle2, and Shuffle1 are utilized for permuting pairs of 16, 8, 4, 2, and 1 coefficients, respectively. We perform an extra Shuffle1 to bring convenience for polynomial point-wise multiplication. After completing all NTT layers, the coefficients are constrained within the range of signed 16-bit integers, eliminating the need for Barrett reduction in NTT. However, this differs in the case of INTT. We adopt lazy reduction in INTT as outlined in [54].

3.3 SHA3 Keccak Implementation

ML-KEM utilizes SHA3-256, SHA3-512, SHAKE128, and SHAKE256 as hash functions, pseudorandom functions (PRFs), and eXtendable-output functions (XOFs). These algorithms belong to the SHA-3 family [22] developed by NIST and are based on the Sponge Construction [11]. The round function employed in the SHA-3 algorithm is the Keccak-p[1600,24] permutation function, which operates on a 64-bit state. In the previous AVX2 implementation of ML-KEM by the Crystals team, Keccak-p[1600,24] achieved 4-way parallelism due to the 256-bit vector register width. However, with AVX-512 implementation, we can compute 8 Keccak permutation results in parallel, as a 512-bit vector register can process 8 Keccak states concurrently.

3.4 Other Modules

We also implement Compress and Decompress, pkDecode, skDecode, skEncode and pkEncode and several polynomial arithmetic functions. The implementation idea is similar to the AVX2 implementation. For simplicity, we don’t discuss details here. One notable difference is that, in the implementation of pkDecode, skDecode, and Decompress, we utilize the AVX-512 instruction vpermb, which operates on bytes, unlike AVX2 instructions. This instruction allows us to conveniently adjust the order of byte streams. We implement rejection sampling using AVX-512 according to the method presented in [56].

4 ML-KEM TLS 1.3 Integration Design Consideration

In this section, we discuss how to integrate ML-KEM AVX-512 implementation into TLS 1.3.

4.1 Batch Key Generation Using Parallel Keccak

The concept of batch key generation was discussed in both [55] and [10]. In [10], Montgomery’s trick was used to compute multiple polynomial inversions, specifically targeting key generation algorithms involving polynomial inversion. On the other hand, [55] introduces an $8\times 1$ X25519 approach to batch 8 key pairs in parallel. We explore the potential applicability of batch key generation in other PQC algorithms. Drawing inspiration from [42], we propose a batch key generation approach for ML-KEM, as outlined in Algorithm 4. The SHA3-256x8 function is constructed based on the 8-way Keccak function discussed in Section 3.3. Our 8-way ML-KEM key generation AVX-512 implementation function is designed to generate 8 independent $(pk,sk)$ pairs simultaneously.

1:Encapsulation keys array ek[8] $\in\mathbb{B}^{384k+32}$ .

2:Decapsulation keys array dk[8] $\in\mathbb{B}^{768k+96}$ .

3: $z\stackrel{{\scriptstyle\$}}{{\longleftarrow}}\mathbb{B}^{32}$

4:fori = 0 to 7do

5: $\left(\mathrm{e}\mathrm{k}_{\mathrm{PKE}}[i],\mathrm{d}\mathrm{k}_{\mathrm{PKE%}}[i]\right)\leftarrow$ K-PKE.KeyGen()

6: $\mathrm{ek}[i]\leftarrow\mathrm{e}\mathrm{k}_{\mathrm{PKE}}[i]$

7:endfor

8:output array $output[8]$ $\leftarrow$ SHA3-256x8 $(\mathrm{ek}[0],\mathrm{ek}[1],\mathrm{ek}[2],\mathrm{ek}[3],\mathrm{ek}[4],%\mathrm{ek}[5],\mathrm{ek}[6],\mathrm{ek}[7])$

9:fori = 0 to 7do

10: $(\mathrm{dk}[i]\leftarrow\left(\mathrm{dk}_{\mathrm{PKE}}\|\mathrm{ek}\|output%[i])\|z\right)$

11:endfor

12:Return (ek[8], dk[8])

4.2 ML-KEM AVX-512 TLS 1.3 Migration Implementation

We use OQS provider proposed by the Open Quantum Safe (OQS) team [36] and OpenSSL 3.3.0-dev to migrate the ML-KEM AVX-512 implementation. OpenSSL is an open-source software library implementing the SSL and TLS protocols. Most PQ-TLS research works use OQS-OpenSSL to migrate PQC algorithms into TLS 1.3. However, OQS-OpenSSL ceased updates in July 2023 and does not support the latest OpenSSL 3.0 version. The latest OQS provider separates the integration of PQ algorithms into TLS 1.3 from the main logic of OpenSSL, without altering the core cryptographic algorithms. This separation isolates the embedding of post-quantum algorithms from the extensive OpenSSL codebase. For ML-KEM AVX-512 code migration, we choose liboqs 0.10.1 [19]. Liboqs is an open-source C library for quantum-safe cryptographic algorithms. The newest liboqs 0.10.1 version adds ML-DSA and ML-KEM C reference and AVX2 codes. We add our ML-KEM AVX-512 code in the ml_kem directory of liboqs. Besides, we implement the corresponding ML-KEM AVX-512 Keygen, Encaps, and Decaps APIs in liboqs. We define the macro OQS_ENABLE_KEM_ml_kem_512_avx512 in the OQS configuration file. By configuring this macro, users can run ML-KEM-512 AVX-512 code within the liboqs library.

5 Revisiting PQC Security of KEM in TLS 1.3

In this section, we revisit PQC security of KEM in TLS 1.3 based on recent research [28, 30]. We state that the IND-CCA KEM used in TLS 1.3 handshake can be replaced by an IND-1-CCA KEM, providing improved efficiency and sufficient security.

5.1 An Efficient Choice of Key-Exchange: IND-1-CCA KEM

In existing PQ-TLS implementations, the ephemeral KEX is implemented with IND-CCA KEMs. IND-CCA KEMs are usually constructed by applying Fujisaki-Okamoto (FO) transform or its variants (e.g. ML-KEM [51]) on an OW/IND-CPA PKE scheme, while FO transform requires re-encrypting the plaintext during decapsulation, significantly reducing the efficiency of KEMs and increasing the cost of side-channel protection [33].

Recent protocols (e.g. TLS 1.3 and KEMTLS) are designed to achieve forward security. In such protocols, each pair of ephemeral public/private keys is discarded immediately after being used once, and a new key pair will be generated for new messages. This means that an adversary will be able to request a decryption only once for a given key pair. Informally, IND-1-CCA security states that an adversary needs to distinguish an honestly generated key from a randomly generated key with at most one decapsulation query. Thus, the IND-1-CCA security of KEMs is sufficient to replace the Diffie-Hellman (DH) key-exchange, ensuring the security of such protocols. In the security proof of TLS 1.3 handshake under the multi-stage model given in [20], the DH key-exchange could be replaced by an IND-1-CCA KEM and the proof would still hold. This idea inspired a series of work, see [28, 30, 43, 44] for details.

[28] proposed that an IND- $q$ -CCA-secure KEM could be obtained from any passively secure PKE (OW-CPA/IND-CPA) without re-encryption. Specifically, [28] presented two constructions named ${T_{CH}}$ and ${T_{H}}$ , as well as their security proofs. Both ${T_{CH}}$ and ${T_{H}}$ do not require re-encryption and de-randomization, and such IND-1-CCA KEMs could be used in TLS 1.3 handshake, improving the efficiency while ensuring security. However, ${T_{CH}}$ leads to ciphertext expansion, and the security of ${T_{H}}$ was only proved in ROM, the QROM proof was not provided. Based on [28], [30] provided both ROM and QROM proofs of ${T_{H}}$ and ${T_{RH}}$ (an implicit variant of ${T_{H}}$ ), with much tighter reductions than [28].

5.2 IND-1-CCA KEM Constructions

In this section, we review the definition of ${T_{CH}}$ and ${T_{H}}/{T_{RH}}$ , as well as their reduction tightness in ROM/QROM.
Constructions. The idea of ${T_{CH}}$ is to send an additional hash value along with ciphertext, which will be used for key confirmation in decapsulation (see Figure 4). ${T_{H}}$ works without additional hash value, the key is derived by $H(m,c)$ directly (Figure 5). ${T_{RH}}$ is an implicit variant of ${T_{H}}$ , the only difference between ${T_{RH}}$ and ${T_{H}}$ is the return value for invalid decryption. In fact, ${T_{H}}$ and ${T_{RH}}$ is equivalent to ${U^{\bot}}$ and ${U^{\not\bot}}$ in [27], respectively.
Reduction Tightness. The security of ${T_{CH}}$ was proved in ROM and QROM by Huguenin-Dumittan and Vaudenay [28], with tightness ${\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}$ and ${\epsilon_{R}}\approx O(1/{{q_{H}}^{3}})\epsilon_{\cal A}^{2}$ respectively, where ${\epsilon_{R}}$ (resp. ${\epsilon_{\cal A}}$ ) is the advantage of the reduction $R$ (resp. adversary ${\cal A}$ ) against the underlying PKE (resp. the IND-1-CCA KEM) and ${q_{H}}$ denotes the number of ${\cal A}$ ’s queries to $H$ . Later, they updated an ePrint version of [28] and gave a tighter QROM proof of ${T_{CH}}$ with bound ${\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}-O({{q_{H}}^{3}}/%{2^{n}})-O({q_{H}}/\sqrt{{2^{n}}})$ . The ROM and QROM proof of ${T_{H}}$ and ${T_{RH}}$ were presented in [30], with tightness ${\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}$ and ${\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}$ respectively. The comparison between FO transform and ${T_{CH}}/{T_{H}}/{T_{RH}}$ is summarized in Table 1.

Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (4)

Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (5)

Construction	Security	$c$ -Expansion	Re-Enc	ROMtightness	QROMtightness
$FO$	IND-CPA $\Rightarrow$ IND-CCA	$\Circle$	$\CIRCLE$	${\epsilon_{R}}\approx{\epsilon_{\cal A}}$	${\epsilon_{R}}\approx O(1/{q_{H}})\epsilon_{\cal A}^{2}$
${T_{CH}}$	OW-CPA $\Rightarrow$ IND-1-CCA	$\CIRCLE$	$\Circle$	${\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}$	${\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}$
${T_{H}}/{T_{RH}}$	IND-CPA $\Rightarrow$ IND-1-CCA	$\Circle$	$\Circle$	${\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}$	${\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}$

6 Discussions and Results

6.1 Experimental Setup

In this section, we describe our experimental setup. We used an emulation experiment setup in one machine. The benchmark platform is an Intel Xeon Platinum 8475B with 4 vCPUs and 8 GiB memory ¹¹1Our testing machine is provided by Alibaba Cloud, belonging to the compute-optimized instance family c8i. It is powered by an Intel Xeon Sapphire Rapids processor, with the specific specification ecs.c8i.xlarge, featuring 4 vCPUs and 8 GiB of memory.. The OpenSSL version we used is OpenSSL3.3.0-dev. We used GCC 9.4.0 to compile all programs. We disabled AVX-512 instructions when benchmarking ML-KEM AVX2 implementation for better comparison. We didn’t disable Turbo Boost and hyper-threading to simulate real-world conditions during TLS handshake.

6.2 Speed of ML-KEM AVX-512 Implementation

In this section, we conduct a performance evaluation of our ML-KEM AVX-512 implementations. Our approach is built upon the Crystals team’s open-source Kyber Standard Code [5], which was implemented for the FIPS 203 draft. To assess performance, we executed the speed testing program 100,000 times and recorded the median CPU cycle count. The results are summarized in Table 2, which compares various ML-KEM implementations across three security levels. Our implementation achieved a speedup ratio ranging from approximately 1.30 $\times$ to 1.64 $\times$ compared to the state-of-the-art AVX2 implementation. This performance enhancement can be attributed to the intricate design of our NTT AVX-512 implementation and the vectorization benefits offered by AVX-512 instructions. As illustrated in Table 2, adopting batch method in Section 4.1 can accelerate the key generation process by 3.5 $\times$ to 4.9 $\times$ .

	ML-KEM-512			ML-KEM-768			ML-KEM-1024			ISA/ISE
	Keygen	Encaps	Decaps	Keygen	Encaps	Decaps	Keygen	Encaps	Decaps
[5]	81750	90,922	118,790	130,048	142,688	178,206	217,174	223,780	274,588	x86-64
Speed-up	6.21 $\times$	6.68 $\times$	7.80 $\times$	6.61 $\times$	7.61 $\times$	8.36 $\times$	8.61 $\times$	8.93 $\times$	9.52 $\times$
[5]	17,750	18,708	19,838	28,996	28,566	30,476	39,996	41,162	44,538	AVX2
Speed-up	1.35x	1.37 $\times$	1.30 $\times$	1.47 $\times$	1.52 $\times$	1.43 $\times$	1.59 $\times$	1.64 $\times$	1.54 $\times$
This work	13,170	13,620	15,230	19,684	18,760	21,308	25,216	25,062	28,834	AVX-512
Batch Keygen	2,804 (4.9 $\times$ )	13,620	15,230	4,866 (4.2 $\times$ )	18,760	21,308	7,394 (3.5 $\times$ )	25,062	28,834

6.3 ML-KEM AVX-512 TLS 1.3 Benchmark

We evaluated the performance of TLS 1.3 handshakes integrated with the ML-KEM AVX-512 implementation in a simulated network environment. Initially, we utilized OpenSSL commands to generate key pairs and server certificates. Subsequently, utilizing the previously generated keys and certificates, we launched a TLS server via the openssl s_server tool [3]. We assessed the handshake connections per second using the openssl s_time tool [4]. Both the server and client were compiled against OpenSSL3.3.0-dev with the OQS provider, thus facilitating support for both hybrid mode and PQ-only mode.
PQ-only ML-KEM PQ-TLS 1.3 Benchmark. In TLS, we mainly observe the ML-KEM algorithm for KEX. However, there are multiple signature algorithms available, each with different security levels. Therefore, observing the combination of different signature algorithms with ML-KEM is crucial and worth considering. For our experiments, we selected three signature algorithms: the standardized post-quantum signature algorithms Dilithium and Falcon, as well as the traditional public key cipher RSA2048.

KEM/SIG	ML-KEM-512		ML-KEM-768		ML-KEM-1024
KEM/SIG	Dilithium2	Falcon512	RSA2048	Dilithium3	Dilithium5
AVX2	3416.02	3191.46	2582.38	2710.85	3191.46
AVX-512	3589.22	3364.61	2650.45	2856.71	2451.08

As shown in Table 3, under PQ-only setting, AVX-512 implementation of ML-KEM brings more handshakes per second compared to AVX2 implementation. This indicates that using AVX-512 to optimize ML-KEM performance can bring improvement to the PQ-TLS handshake. However, the improvement is subtle. Since handshake time is not solely determined by KEX time but also influenced by factors such as transmission time and authentication time, improving the performance of the ML-KEM algorithm does not lead to a significant enhancement in overall handshake time performance. Besides, the handshake time decreases as the security level increases. This is in line with expectations because higher security levels typically result in a slower speed of signature and KEM. Overall, using Dilithium as the signature authentication scheme performs better in handshakes compared to Falcon under the same security level. Compared to RSA2048, using the Dilithium and Falcon post-quantum signature schemes for authentication does not impose significant performance overhead. These conclusions provide insights into the impact of different signature algorithms combined with ML-KEM on TLS handshake performance and serve as a reference for selecting the optimal solution.
Hybrid ML-KEM PQ-TLS 1.3 Benchmark. OQS provider facilitates hybrid key exchange for TLS 1.3 by the IETF draft on Post-Quantum Traditional Hybrid Schemes [29]. Consequently, we also evaluated the performance of ML-KEM and other ECDH curves hybrid schemes. The data from Table 4 indicates a significant decrease in handshake performance in hybrid mode compared to the PQ-only mode. Furthermore, the enhancement in handshake performance achieved through the AVX-512 implementation is minimal. This phenomenon may be attributed to the fact that in hybrid mode, the handshake time is primarily composed of two components: ECDH and ML-KEM. Therefore, solely improving the performance of ML-KEM in hybrid mode is insufficient; there is also a need to enhance the performance of the ECDH scheme.

Hybrid KEM	ML-KEM-512		ML-KEM-768		ML-KEM-1024
Hybrid KEM	p256	x25519	p384	x448	p521	p384
AVX2	893.98	1965.58	375.19	948.04	200.45	369.84
AVX-512	944.23	2031.37	377.72	999.85	201.66	371.39

6.4 TLS 1.3 Handshake with IND-1-CCA KEM

In this section, we implemented IND-1-CCA KEMs based on ${T_{CH}}$ and ${T_{RH}}$ constructions proposed in [28] and [30], integrated the better one into TLS 1.3 and analyzing its performance.
Comparison of KEM Constructions.We constructed two IND-1-CCA KEMs based on the underlying PKE of ML-KEM, employing ${T_{CH}}$ and ${T_{RH}}$ respectively. Both ${T_{CH}}$ and ${T_{RH}}$ utilized SHA3-256 to instantiate the hash function H, consistent with the original $FO$ transform in ML-KEM. Additionally, we fixed the length of the $tag$ in ${T_{CH}}$ to 32 bytes. The results are summarized in Table 5, which compare the execution speed and communication overhead across three constructions (ML-KEM’s original $FO$ construction, ${T_{CH}}$ and ${T_{RH}}$ ). From the experiment results, the following facts are concluded.For key-generation and encapsulation, ${T_{CH}}$ and ${T_{RH}}$ show rational improvement in efficiency, mainly because they omit the hash of $pk$ in key-generation, as well as the de-randomization in encapsulation. For Decapsulation, ${T_{CH}}$ and ${T_{RH}}$ achieve a speed ratio of at least 3.04 $\times$ compared to $FO$ implementation, which can be attributed to removing re-encryption. The improvement becomes more significant as the security level increases because higher security levels result in slower encryption. ${T_{CH}}$ requires extra 32 bytes in ciphertext storage, and ${T_{CH}}$ ’s decapsulation is slightly slower than ${T_{RH}}$ due to the key confirmation. Overall, KEMs based on ${T_{RH}}$ show better performance in both efficiency and communication overhead.

	ML-KEM-512				ML-KEM-768				ML-KEM-1024
	Keygen	Encaps	Decaps	Ct	Keygen	Encaps	Decaps	Ct	Keygen	Encaps	Decaps	Ct
$FO$	17,750	18,708	19,838	768	28,996	28,566	30,476	1,088	39,996	41,162	44,538	1,568
${T_{CH}}$	12,932	18,434	6,520	800	21,942	28,316	8,974	1,120	30,292	42,450	12,172	1,600
${T_{RH}}$	12,896	17,676	5,666	768	21,942	27,580	8,440	1,088	30,336	40,096	11,240	1,568

Note: Ct represents Ciphertext.

IND-1-CCA KEM TLS 1.3 Benchmark.The ${T_{RH}}$ -based IND-1-CCA KEM is integrated into TLS 1.3 in both PQ-only mode and hybrid mode. We measured the number of handshakes per second and compared the results with the original $FO$ implementation, shown in Table 6.

SIG	KEM	$FO$	${T_{RH}}$
	ML-KEM-512	3416.02	3476.17
Dilithium2	p256_ML-KEM-512	924.61	978.12
	x25519_ML-KEM-512	2014.98	2038.19
	ML-KEM-768	2710.85	2803.24
Dilithium3	x25519_ML-KEM-768	1942.31	2036.58
	p256_ML-KEM-768	902.83	931.99

The results show that ${T_{RH}}$ -based IND-1-CCA KEMs increase the number of TLS 1.3 handshakes per second while maintaining communication overhead. Especially in the PQ-only mode with a higher security level, the removal of re-encryption brings more significant improvement, which is in line with our expectations.

Our experiments confirm the advantages of applying IND-1-CCA KEMs in TLS 1.3 at a practical level. Specifically, in application scenarios that place more emphasis on handshake efficiency, IND-1-CCA KEMs have better adaptability than $FO$ -based IND-CCA KEMs. These results inspire further research on IND-1-CCA constructions (e.g. tighter reduction) and their TLS 1.3 implementation.

7 Conclusion

In this paper, we present our implementation of ML-KEM using AVX-512 and introduce a novel batch method for ML-KEM key generation. With the support of OQS provider, we seamlessly integrate our optimized ML-KEM AVX-512 implementation into TLS 1.3, enhancing its resistance to post-quantum threats. We evaluate the performance of our implementation in both PQ-only and hybrid modes. Furthermore, we revisit two IND-1-CCA KEMs and analyze their impact on PQ-TLS handshake performance. Our AVX-512 implementation demonstrates a speedup of up to 1.64 $\times$ for ML-KEM, while our batch method achieves significant speedups ranging from 3.5 $\times$ to 4.9 $\times$ for key generation. Furthermore, our measurements show improvements in TLS 1.3 handshake performance with our AVX-512 implementation. Through our optimized implementation, integration, and assessment efforts, we provide valuable insights for future work aimed at enhancing PQ-TLS handshake performance.

References

[1]Hybrid key exchange in TLS 1.3, https://www.ietf.org/archive/id/draft-ietf-tls-hybrid-design-04.html
[2]Intel xeon phi processor 7250 specifications, https://www.intel.com/content/www/us/en/products/sku/94035/intel-xeon-phi-processor-7250-16gb-1-40-ghz-68-core/specifications.html
[3]OpenSSL: s_server- tls/ssl server program. OpenSSL Documentation (2022), https://www.openssl.org/docs/man3.3/man1/s_server.html
[4]OpenSSL: s_time- ssl/tls performance timing program. OpenSSL Documentation (2022), https://www.openssl.org/docs/man3.3/man1/s_time.html
[5]Kyber standard code (2024), https://github.com/pq-crystals/kyber/tree/standard, accessed on: 2024-03-27
[6]Abiega-L’Eglisse, A.F.D., Delgado-Vargas, K.A., Valencia-Rodriguez, F.Q., Quiroga, V.G., Gallegos-García, G., Nakano-Miyatake, M.: Performance of new hope and crystals-dilithium postquantum schemes in the transport layer security protocol. IEEE Access 8, 213968–213980 (2020). https://doi.org/10.1109/ACCESS.2020.3040324, https://doi.org/10.1109/ACCESS.2020.3040324
[7]Avanzi, R., Bos, J., Ducas, L., Kiltz, E., Lepoint, T., Lyubashevsky, V., Schanck, J.M., Schwabe, P., Seiler, G., Stehlé, D.: Crystals-kyber algorithm specifications and supporting documentation. NIST PQC Round 2(4), 1–43 (2019)
[8]Baentsch, M., Paquin, C., Levitte, R., Hess, B., Segeth, J.: oqsprovider - open quantum safe provider for openssl. Website (2023), https://github.com/open-quantum-safe/oqs-provider
[9]Barrett, P.: Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor. In: Conference on the Theory and Application of Cryptographic Techniques. pp. 311–323. Springer (1986)
[10]Bernstein, D.J., Brumley, B.B., Chen, M., Tuveri, N.: Opensslntru: Faster post-quantum TLS key exchange. In: Butler, K.R.B., Thomas, K. (eds.) 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022. pp. 845–862. USENIX Association (2022), https://www.usenix.org/conference/usenixsecurity22/presentation/bernstein
[11]Bertoni, G., Daemen, J., Peeters, M., VanAssche, G.: Sponge functions. In: ECRYPT hash workshop. vol.2007 (2007)
[12]Bos, J.W., Costello, C., Naehrig, M., Stebila, D.: Post-quantum key exchange for the tls protocol from the ring learning with errors problem. In: 2015 IEEE Symposium on Security and Privacy. pp. 553–570. IEEE (2015)
[13]Bozhko, J., Hanna, Y., Harrilal-Parchment, R., Tonyali, S., Akkaya, K.: Performance evaluation of quantum-resistant tls for consumer iot devices. In: 2023 IEEE 20th Consumer Communications & Networking Conference (CCNC). pp. 230–235. IEEE (2023)
[14]Chang, Y.A., Chen, M.S., Wu, J.S., Yang, B.Y.: Postquantum ssl/tls for embedded systems. In: 2014 IEEE 7th International Conference on Service-Oriented Computing and Applications. pp. 266–270. IEEE (2014)
[15]Cheng, H., Fotiadis, G., Groszschädl, J., Ryan, P.Y.: Highly vectorized sike for avx-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES) 2022(2) (2022)
[16]Cheng, H., Fotiadis, G., Groszschädl, J., Ryan, P.Y., Roenne, P.: Batching csidh group actions using avx-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES) 2021(4), 618–649 (2021)
[17]Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Mathematics of computation 19(90), 297–301 (1965)
[18]Crockett, E., Paquin, C., Stebila, D.: Prototyping post-quantum and hybrid key exchange and authentication in tls and ssh. In: NIST 2nd PQC Standardization Conference. Santa Barbara, California (August 2019), published elsewhere
[19]DouglasStebila, M.M.: liboqs project (2024), https://github.com/open-quantum-safe/liboqs
[20]Dowling, B., Fischlin, M., Günther, F., Stebila, D.: A cryptographic analysis of the tls 1.3 handshake protocol. Journal of Cryptology 34(4), 37 (2021)
[21]Fan, J., Willems, F., Zahed, J., Gray, J., Mister, S., Ounsworth, M., Adams, C.: Impact of post-quantum hybrid certificates on pki, common libraries, and protocols. International Journal of Security and Networks 16(3), 200–211 (2021)
[22]FIPS, P.: Secure hash algorithm-3 (sha-3) standard: Permutation-based hash and extendable-output functions. National Institute for Standards and Technology (NIST) 202(0) (2014)
[23]Garcia, C.R., Aguilera, A.C., Olmos, J.J.V., Monroy, I.T., Rommel, S.: Quantum-resistant tls 1.3: A hybrid solution combining classical, quantum and post-quantum cryptography. In: 2023 IEEE 28th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD). pp. 246–251. IEEE (2023)
[24]Gentleman, W.M., Sande, G.: Fast fourier transforms: for fun and profit. In: Proceedings of the November 7-10, 1966, fall joint computer conference. pp. 563–578 (1966)
[25]Gonzalez, R., Wiggers, T.: Kemtls vs. post-quantum tls: Performance on embedded systems. In: International Conference on Security, Privacy, and Applied Cryptography Engineering. pp. 99–117. Springer (2022)
[26]Henrich, J., Heinemann, A., Wiesmaier, A., Schmitt, N.: Performance impact of pqc kems on tls 1.3 under varying network characteristics. In: International Conference on Information Security. pp. 267–287. Springer (2023)
[27]Hofheinz, D., Hövelmanns, K., Kiltz, E.: A modular analysis of the fujisaki-okamoto transformation. In: Theory of Cryptography Conference. pp. 341–371. Springer (2017)
[28]Huguenin-Dumittan, L., Vaudenay, S.: On ind-qcca security in the ROM and its applications - CPA security is sufficient for TLS 1.3. In: Dunkelman, O., Dziembowski, S. (eds.) Advances in Cryptology - EUROCRYPT 2022 - 41st Annual International Conference on the Theory and Applications of Cryptographic Techniques, Trondheim, Norway, May 30 - June 3, 2022, Proceedings, Part III. Lecture Notes in Computer Science, vol. 13277, pp. 613–642. Springer (2022). https://doi.org/10.1007/978-3-031-07082-2_22, https://doi.org/10.1007/978-3-031-07082-2_22
[29]Internet Engineering Task Force: Hybrid Terminology for Post-Quantum Key Establishment. Tech. rep., Internet Engineering Task Force (2023)
[30]Jiang, H., Ma, Z., Zhang, Z.: Post-quantum security of key encapsulation mechanism against cca attacks with a single decapsulation query. In: International Conference on the Theory and Application of Cryptology and Information Security. pp. 434–468. Springer (2023)
[31]KrisKwiatkowski, L.V.: The tls post-quantum experiment (2019), https://blog.cloudflare.com/the-tls-post-quantum-experiment
[32]Lei, D., He, D., Peng, C., Luo, M., Liu, Z., Huang, X.: Faster implementation of ideal lattice-based cryptography using avx512. ACM Transactions on Embedded Computing Systems 22(5), 1–18 (2023)
[33]MelissaAzouaoui, J.W.B.: Surviving the fo-calypse: Securing pqc implementations in practice. (2022), https://iacr.org/submit/files/slides/2022/rwc/rwc2022/48/slides.pdf
[34]Montgomery, P.L.: Modular multiplication without trial division. Mathematics of computation 44(170), 519–521 (1985)
[35]National Institute of Standards and Technology: Post-quantum cryptography standardization: Selected algorithms (2022), https://csrc.nist.gov/Projects/post-quantum-cryptography/selected-algorithms-2022
[36]NormAshley, M.B.: Open quantum safe: software for the transition to quantum-resistant cryptography (2024), https://openquantumsafe.org/
[37]Pablos, J.I.E., Marriaga, M.E., del Pozo, Á.L.P.: Design and implementation of a post-quantum group authenticated key exchange protocol with the liboqs library: A comparative performance analysis from classic mceliece, dowling, ntru, and saber. IEEE Access 10, 120951–120983 (2022)
[38]Paquin, C., Stebila, D., Tamvada, G.: Benchmarking post-quantum cryptography in tls. In: Post-Quantum Cryptography: 11th International Conference, PQCrypto 2020, Paris, France, April 15–17, 2020, Proceedings 11. pp. 72–91. Springer (2020)
[39]Paul, S., Kuzovkova, Y., Lahr, N., Niederhagen, R.: Mixed certificate chains for the transition to post-quantum authentication in tls 1.3. In: Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security. pp. 727–740 (2022)
[40]Paul, S., Schick, F., Seedorf, J.: Tpm-based post-quantum cryptography: A case study on quantum-resistant and mutually authenticated tls for iot environments. In: Proceedings of the 16th International Conference on Availability, Reliability and Security. pp. 1–10 (2021)
[41]Rescorla, E.: The transport layer security TLS protocol version 1.3. Tech. Rep. RFC 8446, RFC Editor (Aug 2018), https://doi.org/10.17487/RFC8446
[42]Roy, S.S.: Saberx4: High-throughput software implementation of saber key encapsulation mechanism. In: 2019 IEEE 37th International Conference on Computer Design (ICCD). pp. 321–324. IEEE (2019)
[43]Schwabe, P., Stebila, D., Wiggers, T.: Post-quantum tls without handshake signatures. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. pp. 1461–1480 (2020)
[44]Schwabe, P., Stebila, D., Wiggers, T.: More efficient post-quantum kemtls with pre-distributed public keys. In: Computer Security–ESORICS 2021: 26th European Symposium on Research in Computer Security, Darmstadt, Germany, October 4–8, 2021, Proceedings, Part I 26. pp. 3–22. Springer (2021)
[45]Seiler, G.: Faster AVX2 optimized NTT multiplication for ring-lwe lattice cryptography. IACR Cryptol. ePrint Arch. p.39 (2018), http://eprint.iacr.org/2018/039
[46]Shor, P.W.: Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th annual symposium on foundations of computer science. pp. 124–134. Ieee (1994)
[47]Sikeridis, D., Kampanakis, P., Devetsikiotis, M.: Assessing the overhead of post-quantum cryptography in TLS 1.3 and SSH. In: Han, D., Feldmann, A. (eds.) CoNEXT ’20: The 16th International Conference on emerging Networking EXperiments and Technologies, Barcelona, Spain, December, 2020. pp. 149–156. ACM (2020). https://doi.org/10.1145/3386367.3431305, https://doi.org/10.1145/3386367.3431305
[48]Sikeridis, D., Kampanakis, P., Devetsikiotis, M.: Post-quantum authentication in TLS 1.3: A performance study. In: 27th Annual Network and Distributed System Security Symposium, NDSS 2020, San Diego, California, USA, February 23-26, 2020. The Internet Society (2020), https://www.ndss-symposium.org/ndss-paper/post-quantum-authentication-in-tls-1-3-a-performance-study/
[49]Sosnowski, M., Wiedner, F., Hauser, E., Steger, L., Schoinianakis, D., Gallenmüller, S., Carle, G.: The performance of post-quantum tls 1.3. In: Companion of the 19th International Conference on emerging Networking EXperiments and Technologies. pp. 19–27 (2023)
[50]ofStandards, N.I., Technology: Module-lattice-based digital signature standard. Federal Information Processing Standards Publication (FIPS) NIST FIPS 204 ipd, Department of Commerce, Washington, D.C. (2023)
[51]ofStandards, N.I., Technology: Module-lattice-based key-encapsulation mechanism standard. Federal Information Processing Standards Publication (FIPS) NIST FIPS 203 ipd, Department of Commerce, Washington, D.C. (2023)
[52]ofStandards, N.I., Technology: Stateless hash-based digital signature standard. Federal Information Processing Standards Publication (FIPS) NIST FIPS 205 ipd, Department of Commerce, Washington, D.C. (2023)
[53]Tasopoulos, G., Li, J., Fournaris, A.P., Zhao, R.K., Sakzad, A., Steinfeld, R.: Performance evaluation of post-quantum tls 1.3 on resource-constrained embedded systems. In: International Conference on Information Security Practice and Experience. pp. 432–451. Springer (2022)
[54]Westerbaan, B.: When to barrett reduce in the inverse ntt. Cryptology ePrint Archive (2020)
[55]Zhang, J., Huang, J., etal.: ENG25519: Faster TLS 1.3 handshake using optimized X25519 and Ed25519. In: Usenix Security (2024)
[56]Zheng, J., Zhu, H., Song, Z., Wang, Z., Zhao, Y.: Optimized vectorization implementation of crystals-dilithium. arXiv preprint arXiv:2306.01989 (2023)