Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (2024)

11institutetext: School of Computer Science and Technology, Fudan University, Shanghai, China
11email: {jyzheng23,hlzhu22,yfdong22,songzy23,zhenhaozhang23}@m.fudan.edu.cn, {18110240046,ylzhao}@fudan.edu.cn

Jieyu Zheng  Haoliang Zhu  Yifan Dong  Zhenyu Song  Zhenhao Zhang  Yafang Yang  Yunlei ZhaoπŸ–‚πŸ–‚{}^{\href mailto:ylzhao@fudan.edu.cn}start_FLOATSUPERSCRIPT πŸ–‚ end_FLOATSUPERSCRIPT

Abstract

TLS is extensively utilized for secure data transmission over networks. However, with the advent of quantum computers, the security of TLS based on traditional public-key cryptography is under threat. To counter quantum threats, it is imperative to integrate post-quantum algorithms into TLS. Most PQ-TLS research focuses on integration and evaluation, but few studies address the improvement of PQ-TLS performance by optimizing PQC implementation.
For the TLS protocol, handshake performance is crucial, and for post-quantum TLS (PQ-TLS) the performance of post-quantum key encapsulation mechanisms (KEMs) directly impacts handshake performance. In this work, we explore the impact of post-quantum KEMs on PQ-TLS performance. We explore how to improve ML-KEM performance using the latest Intel’s Advanced Vector Extensions instruction set AVX-512. We detail a spectrum of techniques devised to parallelize polynomial multiplication, modular reduction, and other computationally intensive modules within ML-KEM. Our optimized ML-KEM implementation achieves up to 1.64Γ—\timesΓ— speedup compared to the latest AVX2 implementation. Furthermore, we introduce a novel batch key generation method for ML-KEM that can seamlessly integrate into the TLS protocols. The batch method accelerates the key generation procedure by 3.5Γ—\timesΓ— to 4.9Γ—\timesΓ—. We integrate the optimized AVX-512 implementation of ML-KEM into TLS 1.3, and assess handshake performance under both PQ-only and hybrid modes. The assessment demonstrates that our faster ML-KEM implementation results in a higher number of TLS 1.3 handshakes per second under both modes. Additionally, we revisit two IND-1-CCA KEM constructions discussed in Eurocrypt22 and Asiacrypt23. Besides, we implement them based on ML-KEM and integrate the one of better performance into TLS 1.3 with benchmarks.

Keywords:

Post-Quantum Cryptography TLS 1.3 ML-KEM AVX-512.

1 Introduction

Digital communications are ubiquitous worldwide, with most Internet connections relying on Transport Layer Security (TLS) to secure data transmission. However, the current TLS protocol remains vulnerable to quantum attacks. TLS employs public-key cryptography algorithms, including Elliptic Curve Diffie-Hellman (ECDH), Elliptic Curve Digital Signature Algorithm (ECDSA), and RSA. However, these algorithms are susceptible to quantum computing threats, as demonstrated by Shor’s algorithm [46]. To address this vulnerability, the National Institute of Standards and Technology (NIST) launched the Post-Quantum Cryptography (PQC) competition in 2016. After three rounds, NIST announced the selection of the first algorithms to be standardized, including Kyber, Dilithium, Falcon, and SPHINCS+ [35]. In August 2023, NIST designated three standard drafts: ML-KEM [51], ML-DSA [50], and ML-SLH [52], renamed from Kyber, Dilithium and SPHINCS+ respectively.

Amidst the continuous progress in NIST’s PQC algorithm standardization, significant endeavors have been devoted to the development of PQ-TLS over the past decade. Beginning in 2014, Chang et al. [14] introduced a post-quantum SSL/TLS library for embedded systems. Subsequently, Bos et al. [12] developed PQC ciphersuites for TLS based on the ring learning with errors (R-LWE) problem. In July 2016, Chrome introduced the newhope1024 post-quantum option, signaling an initial step towards integrating post-quantum cryptography into mainstream browsers. However, due to patent issues, Chrome removed the newhope1024 option in November 2016. Subsequent efforts by industry giants such as Google and Cloudflare focused on evaluating the performance of post-quantum cryptographic candidates within TLS. In 2018, Google and Cloudflare conducted experiments with NIST PQC candidate HRSS and X25519 in TLS 1.3 Chrome, while in 2019 they utilized ntruhrss701 and sntrup761 in their TLS experiments [31].

Recent research on PQ-TLS has predominantly concentrated on three pivotal domains:

  • β€’

    Integration of PQC into the TLS Protocol: Exploring methodologies to seamlessly incorporate post-quantum cryptographic mechanisms into the TLS framework [18, 44, 43, 40, 39, 23, 37].

  • β€’

    Performance Evaluation and Communication Overheads: Assessing the computational efficiency and communication overheads incurred by post-quantum cryptographic primitives within the TLS ecosystem [26, 49, 13, 53, 25, 21, 48, 38, 47, 6].

  • β€’

    Optimized Implementations for Enhanced PQ-TLS Efficiency: Enhancing the computational efficiency of post-quantum cryptography within TLS through optimized implementations [10].

Most research on PQ-TLS primarily focuses on exploring how to integrate various PQC cryptographic primitives into TLS and evaluating their performance. However, there is a noticeable scarcity of work improving PQ-TLS performance. Performance stands as a critical factor in TLS applications. As an integral component of PQ-TLS, PQC algorithms directly impact the handshake time of PQ-TLS. Therefore, optimizing the implementation of PQC algorithms and integrating them into TLS may contribute to reducing the handshake time of PQ-TLS.

Motivations.

Recent work presented an accelerated Ed25519 and X25519 AVX-512 engine tailored for TLS 1.3, offering significant performance improvement [55]. Previous work explored optimized PQC implementations using the latest Intel Single Instruction Multiple Data (SIMD) instruction AVX-512 (e.g. [15, 32, 16]). However, the integration of PQC AVX-512 optimized implementations into TLS 1.3 remains unrealized, and there is currently no AVX-512 implementation available for ML-KEM. This gap prompts our focus on optimizing ML-KEM using AVX-512 and seamlessly integrating the optimized ML-KEM implementation into TLS 1.3, and also provides an opportunity to explore the impact of AVX-512 instructions on PQ-TLS handshake protocols.

As of now, existing research on PQ-TLS migration relies on OQS-OpenSSL, which lacks support for OpenSSL3 and remains outdated. However, the latest OQS provider [8] not only supports OpenSSL3 but also facilitates a clear separation between OpenSSL code and PQC KEM code. Our comprehensive integration of ML-KEM using the OQS provider could provide valuable guidance for researchers seeking to migrate to PQ-TLS.

In addition, recent studies [28, 30, 43, 44] have affirmed the sufficiency of IND-1-CCA KEMs for TLS 1.3 handshake to be secure. Intriguingly, IND-1-CCA KEMs can be obtained from any OW-CPA/IND-CPA KEMs without re-encryption and de-randomization [28, 30] that are required by IND-CCA KEMs used in PQ-TLS previously. An idea can be easily deduced that TLS 1.3 handshake might demonstrate improved efficiency by applying such IND-1-CCA KEMs. However, there remains a notable absence of experiments focusing on IND-1-CCA-security KEM TLS 1.3. This gap in research catalyzes us to conduct experiments and evaluations on IND-1-CCA-secure PQ-TLS handshake protocols.

Contributions.

In this work, we aim to bridge the gap between PQC engineering implementations and TLS protocol applications. We approach to this task from both an optimization engineering perspective and a TLS system perspective. We will later open source our code.

  • β€’

    We present the first optimized implementation of ML-KEM using AVX-512. As the main bottleneck in ML-KEM lies in polynomial multiplication and hash functions, we achieve 32-way parallel polynomial multiplication and 8-way hash function. Besides, we enhance polynomial rejection and central binomial distribution sampling through the new features of AVX-512 like masked registers and compressive store instructions. Our implementation successfully passes NIST’s KAT tests, achieving a 1.64Γ—\timesΓ— speedup compared to the state-of-the-art AVX2 implementation of ML-KEM.

  • β€’

    We propose a batch key generation method for ML-KEM to batch 8 independent key pairs. Our batch key generation method achieves a speedup of 3.5Γ—\timesΓ— to 4.9Γ—\timesΓ— compared to key generation without batching. This batch generation approach can also be applied to other key generation processes involving hash function calls.

  • β€’

    We revisit two IND-1-CCA KEM constructions discussed in Eurocrypt’22 [28] and Asiacrypt’23 [30], and implement them with the underlying CPA-secure PKE of ML-KEM. We then evaluate the performance of IND-1-CCA KEMs, and integrate the better one into TLS 1.3. The benchmark results indicate that IND-1-CCA KEMs improve the performance of the TLS 1.3 handshake compared to IND-CCA KEMs.

  • β€’

    We integrate the AVX-512 optimized implementation of ML-KEM into TLS 1.3, assess its impact on TLS 1.3 handshake time, and evaluate the influence of different KEM constructions on TLS handshake efficiency. Our evaluation reveals that an efficient implementation of ML-KEM utilizing AVX-512 can yield a higher number of handshakes per second compared to the latest AVX2 implementation.

2 Preliminaries

2.1 Notation

The notation in this paper is the same as the FIPS 203 draft [51]. We denote β„›qsubscriptβ„›π‘ž\mathcal{R}_{q}caligraphic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT as the cyclic polynomial ring β„€q⁒[x]/(xn+1)subscriptβ„€π‘ždelimited-[]π‘₯superscriptπ‘₯𝑛1\mathbb{Z}_{q}[x]/(x^{n}+1)blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 ). We define rβ€²=rmodΒ±Ξ±r^{\prime}=r~{}{\bmod}^{\pm}~{}\alphaitalic_r start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT = italic_r roman_mod start_POSTSUPERSCRIPT Β± end_POSTSUPERSCRIPT italic_Ξ± (resp. rβ€²=rmodΞ±superscriptπ‘Ÿβ€²moduloπ‘Ÿπ›Όr^{\prime}=r\bmod\alphaitalic_r start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT = italic_r roman_mod italic_Ξ±) to be the unique element rβ€²superscriptπ‘Ÿβ€²r^{\prime}italic_r start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT in the range βˆ’βŒŠΞ±2βŒ‹<rβ€²β‰€βŒŠΞ±2βŒ‹π›Ό2superscriptπ‘Ÿβ€²π›Ό2-\left\lfloor{\frac{\alpha}{2}}\right\rfloor<r^{\prime}\leq\left\lfloor{\frac{%\alpha}{2}}\right\rfloor- ⌊ divide start_ARG italic_Ξ± end_ARG start_ARG 2 end_ARG βŒ‹ < italic_r start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ≀ ⌊ divide start_ARG italic_Ξ± end_ARG start_ARG 2 end_ARG βŒ‹ (resp. 0≀rβ€²<Ξ±0superscriptπ‘Ÿβ€²π›Ό0\leq r^{\prime}<\alpha0 ≀ italic_r start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT < italic_Ξ±) such that Ξ±|(rβˆ’rβ€²)conditionalπ›Όπ‘Ÿsuperscriptπ‘Ÿβ€²\alpha|(r-r^{\prime})italic_Ξ± | ( italic_r - italic_r start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ). By default, regular font lettersdenote elements in β„›qsubscriptβ„›π‘ž\mathcal{R}_{q}caligraphic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, bold lower-case letters are column vectors and bold upper-case letters are matrices.

2.2 ML-KEM

ML-KEM is a NIST-standardized lattice-based KEM. Its security is based on the Module Learning With Errors (M-LWE) problem. ML-KEM is derived from Round 3 version of Kyber [7]. The polynomial multiplication over β„€3329⁒[x]/(x256+1)subscriptβ„€3329delimited-[]π‘₯superscriptπ‘₯2561\mathbb{Z}_{3329}[x]/\left(x^{256}+1\right)blackboard_Z start_POSTSUBSCRIPT 3329 end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT 256 end_POSTSUPERSCRIPT + 1 ) is a fundamental operation in ML-KEM. Utilizing the property n|(qβˆ’1)conditionalπ‘›π‘ž1n|(q-1)italic_n | ( italic_q - 1 ), ML-KEM employs an incomplete Number Theoretic Transform (NTT) to accelerate this operation. ML-KEM uses four SHA-3 hash functions: SHA3-256, SHA3-512, SHAKE128, and SHAKE256. For more details, readers can refer to FIPS 203 draft [51].
Number Theoretic Transform.NTT is a variant of Fast Fourier Transform (FFT) in finite fields. Its essence is to use the point-value representation of polynomials to perform efficient polynomial multiplication operations. We denote the forward NTT transform as NTT, and the inverse NTT as INTT. The symbol β€œβ‹…β‹…\cdot⋅” denotes point-wise multiplication. Polynomial multiplication h⁒(x)=f⁒(x)Γ—g⁒(x)βˆˆβ„€q⁒[x]/(xn+1)β„Žπ‘₯𝑓π‘₯𝑔π‘₯subscriptβ„€π‘ždelimited-[]π‘₯superscriptπ‘₯𝑛1h(x)=f(x)\times g(x)\in\mathbb{Z}_{q}[x]/\left(x^{n}+1\right)italic_h ( italic_x ) = italic_f ( italic_x ) Γ— italic_g ( italic_x ) ∈ blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 ) can be computed as follows:

h⁒(x)=f⁒(x)Γ—g⁒(x)=INTT⁑(NTT⁑(f)β‹…NTT⁑(g)).β„Žπ‘₯𝑓π‘₯𝑔π‘₯INTTβ‹…NTT𝑓NTT𝑔h(x)=f(x)\times g(x)=\operatorname{INTT}(\operatorname{NTT}(f)\cdot%\operatorname{NTT}(g)).italic_h ( italic_x ) = italic_f ( italic_x ) Γ— italic_g ( italic_x ) = roman_INTT ( roman_NTT ( italic_f ) β‹… roman_NTT ( italic_g ) ) .

For cyclic polynomial ring β„€q⁒[x]/(xn+1)subscriptβ„€π‘ždelimited-[]π‘₯superscriptπ‘₯𝑛1\mathbb{Z}_{q}[x]/\left(x^{n}+1\right)blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 ), complete NTT requires 2⁒n|(qβˆ’1)conditional2π‘›π‘ž12n|(q-1)2 italic_n | ( italic_q - 1 ), which is not satisfied in ML-KEM parameters. Instead, ML-KEM uses a variant of NTT, which deletes the last layer of the NTT. For this type of NTT that follows the β€œbottom cropping” method, we call it T-NTT (truncated NTT) for short and let β𝛽\betaitalic_Ξ² be the number of truncated layers. ML-KEM uses the case of Ξ²=1𝛽1\beta=1italic_Ξ² = 1 for T-NTT. Given β„€q⁒[x]/(xn+1)subscriptβ„€π‘ždelimited-[]π‘₯superscriptπ‘₯𝑛1\mathbb{Z}_{q}[x]/(x^{n}+1)blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 ), for any integer Ξ²>=0𝛽0\beta>=0italic_Ξ² > = 0, qπ‘žqitalic_q is a prime satisfiesn2Ξ²βˆ’1∣(qβˆ’1)conditional𝑛superscript2𝛽1π‘ž1\frac{n}{2^{\beta-1}}\mid(q-1)divide start_ARG italic_n end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_Ξ² - 1 end_POSTSUPERSCRIPT end_ARG ∣ ( italic_q - 1 ), T-NTT (f)𝑓(f)( italic_f ) has the general form:

β„€q⁒[x]/(xn+1)β‰…βˆi=0n2Ξ²βˆ’1β„€q⁒[x]/(x2Ξ²βˆ’Ο‰2⁒n2Ξ²2β‹…b⁒rn/2β⁒(i)+1)subscriptβ„€π‘ždelimited-[]π‘₯superscriptπ‘₯𝑛1superscriptsubscriptproduct𝑖0𝑛superscript2𝛽1subscriptβ„€π‘ždelimited-[]π‘₯superscriptπ‘₯superscript2𝛽superscriptsubscriptπœ”2𝑛superscript2𝛽⋅2𝑏subscriptπ‘Ÿπ‘›superscript2𝛽𝑖1\mathbb{Z}_{q}[x]/\left({{x^{n}}+1}\right)\cong\prod\limits_{i=0}^{\frac{n}{{{%2^{\beta}}}}-1}{\mathbb{Z}_{q}[x]/\left({{x^{{2^{\beta}}}}-\omega_{\frac{{2n}}%{{{2^{\beta}}}}}^{2\cdot b{r_{n/{2^{\beta}}}}(i)+1}}\right)}blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 ) β‰… ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_n end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT end_ARG - 1 end_POSTSUPERSCRIPT blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_Ο‰ start_POSTSUBSCRIPT divide start_ARG 2 italic_n end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 β‹… italic_b italic_r start_POSTSUBSCRIPT italic_n / 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_i ) + 1 end_POSTSUPERSCRIPT )

where Ο‰2⁒n/2⁒βsubscriptπœ”2𝑛2𝛽{\omega}_{2n/2\beta}italic_Ο‰ start_POSTSUBSCRIPT 2 italic_n / 2 italic_Ξ² end_POSTSUBSCRIPT denotes the 2⁒n2Ξ²2𝑛superscript2𝛽\frac{2n}{{{2^{\beta}}}}divide start_ARG 2 italic_n end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT end_ARG-th root of unity, b⁒rn/2β⁒(i)𝑏subscriptπ‘Ÿπ‘›superscript2𝛽𝑖b{r_{n/{2^{\beta}}}}(i)italic_b italic_r start_POSTSUBSCRIPT italic_n / 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_i ) denotes the bit-reversal permutation of {0,1,…,n2Ξ²βˆ’1}01…𝑛superscript2𝛽1\{0,1,\ldots,\frac{n}{{{2^{\beta}}}}-1\}{ 0 , 1 , … , divide start_ARG italic_n end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT end_ARG - 1 }.

Through the T-NTT transformation, the n𝑛nitalic_n-dimensional polynomials f,g𝑓𝑔f,gitalic_f , italic_g are separately decomposed into n2𝑛2\frac{n}{2}divide start_ARG italic_n end_ARG start_ARG 2 end_ARG linear polynomials. Then the original multiplication fΓ—g𝑓𝑔f\times gitalic_f Γ— italic_g is transformed into n2𝑛2\frac{n}{2}divide start_ARG italic_n end_ARG start_ARG 2 end_ARG multiplications of linear polynomials.

Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (1)

Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (2)

Cooley-Tukey transform [17] and Gentleman-Sande transform [24] are used in NTT and INTT respectively, shown in Figure 2 and Figure 2.

2.3 AVX-512 Instruction Set

The AVX-512 instruction set was introduced by Intel in 2013 and initially supported in the 2016 Xeon Phi series processors [2], AVX-512 has since become a pivotal feature for both major CPU manufacturers. AVX-512 offers several key functionalities: it introduces 32 512-bit zmm registers, enabling simultaneous processing of multiple data elements and accelerating vectorized computations. AVX-512 covers a wide range of floating-point and integer operations for diverse computational requirements. Additionally, AVX-512 provides 8 mask registers (k0-k7) for conditional operations, allowing instructions to execute based on conditions. These mask registers enhance the flexibility and efficiency of data processing by enabling advanced operations such as compression/expansion and masking.

2.4 TLS 1.3 and PQ-TLS

TLS is a standard developed by the Internet Engineering Task Force (IETF) in 1999. Its primary role is to encrypt communication between web applications and servers, with the latest version being TLS 1.3 [41]. When initiating a TLS connection, the client and server exchange parameters such as TLS version and ciphersuite. Our study focuses on enhancing TLS security against post-quantum threats, particularly through the adoption of PQC Key Exchange (KEX) in PQ-TLS. PQ-TLS operates in two modes: hybrid mode and PQ-only mode. The hybrid mode, as standardized by the IETF in TLS 1.3 [1], supports simultaneous usage of ECDH and PQC KEM, albeit at the cost of increased data transmission size and computational resources. Conversely, the PQ-only mode exclusively employs PQC KEM for Key Exchange and PQC signature algorithms for authentication.

3 ML-KEM AVX-512 Implementation

The most time-consuming operations in ML-KEM include modular reduction, polynomial sampling, polynomial multiplication, and hash functions. In this section, we will outline our rationale behind the AVX-512 implementation design for these computationally intensive components.

3.1 Modular Reduction Implementation

Modular reduction plays a crucial role in ML-KEM polynomial arithmetic, where the arithmetic operates over the ring β„€qsubscriptβ„€π‘ž\mathbb{Z}_{q}blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with q=3329π‘ž3329q=3329italic_q = 3329. However, AVX-512 instructions lack dedicated support for modular reduction computations. In this context, we introduce the signed versions of two constant-time reduction algorithms commonly utilized in lattice-based cryptography: Montgomery reduction [34] and Barrett reduction [9], as proposed by Seiler [45].
Reduction Algorithms.The Signed Montgomery reduction computes the Hensel remainder of a signed integer within the range [βˆ’Ξ²β’q/2,β⁒q/2)π›½π‘ž2π›½π‘ž2[-\beta q/2,\beta q/2)[ - italic_Ξ² italic_q / 2 , italic_Ξ² italic_q / 2 ), specifically employed in ML-KEM to reduce the product of two 16-bit coefficients. The output remainder integer resides in the Montgomery domain and is further multiplied by Ξ²βˆ’1superscript𝛽1\beta^{-1}italic_Ξ² start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Barrett reduction operates within a smaller input range compared to Montgomery reduction. Typically, the input to Barrett reduction does not exceed a 16-bit signed integer range. Therefore, Barrett reduction is commonly utilized to reduce coefficients that surpass the range [βˆ’q,q]π‘žπ‘ž[-q,q][ - italic_q , italic_q ] after addition and subtraction operations. The resulting output from Barrett reduction remains within β„€qsubscriptβ„€π‘ž\mathbb{Z}_{q}blackboard_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.
AVX-512 Implementations.We give our AVX-512 implementations of signed Montgomery reduction and Barrett reduction.For the modulus q=3329=213βˆ’29+1π‘ž3329superscript213superscript291q=3329=2^{13}-2^{9}+1italic_q = 3329 = 2 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT + 1, we set Ξ²=216𝛽superscript216\beta=2^{16}italic_Ξ² = 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. This configuration ensures that one coefficient occupies 16 bits within a 512-bit vector register, enabling 32-way parallelism. Given that v𝑣vitalic_v remains constant in Barrett reduction [45, Sec 3.3], we can precompute v𝑣vitalic_v and store it in the vector register. Regarding the computation of a⁒v/2Ξ²π‘Žπ‘£superscript2𝛽av/2^{\beta}italic_a italic_v / 2 start_POSTSUPERSCRIPT italic_Ξ² end_POSTSUPERSCRIPT, it involves two operations: multiplication and division. These two operations can be executed using a single AVX-512 instruction, vpmulhw, which computes the product of every two 16-bit data lanes in vector registers and retains only the higher 16 bits. Additionally, the division by 2⌊log⁑(q)βŒ‹βˆ’1superscript2π‘ž12^{\lfloor\log(q)\rfloor-1}2 start_POSTSUPERSCRIPT ⌊ roman_log ( italic_q ) βŒ‹ - 1 end_POSTSUPERSCRIPT can be efficiently implemented using AVX-512 shift instructions. In Algorithm 1, we introduce a macro red16 designed to compute Barrett reduction using AVX-512 instructions. Within this macro, the zmm\r register holds the output of the Barrett reduction, while zmm\rv stores the constant used in Barrett reduction, represented by v=⌊2⌊log⁑(3329)βŒ‹βˆ’1⁒2163329βŒ‰=20159v=\left\lfloor\frac{2^{\lfloor\log(3329)\rfloor-1}2^{16}}{3329}\right\rceil=20%159italic_v = ⌊ divide start_ARG 2 start_POSTSUPERSCRIPT ⌊ roman_log ( 3329 ) βŒ‹ - 1 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT end_ARG start_ARG 3329 end_ARG βŒ‰ = 20159. Additionally, the zmm\rl register stores an immediate value. Furthermore, we will introduce our 32-way Montgomery reduction AVX-512 implementation, combined with butterfly operation, in Section 3.2.

1:.macro red16 r, rv, rl

2:vpmulhw %zmm\rv, %zmm\r, %zmm\rl

3:vpsraw $10, %zmm\rl, %zmm\rl

4:vpsubw %zmm\rl, %zmm\r, %zmm\r

5:.endm

3.2 NTT Implementation

NTT is one of the most intricate components in the ML-KEM AVX-512 implementation. Both NTT and INTT operations in ML-KEM require 7 layers of butterfly operations to obtain the final result. In this section, we will introduce several techniques used in NTT implementation.
Register Allocation.A 512-bit vector register zmm can accommodate a maximum of 32 16-bit integers, requiring only 8 vector registers to store all 256 polynomial coefficients. Therefore, we allocate 8 vector registers for storing coefficients, 2 for intermediate registers, and 2 for constant registers, leaving 20 vector registers unused.
Butterfly Unit. Our Cooley-Tukey butterfly pseudo-code is outlined in Algorithm 3. Registers zmm\l and zmm\r respectively store coefficients fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and fj+l⁒e⁒n/2subscript𝑓𝑗𝑙𝑒𝑛2f_{j+len/2}italic_f start_POSTSUBSCRIPT italic_j + italic_l italic_e italic_n / 2 end_POSTSUBSCRIPT, while zmm0 holds the modulus qπ‘žqitalic_q. We precompute qβˆ’1modΒ±Ξ²q^{-1}\bmod^{\pm}\betaitalic_q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_mod start_POSTSUPERSCRIPT Β± end_POSTSUPERSCRIPT italic_Ξ² into twiddle factors to reduce one multiplication operation in Montgomery reduction. To obtain the low and high bits of the 16-bit integer multiplication product, we utilize vector integer multiplication instructions vpmullw and vpmulhw. These instructions eliminate the necessity of extending coefficients to 32 bits after multiplication. Our Gentleman-Sande Butterfly pseudo-code is presented in Algorithm 2. Similar to the Cooley-Tukey butterfly, registers zmm\l and zmm\r store coefficients fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and fj+l⁒e⁒n/2subscript𝑓𝑗𝑙𝑒𝑛2f_{j+len/2}italic_f start_POSTSUBSCRIPT italic_j + italic_l italic_e italic_n / 2 end_POSTSUBSCRIPT, respectively. Additionally, registers zmm\zl and zmm\zr hold precomputed ΞΆβ‹…qβˆ’1β‹…πœsuperscriptπ‘ž1\zeta\cdot q^{-1}italic_ΞΆ β‹… italic_q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and ΢𝜁\zetaitalic_ΞΆ values, respectively. The order of twiddle factors in the NTT differs from that in the INTT, thus the same twiddle factors can be used. However, employing the same twiddle factors table necessitates additional permutation. In our implementation, we opt to utilize two distinct twiddle factors tables for the NTT and INTT.

1:.macro Gentleman-Sande butterfly l,r,zl,zh

2:vpsubw %zmm\l,%zmm\r,%zmm21

3:vpaddw %zmm\r,%zmm\l,%zmm\l

4:vpmulhw %zmm\zh,%zmm21,%zmm22

5:vpmulhw %zmm\zh,%zmm21,%zmm22

6:vpmulhw %zmm0,%zmm21,%zmm21

7:vpsubw %zmm21,%zmm22,%zmm\r

8:.endm

1:.macro Cooley-Tukey butterfly l,r,zl,zh

2:vpmullw %zmm\zl,%zmm\r,%zmm11

3:vpmulhw %zmm\zh,%zmm\r,%zmm\r

4:vpmulhw %zmm0,%zmm11,%zmm11

5:vpsubw %zmm11,%zmm\l,%zmm\r

6:vpaddw %zmm11,%zmm\l,%zmm\l

7:.endm

In the implementation of NTT and INTT, we employ layer merging and coefficient permutation methods to reduce the memory access overhead. Specifically, we merge the 7 layers of NTT. We load all coefficients into 8 vector registers only at the first layer and store them back in memory after completing the final layer computation. Achieving such layer merging incurs some additional overhead for the coefficient permutation. Since each vector register accommodates 32 coefficients, in the initial three layers of the NTT, the coefficients stored in vector registers satisfy the correct distance and can directly perform the Cooley-Tukey butterfly. Starting from the fourth layer, as the butterfly distance becomes 16, the coefficient pairs requiring butterfly operations are housed within the same vector register. Thus, it’s necessary to separate the coefficient pairs at corresponding distances into different vector registers.

Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (3)

As depicted in Figure 3, we can observe the arrangement of coefficients within a vector register during the third to seventh layers of the NTT. The functions Shuffle16, Shuffle8, Shuffle4, Shuffle2, and Shuffle1 are utilized for permuting pairs of 16, 8, 4, 2, and 1 coefficients, respectively. We perform an extra Shuffle1 to bring convenience for polynomial point-wise multiplication. After completing all NTT layers, the coefficients are constrained within the range of signed 16-bit integers, eliminating the need for Barrett reduction in NTT. However, this differs in the case of INTT. We adopt lazy reduction in INTT as outlined in [54].

3.3 SHA3 Keccak Implementation

ML-KEM utilizes SHA3-256, SHA3-512, SHAKE128, and SHAKE256 as hash functions, pseudorandom functions (PRFs), and eXtendable-output functions (XOFs). These algorithms belong to the SHA-3 family [22] developed by NIST and are based on the Sponge Construction [11]. The round function employed in the SHA-3 algorithm is the Keccak-p[1600,24] permutation function, which operates on a 64-bit state. In the previous AVX2 implementation of ML-KEM by the Crystals team, Keccak-p[1600,24] achieved 4-way parallelism due to the 256-bit vector register width. However, with AVX-512 implementation, we can compute 8 Keccak permutation results in parallel, as a 512-bit vector register can process 8 Keccak states concurrently.

3.4 Other Modules

We also implement Compress and Decompress, pkDecode, skDecode, skEncode and pkEncode and several polynomial arithmetic functions. The implementation idea is similar to the AVX2 implementation. For simplicity, we don’t discuss details here. One notable difference is that, in the implementation of pkDecode, skDecode, and Decompress, we utilize the AVX-512 instruction vpermb, which operates on bytes, unlike AVX2 instructions. This instruction allows us to conveniently adjust the order of byte streams. We implement rejection sampling using AVX-512 according to the method presented in [56].

4 ML-KEM TLS 1.3 Integration Design Consideration

In this section, we discuss how to integrate ML-KEM AVX-512 implementation into TLS 1.3.

4.1 Batch Key Generation Using Parallel Keccak

The concept of batch key generation was discussed in both [55] and [10]. In [10], Montgomery’s trick was used to compute multiple polynomial inversions, specifically targeting key generation algorithms involving polynomial inversion. On the other hand, [55] introduces an 8Γ—1818\times 18 Γ— 1 X25519 approach to batch 8 key pairs in parallel. We explore the potential applicability of batch key generation in other PQC algorithms. Drawing inspiration from [42], we propose a batch key generation approach for ML-KEM, as outlined in Algorithm 4. The SHA3-256x8 function is constructed based on the 8-way Keccak function discussed in Section 3.3. Our 8-way ML-KEM key generation AVX-512 implementation function is designed to generate 8 independent (p⁒k,s⁒k)π‘π‘˜π‘ π‘˜(pk,sk)( italic_p italic_k , italic_s italic_k ) pairs simultaneously.

1:Encapsulation keys array ek[8] βˆˆπ”Ή384⁒k+32absentsuperscript𝔹384π‘˜32\in\mathbb{B}^{384k+32}∈ blackboard_B start_POSTSUPERSCRIPT 384 italic_k + 32 end_POSTSUPERSCRIPT.

2:Decapsulation keys array dk[8] βˆˆπ”Ή768⁒k+96absentsuperscript𝔹768π‘˜96\in\mathbb{B}^{768k+96}∈ blackboard_B start_POSTSUPERSCRIPT 768 italic_k + 96 end_POSTSUPERSCRIPT.

3:z⟡$𝔹32superscript⟡currency-dollar𝑧superscript𝔹32z\stackrel{{\scriptstyle\$}}{{\longleftarrow}}\mathbb{B}^{32}italic_z start_RELOP SUPERSCRIPTOP start_ARG ⟡ end_ARG start_ARG $ end_ARG end_RELOP blackboard_B start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT

4:fori = 0 to 7do

5:(ekPKE⁒[i],dkPKE⁒[i])←←subscriptekPKEdelimited-[]𝑖subscriptdkPKEdelimited-[]𝑖absent\left(\mathrm{e}\mathrm{k}_{\mathrm{PKE}}[i],\mathrm{d}\mathrm{k}_{\mathrm{PKE%}}[i]\right)\leftarrow( roman_ek start_POSTSUBSCRIPT roman_PKE end_POSTSUBSCRIPT [ italic_i ] , roman_dk start_POSTSUBSCRIPT roman_PKE end_POSTSUBSCRIPT [ italic_i ] ) ← K-PKE.KeyGen()

6:ek⁒[i]←ekPKE⁒[i]←ekdelimited-[]𝑖subscriptekPKEdelimited-[]𝑖\mathrm{ek}[i]\leftarrow\mathrm{e}\mathrm{k}_{\mathrm{PKE}}[i]roman_ek [ italic_i ] ← roman_ek start_POSTSUBSCRIPT roman_PKE end_POSTSUBSCRIPT [ italic_i ]

7:endfor

8:output array o⁒u⁒t⁒p⁒u⁒t⁒[8]π‘œπ‘’π‘‘π‘π‘’π‘‘delimited-[]8output[8]italic_o italic_u italic_t italic_p italic_u italic_t [ 8 ] ←←\leftarrow← SHA3-256x8(ek⁒[0],ek⁒[1],ek⁒[2],ek⁒[3],ek⁒[4],ek⁒[5],ek⁒[6],ek⁒[7])normal-ekdelimited-[]0normal-ekdelimited-[]1normal-ekdelimited-[]2normal-ekdelimited-[]3normal-ekdelimited-[]4normal-ekdelimited-[]5normal-ekdelimited-[]6normal-ekdelimited-[]7(\mathrm{ek}[0],\mathrm{ek}[1],\mathrm{ek}[2],\mathrm{ek}[3],\mathrm{ek}[4],%\mathrm{ek}[5],\mathrm{ek}[6],\mathrm{ek}[7])( roman_ek [ 0 ] , roman_ek [ 1 ] , roman_ek [ 2 ] , roman_ek [ 3 ] , roman_ek [ 4 ] , roman_ek [ 5 ] , roman_ek [ 6 ] , roman_ek [ 7 ] )

9:fori = 0 to 7do

10:(dk⁒[i]←(dkPKE⁒‖ek‖⁒o⁒u⁒t⁒p⁒u⁒t⁒[i])βˆ₯z)←dkdelimited-[]𝑖conditionalsubscriptdkPKEnormekπ‘œπ‘’π‘‘π‘π‘’π‘‘delimited-[]𝑖𝑧(\mathrm{dk}[i]\leftarrow\left(\mathrm{dk}_{\mathrm{PKE}}\|\mathrm{ek}\|output%[i])\|z\right)( roman_dk [ italic_i ] ← ( roman_dk start_POSTSUBSCRIPT roman_PKE end_POSTSUBSCRIPT βˆ₯ roman_ek βˆ₯ italic_o italic_u italic_t italic_p italic_u italic_t [ italic_i ] ) βˆ₯ italic_z )

11:endfor

12:Return (ek[8], dk[8])

4.2 ML-KEM AVX-512 TLS 1.3 Migration Implementation

We use OQS provider proposed by the Open Quantum Safe (OQS) team [36] and OpenSSL 3.3.0-dev to migrate the ML-KEM AVX-512 implementation. OpenSSL is an open-source software library implementing the SSL and TLS protocols. Most PQ-TLS research works use OQS-OpenSSL to migrate PQC algorithms into TLS 1.3. However, OQS-OpenSSL ceased updates in July 2023 and does not support the latest OpenSSL 3.0 version. The latest OQS provider separates the integration of PQ algorithms into TLS 1.3 from the main logic of OpenSSL, without altering the core cryptographic algorithms. This separation isolates the embedding of post-quantum algorithms from the extensive OpenSSL codebase. For ML-KEM AVX-512 code migration, we choose liboqs 0.10.1 [19]. Liboqs is an open-source C library for quantum-safe cryptographic algorithms. The newest liboqs 0.10.1 version adds ML-DSA and ML-KEM C reference and AVX2 codes. We add our ML-KEM AVX-512 code in the ml_kem directory of liboqs. Besides, we implement the corresponding ML-KEM AVX-512 Keygen, Encaps, and Decaps APIs in liboqs. We define the macro OQS_ENABLE_KEM_ml_kem_512_avx512 in the OQS configuration file. By configuring this macro, users can run ML-KEM-512 AVX-512 code within the liboqs library.

5 Revisiting PQC Security of KEM in TLS 1.3

In this section, we revisit PQC security of KEM in TLS 1.3 based on recent research [28, 30]. We state that the IND-CCA KEM used in TLS 1.3 handshake can be replaced by an IND-1-CCA KEM, providing improved efficiency and sufficient security.

5.1 An Efficient Choice of Key-Exchange: IND-1-CCA KEM

In existing PQ-TLS implementations, the ephemeral KEX is implemented with IND-CCA KEMs. IND-CCA KEMs are usually constructed by applying Fujisaki-Okamoto (FO) transform or its variants (e.g. ML-KEM [51]) on an OW/IND-CPA PKE scheme, while FO transform requires re-encrypting the plaintext during decapsulation, significantly reducing the efficiency of KEMs and increasing the cost of side-channel protection [33].

Recent protocols (e.g. TLS 1.3 and KEMTLS) are designed to achieve forward security. In such protocols, each pair of ephemeral public/private keys is discarded immediately after being used once, and a new key pair will be generated for new messages. This means that an adversary will be able to request a decryption only once for a given key pair. Informally, IND-1-CCA security states that an adversary needs to distinguish an honestly generated key from a randomly generated key with at most one decapsulation query. Thus, the IND-1-CCA security of KEMs is sufficient to replace the Diffie-Hellman (DH) key-exchange, ensuring the security of such protocols. In the security proof of TLS 1.3 handshake under the multi-stage model given in [20], the DH key-exchange could be replaced by an IND-1-CCA KEM and the proof would still hold. This idea inspired a series of work, see [28, 30, 43, 44] for details.

[28] proposed that an IND-qπ‘žqitalic_q-CCA-secure KEM could be obtained from any passively secure PKE (OW-CPA/IND-CPA) without re-encryption. Specifically, [28] presented two constructions named TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, as well as their security proofs. Both TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT do not require re-encryption and de-randomization, and such IND-1-CCA KEMs could be used in TLS 1.3 handshake, improving the efficiency while ensuring security. However, TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT leads to ciphertext expansion, and the security of THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT was only proved in ROM, the QROM proof was not provided. Based on [28], [30] provided both ROM and QROM proofs of THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT (an implicit variant of THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT), with much tighter reductions than [28].

5.2 IND-1-CCA KEM Constructions

In this section, we review the definition of TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and TH/TR⁒Hsubscript𝑇𝐻subscript𝑇𝑅𝐻{T_{H}}/{T_{RH}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT, as well as their reduction tightness in ROM/QROM.
Constructions. The idea of TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT is to send an additional hash value along with ciphertext, which will be used for key confirmation in decapsulation (see Figure 4). THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT works without additional hash value, the key is derived by H⁒(m,c)π»π‘šπ‘H(m,c)italic_H ( italic_m , italic_c ) directly (Figure 5). TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT is an implicit variant of THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, the only difference between TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT and THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the return value for invalid decryption. In fact, THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT is equivalent to UβŠ₯superscriptπ‘ˆbottom{U^{\bot}}italic_U start_POSTSUPERSCRIPT βŠ₯ end_POSTSUPERSCRIPT and UβŠ₯ΜΈsuperscriptπ‘ˆnot-bottom{U^{\not\bot}}italic_U start_POSTSUPERSCRIPT βŠ₯ΜΈ end_POSTSUPERSCRIPT in [27], respectively.
Reduction Tightness. The security of TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT was proved in ROM and QROM by Huguenin-Dumittan and Vaudenay [28], with tightness Ο΅Rβ‰ˆO⁒(1/qH)β’Ο΅π’œsubscriptitalic-ϡ𝑅𝑂1subscriptπ‘žπ»subscriptitalic-Ο΅π’œ{\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT and Ο΅Rβ‰ˆO⁒(1/qH3)β’Ο΅π’œ2subscriptitalic-ϡ𝑅𝑂1superscriptsubscriptπ‘žπ»3superscriptsubscriptitalic-Ο΅π’œ2{\epsilon_{R}}\approx O(1/{{q_{H}}^{3}})\epsilon_{\cal A}^{2}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT respectively, where Ο΅Rsubscriptitalic-ϡ𝑅{\epsilon_{R}}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT (resp. Ο΅π’œsubscriptitalic-Ο΅π’œ{\epsilon_{\cal A}}italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT) is the advantage of the reduction R𝑅Ritalic_R (resp. adversary π’œπ’œ{\cal A}caligraphic_A) against the underlying PKE (resp. the IND-1-CCA KEM) and qHsubscriptπ‘žπ»{q_{H}}italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT denotes the number of π’œπ’œ{\cal A}caligraphic_A’s queries to H𝐻Hitalic_H. Later, they updated an ePrint version of [28] and gave a tighter QROM proof of TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT with bound Ο΅Rβ‰ˆO⁒(1/qH2)β’Ο΅π’œ2βˆ’O⁒(qH3/2n)βˆ’O⁒(qH/2n)subscriptitalic-ϡ𝑅𝑂1superscriptsubscriptπ‘žπ»2superscriptsubscriptitalic-Ο΅π’œ2𝑂superscriptsubscriptπ‘žπ»3superscript2𝑛𝑂subscriptπ‘žπ»superscript2𝑛{\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}-O({{q_{H}}^{3}}/%{2^{n}})-O({q_{H}}/\sqrt{{2^{n}}})italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_O ( italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - italic_O ( italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT / square-root start_ARG 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ). The ROM and QROM proof of THsubscript𝑇𝐻{T_{H}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT were presented in [30], with tightness Ο΅Rβ‰ˆO⁒(1/qH)β’Ο΅π’œsubscriptitalic-ϡ𝑅𝑂1subscriptπ‘žπ»subscriptitalic-Ο΅π’œ{\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT and Ο΅Rβ‰ˆO⁒(1/qH2)β’Ο΅π’œ2subscriptitalic-ϡ𝑅𝑂1superscriptsubscriptπ‘žπ»2superscriptsubscriptitalic-Ο΅π’œ2{\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT respectively. The comparison between FO transform and TC⁒H/TH/TR⁒Hsubscript𝑇𝐢𝐻subscript𝑇𝐻subscript𝑇𝑅𝐻{T_{CH}}/{T_{H}}/{T_{RH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT is summarized in Table 1.

Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (4)
Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (5)

Construction

Security

c𝑐citalic_c-Expansion

Re-Enc

ROMtightness

QROMtightness

F⁒O𝐹𝑂FOitalic_F italic_O

IND-CPA ⇒⇒\Rightarrow⇒ IND-CCA

β—‹β—‹\Circleβ—‹

●●\CIRCLE●

Ο΅Rβ‰ˆΟ΅π’œsubscriptitalic-ϡ𝑅subscriptitalic-Ο΅π’œ{\epsilon_{R}}\approx{\epsilon_{\cal A}}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT

Ο΅Rβ‰ˆO⁒(1/qH)β’Ο΅π’œ2subscriptitalic-ϡ𝑅𝑂1subscriptπ‘žπ»superscriptsubscriptitalic-Ο΅π’œ2{\epsilon_{R}}\approx O(1/{q_{H}})\epsilon_{\cal A}^{2}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT

OW-CPA ⇒⇒\Rightarrow⇒ IND-1-CCA

●●\CIRCLE●

β—‹β—‹\Circleβ—‹

Ο΅Rβ‰ˆO⁒(1/qH)β’Ο΅π’œsubscriptitalic-ϡ𝑅𝑂1subscriptπ‘žπ»subscriptitalic-Ο΅π’œ{\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT

Ο΅Rβ‰ˆO⁒(1/qH2)β’Ο΅π’œ2subscriptitalic-ϡ𝑅𝑂1superscriptsubscriptπ‘žπ»2superscriptsubscriptitalic-Ο΅π’œ2{\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

TH/TR⁒Hsubscript𝑇𝐻subscript𝑇𝑅𝐻{T_{H}}/{T_{RH}}italic_T start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT

IND-CPA ⇒⇒\Rightarrow⇒ IND-1-CCA

β—‹β—‹\Circleβ—‹

β—‹β—‹\Circleβ—‹

Ο΅Rβ‰ˆO⁒(1/qH)β’Ο΅π’œsubscriptitalic-ϡ𝑅𝑂1subscriptπ‘žπ»subscriptitalic-Ο΅π’œ{\epsilon_{R}}\approx O(1/{q_{H}}){\epsilon_{\cal A}}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT

Ο΅Rβ‰ˆO⁒(1/qH2)β’Ο΅π’œ2subscriptitalic-ϡ𝑅𝑂1superscriptsubscriptπ‘žπ»2superscriptsubscriptitalic-Ο΅π’œ2{\epsilon_{R}}\approx O(1/{{q_{H}}^{2}})\epsilon_{\cal A}^{2}italic_Ο΅ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT β‰ˆ italic_O ( 1 / italic_q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_Ο΅ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

6 Discussions and Results

6.1 Experimental Setup

In this section, we describe our experimental setup. We used an emulation experiment setup in one machine. The benchmark platform is an Intel Xeon Platinum 8475B with 4 vCPUs and 8 GiB memory 111Our testing machine is provided by Alibaba Cloud, belonging to the compute-optimized instance family c8i. It is powered by an Intel Xeon Sapphire Rapids processor, with the specific specification ecs.c8i.xlarge, featuring 4 vCPUs and 8 GiB of memory.. The OpenSSL version we used is OpenSSL3.3.0-dev. We used GCC 9.4.0 to compile all programs. We disabled AVX-512 instructions when benchmarking ML-KEM AVX2 implementation for better comparison. We didn’t disable Turbo Boost and hyper-threading to simulate real-world conditions during TLS handshake.

6.2 Speed of ML-KEM AVX-512 Implementation

In this section, we conduct a performance evaluation of our ML-KEM AVX-512 implementations. Our approach is built upon the Crystals team’s open-source Kyber Standard Code [5], which was implemented for the FIPS 203 draft. To assess performance, we executed the speed testing program 100,000 times and recorded the median CPU cycle count. The results are summarized in Table 2, which compares various ML-KEM implementations across three security levels. Our implementation achieved a speedup ratio ranging from approximately 1.30Γ—\timesΓ— to 1.64Γ—\timesΓ— compared to the state-of-the-art AVX2 implementation. This performance enhancement can be attributed to the intricate design of our NTT AVX-512 implementation and the vectorization benefits offered by AVX-512 instructions. As illustrated in Table 2, adopting batch method in Section 4.1 can accelerate the key generation process by 3.5Γ—\timesΓ— to 4.9Γ—\timesΓ—.

ML-KEM-512ML-KEM-768ML-KEM-1024ISA/ISE
KeygenEncapsDecapsKeygenEncapsDecapsKeygenEncapsDecaps
[5]8175090,922118,790130,048142,688178,206217,174223,780274,588x86-64
Speed-up6.21Γ—\timesΓ—6.68Γ—\timesΓ—7.80Γ—\timesΓ—6.61Γ—\timesΓ—7.61Γ—\timesΓ—8.36Γ—\timesΓ—8.61Γ—\timesΓ—8.93Γ—\timesΓ—9.52Γ—\timesΓ—
[5]17,75018,70819,83828,99628,56630,47639,99641,16244,538AVX2
Speed-up1.35x1.37Γ—\timesΓ—1.30Γ—\timesΓ—1.47Γ—\timesΓ—1.52Γ—\timesΓ—1.43Γ—\timesΓ—1.59Γ—\timesΓ—1.64Γ—\timesΓ—1.54Γ—\timesΓ—
This work13,17013,62015,23019,68418,76021,30825,21625,06228,834AVX-512
Batch Keygen2,804 (4.9Γ—\timesΓ—)13,62015,2304,866 (4.2Γ—\timesΓ—)18,76021,3087,394 (3.5Γ—\timesΓ—)25,06228,834

6.3 ML-KEM AVX-512 TLS 1.3 Benchmark

We evaluated the performance of TLS 1.3 handshakes integrated with the ML-KEM AVX-512 implementation in a simulated network environment. Initially, we utilized OpenSSL commands to generate key pairs and server certificates. Subsequently, utilizing the previously generated keys and certificates, we launched a TLS server via the openssl s_server tool [3]. We assessed the handshake connections per second using the openssl s_time tool [4]. Both the server and client were compiled against OpenSSL3.3.0-dev with the OQS provider, thus facilitating support for both hybrid mode and PQ-only mode.
PQ-only ML-KEM PQ-TLS 1.3 Benchmark. In TLS, we mainly observe the ML-KEM algorithm for KEX. However, there are multiple signature algorithms available, each with different security levels. Therefore, observing the combination of different signature algorithms with ML-KEM is crucial and worth considering. For our experiments, we selected three signature algorithms: the standardized post-quantum signature algorithms Dilithium and Falcon, as well as the traditional public key cipher RSA2048.

KEM/SIGML-KEM-512ML-KEM-768ML-KEM-1024
Dilithium2Falcon512RSA2048Dilithium3Dilithium5
AVX23416.023191.462582.382710.853191.46
AVX-5123589.223364.612650.452856.712451.08

As shown in Table 3, under PQ-only setting, AVX-512 implementation of ML-KEM brings more handshakes per second compared to AVX2 implementation. This indicates that using AVX-512 to optimize ML-KEM performance can bring improvement to the PQ-TLS handshake. However, the improvement is subtle. Since handshake time is not solely determined by KEX time but also influenced by factors such as transmission time and authentication time, improving the performance of the ML-KEM algorithm does not lead to a significant enhancement in overall handshake time performance. Besides, the handshake time decreases as the security level increases. This is in line with expectations because higher security levels typically result in a slower speed of signature and KEM. Overall, using Dilithium as the signature authentication scheme performs better in handshakes compared to Falcon under the same security level. Compared to RSA2048, using the Dilithium and Falcon post-quantum signature schemes for authentication does not impose significant performance overhead. These conclusions provide insights into the impact of different signature algorithms combined with ML-KEM on TLS handshake performance and serve as a reference for selecting the optimal solution.
Hybrid ML-KEM PQ-TLS 1.3 Benchmark. OQS provider facilitates hybrid key exchange for TLS 1.3 by the IETF draft on Post-Quantum Traditional Hybrid Schemes [29]. Consequently, we also evaluated the performance of ML-KEM and other ECDH curves hybrid schemes. The data from Table 4 indicates a significant decrease in handshake performance in hybrid mode compared to the PQ-only mode. Furthermore, the enhancement in handshake performance achieved through the AVX-512 implementation is minimal. This phenomenon may be attributed to the fact that in hybrid mode, the handshake time is primarily composed of two components: ECDH and ML-KEM. Therefore, solely improving the performance of ML-KEM in hybrid mode is insufficient; there is also a need to enhance the performance of the ECDH scheme.

Hybrid KEMML-KEM-512ML-KEM-768ML-KEM-1024
p256x25519p384x448p521p384
AVX2893.981965.58375.19948.04200.45369.84
AVX-512944.232031.37377.72999.85201.66371.39

6.4 TLS 1.3 Handshake with IND-1-CCA KEM

In this section, we implemented IND-1-CCA KEMs based on TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT constructions proposed in [28] and [30], integrated the better one into TLS 1.3 and analyzing its performance.
Comparison of KEM Constructions.We constructed two IND-1-CCA KEMs based on the underlying PKE of ML-KEM, employing TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT respectively. Both TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT utilized SHA3-256 to instantiate the hash function H, consistent with the original F⁒O𝐹𝑂FOitalic_F italic_O transform in ML-KEM. Additionally, we fixed the length of the t⁒a⁒gπ‘‘π‘Žπ‘”tagitalic_t italic_a italic_g in TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT to 32 bytes. The results are summarized in Table 5, which compare the execution speed and communication overhead across three constructions (ML-KEM’s original F⁒O𝐹𝑂FOitalic_F italic_O construction, TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT). From the experiment results, the following facts are concluded.For key-generation and encapsulation, TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT show rational improvement in efficiency, mainly because they omit the hash of p⁒kπ‘π‘˜pkitalic_p italic_k in key-generation, as well as the de-randomization in encapsulation. For Decapsulation, TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT and TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT achieve a speed ratio of at least 3.04Γ—\timesΓ— compared to F⁒O𝐹𝑂FOitalic_F italic_O implementation, which can be attributed to removing re-encryption. The improvement becomes more significant as the security level increases because higher security levels result in slower encryption. TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT requires extra 32 bytes in ciphertext storage, and TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT’s decapsulation is slightly slower than TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT due to the key confirmation. Overall, KEMs based on TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT show better performance in both efficiency and communication overhead.

ML-KEM-512ML-KEM-768ML-KEM-1024
KeygenEncapsDecapsCtKeygenEncapsDecapsCtKeygenEncapsDecapsCt
F⁒O𝐹𝑂FOitalic_F italic_O17,75018,70819,83876828,99628,56630,4761,08839,99641,16244,5381,568
TC⁒Hsubscript𝑇𝐢𝐻{T_{CH}}italic_T start_POSTSUBSCRIPT italic_C italic_H end_POSTSUBSCRIPT12,93218,4346,52080021,94228,3168,9741,12030,29242,45012,1721,600
TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT12,89617,6765,66676821,94227,5808,4401,08830,33640,09611,2401,568

Note: Ct represents Ciphertext.

IND-1-CCA KEM TLS 1.3 Benchmark.The TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT-based IND-1-CCA KEM is integrated into TLS 1.3 in both PQ-only mode and hybrid mode. We measured the number of handshakes per second and compared the results with the original F⁒O𝐹𝑂FOitalic_F italic_O implementation, shown in Table 6.

SIGKEMF⁒O𝐹𝑂FOitalic_F italic_OTR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT
ML-KEM-5123416.023476.17

Dilithium2

p256_ML-KEM-512924.61978.12
x25519_ML-KEM-5122014.982038.19
ML-KEM-7682710.852803.24

Dilithium3

x25519_ML-KEM-7681942.312036.58
p256_ML-KEM-768902.83931.99

The results show that TR⁒Hsubscript𝑇𝑅𝐻{T_{RH}}italic_T start_POSTSUBSCRIPT italic_R italic_H end_POSTSUBSCRIPT-based IND-1-CCA KEMs increase the number of TLS 1.3 handshakes per second while maintaining communication overhead. Especially in the PQ-only mode with a higher security level, the removal of re-encryption brings more significant improvement, which is in line with our expectations.

Our experiments confirm the advantages of applying IND-1-CCA KEMs in TLS 1.3 at a practical level. Specifically, in application scenarios that place more emphasis on handshake efficiency, IND-1-CCA KEMs have better adaptability than F⁒O𝐹𝑂FOitalic_F italic_O-based IND-CCA KEMs. These results inspire further research on IND-1-CCA constructions (e.g. tighter reduction) and their TLS 1.3 implementation.

7 Conclusion

In this paper, we present our implementation of ML-KEM using AVX-512 and introduce a novel batch method for ML-KEM key generation. With the support of OQS provider, we seamlessly integrate our optimized ML-KEM AVX-512 implementation into TLS 1.3, enhancing its resistance to post-quantum threats. We evaluate the performance of our implementation in both PQ-only and hybrid modes. Furthermore, we revisit two IND-1-CCA KEMs and analyze their impact on PQ-TLS handshake performance. Our AVX-512 implementation demonstrates a speedup of up to 1.64Γ—\timesΓ— for ML-KEM, while our batch method achieves significant speedups ranging from 3.5Γ—\timesΓ— to 4.9Γ—\timesΓ— for key generation. Furthermore, our measurements show improvements in TLS 1.3 handshake performance with our AVX-512 implementation. Through our optimized implementation, integration, and assessment efforts, we provide valuable insights for future work aimed at enhancing PQ-TLS handshake performance.

References

  • [1]Hybrid key exchange in TLS 1.3, https://www.ietf.org/archive/id/draft-ietf-tls-hybrid-design-04.html
  • [2]Intel xeon phi processor 7250 specifications, https://www.intel.com/content/www/us/en/products/sku/94035/intel-xeon-phi-processor-7250-16gb-1-40-ghz-68-core/specifications.html
  • [3]OpenSSL: s_server- tls/ssl server program. OpenSSL Documentation (2022), https://www.openssl.org/docs/man3.3/man1/s_server.html
  • [4]OpenSSL: s_time- ssl/tls performance timing program. OpenSSL Documentation (2022), https://www.openssl.org/docs/man3.3/man1/s_time.html
  • [5]Kyber standard code (2024), https://github.com/pq-crystals/kyber/tree/standard, accessed on: 2024-03-27
  • [6]Abiega-L’Eglisse, A.F.D., Delgado-Vargas, K.A., Valencia-Rodriguez, F.Q., Quiroga, V.G., Gallegos-GarcΓ­a, G., Nakano-Miyatake, M.: Performance of new hope and crystals-dilithium postquantum schemes in the transport layer security protocol. IEEE Access 8, 213968–213980 (2020). https://doi.org/10.1109/ACCESS.2020.3040324, https://doi.org/10.1109/ACCESS.2020.3040324
  • [7]Avanzi, R., Bos, J., Ducas, L., Kiltz, E., Lepoint, T., Lyubashevsky, V., Schanck, J.M., Schwabe, P., Seiler, G., StehlΓ©, D.: Crystals-kyber algorithm specifications and supporting documentation. NIST PQC Round 2(4), 1–43 (2019)
  • [8]Baentsch, M., Paquin, C., Levitte, R., Hess, B., Segeth, J.: oqsprovider - open quantum safe provider for openssl. Website (2023), https://github.com/open-quantum-safe/oqs-provider
  • [9]Barrett, P.: Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor. In: Conference on the Theory and Application of Cryptographic Techniques. pp. 311–323. Springer (1986)
  • [10]Bernstein, D.J., Brumley, B.B., Chen, M., Tuveri, N.: Opensslntru: Faster post-quantum TLS key exchange. In: Butler, K.R.B., Thomas, K. (eds.) 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022. pp. 845–862. USENIX Association (2022), https://www.usenix.org/conference/usenixsecurity22/presentation/bernstein
  • [11]Bertoni, G., Daemen, J., Peeters, M., VanAssche, G.: Sponge functions. In: ECRYPT hash workshop. vol.2007 (2007)
  • [12]Bos, J.W., Costello, C., Naehrig, M., Stebila, D.: Post-quantum key exchange for the tls protocol from the ring learning with errors problem. In: 2015 IEEE Symposium on Security and Privacy. pp. 553–570. IEEE (2015)
  • [13]Bozhko, J., Hanna, Y., Harrilal-Parchment, R., Tonyali, S., Akkaya, K.: Performance evaluation of quantum-resistant tls for consumer iot devices. In: 2023 IEEE 20th Consumer Communications & Networking Conference (CCNC). pp. 230–235. IEEE (2023)
  • [14]Chang, Y.A., Chen, M.S., Wu, J.S., Yang, B.Y.: Postquantum ssl/tls for embedded systems. In: 2014 IEEE 7th International Conference on Service-Oriented Computing and Applications. pp. 266–270. IEEE (2014)
  • [15]Cheng, H., Fotiadis, G., GroszschΓ€dl, J., Ryan, P.Y.: Highly vectorized sike for avx-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES) 2022(2) (2022)
  • [16]Cheng, H., Fotiadis, G., GroszschΓ€dl, J., Ryan, P.Y., Roenne, P.: Batching csidh group actions using avx-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES) 2021(4), 618–649 (2021)
  • [17]Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Mathematics of computation 19(90), 297–301 (1965)
  • [18]Crockett, E., Paquin, C., Stebila, D.: Prototyping post-quantum and hybrid key exchange and authentication in tls and ssh. In: NIST 2nd PQC Standardization Conference. Santa Barbara, California (August 2019), published elsewhere
  • [19]DouglasStebila, M.M.: liboqs project (2024), https://github.com/open-quantum-safe/liboqs
  • [20]Dowling, B., Fischlin, M., GΓΌnther, F., Stebila, D.: A cryptographic analysis of the tls 1.3 handshake protocol. Journal of Cryptology 34(4), 37 (2021)
  • [21]Fan, J., Willems, F., Zahed, J., Gray, J., Mister, S., Ounsworth, M., Adams, C.: Impact of post-quantum hybrid certificates on pki, common libraries, and protocols. International Journal of Security and Networks 16(3), 200–211 (2021)
  • [22]FIPS, P.: Secure hash algorithm-3 (sha-3) standard: Permutation-based hash and extendable-output functions. National Institute for Standards and Technology (NIST) 202(0) (2014)
  • [23]Garcia, C.R., Aguilera, A.C., Olmos, J.J.V., Monroy, I.T., Rommel, S.: Quantum-resistant tls 1.3: A hybrid solution combining classical, quantum and post-quantum cryptography. In: 2023 IEEE 28th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD). pp. 246–251. IEEE (2023)
  • [24]Gentleman, W.M., Sande, G.: Fast fourier transforms: for fun and profit. In: Proceedings of the November 7-10, 1966, fall joint computer conference. pp. 563–578 (1966)
  • [25]Gonzalez, R., Wiggers, T.: Kemtls vs. post-quantum tls: Performance on embedded systems. In: International Conference on Security, Privacy, and Applied Cryptography Engineering. pp. 99–117. Springer (2022)
  • [26]Henrich, J., Heinemann, A., Wiesmaier, A., Schmitt, N.: Performance impact of pqc kems on tls 1.3 under varying network characteristics. In: International Conference on Information Security. pp. 267–287. Springer (2023)
  • [27]Hofheinz, D., HΓΆvelmanns, K., Kiltz, E.: A modular analysis of the fujisaki-okamoto transformation. In: Theory of Cryptography Conference. pp. 341–371. Springer (2017)
  • [28]Huguenin-Dumittan, L., Vaudenay, S.: On ind-qcca security in the ROM and its applications - CPA security is sufficient for TLS 1.3. In: Dunkelman, O., Dziembowski, S. (eds.) Advances in Cryptology - EUROCRYPT 2022 - 41st Annual International Conference on the Theory and Applications of Cryptographic Techniques, Trondheim, Norway, May 30 - June 3, 2022, Proceedings, Part III. Lecture Notes in Computer Science, vol. 13277, pp. 613–642. Springer (2022). https://doi.org/10.1007/978-3-031-07082-2_22, https://doi.org/10.1007/978-3-031-07082-2_22
  • [29]Internet Engineering Task Force: Hybrid Terminology for Post-Quantum Key Establishment. Tech. rep., Internet Engineering Task Force (2023)
  • [30]Jiang, H., Ma, Z., Zhang, Z.: Post-quantum security of key encapsulation mechanism against cca attacks with a single decapsulation query. In: International Conference on the Theory and Application of Cryptology and Information Security. pp. 434–468. Springer (2023)
  • [31]KrisKwiatkowski, L.V.: The tls post-quantum experiment (2019), https://blog.cloudflare.com/the-tls-post-quantum-experiment
  • [32]Lei, D., He, D., Peng, C., Luo, M., Liu, Z., Huang, X.: Faster implementation of ideal lattice-based cryptography using avx512. ACM Transactions on Embedded Computing Systems 22(5), 1–18 (2023)
  • [33]MelissaAzouaoui, J.W.B.: Surviving the fo-calypse: Securing pqc implementations in practice. (2022), https://iacr.org/submit/files/slides/2022/rwc/rwc2022/48/slides.pdf
  • [34]Montgomery, P.L.: Modular multiplication without trial division. Mathematics of computation 44(170), 519–521 (1985)
  • [35]National Institute of Standards and Technology: Post-quantum cryptography standardization: Selected algorithms (2022), https://csrc.nist.gov/Projects/post-quantum-cryptography/selected-algorithms-2022
  • [36]NormAshley, M.B.: Open quantum safe: software for the transition to quantum-resistant cryptography (2024), https://openquantumsafe.org/
  • [37]Pablos, J.I.E., Marriaga, M.E., del Pozo, Á.L.P.: Design and implementation of a post-quantum group authenticated key exchange protocol with the liboqs library: A comparative performance analysis from classic mceliece, dowling, ntru, and saber. IEEE Access 10, 120951–120983 (2022)
  • [38]Paquin, C., Stebila, D., Tamvada, G.: Benchmarking post-quantum cryptography in tls. In: Post-Quantum Cryptography: 11th International Conference, PQCrypto 2020, Paris, France, April 15–17, 2020, Proceedings 11. pp. 72–91. Springer (2020)
  • [39]Paul, S., Kuzovkova, Y., Lahr, N., Niederhagen, R.: Mixed certificate chains for the transition to post-quantum authentication in tls 1.3. In: Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security. pp. 727–740 (2022)
  • [40]Paul, S., Schick, F., Seedorf, J.: Tpm-based post-quantum cryptography: A case study on quantum-resistant and mutually authenticated tls for iot environments. In: Proceedings of the 16th International Conference on Availability, Reliability and Security. pp. 1–10 (2021)
  • [41]Rescorla, E.: The transport layer security TLS protocol version 1.3. Tech. Rep. RFC 8446, RFC Editor (Aug 2018), https://doi.org/10.17487/RFC8446
  • [42]Roy, S.S.: Saberx4: High-throughput software implementation of saber key encapsulation mechanism. In: 2019 IEEE 37th International Conference on Computer Design (ICCD). pp. 321–324. IEEE (2019)
  • [43]Schwabe, P., Stebila, D., Wiggers, T.: Post-quantum tls without handshake signatures. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. pp. 1461–1480 (2020)
  • [44]Schwabe, P., Stebila, D., Wiggers, T.: More efficient post-quantum kemtls with pre-distributed public keys. In: Computer Security–ESORICS 2021: 26th European Symposium on Research in Computer Security, Darmstadt, Germany, October 4–8, 2021, Proceedings, Part I 26. pp. 3–22. Springer (2021)
  • [45]Seiler, G.: Faster AVX2 optimized NTT multiplication for ring-lwe lattice cryptography. IACR Cryptol. ePrint Arch. p.39 (2018), http://eprint.iacr.org/2018/039
  • [46]Shor, P.W.: Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th annual symposium on foundations of computer science. pp. 124–134. Ieee (1994)
  • [47]Sikeridis, D., Kampanakis, P., Devetsikiotis, M.: Assessing the overhead of post-quantum cryptography in TLS 1.3 and SSH. In: Han, D., Feldmann, A. (eds.) CoNEXT ’20: The 16th International Conference on emerging Networking EXperiments and Technologies, Barcelona, Spain, December, 2020. pp. 149–156. ACM (2020). https://doi.org/10.1145/3386367.3431305, https://doi.org/10.1145/3386367.3431305
  • [48]Sikeridis, D., Kampanakis, P., Devetsikiotis, M.: Post-quantum authentication in TLS 1.3: A performance study. In: 27th Annual Network and Distributed System Security Symposium, NDSS 2020, San Diego, California, USA, February 23-26, 2020. The Internet Society (2020), https://www.ndss-symposium.org/ndss-paper/post-quantum-authentication-in-tls-1-3-a-performance-study/
  • [49]Sosnowski, M., Wiedner, F., Hauser, E., Steger, L., Schoinianakis, D., GallenmΓΌller, S., Carle, G.: The performance of post-quantum tls 1.3. In: Companion of the 19th International Conference on emerging Networking EXperiments and Technologies. pp. 19–27 (2023)
  • [50]ofStandards, N.I., Technology: Module-lattice-based digital signature standard. Federal Information Processing Standards Publication (FIPS) NIST FIPS 204 ipd, Department of Commerce, Washington, D.C. (2023)
  • [51]ofStandards, N.I., Technology: Module-lattice-based key-encapsulation mechanism standard. Federal Information Processing Standards Publication (FIPS) NIST FIPS 203 ipd, Department of Commerce, Washington, D.C. (2023)
  • [52]ofStandards, N.I., Technology: Stateless hash-based digital signature standard. Federal Information Processing Standards Publication (FIPS) NIST FIPS 205 ipd, Department of Commerce, Washington, D.C. (2023)
  • [53]Tasopoulos, G., Li, J., Fournaris, A.P., Zhao, R.K., Sakzad, A., Steinfeld, R.: Performance evaluation of post-quantum tls 1.3 on resource-constrained embedded systems. In: International Conference on Information Security Practice and Experience. pp. 432–451. Springer (2022)
  • [54]Westerbaan, B.: When to barrett reduce in the inverse ntt. Cryptology ePrint Archive (2020)
  • [55]Zhang, J., Huang, J., etal.: ENG25519: Faster TLS 1.3 handshake using optimized X25519 and Ed25519. In: Usenix Security (2024)
  • [56]Zheng, J., Zhu, H., Song, Z., Wang, Z., Zhao, Y.: Optimized vectorization implementation of crystals-dilithium. arXiv preprint arXiv:2306.01989 (2023)
Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment (2024)
Top Articles
Latest Posts
Article information

Author: Allyn Kozey

Last Updated:

Views: 5634

Rating: 4.2 / 5 (43 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Allyn Kozey

Birthday: 1993-12-21

Address: Suite 454 40343 Larson Union, Port Melia, TX 16164

Phone: +2456904400762

Job: Investor Administrator

Hobby: Sketching, Puzzles, Pet, Mountaineering, Skydiving, Dowsing, Sports

Introduction: My name is Allyn Kozey, I am a outstanding, colorful, adventurous, encouraging, zealous, tender, helpful person who loves writing and wants to share my knowledge and understanding with you.