Revisiting Keccak and Dilithium Implementations on ARMv7-M

Keccak is widely used in lattice-based cryptography (LBC) and its impact to the overall running time in LBC scheme can be predominant on platforms lacking dedicated SHA-3 instructions. This holds true on embedded devices for Kyber and Dilithium, two LBC schemes selected by NIST to be standardized a...

Full description

Bibliographic Details
Main Authors: Junhao Huang, Alexandre Adomnicăi, Jipeng Zhang, Wangchen Dai, Yao Liu, Ray C. C. Cheung, Çetin Kaya Koç, Donglong Chen
Format: Article
Language:English
Published: Ruhr-Universität Bochum 2024-03-01
Series:Transactions on Cryptographic Hardware and Embedded Systems
Subjects:
Online Access:https://tches.iacr.org/index.php/TCHES/article/view/11419
_version_ 1797262239929991168
author Junhao Huang
Alexandre Adomnicăi
Jipeng Zhang
Wangchen Dai
Yao Liu
Ray C. C. Cheung
Çetin Kaya Koç
Donglong Chen
author_facet Junhao Huang
Alexandre Adomnicăi
Jipeng Zhang
Wangchen Dai
Yao Liu
Ray C. C. Cheung
Çetin Kaya Koç
Donglong Chen
author_sort Junhao Huang
collection DOAJ
description Keccak is widely used in lattice-based cryptography (LBC) and its impact to the overall running time in LBC scheme can be predominant on platforms lacking dedicated SHA-3 instructions. This holds true on embedded devices for Kyber and Dilithium, two LBC schemes selected by NIST to be standardized as quantumsafe cryptographic algorithms. While extensive work has been done to optimize the polynomial arithmetic in these schemes, it was generally assumed that Keccak implementations were already optimal and left little room for enhancement. In this paper, we revisit various optimization techniques for both Keccak and Dilithium on two ARMv7-M processors, i.e., Cortex-M3 and M4. For Keccak, we improve its efficiency using two architecture-specific optimizations, namely lazy rotation and memory access pipelining, on ARMv7-M processors. These optimizations yield performance gains of up to 24.78% and 21.4% for the largest Keccak permutation instance on Cortex-M3 and M4, respectively. As for Dilithium, we first apply the multi-moduli NTT for the small polynomial multiplication cti on Cortex-M3. Then, we thoroughly integrate the efficient Plantard arithmetic to the 16-bit NTTs for computing the small polynomial multiplications csi and cti on Cortex-M3 and M4. We show that the multi-moduli NTT combined with the efficient Plantard arithmetic could obtain significant speed-ups for the small polynomial multiplications of Dilithium on Cortex-M3. Combining all the aforementioned optimizations for both Keccak and Dilithium, we obtain 15.44% ∼ 23.75% and 13.94% ∼ 15.52% speed-ups for Dilithium on Cortex-M3 and M4, respectively. Furthermore, we also demonstrate that the Keccak optimizations yield 13.35% to 15.00% speed-ups for Kyber, and our Keccak optimizations decrease the proportion of time spent on hashing in Dilithium and Kyber by 2.46% ∼ 5.03% on Cortex-M4.
first_indexed 2024-04-24T23:53:58Z
format Article
id doaj.art-5e78fc90c44a4a0ea9d9e6a83c4bc3c9
institution Directory Open Access Journal
issn 2569-2925
language English
last_indexed 2024-04-24T23:53:58Z
publishDate 2024-03-01
publisher Ruhr-Universität Bochum
record_format Article
series Transactions on Cryptographic Hardware and Embedded Systems
spelling doaj.art-5e78fc90c44a4a0ea9d9e6a83c4bc3c92024-03-14T16:24:50ZengRuhr-Universität BochumTransactions on Cryptographic Hardware and Embedded Systems2569-29252024-03-012024210.46586/tches.v2024.i2.1-24Revisiting Keccak and Dilithium Implementations on ARMv7-MJunhao Huang0Alexandre Adomnicăi1Jipeng Zhang2Wangchen Dai3Yao Liu4Ray C. C. Cheung5Çetin Kaya Koç6Donglong Chen7Guangdong Provincial Key Laboratory IRADS, BNU-HKBU United International College, Zhuhai, China; Hong Kong Baptist University, Hong Kong, ChinaIndependent researcher, Paris, FranceNanjing University of Aeronautics and Astronautics, Nanjing, ChinaZhejiang Lab, Hangzhou, ChinaSun Yat-sen University, Zhuhai, ChinaCity University of Hong Kong, Hong Kong, ChinaNanjing University of Aeronautics and Astronautics, Nanjing, China; Iˇgdır University, Merkez, Turkey; University of California Santa Barbara, Santa Barbara, USAGuangdong Provincial Key Laboratory IRADS, BNU-HKBU United International College, Zhuhai, China Keccak is widely used in lattice-based cryptography (LBC) and its impact to the overall running time in LBC scheme can be predominant on platforms lacking dedicated SHA-3 instructions. This holds true on embedded devices for Kyber and Dilithium, two LBC schemes selected by NIST to be standardized as quantumsafe cryptographic algorithms. While extensive work has been done to optimize the polynomial arithmetic in these schemes, it was generally assumed that Keccak implementations were already optimal and left little room for enhancement. In this paper, we revisit various optimization techniques for both Keccak and Dilithium on two ARMv7-M processors, i.e., Cortex-M3 and M4. For Keccak, we improve its efficiency using two architecture-specific optimizations, namely lazy rotation and memory access pipelining, on ARMv7-M processors. These optimizations yield performance gains of up to 24.78% and 21.4% for the largest Keccak permutation instance on Cortex-M3 and M4, respectively. As for Dilithium, we first apply the multi-moduli NTT for the small polynomial multiplication cti on Cortex-M3. Then, we thoroughly integrate the efficient Plantard arithmetic to the 16-bit NTTs for computing the small polynomial multiplications csi and cti on Cortex-M3 and M4. We show that the multi-moduli NTT combined with the efficient Plantard arithmetic could obtain significant speed-ups for the small polynomial multiplications of Dilithium on Cortex-M3. Combining all the aforementioned optimizations for both Keccak and Dilithium, we obtain 15.44% ∼ 23.75% and 13.94% ∼ 15.52% speed-ups for Dilithium on Cortex-M3 and M4, respectively. Furthermore, we also demonstrate that the Keccak optimizations yield 13.35% to 15.00% speed-ups for Kyber, and our Keccak optimizations decrease the proportion of time spent on hashing in Dilithium and Kyber by 2.46% ∼ 5.03% on Cortex-M4. https://tches.iacr.org/index.php/TCHES/article/view/11419KeccakDilithiumARMv7-MPlantard arithmeticlattice-based cryptography
spellingShingle Junhao Huang
Alexandre Adomnicăi
Jipeng Zhang
Wangchen Dai
Yao Liu
Ray C. C. Cheung
Çetin Kaya Koç
Donglong Chen
Revisiting Keccak and Dilithium Implementations on ARMv7-M
Transactions on Cryptographic Hardware and Embedded Systems
Keccak
Dilithium
ARMv7-M
Plantard arithmetic
lattice-based cryptography
title Revisiting Keccak and Dilithium Implementations on ARMv7-M
title_full Revisiting Keccak and Dilithium Implementations on ARMv7-M
title_fullStr Revisiting Keccak and Dilithium Implementations on ARMv7-M
title_full_unstemmed Revisiting Keccak and Dilithium Implementations on ARMv7-M
title_short Revisiting Keccak and Dilithium Implementations on ARMv7-M
title_sort revisiting keccak and dilithium implementations on armv7 m
topic Keccak
Dilithium
ARMv7-M
Plantard arithmetic
lattice-based cryptography
url https://tches.iacr.org/index.php/TCHES/article/view/11419
work_keys_str_mv AT junhaohuang revisitingkeccakanddilithiumimplementationsonarmv7m
AT alexandreadomnicai revisitingkeccakanddilithiumimplementationsonarmv7m
AT jipengzhang revisitingkeccakanddilithiumimplementationsonarmv7m
AT wangchendai revisitingkeccakanddilithiumimplementationsonarmv7m
AT yaoliu revisitingkeccakanddilithiumimplementationsonarmv7m
AT raycccheung revisitingkeccakanddilithiumimplementationsonarmv7m
AT cetinkayakoc revisitingkeccakanddilithiumimplementationsonarmv7m
AT donglongchen revisitingkeccakanddilithiumimplementationsonarmv7m