Fast and Clean: Auditable high-performance assembly via constraint solving

Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only ti...

Full description

Bibliographic Details
Main Authors: Amin Abdulrahman, Hanno Becker, Matthias J. Kannwischer, Fabien Klein
Format: Article
Language:English
Published: Ruhr-Universität Bochum 2023-12-01
Series:Transactions on Cryptographic Hardware and Embedded Systems
Subjects:
Online Access:https://tches.iacr.org/index.php/TCHES/article/view/11241
_version_ 1797403985031725056
author Amin Abdulrahman
Hanno Becker
Matthias J. Kannwischer
Fabien Klein
author_facet Amin Abdulrahman
Hanno Becker
Matthias J. Kannwischer
Fabien Klein
author_sort Amin Abdulrahman
collection DOAJ
description Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.
first_indexed 2024-03-09T02:46:29Z
format Article
id doaj.art-a535fe408b5a476386b59dcac10fdcd0
institution Directory Open Access Journal
issn 2569-2925
language English
last_indexed 2024-03-09T02:46:29Z
publishDate 2023-12-01
publisher Ruhr-Universität Bochum
record_format Article
series Transactions on Cryptographic Hardware and Embedded Systems
spelling doaj.art-a535fe408b5a476386b59dcac10fdcd02023-12-05T16:13:01ZengRuhr-Universität BochumTransactions on Cryptographic Hardware and Embedded Systems2569-29252023-12-012024110.46586/tches.v2024.i1.87-132Fast and Clean: Auditable high-performance assembly via constraint solvingAmin Abdulrahman0Hanno Becker1Matthias J. Kannwischer2Fabien Klein3Ruhr University Bochum, Bochum, Germany; Max Planck Institute for Security and Privacy, Bochum, GermanyAutomated Reasoning Group, Amazon Web Services, Cambridge, United KingdomQuantum Safe Migration Center, Chelpis Quantum Tech, Taipei, TaiwanArm Limited Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability. https://tches.iacr.org/index.php/TCHES/article/view/11241SuperoptimizationConstraint SolvingCryptographyPost-Quantum CryptographyArmv8.1-MAArch64
spellingShingle Amin Abdulrahman
Hanno Becker
Matthias J. Kannwischer
Fabien Klein
Fast and Clean: Auditable high-performance assembly via constraint solving
Transactions on Cryptographic Hardware and Embedded Systems
Superoptimization
Constraint Solving
Cryptography
Post-Quantum Cryptography
Armv8.1-M
AArch64
title Fast and Clean: Auditable high-performance assembly via constraint solving
title_full Fast and Clean: Auditable high-performance assembly via constraint solving
title_fullStr Fast and Clean: Auditable high-performance assembly via constraint solving
title_full_unstemmed Fast and Clean: Auditable high-performance assembly via constraint solving
title_short Fast and Clean: Auditable high-performance assembly via constraint solving
title_sort fast and clean auditable high performance assembly via constraint solving
topic Superoptimization
Constraint Solving
Cryptography
Post-Quantum Cryptography
Armv8.1-M
AArch64
url https://tches.iacr.org/index.php/TCHES/article/view/11241
work_keys_str_mv AT aminabdulrahman fastandcleanauditablehighperformanceassemblyviaconstraintsolving
AT hannobecker fastandcleanauditablehighperformanceassemblyviaconstraintsolving
AT matthiasjkannwischer fastandcleanauditablehighperformanceassemblyviaconstraintsolving
AT fabienklein fastandcleanauditablehighperformanceassemblyviaconstraintsolving