Fast and Clean: Auditable high-performance assembly via constraint solving
Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only ti...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ruhr-Universität Bochum
2023-12-01
|
Series: | Transactions on Cryptographic Hardware and Embedded Systems |
Subjects: | |
Online Access: | https://tches.iacr.org/index.php/TCHES/article/view/11241 |
_version_ | 1797403985031725056 |
---|---|
author | Amin Abdulrahman Hanno Becker Matthias J. Kannwischer Fabien Klein |
author_facet | Amin Abdulrahman Hanno Becker Matthias J. Kannwischer Fabien Klein |
author_sort | Amin Abdulrahman |
collection | DOAJ |
description |
Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.
In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.
We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.
|
first_indexed | 2024-03-09T02:46:29Z |
format | Article |
id | doaj.art-a535fe408b5a476386b59dcac10fdcd0 |
institution | Directory Open Access Journal |
issn | 2569-2925 |
language | English |
last_indexed | 2024-03-09T02:46:29Z |
publishDate | 2023-12-01 |
publisher | Ruhr-Universität Bochum |
record_format | Article |
series | Transactions on Cryptographic Hardware and Embedded Systems |
spelling | doaj.art-a535fe408b5a476386b59dcac10fdcd02023-12-05T16:13:01ZengRuhr-Universität BochumTransactions on Cryptographic Hardware and Embedded Systems2569-29252023-12-012024110.46586/tches.v2024.i1.87-132Fast and Clean: Auditable high-performance assembly via constraint solvingAmin Abdulrahman0Hanno Becker1Matthias J. Kannwischer2Fabien Klein3Ruhr University Bochum, Bochum, Germany; Max Planck Institute for Security and Privacy, Bochum, GermanyAutomated Reasoning Group, Amazon Web Services, Cambridge, United KingdomQuantum Safe Migration Center, Chelpis Quantum Tech, Taipei, TaiwanArm Limited Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability. https://tches.iacr.org/index.php/TCHES/article/view/11241SuperoptimizationConstraint SolvingCryptographyPost-Quantum CryptographyArmv8.1-MAArch64 |
spellingShingle | Amin Abdulrahman Hanno Becker Matthias J. Kannwischer Fabien Klein Fast and Clean: Auditable high-performance assembly via constraint solving Transactions on Cryptographic Hardware and Embedded Systems Superoptimization Constraint Solving Cryptography Post-Quantum Cryptography Armv8.1-M AArch64 |
title | Fast and Clean: Auditable high-performance assembly via constraint solving |
title_full | Fast and Clean: Auditable high-performance assembly via constraint solving |
title_fullStr | Fast and Clean: Auditable high-performance assembly via constraint solving |
title_full_unstemmed | Fast and Clean: Auditable high-performance assembly via constraint solving |
title_short | Fast and Clean: Auditable high-performance assembly via constraint solving |
title_sort | fast and clean auditable high performance assembly via constraint solving |
topic | Superoptimization Constraint Solving Cryptography Post-Quantum Cryptography Armv8.1-M AArch64 |
url | https://tches.iacr.org/index.php/TCHES/article/view/11241 |
work_keys_str_mv | AT aminabdulrahman fastandcleanauditablehighperformanceassemblyviaconstraintsolving AT hannobecker fastandcleanauditablehighperformanceassemblyviaconstraintsolving AT matthiasjkannwischer fastandcleanauditablehighperformanceassemblyviaconstraintsolving AT fabienklein fastandcleanauditablehighperformanceassemblyviaconstraintsolving |