# High Performance Modular Multiplication for SIDH 

Weiqiang Liu ${ }^{\odot}$, Senior Member, IEEE, Ziying Ni, Jian Ni, Ciara Rafferty, Member, IEEE, and Máire O'Neill, Senior Member, IEEE


#### Abstract

The latest research indicates that quantum computers will be realized in the near future. In theory, the computation speed of a quantum computer is much faster than current computers, which will pose a serious threat to current cryptosystems. Post-quantum cryptography (PQC) is a class of cryptography based on underlying mathematical problems that are considered infeasible to crack even with access to a quantum computer. The supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol is a new post-quantum cryptosystem, which offers advantages in reduced secret key length and attack resistance. SIDH is the basis of the supersingular isogeny key encapsulation (SIKE) protocol, which is in the second round of the U.S. National Institute of Standards and Technology (NIST) PQC standardization process. In this article, we propose a new modular multiplication algorithm and a new interleaved hardware architecture for SIDH. Performance results for the proposed modular multiplier using four parameter sets for the prime, $\boldsymbol{p}$ that correspond to the SIKE Round 2 parameter sets show significant advantages in speed.


Index Terms-FPGA, modular multiplication, post-quantum cryptography (PQC), supersingular isogeny Diffie-Hellman (SIDH).

## I. Introduction

The computational performance of a quantum computer is expected to be much higher than that of a traditional computer. In 1994, Shor [1] proposed a fast decomposition method for large numbers based on a quantum computer, whose complexity is $\mathcal{O}(\log N)$, and the larger the number, the better the performance of Shor's algorithm. Later in 1996, Grover [2] proposed a quantum algorithm that can be used to search an unsorted database faster than a traditional computer [quadratic speedup $\mathcal{O}(N / 2)$ time rather than $\mathcal{O}(N)$ ]. In recent years much effort has gone into the development of quantum computers by industry, and in 2019, Google developed a new 54-qubit processor, which is 9 orders of magnitude faster than the fastest supercomputer [3]. Thus, it is predictable that quantum computers will become practical in the near future.

Traditional encryption methods, such as RSA [4] based on the large number factorization problem, and elliptic curve cryptography (ECC) [5], based on the discrete logarithm problem, will be easily

[^0]broken by quantum computers, which will have an important impact on current communications and network security.
Post-quantum cryptography (PQC) [6] is a form of public-key cryptography with high complexity, and includes encryption, digital signatures, and key encapsulation mechanisms. PQC contains a variety of algorithms based on different hard problems [7], such as lattice-based cryptography, multivariate quadratic cryptography, hashbased signatures, and code-based cryptography. The supersingular isogeny Diffie-Hellman (SIDH) [8] key exchange protocol is the most recently proposed PQC scheme. SIDH is based on the theory of point addition and point doubling in ECC, combined with the theory of isogenies in elliptic curves. Two curves $E / K$ and $E^{\prime} / K$ are isogenous over $K$ if there exists a morphism $\phi: E \rightarrow E^{\prime}$ with coefficients in $K$ mapping the neutral element of $E$ to the neutral element of $E^{\prime}$. Therefore, SIDH is more complex than ECC, and is believed to be resistant to attacks by quantum computers [9]. In addition, SIDH is characterized by relatively small key sizes compared with other PQC schemes. SIDH is the basis of the supersingular isogeny key encapsulation (SIKE) protocol [10], which is a candidate in the NIST PQC standardization process. However, as SIDH was proposed more than a decade after other PQC cryptosystems, there is still much work needed to understand its practical capability.
In 2011, Jao and De Feo [8] proposed the SIDH algorithm and implemented it for the first time in software. They used a 768bit modulus to satisfy 128 -bit security and the result is two orders of magnitude faster than isogeny-based cryptosystems over ordinary curves. In 2016, Costello et al. [9] proposed a fast software implementation for SIDH, whose computing speed is 2.5 times faster than the original software implementations. In the same year, the first hardware implementation of SIDH on Virtex-7 FPGA was presented in [11]. The implementation is $50 \%$ faster than the fastest known software implementation. Most recently, in 2019, Koziel et al. [12] improved the Montgomery multiplier with an optimized radix, which is faster than the previous designs.

In SIDH, the modulus $p=l_{a}^{e_{A}} l_{b}^{e_{B}} \cdot f \pm 1$, where $l_{a}$ and $l_{b}$ are two small prime numbers, and $f$ is a small number, so that $p$ conforms to the form of a prime number. In general, $\log _{2} l_{a}^{e_{A}}$ is required to be approximately $\log _{2} l_{b}^{e_{B}}$ to ensure that either side of A and $B$ is equally difficult for attackers. The modulus plays an important role in public key cryptography schemes. Karmakar et al. [13] proposed a fast modulus multiplication method for the modulus in this special field. Liu et al. [14] improved the modular multiplication algorithm over [13] by proposing a new algorithm. Bos and Friedberger [15] compared several modular multiplication methods for SIDH mathematically, but did not provide hardware results.
This article proposes a new modular multiplication algorithm named high-performance finite field multiplication (HFFM) for the special field $F_{p}, p=f \cdot 2^{a} \cdot l_{b}^{e_{B}}-1$, which is suitable for a larger modulus range. In addition, to reduce the overall computation time, a new multipipelined interleaved architecture running on FPGA is also proposed.

The rest of this article is organized as follows. Section II reviews the special field modulus multiplication method. Section III proposes the new HFFM algorithm and its hardware architecture. Section IV
presents the results of the hardware implementation and compares it with the existing hardware implementations. Section V concludes this article.

## II. Review of Finite Field Multiplication for SidH

The Barrett reduction [16] is introduced first, which is used in the following algorithms. Barrett reduction consists of several multiplications and shift operations instead of the division operation. As a result, it is efficient for hardware implementation. The transformation of division in $a \bmod b$ is shown in (1), where $x=2^{k} / b$, $k=\log _{2} a$. For more details on Barrett reduction, please refer to [16]

$$
\begin{equation*}
\frac{1}{b}=\frac{\left(2^{k}\right) / b}{b \cdot 2^{k} / b}=\frac{\left(2^{k}\right) / b}{2^{k}} \approx \frac{x}{2^{k}} \tag{1}
\end{equation*}
$$

One of the moduli in the SIDH public key exchange protocol is $p=f \cdot 2^{a} 3^{b}-1$, where $f$ is a very small positive integer, such as 1,2 , etc. Based on this form of special field, [13] proves that a good performance can be achieved using Barrett reduction directly and proposed a fast modulus multiplication method named efficient finite field multiplication (EFFM) through mathematical transformation. Assume $p=2 \cdot 2^{a} \cdot 3^{b}-1$ and that both $a$ and $b$ are even numbers, and the radix $R=2^{a / 2} \cdot 3^{b / 2}$. Therefore, the number $A$ in this field can be expressed as follows:

$$
\begin{equation*}
A=a_{1} \cdot R^{2}+a_{2} \cdot R+a_{3}, a_{1} \in\{0,1\}, a_{2}, a_{3} \in[0, R) \tag{2}
\end{equation*}
$$

$B$ in this field also satisfies the structure of (2), and the product $C$ can be expressed as follows:

$$
\begin{align*}
C= & a_{1} b_{1} \cdot R^{4}+\left(a_{1} b_{2}+a_{2} b_{1}\right) \cdot R^{3}+\left(a_{1} b_{3}+a_{2} b_{2}+a_{3} b_{1}\right) \cdot R^{2} \\
& +\left(a_{2} b_{3}+a_{3} b_{2}\right) \cdot R+a_{3} b_{3} . \tag{3}
\end{align*}
$$

As $2 \cdot R^{2}=p+1$, we can get $R^{2} \equiv 2^{-1}(\bmod p) . R^{2}$ and $R^{4}$ in (3) can be replaced with $2^{-1}(\bmod p)$ and $2^{-2}(\bmod p)$, respectively, which can be precomputed and stored directly in the hardware to save computation time. Through the above transformation, (3) can be converted into (4) and $C$ can be expressed as $C=c_{1} \cdot R^{2}+c_{2} \cdot R+c_{3}$

$$
\begin{align*}
C= & \left(a_{1} b_{3}+a_{2} b_{2}+a_{3} b_{1}\right)(\bmod 2) \cdot R^{2} \\
& +\left(\left\lfloor\left(a_{1} b_{2}+a_{2} b_{1}\right) / 2\right\rfloor+\left(a_{2} b_{3}+a_{3} b_{2}\right)\right) \cdot R \\
& +\left(2^{-2}(\bmod p) a_{1} b_{1}+\left(a_{1} b_{2}+a_{2} b_{1}\right)(\bmod 2)\right) \cdot R / 2 \\
& +\left\lfloor\left(a_{1} b_{3}+a_{3} b_{1}+a_{2} b_{2} / 2\right)\right\rfloor+a_{3} b_{3} . \tag{4}
\end{align*}
$$

As $c_{2}$ and $c_{3}$ are beyond the range of $[0, R)$, a final division operation is required. Karmakar et al. [13] proposed the use of the Barrett division algorithm based on the special structure of the modulus and Barrett reduction. $2^{a / 2}$ is a factor of $R$ and the multiplication and division of 2 in the digital circuit only need shift operation. Thus, we can extract $2^{a / 2}$ from the divisor and divide $3^{b / 2}$ only. Since $b$ is a fixed integer, the Barrett reduction can be used. The above method can shorten the bit width of the dividend by a quarter. After the above process, the result of the Barrett division may be greater than $p$, so a subtraction is required for this case.

The values of $a_{1}$ and $b_{1}$ in (2) can only be 0 or 1 , and dividing the number in this field into three parts for calculation is not the best choice. Liu et al. [14] proposed the improved EFFM (FFM1) algorithm to optimize (4). Suppose $A$ in this special field still satisfies (2), as $(p-A)(p-B)(\bmod p)=A \cdot B(\bmod p)$, if $A$ is greater than $R^{2}$, then $A^{\prime}$ can be found using

$$
\begin{equation*}
A^{\prime}=p-A \quad\left(a_{1}=1\right) \tag{5}
\end{equation*}
$$

Equation (6) is the expression expanded by $A^{\prime}$, where $a_{i}^{\prime}=R-$ $1-a_{i}, i=2,3$

$$
\begin{equation*}
A^{\prime}=a_{2}^{\prime} \cdot R+a_{3}^{\prime}, a_{2}^{\prime}, a_{3}^{\prime} \in[0, R) \tag{6}
\end{equation*}
$$

Then, the product $C^{\prime}$ of $A^{\prime}$ and $B^{\prime}$ is expressed as (7). Moreover, it is not guaranteed that the two multiplicands $A$ and $B$ are greater than $R^{2}$ at the same time, and the final results may need to be modified. The modified process is shown in

$$
\begin{align*}
C^{\prime}= & a_{2}^{\prime} b_{2}^{\prime} \cdot R^{2}+\left(a_{2}^{\prime} b_{3}^{\prime}+a_{3}^{\prime} b_{2}^{\prime}\right) \cdot R+a_{3}^{\prime} b_{3}^{\prime} \\
= & \left(a_{2}^{\prime} b_{2}^{\prime}(\bmod 2)\right) \cdot R^{2}+\left(a_{2}^{\prime} b_{3}^{\prime}+a_{3}^{\prime} b_{2}^{\prime}\right) \cdot R \\
& +\left(a_{3}^{\prime} b_{3}^{\prime}+\left\lfloor a_{2}^{\prime} b_{2}^{\prime} / 2\right\rfloor\right) \\
= & c_{1}^{\prime} \cdot R^{2}+c_{2}^{\prime} \cdot R+c_{3}^{\prime}  \tag{7}\\
A \cdot B(\bmod p)= & A^{\prime} \cdot B^{\prime}(\bmod p) \cdot(-1)^{a_{1} \oplus b_{1}} . \tag{8}
\end{align*}
$$

In addition, $c_{2}^{\prime}$ and $c_{3}^{\prime}$ in (7) may be greater than $R$, and hence, may need to be reduced by Barrett division.
In order to solve the problem that only a small range of moduli can be used in the FFM1 algorithm, Liu et al. [14] proposed another finite field multiplication (FFM2) algorithm. The FFM2 algorithm no longer divides operands $A$ and $B . f, a$, and $b$ are no longer required to meet the above conditions as in FFM1. In addition, FFM2 algorithm using $p=2^{a} 3^{b} \cdot f+1$ and $p=2^{a} 3^{b} \cdot f-1$ are studied separately.

Assume that $p=2^{a} 3^{b} \cdot f-1$, in this case, we can get

$$
\begin{equation*}
f \cdot 2^{a} 3^{b}=p+1 \tag{9}
\end{equation*}
$$

Then, the product $C \bmod p$ can be expressed as follows:

$$
\begin{equation*}
C \equiv q \cdot(p+1)+r \equiv q p+q+r \equiv(q+r) \bmod p \tag{10}
\end{equation*}
$$

From (10), we know that when $p$ is $f \cdot 2^{a} 3^{b}-1$, we can first find the quotient and remainder of $C$ to $f \cdot 2^{a} 3^{b}$, and then add the quotient and remainder to get the result. However, the sum of the quotient and remainder may be greater than the modulus $p$; so the result needs to be modified. In (11), $r$ is the remainder and $q$ is the quotient. By deduction, we have

$$
\begin{equation*}
0 \leq r+q<2 p \tag{11}
\end{equation*}
$$

As it can be seen from (11), in this case, a subtraction may be required in the final stage to modify the final result. Similarly, when $p=f \cdot 2^{a} 3^{b}+1$, we have

$$
\begin{equation*}
C \equiv q \cdot(p-1)+r \equiv q p-q+r \equiv(r-q) \bmod p . \tag{12}
\end{equation*}
$$

By deduction, we have

$$
\begin{equation*}
-p \leq r-q \leq p \tag{13}
\end{equation*}
$$

When the difference is less than 0 , we have to add $p$ to the final result. For more details on the FFM1 and FFM2 algorithm, please refer to [11].

## III. HFFM Multiplication Algorithm and Its Hardware Architecture

## A. HFFM Algorithm for SIDH

The dividing and conquering method of the EFFM and FFM1 algorithms is an effective idea, and (10) can simplify the computation. Inspired by this, if $a / 2$ is used to preprocess the operands, only one Barrett reduction is required. Therefore, we choose $f \cdot l_{b}^{b}$ as a radix. For $f \cdot l_{b}^{b}$, the values of $f$ and $b$ will be chosen to make sure that $p$ is a prime.
Thus, the HFFM algorithm is proposed as follows. If $p=f \cdot 2^{a}$. $l_{b}^{b}-1$, then $f$ is a very small positive integer, and $a$ and $b$ do not need to be even. The number $A$ in this special field can be expressed as follows:

$$
\begin{equation*}
A=a_{1} \cdot f \cdot l_{b}^{b}+a_{2}, a_{1} \in\left[0,2^{a}\right), a_{2} \in\left[0, l_{b}^{b}\right) \tag{14}
\end{equation*}
$$



Fig. 1. Decomposition in the HFFM algorithm. (H: the first half of the operand, L: the second half of the operand).
$B$ in this field can also be written in the same expression as (14). Therefore, the product $C$ of $A$ and $B$ can be expressed as follows:

$$
\begin{equation*}
C=a_{1} b_{1} \cdot\left(f \cdot l_{b}^{b}\right)^{2}+\left(a_{1} b_{2}+a_{2} b_{1}\right) \cdot f \cdot l_{b}^{b}+a_{2} b_{2} . \tag{15}
\end{equation*}
$$

As $a_{1} b_{2}+a_{2} b_{1}$ is less than $2 p$ and the product of $2^{a}$ and $f \cdot l_{b}^{b}$ is $(p+1), a_{1} b_{2}+a_{2} b_{1}$ is split in half by $2^{a}$. Set $m_{1}=\left\lfloor\left(a_{1} b_{2}+\right.\right.$ $\left.\left.a_{2} b_{1}\right) / 2^{a}\right\rfloor$, which is the first half of $a_{1} b_{2}+a_{2} b_{1}$, and set $m_{2}=$ $\left(a_{1} b_{2}+a_{2} b_{1}\right)\left(\bmod 2^{a}\right)$, which is the second half of $a_{1} b_{2}+a_{2} b_{1}$. Then $\left(a_{1} b_{2}+a_{2} b_{1}\right) \cdot f \cdot l_{b}^{b}$ is converted into

$$
\begin{align*}
\left(a_{1} b_{2}+a_{2} b_{1}\right) \cdot f \cdot l_{b}^{b}= & \left(m_{1} \cdot 2^{a}+m_{2}\right) f \cdot l_{b}^{b} \\
= & m_{1} \cdot 2^{a}\left(f \cdot l_{b}^{b}\right)+m_{2} \cdot f \cdot l_{b}^{b} \\
= & m_{1} \cdot(p+1)+m_{2} \cdot f \cdot l_{b}^{b} \\
= & m_{1}+m_{2} \cdot f \cdot l_{b}^{b} \\
= & \left\lfloor\left(a_{1} b_{2}+a_{2} b_{1}\right) / 2^{a}\right\rfloor \\
& +\left(a_{1} b_{2}+a_{2} b_{1}\right)\left(\bmod 2^{a}\right) \cdot f \cdot l_{b}^{b} . \tag{16}
\end{align*}
$$

Equation (16) breaks $\left(a_{1} b_{2}+a_{2} b_{1}\right) \cdot f \cdot l_{b}^{b}$ into two parts. One part takes 1 as the radix and the other part takes $f \cdot l_{b}^{b}$ as the radix. The bit width of each part is only about half of $p$. Similarly, since $a_{1} b_{1}$ is less than $2 p$, we split it into two terms by $2^{a}$. After the first split, it will produce a radix of $\left(f \cdot l_{b}^{b}\right)^{2}$. As the bit width of $f \cdot l_{b}^{b}$ and $\left(a_{1} b_{1}\right)\left(\bmod 2^{a}\right)$ are both about half of $p$, the product can be divided by $2^{a}$ as above. Therefore, $a_{1} b_{1} \cdot\left(f \cdot l_{b}^{b}\right)^{2}$ is converted as follows:

$$
\begin{align*}
& a_{1} b_{1}\left(f \cdot l_{b}^{b}\right)^{2} \\
& =\left\lfloor\left(a_{1} b_{1}\right) / 2^{a}\right\rfloor\left(f \cdot l_{b}^{b}\right)+\left(a_{1} b_{1}\right)\left(\bmod 2^{a}\right)\left(f \cdot l_{b}^{b}\right)^{2} \\
& =\left\lfloor\left(a_{1} b_{1}\right) / 2^{a}\right\rfloor\left(f \cdot l_{b}^{b}\right)+\left\lfloor\left(\left(a_{1} b_{1}\right)\left(\bmod 2^{a}\right)\left(f \cdot l_{b}^{b}\right) / 2^{a}\right)\right\rfloor \\
& \quad+\left(\left(\left(a_{1} b_{1}\right)\left(\bmod 2^{a}\right) \cdot\left(f \cdot l_{b}^{b}\right)\right)\left(\bmod 2^{a}\right)\right) \cdot\left(f \cdot l_{b}^{b}\right) . \tag{17}
\end{align*}
$$

Equation (15) is finally expressed as follows:

$$
\begin{align*}
C= & c_{1} \cdot\left(f \cdot l_{b}^{b}\right)+c_{2} \\
= & \left(\left(a_{1} b_{2}+a_{2} b_{1}\right)\left(\bmod 2^{a}\right)+\left\lfloor\left(a_{1} b_{1}\right) / 2^{a}\right\rfloor\right. \\
& \left.+\left(\left(\left(a_{1} b_{1}\right)\left(\bmod 2^{a}\right) \cdot\left(f \cdot l_{b}^{b}\right)\right)\left(\bmod 2^{a}\right)\right)\right) \cdot\left(f \cdot l_{b}^{b}\right) \\
& +\left(a_{2} b_{2}+\left\lfloor\left(a_{1} b_{2}+a_{2} b_{1}\right) / 2^{a}\right\rfloor\right. \\
& \left.+\left\lfloor\left(\left(a_{1} b_{1}\right)\left(\bmod 2^{a}\right)\left(f \cdot l_{b}^{b}\right) / 2^{a}\right)\right\rfloor\right) . \tag{18}
\end{align*}
$$

The above decomposition process is shown in Fig. 1. In addition, in the initial stage of the algorithm, the Karatsuba algorithm can be used to calculate $a_{1} b_{2}+a_{2} b_{1}$ to save one multiplication. $c_{2}$ in (18) may be larger than $f \cdot l_{b}^{b}$; so Barrett reduction is used to ensure $c_{2}$ falls in $\left[0, f \cdot l_{b}^{b}\right)$. To further save multiplication operations, we only

```
Algorithm 1 HFFM Multiplication With Special Field
    input : \(A=a_{1} f_{b}^{b}+a_{2} ; B=b_{1} f_{b}^{b}+b_{2}\)
    output : \(C=c_{1} f_{b}^{b}+c_{2}\)
    compute \(a_{1} b_{1},\left(a_{1}+a_{2}\right)\left(b_{1}+b_{2}\right), a_{2} b_{2}, a_{1} b_{2}+a_{2} b_{1}=\)
    \(\left(a_{1}+a_{2}\right)\left(b_{1}+b_{2}\right)-a_{1} b_{1}-a_{2} b_{2}, r_{1}=a_{1} b_{1} \bmod 2^{a}, q_{1}=\)
    \(\left\lfloor a_{1} b_{1} / 2^{a}\right\rfloor, r_{1} f_{b}^{b}\);
    \(c_{2}=a_{2} b_{2}+\left\lfloor\left(a_{2} b_{1}+a_{1} b_{2}\right) / 2^{a}\right\rfloor ;\)
    \(c_{2}=c_{2}+\left\lfloor r_{1} f_{b}^{b} / 2^{a}\right\rfloor ;\)
    \(c_{1}=q_{1}+\left(a_{2} b_{1}+a_{1} b_{2}\right) \bmod 2^{a} ;\)
    \(c_{1}=c_{1}+r_{1} f_{b}^{b} \bmod 2^{a}\);
    Barrett Reduction \(\left(c_{2}\right) \Rightarrow c_{2}=r, c_{1}=p+q\);
    \(c_{1}=c_{1} \bmod 2^{a} ; c_{2}=c_{2}+\left\lfloor c_{1} / 2^{a}\right\rfloor\).
```



Fig. 2. Proposed new architecture for $N \cdot N$ multiplication in eight cycles.
calculate the first half of the dividend in the Barrett reduction, which may produce a result of Barrett reduction that is more than $2 \cdot f$. $l_{b}^{b}$, and one extra subtraction is required. After that, the results of adding the quotient and $c_{1}$ may also be greater than $2^{a}$. Therefore, a shift operation may be required. In (15), splitting $a_{1} b_{2}+a_{2} b_{1}$ is not necessary; however, it shortens the coefficient bit width of $f \cdot l_{b}^{b}$. The proposed HFFM algorithm is shown in Algorithm 1.
Similar to the FFM1 algorithm, the HFFM algorithm divides the multiplier into two segments. However, the new algorithm uses $f \cdot l_{b}^{b}$ to segment operands and uses multiple shifting operations instead of partial division. As a result, the number of multiplications and additions in the HFFM algorithm are less than that of the FFM1 and FFM2 algorithms. The multiplication in the HFFM algorithm is also more friendly for pipelined hardware implementation. Compared with FFM2, if under the same modulus, the HFFM algorithm performs the same operation with only half of the bit width.

## B. New Hardware Architecture for the Proposed HFFM Algorithm

We also propose a new interleaved hardware architecture to increase the operating frequency and reduce the computation time of the SIDH modular multiplication. In this architecture, there are 9 $N / 6$-bit multipliers, $3 N / 3$-bit carry-save adders (CSAs), 1 ( $N / 3+1$ )bit CSA, $4(N / 3+1)$-bit adders, and $2 N$-bit adders. The whole design is controlled by a finite state machine (FSM).

As observed from previous designs, the large multiplier limits the operating frequency. Further splitting up the multiplication units can further increase the frequency. For this design, we choose the modulus $p=2^{a} \cdot l_{b}^{b} \cdot f-1$, and $K=a$ which is used for the shift operation, $N^{\prime}=\max \left(\operatorname{digit}\left(2^{a}\right), \operatorname{digit}\left(l_{b}^{b} \cdot f\right)\right)$. Complete $N^{\prime}$ to the point that it is divisible by 6 to get $N$. It takes 8 cycles to complete an $N \cdot N$ multiplication, which is divided into two stages. As shown in Fig. 2,

TABLE I
Comparison of Different Hardware Architectures and Algorithms of Modular Multiplication for Sidh on Virtex-7 FPGA
(F: Frequency, Mult.: The First Two Modular Multiplications)

| Prime | Designs | LUTs | FFs | DSPs | $\begin{array}{\|c\|} \hline \mathrm{F} \\ (\mathrm{MHz}) \end{array}$ | Mult. <br> (Cycles) | Mult. <br> (Time) | Interleave (Cycles) | Interleave (Time) | Improvement in Time(Mult.) | Improvement in Time (Interleave) |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Montgomery Multiplier [12] | - | - | 56 | 207 | 88 | 425ns | 61 | 295ns | 34.12\% | 22.37\% |
|  | HFFM Algorithm(N/6) | 6724 | 6216 | 36 | 236 | 66 | 280ns | 54 | 229ns |  |  |
| SIKEp503 | Design in [12] | 2952 | 5752 | 88 | 171 | 70 | 409ns | 49 | 286ns | $\mathbf{2 4 . 2 1 \%}$ | 11.19\% |
|  | HFFM Algorithm(N/8) | 7757 | 7357 | 64 | 213 | 66 | 310ns | 54 | 254ns |  |  |
| SIKEp610 | Montgomery Multiplier [12] | - | - | 78 | 207 | 121 | 585ns | 83 | 401 ns | 40.85\% | 29.43\% |
|  | HFFM Algorithm(N/6) | 8600 | 8004 | 81 | 191 | 66 | 346ns | 54 | 283ns |  |  |
| SIKEp751 | Design in [12] | - | - | 128 | 167 | 100 | 597 ns | 69 | 412ns | 31.32\% | 18.69\% |
|  | HFFM Algorithm(N/6) | 10768 | 10481 | 144 | 161 | 66 | 410ns | 54 | 335ns |  |  |
| $P_{771}$ | FFM1 Algorithm [14] | 16502 | 9661 | 122 | 61 | 64 | 1049ns | 64 | 1049ns | 58.63\% | 66.16\% |
|  | HFFM Algorithm(N/6) | 11007 | 10680 | 144 | 152 | 66 | 434ns | 54 | 355ns |  |  |



Fig. 3. Steps of $N \cdot N$ multiplications of the proposed modular multiplier.
it takes 4 cycles for the first stage to complete a multiplication of $N / 2 \cdot N / 2$.

In the first cycle of the first stage, we use $9 N / 6 \cdot N / 6$ multipliers to calculate the partial products. In the second cycle, the results from the first cycle are stored in a register. In addition, we calculate the sum of the lowest three partial products using an adder and a CSA, then store the $N / 6$-bit sum and pass the $(N / 6+1)$-bit carry to the next cycle. In order to reduce the delay of the third cycle, a CSA is used to perform preliminary operations on the middle three partial products. In the third cycle, a CSA and $(N / 3+1)$-bit adders are used to obtain the addition result of the last six partial products. After that, the $(N / 6+1)$-bit carry and two partial products are sent to a CSA for calculation. In the fourth cycle, we use two $(N / 3+1)$-bit adders to get the most significant ( $5 N / 6$ )-bit of the $N / 2 \cdot N / 2$ multiplication result.

If the data pipeline passes the 4 cycles in the first stage, then the next cycle after the first $N / 2 \cdot N / 2$ multiplication operation is complete will get the result of the second $N / 2 \cdot N / 2$ multiplication. In order to use an adder with a smaller number, the partial product of $N \cdot N$ is calculated in the order from the least significant part to the most significant part. Therefore, for the 4 cycles of the second stage, an N bit adder is used, and the adder gets the $N$-bit addition result in every clock cycle. If the second half of the result no longer participates in the operation, it will be stored in the register in advance. Fig. 2 shows the calculation of an $N \cdot N$ multiplication in 8 cycles.

However, the last multiplication $\left[\left(\left(a_{1} b_{1}\right) \bmod 2^{a}\right)\left(f \cdot l_{b}^{b}\right)\right]$ is dependant on the result of $a_{1} b_{1}$, while the other multiplications are independent of each other. For multiple $N \cdot N$ multiplications, the pipelined architecture can also be used. A four-stage pipeline architecture is proposed for use here, which can save 12 cycles.

Since only the first $N$-bit of the dividend is calculated when performing the first step of the Barrett reduction, the reduction result may be greater than 2 times $f \cdot l_{b}^{b}$. As it is an odd number (extract a multiple of 2 in $f$ and add it to $a$ ), the lowest significant bit (LSB) of the inversion of $f \cdot l_{b}^{b}$ is changed from 0 to 1 to get its complement. For the same reason, the complement of $2 f \cdot l_{b}^{b}$ is inverted by $f \cdot l_{b}^{b}$
first, and the LSB is replaced by 1 and then it is shifted one bit to the left. In addition, we only need to subtract the lower $(b+1)$ bits of $c_{2}$. The modular multiplication is complete.

It can be seen that all blocks of the proposed modular multiplier are not working at the same time, which causes low computational efficiency. There is a lengthy waiting time between the initial stage of the modular multiplier and the multiplication required by Barrett reduction; therefore, we consider interleaving the two modular multiplications. The specific steps of the operation are shown in Fig. 3.

We call each four cycles a layer. As we can see from Fig. 3, the interleaved multiplier calculates the four partial products required for the first modular multiplication at layers $0-4$, and then calculates four partial products required for the second modular multiplication at layers 5-8. At the same time, we calculate $c_{1}$ and $c_{2}$ of the first modular multiplication. Then the two multiplications in Barrett reduction are calculated in turn. There is only one cycle in the 10th layer and 13th layer because the operand of the next multiplication has not yet been calculated and needs to be extended by one cycle. Since the next two modular multiplications can be calculated at the beginning of each 0 layer, the iteration period is also shortened. At this time, the multiplier requires 66 cycles for the first two modular multiplications, and a multiplication iteration is completed every 54 cycles.

## IV. Results and Comparison

To offer a fair comparison with previous work, the proposed hardware modular multiplier architecture is implemented on a Virtex-7 xc7vx690tffg1157-3 device, and compiled with Xilinx Vivado 2018.2. The modulus $p=2^{387} \cdot 3^{242}-1$ and four parameter sets corresponding to the NIST Round 2 SIKE parameter sets are used.

The FPGA implementation results for the five parameter sets are listed in Table I. The proposed modular multiplier design is compared with the latest design [12] for parameter sets SIKEp503 and SIKEp751, while it is compared with the systolic array Montgomery multiplier from [12] for SIKEp434 and SIKEp610. Since there is no implementation for reference and every block in the systolic array Montgomery multiplier are the same, we select the highest frequency of SIKEp503 and SIKEp751 in [12] to estimate the frequency of the systolic array Montgomery multiplier of SIKEp434 and SIKEp610. We also compare with [14] for $P_{771}$. In order to save DSP resources, we use $16 \mathrm{~N} / 8$ multipliers in the $\operatorname{SIKEp} 503$, which reduces DSPs from 81 to 64.

From Table I, all of the proposed HFFM designs are faster than the previous designs. Compared with the optimized Montgomery multiplier in [12], the proposed design can still achieve 11-18\% speed improvement and compared with the conventional Montgomery multiplier, the proposed design achieves a $22-29 \%$ speed improvement. As can be seen from the table, In the case of SIKEp434 and

SIKEp503, the proposed design uses less DSP resources. Since a DSP can calculate a $17 \times 17$ unsigned multiplication [12] and can be converted into 345LUTs in Multiplier 12.0 of Vivado, the resources we used in SIKEp434 are similar to the previous design. The proposed design with SIKEp503 uses less resources. In addition, the proposed design for prime, $P_{771}$, is nearly two times faster than the previous reported equivalent design.

## V. Conclusion

In this article, a new high performance modular multiplication algorithm named HFFM for the specific fields in SIDH is proposed. This algorithm saves multiplications and additions compared with the previous algorithms. In addition, we propose a new interleaved multiplication architecture that allows two multiplications to be interleaved, which significantly saves computation time. Compared with previous work, the proposed multipliers are much faster. In addition, the proposed multiplier for the SIKEp503 parameter set uses even less hardware resources. Overall, it is evident that the proposed multipliers can be used to achieve high speed SIKE implementations in hardware.

## REFERENCES

[1] P. W. Shor, "Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer," SIAM J. Sci. Stat. Comput., vol. 26, no. 5, pp. 1484-1509, 1994.
[2] L. K. Grover, "A fast quantum mechanical algorithm for database search," in Proc. 28th ACM Symp. Theory Comput., 1996, pp. 212-219.
[3] F. Arute et al., "Quantum supremacy using a programmable superconducting processor," Nature, vol. 574, pp. 505-510, Oct. 2019.
[4] R. L. Rivest, A. Shamir, and L. Adleman, "A method for obtaining digital signatures and public-key cryptosystems," Commun. ACM, vol. 21, no. 2, pp. 120-126, 1978.
[5] V. S. Miller, "Use of elliptic curves in cryptography," in Proc. 7th Annu. Int. Cryptol. Conf. Adv. Cryptol., vol. 218, 1985, pp. 417-426.
[6] D. J. Bernstein, "Introduction to post-quantum cryptography," in PostQuantum Cryptography. Berlin, Germany: Springer, 2009, pp. 1-14.
[7] L. Chen et al., "NIST: Report on post-quantum cryptography," Nat. Inst. Stand. Technol., Gaithersburg, MD, USA, Rep. NISTIR 8105, 2016.
[8] D. Jao and L. De Feo, "Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies," in Proc. 1st Int. Conf. Post Quantum Cryptography, 2011, pp. 19-34.
[9] C. Costello, P. Longa, and M. Naehrig, "Efficient algorithms for supersingular isogeny Diffie-Hellman," in Proc. 36th Annu. Int. Cryptol. Conf. Adv. Cryptol., vol. 9814, 2016, pp. 572-601.
[10] D. Jao et al., Supersingular Isogeny Key Encapsulation NIST Post Quantum Standardization Project, Gaithersburg, MD, USA, 2017.
[11] B. Koziel, R. Azarderakhsh, M. M. Kermani, and D. Jao, "Post-quantum cryptography on FPGA based on isogenies on elliptic curves," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64, no. 1, pp. 86-99, Jan. 2017.
[12] B. Koziel, A.-B. Ackie, R. E. Khatib, R. Azarderakhsh, and M. Mozaffari-Kermani, "SIKE'd up: Fast and secure hardware architectures for supersingular isogeny key encapsulation," IACR Cryptol. ePrint Archive, Rep. 2019/711, 2019. [Online]. Available: https://eprint.iacr.org/2019/711
[13] A. Karmakar, S. S. Roy, F. Vercauteren, and I. Verbauwhede, "Efficient finite field multiplication for isogeny based post quantum cryptography," in Proc. 10th Int. Workshop Arithmetic Finite Fields, 2016, pp. 193-207.
[14] W. Liu, J. Ni, Z. Liu, C. Liu, and M. O’Neill, "Optimized modular multiplication for supersingular isogeny Diffie-Hellman," IEEE Trans. Comput., vol. 68, no. 8, pp. 1249-1255, Aug. 2019.
[15] J. W. Bos and S. J. Friedberger, "Arithmetic considerations for isogenybased cryptography," IEEE Trans. Comput., vol. 68, no. 7, pp. 979-990, Jul. 2019.
[16] P. Barrett, "Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor," in Proc. 7th Annu. Int. Cryptol. Conf. Adv. Cryptol., vol. 263, 1987, pp. 311-323.


[^0]:    Manuscript received April 23, 2019; revised September 18, 2019; accepted December 10, 2019. Date of publication December 17, 2019; date of current version September 18, 2020. This work was supported in part by the National Natural Science Foundation of China under Grant 61871216, in part by the Fundamental Research Funds for the Central Universities China under Grant NE2019102, and in part by the Six Talent Peaks Project in Jiangsu Province under Grant 2018XYDXX-009. This article was recommended by Associate Editor Y. Makris. (Corresponding author: Weiqiang Liu.)
    Weiqiang Liu, Ziying Ni, and Jian Ni are with the College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China (e-mail: liuweiqiang@nuaa.edu.cn; nzy@nuaa.edu.cn; nijian@nuaa.edu.cn).
    Ciara Rafferty and Máire O'Neill are with the Centre for Secure Information Technologies, Queens University Belfast, Belfast BT3 9DT, U.K. (e-mail: c.m.raffertyg@qub.ac.uk; m.oneill@ecit.qub.ac.uk).

    Digital Object Identifier 10.1109/TCAD.2019.2960330

