Download TomsFastMath User Manual v0.12
Transcript
TomsFastMath User Manual v0.12 Tom St Denis [email protected] March 14, 2007 This text and library are all hereby placed in the public domain. This book has been formatted for B5 [176x250] paper using the LATEX book macro package. This project was sponsored in part by Secure Science Corporation http://www.securescience.net. Contents 1 Introduction 1.1 What is TomsFastMath? . . . . . . 1.2 License . . . . . . . . . . . . . . . . 1.3 Building . . . . . . . . . . . . . . . 1.3.1 Intel CC . . . . . . . . . . . 1.3.2 MSVC . . . . . . . . . . . . 1.3.3 Build Limitations . . . . . . 1.3.4 Optimization Configuration 1.3.5 Build Configurations . . . . 1.3.6 Precision Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 2 2 3 3 5 5 2 Getting Started 2.1 Data Types . . . . . . . . . . . . 2.2 Initialization . . . . . . . . . . . 2.2.1 Simple Initialization . . . 2.2.2 Initialize Small Constants 2.2.3 Initialize Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 8 8 8 8 3 Arithmetic Operations 3.1 Odds and Evens . . . . 3.2 Sign Manipulation . . . 3.3 Comparisons . . . . . . 3.4 Shifting . . . . . . . . . 3.5 Basic Algebra . . . . . . 3.6 Modular Exponentiation 3.7 Number Theoretic . . . 3.8 Prime Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 10 10 11 11 11 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . . . . . . . . 4 Porting TomsFastMath 4.1 Getting Started . . . . . . 4.2 Multiply with Comba . . 4.3 Squaring with Comba . . 4.4 Montgomery with Comba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 15 17 List of Figures 1.1 Recommended Build Modes . . . . . . . . . . . . . . . . . . . . . v 5 Chapter 1 Introduction 1.1 What is TomsFastMath? TomsFastMath is meant to be a very fast yet still fairly portable and easy to port large integer arithmetic library written in ISO C. The goal specifically is to be able to perform very fast modular exponentiations and other related functions required for ECC, DH and RSA cryptosystems. Most of the library is pure ISO C portable source code while a small portion (three files) contain a mixture of ISO C and assembler inline fragments. Compared to LibTomMath this new library is meant to be much faster while sacrificing flexibiltiy. This is accomplished through several means. 1. The new code is slightly messier and contains asm blocks. 2. This uses fixed not multiple precision integers. 3. It is designed only for fast modular exponentiations [e.g. less flexibility]. To mitigate some of the problems that arise from using assembler it has been carefully and appropriately used where it would make the most gain in performance. Also we use macro’s for assembler code which allows new ports to be inserted easily. The new code uses fixed precision arithmetic which means at compile time you choose a maximum precision and all numbers are limited to that. This has the benefit of not requiring any memory heap operations (which are slow) in 1 2 CHAPTER 1. INTRODUCTION any of the functions. It has the downside that integers that are too large are truncated. The goal of this library is to be able to perform modular exponentiations (with an odd modulus) very fast. This is what takes the most time in systems such as RSA and DH. This also requires fast multiplication and squaring and has the side effect of speeding up ECC operations as well. 1.2 License TomsFastMath is public domain. 1.3 Building To build the library simply type “make”. Or to install in typical *unix like directories use “make install”. Similarly a shared library can be built with “make -f makefile.shared install”. You can build the test program with “make test”. To perform simple static testing (useful to test out new assembly ports) use the stest program. Type “make stest” and run it on your target. The program will perform three multiplications, squarings and montgomery reductions. Likely if your assembly code is invalid this code will exhibit the bug. 1.3.1 Intel CC In theory you should be able to build the library with CFLAGS="-O3 -ip" CC=icc make IGNORE_SPEED=1 However, Intels inline assembler is way less advanced than GCCs. As a result it doesn’t compile. Fortunately it doesn’t really matter. 1.3.2 MSVC The library doesn’t build with MSVC. Imagine that. 1.3. BUILDING 1.3.3 3 Build Limitations TomsFastMath has the following build requirements which are non–portable but under most circumstances not problematic. 1. “CHAR BIT” must be eight. 2. The “fp digit” type must be a multiple of eight bits long. 3. The “fp word” must be at least twice the length of fp digit. 1.3.4 Optimization Configuration By default TFM is configured for 32–bit digits using ISO C source code. This mode while portable is not very efficient. While building the library (from scratch) you can define one of several “CFLAGS” defines. For example, to build with with SSE2 optimizations type CFLAGS=-DTFM_SSE2 make clean libtfm.a x86–32 The “x86–32” mode is defined by “TFM X86” and covers all i386 and beyond processors. It requires GCC to build and only works with 32–bit digits. In this mode fp digit is 32–bits and fp word is 64–bits. This mode will be autodetected when building with GCC to an “i386” target. You can override this behaviour by defining TFM NO ASM or another optimization mode (such as SSE2). SSE2 The “SSE2” mode is defined by “TFM SSE2” and requires a Pentium 4, Pentium M or Athlon64 processor. It requires GCC to build. Note that you shouldn’t define both TFM X86 and TFM SSE2 at the same time. This mode only works with 32–bit digits. In this mode fp digit is 32–bits and fp word is 64–bits. While this mode will work on the AMD Athlon64 series of processors it is less efficient than the native “x86–64” mode and not recommended. There is an additional “TFM PRESCOTT” flag that you can define for P4 Prescott processors. This causes the mul/sqr functions to use x86 32 and the montgomery reduction to use SSE2 which is (so far) the fastest combination. If you are using an older (e.g. Northwood) generation P4 don’t define this. 4 CHAPTER 1. INTRODUCTION x86–64 The “x86–64” mode is defined by “TFM X86 64” and requires a “x86–64” capable processor (Athlon64 and future Pentium processors). It requires GCC to build and only works with 64–bit digits. Note that by enabling this mode it will automatically enable 64–bit digits. In this mode fp digit is 64–bits and fp word is 128–bits. This mode will be autodetected when building with GCC to an “x86–64” target. You can override this behaviour by defining TFM NO ASM. ARM The “ARM” mode is defined by “TFM ARM” and requires a ARMv4 with the M instructions (enhanced multipliers) or higher processor. It requires GCC and works with 32–bit digits. In this mode fp digit is 32–bits and fp word is 64–bits. PPC32 The “PPC32” mode is defined by “TFM PPC32” and requires a standard PPC processor. It doesn’t use altivec or other extensions so it should work on all compliant implementations of PPC. It requires GCC and works with 32–bit digits. In this mode fp digit is 32–bits and fp word is 64–bits. PPC64 The “PPC64” mode is defined by “TFM PPC64” and requires a 64–bit PPC processor. AVR32 The “AVR32” mode is defined by “TFM AVR32” and requires an Atmel AVR32 processor. Future Releases Future releases will support additional platform optimizations. Developers of MIPS and SPARC platforms are encouraged to submit GCC asm inline patches (see chapter 4 for more information). 5 1.3. BUILDING Processor All 32–bit x86 platforms Pentium 4 Pentium 4 Prescott Athlon64 ARMv4 or higher with M G3/G4 (32-bit PPC) G5 (64-bit PPC) Atmel AVR32 Recommended Mode TFM X86 TFM SSE2 TFM SSE2 + TFM PRESCOTT TFM X86 64 TFM ARM TFM PPC32 TFM PPC64 TFM AVR32 x86–32 or x86–64 (with GCC) Leave blank and let autodetect work Figure 1.1: Recommended Build Modes 1.3.5 Build Configurations TomsFastMath is configurable in terms of which unrolled code (if any) is included. By default, the majority of the code is included which results in large binaries. The first flag to try out is TFM ALREADY SET which tells TFM to turn off all unrolled code. This will result in a smaller library but also a much slower library. From this clean state, you can start enabling unrolled code for given cryptographic tasks at hand. A series of TFM MULXYZ and TFM SQRXYZ macros exist to enable specific unrolled code. For instance, TFM MUL32 will enable a 32 digit unrolled multiplier. For a complete list see the tfm.h header file. Keep in mind this is for digits not bits. For example, you should enable TFM MUL16 if you are doing 1024-bit exptmods on a 64–bit platform, enable TFM MUL32 on 32–bit platforms. To help developers use ECC there are a set of defines for the five NIST curve sizes. They are named TFM ECCXYZ where XYZ is one of 192, 224, 256, 384, or 521. These enable the multipliers and squaring code for a given curve, autodetecting 64–bit platforms as well. 1.3.6 Precision Configuration The precision of all integers in this library are fixed to a limited precision. Essentially the rule of setting the precision is if you plan on doing modular exponentiation with k–bit numbers than the precision must be fixed to 2k–bits plus four digits. 6 CHAPTER 1. INTRODUCTION This is changed by altering the value of “FP MAX SIZE” in tfm.h to your desired size. By default, the library is configured to handle upto 2048–bit inputs to the modular exponentiator. Chapter 2 Getting Started 2.1 Data Types TomsFastMath is a large fixed precision integer library. It provides the functionality to manipulate large signed integers through a relatively trivial api and a single data type. The “fp int” or fixed precision integer is the data type that the functions operate with. typedef struct { fp_digit dp[FP_SIZE]; int used, sign; } fp_int; The dp member is the array of digits that forms the number. It must always be zero padded. The used member is the count of digits used in the array. Although the precision is fixed the algorithms are still tuned to not process the entire array if it does not have to. The sign indicates the sign of the integer. It is FP ZPOS (0) if the integer is zero or positive and FP NEG (1) otherwise. 7 8 CHAPTER 2. GETTING STARTED 2.2 2.2.1 Initialization Simple Initialization To initialize an integer to the default state of zero use the fp init() function. void fp_init(fp_int *a); This will initialize the fp int a to zero. Note that the function fp zero() is an alias for fp init(). 2.2.2 Initialize Small Constants To initialize an integer with a small single digit value use the fp set() function. void fp_set(fp_int *a, fp_digit b); This will initialize a and set it equal to the digit b. 2.2.3 Initialize Copy To initialize an integer with a copy of another integer use the fp init copy() function. void fp_init_copy(fp_int *a, fp_int *b) This will initialize a as a copy of b. Note that for compatibility with LibTomMath the function fp copy() is also provided. Chapter 3 Arithmetic Operations 3.1 Odds and Evens To quickly and easily tell if an integer is zero, odd or even use the following functions. int fp_iszero(fp_int *a); int fp_iseven(fp_int *a); int fp_isodd(fp_int *a); These will return FP YES if the answer to their respective questions is yes. Otherwise they return FP NO. Note that these are implemented as macros and as such you should avoid using ++ or – – operators on the input operand. 3.2 Sign Manipulation To negate or compute the absolute of an integer use the following functions. void fp_neg(fp_int *a, fp_int *b); void fp_abs(fp_int *a, fp_int *b); This will compute the negation (or absolute) of a and store the result in b. Note that these are implemented as macros and as such you should avoid using ++ or – – operators on the input operand. 9 10 CHAPTER 3. ARITHMETIC OPERATIONS 3.3 Comparisons To perform signed or unsigned comparisons use following functions. int fp_cmp(fp_int *a, fp_int *b); int fp_cmp_mag(fp_int *a, fp_int *b); These will compare a to b. They will return FP GT if a is larger than b, FP EQ if they are equal and FP LT if a is less than b. The function fp cmp performs signed comparisons while the other performs unsigned comparisons. 3.4 Shifting To shift the digits of an fp int left or right use the following functions. void fp_lshd(fp_int *a, int x); void fp_rshd(fp_int *a, int x); These will shift the digits of a left (or right respectively) x digits. To shift individual bits of an fp int use the following functions. void void void void void void fp_div_2d(fp_int *a, int b, fp_int *c, fp_int *d); fp_mod_2d(fp_int *a, int b, fp_int *c); fp_mul_2d(fp_int *a, int b, fp_int *c); fp_mul_2(fp_int *a, fp_int *c); fp_div_2(fp_int *a, fp_int *c); fp_2expt(fp_int *a, int b); fp div 2d() will divide a by 2b and store the quotient in c and remainder in d. Either of c or d can be NULL if their value is not required. fp mod 2d() is a shortcut to compute the remainder directly. fp mul 2d() will multiply a by 2b and store the result in c. The fp mul 2() and fp div 2() functions are optimized multiplication and divisions by two. The function fp 2expt() will compute a = 2b quickly. To quickly count the number of least significant bits that are zero use the following function. int fp_cnt_lsb(fp_int *a); This will return the number of adjacent least significant bits that are zero. This is equivalent to the number of times two evenly divides a. 3.5. BASIC ALGEBRA 3.5 11 Basic Algebra The following functions round out the basic algebraic functionality of the library. void fp_add(fp_int *a, fp_int *b, fp_int *c); void fp_sub(fp_int *a, fp_int *b, fp_int *c); void fp_mul(fp_int *a, fp_int *b, fp_int *c); void fp_sqr(fp_int *a, fp_int *b); int fp_div(fp_int *a, fp_int *b, fp_int *c, fp_int *d); int fp_mod(fp_int *a, fp_int *b, fp_int *c); The functions fp add(), fp sub() and fp mul() perform their respective operations on a and b and store the result in c. The function fp sqr() computes b = a2 and is faster than using fp mul() to perform the same operation. The function fp div() divides a by b and stores the quotient in c and remainder in d. Either of c and d can be NULL if the result is not required. The function fp mod() is a simple shortcut to find the remainder. 3.6 Modular Exponentiation To compute a modular exponentiation use the following function. int fp_exptmod(fp_int *a, fp_int *b, fp_int *c, fp_int *d); This computes d ≡ ab (mod c) for any odd c and b. b may be negative so long as a−1 (mod c) exists. The initial value of a may be larger than c. The size of c must be half of the maximum precision used during the build of the library. For example, by default c must be less than 22048 . 3.7 Number Theoretic To perform modular inverses, greatest common divisor or least common multiples use the following functions. int fp_invmod(fp_int *a, fp_int *b, fp_int *c); void fp_gcd(fp_int *a, fp_int *b, fp_int *c); void fp_lcm(fp_int *a, fp_int *b, fp_int *c); 12 CHAPTER 3. ARITHMETIC OPERATIONS The fp invmod() function will find the modular inverse of a modulo an odd modulus b and store it in c (provided it exists). The function fp gcd() will compute the greatest common divisor of a and b and store it in c. Similarly the fp lcm() function will compute the least common multiple of a and b and store it in c. 3.8 Prime Numbers To quickly test a number for primality call this function. int fp_isprime(fp_int *a); This will return FP YES if a is probably prime. It uses 256 trial divisions and eight rounds of Rabin-Miller testing. Note that this routine performs modular exponentiations which means that a must be in a valid range of precision. Chapter 4 Porting TomsFastMath 4.1 Getting Started Porting TomsFastMath to a given processor target is usually a simple procedure. For the most part assembly is used to get around the lack of a “add with carry” operation in the C language. To make matters simpler the use of assembler is through macro blocks. Each “port” is defined by a block of code that re-defines the portable ISO C macros with assembler inline blocks. To add a new port you must designate a TFM XXX define that will enable your port when built. 4.2 Multiply with Comba The file “fp mul comba.c” is responsible for providing the fast multiplication within the library. This comba multiplication is fairly simple. It uses a sliding three digit carry system with the variables c0, c1, c2. For every digit of output c0 is the what will be that digit, c1 will carry into the next digit and c2 will be the “c1” carry for the next digit. For every “next” digit effectively c0 is stored as output, c1 moves into c0, c2 into c1 and zero into c2. The following macros define the assmebler interface to the code. #define COMBA_START This is issued at the beginning of the multiplication function. This is in place to allow you to initialize any registers or machine words required. You 13 14 CHAPTER 4. PORTING TOMSFASTMATH can leave it blank if you do not need it. #define COMBA_CLEAR \ c0 = c1 = c2 = 0; This clears the three comba carries. If you are going to place carries in registers then zero the appropriate registers. Note that the functions do not use c0, c1 or c2 directly so you are free to ignore these varibles and use registers directly. #define COMBA_FORWARD \ c0 = c1; c1 = c2; c2 = 0; This propagates the carries after a digit has been produced. #define COMBA_STORE(x) \ x = c0; This stores the c0 digit in the memory location specified by x. Note that if you manually aliased c0 with a register than just store that register in x. #define COMBA_STORE2(x) \ x = c1; This stores the c1 digit in the memory location specified by x. Note that if you manually aliased c1 with a register than just store that register in x. #define COMBA_FINI If at the end of the function you need to perform some action fill this macro in. #define t = c0 = c1 = MULADD(i, j) \ ((fp_word)i) * ((fp_word)j); \ (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; This macro performs the “multiply and add” step that is central to the comba multiplier. It multiplies the fp digits i and j to produce a fp word result. Effectively the double–digit value is added to the three-digit carry formed by c0, c1, c2 where c0 is the least significant digit. 4.3. SQUARING WITH COMBA 4.3 15 Squaring with Comba Squaring is similar to multiplication except that it uses a special “multiply and add twice” macro that replaces multiplications that are not required. #define COMBA_START This allows for any initialization code you might have. #define CLEAR_CARRY \ c0 = c1 = c2 = 0; This will clear the carries. Like multiplication you can safely alias the three carry variables to registers if you can/want to. #define COMBA_STORE(x) \ x = c0; Store the c0 carry to a given memory location. #define COMBA_STORE2(x) \ x = c1; Store the c1 carry to a given memory location. #define CARRY_FORWARD \ c0 = c1; c1 = c2; c2 = 0; Forward propagate all three carry variables. #define COMBA_FINI If you need to clean up at the end of the function. /* multiplies point i and j, updates carry "c1" and digit c2 */ #define SQRADD(i, j) \ t = ((fp_word)i) * ((fp_word)j); \ c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; This is essentially the MULADD macro from the multiplication code. 16 CHAPTER 4. PORTING TOMSFASTMATH /* for squaring some of the terms are #define SQRADD2(i, j) t = ((fp_word)i) * ((fp_word)j); c0 = (c0 + t); if (c0 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 c0 = (c0 + t); if (c0 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 doubled... */ \ \ < ((fp_digit)t)) < (t>>DIGIT_BIT)) < ((fp_digit)t)) < (t>>DIGIT_BIT)) ++c1; \ ++c2; \ ++c1; \ ++c2; This is like SQRADD except it adds the produce twice. It’s similar to computing SQRADD(i, j*2). To further make things interesting the squaring code also has “doubles” (see my LTM book chapter five...) which are handled with these macros. #define SQRADDSC(i, j) do { fp_word t; t = ((fp_word)i) * ((fp_word)j); sc0 = (fp_digit)t; sc1 = (t >> DIGIT_BIT); sc2 = 0; } while (0); \ \ \ \ This computes a product and stores it in the “secondary” carry registers hsc0, sc1, sc2i. #define SQRADDAC(i, j) do { fp_word t; t = sc0 + ((fp_word)i) * ((fp_word)j); sc0 = t; t = sc1 + (t >> DIGIT_BIT); sc1 = t; sc2 += t >> DIGIT_BIT; } while (0); \ \ \ \ This computes a product and adds it to the “secondary” carry registers. #define SQRADDDB do { fp_word t; t = ((fp_word)sc0) + ((fp_word)sc0) + c0; c0 = t; t = ((fp_word)sc1) + ((fp_word)sc1) + c1 + (t >> DIGIT_BIT); c1 = t; c2 = c2 + ((fp_word)sc2) + ((fp_word)sc2) + (t >> DIGIT_BIT); } while (0); This doubles the “secondary” carry registers and adds the sum to the main carry registers. Really complicated. \ \ 17 4.4. MONTGOMERY WITH COMBA 4.4 Montgomery with Comba Montgomery reduction is used in modular exponentiation and is most called function during that operation. It’s important to make sure this routine is very fast or all is lost. Unlike the two other comba routines this one does not use a single three– digit carry system. It does have three–digit carries except that the routine steps through them in the inner loop. This means you cannot alias them to registers (at all). To make matters simple though the three arrays of carries are stored in one array. The “c0” array resides in c[0 . . . OF F 1−1], “c1” in c[OF F 1 . . . OF F 2−1] and “c2” in c[OF F 2 . . . OF F 2 + F P SIZE − 1]. #define MONT_START This allows you to insert anything at the start that you need. #define MONT_FINI This allows you to insert anything at the end that you need. #define LOOP_START \ mu = c[x] * mp; This computes the µ value for the inner loop. You can safely alias mu and mp to a register if you want. #define INNERMUL do { fp_word t; _c[0] = t = ((fp_word)_c[0] + (fp_word)cy) + (((fp_word)mu) * ((fp_word)*tmpm++)); cy = (t >> DIGIT_BIT); } while (0) \ \ \ \ \ This computes the inner product and adds it to the destination and carry variable cy. This uses the mu value computed above (can be in a register already) and the cy which is a chaining carry. Inside the INNERMUL loop the cy value can be kept inside a register (hint: it always starts as cy = 0 in the first iteration). Upon completion of the inner loop the macro LOOP END is called which is used to fetch cy into the variable the C program can see. This is where, if you cached cy in a register you would copy it to the locally accessible C variable. 18 CHAPTER 4. PORTING TOMSFASTMATH #define PROPCARRY \ do { fp_digit t = _c[0] += cy; cy = (t < cy); } while (0) This propagates the carry upwards by one digit. Index fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp abs, 9 add, 11 cmp, 10 cmp mag, 10 cnt lsb, 10 div, 11 div 2, 10 div 2d, 10 exptmod, 11 gcd, 11 init, 8 init copy, 8 invmod, 11 iseven, 9 isodd, 9 isprime, 12 iszero, 9 lcm, 11 lshd, 10 mod, 11 mod 2d, 10 mul, 11 mul 2, 10 mul 2d, 10 neg, 9 rshd, 10 set, 8 sqr, 11 sub, 11 19