Skip to content

Armv8.2 SM3和SM4

Sun Yimin edited this page Oct 10, 2023 · 31 revisions

SM3 arm64 plain asm on arm64-graviton2

go test -v -short -bench . -run=^$ ./...
goos: linux
goarch: arm64
pkg: github.com/emmansun/gmsm/sm3
BenchmarkHash8Bytes
BenchmarkHash8Bytes-2     	 2738724	       438.4 ns/op	  18.25 MB/s
BenchmarkHash1K
BenchmarkHash1K-2         	  192519	      6232 ns/op	 164.32 MB/s
BenchmarkHash8K
BenchmarkHash8K-2         	   24950	     48112 ns/op	 170.27 MB/s
BenchmarkHash8K_SH256
BenchmarkHash8K_SH256-2   	  223354	      5369 ns/op	1525.81 MB/s
PASS
ok  	github.com/emmansun/gmsm/sm3	5.857s

和CPU指令级别的差距基本上是10倍!

SM4 with AES

AESE指令相当于:

  1. AddRoundKey(state, RoudKey)
  2. ShiftRows(State)
  3. SubBytes(State)

所以,如果RoundKey = 0, 那么AESE相当于执行了

  1. ShiftRows(State)
  2. SubBytes(State)

使用全0 RoundKey有没有什么副作用?

    go test -v -short -bench . -run=^$ ./...
    goos: linux
    goarch: arm64
    pkg: github.com/emmansun/gmsm/sm4
    BenchmarkEncrypt
    BenchmarkEncrypt-2   	 2145859	       559.1 ns/op	  28.62 MB/s
    BenchmarkDecrypt
    BenchmarkDecrypt-2   	 2145296	       559.4 ns/op	  28.60 MB/s
    BenchmarkExpand
    BenchmarkExpand-2    	 2064466	       581.2 ns/op
    PASS
    ok  	github.com/emmansun/gmsm/sm4	5.334s

SM4 with SM4E & SM4EKEY

SM4EKEY SM4E 目前golang还没有支持SM4E/SM4EKEY指令,不过我们可以根据不支持的操作码来处理:

  1. Clone codes from https://github.com/golang/arch
  2. 修改arm64asm/tables.go: 增加SM4E/SM4EKEY常量;同时加入opstr;加入指令到instFormats。 image
	// SM4E <Vd>.4S, <Vn>.4S
	{0xfffffc00, 0xcec08400, SM4E, instArgs{arg_Vd_arrangement_4S, arg_Vn_arrangement_4S}, nil},
	// SM4EKEY <Vd>.4S, <Vn>.4S, <Vm>.4S
	{0xffe0fc00, 0xce60c800, SM4EKEY, instArgs{arg_Vd_arrangement_4S, arg_Vn_arrangement_4S, arg_Vm_arrangement_4S}, nil},	
  1. 修改arm64asm/plan9x.go,noSuffixOpSet里加上SM4E和SM4EKEY,这个是可选的,加了的话,plan9x的指令就不会出现V前缀。
  2. 写测试,testDecodeLine()方法是从decode_test.go的testDecode()方法中抽出来的。看了那个Decode()方法就能编码出那些32位的code了。
func TestDecodeSM4Codes(t *testing.T) {
	//gnu syntax, load 16 bytes plaintext to v8 (need to reverse byte order first), 32 round keys to v0-v7, the final result should be reverse byte order again
	testDecodeLine(t, "gnu", "0884c0ce|	sm4e v8.4s, v0.4s")
	testDecodeLine(t, "gnu", "2884c0ce|	sm4e v8.4s, v1.4s")
	testDecodeLine(t, "gnu", "4884c0ce|	sm4e v8.4s, v2.4s")
	testDecodeLine(t, "gnu", "6884c0ce|	sm4e v8.4s, v3.4s")
	testDecodeLine(t, "gnu", "8884c0ce|	sm4e v8.4s, v4.4s")
	testDecodeLine(t, "gnu", "a884c0ce|	sm4e v8.4s, v5.4s")
	testDecodeLine(t, "gnu", "c884c0ce|	sm4e v8.4s, v6.4s")
	testDecodeLine(t, "gnu", "e884c0ce|	sm4e v8.4s, v7.4s")
	//plan9 syntax, load 16 bytes plaintext to v8 (need to reverse byte order first), 32 round keys to v0-v7, the final result should be reverse byte order again
	testDecodeLine(t, "plan9", "0884c0ce|	SM4E V0.S4, V8.S4")
	testDecodeLine(t, "plan9", "2884c0ce|	SM4E V1.S4, V8.S4")
	testDecodeLine(t, "plan9", "4884c0ce|	SM4E V2.S4, V8.S4")
	testDecodeLine(t, "plan9", "6884c0ce|	SM4E V3.S4, V8.S4")
	testDecodeLine(t, "plan9", "8884c0ce|	SM4E V4.S4, V8.S4")
	testDecodeLine(t, "plan9", "a884c0ce|	SM4E V5.S4, V8.S4")
	testDecodeLine(t, "plan9", "c884c0ce|	SM4E V6.S4, V8.S4")
	testDecodeLine(t, "plan9", "e884c0ce|	SM4E V7.S4, V8.S4")
	//gnu syntax, load 32 ck to v0-v7, root key (reverse byte order first) xor fk to v8, the result round keys will be in v9, need to move v9 to v8 from second invocation of sm4ekey
	testDecodeLine(t, "gnu", "09c960ce|	sm4ekey v9.4s, v8.4s, v0.4s")
	testDecodeLine(t, "gnu", "09c961ce|	sm4ekey v9.4s, v8.4s, v1.4s")
	testDecodeLine(t, "gnu", "09c962ce|	sm4ekey v9.4s, v8.4s, v2.4s")
	testDecodeLine(t, "gnu", "09c963ce|	sm4ekey v9.4s, v8.4s, v3.4s")
	testDecodeLine(t, "gnu", "09c964ce|	sm4ekey v9.4s, v8.4s, v4.4s")
	testDecodeLine(t, "gnu", "09c965ce|	sm4ekey v9.4s, v8.4s, v5.4s")
	testDecodeLine(t, "gnu", "09c966ce|	sm4ekey v9.4s, v8.4s, v6.4s")
	testDecodeLine(t, "gnu", "09c967ce|	sm4ekey v9.4s, v8.4s, v7.4s")
	//gnu syntax, load 32 ck to v0-v7, root key (reverse byte order first) xor fk to v8, the result round keys will be in v9 (1,3,5,7) and v8 (2,4,6,8),避免寄存器copy。
	testDecodeLine(t, "gnu", "09c960ce|	sm4ekey v9.4s, v8.4s, v0.4s")
	testDecodeLine(t, "gnu", "28c961ce|	sm4ekey v8.4s, v9.4s, v1.4s")
	testDecodeLine(t, "gnu", "09c962ce|	sm4ekey v9.4s, v8.4s, v2.4s")
	testDecodeLine(t, "gnu", "28c963ce|	sm4ekey v8.4s, v9.4s, v3.4s")
	testDecodeLine(t, "gnu", "09c964ce|	sm4ekey v9.4s, v8.4s, v4.4s")
	testDecodeLine(t, "gnu", "28c965ce|	sm4ekey v8.4s, v9.4s, v5.4s")
	testDecodeLine(t, "gnu", "09c966ce|	sm4ekey v9.4s, v8.4s, v6.4s")
	testDecodeLine(t, "gnu", "28c967ce|	sm4ekey v8.4s, v9.4s, v7.4s")
}

每次sm4e/sm4ekey只能执行4轮,所以需要调用8次。

4.然后,你就可以在golang的arm64的汇编中使用那些32位的codes了。

WORD	$0x0884c0ce       // SM4E V0.S4, V8.S4

[3/30/2023] 通过进一步学习和QEMU环境测试,发现不需要进行字节序变换。以下才是正确的!项目中的SM3 SM4 NI实现已经通过QEMU测试。

WORD	$0xcec08408       // SM4E V0.S4, V8.S4

用指令字的缺点主要是易读性差,另外一个就是不能或不好写宏代码。

SM3 SM4 指令字生成工具

代码

SM3 with SM3PARTW1 / SM3PARTW2 / SM3SS1 / SM3TT1A / SM3TT2A / SM3TT2A / SM3TT2B

 P1(X)= X XOR (X <<< 15) XOR (X <<< 23)
 
 P1(X1 XOR X2)
=(X1 XOR X2) XOR ((X1 XOR X2) <<< 15) XOR ((X1 XOR X2) <<< 23)
=X1 XOR X2 XOR (X1 <<< 15) XOR (X2 <<< 15) XOR (X1 <<< 23) XOR (X2 <<< 23)
=X1 XOR (X1 <<< 15) XOR (X1 <<< 23) XOR X2 XOR (X2 <<< 15) XOR (X2 <<< 23)
=P1(X1) XOR P1(X2)

这里, 异或XOR运算满足:
交换律
结合律
并且假定(X1 XOR X2) <<< 15 = (X1 <<< 15) XOR (X2 <<< 15), 也就是说循环左移ROL对异或XOR运算满足分配律,这一点是不显然的。

SM3PARTW1中最后一个字:
Vd[3] = P1(C XOR (R1 <<< 15)), 这里 C 是另外两个字的异或结果, R1 是 X(4i+16)的一部分:X(4i+16) = R1 XOR R2

SM3PARTW2中的tmp.value[0]就是R2
 P1(C XOR (R1 <<< 15)) XOR P1(R2 <<< 15) = P1(C XOR (R1 <<< 15) XOR (R2 <<< 15)) = P1(C XOR ((R1 XOR R2) <<< 15))

所以,关键就是循环位移对异或运算满足分配律成立, 或者更一般的,逻辑位移运算对异或运算满足分配律Does a shift operation distribute over XOR

模拟代码

Reference

SM3和SM4 CPU指令实现,找不到相关CPU环境,mark先。

  1. Summary of A64 cryptographic instructions
  2. Arm A64 Instruction Set Architecture
  3. linux arm64 crypto / (https://github.com/torvalds/linux/tree/master/arch/arm64/crypto)
  4. A Quick Guide to Go's Assembler
  5. Golang arm instructions mapping
  6. A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.
  7. asm2go