Skip to content

FEX-2409

Latest
Compare
Choose a tag to compare
@Sonicadvance1 Sonicadvance1 released this 06 Sep 01:04
· 124 commits to main since this release
d5db894

Read the blog post at FEX-Emu's Site!

FEX-2409 is now released... with a big performance boost.

Inline Graph

I'm tired, carry me.

Little differences between x86 and Arm can cause big performance penalties for
emulators. None more so than flags.

What are those?

Flags are bits representing the processor state. For example, if an operation
results in zero, the "zero flag" is set. Both x86 and Arm have flags, so for
emulating x86 on Arm, we map x86 flags to Arm flags to reduce emulation
overhead. That is possible because x86 and Arm have similar flags. By contrast,
architectures like RISC-V lack flags, slowing down x86-on-RISC-V emulators.

Many arithmetic operations set flags. Programs can then conditionally jump
("branch") according to the flags. On x86, the flags are thus the building blocks
of if statements and loops. To check if two variables are equal, x86 code
subtracts them and checks the zero flag. To check if one variable is less than
another, x86 code subtracts and checks the negative flag. This pattern --
subtracting, setting flags, and discarding the actual result -- is so common
that it has a special instruction: CMP ("CoMPare").

If the story ended here, emulation would be easy. Unfortunately, we need to
talk about the carry flag.

After an addition, the carry flag indicates if the result overflowed.
Programs can then check the carry flag to detect overflows. The flag can also
be input to another addition to implement 128-bit additions.

Subtractions are similar. In hardware, subtractions are additions with an
operand negated. Because they are additions in hardware, subtractions set the
carry flag. Precisely how is the carry flag defined for subtraction? There are
two competing conventions.

The first sets the flag when there is a borrow, by mathematical analogy with
addition. x86 uses this "borrow flag" convention, as it seems more natural.

The second option sets the flag when there is not a borrow. Isn't that
backwards? It turns out that adding a (two's complement) negated operand
overflows exactly when the subtraction does not borrow. This "true carry"
convention matches actual hardware behaviour, while the "borrow" x86 convention
requires extra gates to invert carry. Arm uses the "true carry" convention to
save a few gates.

Which convention should FEX use?

We could store the x86 carry flag in the Arm carry flag. Unfortunately, that
requires an extra instruction after each subtraction to invert carry to get the
borrow flag.

The counter-intuitive alternative is storing the opposite of the x86 flag.
That requires an extra invert after every addition, but it eliminates the
invert after subtraction.

Either we pay after additions or after subtractions. Which should we pick?

While addition is common, using the flags from an addition is not. Flags are
typically used with comparisons, which are subtractions. Therefore, the
inverted convention usually wins. This month, Alyssa adjusted FEX to invert
carries, speeding up typical workloads by a few percent.

After tackling the carry flag, Alyssa optimized FEX's translations of address
modes, push/pop, AVX load/stores, and more. Overall, benchmarks are upwards of
10% faster since the last release.

A Qt change

What about more user-visible changes? If you use the FEXConfig tool to
configure the emulator, you're in for a treat. While it works, this ImGui-based
tool isn't exactly known for its convenience. In
between his work optimizing the [redacted] out of FEX's [redacted], Tony
rewrote FEXConfig as a simple Qt application, improving aesthetics, usability,
and accessibility all in one go. Here's a preview:

Besides look and feel, we've polished first-time setup for logging,
library forwarding, and RootFS images. We've also made tweaking various
emulation settings a bit nicer. Users of our Ubuntu PPA can simply
update to unlock these improvements without any further action.

But with so much optimization, who needs speedhacks anymore?

Raw Changes

FEX Release FEX-2409

  • ARM64EC

  • Always use the CPU area context for BeginSimulation (cc589ba)

  • Fix ContextFlag member tests (47b3637)

  • AVX128

  • Fixes 256-bit float compares (2478abb)

  • Fixes incorrect size usage in AVX128_Vector_CVT_Int_To_Float (a4db585)

  • Arm64

  • Implement support for strict in-process split-locks (74e95df)

  • Ensure 256-bit operations always assert without 256-bit SVE (90f7cc9)

  • Fixes SIGBUS handler for FEAT_LRCPC (2357253)

  • Allow directly correlating an ARM register back to an x86 register (92c951c)

  • Handle backpatching in a thread-safe manner (226f5e2)

  • CMake

  • Adds an AArch64 cross-compile toolchain file (06497fd)

  • Don't install binfmts for MinGW builds (f009a00)

  • CPUID

  • add missing Apple core part numbers (8296bfc)

  • ConstProp

  • stop pooling inline constants (5013b8a)

  • DeadStoreElimination

  • handle PF/AF invalidate (03832b2)

  • External

  • Update robin-map from 1.2.1 to 1.3.0 (96055cb)

  • Update fmt from 10.1.1 to 11.0.2 (f1d7879)

  • FEX

  • Moves sigreturn symbols to frontend (4baeffe)

  • FEXCore

  • Dynamically scale TSC (46a2a06)

  • Splits up atomic enablement checks (689b461)

  • Stop installing static library (4abac0c)

  • Disable vixl linking if vixl disasm or simulator is disabled (5e706df)

  • FEXInterpreter

  • Support portable installs (9336e35)

  • FEXLoader

  • don't install FEXUpdateAOTIRCache (1e1bcc4)

  • FEXQConfig

  • Add strict split-lock option (8fe1e95)

  • FEXQonfig

  • Fix minor saving/loading quirks (e234e11)

  • Minor followup changes (c74df6a)

  • FEXRootFSFetcher

  • Allow UI override through options (5b65f30)

  • Frontend

  • short-circuit code generation on invalid instructions with multiblock (92ddc00)

  • HostFeatures

  • Moves MIDR querying to the frontend (fbf62f1)

  • Removes vixl usage (1caa31c)

  • fix clang reformat constantly with missing comment (86e5e1a)

  • IR

  • fix scalar FMA tied sources (a66fac6)

  • InstCountCI

  • include x86 instruction count (50a9cea)

  • InstructionCountCI

  • explicitly enable flagm for multiinst (33558e6)

  • LinuxEmulation

  • Implement support for seccomp (b368223)

  • Moves guest signal frame generation to its own file (d9544e7)

  • LinuxSyscalls

  • Adds missing header (4f40416)

  • Implements less invasive assertion only EFAULT handlers (ce88f5f)

  • Some minor cleanups (27cd399)

  • OpcodeDispatcher

  • Convert more template handlers to Bind handlers (2829ad5)

  • Allow x86 code to read CNTVCT on ARM64EC (e6aa268)

  • fix tso checks (2a170cf)

  • Remove old bad assumption in INC/DEC (c634c53)

  • SignalDelegator

  • Refactor how thread local data is stored (114112a)

  • Changes AFmt to ERROR_AND_DIE_FMT (f897579)

  • Make sure to defer a signal if the guest signal mask desires (8617150)

  • SpinWaitLock

  • Update comment about WFE spurious wakeups (894aaa9)

  • Thunks

  • Removes global static initializer (b4a67a6)

  • VDSO

  • Fixes a pretty nasty bug where we were never using the host VDSO (0754aff)

  • WOW64

  • Improve exception handling (8e1695a)

  • Windows

  • Load per-application configs (49b40b7)

  • Fix missing pragma and license text (e415b94)

  • Fill in per-core MIDR information (7a9eb01)

  • Fix some file operations on actual Windows (b6cb897)

  • Implement CreateDirectoryW CRT function (eadb502)

  • Install libraries to CMAKE_INSTALL_LIBDIR (21f9841)

  • Support CPU feature detection from ID registry keys (97c229d)

  • Misc

  • More EFAULT handlers (62e1767)

  • Add Qt-based config editor (3020a0d)

  • Install GDBSymbol integration in the correct location (ac814ac)

  • Eliminate a move in 64-bit umul (2039950)

  • Adds more syscall memory access checks (b526c60)

  • Fix multiblock on ARM64EC (812224a)

  • Optimize AXFLAG-less systems (42f2851)

  • Optimize adc x, 0 (43cd897)

  • Optimise test al, 1 (0e5f5a2)

  • Optimize test (cfa2ad8)

  • Rearrange SRA to let us coalesce cmpxchg moves (877b2f4)

  • Optimize AVX load/store with ldp/stp (5ac7d5d)

  • Optimize BTC (a7138f2)

  • Improvements from bytemark "huffman" (99afd87)

  • binfmt_misc: Support systemd binfmt_misc (d2c82ba)

  • Library Forwarding: Follow up from #3964 (e5149fb)

  • Install libraries in the correct location (abbd655)

  • Clean up our JSON dependencies (00ef1ba)

  • small optimizations for returns (933c65d)

  • Support cooperative suspend on ARM64EC (df0ecad)

  • Add a hack for multiple destinations & make good use of it (aa5d2ff)

  • Little opcodedispatcher optimizations (6f43c8f)

  • Invert carry flag internally (8aa7d1a)

  • x32

  • Signals

  • Fixes bug in the sigqueue syscalls (0416950)