Cómo: pow (real, real) en x86

I'm looking for the implementation of pow(real, real) in x86 Assembly. Also I'd like to understand how the algorithm works.

preguntado el 09 de enero de 11 a las 09:01

glibc's implementation of the pow() la función es in sysdeps/ieee754/dbl-64/e_pow.c. It uses some some integer examination of the FP bit patterns, and some FP multiplies and adds, but doesn't use any special x87 instructions. For x86-64, it gets compiled into __ieee754_pow_sse2() (by this code that #includes it). Anyway, x87 isn't the best way to do it on modern CPUs. -

I assume glibc's code is either more accurate or faster than x87. Possibly both, but maybe just more accurate (correctly rounded to nearest). It doesn't use a loop, though, and single-stepping through the instructions, there aren't que muchos para pow(1.175, 33.75). FYL2X is a very slow instruction (~100 cycles) on modern CPUs, so it shouldn't be that hard to beat it. -

Relacionado: Optimizations for pow() with const non-integer exponent? has a fast approximate version (using SIMD intrinsics). See also Where is Clang's '_mm256_pow_ps' intrinsic? for SIMD math libraries that provide pow. -

3 Respuestas

Just compute it as 2^(y*log2(x)).

There is a x86 instruction FYL2X to compute y*log2(x) and a x86 instruction F2XM1 to do exponentiation. F2XM1 requires an argument in [-1,1] range, so you'd have to add some code in between to extract the integer part and the remainder, exponentiate the remainder, use FSCALE to scale the result by an appropriate power of 2.

Respondido el 09 de enero de 11 a las 12:01

I know this is an old thread, but here's an implementation: madwizard.org - Matth

Just to have code here: fld power fld x fyl2x fld1 fld st(1) fprem f2xm1 fadd fscale fxch st(1) fstp st ;st0 = X^power - Qwertiy

OK, I implemented power(double a, double b, double * result); in x86 just as you recommended.

Código: http://pastebin.com/VWfE9CZT

%define a               QWORD [ebp+8]
%define b               QWORD [ebp+16]
%define result          DWORD [ebp+24]
%define ctrlWord            WORD [ebp-2]
%define tmp             DWORD [ebp-6]

segment .text
    global power

    push ebp
    mov ebp, esp
    sub esp, 6
    push ebx

    fstcw ctrlWord
    or ctrlWord, 110000000000b
    fldcw ctrlWord

    fld b
    fld a

    fist tmp

    fild tmp
    fild tmp

    mov ebx, result
    fst QWORD [ebx]

    pop ebx
    mov esp, ebp
    pop ebp

Respondido el 06 de junio de 12 a las 14:06

Could I recommend that you go ahead and include that code here, in your answer? - Jonathon Reinhart

Debes sub esp, 8 to keep it aligned for pushing ebx. You could also swap tmp and ControlWord, e.g. %define tmp DWORD [ebp-4], so it's aligned. - Peter Cordes

It would make a lot more sense to return a double instead of taking an output arg, so you can just leave the value in st0. Or if you insist on taking a pointer, load the pointer into EAX, ECX, or EDX so you don't have to save/restore EBX at all. Also, you should restore the original rounding mode when you're done. (e.g. save the original in a register, then store and fldcw it). This leaves it set to truncation (toward zero), not the default round-to-nearest. efg2.com/Lab/Library/Delphi/MathFunctions/FPUControlWord.Txt. - Peter Cordes

Actualizar: frndint is slow on a lot of CPUs (agner.org/optimize), asi que fist/fild is actually better. And if you want to convert with truncation, you can use SSE3 fisttp if available instead of changing the FP rounding mode and restoring it. Or just use SSE2 cvttsd2si to convert with truncation. - Peter Cordes

Here's my function using the main algorithm by 'The Svin'. I wrapped it in __fastcall & __declspec(naked) decorations, and added the code to make sure the base/x is positive. If x is negative, the FPU will totally fail. You need to check the 'x' sign bit, plus consider odd/even bit of 'y', and apply the sign after it's finished! Lemme know what you think to any random reader. Looking for even better versions with x87 FPU code if possible. It compiles/works with Microsoft VC++ 2005 what I always stick with for various reasons.

Compatibility v. ANSI pow(x,y): Very good! Faster, predictable results, negative values are handled, just no error feedback for invalid input. But, if you know 'y' can always be an INT/LONG, do NOT use this version; I posted Agner Fog's version with some tweaks which avoids the very slow FSCALE, search my profile for it! His is the fastest x87/FPU way under those limited circumstances!

extern double __fastcall fs_Power(double x, double y);

// Main Source: The Svin
// pow(x,y) is equivalent to exp(y * ln(x))
// Version: 1.00

__declspec(naked) double __fastcall fs_Power(double x, double y) { __asm {
    LEA   EAX, [ESP+12]         ;// Save 'y' index in EAX
    FLD   QWORD PTR [EAX]       ;// Load 'y' (exponent) (works positive OR negative!)
    FIST  DWORD PTR [EAX]       ;// Round 'y' back to INT form to test for odd/even bit
    MOVZX EAX, WORD PTR [EAX-1] ;// Get x's left sign bit AND y's right odd/even bit!
    FLD   QWORD PTR [ESP+4]     ;// Load 'x' (base) (make positive next!)
    FABS            ;// 'x' MUST be positive, BUT check sign/odd bits pre-exit!
    AND   AX, 0180h ;// AND off all bits except right 'y' odd bit AND left 'x' sign bit!
    FYL2X       ;// 'y' * log2 'x' - (ST(0) = ST(1) * log2 ST(0)), pop
    FLD1        ;// Load 1.0f: 2 uses, mantissa extract, add 1.0 back post-F2XM1
    FLD   ST(1) ;// Duplicate current result
    FPREM1      ;// Extract mantissa via partial ST0/ST1 remainder with 80387+ IEEE cmd
    F2XM1       ;// Compute (2 ^ ST(0) - 1)
    FADDP ST(1), ST ;// ADD 1.0f back! We want (2 ^ X), NOT (2 ^ X - 1)!
    FSCALE      ;// ST(0) = ST(0) * 2 ^ ST(1) (Scale by factor of 2)
    FFREE ST(1) ;// Maintain FPU stack balance
;// Final task, make result negative if needed!
    CMP   AX, 0180h    ;// Combo-test: Is 'y' odd bit AND 'x' sign bit set?
    JNE   EXIT_RETURN  ;// If positive, exit; if not, add '-' sign!
        FCHS           ;// 'x' is negative, 'y' is ~odd, final result = negative! :)
;// For __fastcall/__declspec(naked), gotta clean stack here (2 x 8-byte doubles)!
    RET   16     ;// Return & pop 16 bytes off stack

Alright, to wrap this experiment up, I ran a benchmark test using the RDTSC CPU time stamp/clocks counter instruction. I followed the advice of also setting the process to High priority with "SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);" and I closed out all other apps.

Results: Our retro x87 FPU Math function "fs_Power(x,y)" is 50-60% faster than the MSCRT2005 pow(x,y) version which uses a pretty long SSE branch of code labeled '_pow_pentium4:' if it detects a 64-bit >Pentium4+ CPU. So yaaaaay!! :-)

Notes: (1) The CRT pow() has a ~33 microsecond initialization branch it appears which shows us 46,000 in this test. It operates at a normal average after that of 1200 to 3000 cycles. Our hand-crafted x87 FPU beauty runs consistent, no init penalty on the first call!

(2) While CRT pow() lost every test, it DID win in ONE area: If you entered wild, huge, out-of-range/overflow values, it quickly returned an error. Since most apps don't need error checks for typical/normal use, it's irrelevant.


2nd Test (I had to run it again to copy/paste text after the image snap):

 x86 fs_Power(2, 32): CPU Cycles (RDTSC): 1248
MSCRT SSE pow(2, 32): CPU Cycles (RDTSC): 50112

 x86 fs_Power(-5, 256): CPU Cycles (RDTSC): 1120
MSCRT SSE pow(-5, 256): CPU Cycles (RDTSC): 2560

 x86 fs_Power(-35, 24): CPU Cycles (RDTSC): 1120
MSCRT SSE pow(-35, 24): CPU Cycles (RDTSC): 2528

 x86 fs_Power(64, -9): CPU Cycles (RDTSC): 1120
MSCRT SSE pow(64, -9): CPU Cycles (RDTSC): 1280

 x86 fs_Power(-45.5, 7): CPU Cycles (RDTSC): 1312
MSCRT SSE pow(-45.5, 7): CPU Cycles (RDTSC): 1632

 x86 fs_Power(72, -16): CPU Cycles (RDTSC): 1120
MSCRT SSE pow(72, -16): CPU Cycles (RDTSC): 1632

 x86 fs_Power(7, 127): CPU Cycles (RDTSC): 1056
MSCRT SSE pow(7, 127): CPU Cycles (RDTSC): 2016

 x86 fs_Power(6, 38): CPU Cycles (RDTSC): 1024
MSCRT SSE pow(6, 38): CPU Cycles (RDTSC): 2048

 x86 fs_Power(9, 200): CPU Cycles (RDTSC): 1152
MSCRT SSE pow(9, 200): CPU Cycles (RDTSC): 7168

 x86 fs_Power(3, 100): CPU Cycles (RDTSC): 1984
MSCRT SSE pow(3, 100): CPU Cycles (RDTSC): 2784

Any real world applications? YES! Pow(x,y) is used heavily to help encode/decode a CD's WAVE format to OGG and vice versa! When you're encoding a full 60 minutes of WAVE data, that's where the time-saving payoff would be significant! Many Math functions are used in OGG/libvorbis also like acos(), cos(), sin(), atan(), sqrt(), ldexp() (very important), etc. So fine-tuned versions like this, which don't bother/need error checks, can save lots of time!!

My experiment is the result of building an OGG decoder for the NSIS installer system which led to me replacing all the Math "C" library functions the algorithm needs with what you see above. Well, ALMOST, I need acos() in x86, but I STILL can't find anything for that...

Regards, and hope this is useful to anyone else that likes to tinker!

Respondido 31 Jul 19, 00:07

¿Por qué no usar el fabs instruction? It's much faster (1 cycle latency on modern AMD and Intel) and doesn't cause a store-forwarding stall by writing half of a double right before you do a qword load of the whole thing. (You can still branch on the original value in memory at the end). Also, you don't need to save/restore EDX: it's call-clobbered in all the standard calling conventions. - Peter Cordes

Also you can avoid fxch st1. Utilizar fstp st(1) to keep st0 = st0 while popping the stack. - Peter Cordes

fprem is slow. If you're just using that to get the integer and fractional parts, use frndint and subtract. (Hmm, according to agner.org/optimize frndint is also slow on Intel, but fast on Ryzen. Strange because SSE/AVX roundpd is fast) - Peter Cordes

(1) Yeah, regarding FRNDINT, I've been studying Agner Fog, he says it's very slow. (2) Honestly haven't timer-tested FABS after I've loaded a float versus INT commands to memory to AND off the sign bit. But given Agner, I think it podría be faster, not 100%. ;) (3) I preserved EDX because I sorta prefer to know that all registers are preserved when I return from a call. I remember building DLLs in VC++, calling them in VB/VBA, but if EDI wasn't restored, you'd get a bad DLL calling convention. I inline ASM a lot more, try to use all registers, so peace of mind I don't break something. - Fulano de Tal

Yes, unconditional fabs at the top is what I'd do; no possibly-mispredicted branches until long after the FP value has been ready, so FP latency can hide the branch miss penalty. You don't need a dword load, just byte. test [mem], imm8 / jz es más pequeño que mov eax, [esp+4] / test eax,eax / jns. Although doing the mov eax, [esp+4] before the slow microcoded FP instructions could be good for reducing branch mispredict detection latency for a test/jns at the end, in case the microcoded instructions tie up the front-end. - Peter Cordes

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.