Netwide Assembly
Assembly is not the sort of programming language I'd like to code in all of the time, for obvious reasons. That said, learning it provides many benefits. Simply understanding more about how programs really work under the hood is its own reward, and an understanding of what a compiler turns your code into can help you figure out what the best way to write extremely performance-sensitive code is. It's also not unheard of for this kind of performance-sensitive code to be written in assembly manually, if it happens to be the sort of code that the compiler won't be able to optimize automatically.
I had a bit of trouble compiling my Hello World program, which is normal when trying to learn assembly (On Windows anyway). It turns out that Mingw is single-target, and despite the fact that my Mingw is 32-bit, it targets my 64-bit architecture. I guess. Regardless, I was able to compile the following code and run it successfully. It must link to the C standard libraries.
global main
extern printf
section .text
main:
push rbp
mov rbp, rsp
sub rsp, 8*4 ; reserve for local variables
mov rcx, message
call printf
mov rax,0
add rsp,8*4
pop rbp
ret
message:
db 'Hello, World', 13, 10, 0
And this was compiled and run with:
nasm -fwin64 hello.asm -o hello.obj
gcc -m64 hello.obj -o hello.exe
hello.exe
I was a tad confused by the printf message for a moment, it didn't help that the "13" happens to be the length of the message including a null terminator, but that's just the carriage return. We can take it out. The printf in assembly is pretty powerful too, we can use format strings like normal. For example, the following code prints "123456".
global main
extern printf
section .data
fmt_str: db '%d', 10, 0
section .text
main:
push rbp
mov rbp, rsp
sub rsp, 8*4
mov rax, 123456
mov rcx, fmt_str
mov rdx, rax
call printf
mov rax,0
add rsp,8*4
pop rbp
ret
Note that we use rcx, rdx, r8, and r9 for function parameters on Windows, but that Linux uses rdi, rsi, rdx, rcx, r8, and r9.
The messing about with the stack at the beginning and end of the main section also confused me for a moment, but that turns out to be the shadow space. This is space we must add to the stack before calling functions in clib, because they must have access to this region and will use it as scratch space.
global main
extern printf
section .data
fmt_num: db '%ld Params...', 10, 0
fmt_str: db ' %s', 10, 0
section .text
main:
push rbp
; Reserve shadow space on stack.
mov rbp, rsp
sub rsp, 8*4
mov r12, rcx ; argc
mov r13, rdx ; argv
; Print argc
mov rcx, fmt_num
mov rdx, r12
call printf
; Print args in argv
print_arg:
; Print an arg
mov rcx, fmt_str
mov rdx, [r13]
call printf
; Go to the next arg
add r13, 8
; Decrement the number of args and go back to the start of the loop if it isn't 0.
dec r12
cmp r12, 0
jne print_arg
; Return shadow space.
add rsp, 8*4
pop rbp
; Return value
mov rax, 0
ret
We have access to the entire C library, which is frankly overpowered. Any function with four arguments are fair game if we put arguments into rcx/rdx/r8/r9. If it has more, we have to push them to the stack. We can write to files by simply calling the appropriate functions.
global main
extern printf
extern fopen
extern fwrite
extern fclose
section .data
fn: db 'out.txt', 0
fn_o: db 'w', 0
txt: db 'Hello File!', 0
section .text
main:
push rbp
; Reserve shadow space on stack.
mov rbp, rsp
sub rsp, 8*4
; Open the file
mov rcx, fn
mov rdx, fn_o
call fopen
; Save the file pointer
mov r12, rax
; Write to the file
mov rcx, txt ; ptr
mov rdx, 1 ; size
mov r8, 11 ; nmemb
mov r9, r12 ; stream
call fwrite
; Close the file
mov rcx, r12
call fclose
; Free shadow space
add rsp, 8*4
pop rbp
; Return value
mov rax, 0
ret
And if we can write to files, that means was can make images using the Netpbm format. I'm using the P5 format that includes an ASCII header followed by raw bytes representing one-channel pixels. The below image was made by giving each pixel a brightness corresponding to its y-coordinate, and was used to test my code for iterating over all of the pixel values, which is not such an easy thing as it is in C.
Now, we can access the Floating Point Unit, or FPU. This is a sub-component of the processor that performs floating-point arithmetic, and we'll need to be able to use it for any reasonably interesting images. The FPU is a bit jarring compared to the CPU, it contains 8 80-bit (extended precision) registers that are presented as a stack to the user. We must take care not to forget our exact place in that stack or we risk pulling the wrong value from it or filling it up completely, causing errors. I was able to implement a simple sin-wave renderer without too much trouble. It takes a pixel's x-coordinate, divides it by a constant to scale the waves, calculates the sin of it, and transforms that into a valid pixel brightness between 0 and 250 before writing it to the final image.
The best resources I found for this were this guide for FPU in MASM by Raymond Filiatreault (See the NASM manual for the differences between NASM and MASM) and this instruction set reference.
DEFAULT REL ; Important for some thing.
%define edge 256
global main
extern printf
extern sprintf
extern fopen
extern fwrite
extern fclose
section .data
hdr_fmt: db 'P5 %d %d 255 ', 0
fn: db 'out.pgm', 0
fn_o: db 'wb', 0
sin_div: dw 15
sin_add: dd 1.0
sin_mul: dw 125
section .bss
hdr: resb 64
img: resb edge*edge
; x and y coords of the current pixel
px: resq 1
py: resq 1
; Final brightness of the current pixel
val: resw 1
section .text
; Increments [px]. If [px] becomes equal to edge, set it to 0 and increment [py].
next_pixel:
push rax
mov rax, [px]
inc rax
cmp qword rax, edge
je next_row
mov [px], rax
pop rax
ret
next_row:
mov qword [px], 0
inc qword [py]
pop rax
ret
main:
push rbp
; Reserve shadow space on stack.
mov rbp, rsp
sub rsp, 8*4
; Write the header
mov rcx, hdr
mov rdx, hdr_fmt
mov r8, edge
mov r9, edge
call sprintf
; Save header length
mov r13, rax
mov r14, edge*edge
mov r15, img
; x and y pixel coordinates
mov qword [px], 0
mov qword [py], 0
write_loop:
; Push the pixel coordinates to the FPU stack.
FILD qword [px] ; Load the pixel's x coordinate
FIDIV word [sin_div] ; Divide the pixel coordinate's value by sin_div
FSIN ; Take the sin of the pixel's value
FADD dword [sin_add] ; Add 1 to the sine, which is now in the range 0 to 2
FIMUL word [sin_mul] ; Multiply the sine by 125, to get it to a valid pixel brightness
FISTP word [val] ; Save the value back to memory and remove it from the FPU.
; Write to the next pixel
mov bx, [val]
mov [r15], bl
; Set ax/bx to the next pixel
call next_pixel
; Point to next pixel
inc r15
; Decrement pixel counter
dec r14
cmp r14, 0
jne write_loop
; Open the file
mov rcx, fn
mov rdx, fn_o
call fopen
; Save the file pointer
mov r12, rax
; Write header to the file
mov rcx, hdr ; ptr
mov rdx, 1 ; size
mov r8, r13 ; nmemb
mov r9, r12 ; stream
call fwrite
; Write data the file
mov rcx, img ; ptr
mov rdx, 1 ; size
mov r8, edge*edge ; nmemb
mov r9, r12 ; stream
call fwrite
; Close the file
mov rcx, r12
call fclose
; Free shadow space
add rsp, 8*4
pop rbp
; Return value
mov rax, 0
ret
This project was an excellent introduction to the FPU, it required very little concern for the stack presented by the FPU, so I didn't have worry too much about encountering FPU issues I wasn't prepared for, and I could focus more on just getting used to using the thing. I have to say, it's also very fun when the only documentation available to you is un-styled HTML that was copied out of a PDF decades ago and never touched. It has a certain charm. Anyway, that code produced the following image:
Now that I have the basics of the FPU down, I can get to work rendering the Mandelbrot set. That sounds complicated, but it's really just a matter of knowing how to convert the math into code, and in practice that just means I need to write out the necessary arithmetic and not screw up. We first need to convert the pixel coords into floats in a reasonable range, then repeatedly iterate them. The main challenge that I foresaw was having to do more stack fiddling on the FPU. Indeed, that gave me a little more trouble, but even though I still don't have the slightest clue why the FPU is arranged the way it is, it starts to come pretty naturally once you get into the flow. Also, every few lines contains a comment giving the layout of the FPU registers and a function call to a function I wrote that dumps the FPU registers to console. I used this function to track down an issue where I accidentally freed one too many FPU registers. After figuring everything out, it was able to produce the image below with no trouble.
Now that I've got it working, I'm curious, how does my assembly stack up against GCC's, performance-wise? I'll use the clock() function in both. I can use my new-found FPU skills to convert the values into seconds and display them. Note that I'm only timing the portion of the code that calculates the Mandelbrot set values, not the time it takes to write the data to a file. I find that my assembly code consistently takes between 560 and 570 milliseconds. While my code does beat GCC in the default configuration, using the -O3 flag to optimize for the best possible performance gets the C output to hover around 330 milliseconds. So, the legends were true. If you want to optimize by writing the assembly manually, you really need to know what you're doing. Granted, I wasn't trying to write fast code at all, but I can only imagine that nearly doubling the speed of my code would take an unreasonable amount of effort. It's not even clear whether the compiler left any room for improvement or not.
Regardless, it was fun to revisit and expand upon my assembly skills. I didn't really expect to learn about the FPU when I started but I'm glad I did. And I have to say, programming in assembly brings with it a certain kind of catharsis. Maybe it's just the change of pace. The final code is given below.
DEFAULT REL
%define edge 2048
%define iter 200
%define escape 200
global main
extern printf
extern sprintf
extern fopen
extern fwrite
extern fclose
extern clock
section .data
hdr_fmt: db 'P5 %d %d 255 ', 0
fn: db 'out.pgm', 0
fn_o: db 'wb', 0
prn_flt: db '%lf, ', 0
prn_nwl: db 10, 0
prn_int: db '%d', 10, 0
prn_tim: db 'Completed in %lf seconds.', 10, 0
edge_mem: dw edge ; Needed because constants can't be loaded into the FPU
cam_edge: dd 4.0 ; Width of the camera in the complex plane
cam_of_x: dd -2.0 ; x-coord of left edge of the camera in the complex plane
cam_of_y: dd -2.0 ; y-coord of upper edge of the camera in the complex plane
CLOCKS_PER_SEC: dd 1000
const_2: dw 2
section .bss
hdr: resb 64
img: resb edge*edge
; Temporary space to hold a single-precision float
f_tmp: resq 1
cnt_t: resq 1
iter_c: resq 1
; x and y coords of the current pixel
px: resq 1
py: resq 1
; abs(Z)^2
sqr_abs_z: resw 1
; Final brightness of the current pixel
val: resw 1
; Timing
start_t: resq 1
end_t: resq 1
diff_t: resq 1
section .text
; Increments rax. If rax becomes edge, set it to 0 and increment rbx
next_pixel:
push rax
mov rax, [px]
inc rax
cmp qword rax, edge
je next_row
mov [px], rax
pop rax
ret
next_row:
mov qword [px], 0
inc qword [py]
pop rax
ret
; Writes the contents of st0 through st7 to the console.
; Destroys rbx, rcx, and rdx plus the registers destroyed by printf.
dump_fpu_regs:
mov qword [cnt_t], 8
sub rsp, 8*4
dump_fpu_regs_loop:
mov rcx, prn_flt
FST qword [f_tmp]
mov rdx, [f_tmp]
call printf
FINCSTP
mov rbx, [cnt_t]
dec rbx
mov [cnt_t], rbx
cmp rbx, 0
jne dump_fpu_regs_loop
mov rcx, prn_nwl
call printf
add rsp, 8*4
ret
main:
push rbp
; Reserve shadow space on stack.
mov rbp, rsp
sub rsp, 8*4
; Write the header
mov rcx, hdr
mov rdx, hdr_fmt
mov r8, edge
mov r9, edge
call sprintf
; Save header length
mov r13, rax
mov r14, edge*edge
mov r15, img
; x and y pixel coordinates
mov qword [px], 0
mov qword [py], 0
call clock
mov [start_t], rax
FINIT ; Reset the FPU (Unecessary but good practice)
write_loop:
; Push some zeroes to the FPU stack.
FLDZ ; Zi
FLDZ ; Zr
; Push the pixel coordinates to the FPU stack.
FILD qword [py] ; Load the pixel's y coordinate
FIDIV word [edge_mem] ; Get the pixel coordinate into the range 0-1
FMUL dword [cam_edge] ; Scale by the camera's edge
FADD dword [cam_of_y] ; Translate the camera by its offset
FILD qword [px] ; Load the pixel's x coordinate
FIDIV word [edge_mem] ; Get the pixel coordinate into the range 0-1
FMUL dword [cam_edge] ; Scale by the camera's edge
FADD dword [cam_of_x] ; Translate the camera by its offset
; ST: Cr, Ci, Zr, Zi, NULL, NULL, NULL, NULL
; Perform the iterations
mov qword [iter_c], iter ; Initialize the counter
iter_loop:
; Copy Zi and multiply it by itself
FLD st3
FMUL st0
; Copy Zr and multiply it by itself
FLD st3
FMUL st0
; ST: Zr*Zr, Zi*Zi, Cr, Ci, Zr, Zi, NULL, NULL
;call dump_fpu_regs
; Calculate the new Zr
FLD st0
; ST: Zr*Zr, Zr*Zr, Zi*Zi, Cr, Ci, Zr, Zi, NULL
;call dump_fpu_regs
FSUB st2
FADD st3
FXCH st5
; ST: Zr, Zr*Zr, Zi*Zi, Cr, Ci, Zr*Zr - Zi*Zi + Cr, Zi, NULL
;call dump_fpu_regs
; Calculate the new Zi
FMUL st6
FIMUL word [const_2]
FADD st4
FXCH st6
; ST: Zi, Zr*Zr, Zi*Zi, Cr, Ci, Zr*Zr - Zi*Zi + Cr, Zr*Zi*2 + Ci, NULL
;call dump_fpu_regs
FXCH st1
FADD st2
; ST: Zr*Zr + Zi*Zi, Zi, Zi*Zi, Cr, Ci, Zr*Zr - Zi*Zi + Cr, Zr*Zi*2 + Ci, NULL
;call dump_fpu_regs
; Convert the abs(Z)^2 value to an int and save it to memory, and pop the FPU
FISTP word [sqr_abs_z]
; Clean up the FPU
FFREEP st0
FFREEP st0
;call dump_fpu_regs
; Compare the abs(z)^2 with escape.
mov rax, [sqr_abs_z]
cmp rax, escape
jg escaped
; Decrement the counter and exit when it reaches 0
mov rcx, [iter_c]
dec rcx
mov [iter_c], rcx
cmp rcx, 0
jne iter_loop
not_escaped:
; Write to the next pixel
mov byte [r15], 0
jmp escape_end
escaped:
; Write to the next pixel
mov byte [r15], 255
escape_end:
; Clean up the FPU
FFREEP st0
FFREEP st0
FFREEP st0
FFREEP st0
; Set ax/bx to the next pixel
call next_pixel
; Point to next pixel
inc r15
; Decrement pixel counter
dec r14
cmp r14, 0
jne write_loop
call clock
sub rax, [start_t]
mov [end_t], rax
FILD qword [end_t]
FIDIV dword [CLOCKS_PER_SEC]
FSTP qword [end_t]
mov rcx, prn_tim
mov rdx, [end_t]
call printf
; Open the file
mov rcx, fn
mov rdx, fn_o
call fopen
; Save the file pointer
mov r12, rax
; Write header to the file
mov rcx, hdr ; ptr
mov rdx, 1 ; size
mov r8, r13 ; nmemb
mov r9, r12 ; stream
call fwrite
; Write data the file
mov rcx, img ; ptr
mov rdx, 1 ; size
mov r8, edge*edge ; nmemb
mov r9, r12 ; stream
call fwrite
; Close the file
mov rcx, r12
call fclose
; Free shadow space
add rsp, 8*4
pop rbp
; Return value
mov rax, 0
ret