Home / Other

Netwide Assembly

Assembly is not the sort of programming language I'd like to code in all of the time, for obvious reasons. That said, learning it provides many benefits. Simply understanding more about how programs really work under the hood is its own reward, and an understanding of what a compiler turns your code into can help you figure out what the best way to write extremely performance-sensitive code is. It's also not unheard of for this kind of performance-sensitive code to be written in assembly manually, if it happens to be the sort of code that the compiler won't be able to optimize automatically.

I had a bit of trouble compiling my Hello World program, which is normal when trying to learn assembly (On Windows anyway). It turns out that Mingw is single-target, and despite the fact that my Mingw is 32-bit, it targets my 64-bit architecture. I guess. Regardless, I was able to compile the following code and run it successfully. It must link to the C standard libraries.


global  main
    extern  printf

    section .text

main:
    push rbp
    mov rbp, rsp
    sub rsp, 8*4  ; reserve for local variables   

    mov rcx, message
    call printf
   
    mov rax,0
    add rsp,8*4
    pop rbp
	
    ret

message:
    db  'Hello, World', 13, 10, 0

And this was compiled and run with:


nasm -fwin64 hello.asm -o hello.obj
gcc -m64 hello.obj -o hello.exe
hello.exe

I was a tad confused by the printf message for a moment, it didn't help that the "13" happens to be the length of the message including a null terminator, but that's just the carriage return. We can take it out. The printf in assembly is pretty powerful too, we can use format strings like normal. For example, the following code prints "123456".


global  main
	extern  printf

section .data
	fmt_str: db '%d', 10, 0

section .text
main:
	push rbp
	mov rbp, rsp
	sub rsp, 8*4
	
	mov rax, 123456

	mov rcx, fmt_str
	mov rdx, rax
	call printf
   
	mov rax,0
	add rsp,8*4
	pop rbp
	ret

Note that we use rcx, rdx, r8, and r9 for function parameters on Windows, but that Linux uses rdi, rsi, rdx, rcx, r8, and r9.

The messing about with the stack at the beginning and end of the main section also confused me for a moment, but that turns out to be the shadow space. This is space we must add to the stack before calling functions in clib, because they must have access to this region and will use it as scratch space.


global main
	extern  printf

section .data
	fmt_num: db '%ld Params...', 10, 0
	fmt_str: db '  %s', 10, 0

section .text
main:
	push rbp
	
	; Reserve shadow space on stack.
	mov rbp, rsp
	sub rsp, 8*4
	
	mov r12, rcx ; argc
	mov r13, rdx ; argv
	
	; Print argc
	mov rcx, fmt_num
	mov rdx, r12
	call printf
	
	; Print args in argv
	print_arg:
		; Print an arg
		mov rcx, fmt_str
		mov rdx, [r13]
		call printf
		
		; Go to the next arg
		add r13, 8
		
		; Decrement the number of args and go back to the start of the loop if it isn't 0.
		dec r12
		cmp r12, 0
		jne print_arg
	
	; Return shadow space.
	add rsp, 8*4
	pop rbp
	
	; Return value
	mov rax, 0
	
	ret

We have access to the entire C library, which is frankly overpowered. Any function with four arguments are fair game if we put arguments into rcx/rdx/r8/r9. If it has more, we have to push them to the stack. We can write to files by simply calling the appropriate functions.


global main
	extern printf
	extern fopen
	extern fwrite
	extern fclose

section .data
	fn: db 'out.txt', 0
	fn_o: db 'w', 0
	txt: db 'Hello File!', 0

section .text
main:
	push rbp
	
	; Reserve shadow space on stack.
	mov rbp, rsp
	sub rsp, 8*4
	
	; Open the file
	mov rcx, fn
	mov rdx, fn_o
	call fopen
	
	; Save the file pointer
	mov r12, rax
	
	; Write to the file
	mov rcx, txt	; ptr
	mov rdx, 1		; size
	mov r8, 11		; nmemb
	mov r9, r12		; stream
	call fwrite
	
	; Close the file
	mov rcx, r12
	call fclose
	
	; Free shadow space
	add rsp, 8*4
	pop rbp
	
	; Return value
	mov rax, 0
	
	ret

And if we can write to files, that means was can make images using the Netpbm format. I'm using the P5 format that includes an ASCII header followed by raw bytes representing one-channel pixels. The below image was made by giving each pixel a brightness corresponding to its y-coordinate, and was used to test my code for iterating over all of the pixel values, which is not such an easy thing as it is in C.

Now, we can access the Floating Point Unit, or FPU. This is a sub-component of the processor that performs floating-point arithmetic, and we'll need to be able to use it for any reasonably interesting images. The FPU is a bit jarring compared to the CPU, it contains 8 80-bit (extended precision) registers that are presented as a stack to the user. We must take care not to forget our exact place in that stack or we risk pulling the wrong value from it or filling it up completely, causing errors. I was able to implement a simple sin-wave renderer without too much trouble. It takes a pixel's x-coordinate, divides it by a constant to scale the waves, calculates the sin of it, and transforms that into a valid pixel brightness between 0 and 250 before writing it to the final image.

The best resources I found for this were this guide for FPU in MASM by Raymond Filiatreault (See the NASM manual for the differences between NASM and MASM) and this instruction set reference.


DEFAULT REL ; Important for some thing.

%define edge 256

global main
	extern printf
	extern sprintf
	extern fopen
	extern fwrite
	extern fclose

section .data
	hdr_fmt: db 'P5 %d %d 255 ', 0
	fn: db 'out.pgm', 0
	fn_o: db 'wb', 0
	
	sin_div: dw 15
	sin_add: dd 1.0
	sin_mul: dw 125

section .bss
	hdr: resb 64
	img: resb edge*edge
	
	; x and y coords of the current pixel
	px: resq 1
	py: resq 1
	
	; Final brightness of the current pixel
	val: resw 1

section .text

; Increments [px]. If [px] becomes equal to edge, set it to 0 and increment [py].
next_pixel:
	push rax
	
	mov rax, [px]
	inc rax
	
	cmp qword rax, edge
	je next_row
	
	mov [px], rax
	pop rax
	ret
	
	next_row:
		mov qword [px], 0
		inc qword [py]
		
		pop rax
		ret

main:
	push rbp
	
	; Reserve shadow space on stack.
	mov rbp, rsp
	sub rsp, 8*4
	
	; Write the header
	mov rcx, hdr
	mov rdx, hdr_fmt
	mov r8, edge
	mov r9, edge
	call sprintf
	
	; Save header length
	mov r13, rax
	
	mov r14, edge*edge
	mov r15, img
	
	; x and y pixel coordinates
	mov qword [px], 0
	mov qword [py], 0
	
	write_loop:
		; Push the pixel coordinates to the FPU stack.
		FILD qword [px] ; Load the pixel's x coordinate
		
		FIDIV word [sin_div]  ; Divide the pixel coordinate's value by sin_div
		FSIN                  ; Take the sin of the pixel's value
		FADD dword [sin_add]  ; Add 1 to the sine, which is now in the range 0 to 2
		FIMUL word [sin_mul]  ; Multiply the sine by 125, to get it to a valid pixel brightness
		
		FISTP word [val] ; Save the value back to memory and remove it from the FPU.
		
		; Write to the next pixel
		mov bx, [val]
		mov [r15], bl
		
		; Set ax/bx to the next pixel
		call next_pixel
		
		; Point to next pixel
		inc r15
		
		; Decrement pixel counter
		dec r14
		
		cmp r14, 0
		jne write_loop
	
	; Open the file
	mov rcx, fn
	mov rdx, fn_o
	call fopen
	
	; Save the file pointer
	mov r12, rax
	
	; Write header to the file
	mov rcx, hdr	; ptr
	mov rdx, 1		; size
	mov r8, r13		; nmemb
	mov r9, r12		; stream
	call fwrite
	
	; Write data the file
	mov rcx, img		; ptr
	mov rdx, 1			; size
	mov r8, edge*edge	; nmemb
	mov r9, r12			; stream
	call fwrite
	
	; Close the file
	mov rcx, r12
	call fclose
	
	; Free shadow space
	add rsp, 8*4
	pop rbp
	
	; Return value
	mov rax, 0
	
	ret

This project was an excellent introduction to the FPU, it required very little concern for the stack presented by the FPU, so I didn't have worry too much about encountering FPU issues I wasn't prepared for, and I could focus more on just getting used to using the thing. I have to say, it's also very fun when the only documentation available to you is un-styled HTML that was copied out of a PDF decades ago and never touched. It has a certain charm. Anyway, that code produced the following image:

Now that I have the basics of the FPU down, I can get to work rendering the Mandelbrot set. That sounds complicated, but it's really just a matter of knowing how to convert the math into code, and in practice that just means I need to write out the necessary arithmetic and not screw up. We first need to convert the pixel coords into floats in a reasonable range, then repeatedly iterate them. The main challenge that I foresaw was having to do more stack fiddling on the FPU. Indeed, that gave me a little more trouble, but even though I still don't have the slightest clue why the FPU is arranged the way it is, it starts to come pretty naturally once you get into the flow. Also, every few lines contains a comment giving the layout of the FPU registers and a function call to a function I wrote that dumps the FPU registers to console. I used this function to track down an issue where I accidentally freed one too many FPU registers. After figuring everything out, it was able to produce the image below with no trouble.

Now that I've got it working, I'm curious, how does my assembly stack up against GCC's, performance-wise? I'll use the clock() function in both. I can use my new-found FPU skills to convert the values into seconds and display them. Note that I'm only timing the portion of the code that calculates the Mandelbrot set values, not the time it takes to write the data to a file. I find that my assembly code consistently takes between 560 and 570 milliseconds. While my code does beat GCC in the default configuration, using the -O3 flag to optimize for the best possible performance gets the C output to hover around 330 milliseconds. So, the legends were true. If you want to optimize by writing the assembly manually, you really need to know what you're doing. Granted, I wasn't trying to write fast code at all, but I can only imagine that nearly doubling the speed of my code would take an unreasonable amount of effort. It's not even clear whether the compiler left any room for improvement or not.

Regardless, it was fun to revisit and expand upon my assembly skills. I didn't really expect to learn about the FPU when I started but I'm glad I did. And I have to say, programming in assembly brings with it a certain kind of catharsis. Maybe it's just the change of pace. The final code is given below.


DEFAULT REL

%define edge 2048
%define iter 200
%define escape 200

global main
	extern printf
	extern sprintf
	extern fopen
	extern fwrite
	extern fclose
	extern clock

section .data
	hdr_fmt: db 'P5 %d %d 255 ', 0
	fn: db 'out.pgm', 0
	fn_o: db 'wb', 0
	
	prn_flt: db '%lf, ', 0
	prn_nwl: db 10, 0
	prn_int: db '%d', 10, 0
	
	prn_tim: db 'Completed in %lf seconds.', 10, 0
	
	edge_mem: dw edge ; Needed because constants can't be loaded into the FPU
	cam_edge: dd 4.0  ; Width of the camera in the complex plane
	cam_of_x: dd -2.0 ; x-coord of left edge of the camera in the complex plane
	cam_of_y: dd -2.0 ; y-coord of upper edge of the camera in the complex plane
	
	CLOCKS_PER_SEC: dd 1000
	
	const_2: dw 2

section .bss
	hdr: resb 64
	img: resb edge*edge
	
	; Temporary space to hold a single-precision float
	f_tmp: resq 1
	cnt_t: resq 1
	
	iter_c: resq 1
	
	; x and y coords of the current pixel
	px: resq 1
	py: resq 1
	
	; abs(Z)^2
	sqr_abs_z: resw 1
	
	; Final brightness of the current pixel
	val: resw 1
	
	; Timing
	start_t: resq 1
	end_t: resq 1
	diff_t: resq 1

section .text

; Increments rax. If rax becomes edge, set it to 0 and increment rbx
next_pixel:
	push rax
	
	mov rax, [px]
	inc rax
	
	cmp qword rax, edge
	je next_row
	
	mov [px], rax
	pop rax
	ret
	
	next_row:
		mov qword [px], 0
		inc qword [py]
		
		pop rax
		ret

; Writes the contents of st0 through st7 to the console.
; Destroys rbx, rcx, and rdx plus the registers destroyed by printf.
dump_fpu_regs:
	mov qword [cnt_t], 8
	
	sub rsp, 8*4
	
	dump_fpu_regs_loop:
		mov rcx, prn_flt
		FST qword [f_tmp]
		mov rdx, [f_tmp]
		call printf
		
		FINCSTP
	
		mov rbx, [cnt_t]
		dec rbx
		mov [cnt_t], rbx
		cmp rbx, 0
		jne dump_fpu_regs_loop
	
	mov rcx, prn_nwl
	call printf
	
	add rsp, 8*4
	
	ret

main:
	push rbp
	
	; Reserve shadow space on stack.
	mov rbp, rsp
	sub rsp, 8*4
	
	; Write the header
	mov rcx, hdr
	mov rdx, hdr_fmt
	mov r8, edge
	mov r9, edge
	call sprintf
	
	; Save header length
	mov r13, rax
	
	mov r14, edge*edge
	mov r15, img
	
	; x and y pixel coordinates
	mov qword [px], 0
	mov qword [py], 0
	
	call clock
	mov [start_t], rax
	
	FINIT ; Reset the FPU (Unecessary but good practice)
	
	write_loop:
		; Push some zeroes to the FPU stack.
		FLDZ ; Zi
		FLDZ ; Zr
		
		; Push the pixel coordinates to the FPU stack.
		FILD qword [py]       ; Load the pixel's y coordinate
		FIDIV word [edge_mem]  ; Get the pixel coordinate into the range 0-1
		FMUL dword [cam_edge]  ; Scale by the camera's edge
		FADD dword [cam_of_y]  ; Translate the camera by its offset
		
		FILD qword [px]       ; Load the pixel's x coordinate
		FIDIV word [edge_mem]  ; Get the pixel coordinate into the range 0-1
		FMUL dword [cam_edge]  ; Scale by the camera's edge
		FADD dword [cam_of_x]  ; Translate the camera by its offset
		
		; ST: Cr, Ci, Zr, Zi, NULL, NULL, NULL, NULL
		
		; Perform the iterations
		mov qword [iter_c], iter ; Initialize the counter
		iter_loop:
			; Copy Zi and multiply it by itself
			FLD st3
			FMUL st0
			
			; Copy Zr and multiply it by itself
			FLD st3
			FMUL st0
			
			; ST: Zr*Zr, Zi*Zi, Cr, Ci, Zr, Zi, NULL, NULL
			;call dump_fpu_regs
			
			; Calculate the new Zr
			FLD st0
			
			; ST: Zr*Zr, Zr*Zr, Zi*Zi, Cr, Ci, Zr, Zi, NULL
			;call dump_fpu_regs
			
			FSUB st2
			FADD st3
			FXCH st5
			
			; ST: Zr, Zr*Zr, Zi*Zi, Cr, Ci, Zr*Zr - Zi*Zi + Cr, Zi, NULL
			;call dump_fpu_regs
			
			; Calculate the new Zi
			FMUL st6
			FIMUL word [const_2]
			FADD st4
			FXCH st6
			
			; ST: Zi, Zr*Zr, Zi*Zi, Cr, Ci, Zr*Zr - Zi*Zi + Cr, Zr*Zi*2 + Ci, NULL
			;call dump_fpu_regs
			
			FXCH st1
			FADD st2
			
			; ST: Zr*Zr + Zi*Zi, Zi, Zi*Zi, Cr, Ci, Zr*Zr - Zi*Zi + Cr, Zr*Zi*2 + Ci, NULL
			;call dump_fpu_regs
			
			; Convert the abs(Z)^2 value to an int and save it to memory, and pop the FPU
			FISTP word [sqr_abs_z]
			
			; Clean up the FPU
			FFREEP st0
			FFREEP st0
			
			;call dump_fpu_regs
			
			; Compare the abs(z)^2 with escape.
			mov rax, [sqr_abs_z]
			
			cmp rax, escape
			jg escaped
			
			; Decrement the counter and exit when it reaches 0
			mov rcx, [iter_c]
			dec rcx
			mov [iter_c], rcx
			cmp rcx, 0
			jne iter_loop
		
		not_escaped:
			; Write to the next pixel
			mov byte [r15], 0
			jmp escape_end
		
		escaped:
			; Write to the next pixel
			mov byte [r15], 255
		
		escape_end:
		
		; Clean up the FPU
		FFREEP st0
		FFREEP st0
		FFREEP st0
		FFREEP st0
		
		; Set ax/bx to the next pixel
		call next_pixel
		
		; Point to next pixel
		inc r15
		
		; Decrement pixel counter
		dec r14
		
		cmp r14, 0
		jne write_loop
	
	call clock
	
	sub rax, [start_t]
	mov [end_t], rax
	
	FILD qword [end_t]
	FIDIV dword [CLOCKS_PER_SEC]
	FSTP qword [end_t]
	
	mov rcx, prn_tim
	mov rdx, [end_t]
	call printf
	
	; Open the file
	mov rcx, fn
	mov rdx, fn_o
	call fopen
	
	; Save the file pointer
	mov r12, rax
	
	; Write header to the file
	mov rcx, hdr	; ptr
	mov rdx, 1		; size
	mov r8, r13		; nmemb
	mov r9, r12		; stream
	call fwrite
	
	; Write data the file
	mov rcx, img		; ptr
	mov rdx, 1			; size
	mov r8, edge*edge	; nmemb
	mov r9, r12			; stream
	call fwrite
	
	; Close the file
	mov rcx, r12
	call fclose
	
	; Free shadow space
	add rsp, 8*4
	pop rbp
	
	; Return value
	mov rax, 0
	
	ret