Z80 Bits

Milos "baze" Bazelides, baze@stonline.sk
last updated March 3rd, 2003

I decided to create this collection of Z80 routines for one simple reason - I like magic. And of course, I was fed up with bad code one can find in many embedded devices, web pages, tutorials and such. The routines presented here is what I believe to be the best of its kind or at least very close to it.

However, if you find a bug or optimisation please let me know. My objective is to know and share the best stuff out there. Also, if you feel you've got something that should be posted here drop me a mail. Keep in mind though that any code you submit should be machine independent and of general use. Of course, you'll be guaranteed a honourable mention in the Credits :)

Please don't complain about lack of comments. This is not a coding tutorial but rather a collection of tricks for (more or less) experienced coders. I'm sure it's not that hard to figure out what's going on.

Note: This document is by far not finished yet. I'll continue to add new code in near future. I also think of providing binary images of look-up tables in cases where table generator is not trivial. Also, I'd be glad if some native English speaker helped me to correct numerous grammar and spelling mistakes :)

Table of Contents

1 Integer Multiplication 2 Integer Division 3 Random Number Generators 4 Conversions Between Numbers and Strings 5 Cyclic Redundancy Checks (CRC) Credits

1 Integer Multiplication

1.1 Classic 8-bit * 8-bit Unsigned

Input: H = Multiplier, E = Multiplicand, L = 0, D = 0
Output: HL = Product

	sla	h		; optimised 1st iteration
	jr	nc,$+3
	ld	l,e

	add	hl,hl		; unroll 7 times
	jr	nc,$+3		; ...
	add	hl,de		; ...

1.2 Classic 16-bit * 8-bit Unsigned

Input: A = Multiplier, DE = Multiplicand, HL = 0, C = 0
Output: A:HL = Product

	add	a,a		; optimised 1st iteration
	jr	nc,$+4
	ld	h,d
	ld	l,e

	add	hl,hl		; unroll 7 times
	rla			; ...
	jr	nc,$+4		; ...
	add	hl,de		; ...
	adc	a,c		; ...

1.3 Square Table 8-bit * 8-bit Signed

Input: B = Multiplier, C = Multiplicand (both in range -128..127)
Output: HL = Product

Note: Routine uses one of these two formulas: 2ab = (a + b)^2 - a^2 - b^2 or 2ab = a^2 + b^2 - (a - b)^2, depends if (a + b) overflows or not. Powering by 2 is done by table lookup. 512 bytes long table is aligned to 256 byte boundary and contains entries of form SqrTab[x] = x^2. If we treat one of the operands as fractional number -1..1 premultiplied by 128, 2ab performs native shift of the result into register H. That's especially useful e.g. for x * sin(y). Otherwise we have to shift HL right (divide it by 2). We could divide table entries by 2 instead but that causes loss of precision.

Mul8x8	ld	h,SqrTab/256
	ld	l,b
	ld	a,b
	ld	e,(hl)
	inc	h
	ld	d,(hl)		; DE = a^2
	ld	l,c
	ld	b,(hl)
	dec	h
	ld	c,(hl)		; BC = b^2
	add	a,l		; let's try (a + b)
	jp	pe,Plus		; jump if no overflow

	sub	l
	sub	l
	ld	l,a
	ld	a,(hl)
	inc	h
	ld	h,(hl)
	ld	l,a		; HL = (a - b)^2
	ex	de,hl
	add	hl,bc
	sbc	hl,de		; HL = a^2 + b^2 - (a - b)^2

;	sra	h		; uncomment to get real product
;	rr	l
	ret

Plus	ld	l,a
	ld	a,(hl)
	inc	h
	ld	h,(hl)
	ld	l,a		; HL = (a + b)^2
	or	a
	sbc	hl,bc
	or	a
	sbc	hl,de		; HL = (a + b)^2 - a^2 - b^2

;	sra	h		; uncomment to get real product
;	rr	l
	ret

Square table generator is based on a fact that differences between consecutive squares (0, 1, 4, 9, 16, 25, ...) are a sequence of odd numbers (1, 3, 5, 7, 9, ...).

	ld	hl,SqrTab	; must be a multiple of 256
	ld	b,l
	ld	c,l		; BC holds odd numbers
	ld	d,l
	ld	e,l		; DE holds squares

SqrGen	ld	(hl),e
	inc	h
	ld	(hl),d		; store x^2
	ld	a,l
	neg
	ld	l,a
	ld	(hl),d
	dec	h
	ld	(hl),e		; store -x^2
	ex	de,hl
	inc	c
	add	hl,bc		; add next odd number
	inc	c
	ex	de,hl

	cpl			; one byte replacement for NEG, DEC A
	ld	l,a
	rla
	jr	c,SqrGen

1.4 Square Table 6-bit * 6-bit Signed

The first thing that should be pointed out here is that the topic is not particularly correct. Actually, the routine is able to multiply any pair of numbers x, y as long as (x + y) <= 127 and (x - y) >= -128. But if x, y are signed 6-bit values, these rules are never violated, no overflows occur and no specific checking is needed.

The routine is based on a formula 4xy = (x + y)^2 - (x - y)^2 and uses the same lookup table (see previous chapter) except all table entries are pre-divided by 4 to avoid division (shifting) at the end. An explanation why it works can be found here. In case we leave the table as is, routine nicely handles fixed point multiplications. That means, if we treat one of the operands as a fractional number in range (-1, 1) pre-multiplied by 64, integer part of the result gets shifted handily into register H.

Input: B = Multiplier, C = Multiplicand
Output: HL = Product

Note: SqrTab must be aligned to a 256 byte boundary.

Mul6x6	ld	h,SqrTab/256
	ld	a,b
	sub	c		; A = x - y
	ld	l,a
	ld	a,b
	add	a,c		; A = x + y
	ld	c,(hl)
	inc	h
	ld	b,(hl)		; BC = (x - y)^2
	ld	l,a
	ld	a,(hl)
	dec	h
	ld	l,(hl)
	ld	h,a		; HL = (x + y)^2
	or	a
	sbc	hl,bc		; HL = (x + y)^2 - (x - y)^2

It's possible to speed up the routine by having two consecutive square tables where first table holds entries of negative sign. Question is, however, if 4 cycles are worth wasting another 512 bytes.

Mul6x6	ld	h,SqrTab/256
	ld	a,b
	sub	c		; A = x - y
	ld	l,a
	ld	a,b
	add	a,c		; A = x + y
	ld	c,(hl)
	inc	h
	ld	b,(hl)		; BC = -(x - y)^2, that's the trick
	inc	h
	ld	l,a
	ld	a,(hl)
	inc	h
	ld	h,(hl)
	ld	l,a		; HL = (x + y)^2
	add	hl,bc		; HL = (x + y)^2 - (x - y)^2

1.5 Logarithmic Table 8-bit * 8-bit Signed

(to do)

2 Integer Division

Division is an awkward arithmetic operation even if it's directly supported by hardware. Thus, we can't expect blazing speed even from well written code. Before you attempt to use any of these routines, please consider these hints:

The routines here are implementations of so-called restoring and non-restoring division algorithms. Unfortunately it seems that implementations of more sophisticated methods yield to slower code.

Note: SLIA stands for semi-documented instruction Shift Left Inverted Arithmetic (operation codes 30h..37h prefixed by CBh, DDh CBh or EDh CBh) also known as SLL (Shift Left Logical). It shifts register left and sets the least significant bit to 1. In most division routines register B is left unused. It enables you to create loops using DJNZ in case you prefer small size over speed of unrolled code.

2.1 Classic 8-bit / 8-bit Unsigned

Input: D = Dividend, E = Divisor, A = 0
Output: D = Quotient, A = Remainder

	sla	d		; unroll 8 times
	rla			; ...
	cp	e		; ...
	jr	c,$+4		; ...
	sub	e		; ...
	inc	d		; ...

The most awkward part of this algorithm is Carry complement. If we had success subtracting divisor from partial remainder, Carry is set to 0. However, it means that we should add 1 to the partial result and vice versa. One workaround is to leave Carry as is, get rid of one instruction and complement whole result at the end (which introduces a little overhead but it's still more efficient).

Input: D = Dividend, E = Divisor, A = 0, Carry = 0
Output: A = Quotient, E = Remainder

	rl	d		; unroll 8 times
	rla			; ...
	sub	e		; ...
	jr	nc,$+3		; ...
	add	a,e		; ...

	ld	e,a		; save remainder
	ld	a,d		; complement the result
	cpl

In case you are really looking to save each cycle there's a slightly optimised non-restoring version of the algorithm. The downside is that routine splits into two almost identical tracks of code (I wrote them to columns for better readability).

Input: D = Dividend, E = Divisor, A = 0
Output: D = Quotient, A = Remainder

	sla	d
	rla
	cp	e
	jr	c,C1
NC0	sub	e

	slia	d	C1	sla	d
	rla			rla
	cp	e		cp	e
	jr	c,C2		jr	nc,NC1
NC1	sub	e

	slia	d	C2	sla	d
	rla			rla
	cp	e		cp	e
	jr	c,C3		jr	nc,NC2
NC2	sub	e

	slia	d	C3	sla	d
	rla			rla
	cp	e		cp	e
	jr	c,C4		jr	nc,NC3
NC3	sub	e

	slia	d	C4	sla	d
	rla			rla
	cp	e		cp	e
	jr	c,C5		jr	nc,NC4
NC4	sub	e

	slia	d	C5	sla	d
	rla			rla
	cp	e		cp	e
	jr	c,C6		jr	nc,NC5
NC5	sub	e

	slia	d	C6	sla	d
	rla			rla
	cp	e		cp	e
	jr	c,C7		jr	nc,NC6
NC6	sub	e

	slia	d	C7	sla	d

2.2 Classic 16-bit / 8-bit Unsigned

Input: HL = Dividend, C = Divisor, A = 0
Output: HL = Quotient, A = Remainder (see note)

	add	hl,hl		; unroll 16 times
	rla			; ...
	cp	c		; ...
	jr	c,$+4		; ...
	sub	c		; ...
	inc	l		; ...

2.3 Classic 16-bit / 16-bit Unsigned

Input: A:C = Dividend, DE = Divisor, HL = 0
Output: A:C = Quotient, HL = Remainder

	slia	c		; unroll 16 times
	rla			; ...
	adc	hl,hl		; ...
	sbc	hl,de		; ...
	jr	nc,$+4		; ...
	add	hl,de		; ...
	dec	c		; ...

2.4 Classic 24-bit / 8-bit Unsigned

Input: E:HL = Dividend, D = Divisor, A = 0
Output: E:HL = Quotient, A = Remainder

	add	hl,hl		; unroll 24 times
	rl	e		; ...
	rla			; ...
	cp	d		; ...
	jr	c,$+4		; ...
	sub	d		; ...
	inc	l		; ...

2.5 Classic 24-bit / 16-bit Unsigned

Input: A:BC = Dividend, DE = Divisor, HL = 0
Output: A:BC = Quotient, HL = Remainder

	slia	c		; unroll 24 times
	rl	b		; ...
	rla			; ...
	adc	hl,hl		; ...
	sbc	hl,de		; ...
	jr	nc,$+4		; ...
	add	hl,de		; ...
	dec	c		; ...

2.6 Classic 32-bit / 8-bit Unsigned

Input: DE:HL = Dividend, C = Divisor, A = 0
Output: DE:HL = Quotient, A = Remainder

	add	hl,hl		; unroll 32 times
	rl	e		; ...
	rl	d		; ...
	rla			; ...
	cp	c		; ...
	jr	c,$+4		; ...
	sub	c		; ...
	inc	l		; ...

3 Random Number Generators

3.1 8-bit Random Number Generator

This is a very simple linear congruential generator. The formula is x[i + 1] = (5 * x[i] + 1) mod 256. Its only advantage is small size and simplicity. Due to nature of such generators only a couple of higher bits should be considered random.

Input: none
Output: A = pseudo-random number, period 256

Rand8	ld	a,Seed		; Seed is usually 0
	ld	b,a
	add	a,a
	add	a,a
	add	a,b
	inc	a		; another possibility is ADD A,7
	ld	(Rand8+1),a
	ret

3.2 16-bit Random Number Generator

This generator is based on similar method but gives much better results. It was taken from an old ZX Spectrum game and slightly optimised.

Input: none
Output: HL = pseudo-random number, period 65536

Rand16	ld	de,Seed		; Seed is usually 0
	ld	a,d
	ld	h,e
	ld	l,253
	or	a
	sbc	hl,de
	sbc	a,0
	sbc	hl,de
	ld	d,0
	sbc	a,d
	ld	e,a
	sbc	hl,de
	jr	nc,Rand
	inc	hl
Rand	ld	(Rand16+1),hl
	ret

4 Conversions Between Numbers and Strings

4.1 16-bit Integer to ASCII (decimal)

Input: HL = number to convert, DE = location of ASCII string
Output: ASCII string at (DE)

Num2Dec	ld	bc,-10000
	call	Num1
	ld	bc,-1000
	call	Num1
	ld	bc,-100
	call	Num1
	ld	c,-10
	call	Num1
	ld	c,-1

Num1	ld	a,'0'-1
Num2	inc	a
	add	hl,bc
	jr	c,Num2
	sbc	hl,bc

	ld	(de),a
	inc	de
	ret

4.2 16-bit Integer to ASCII (hexadecimal)

Hexadecimal conversion operates directly on nibbles and takes advantage of nifty DAA trick.

Input: HL = number to convert, DE = location of ASCII string
Output: ASCII string at (DE)

Num2Hex	ld	a,h
	call	Num1
	ld	a,h
	call	Num2
	ld	a,l
	call	Num1
	ld	a,l
	jr	Num2

Num1	rra
	rra
	rra
	rra
Num2	or	F0h
	daa
	add	a,A0h
	adc	a,40h

	ld	(de),a
	inc	de
	ret

4.3 Memory dump (hexadecimal)

As this is one of the most typical tasks, why not to do it tricky way? The code snippet here takes a byte from (HL) and prints it. Note that it uses another (shorter) DAA trick as we know that Half Carry is cleared before DAA.

Input: HL = address of data
Output: memory dump

Note: You'll have to replace the PRINT_CHAR macro by actual platform-specific code. Don't forget to preserve the contents of HL!

	xor	a
	rld
	call	Nibble

Nibble	push	af
	daa
	add	a,F0h
	adc	a,40h

	PRINT_CHAR		; prints ASCII character in A

	pop	af
	rld
	ret

5 Cyclic Redundancy Checks (CRC)

5.1 16-bit CRC-CCITT

The following routine calculates standard CRC-CCITT bit-by-bit using polynomial 1021h. Another common scheme CRC-16 uses polynomial A001h and starts with value 0 (so it's likely that you misinterpret bunch of zeros as valid data). It might be useful to extend the code to use 16-bit byte counter.

Input: DE = address of input data, C = number of bytes to process
Output: HL = CRC

Crc16	ld	hl,FFFFh
Read	ld	a,(de)
	inc	de
	xor	h
	ld	h,a
	ld	b,8
CrcByte	add	hl,hl
	jr	nc,Next
	ld	a,h
	xor	10h
	ld	h,a
	ld	a,l
	xor	21h
	ld	l,a
Next	djnz	CrcByte
	dec	c
	jr	nz,Read
	ret

5.2 Table-driven 16-bit CRC-CCITT

Note: I haven't tested the results yet so there might be a bug somewhere (most likely wrong polynomial producing bad results).

This is a much faster equivalent of the previous routine. It processes one byte at a time using 512 byte long table. There is a change in algorithm though. Intermediate results are shifted right and polynomial is reversed. It means that even results are reversed (the most significant bit is actually the least significant one and vice versa). Depending on actual use this might be a problem or not (for example, it's suitable if you interoperate with hardware as UARTs send least significant bit first). Even if you decide to adjust result back to correct value, you should still gain more than you loss.

Input: HL = address of input data, BC = number of bytes to process
Output: DE = CRC

Note: CrcTab must be aligned to a 256 byte boundary. Table generator uses reverse of 1021h, that is 8408h.

Crc16	ld	d,FFh
	ld	a,e
Read	xor	(hl)
	ex	de,hl
	ld	l,a
	ld	a,h
	ld	h,CrcTab/256
	xor	(hl)
	inc	h
	ld	h,(hl)
	ex	de,hl
	cpi
	jp	pe,Read
	ld	e,a
	ret

CRC table generator:

	ld	hl,CrcTab
CrcGen	ld	d,0
	ld	e,l
	ld	b,8
CrcByte	srl	d
	rr	e
	jr	nc,Next
	ld	a,d
	xor	84h
	ld	d,a
	ld	a,e
	xor	08h
	ld	e,a
Next	djnz	CrcByte
	ld	(hl),e
	inc	h
	ld	(hl),d
	dec	h
	inc	l
	jr	nz,CrcGen
	ret

5.3 32-bit CRC-32

(to do)

5.4 Table-driven 32-bit CRC-32

(to do)

Credits

My thanks goes to the following people who contributed to this document:

Slavomir "Busy" Labsky
for help with optimisation of exponential multiplication (well... and for teaching me how to code ;).

Pavel "Zilogator" Cimbal
for 8-bit random number generator, integer-to-hexadecimal conversions and many inspiring brainstorms.

Norbert "Noro" Grellneth
for neat "ADD HL,HL does it all" trick in 8-bit * 8-bit multiplication routine.

Tomas "Universum" Vilim
for clever use of SLIA in integer division routines that were a great resource for further optimisation.

Patai "CoBB" Gergely
for use of CP to avoid undocumented SLIA or Carry complement in some integer division routines.

Lawrence Chitty
for Z80 implementation of "fast CRC" without lookup table and helpful hints on table-driven CRC.

Petr "Poke" Petyovsky
for non-restoring optimisation of integer division routine.

David Revelj
for pointing out that pre-division of table entries by 4 in 6-bit * 6-bit multiplication doesn't cause loss of precision (and several hints that helped to clear some confusing comments).