Friday, April 19, 2019

More on drawing sprites on the ZX Spectrum

A recent discussion on the Z80 Assembly Programming on the ZX Spectrum Facebook group prompted me to look into the differing costs of various ways of drawing sprite images on the ZX Spectrum.

The discussion revolved around handling the Speccy's slightly odd display file layout.  To recap, the bitmap display consists of 24 rows of 32 columns of cells, each cell of which has 8 lines of 8 pixels (one byte per line).  The address for a particular cell line is 010RRLLL RRRCCCCC where RRRRR is the row number (0..23), CCCCC is the column number (0..31), and LLL is the line number (0..7), all in binary.  Note that the display addresses are broken into three sections under this scheme: rows 0..7, rows 8..15, and rows 16..23.

(There is a separate attribute overlay giving the foreground and background colour for each cell, but I'm not going to discuss that here.)

The advantage of this layout is that if the display address is in a Z80 register pair, HL say, one can move from one cell line to the next simply by incrementing the upper half of the register pair: inc H.

There are three main ways of drawing a sprite image on to the display.  For the sake of argument, I will assume a 16x16 pixel image (more or less standard for most games on the Speccy).  I am not going to consider clipping at the edges of the display.  I am going to assume that we will draw images to the display using the xor method (very common: redrawing an image in the same place has the effect of removing it from the display).  I will further assume that, for speed, each image comes in 8 pre-shifted forms:

xxxxxxxx xxxxxxxx ........  - x 16 lines for 0px offset
.xxxxxxx xxxxxxxx x.......  - x 16 lines for 1px offset
..xxxxxx xxxxxxxx xx......  - x 16 lines for 2px offset
.
.
.
.......x xxxxxxxx xxxxxxx.  - x 16 lines for 7px offset

(1) Break the image into cell sized chunks and draw each cell as a separate operation.  This means that a 16x16 image at an arbitrary pixel position on the display will be drawn over a 3x3 cell region (i.e., 24 lines of drawing for 50% over-draw).  I have discussed this at length here before.

(2) Draw only the lines in the image (i.e., no over-draw), dealing with the line/row/third transition with in-line code.

(3) Draw only the lines in the image, but use a separate table of display line addresses (i.e., pre-computing the line/row/third transitions ahead of time).

So how do these schemes work out?  (I'm using Simon Brattel's excellent Zeus assembler syntax here, in case anyone is wondering.)

Dealing with the line/row/third transitions with in-line code.

The best approach I could come up with here was this:
- unroll the drawing code for 8 lines at a time;
- handle the row/third adjustment after the 8th line;
- jump into the correct spot at the start (e.g., if the top line of our image starts at line 3 of our cell, then we would start drawing at the "line 3" entry point).

Here's my code to do that:

; DE points to the image data: L-to-R, T-to-B, three bytes per line.
; C is height of sprite in lines.
; HL is address of display byte for the top-left byte of the sprite image.
Xor24px_Adj proc
                ; Jump into the correct offset for row adjustment.
                ; Mean cost is about 18 + 3 * (4 + 7) + 1 * (4 + 12)
                ; = 67 t-states.
                ld A, H
                and 000111
                xor 000111
                       jr z, _L7
                dec A: jr Z, _L6
                dec A: jr Z, _L5
                dec A: jr Z, _L4
                dec A: jr Z, _L3
                dec A: jr Z, _L2
                dec A: jr Z, _L1
                ; Unrolled drawing loop.
                _XorLineAdj macro()
                    ; Cost per 3 byte line is 3 * (21 + 6) + 6 * 4 + 5
                    ; = 112 t-states.
                    ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                    ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                    ld A, (DE): xor (HL): ld (HL), A: inc DE
                    dec L: dec L: inc H
                    dec C: ret z
                endm
    _L0         _XorLineAdj()
    _L1         _XorLineAdj()
    _L2         _XorLineAdj()
    _L3         _XorLineAdj()
    _L4         _XorLineAdj()
    _L5         _XorLineAdj()
    _L6         _XorLineAdj()
    _L7         _XorLineAdj()
                ; Adjustment for next row.
                ; Mean cost per line is (2 * (4 + 7 + 4) + 7 + 12) / 8
                ; = 6 t-states.
                ld A, L: add 32: ld L, A: jr c, _L0
                lf A, H: sub  8: ld H, A: jr _L0
                ; Overall cost after setup overhead is
                ; 112 + 6 = 118 t-states per line.
endp

So, for 16x16px pre-shifted sprite images, we expect to take around 67 + 16 * 118 = 1,955 t-states per image with this approach (there are other overheads, of course, but let's take this as our baseline).

Note: this runs to about 189 bytes of machine code, so it is quite compact.

Dealing with the line/row/third transitions with a pre-computed table of display line addresses.

The best approach I could come up with here is to pre-compute the starting address of each display line:

    dw $4000, $4100, $4200, ..., $4700
    dw $4020, $4120, $4220, ..., $4720
    ...
    dw $40e0, $41e0, $42e0, ..., $47e0
    dw $4800, $4900, $4a00, ..., $4f00
    dw $4820, $4920, $4a20, ..., $4f20
    ...

then
- point the stack pointer SP at the correct entry for the starting display line of the sprite image;
- put the image column number in a different register, B;
- pop each display line address off the stack (this is the fastest way of fetching 16-byte data on the Z80);
- drawing code for the entire image is unrolled!  If you want to draw a smaller image, you jump into the correct entry point for that.

Here's my code to do that:

; DE points to the image data: L-to-R, T-to-B, three bytes per line.
; SP points to the table of display line start addresses offset for
;   the first line of the sprite image.
; B is the leftmost column number of the sprite image.
; User calls into the correct offset (e.g., Xor24px_Tbl_16 for a 16 line
; image).
_XorLineTbl macro()
                ; Cost per 3 byte line is 22 + 3 * (21 + 6)
                ; = 93 t-states.
                pop HL: ld A, L: or B: ld L, A
                ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                ld A, (DE): xor (HL): ld (HL), A: inc DE
endm

Xor24px_Tbl_16  _XorLineTbl()
Xor24px_Tbl_15  _XorLineTbl()
Xor24px_Tbl_14  _XorLineTbl()
; ...
Xor24px_Tbl_3   _XorLineTbl()
Xor24px_Tbl_2   _XorLineTbl()
Xor24px_Tbl_1   _XorLineTbl()
; Exit code goes here (e.g., restore SP and ret).

So, for 16x16px pre-shifted sprite images, we expect to take around 16 * 93 = 1,488 t-states per image.  Again, there is a bit more overhead to take into account here, but I'll call this the base-line figure.

Note that this scheme uses a lot more code: around 288 bytes of unrolled drawing code for a 16x16px image along with a 384 byte address table.  For my money, that is a good trade-off for the extra speed.

Comparison.

Rounding the numbers up, fixing things at the line-8 transition costs you 2,000 t-states per sprite image whereas using the stack-pointer-into-a-lookup-table scheme costs you only 1,500 t-states per sprite image.

What does this mean in practice?  Well, the usual approach to obtaining flicker-free graphics involves waiting for the screen-refresh interrupt, then redrawing everything before the CRT scan line catches up with you.  On the ZX Spectrum the upper screen border (before the display bitmap is drawn) takes around 14,000 t-states. 

For the fix-in-code approach, this is enough time to erase and redraw 14,000 / 2,000 / 2 = 3.5 16x16px sprites before you risk being caught be the screen refresh.

For the use-a-table approach, this is enough time to erase and redraw 14,000 / 1,500 / 2 = 4.6 16x16px sprites before you risk being caught by the screen refresh.

Conclusion.

For my money, I'd go for the table-based approach.

No comments: