Wednesday, September 30, 2015

Faster Sprites on the ZX Spectrum

This is one for anybody who grew up programming the Sinclair ZX Spectrum, a wonderful machine from the early '80s microcomputer days, sporting a 3.5 MHz Z80 CPU and 48 KBytes of RAM.

Lacking hardware sprites, ZX Spectrum programmers were/are forced to draw directly to the display. Moving a sprite therefore involves removing the previous sprite image before drawing the new one. Since sprites can overlap, this typically means first removing all previously drawn sprites before redrawing any of them in their new positions (unless you use XOR to draw your sprites, in which case removing a sprite is just redrawing it in its previous position and removing/redrawing can occur in any order).

My code implementing the ideas here is available at https://github.com/ralphbecket/Z80/tree/master/FasterSprites

Screen Layout and Timing

To achieve flicker-free animation, all drawing must occur before the display refresh crosses the part of the screen being updated. This means understanding something of the ZX Spectrum display arrangement:

+-----------------------------------------------------+
| Border: 64 scan lines before display file.          |
|                                                     |
|    +-------------------------------------------+    |
|    |  |  |  |  |  |   Display: 256 x 192 pixels|    |
|    |--+--+--+--+--+-- arranged as 32 cols of   |    |
|    |  |  |  |  |  |   24 rows of 8x8 cells.    |    |
|    |--+--+--+--+--+--                          |    |
|    |  |  |  |  |  |                            |    |
|    |--+--+--+--+--+--                          |    |
|    |  |  |  |  |  |                            |    |
|    |                                           |    |
|    |                                           |    |
|    |                                           |    | 
|    |                                           |    | 
|    |                                           |    | 
|    |                                           |    | 
|    +-------------------------------------------+    | 
|                                                     | 
|                                                     | 
+-----------------------------------------------------+ 

The border is a single, block colour surrounding the display file. It starts 64 scan lines before the display file. The display file is a 256 x 192 pixel bitmap arranged as 24 rows of 32 columns of 8x8 pixel cells. A separate attribute map specifies a single foreground/background colour pair for each cell (i.e., colours can only be specified at the level of cells, not pixels).

Each line takes 224 Ts for the hardware to display, where 1 T = 1 t-state = 1 clock tick of the 3.5 MHz Z80 CPU. The only reliable way to know when to start drawing is to wait for the vertical refresh interrupt, which occurs every time the hardware starts drawing a new frame from the top of the screen. To put this in perspective, a typical Z80 instruction takes around 4 to 16 Ts to execute.

Assuming we want flicker-free animation, this means we have 224 * 64 = 14,336 Ts from the start of the interrupt before the display refresh catches up with us. (There is a clever trick involving waiting for the display hardware to just pass the top of the display file and start drawing behind it, but it isn't very portable.) A whole frame takes 224 * (64 + 192 + 64) = 71,680 Ts.

Display Addressing

The display file is arranged as 32 columns of 24 rows of 8x8 pixel cells. Each cell has a colour attribute and eight bytes of bitmap data. Bitmap addresses are given in binary as 010RRLLL RRRCCCCC where RRRRR is the row number (0..23), CCCCC is the column number (0..31), and LLL is the line number within the cell (0..7). Attribute addresses are given as 010110RR RRRCCCCC. The curious looking bitmap addressing is cunningly arranged so that successive lines within a cell can be reached by simply incrementing the upper byte of the address.

Moving Bytes to the Display

Now, how best to move bits on to the display file? Setting aside the attribute file for now, there are a number of options. For the sake of this part of the discussion, I'm going to assume we are just interested in moving bytes from some (pre-shifted) source to a cell on the display file.

Here are the most common approaches, assuming the source bitmap address is in DE or SP and the target display bitmap cell address is in HL:

(1) Naive copy

    ld A, (DE)      ;   7 Ts
    ld (HL), A      ;   7 Ts
    inc DE          ;   6 Ts
    inc H           ;   4 Ts
    ; ... and so on for the next 7 lines.
    ; Cost per byte: 24 Ts

(2) Loading from the stack

    pop DE          ;  10 Ts
    ld (HL), D      ;   7 Ts
    inc H           ;   4 Ts
    ld (HL), E      ;   7 Ts
    inc H           ;   4 Ts
    ; ... and so on for the next 3 line pairs.
    ; Cost per byte: 16 Ts

(3) Self-modifying code

    ld (HL), line_1 ;  10 Ts
    inc H           ;   4 Ts
    ld (HL), line_2 ;  10 Ts
    inc H           ;   4 Ts
    ; and so on for the next 6 lines.
    ; Cost per byte: 14 Ts

(4) Copying from double buffer with some SMC

    ld SP, src_addr
    pop HL
    pop DE
    pop BC
    ; ... and so on for other register pairs.
    ld SP, disp_addr
    ; ...
    push BC
    push DE
    push HL
    ; and repeat until done.
    ; Cost per byte: 10.5 Ts

Comparing the Options

We can safely discard option (1) as the slowest and having no compensating merits.

Option (4) essentially assumes we have written to some backing buffer and are simply copying it to the display. A full screen takes 192 * 32 * 10.5 = 64,512 Ts, which, being approximately the same as the 224 * (64 + 192 + 64) = 71,680 Ts it takes for the hardware to display a full frame, leaves no time for drawing the backing buffer in the first place. In other words, option (4) is only acceptable if we are prepared to render every other frame or less. For this reason, I'm going to discard option (4) -- I want to see what we can manage in the opening 14,336 Ts of every frame.

Options (2) and (3) remain. In theory, we could manage 14,336 / (8 * 16) = 112 cells per frame with option (2) or 14,336 / (8 * 14) = 128 cells per frame with option (3).

These figures, of course, are without counting bookkeeping such as setting up DE/SP/HL or writing colour attributes. For example, for each cell we'd need something like the following to set up the registers and set the attribute byte:

    pop HL      ;  10 Ts -- load the attr addr.
    pop DE      ;  10 Ts -- load the attr byte and disp upper addr.
    ld (HL), D  ;   7 Ts -- write the attr byte.
    ld H, E     ;   4 Ts -- convert attr addr to disp addr.

Let's assume we need at least 40 Ts of overhead per cell or 40 / 8 = 5 Ts per byte. This gives 14,336 / (8 * 21) = 85 cells per frame for option (2) and 14,336 / (8 * 19) = 94 cells per frame for option (3).

How many sprites at full tilt?

Assuming a "standard" 16 x 16 pixel sprite, pre-shifted to any position, such a sprite will occupy a 3 * 3 cell region. Therefore, option (2) allows for 85 / 9 = 9.4 sprites per frame, while option (3) allows for 94 / 9 = 10.4 sprites per frame.

Memory budgets

The downside of option (3) is the amount of memory it requires. Each cell that we draw requires the following space for self-modifying code:

    ld HL, attr_addr    ; 3 bytes
    ld (HL), attr       ; 2 bytes
    ld H, disp_hi_addr  ; 2 bytes
    ld (HL), line_1     ; 2 bytes
    inc H               ; 1 byte
    ; ... last two lines repeated eight times in total.
    ; Total cost: 30 bytes.

Option (2) requires less space per cell, just the data for the drawing loop:

    dw attr_addr
    db attr, disp_hi_addr
    db line_1, line_2, ..., line_8
    ; Total cost: 12 bytes.

The winner is...

After trying both approaches, I found option (3) to be costly in memory and costly to set up each frame, leaving little time for game logic. For that reason, I claim option (2) is the optimal strategy provided we are aiming for full-screen full-speed flickerless animation without resorting to beam chasing tricks.

Digression: full cells or part cells

The discussion so far has assumed we are always drawing full cells. Consider a 16x16 sprite, this, when shifted into position, will occupy 3x16 bytes. However, it will (seven times out of eight) cross three cell rows. We can either optimise drawing of complete cells (3x24 bytes) or we can try to draw only what we need to (3x16 bytes). There are two problems with the latter approach. The first is what to do when sprites overlap (see below). With the prepare-first-then-draw-quickly strategy of option (2), we can handle the merging of overlapping images in the preparation period, which is outside the time-sensitive drawing period. With the just-draw-what-you-need-and-no-more strategy, it's not at all clear how you could manage this. You could restrict yourself to just xor-drawing and add the xor-code inline, but that costs you -- here's the code for doing that for one byte pair:

    ; Xor-drawing.
    pop DE      ;  10 Ts - Fetch two lines
    ld A, (HL)  ;   7 Ts
    xor D       ;   4 Ts
    ld (HL), A  ;   7 Ts
    inc H       ;   4 Ts
    ld A, (HL)  ;   7 Ts
    xor E       ;   4 Ts
    ld (HL), A  ;   7 Ts
    inc H       ;   4 Ts
    ; All repeated four times...
    ; Total cost: 27 Ts per byte

Add in the 5 Ts cost for attribute handling et cetera and we're up to 32 Ts per byte, limiting us to 14,336 / (8 * 32) = 56 cells. Not nearly so good compared to the 85 cells we can draw with option (2). Now, you might say, "but in this case we're only going to draw just what we need, so our 16x16 sprites only occupy 6 cells worth of space, giving us 56 / 6 = 9 sprites per frame". That is true, until you consider that each of those sprites will, seven times out of eight, cross two cell row boundaries. You now need to add logic to handle both "drawing from one to eight lines of a cell (top or bottom)" and "updating the display address for the top of the next cell row". Neither of those operations are cheap -- I've tried every trick I can think of -- resulting in a more complex, slower solution. That's why I've decided to stick to drawing complete cells: the overheads involved in drawing partial cells are simply too high to pay off.

Overlapping sprites

When cells from two sprites overlap, we need to decide how to handle the situation. The simplest approach of "one of them wins, the other is never seen" looks awful. Instead, we want to combine them somehow -- typically using or- or xor-merging or even using masking (and-mask then or-image) if we want to be really fancy. The key thing is, if we want to sustain nine sprites per frame, this "compositing" has to occur in the set-up period, not in the drawing period.

One way to do this is to maintain a map, much like the display attribute map, indicating which of the cells on screen is to be drawn in the next frame. Our "used cells" map will, instead, hold zero for unused cells, and positive number n if the nth "draw list" record holds the data (attribute pointer, attribute, display pointer high byte, and eight bitmap bytes) for that cell.

Used cell map:
+-------------------------------------+
|  |  |  |  |  |                      |
|--+--+--+--+--+--                    |
|  | 1| 4| 7|  |                      |
|--+--+--+--+--+--                    |
|  | 2| 5| 8|11|14                    |
|--+--+--+--+--+--                    |
|  | 3| 6| 9|12|15                    |
|--+--+--+--+--+--                    |
|  |  |  |10|13|16 etc.               | 
|                                     |
|                                     |
+-------------------------------------+

Draw list:
1 : attr ptr, attr, disp hi ptr, bits1 ... bits 8
2 : attr ptr, attr, disp hi ptr, bits1 ... bits 8
3 : attr ptr, attr, disp hi ptr, bits1 ... bits 8
4 : attr ptr, attr, disp hi ptr, bits1 ... bits 8
etc.

When we want to add a cell to our draw list, we first examine the corresponding entry in the used cells map. If this is a zero, then we need to set up a new draw list record for this cell and update the used cells map accordingly. If the used cells map contains a non-zero value, that tells us that the cell already has a draw list record and which record we must adjust in our compositing operation.

Erasing sprites

The Speccy allows for a neat, cost effective erasure trick provided you don't have to preserve any background. Simply set the attributes for any cell to be erased to have the same foreground and background colours! No bitmap data needs to be changed at all. This is a single byte write and vastly cheaper than redrawing using xor or any similar shenanigans.

Thankfully, our used cell map provides a convenient mechanism for working out which cells need to be erased from the previous frame (we don't want to blank any cells we are drawing to this frame, just the ones that were drawn to in the previous frame, but not this frame).

Basically, we need to track which cells were used in the previous frame. After drawing the current frame, we see which cells were used in the previous frame, but not in this frame (i.e., we check a list of cells from the previous frame against the cells used in this frame -- via the used cells map -- and write the blank attribute value to those cells).

Empirical testing

Using Simon Brattel's marvellous Zeus assembler/emulator/debugger/profiler I have collected some performance data. Running nine 16x16 pixel (i.e., 3x3 cell) sprites I observe the following T-states per frame:

  • 13,054 Ts spent drawing cells (tidily within our 14,336 T budget);
  • 8,087 Ts spent blanking cells (over our 14,336 T budget);
  • 36,587 Ts spent setting up cells;
  • 9,801 Ts spent managing sprites.
This is rather encouraging. I am somewhat surprised by the amount of time needed for blanking, but there's a fair amount of other bookkeeping going on in there and only a few cells typically need blanking per frame (sprites will typically move less than one cell width per frame). There is plenty of time left over for game logic, sound, and what-have-you.

Further thoughts

Given that so much remaining time is available per frame, it seems quite likely that a similar number of masked sprites could be supported at full frame rate. This would require adjustment to the "compositing" code and sprite layout, but nothing major. If there were a non-blank background to deal with, that would also complicate matters somewhat, but I can imagine at least a couple of plausible approaches.

All that said, I'm happy that my analysis played out in practice. I'm also happy that this is a shape-agnostic solution: any sprite sizes can be accommodated provided they fit within the approximately 100 cells per frame. That would be one huge 72x72 pixel sprite or four 32x32 pixel sprites or any other combination. All at 50Hz, in full colour, with compositing.

8 comments:

Ilya said...

If only there was a "ZX Spectrum expertise" category on LinkedIn, you'd have my endorsement :)

BloodBaz said...

Great write up. I've always tried to beat the scanline but there is some great stuff here to make me rethink things

Rafe said...

Thanks -- there are some amazing demoes out there, but in this case I was looking for something fairly general purpose that didn't require an unrealistic amount of mystical set-up by the user.

Hiro Protagonist said...

using Push (and you've set the stack to point at screen ram) would be even quicker :)

11 T-States per byte and the inc is free

Rafe said...

Isn't that what I propose as option (4) above? It's only effective if you're trying to redraw substantial rectangular regions of the display.

Roberto said...

Hello, what assembler compiler do you use? In ZX Spin's assembler
the source code does not compile.

Rafe said...

@baze, thank you for your comments: most interesting.

The 384-push-clear is a good option and twice the speed. The reason I didn't go for something like that is that path requires you to blank the attributes _before_ you draw anything, which would reduce the actual sprite drawing time before the beam hits the display area. Cleaning up _after_ seemed like the option with the least amount of flicker -- but I'm sure you're right, there are more experiments to be tried.

Regarding the ldi/ldd zig-zag, that's a neat approach, but it still takes 16 Ts/byte, same as option (2), although I grant you that any other overheads would be minimised.

Anywaym, thanks again!

Rafe said...

@Roberto, I use Simon Brattel's most excellent Zeus assembler which you can obtain here: http://desdes.com/products/oldfiles/index.htm -- I hope the code isn't too hard to convert for other assemblers, though.