Friday, April 19, 2019

More on drawing sprites on the ZX Spectrum

A recent discussion on the Z80 Assembly Programming on the ZX Spectrum Facebook group prompted me to look into the differing costs of various ways of drawing sprite images on the ZX Spectrum.

The discussion revolved around handling the Speccy's slightly odd display file layout.  To recap, the bitmap display consists of 24 rows of 32 columns of cells, each cell of which has 8 lines of 8 pixels (one byte per line).  The address for a particular cell line is 010RRLLL RRRCCCCC where RRRRR is the row number (0..23), CCCCC is the column number (0..31), and LLL is the line number (0..7), all in binary.  Note that the display addresses are broken into three sections under this scheme: rows 0..7, rows 8..15, and rows 16..23.

(There is a separate attribute overlay giving the foreground and background colour for each cell, but I'm not going to discuss that here.)

The advantage of this layout is that if the display address is in a Z80 register pair, HL say, one can move from one cell line to the next simply by incrementing the upper half of the register pair: inc H.

There are three main ways of drawing a sprite image on to the display.  For the sake of argument, I will assume a 16x16 pixel image (more or less standard for most games on the Speccy).  I am not going to consider clipping at the edges of the display.  I am going to assume that we will draw images to the display using the xor method (very common: redrawing an image in the same place has the effect of removing it from the display).  I will further assume that, for speed, each image comes in 8 pre-shifted forms:

xxxxxxxx xxxxxxxx ........  - x 16 lines for 0px offset
.xxxxxxx xxxxxxxx x.......  - x 16 lines for 1px offset
..xxxxxx xxxxxxxx xx......  - x 16 lines for 2px offset
.
.
.
.......x xxxxxxxx xxxxxxx.  - x 16 lines for 7px offset

(1) Break the image into cell sized chunks and draw each cell as a separate operation.  This means that a 16x16 image at an arbitrary pixel position on the display will be drawn over a 3x3 cell region (i.e., 24 lines of drawing for 50% over-draw).  I have discussed this at length here before.

(2) Draw only the lines in the image (i.e., no over-draw), dealing with the line/row/third transition with in-line code.

(3) Draw only the lines in the image, but use a separate table of display line addresses (i.e., pre-computing the line/row/third transitions ahead of time).

So how do these schemes work out?  (I'm using Simon Brattel's excellent Zeus assembler syntax here, in case anyone is wondering.)

Dealing with the line/row/third transitions with in-line code.

The best approach I could come up with here was this:
- unroll the drawing code for 8 lines at a time;
- handle the row/third adjustment after the 8th line;
- jump into the correct spot at the start (e.g., if the top line of our image starts at line 3 of our cell, then we would start drawing at the "line 3" entry point).

Here's my code to do that:

; DE points to the image data: L-to-R, T-to-B, three bytes per line.
; C is height of sprite in lines.
; HL is address of display byte for the top-left byte of the sprite image.
Xor24px_Adj proc
                ; Jump into the correct offset for row adjustment.
                ; Mean cost is about 18 + 3 * (4 + 7) + 1 * (4 + 12)
                ; = 67 t-states.
                ld A, H
                and 000111
                xor 000111
                       jr z, _L7
                dec A: jr Z, _L6
                dec A: jr Z, _L5
                dec A: jr Z, _L4
                dec A: jr Z, _L3
                dec A: jr Z, _L2
                dec A: jr Z, _L1
                ; Unrolled drawing loop.
                _XorLineAdj macro()
                    ; Cost per 3 byte line is 3 * (21 + 6) + 6 * 4 + 5
                    ; = 112 t-states.
                    ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                    ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                    ld A, (DE): xor (HL): ld (HL), A: inc DE
                    dec L: dec L: inc H
                    dec C: ret z
                endm
    _L0         _XorLineAdj()
    _L1         _XorLineAdj()
    _L2         _XorLineAdj()
    _L3         _XorLineAdj()
    _L4         _XorLineAdj()
    _L5         _XorLineAdj()
    _L6         _XorLineAdj()
    _L7         _XorLineAdj()
                ; Adjustment for next row.
                ; Mean cost per line is (2 * (4 + 7 + 4) + 7 + 12) / 8
                ; = 6 t-states.
                ld A, L: add 32: ld L, A: jr c, _L0
                lf A, H: sub  8: ld H, A: jr _L0
                ; Overall cost after setup overhead is
                ; 112 + 6 = 118 t-states per line.
endp

So, for 16x16px pre-shifted sprite images, we expect to take around 67 + 16 * 118 = 1,955 t-states per image with this approach (there are other overheads, of course, but let's take this as our baseline).

Note: this runs to about 189 bytes of machine code, so it is quite compact.

Dealing with the line/row/third transitions with a pre-computed table of display line addresses.

The best approach I could come up with here is to pre-compute the starting address of each display line:

    dw $4000, $4100, $4200, ..., $4700
    dw $4020, $4120, $4220, ..., $4720
    ...
    dw $40e0, $41e0, $42e0, ..., $47e0
    dw $4800, $4900, $4a00, ..., $4f00
    dw $4820, $4920, $4a20, ..., $4f20
    ...

then
- point the stack pointer SP at the correct entry for the starting display line of the sprite image;
- put the image column number in a different register, B;
- pop each display line address off the stack (this is the fastest way of fetching 16-byte data on the Z80);
- drawing code for the entire image is unrolled!  If you want to draw a smaller image, you jump into the correct entry point for that.

Here's my code to do that:

; DE points to the image data: L-to-R, T-to-B, three bytes per line.
; SP points to the table of display line start addresses offset for
;   the first line of the sprite image.
; B is the leftmost column number of the sprite image.
; User calls into the correct offset (e.g., Xor24px_Tbl_16 for a 16 line
; image).
_XorLineTbl macro()
                ; Cost per 3 byte line is 22 + 3 * (21 + 6)
                ; = 93 t-states.
                pop HL: ld A, L: or B: ld L, A
                ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                ld A, (DE): xor (HL): ld (HL), A: inc DE: inc L
                ld A, (DE): xor (HL): ld (HL), A: inc DE
endm

Xor24px_Tbl_16  _XorLineTbl()
Xor24px_Tbl_15  _XorLineTbl()
Xor24px_Tbl_14  _XorLineTbl()
; ...
Xor24px_Tbl_3   _XorLineTbl()
Xor24px_Tbl_2   _XorLineTbl()
Xor24px_Tbl_1   _XorLineTbl()
; Exit code goes here (e.g., restore SP and ret).

So, for 16x16px pre-shifted sprite images, we expect to take around 16 * 93 = 1,488 t-states per image.  Again, there is a bit more overhead to take into account here, but I'll call this the base-line figure.

Note that this scheme uses a lot more code: around 288 bytes of unrolled drawing code for a 16x16px image along with a 384 byte address table.  For my money, that is a good trade-off for the extra speed.

Comparison.

Rounding the numbers up, fixing things at the line-8 transition costs you 2,000 t-states per sprite image whereas using the stack-pointer-into-a-lookup-table scheme costs you only 1,500 t-states per sprite image.

What does this mean in practice?  Well, the usual approach to obtaining flicker-free graphics involves waiting for the screen-refresh interrupt, then redrawing everything before the CRT scan line catches up with you.  On the ZX Spectrum the upper screen border (before the display bitmap is drawn) takes around 14,000 t-states. 

For the fix-in-code approach, this is enough time to erase and redraw 14,000 / 2,000 / 2 = 3.5 16x16px sprites before you risk being caught be the screen refresh.

For the use-a-table approach, this is enough time to erase and redraw 14,000 / 1,500 / 2 = 4.6 16x16px sprites before you risk being caught by the screen refresh.

Conclusion.

For my money, I'd go for the table-based approach.

Wednesday, July 18, 2018

High speed full-screen drawing on the ZX Spectrum

This is another retro-programming idea that occurred to me, harking back to my old ZX Spectrum days.  If you're not au fait with the speccy's screen layout, I refer you to this posting of mine about rendering sprites.

The problem is that the old Z80 isn't fast enough to update all of the video memory before the CRT beam catches up with you.  The screen has a 64 pixel upper border (requiring no video memory), followed by a 192 pixel high by 256 pixel wide bitmap, followed by a lower border (also requiring no video memory).  The bitmap is monochrome, with each consecutive set of 8 pixels being represented in a single byte.  There is a separate 24 row by 32 column attribute map specifying foreground and background colours for each 8x8 pixel cell.

The only interrupt you get on the speccy occurs once every 50th of a second when the CRT beam starts from the top of the upper border (so this is pretty much when you have to start drawing).  It takes 224 T-states (i.e., clock ticks) for the CRT beam to scan one pixel line.  Since it takes around 16 Ts to move a single byte (give or take), the Z80 can update at most half of the 32 bytes on each row of pixels before the CRT beam catches up (which would appear as "tearing" on the screen).

[Note: some games did manage a full 50 FPS refresh rate, but only by severely restricting the area of the screen that was updated.]

However, let's say you only wanted to update about a third of the bitmap at 50 FPS, but didn't want to be restricted to which third.  The remainder would be painted in the background colour, which is easily arranged by setting the corresponding cells in the attribute map to have the same foreground and background colour (i.e., the set pixels and the clear pixels in such an 8x8 cell will have the same colour).

What occurs to me is this.  For a "blank" cell, we just need to set the cell attribute.  For a "bitmap" cell, we need to set the cell attribute and set the corresponding bitmap bytes in the video memory.  Assuming that we have the address of the "current" attribute entry in DE and the bitmap entry in HL (and the magic number 1 - $0700 in BC), then we can do the following:

DrawSomeBlankCell:
    ld A, SomeAttr
    ld (DE), A
    inc E
    inc L
    ret

(for 32 Ts per blank cell)

DrawSomeBitmapCell:
    ld A, SomeAttr
    ld (DE), A
    inc E
    ld (HL), SomeBitmapByte0 : inc H
    ld (HL), SomeBitmapByte1 : inc H
    ld (HL), SomeBitmapByte2 : inc H
    ld (HL), SomeBitmapByte3 : inc H
    ld (HL), SomeBitmapByte4 : inc H
    ld (HL), SomeBitmapByte5 : inc H
    ld (HL), SomeBitmapByte6 : inc H
    ld (HL), SomeBitmapByte7 : add HL, BC
    ret

(for 146 Ts per bitmap cell).

The "draw list" would simply be a consecutive list of pointers to these drawing functions:

DrawList:
    dw Draw0, Draw1, Draw2, ..., Draw31, [fix up]
    dw Draw32, Draw33, Draw34, ..., Draw 63, [fix up]
    ....
    dw ..., Draw767, [finish up]

We would redraw the entire screen by simply pointing DE at the display attributes, HL at the display bitmap, SP at DrawList and executing a 'ret'.  (The [fix up] calls would simply deal with the odd end-of-line addressing fixes that are needed when handling the ZX Spectrum display.)

How much could we redraw, assuming an even distribution of action across the screen?  A simple calculation shows we can have a good 11 bitmaps on each line and still update the entire display before the CRT beam catches up.  That is just slightly in excess of 33% of the entire display!

(I'm currently working on a full-screen Pac Man demo. to illustrate this idea.  Time is scarce, so don't hold your breath.)

A very belated update: you can find the working source code at https://github.com/ralphbecket/Z80/tree/master/MegaPacMan showing a full-screen 50 FPS scrolling maze with four ghosts running around at random.  It could support many more and/or much larger sprites, but I'm happy to call it a day here!

Friday, November 17, 2017

Retro-programming: an High-Level Byte-Code for 8-bit CPUs

This idea has been percolating around in my hind-brain for a while now, so I thought I'd set it down on paper.

The Problem

Imagine you are back in the 1980s, programming one of the wonderfully primitive home micro-computers that has just arrived on the scene.  Say, a ZX Spectrum, sporting a mere 48 Kbytes RAM and a 3.5 MHz Z80 CPU (alternatively, a comparable 1 MHz 6502-based Commodore 64).  These CPUs are aggressively 8-bit in nature -- yes, the Z80 has some 16-bit support, but it isn't great.  Worse, these CPUs are non-orthogonal: every register seems to have its own special rules.  What would have been nice was some kind of easy-to-implement byte-code interpreter which would be comfortable to program in and for which it would be easy to write a compiler in BASIC (all home micro-computers booted directly to BASIC in those days).  This virtual machine would, of course, run more slowly than hand-crafted assembly language, but many times faster than the interpreted BASICs of the day.  Having a standardised byte-code would mean you could run the same program on any machine on which you had an implementation of the byte-code interpreter.

[Note: these thoughts are partially inspired by Wozniak's SWEET16, although that is a much lower-level VM than what I have in mind.]

One thing I have aimed for is a rich set of addressing modes, important for raising the level of abstraction and for reasonable code density.  Most 8-bit VMs that I've seen, like SWEET16, provide a rather low-level virtual machine typically requiring several instructions to do anything useful.  Ugh!

Outline of the Virtual Machine

The proposed VM works in 16-bit words.

A variable is any address in RAM.

Op-codes take 0, 1, or 2 arguments (if op-codes take a third argument which is a fixed address to branch to).  Typically, the first argument will be an assignment target.

There are two "registers", X and Y, to hold the arguments for op-codes.

The registers are set via the following instructions (the explanation uses C syntax for the semantics, recall that & in C is the "address-of" operator).

  • Xk -- load X with a constant; X := &k.
  • Xv -- load X with a variable; X := &v.
  • X_k -- subscript X by a constant; X := &(X[k]).
  • X_v -- subscript X by a variable; X := &(X[v]).
  • X@ -- indirection through X (equivalent to X_k 0); X := *X.
and similarly for the Y register.

Together, these allow for quite rich argument addressing.  For example, loading register X with x.i.foo.bar (or, in C syntax, x[i]->foo->bar) would be expressed in the VM as

  Xv x
  X_v i
  X_k foo
  X_k bar


In the VM, SP is a global stack-pointer variable used by the push, pop, call, and ret op-codes.

Op-codes:
  • X := Y; X += Y; X -= Y; X *= Y; X /= Y; X &= Y; X |= Y; X ^= Y; etc.
  • inc X; dec X
  • push X -- equivalent to SP -= 2; @SP := X.
  • pop X -- equivalent to X := @SP; SP += 2.
  • jp X -- jump to address X.
  • call X -- equivalent to push IP; jp X where IP is the address of the next op-code.
  • ret -- equivalent to pop IP.
  • poke X, Y -- write the least significant byte of to address X.
  • X :=peek Y -- assign the byte at address Y to X.
  • if X = Y A; if X != Y A; if X < Y A;etc. -- branch to address A if the test fails (i.e., unless X = Y; X != Y; X < Y; etc.).
  • And anything else that might be useful.
The odd inversion of the if op-codes is to support some sensible high-level syntax, such as

if q != 0 { p /= q }

which would have the obvious translation.

Example Code: Searching a Binary Tree

I'm making up some high-level syntax here, fill in any gaps using your imagination.

// Symbols for the field offsets in our binary tree vertices.
Value .equ 0
Left .equ 1
Right .equ 2

// We assume t is the root of the binary tree and we are looking for the vertex containing x.
loop {
  if t = 0 { ret } // We hit null, we failed to find x.
  t := t.Value
  if t = x { ret }
  if t < x { t := t.Left } else { t := t.Right }
}

That, I claim, is fairly neat.  The equivalent Z80 or 6502 assembly code would be far less transparent.

The above would be expanded in a simple "assembler" into something like this (where  X or Y is syntactically unchanged from its previous use, no op-codes need be omitted for set-up):

_1:
  Xv t; Yk 0; if= _2
    ret
_2:
  Yv t; Y_k Value; :=
  Yv x; if= _3
    ret
_3:
  if< _4
    Yv t; Y_k Left; :=
    Xk _5; jp
_4:
    Yv t; Y_k Right; :=
_5:
  Xk _1; jp

The Byte-code and Interpreter

Now, as an ex-Sinclair kiddie, I'm used to the Z80, so I'm going to use that to inform my idea of how the byte code might be expressed.

Rather than interpreting byte-codes, I'm simply going to propose a sequence of [op code address][arg] or just [op code address] 16-bit values as my byte code.  That is, every op-code is represented by the address of the machine code implementing it.

For example:

Xk:
  ld (X), SP: pop HL: ret    ; Because the Z80 SP now holds the address of the constant.

Xv:
  pop HL: ld (X), HL: ret

X_k:
  ld HL, (X): pop DE; add HL, DE: add HL, DE: ld (X), HL: ret

':=':
  ld HL, (Y): ld DE, (X): ldi: ldi: ret

'+=':
  ld HL, (X): ld DE, (Y)
  ld A, (DE): inc DE: add (HL): ld (HL), A: inc HL
  ld A, (DE): ld B, (HL): adc A, B: ld (HL), A    ; Ain't the Z80 horrible?
  ret

By pointing the Z80 stack-pointer at the byte-code stream, I can start executing it just by issuing the ret assembly instruction (which pops the next address off the stack and jumps to it).

Going by this, each register set-up op-code and argument takes 4 bytes, each if op-code and jump address takes 4 bytes, and every other op-code takes 2 bytes.

Under this scheme, the above binary tree search function would take 140 bytes.  This is comparable to the equivalent Z80 machine code.  It would run fairly quickly, the "interpretation" overhead being about as small as I can think of making it.

A Quick Comparison with SWEET16

This article on SWEET16 provides an 18-byte SWEET16 memory copy routine.  How would our putative VM do?

loop {
  if n = 0 { ret }
  poke dest, src
  inc dest
  inc src
}

which would turn into the following byte code:

_1:
  Xv n; Yk 0; if= _2
    ret
_2:
  Xv dest; Yv src; poke
  inc
  Xv src; inc
  jp _1

which, by my reckoning, comes to 36 bytes, twice what SWEET16 requires.  Not such a big deal given how much more comfortable the new VM feels to program in than SWEET16.

Closing Thoughts

I wish I'd thought of this back in the early 1980s!  While coding everything in assembly language was fun, it was error prone, and the weirdness of the Z80 turned every little design decision into a mental challenge.  I think this VM (and accompanying compiler) would have been possible to write in BASIC and a handful of assembler in a few evenings and that it would been a great tool for bootstrapping other projects.

(Later in life I ended up spending fifteen years in programming language design and compiler research...)


Tuesday, August 29, 2017

Tuesday, May 03, 2016

Od -- it's like Mithril and Knockout, but better

I've loved using Knockout these last few years, but struggled with its unpredictable performance.  I really like the idea of Mithril (and the other virtual-DOM schemes), but it has really wonky problems if you move a component around in your DOM.

To address these issues, I recently came up with my own scheme, Od, and I really like it:

https://github.com/ralphbecket/Web/tree/master/Od

The web site there has plenty of explanation, examples, and documentation.
TypeScript, the beautiful face of the ugly tapeworm that is JavaScript.  It's the only way to do this stuff.

To add a TypeScript component to a non-TypeScript Visual Studio project, add the following to the end of the project .csproj file (do this in notepad or somesuch; adjust the TypeScript path version accordingly):

Reload the project and away you go.  When you add TypeScript files to your project, you can set their BuildAction properties to TypeScriptCompile, if this doesn't happen automatically.

Friday, March 18, 2016

On Python (vulgar, but funny)

https://semitwist.com/articles/article/view/why-i-hate-python-or-any-dynamic-language-really