Tao Te KaChing

Workin' the cash register of the Great Tao

So, the last post was pretty long, running through basic kernels practically line-by-line to see how the 6507 instructs the TIA to do great things with our TV. The remainder of our posts really, then, have to do simply with performing various functions on the 2600. How do I show a sprite? How do I position it on the screen? Sounds, collisions, etc...This post, however, will be one last smackdown on the relationship between our machine cycles and the TV rastering. This time, we're getting into cycle counting and troubleshooting our assembly code.

If you've been farting around with the kernels from the last two posts, you may have noticed that the SciTE / DASM solution is not exactly awesome for stepping through program logic. Going on Google to look for a 6502 simulator will give you tons of options. I'm using the 6502 Macroassembler & Simulator by Michal Kowalski. It pretty much rocks and does everything we need it to, AND MORE! You can get it here.

Cycle counting refers to an ongoing verification we will be doing while creating our game. We know that a 6502 instruction takes a certain amount of machine cycles. We also know that we have a certain amount of machine cycles per scanline. We *further* know that, to the Atari VCS' TIA chip, there are a certain number of scanlines per NTSC frame and that once we've started rasterizing frames, we are in “The Kernel Loop” and will be measuring all our gametime within the instructions we can perform per frame. Really, there essentially two "blocks" of code we'll always work with: initialization routines, and our kernel loop (we can always have more than one kernel, but we'll stay with this general assumption for now). We unfortunately need to care about efficiency even in our initialization routines to save space, as we only have 4,096 bytes for our code.* To get a sense of how little this is, let's look at some "specs". If a scanline is the quivalent of 76 machine cycles (per § Television Protocol, Stella guide), and we have a total of 262 scanlines per frame, that's 19912 machines cycles ** total** for our kernel. If we assume we use on average 3 cycles per instruction, that leaves us just over 6,600 instructions

And it get's uglier. Let's say we're checking to see if the accumulator is currently holding a value of, say, 28. If it is, set the X index to 10, otherwise set the X index to 5. Here's what this code might look like:

LDA #20 ; don't count this... CMP #28 ; 2 BNE not_equal ; 2/3 equal LDX #10 ; 2 JMP done ; 3 not_equal LDX #5 ; 2 done BRK

Try running this in your emulator. You may need to add a few lines at the beginning. Here's my code for this if you're using Kowalski's emulator (the one I use, from hereon referred to as MK65E):

; IO area of the simulator has to be set at the address $e000 ; (Option/Simulator/InOut memory area) ; In/Out Window will only accept input when it has focus (is active) *= $100 io_area = $e000 io_cls = io_area + 0 ; clear terminal window io_putc = io_area + 1 ; put char io_putr = io_area + 2 ; put raw char (doesn't interpret CR/LF) io_puth = io_area + 3 ; put as hex number io_getc = io_area + 4 ; get char LDA #20 CMP #28 ; 2 BNE not_equal ; 2/3 equal LDX #10 ; 2 JMP done ; 3 not_equal LDX #5 ; 2 done BRK

Make sure to clear your cycle counter when starting at CMP #28. Here's where if you're using MK65E:

My clock showed 7 cycles when the accumulator does not equal 28. Now change LDA #20 to be LDA #20. You should see two things. First, this operation now takes 9 cycles. Second, the **B**ranch if **N**ot **E**qual only takes 2 cycles instead of 3. Since everything is timed on the 2600 down to the machine cycle, this just won't do; they need to be the same, else our kernel is not keeping in sync with the TIA / TV. Add a NOP after the LDX #5. That should give us 9 either way. Excellent. Now we *know* our routine will cost us 9 machine cycles either way.

We can save ourselves some cycles and bytes, however, if we can use the X index for our comparison and accumulator for the 5 or 10 result:

LDX #20 ; don't count this... LDA #5 ; 2 CPX #28 ; 2 BNE done ; 2/3 equal ASL ; 2 done BRK

Well, our clever use of the bitshift left us with the possibility of either 7 or 8 cycles. We can't use a NOP to even that out. However, we need to remember that the timing in our kernel is always by rasterized scanlines. A WSYNC call tells the TIA to go ahead and let the scanline that we are on get rastered, and not to worry about further instructions until we're starting the next scanline. This means that our 7-or-8-cycles routine above doesn't matter as long as we make a WSYNC within the scanline we're working within. For instance, if we have a routine with several branches so that we end up with a range of different possible cycle costs, say 50, 52, 58, 61, and 67, if we make a STA WSYNC call at the end of this routine, we in a sense have padded any of those costs to equal 76 machine cycles, and we start evenly for any of those possibilities on the next scanline. Here's a visual of how our routine above is “evened out” by a STA WSYNC:

Obviously this concept extends to routines that may cost more than one scanline. The TIA is expecting the equivalent of 262 scanlines worth of machine cycles to stay in sync with the TV. A scanline is 76 machine cycles. If our routine varies from 77 to 95 machine cycles, that's fine, because we've guaranteed “using up” a scanline, so just follow up with a WSYNC call and we're good. If we have a routine that varies from, say 50 to 90 machine cycles, now we may need to add some NOPs or break the routine up with a WSYNC call to guarantee we'll end up using two scanlines.

So this is one aspect of cycle counting, and an introduction to why a good 6502 simulator will be very helpful to us. Let's look now at an actual problem we'll be facing: vertical placement of sprites. Here's some pseudo-code for placing a sprite vertically on the screen:

- If we're on a scanline that our sprite is on: - Get the bits we'll turn on for our sprite for that scanline - Set the bits to turn on for our sprite for that scanline - Get the color for our sprite for that scanline - Set the color for our sprite for that scanline - Otherwise: - Skip drawing the sprite...

We know we have one branch, and some loading and setting to do. We can come up with a reasonable guesstimate for how much this routine will cost us:

- If...: (BRANCH = 2/3 machine cycles) - Get (bit addr)... (LD? zeropage = 3 machine cycles) - Set (bit addr)... (ST? zeropage = 3 machine cycles) - Increase (bit addr) to next (bit addr)... (INC zeropage = 5 cycles) - Get (color addr)... (LD? zeropage = 3 machine cycles) - Set (color addr)... (ST? zeropage = 3 machine cycles) - Increase (color addr) to next (color addr)... (INC zeropage = 5 cycles) - Otherwise: - ( label to jump to )

So at most we're looking at 25 machine cycles. That's already a third of a scanline. If we use both sprites, we're looking at 50 machine cycles. The missles are only one scanline high, so we can avoid the INC costs for them, therefore both sprites and both missles would be approximately 80 machine cycles! DOH! We're over budget! Both sprites, both missles, and both playfields would be at least 130 machine cycles, and a playfield needs 3 bytes set (20 bits wide), so we're actually looking more like 40 cycles *just for rastering a playfield object!* Can it be done? Can we even fit all these things together in one scanline? We'll see later on when we deal directly with this issue (the answer is somewhere between “sort-of” and “no”, it seems). However, for our purposes now, we see the other aspect of cycle counting: trying to fit our logic in a very specific amount of “space”. We can use a different algorithm to get the same result:

- Add current scanline counter to magic Y-pos for sprite - If we have an overflow: - From our sprite's beginning address plus the overflow, get the bits - Set the bits - From our sprite's colors beginning address plus the overflow, get the color - Set the color

Our costs now look something like:

- Add current scanline counter... (ADC zeropage = 3 machine cycles) - If...: (BRANCH = 2/3 machine cycles) - Put overflow in Y index (TAY = 2 machine cycles) - Get (starting bit addr + overflow)... (LD? addr,Y = 4/5 machine cycles) - Set (bit addr)... (ST? zeropage = 3 machine cycles) - Get (starting color addr + overflow)... (LD? addr,Y = 4/5 machine cycles) - Set (color addr)... (ST? zeropage = 3 machine cycles)

A reduction of ONE WHOLE MACHINE CYCLE! It may not sound like much, oh but it is. Also both algorithms assume we need to track the coloring of our sprite. If we use one single color and set it before even starting the 192 scanlines of display, we could eliminate an additional 7 or 8 machine cycles.

So for some homework, try implementing the pseudo-code above in your emulator, and then keep whittling the cycles down as much as possible. Also, compare and play around with the following two code samples:

Sample 1:

processor 6502 include "vcs.h" ; DASM pseudo-op to include mnemonics for TIA / RIOT registers seg org $F000 ; our ROM starts at the beginning of the second 4K block var_counter = $80 var_numberDispay = $81 label_Reset LDA #0 STA var_counter STA var_numberDispay LDA #100 STA COLUP0 ; --------------------------- our "kernel" starts here label_Frame ; --------------------------- do the v-sync (Stella, §3.3) label_Vsync LDA #2 STA VSYNC ; enable D1 bit at VSYNC to start vertical sync STA WSYNC STA WSYNC STA WSYNC ; our 3 scanlines LDA #0 STA VSYNC ; disable D1 but to finish vertical sync ; --------------------------- do the v-blank (Stella, §3.3) label_Vblank LDA #2 STA VBLANK ; enable D1 bit at VBLANK to start vertical blank repeat 35 STA WSYNC repend ; do 35 scanlines using DASM "REPEAT" pseudo-op LDA #240 STA HMP0 STA WSYNC ; use a scanline (WSYNC) pre our HMOVE to "slide" STA HMOVE ; our sprite to the right by one "pixel" ; ( Stella, TIA 1A - TELEVISION INTERFACE ADAPTOR (MODEL 1A) ; §5.A & B ) INC var_counter ; 5 LDA #60 ; 2 CMP var_counter ; 3 BNE label_SkipCounterUpdate ; 2/3 ======== total to here = 12/13 ; --- match cycles from here on in label_SkipCounterUpdate LDA #0 ; 2 STA var_counter ; 3 INC var_numberDispay ; 5 JMP label_CounterFinished ; 3 ========== total to here = 13 label_SkipCounterUpdate NOP NOP NOP NOP NOP JMP label_CounterFinished label_CounterFinished STA WSYNC LDA #0 STA VBLANK ; disable D1 bit at VBLANK to finish vertical blank ; --------------------------- do the picture lines (Stella, §TELEVISION PROTOCOL) LDA var_numberDispay label_Picture repeat 192 STA GRP0 STA WSYNC ; do 192 scanlines using DASM "REPEAT" pseudo-op repend ; --------------------------- do the scanlines (Stella, §TELEVISION PROTOCOL) label_Overscan repeat 30 STA WSYNC ; do 30 scanlines using DASM "REPEAT" pseudo-op repend ; --------------------------- do it all over again, 60 times per second! JMP label_Frame org $FFFA .word label_Reset ; NMI .word label_Reset ; RESET .word label_Reset ; IRQ end

Sample 2:

processor 6502 include "vcs.h" ; DASM pseudo-op to include mnemonics for TIA / RIOT registers seg org $F000 ; our ROM starts at the beginning of the second 4K block var_counter = $80 var_numberDispay = $81 label_Reset LDA #0 STA var_counter STA var_numberDispay LDA #100 STA COLUP0 ; --------------------------- our "kernel" starts here label_Frame ; --------------------------- do the v-sync (Stella, §3.3) label_Vsync LDA #2 STA VSYNC ; enable D1 bit at VSYNC to start vertical sync STA WSYNC STA WSYNC STA WSYNC ; our 3 scanlines LDA #0 STA VSYNC ; disable D1 but to finish vertical sync ; --------------------------- do the v-blank (Stella, §3.3) label_Vblank LDA #2 STA VBLANK ; enable D1 bit at VBLANK to start vertical blank repeat 35; STA WSYNC repend ; do 35 scanlines using DASM "REPEAT" pseudo-op LDA #240 STA HMP0 STA WSYNC ; use a scanline (WSYNC) pre our HMOVE to "slide" STA HMOVE ; our sprite to the right by one "pixel" ; ( Stella, TIA 1A - TELEVISION INTERFACE ADAPTOR (MODEL 1A) ; §5.A & B ) INC var_counter ; 5 LDA #60 ; 2 CMP var_counter ; 3 BNE label_CounterFinished ; 2/3 ======== total to here = 12/13 ; --- let WSYNC after label_CounterFinished "use up" the remaining cycles LDA #0 ; 2 STA var_counter ; 3 INC var_numberDispay ; 5 JMP label_CounterFinished ; 3 ========== total to here = 13 label_CounterFinished STA WSYNC LDA #0 STA VBLANK ; disable D1 bit at VBLANK to finish vertical blank ; --------------------------- do the picture lines (Stella, §TELEVISION PROTOCOL) LDA var_numberDispay label_Picture repeat 192 STA GRP0 STA WSYNC ; do 192 scanlines using DASM "REPEAT" pseudo-op repend ; --------------------------- do the scanlines (Stella, §TELEVISION PROTOCOL) label_Overscan repeat 30 STA WSYNC ; do 30 scanlines using DASM "REPEAT" pseudo-op repend ; --------------------------- do it all over again, 60 times per second! JMP label_Frame org $FFFA .word label_Reset ; NMI .word label_Reset ; RESET .word label_Reset ; IRQ end

~ZagNut

*(bank switching to extend the size of the game is not going to be covered in this series...we have enough to deal with already!)