News:

Want an XLerator? Please participate in the market research thread: https://lisalist2.com/index.php/topic,594.msg4180.html

Main Menu

Cycle-counting for tight code

Started by stepleton, November 25, 2025, 12:02:16 PM

Previous topic - Next topic

stepleton

Let's say you're writing some speed-critical Lisa code. Every cycle counts, and you have choices to make between different implementation options. There are various guides like this one for counting M68K instruction timings, but how can you know how long things really take if you need to access memory or one of the peripheral devices (e.g. the SCC or a VIA)?

I have to admit that the Lisa's complicated memory system makes this a bit confusing to me. I know the full story must be derivable from the Lisa Hardware Manual (e.g. section 4.2) and general timing information about the 68000 itself, but it's still confusing: the timing guide linked above refers to read and write bus cycles, and do those count differently on the Lisa? Does the MMU or accessing peripheral devices introduce wait states?

I'm happy to keep working to figure it out on my own, but I wonder if someone might know these things off the top of their head...

sigma7

#1
I use the tables in the Motorola MC68000 Users Manual (typically a PDF of it); the Quick Reference Guide is also handy for this. Presumably third party web sites have reproduced the data accurately, so if their presentation makes more sense to you, might as well use them.

For the unfamiliar, the specified number of clock cycles for a 68000 instruction includes the fetching of the instruction and the internal operations performed, but does not include the clock cycles consumed by any memory accesses performed by the instruction (which vary depending on what type of memory access it is as well as how many the instruction needs). So one looks up the "effective address" cycle timing(s) and adds that (those) to the instruction's cycle count.

IIRC, regular bus cycles (terminated by DTACK) operate in the documented number of clock cycles on the Lisa.

I believe the MMU and video are synchronized with CPU operation without adding wait states, but now that you've asked I'm not confident I can say there are never any wait states. A video access has to occur without fail, so it would have a higher design priority than the CPU. I suppose if there is an occasional wait state, you probably wouldn't be able to do anything about it. I recognize that in this case you want to be able to predict it more than avoid it.

Wait states do occur when trying to access the memory shared with the 6504 if the 6504 has locked out the 68000 to perform timing critical floppy disk access.

The AM9512 coprocessor circuitry also has a wait state generating circuit, but its unlikely anyone will run into that.

An expansion card may generate wait states; I don't recall any that do. A missing expansion card will cause wait states until bus timeout.

VPA/VMA bus cycles are different (slower) than DTACK cycles, but they are atypical (being used for the SCC, as well as the VIAs on the Lisa 1/2 I/O Board but not the 2/10 I/O or expansion cards).

Adding an XLerator changes timing as accesses to the original hardware require synchronization with the CPU Board.

This may not be very helpful, do you have a specific circumstance you are considering? Perhaps I can measure it for you.
Warning: Memory errors found. ECC non-functional. Verify comments if accuracy is important to you.

stepleton

Thanks for this very instructive response! It's good to know that regular bus cycles are (probably) dependable in their timing.

I note that MOVE.x Dy,Dz is listed as taking four cycles and as carrying out one bus read cycle. I was confused by the read cycle bit until I realised that this must simply be for the instruction fetch. This is confirmed in a document called the 68000 "Yet Another Cycle Hunting Table" (YACHT), which I had never heard of before. It conveniently breaks down what the bus cycles are for.

It would be interesting to know about how transactions via VPA/VMA cycles differ. It might be easiest and good practice for me to work out a way to measure this myself. It would be interesting, for example, to get an idea of the theoretical upper bound on parallel port communication.

sigma7

QuoteIt would be interesting to know about how transactions via VPA/VMA cycles differ.
Theoretical consequences of the 6800 VPA/VMA bus cycle operation from perusing the 68000 User Manual...

The VPA/VMA bus cycle option was provided by Motorola to simplify interfacing the 68000 with peripheral chips originally designed for use with the 6800 series.

VPA/VMA bus operation is synchronized with the "E clock" signal generated by the 68000. E is the cpu clock divided by 10, with a 60/40 duty cycle. That means the E clock can be in one of 10 different alignments with the cpu clock at the start of a particular instruction.

The User Manual provides a "Best Case" "MC68000 to M6800 Peripheral Timing Diagram" which shows 7 clock periods of wait states added to a bus cycle. If the E clock was in the worst case alignment with the start of the cycle, then an additional 9 clock cycles would be needed for a total of 16 added clock periods.

So the penalty of a VPA/VMA cycle is somewhere between 7 and 16 clock cycles, averaging around 12.

(That's assuming a peripheral doesn't need to add more wait states, which is the case in the Lisa aside from the AM9512. The circuitry immediately starts the VPA bus cycle response when the corresponding I/O address is decoded.)

Since all 68000 instructions are a multiple of 2 clock cycles (as far as I can tell), I think one might be able to optimize block moving code such that 8 wait states are added for each successive VPA/VMA cycle. To do this, successive instructions using VPA/VMA would need to be a multiple of 10 clock cycles apart (including all the clock cycles consumed by bus cycles, not just the instruction cycle time). The first VPA cycle would suffer the average ~12 cycle penalty, but each of the following VPA cycles could suffer the minimum penalty (7 cycles + 1 since 68000 instructions are multiples of 2).

The E clock is about 500 KHz in a stock Lisa, and with a 16 MHz XLerator installed, it is 1.6 MHz. The XLerator 12.5 and 18 retain the stock E clock frequency.

VPA/VMA bus cycles are used to access the 8530 SCC, and the Parallel Port and Keyboard 6522 VIAs on the Lisa 1 aka 2/5 I/O Board. IIRC, they are also used for interrupt acknowledge cycles on some expansion cards as that simplifies the circuitry.

The 6522 VIAs on the 2/10 I/O Board and the Dual Parallel Expansion Card don't use VPA/VMA bus cycles.
Warning: Memory errors found. ECC non-functional. Verify comments if accuracy is important to you.

stepleton

This is fantastic detail, thanks! It's more complicated than I'd imagined. I'll have to chew on it for a while to fully understand it. It's also interesting to see this example of difference between the 2/10 I/O board and its predecessor.

Glancing at the schematics, here's how the Lisa 1 I/O board is serving VMA to one of the keyboard 6522's chip select pins.
lisa1.png.
E is coming from the buffered, BGACK-gated E line from the CPU board.

The 2/10 ties that same chip select pin high, meanwhile:
lisa210.png
and E is coming from some flip-flops which at a glance seemed to me to be dividing CPUCK.

sigma7

Quoteand E is coming from some flip-flops which at a glance seemed to me to be dividing CPUCK.

Yes, the VIAs' clock on the 2/10 I/O Board is driven by CPUCK/4, which is also how the VIAs on the Dual Parallel Port card are clocked.

This makes those VIA clocks about 1.25 MHz versus the CPUCK/10 frequency of the E clock used by the original I/O Board.

The effect of the different VIA clock rates is observed mostly in their programmable timers, which are used for making sound as well as time constants used by eg. the Macintosh Time Manager.

Hence the references in source code to "fast timers" on the newer (2/10) I/O Board.

IIRC, the length of the Parallel Port strobe pulse is also affected proportionally by the VIA clock frequency.
Warning: Memory errors found. ECC non-functional. Verify comments if accuracy is important to you.

sigma7

#6
QuoteIt might be easiest and good practice for me to work out a way to measure this myself. It would be interesting, for example, to get an idea of the theoretical upper bound on parallel port communication.

You should see some performance variation depending on the hardware:

The Dual Parallel Port expansion card VIA's respond with DTACK bus cycles, so these will give the same performance in a Lisa 1/5 and 2/10, and the 2/10 I/O Board parallel port VIA circuit is similar to the Dual Port card.

On the Lisa 1 (2/5) I/O Board, the VIAs respond with VPA instead of DTACK, so these will be slower than the 2/10 and expansion port card.

Then there is the oddity of the 16 MHz XLerator's E clock being 1.6 MHz, which will make the Lisa 1 I/O Board VIAs and VPA cycles perform better, but I'd guess still slower than a 2/10 with 16 MHz XLerator.

The parallel ports are connected only to the lower byte data bus. So, regrettably, they are generally accessed only by byte operations (not word or long).

One of the simplest transfer operations for reading from the parallel port into memory is:
  MOVE.B (A5),(A0)+  ; A5 points to the desired VIA input register, A0 points to memory where data is to be storedWhich is 12 clock cycles (+ those required for memory and peripheral access) per byte

Since the transferring of a byte of data between a parallel port and memory necessitates a memory access and a peripheral access, the primary opportunity to maximize throughput of a block of data is by minimizing the number of instruction cycles.

In addition to unrolling a loop to minimize branching time, potentially useful is the MOVEP instruction which does word and long transfers to/from odd or even bytes.

For the 68000, the MOVEP instruction doesn't support the "register indirect" addressing mode (eg. "(A0)"), but rather requires "register indirect with offset" (eg. "9(A0)" or "(9,A0)" depending on the assembler syntax), which means another word of instruction fetch. Its other limitation is that it will transfer only to/from a register.

Consider:
MOVEP.L 0(A5),D0      ; move 4 bytes from VIA into D0
 MOVE.L D0,(A0)+
Which is 24 clock cycles for the MOVEP.L, plus 12 clock cycles for the MOVE.L, making 36 cycles for 4 bytes or 8 cycles per byte (+ those required for memory and peripheral access).

One way to measure execution time of a random piece of code is to wrap it in code that toggles an accessible pin.

For example, one might use the DTR pin (20) of Serial Port A connected to a test instrument something like this:

; assumes the parallel port VIA was already set up for use with a parallel port drive

VIAX  EQU $FCD979 ; parallel port VIA's "input/output without handshake" register (ie. don't pulse the strobe line)
SCCA  EQU $FCD203 ; SCC channel A control register
DTRAL EQU $68    ; value to stuff into SCC register 5 to de-assert DTRA
DTRAH EQU $E8    ; value to stuff into SCC register 5 to assert DTRA

 ; the SCC needs a recovery period between the write to select its register and then accessing that register
 ; to implement this delay, we embed the long address in each instruction rather than eg. using a register and NOPs

      MOVE.B      #5,SCCA    ; select SCC register 5
      MOVE.B      #DTRAL,SCCA ; make sure DTRA starts de-asserted
 
 ; measure overhead of timing code
      MOVE.B      #5,SCCA    ; select SCC register 5
      MOVE.B      #DTRAH,SCCA ; assert DTRA
      MOVE.B      #5,SCCA    ; select SCC register 5
      MOVE.B      #DTRAL,SCCA ; deassert DTRA
 ; subtract the length of the preceding quick pulse from the longer pulse that follows to compensate for the timing overhead
 
 ; set up timing loop 1
      MOVE.W      #63,D7      ; to loop 64 times and move 256 bytes
      MOVEA.L    #VIAX,A5    ; point to register for timing test
      LEA        @Data,A0    ; point to a data storage area
 
 ; start timing loop 1
      MOVE.B      #5,SCCA    ; select SCC register 5
      MOVE.B      #DTRAH,SCCA ; assert DTRA
@LP1  MOVE.B      (A5),(A0)+  ; move 1 byte, repeat 4 times to compare timing with MOVEP.L
      MOVE.B      (A5),(A0)+  ;
      MOVE.B      (A5),(A0)+  ;
      MOVE.B      (A5),(A0)+  ;
      DBRA        D7,@LP1    ; loop until register decrements to -1
      MOVE.B      #5,SCCA    ; select SCC register 5
      MOVE.B      #DTRAL,SCCA ; deassert DTRA
 ; timing loop 1 ended
 
 ; set up timing loop 2
      MOVE.W      #63,D7      ; to loop 64 times and move 256 bytes
      MOVEA.L    #VIAX,A5    ; point to register for timing test
      LEA        @Data,A0    ; point to a data storage area
 
 ; start timing loop 2
      MOVE.B      #5,SCCA    ; select SCC register 5
      MOVE.B      #DTRAH,SCCA ; assert DTRA
@LP2  MOVEP.L    0(A5),D0
      MOVE.L      D0,(A0)+
      DBRA        D7,@LP2    ; loop until register decrements to -1
      MOVE.B      #5,SCCA    ; select SCC register 5
      MOVE.B      #DTRAL,SCCA ; deassert DTRA
 ; timing loop 2 ended
      RTS                    ; return to calling code
     
@Data EQU  *                ; store data here (in the memory following this code)

An "easy" way to run a quick test like this is with BLU's "download code and execute" function. You'll need to send the executable/binary code of course. I presume anyone interested in this level of detail can fix any typos and assemble the code, but I'd be happy to assemble with MPW and upload the binary if that would be useful to someone. I just looked for an online 68k assembler, but was unsuccessful in finding one that generated the binary code.
Warning: Memory errors found. ECC non-functional. Verify comments if accuracy is important to you.