Don't get misled. That "1.25MHz speed" thing is about the T1 and T2 timers that are inside the VIAs, which can cause an interrupt, or be used to generate a square wave output, etc. It's not about the transfer rate to/from a ProFile or Widget.
"625K bytes/second maximum data transfer rate is pure bullshit. They got to it like this: 68000 at 5MHz so
5,000,000 cycles/second. Each memory access is 8 cycles.
5,000,000 / 8 = 625,000 or
6.25KB/s.
However that's a lie. If you have a tight 68000 assembly loop that reads from port A on that via, and then turns around and writes to memory and increments an address register, you need at least two memory accesses (each 8 CPU cycles), one for a read and one for a write.
But wait! There's more! The 68000 doesn't really have a cache (well it has the IR register, which is a single word, so might as well have none), so it needs some bus cycles to read those opcodes, and then, to execute them.
The smallest opcode can be read in a single shot (2 bytes/1 16-bit word) - so 8 CPU cycles.
So if you use register based addressing, indexing you can do something like this:
LEA.L VIA_portA, A0
LEA.L BUFFER,A1
MOVE.W #512+20,D0
loop: MOVEP.B (A0),(A1)+
DBRA D0,loop
This guy
MOVE.B (A0),(A1)+ will take at least 24 CPU cycles, likely more. 8 cycles to read the opcode, 8 cycles to read from the VIA port A (
IRA), 8 cycles to write to the port's data to memory
(A1), a few more to increment A1
+. Then, DBRA will take yet at least 8 more cycles just to read the opcode, likely more for the decrement of D0.
So, right off the bat you're looking at minimum 32 cycles.
On top of that, half the bus cycles are used by the video state machine, during which time the 68000 is just sitting there waiting. Sure, it could perform internal operations, but that's unlikely as there's no cache. So you're already starting with
[ 5,000,000 cycles/sec / 2 (video state steal) / 32 (cycles for those two instructions) ] at the absolute minimum.
So doing that division gives us something like 78.125KB/s at the absolute best case. And yes, I cheated, I was lazy and didn't assemble this and look up each opcode generated in the 68000UGM to see the exact cycle. Most likely its more cycles than I said, so slower than that 78KB/s - possibly even as slow as half of that.
Sure you can play tricks like loop unrolling (which LOS 3.1 does) or whatever, but there is some limit that you can't go past.
Perhaps if they had rigged up a DMA controller it might have gotten closer, but still, it would take about the same number of cycles while the DMA controller does its thing and the 68000 is waiting for the bus. You'd need to rig up a separate bus out of the way of the CPU to the DRAM and tie it into the DRAM-refresh to get that fast - that is instead of using half the cycles to just refresh the DRAM (which is the other half of the purpose of the video state machine) you could have the DMA controller push data from the ProFile directly to a buffer setup by the OS in RAM. Which is really hard to do.
The Mac does something similar with its video state machine, but uses faster DRAM so they can go the full 8MHz and also do the DRAM refreshes while the CPU isn't using the bus (I think it uses !CLK vs CLK so both high and low clocks are used) - Steve Chamberlain had a nice writeup on this when he was building his Mac clone:
https://www.bigmessowires.com/category/plustoo/ or here
https://www.bigmessowires.com/category/68katy/ - but I remember he had another with the bus cycle timings and what not that I don't see right now.
The Lisa actually alternates between 8 cycles for the CPU, and then 8 cycles for the video/DRAM refresh - somewhere in the Lisa HWG they show this with all the bus cycles 0-7 for CPU, then 0-7 for video. That's why it's much slower. So it's closer to 2.5MHz in reality.
So tl;dr there's no way to get anywhere near 625KBPS. I'll go away and shut up now.