Topic: Cameo/Aphid benchmarking (Read 12695 times)

Lisa2 · « **on:** September 08, 2021, 11:30:09 am »

So I finally was able to put together a Cameo/Aphid device to play with. I was interested in the performance of the Cameo/Aphid vs. the X/Profile. My test setup is a 2/10 running MW+II with an X/Profile in place of the Widget, the Cameo/Aphid connected to a dual parallel port card, and a PCMCIA flash drive connected to a Sun SCSI card. To test the performance of the drives I used an old HD benchmarking program from LaCie named TimeDrive. This program does non-destructive testing using the driver ( not direct SCSI calls) so it's a good fit for these parallel port drives. A made a quick video of the testing ( that my cat photo-bombed ), it's boring, shaky, and about 5 min. long, but for those who may be interested here is the link:

https://youtu.be/pDfxcUbvBc0

The test is not scientific, but the results were interesting.

Rick

stepleton · « **Reply #1 on:** September 08, 2021, 01:45:00 pm »

Wow, cool! Maybe I shouldn't write hard drive emulators in Python :-)

It would be nice to compare two drives on the same port; it's possible that the internal port and the expansion card could have different rates. Another factor might be the SD card in use inside the Aphid (as well as the media in the other devices). Most SD cards will be plenty quick, but the speed difference between an older Class 4 device and a Class 10 or better will likely be substantial in Cameo/Aphid.

One slightly obnoxious quirk of the internal I/O mechanisms that Cameo/Aphid uses is that data has to be passed around in 512-byte blocks. As a ProFile drive block is 532 bytes, you need two of these transfer operations to access a single block. If you wanted to, you could write these blocks directly to shared memory inside the PocketBeagle, but I chose the path of purity...

I wonder how a real drive would do!

Lisa2 · « **Reply #2 on:** September 08, 2021, 02:08:47 pm »

A couple more quick notes on this:

1. I was using using the latest build of the Aphid software from this summer.

2. I think the SD card used in the Aphid is a class 10 ( I will confirm this later ).

3. The CF card in the X/Profile rated at 80x speed ( 80 x 150 kB = 12,000 kB or 12Mb ?? )

4. I did try connecting the Aphid to the internal port on the 2/10 without success so far ( I do have the correct cable for this ), but I can try the test again using both connected the dual par port card ( or to the same exact port if you want ).

5. I can also run the same test on a real profile or a widget if you want, I have both working available.

Rick

blusnowkitty · « **Reply #3 on:** September 08, 2021, 02:38:24 pm »

Quote from: stepleton on September 08, 2021, 01:45:00 pm

It would be nice to compare two drives on the same port; it's possible that the internal port and the expansion card could have different rates.

For what it's worth, I found this in the Parallel Card manual trying to debug my homemade DMP cable:

Quote

1. Timing. The Lisa parallel connector is driven by a 500 kHz clock. The Parallel Interface Card supports a faster clock operating at 1.25 M Hz. This affects software using 6522A internal timers.

https://generalphiles.com/files/Apple/Lisa/Parallel%20Interface%20Card.pdf
Page 14, or PDF page 21

Lisa2 · « **Reply #4 on:** September 08, 2021, 03:52:34 pm »

This would seem to indicate that the Dual Parallel card is more than twice as fast as the internal port. I am not sure this is case. The Dual Parallel card manual also states the card has a "625K bytes/second maximum data transfer rate", I can't find this spec for the internal port right now.

I can test the same X/Profile on the Dual Parallel card and see if it's any faster...

Rick

stepleton · « **Reply #5 on:** September 08, 2021, 04:04:40 pm »

I remembered that thing about the faster clock on the parallel card; how that affects I/O with the Lisa I'm not sure. But if @Lisa2, you're willing to try your X/ProFile on the parallel port, that would be an excellent test of the difference.

I'm girding myself for the news that Cameo/Aphid really is slower! I never have tried to optimise it for speed, and there's a fair amount you could do to make it go faster. (Ditching python for C might be a start.) But I'm betting that most Lisa applications aren't really disk I/O bound. The main thing I'm worried about is compatibility, so that bit about it not working internally is what grabs my attention. Do you have the inline 100-ohm terminating resistors fitted? (I'm guessing you do.) These are what enables Widget replacement for me, but I haven't had more 2/10s to try it out on besides my own.

It's worth saying that when James Denton was working on preparing this product derived from Cameo/Aphid, he needed to make some tweaks to get it working on the internal connector. Note the presence of Rev. A and Rev. B options on his page. I'm not sure how his Rev. B is made, but I know he was investigating a simpler replacement for the TXS0801 bidirectional level adaptor chips that I use in my design, which I think is a pretty good idea. James was planning to share his designs and may have already done it somewhere.

jamesdenton · « **Reply #6 on:** September 08, 2021, 05:25:07 pm »

Quote from: stepleton on September 08, 2021, 04:04:40 pm

It's worth saying that when James Denton was working on preparing this product derived from Cameo/Aphid, he needed to make some tweaks to get it working on the internal connector. Note the presence of Rev. A and Rev. B options on his page. I'm not sure how his Rev. B is made, but I know he was investigating a simpler replacement for the TXS0801 bidirectional level adaptor chips that I use in my design, which I think is a pretty good idea. James was planning to share his designs and may have already done it somewhere.

Thanks for the nudge - I've push the schematics, board, and related BOM to my forked repo here.

FWIW, the Rev A board uses the same components found in Tom's design and uses TXS0108s and 100ohm resistors (where necessary). Works well with Lisa 2/5 and Apple ///. Been a while since I tested with a parallel card. I was not able to get it to work reliably with my 2/10 using the internal cable. The same can be said for the original cameo/aphid I built.

Which brings me to Rev B - I used many BSS138s in place of the (2) TXS0108s. This has shown to work really well on the 2/10, along with the 2/5 and Apple ///. However, I have found the PocketBeagle's themselves to be a little more flaky with this board. Some work great 100% of the time, others are a bit more temperamental and I see checksum errors and various oddities.

stepleton · « **Reply #7 on:** September 08, 2021, 05:59:34 pm »

Thanks James! Looks like I might need to find some different 2/10s for testing besides my own. About the only thing I can think of that might be different is that I'm not using any adaptor cable between my own device and the Widget cable, which I plug straight into the 26-pin header at the back of the Aphid board.

It's been about three years since I designed the Aphid board. It's possible that trying to use fancy all-in-one automatic bidirectional level adaptor ICs for the data and signaling lines was just too clever by half. (ProFile emulation was only meant to be one application for Aphid --- I wanted to have the board be useful for other things too, like GPIB. But I've never tried any other use.)

I think there are bidirectional level shifter ICs that require you to toggle the data direction with a pin; alternatively, you could do what the ProFile did and have two tri-state buffers, one for in and one for out. The R/~W line picks the active one directly, if I recall correctly. I suspect either would be more dependable. In the meantime, I wish I knew why my 2/10 was more tolerant than other folks' machines.

rayarachelian · « **Reply #8 on:** September 08, 2021, 06:02:52 pm »

Don't get misled. That "1.25MHz speed" thing is about the T1 and T2 timers that are inside the VIAs, which can cause an interrupt, or be used to generate a square wave output, etc. It's not about the transfer rate to/from a ProFile or Widget.

"625K bytes/second maximum data transfer rate is pure bullshit. They got to it like this: 68000 at 5MHz so 5,000,000 cycles/second. Each memory access is 8 cycles. 5,000,000 / 8 = 625,000 or 6.25KB/s.

However that's a lie. If you have a tight 68000 assembly loop that reads from port A on that via, and then turns around and writes to memory and increments an address register, you need at least two memory accesses (each 8 CPU cycles), one for a read and one for a write.

But wait! There's more! The 68000 doesn't really have a cache (well it has the IR register, which is a single word, so might as well have none), so it needs some bus cycles to read those opcodes, and then, to execute them.

The smallest opcode can be read in a single shot (2 bytes/1 16-bit word) - so 8 CPU cycles.

So if you use register based addressing, indexing you can do something like this:

Code: [Select]

       LEA.L  VIA_portA, A0
       LEA.L  BUFFER,A1
       MOVE.W #512+20,D0
loop:  MOVEP.B (A0),(A1)+
       DBRA D0,loop

This guy MOVE.B (A0),(A1)+ will take at least 24 CPU cycles, likely more. 8 cycles to read the opcode, 8 cycles to read from the VIA port A (IRA), 8 cycles to write to the port's data to memory (A1), a few more to increment A1 +. Then, DBRA will take yet at least 8 more cycles just to read the opcode, likely more for the decrement of D0.

So, right off the bat you're looking at minimum 32 cycles.

On top of that, half the bus cycles are used by the video state machine, during which time the 68000 is just sitting there waiting. Sure, it could perform internal operations, but that's unlikely as there's no cache. So you're already starting with [ 5,000,000 cycles/sec / 2 (video state steal) / 32 (cycles for those two instructions) ] at the absolute minimum.

So doing that division gives us something like 78.125KB/s at the absolute best case. And yes, I cheated, I was lazy and didn't assemble this and look up each opcode generated in the 68000UGM to see the exact cycle. Most likely its more cycles than I said, so slower than that 78KB/s - possibly even as slow as half of that.

Sure you can play tricks like loop unrolling (which LOS 3.1 does) or whatever, but there is some limit that you can't go past.

Perhaps if they had rigged up a DMA controller it might have gotten closer, but still, it would take about the same number of cycles while the DMA controller does its thing and the 68000 is waiting for the bus. You'd need to rig up a separate bus out of the way of the CPU to the DRAM and tie it into the DRAM-refresh to get that fast - that is instead of using half the cycles to just refresh the DRAM (which is the other half of the purpose of the video state machine) you could have the DMA controller push data from the ProFile directly to a buffer setup by the OS in RAM. Which is really hard to do.

The Mac does something similar with its video state machine, but uses faster DRAM so they can go the full 8MHz and also do the DRAM refreshes while the CPU isn't using the bus (I think it uses !CLK vs CLK so both high and low clocks are used) - Steve Chamberlain had a nice writeup on this when he was building his Mac clone: https://www.bigmessowires.com/category/plustoo/ or here https://www.bigmessowires.com/category/68katy/ - but I remember he had another with the bus cycle timings and what not that I don't see right now.

The Lisa actually alternates between 8 cycles for the CPU, and then 8 cycles for the video/DRAM refresh - somewhere in the Lisa HWG they show this with all the bus cycles 0-7 for CPU, then 0-7 for video. That's why it's much slower. So it's closer to 2.5MHz in reality.

So tl;dr there's no way to get anywhere near 625KBPS. I'll go away and shut up now.

Lisa2 · « **Reply #9 on:** September 09, 2021, 12:11:16 pm »

As a followup to my post yesterday:

1. I did confirm the SD card in the Aphid I used for test was a 16Gig UHS speed class 1 ( equivalent to a speed class 10 ).

2. My interface uses TXS0108s and 100ohm resistors

3. I did try to test my real 5M profile, but while it was working last weekend, it was not cooperating last night. Those things are quite temperamental.

4. Tested the same X/Profile on both the internal 2/10 port and using a Dual Par Card. The end result is that Dual Par Card is faster than the internal port in this very un-scientific test. Long, un-professional video of the testing here:

https://youtu.be/63BF9FbOylU

Rick

stepleton · « **Reply #10 on:** September 09, 2021, 01:40:49 pm »

Thanks for trying it out! I guess that settles it. Cameo/Aphid will just be slower than the X/ProFile for now.

In a read request, there is basically one likely place for an Apple parallel port hard drive to be slow, and that's when it's getting the data ready for the computer to read back. Once the hard drive says "ready", the computer sets the pace, and there's no reason to expect that the Lisa would treat any hard drive differently at that point.

So the place to look for speedups on read is at the point where the drive marshals data to be read into the computer. In Cameo/Aphid there's definitely some room for improvement in the higher-level side of things. (Cameo/Aphid's programming is split into two parts: fast lower-level I/O that runs on these dedicated TI coprocessors that the PocketBeagle has, then a slower high-level Python program that runs on the ARM.)

While Cameo/Aphid is running, the entire drive image is living in a memory mapped file, so it's starting off in a place where it can be accessed very rapidly. It's simple to access a block --- you just pull the right run of bytes out of an array.

But then it has to copy the data over to the PocketBeagle's speedy I/O coprocessors --- and ah yes, this is probably a big reason for why reads are slower. It can't just do a copy --- it also has to compute and interleave the parity bytes* for all of the data bytes on the fly. This line would probably be WAY faster in any common compiled language. Once it does that, it has to do that copy, but it does it in three chunks (not two) because of that 512-byte limited internal I/O that I mentioned: we have 2*532 bytes to copy given the parity bytes.

In a way I'm relieved, since there's probably a lot of low-hanging fruit to get more speed. I'm reluctant to move away from Python since that's what makes the "magic block" plugin facility so easy to program. (This is the system that powers the hard drive image selector.)

But I probably won't try to do anything unless I hear that running twice as slowly as an X/ProFile is a problem :-) I'm guessing (hoping!) that Cameo/Aphid will still smoke a ProFile or a Widget, but the main setting where I could anticipate an issue with the speed as-is is if you're trying to watch a video.

Thanks for doing this investigation! I'm still annoyed that the termination resistors aren't a general solution. More development is required to see if the TXS0108 chip option can be reliable at the end of the internal cable. I'll update the Cameo/Aphid GitHub page soon to say that it's not a sure bet.

* ETA: Parity bytes? Well, we only use one bit of each byte, but we still use a whole byte for simplicity.

LisaList2

News:

Author Topic: Cameo/Aphid benchmarking (Read 12695 times)

Lisa2

Cameo/Aphid benchmarking

stepleton

Re: Cameo/Aphid benchmarking

Lisa2

Re: Cameo/Aphid benchmarking

blusnowkitty

Re: Cameo/Aphid benchmarking

Lisa2

Re: Cameo/Aphid benchmarking

stepleton

Re: Cameo/Aphid benchmarking

jamesdenton

Re: Cameo/Aphid benchmarking

stepleton

Re: Cameo/Aphid benchmarking

rayarachelian

Re: Cameo/Aphid benchmarking

Lisa2

Re: Cameo/Aphid benchmarking

stepleton

Re: Cameo/Aphid benchmarking