Topic: LisaEm Bugs - LisaTest crash on CPU (Read 13850 times)

D.Finni · « **on:** April 11, 2019, 01:43:52 pm »

Crash running LisaTest off the disk labeled "LisaTest for Peripherals."

Maybe this is low-hanging fruit to find and fix?

rayarachelian · « **Reply #1 on:** April 12, 2019, 06:59:32 pm »

Thank you D. Finni

Hmmm, that's unfortunate, LisaTest was one of the first things to boot and run on LisaEm, and they should pass. I guess over time new bugs have slipped in.

Can you tell me which version of LisaTest you used (since there are a quite few different ones on bitsavers and other places)
That message means the Program Counter went to zero which is not a good place to be for a 68000 (as an aside, reboots are done by setting it to -1 which causes both a segfault and an address error at the same time, and since the 68k can't recover from that, it resets), since the low addresses are where the 68000 store vectors to exceptions.
So either:

the code went wrong somewhere during while taking a vector
the vector is set to zero and an interrupt happened incorrectly
something clobbered the PC by writing memory randomly
or perhaps LisaTest is testing this specific condition and expects different results than we emulate.

P.S. I've moved this out of the workshop discussion as it's a different topic.

D.Finni · « **Reply #2 on:** April 12, 2019, 09:08:29 pm »

Oops, I wasn't a very responsible bug reporter.

OK, here I have attached the disk image.

Once booted, the button sequence is Short, Continue, Continue. After a few moments of the CPU test, it crashes.

rayarachelian · « **Reply #3 on:** April 13, 2019, 10:19:08 am »

Ouch, well this is even more embarrassing, it crashes early during booting in 1.2.7.

Back to the drawingboard I guess.

Looks like it switches MMU context, but something goes off the rails and blows up, it actually causes a segfault, I don't think this happens on 1.2.6 on macos X since you're able to at least start a test. I did fix a bunch of IPC related memory leaks in 1.2.7, that may be related.

Code: [Select]

1/000e537a (1 1/0/0) : 1f28 0001                  : ....     :  266 : MOVE.B     $0001(A0),-(A7)  SRC:clk:000000001b499b32 +16 clks
D 0:00248000 1:000007b2 2:000000ff 3:00042004 4:00000003 5:000003b2 6:00000012 7:00000012 .Sz.... imsk:7 pnd:3 (1/0/normal cx:1)SRC:
A 0:0004d3e2 1:00004000 2:000e5328 3:00000d2c 4:00003566 5:0004e5a4 6:0004d3ac 7:0004d376 SP:00000800 PC:000e537e SRC:

1/000e537e (1 1/0/0) : 341f                       : 4.       :  451 : MOVE.W     (A7)+,D2  SRC:clk:000000001b499b42 +8 clks
D 0:00248000 1:000007b2 2:0000073e 3:00042004 4:00000003 5:000003b2 6:00000012 7:00000012 .Sz.... imsk:7 pnd:3 (1/0/normal cx:1)SRC:
A 0:0004d3e2 1:00004000 2:000e5328 3:00000d2c 4:00003566 5:0004e5a4 6:0004d3ac 7:0004d378 SP:00000800 PC:000e5380 SRC:

1/000e5380 (1 1/0/0) : 206e 000c                  : .n..     :  420 : MOVE.L     $000c(A6),A0  SRC:clk:000000001b499b4a +16 clks
D 0:00248000 1:000007b2 2:0000073e 3:00042004 4:00000003 5:000003b2 6:00000012 7:00000012 .Sz.... imsk:7 pnd:3 (1/0/normal cx:1)SRC:
A 0:0004d3e0 1:00004000 2:000e5328 3:00000d2c 4:00003566 5:0004e5a4 6:0004d3ac 7:0004d378 SP:00000800 PC:000e5384 SRC:

1/000e5384 (1 1/0/0) : 1428 0001                  : ....     :  263 : MOVE.B     $0001(A0),D2  SRC:clk:000000001b499b5a +12 clks
D 0:00248000 1:000007b2 2:000007fc 3:00042004 4:00000003 5:000003b2 6:00000012 7:00000012 .S..n.. imsk:7 pnd:3 (1/0/normal cx:1)SRC:
A 0:0004d3e0 1:00004000 2:000e5328 3:00000d2c 4:00003566 5:0004e5a4 6:0004d3ac 7:0004d378 SP:00000800 PC:000e5388 SRC:

cpu68k.c:cpu68k_ipc:508:dt_ImmW @ 000e538a target:00000016 opcode:d2fc| 10:21:18.2 457808742
1/000e5388 (1 1/0/0) : d2fc 0016                  : ....     : 1466 : ADDA.W     #$0016,A1  SRC:clk:000000001b499b66 +12 clks
D 0:00248000 1:000007b2 2:000007fc 3:00042004 4:00000003 5:000003b2 6:00000012 7:00000012 .S..n.. imsk:7 pnd:3 (1/0/normal cx:1)SRC:
A 0:0004d3e0 1:00004016 2:000e5328 3:00000d2c 4:00003566 5:0004e5a4 6:0004d3ac 7:0004d378 SP:00000800 PC:000e538c SRC:

1/000e538c (1 1/0/0) : 4e91                       : N.       :  752 : JSR.L      (A1)  SRC:clk:000000001b499b72 +16 clks
D 0:00248000 1:000007b2 2:000007fc 3:00042004 4:00000003 5:000003b2 6:00000012 7:00000012 .S..n.. imsk:7 pnd:3 (1/0/normal cx:1)SRC:
A 0:0004d3e0 1:00004016 2:000e5328 3:00000d2c 4:00003566 5:0004e5a4 6:0004d3ac 7:0004d374 SP:00000800 PC:00004016 SRC:

cpu68k.c:cpu68k_ipc:452:dt_AbsL @ 00004018 target:00fce010 opcode:4a39| 10:21:18.2 457808770
cpu68k.c:cpu68k_ipc:452:dt_AbsL @ 00004026 target:00fce012 opcode:4a39| 10:21:18.2 457808770
1/00004016 (1 1/0/0) : 4a39 00fc e010             : J9....   :  698 : TST.B      $00fce010  SRC:clk:000000001b499b82 +16 clks
memory.c:lisa_rb_io:3519:@00fce010| 10:21:18.2 457808770
memory.c:lisa_wb_Oxe000_latches:2003:@00fce010| 10:21:18.2 457808770
memory.c:lisa_wb_Oxe000_latches:2039:start=1 was 0 SRC:context_switch!| 10:21:18.2 457808770
mmu.c:mmuflush:890:Entering MMUflush Context Change - current context:1| 10:21:18.2 457808770
mmu.c:mmuflush:900:--->switched to context:0 s1/s2/start:1/0/1<----
| 10:21:18.2 457808770
mmu.c:mmuflush:904:mmu flush - old context:1 supervisor_flag:1 current s flag:1 switching to context=0 s1/s2=1/0 start=1
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:290:*** mmu_t[0][0].adr!=adr invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:291:*** mmu_t[0][0].rfn!=rfn invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:292:*** mmu_t[0][0].wfn!=wfn invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:290:*** mmu_t[0][1].adr!=adr invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:291:*** mmu_t[0][1].rfn!=rfn invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:292:*** mmu_t[0][1].wfn!=wfn invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:290:*** mmu_t[0][2].adr!=adr invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:291:*** mmu_t[0][2].rfn!=rfn invalid/changed, good to reset it
...
mmu.c:init_start_mode_segment:290:*** mmu_t[0][68].adr!=adr invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:init_start_mode_segment:292:*** mmu_t[0][68].wfn!=wfn invalid/changed, good to reset it
| 10:21:18.2 457808770
mmu.c:mmuflush:920:Initialized start mode, done with mmuflush
| 10:21:18.2 457808770
D 0:00248000 1:000007b2 2:000007fc 3:00042004 4:00000003 5:000003b2 6:00000012 7:00000012 .S..n.. imsk:7 pnd:3 (1/0/START cx:0)SRC:
A 0:0004d3e0 1:00004016 2:000e5328 3:00000d2c 4:00003566 5:0004e5a4 6:0004d3ac 7:0004d374 SP:00000800 PC:0000401c SRC:

Ok, so built a fresh copy of 1.2.6.2 (I know, I know) in a Ubuntu 18.04 VM, and yes, the above is a fresh new bug in 1.2.7, and confirmed, the bug you reported still does happen in 1.2.6.2.

From what I see, a vertical retrace interrupt occurred, but the target for the vector handler was in a memory page that wasn't loaded, which caused a bus error, and the bus error vector was set to zero, so that set the PC to zero.
I'm guessing that the vertical retrace interrupt should not have happened and occurred wrongly since LisaTest didn't setup an interrupt handler, either that or the MMU refresh code failed and returned an empty page instead of page zero, and so when the ISR read the vector for bus error it got a zero.

Code: [Select]

reg68k.c:reg68k_external_execute:1037:

Current opcode lives inside a bad_page, throwing lisa_mmu_exception @ 1/0066000e. Previous IR=66fc
reg68k.c:reg68k_external_execute:1038:mmu_trans[3300] segment #51


reg68k.c:reg68k_internal_vector:1840:Entering: internal_vector - VECTOR:2, addr_err:0066000e oldpc:e466000e, pc24:0066000e, reg68k_pc:e466000e
irqdoneD 0:00000000 1:0000ff00 2:00001e4a 3:00f7c9ca 4:0000ccea 5:0003bc00 6:0000c9ba 7:00000003 .S..... imsk:1 pnd:2 (0/0/normal cx:1)SRC:
irqdoneA 0:002a00ee 1:e466000e 2:0000dcda 3:00221c5e 4:00f80100 5:00f7ff00 6:00f7cc40 7:00f7cbd6 SP:00f76bba PC:e466000e SRC:

memory.c:dmem68k_fetch_long:436:
reg68k.c:reg68k_internal_vector:1881: REAd LONG @ 1/00000008:: 
chkmtmmu: (ok):17/17 {mmu/mt:00000008/00080008 *MISMATCH*},   mmu_t[1][0000].fn=17 17-ram   mmu[0].fn=17 17-ram, mmudirty:0
memory.c:lisa_rl_ram:2529:mmu translation of 1/00000008 is: 00080008
memory.c:dmem68k_fetch_long:442::::::READ LONG 00000000 '    ' from @1/00000008 (@00000008)
reg68k.c:reg68k_internal_vector:1930:PUSH PC e466000e context:1
memory.c:dmem68k_store_long:511::::::WRITE LONG e466000e 'df O' to @1/00f7cbd2

Created a new bug here: https://github.com/rayarachelian/lisaem-1.2.x/issues/9

D.Finni · « **Reply #4 on:** May 02, 2019, 11:00:54 pm »

Quote from: rayarachelian on April 13, 2019, 10:19:08 am

Created a new bug here: https://github.com/rayarachelian/lisaem-1.2.x/issues/9

Any update on this? Have you had time to look into this bug?

rayarachelian · « **Reply #5 on:** May 05, 2019, 01:29:15 pm »

Nothing useful yet. The last bits of code before this happen turn on VIA1 IRQ for CA1, then there's a time delay, then when the vertical retrace happens, the ISR for autovector 1 is either not loaded or is in a bad page, so a bus error occurs, but the bus error vector is zero, so it dies.

The only odd thing is this MMU mismatch warning, but when I check on the RAM dump and MMU dump, I see they all have valid values, and the MMU cache matches the MMU registers, so not sure what's wrong. (look for *MISMATCH*)

I don't think LisaTest sets up a vector for bus errors and so this is an unexpected event. The real question then is why this IRQ happened and wasn't handled. Though I do see some code before the mismatches that seems to copy low RAM elsewhere, so perhaps it was getting ready to test low memory before this happened.

I do see some of the code winds up executing with the high address byte set (on the Lisa and pre-32 bit clean Macs the high address byte is ignored, and is used for flags. It does look like LOS uses this too and some of that is addressed in 1.2.7, but I don't see any obvious thing that I can point at and say where things went wrong to cause the bus error to be triggered, so whatever the bug is, it's still hidden.
Here's a bit of the cleaned up tracelog, as well as a larger uncleaned one twoards the end, but I don't know what else is useful there.

rayarachelian · « **Reply #6 on:** May 30, 2019, 07:40:28 pm »

So I've stopped looking to fix this on 1.2.6 and instead am trying to fix this (or other related bugs) on 1.2.7. There's some code in reg68k.c that allocates IPCTs (Instruction Pointer Cache Tables) which libGenerator uses as an instruction cache.

I think either something is clobbering the data stored in that cache or in the MMU pointers, or something else is causing corruption there.

I've eliminated the free ipc table - the source of the bug isn't there by adding code to walk through the free ipcts and check that they are initialized to zero and none are "dirty" - the bug doesn't happen there.

The symptom in 1.2.7 is that in this code there's a segfault caused following ipc->opcode.
As per gdb ipc is not NULL, but following the pointer causes a crash, which means it's pointing to the wrong place, which means something has corrupted, or invalid pointers somehow were written to it - which should never happen. It's possible some other code somewhere is either clobbering the IPCTs or the mmu translation cache.

This code is a little bit unoptimized vs 1.2.6 - because I'm trying to track down exactly where the segfault is happening, using -O3 optimizes some of the variables out so it's hard to tell and the segfault happens on if (flag)

Code: [Select]

               abort_opcode=2;

               #ifndef EVALUATE_EACH_IPC
               static int flag;
               flag=(ipc==NULL);
               if (!flag) flag=(ipc->function==NULL);

               if (!flag) 
                  {  uint16 myword=fetchword(pc24 & 0x00ffffff);
                     if (ipc->opcode!=myword) flag=1;
                  }
               if (flag) //==13256== Conditional jump or move depends on uninitialised value(s)
               #endif
                {
                    if (abort_opcode==1) break;
                    if (!mt->table) mt->table=get_ipct();  //we can skip free_ipct
                    cpu68k_makeipclist(pc24 & 0x00ffffff); if (abort_opcode==1) break; //==24726== Conditional jump or move depends on uninitialised value(s)
                    ipc=&(mt->table->ipc[(pc24 & 0x1ff)>>1]);
                }
                abort_opcode=0;

rayarachelian · « **Reply #7 on:** June 01, 2019, 08:09:28 pm »

Found a minor initialization bug that was causing the 1.2.7 boot up crashes on LisaTest, so basically when I use calloc to allocate memory dynamically, that guarantees that the space allocated is initialized to zero, however, any arrays declared in C code itself that get allocated to the heap are apparently garbage filled. I did clear most of the fields in mmu_trans, except the table pointer, which had random junk in it, this caused the segfault in previous reply.

This might or might not be the issue with LisaTest failing, as mentioned in another post tonight, had some mouse movement issues that prevented running LisaTest properly, but it did expose a possible easter egg in Monitor OS (or perhaps it's not well known).

LisaList2

News:

Author Topic: LisaEm Bugs - LisaTest crash on CPU (Read 13850 times)

D.Finni

LisaEm Bugs - LisaTest crash on CPU

rayarachelian

Re: LisaEm Bugs

D.Finni

Re: LisaEm Bugs

rayarachelian

Re: LisaEm Bugs

D.Finni

Re: LisaEm Bugs

rayarachelian

Re: LisaEm Bugs - LisaTest crash on CPU

rayarachelian

Re: LisaEm Bugs - LisaTest crash on CPU

rayarachelian

Re: LisaEm Bugs - LisaTest crash on CPU