Synth-DIY Yahoo! Groups Archives

Having a persistent problem that is proving resistant to standard debugging 
tricks.
I'm hoping that talking it through here, will dredge up the problem.

I'd like to say that I've eliminated all the obvious problems, but I've 
been knocking on this for a while now, with no resolution.
Apologies for the lengthy explanation, but it's a dynamic problem, and 
requires some understanding of what's happening to debug.

Chip: Mega8, 8 MHz internal R/C osc.
Boot routine includes an explicit scrub of all registers, and all SRAM.
Tons of stack space, and there's a constant check for creeping stack,  or 
undesired ISRs, which never fires.
Emulation says the stack only goes down by about 6 calls, which is normal, 
there's over 100 bytes between the last ram var, and the lowest stack point.
The trap routine has been tested in sim, and in emulation, by corrupting 
the stack, or triggering unused ISRs, and it does work there.

I don't think the data reception is at fault, but I'll give a brief 
explanation here:

This system receives data from a master system, using a latch that I read, 
during the INT0 ISR.
If the host "blips" the int line, then it won't be high by the time that I 
look at it in the ISR, (Physically impossible) and so that is "data", which 
I count, and store using X as a pointer dedicated to this ISR.
If the host holds the int line low, then I read this data as a "command", 
and store it in a special SRAM slot, where it is acted on in the main 
routine.  There is more to the comms, like acking data and command, but I 
don't think it's relevant.

The ISR maintains a count of bytes received, in SRAM at 3A7h (Note, not on 
a boundary)

The ISR decides it has a full packet when the byte count reaches 104, and 
it then increments a count of packets received, which is stored in a low 
register. (I've tried moving the byte count to SRAM, no diff).  Receiving a 
command with < 104 bytes received terminates a packet early, and this is 
normal, and desired. Sim and emulation say this is working fine.

Whenever we get a full packet, then I check for buffer wrap, and if so, 
reset X to the beginning of the buffer space.
I also do a "full" check, based on packet count, and signal the master to 
hold off when I'm full (appears to work fine)

Data is stored in a number of 104 byte packets, starting at a fixed address 
in SRAM, and rolling over when full.
It's more or less a standard circular buffer, except that the data is in 
104 byte records, and each is used multiple times before discarding it, 
which involves paving it to zeroes, and incrementing the tail pointer.

Ok, so we have lots of packets in SRAM, and we're wrapping properly, 
detecting full when we should, and all that fun stuff.

I have run this in sim, including simulaitng the ints, and there is never a 
problem.
Unfortunately, I can't sim the command reception.

Also, the diag version runs fine in real hardware, producing endless 
perfect output.

In the diag version, the int0 ints can only happen at one point, when all 
the other tasks have run, so it's not a "real" test.
Because the ints are faked by code in the target system, they can only 
occur at one point in the code, and they "slam" a full packet in before 
returning to the main program flow. Making this more scattered would be 
extremely problematic.

So, the ISR code, when examined, hand-flown, and run in the sim, works 
fine, for hours.
In real life, in the ICE-50 or real chip, it appears to loose it's place 
occasionally.
Unfortunately, telling exactly what happens is next to impossible because 
the ice has no facility for break on write to RAM/REG.

When running for real, the system appears to think it has received a 
command, after some number of packets and commands have been transferred. 
Sometimes it fails in the first packet, sometimes it runs for 100-200 
packets or more before failure.

When I breakpoint on receiving a bogus command, (Check of Master_Command in 
SRAM is <>0) the master and I both agree on the number of bytes 
transferred, and a scope trace proves that the master is not sending me a 
command at that time. So, it's not that a byte is getting misinterpreted.

When I break on receiving a command in the ISR, I only see valid commands, 
so apparently the ISR is not the source of the problem.

The system decides it has a command to process whenever Master_Command in 
SRAM is <>0. This is a very simple test.
Commands of 00H are illegal, and the ISR traps those and converts them to 
FFh which I can see, and is harmlessly discarded.
The ice tells me this never happens, which is good.

It appears that the SRAM var Master_Command at 3B1h is getting corrupted.
I had the same symptoms when Master_Command was in a low register.

Clearing Master_Command whenever I receive a data byte appears to help the 
problem, but does not solve it.
Nor should this be necessary.. It is simply masking the problem.  The 
system checks for commands relatively infrequently, 10-400uS depending on 
what tasks need to run in the scheduler, and when the command byte is stored.

There are no STD's in the code, and I strongly believe that all pointer 
writes (ST X etc) are done with initialized pointers.
I moved Master_Command to SRAM thinking that this would isolate an improper 
write to low ram space, which overlaps the registers, but the symptoms are 
the same, which eliminates that possibility.

If I turn off the command mechanism, then I get very occasional bad data 
lines, where it appears that I thought that I had one more packet than I 
really had, or that a packet terminated very early, which would be hard to 
tell from one "extra" packet.
This could be cases of "false command" happening, it happens at roughly the 
same frequency, and the results are consistent with that happening.  This 
never happens during sim, only in emulation, or on the real chip.

The tasks each preserve the global int enable/disable status, and they 
locally disable ints while they are checking flags.
When they have decided not to run, they restore the int status and exit.
If they decide to run, then they may enable or disable ints as appropriate, 
but the int status is always restored on exit.
There are many points where ints are no problem, and only a few where I 
disable the ints for short periods.
During flag manipulation or checking, ints are disabled, to prevent 
problems if an ISR changes a flag bit in the middle of a decision.
This is also apparently unrelated, but...


Examining and simming/emulating all the data manipulation routines says 
they are fine.

To sum it up, I guess I'm looking for what could corrupt a value stored in 
low register, or high SRAM, apparently identically, when that value is 
relocated.

Any ideas on how to isolate this?  It's a complex system, so turning things 
on and off is pretty much impractical.
There are many tasks, each waiting on flags from other tasks that they have 
completed, before they run, and they in turn set/clear flags appropriately 
to signal other tasks.  That mechanism works, and is apparently unrelated 
to the problem.  If that portion was having problems, the results would be BAD.

So, I think I've talked it out pretty thoroughly, and unfortunately, the 
light bulb is still dark..
AVR-Chat

Sneaky Sram Subterfuge (Somewhat long and complicated)

Attachments

Move to quarantaine