Having a persistent problem that is proving resistant to standard debugging tricks. I'm hoping that talking it through here, will dredge up the problem. I'd like to say that I've eliminated all the obvious problems, but I've been knocking on this for a while now, with no resolution. Apologies for the lengthy explanation, but it's a dynamic problem, and requires some understanding of what's happening to debug. Chip: Mega8, 8 MHz internal R/C osc. Boot routine includes an explicit scrub of all registers, and all SRAM. Tons of stack space, and there's a constant check for creeping stack, or undesired ISRs, which never fires. Emulation says the stack only goes down by about 6 calls, which is normal, there's over 100 bytes between the last ram var, and the lowest stack point. The trap routine has been tested in sim, and in emulation, by corrupting the stack, or triggering unused ISRs, and it does work there. I don't think the data reception is at fault, but I'll give a brief explanation here: This system receives data from a master system, using a latch that I read, during the INT0 ISR. If the host "blips" the int line, then it won't be high by the time that I look at it in the ISR, (Physically impossible) and so that is "data", which I count, and store using X as a pointer dedicated to this ISR. If the host holds the int line low, then I read this data as a "command", and store it in a special SRAM slot, where it is acted on in the main routine. There is more to the comms, like acking data and command, but I don't think it's relevant. The ISR maintains a count of bytes received, in SRAM at 3A7h (Note, not on a boundary) The ISR decides it has a full packet when the byte count reaches 104, and it then increments a count of packets received, which is stored in a low register. (I've tried moving the byte count to SRAM, no diff). Receiving a command with < 104 bytes received terminates a packet early, and this is normal, and desired. Sim and emulation say this is working fine. Whenever we get a full packet, then I check for buffer wrap, and if so, reset X to the beginning of the buffer space. I also do a "full" check, based on packet count, and signal the master to hold off when I'm full (appears to work fine) Data is stored in a number of 104 byte packets, starting at a fixed address in SRAM, and rolling over when full. It's more or less a standard circular buffer, except that the data is in 104 byte records, and each is used multiple times before discarding it, which involves paving it to zeroes, and incrementing the tail pointer. Ok, so we have lots of packets in SRAM, and we're wrapping properly, detecting full when we should, and all that fun stuff. I have run this in sim, including simulaitng the ints, and there is never a problem. Unfortunately, I can't sim the command reception. Also, the diag version runs fine in real hardware, producing endless perfect output. In the diag version, the int0 ints can only happen at one point, when all the other tasks have run, so it's not a "real" test. Because the ints are faked by code in the target system, they can only occur at one point in the code, and they "slam" a full packet in before returning to the main program flow. Making this more scattered would be extremely problematic. So, the ISR code, when examined, hand-flown, and run in the sim, works fine, for hours. In real life, in the ICE-50 or real chip, it appears to loose it's place occasionally. Unfortunately, telling exactly what happens is next to impossible because the ice has no facility for break on write to RAM/REG. When running for real, the system appears to think it has received a command, after some number of packets and commands have been transferred. Sometimes it fails in the first packet, sometimes it runs for 100-200 packets or more before failure. When I breakpoint on receiving a bogus command, (Check of Master_Command in SRAM is <>0) the master and I both agree on the number of bytes transferred, and a scope trace proves that the master is not sending me a command at that time. So, it's not that a byte is getting misinterpreted. When I break on receiving a command in the ISR, I only see valid commands, so apparently the ISR is not the source of the problem. The system decides it has a command to process whenever Master_Command in SRAM is <>0. This is a very simple test. Commands of 00H are illegal, and the ISR traps those and converts them to FFh which I can see, and is harmlessly discarded. The ice tells me this never happens, which is good. It appears that the SRAM var Master_Command at 3B1h is getting corrupted. I had the same symptoms when Master_Command was in a low register. Clearing Master_Command whenever I receive a data byte appears to help the problem, but does not solve it. Nor should this be necessary.. It is simply masking the problem. The system checks for commands relatively infrequently, 10-400uS depending on what tasks need to run in the scheduler, and when the command byte is stored. There are no STD's in the code, and I strongly believe that all pointer writes (ST X etc) are done with initialized pointers. I moved Master_Command to SRAM thinking that this would isolate an improper write to low ram space, which overlaps the registers, but the symptoms are the same, which eliminates that possibility. If I turn off the command mechanism, then I get very occasional bad data lines, where it appears that I thought that I had one more packet than I really had, or that a packet terminated very early, which would be hard to tell from one "extra" packet. This could be cases of "false command" happening, it happens at roughly the same frequency, and the results are consistent with that happening. This never happens during sim, only in emulation, or on the real chip. The tasks each preserve the global int enable/disable status, and they locally disable ints while they are checking flags. When they have decided not to run, they restore the int status and exit. If they decide to run, then they may enable or disable ints as appropriate, but the int status is always restored on exit. There are many points where ints are no problem, and only a few where I disable the ints for short periods. During flag manipulation or checking, ints are disabled, to prevent problems if an ISR changes a flag bit in the middle of a decision. This is also apparently unrelated, but... Examining and simming/emulating all the data manipulation routines says they are fine. To sum it up, I guess I'm looking for what could corrupt a value stored in low register, or high SRAM, apparently identically, when that value is relocated. Any ideas on how to isolate this? It's a complex system, so turning things on and off is pretty much impractical. There are many tasks, each waiting on flags from other tasks that they have completed, before they run, and they in turn set/clear flags appropriately to signal other tasks. That mechanism works, and is apparently unrelated to the problem. If that portion was having problems, the results would be BAD. So, I think I've talked it out pretty thoroughly, and unfortunately, the light bulb is still dark..
Message
Sneaky Sram Subterfuge (Somewhat long and complicated)
2004-02-09 by Dave VanHorn
Attachments
- No local attachments were found for this message.