A tale of timing and short pulses
2004-02-06 by Robert Adsett
I just spent a day tracing down a problem. I thought I'd share the, umm
adventure, in the hopes that the information might be useful (or
entertaining in a "thank god it happened to him not me" way).
I've been updating my timing routines and testing and documenting
them. One of the final steps was to test the one-wire communication
library I've been working on. Now this library is a port of the Dallas
library adapted for the LPC and they've just updated it so thought I'd
bring over the changes at the same time. That was probably a mistake.
Transferring the changes went without too many problems. (It did reveal an
apparent bug in PC-Lint but I haven't managed to make a small enough test
case that shows the problem). The first test of the functionality went
without problem.
Feeling confident I moved on to the second test. Compile, load run. Looks
OK. Run first step, that's odd where's the output. No output, I figured
something had changed in the way the routines worked or the display
routines and I hadn't noticed. Some time later and I hadn't found anything.
Well, next step is to add debugging prints to figure out where it stops
behaving as expected. Add a few and narrow it down to an area where the
devices are being enumerated.
Add debug prints at coarse level to enumeration. Uh Oh! It suddenly starts
working. Now it's looking like some sort of memory problem or timing issue.
Feeling confident in the timing routines (I'd just finished testing them
after all) I starting down the road looking for memory issues, such as
overflow, bad pointers (especially in newly modified code) and even code
generation. Hours later no progress except rather more puzzlement.
Oscilloscope time. The puzzle deepens. With the debug prints in the
enumeration process is clearly proceeding with each device responding in
turn. Take out the prints and the process collapses into something that is
not even consistent between tests. Eeekk!
Replace debug prints with calls to WaitUs (a delay routine). With large
waits between enumerations (5mS) everything appears to work. I didn't see
why that should be necessary, time to explore further. Shorten the delay,
the whole thing collapses again. Hmm, that's pointing back to timing again.
Look more carefully at the collapse. Things make less sense than
before. A wait after the enumeration appears to affect the
enumeration. IE enumerate(), wait for 5ms works, but enumerate wait for
1ms fails. The program is managing to effects events after they
occur? Obviously I'm missing something. Time to stop trying to solve the
problem and just start gathering information, something I should have
started earlier.
Add pin toggles around the individual enumerations so that the process is
easy to correlate on the oscilloscope. Looks right on the
oscilloscope. The enumeration bit pattern pauses at the pause between
enumerations.
Remove the delay between enumerations. It still works. What the...! The
pin toggle appears not to happen between enumerations. After a few more
tests and fooling around I find it. The pin toggle is just narrow enough
to hide behind an oscilloscope graticule.
Remove the toggle between the enumerations. It collapses again. Good, at
least it hasn't just disappeared on me.
Now, let's zoom in on the enumeration and see if I can't find out exactly
where it's breaking down. Gather up code to trace along the path as I
manually decode. (Note that each bit signal is 80-90uS long and there are
70 odd bits so seeing each bit does involve going to a higher level of
detail, and as it turns out the devil is in the detail). Repeat several
measurements at higher resolution, the first 1/2 dozen or so bits are not
consistently repeatable, they seen to fall into two patterns. Ahh, now I'm
are getting somewhere.
Freeze a sample and look in greater detail (thank heavens for DSOs). Reset
OK, bit 1 OK, bit 2 OK, bit 3 OK, bit 4 OK. Wait a minute what's that
glitch doing there and why is bit 4 so much longer than the others?
Take a break. An important part of the process, probably too long delayed.
Come back to problem. Take several measurements. Save the first one as a
reference and take others until I have one that is different from the first
for comparison. Hmm the two measurements are very close. Where I have a
glitch on the first one, I have a full rise on my second
measurement. Check the timing and indeed the glitch occurs where the
output should normally rise and the extended low period covers the low
period for two bits. It appears something is killing the output of the
timers match register as it occurs.
Check, read code, check documentation. Some time later I notice the
following sequence and realize its possible significance
- Set output low
- Set timer match to set output high in 75 uS
- Wait 15 uS
- Sample input
- Wait 60 uS
- Disable timer match (1)
- Wait additional time
The disable at (1) is there to prevent the match from occurring on wrap
around after it's no longer needed. Unfortunately if the waits are
accurate (and I just spent a fair amount of time ensuring they were) that
disable happens at approximately the same time that the match actually
occurs and cancels it just as it starts to affect the output. It appears
there is a race condition in the timer module. The fix is simple
enough. Move the disable to later in the process.
I don't think I could hit this consistently if I tried (I expect the timing
would change if you looked at it crosswise), so why am I (relatively)
confident that this is the source of the problem? A couple of
reasons. The sensitivity of the effect to code changes. Almost any change
in code position or timing eliminates the problem. Changing the timing
using the fix suggested works as does adding time to the waits before the
disable. Also the glitch coming out of the CPU is only 30nS wide and I've
not run across anything in the micro that can produce a pulse that narrow
under normal circumstances (hmm if that could be made repeatable it might
be useful). Finally reducing the wait time in the sequence above
eliminates the rise entirely.
I don't suppose anyone can confirm that the timer has this apparent
glitch? I know whenever I see an issue like that mentioned I wonder "how
did anyone ever run across that?". Well now I know :)
Robert
" 'Freedom' has no meaning of itself. There are always restrictions,
be they legal, genetic, or physical. If you don't believe me, try to
chew a radio signal. "
Kelvin Throop, III