32 bit Division
2004-08-31 by bobbruce000
Yahoo Groups archive
Index last updated: 2026-04-28 23:31 UTC
Thread
2004-08-31 by bobbruce000
Anyone have fast working code for unsigned 32 bit division? Currently, I am using the __udivsi3 routine that comes with gcc, but it is slow and appears to be written to work on early ARMs without hardware multipliers.
2004-09-07 by Frank Sergeant
"bobbruce000" <bobbruce000@...> writes:
> Anyone have fast working code for unsigned 32 bit division?
> Currently, I am using the __udivsi3 routine that comes with
> gcc, but it is slow and appears to be written to work on
> early ARMs without hardware multipliers.
Below my sig is the division routine, in ARM assembler, that I am using
in my Forth. I don't know that it is any faster. It doesn't use the
multiplier. How would you use the multiplier such that it would be
faster? I can see how to do it with a known divisor (by multiplying by
the reciprocal of the divisor) but not with a routine that accepts an
arbitrary divisor. Do you have some sort of Newton's Method that would
be faster? I don't actually do enough division at the moment to need it
any faster, but it is an interesting topic.
Also, see the question in the code in case anyone knows a way to reduce
the two instructions to just one, repeated here:
@mov r1, r1, rol #1 @ this is what we'd like to do
mov r1, r1, lsl #1 @ but we accomplish it with these
adc r1, r1, #0 @ two instructions. Is there a
@ way to do it in just one?
In the listing, 'DSTK' is the register R10 and 'nxt' is more or less the
Forth equivalent of a subroutine return in that it transfers control to
the next Forth word.
I suspect there are some "early out" improvements that could be made for
smaller divisors, etc. Also, I think ARM may have had some sample code
including a division algorithm.
--
Frank
http://pygmy.utoh.org/riscy
@@@@ U/MOD ( dividend divisor - uR uQ)
@ Divide 32bits by 32bits with 32bit remainder and 32bit quotient.
@ See Patterson & Hennessy, Computer Organization & Design, around
@ pp. 219-221. This algorithm is a cross between the book's
@ second and third versions.
UslashMOD:
@ divisor is in r0 (TOS)
@ put dividend into r2 and extend it to 64 bits with r1=0
ldr r2, [DSTK]
mov r1, #0
@ Divide r1:r2 by r0
@ For each of the 32 bits,
@ shift r1:r2 left
@ subtract r0 from r1
@ if result is negative
@ undo subtraction and new quotient bit (lsbit of r2) gets 0
@ otherwise
@ new quotient bit (lsbit of r2) gets 1
@ Remainder is in r1 and quotient is in r2
@ Use r3 to count the iterations
mov r3, #32
@ First, shift dividend (r1:r2 left by one bit)
1: movs r2, r2, lsl #1 @ msbit is now in Carry
@mov r1, r1, rol #1 @ this is what we'd like to do
mov r1, r1, lsl #1 @ but we accomplish it with these
adc r1, r1, #0 @ two instructions. Is there a
@ way to do it in just one?
@ Subtract divisor from dividend. If result is not negative,
@ set least significant bit of quotient (in r2). Otherwise,
@ undo the subtraction by adding divisor back to r1.
subs r1, r1, TOS
@ new quotient bit should be 1 only if subtraction result
@ was non-negative, i.e. add complement of sign to r2
addpl r2, r2, #1
@ if subtraction result was negative, add back the divisor
@ to r1, i.e. undo the subtraction
addmi r1, r1, TOS
subs r3, r3, #1
bne 1b
@ At this point, r2 contains the quotient and r1 the remainder
str r1, [DSTK]
mov TOS, r2
nxt2004-09-07 by Peter
--- In lpc2000@yahoogroups.com, Frank Sergeant <frank@p...> wrote:
<snip>
> Also, see the question in the code in case anyone knows a way to
reduce
> the two instructions to just one, repeated here:
>
> @mov r1, r1, rol #1 @ this is what we'd like to do
> mov r1, r1, lsl #1 @ but we accomplish it with
these
> adc r1, r1, #0 @ two instructions. Is there a
> @ way to do it in just one?
>
Unfortunately, the immediate and the shifted register have to come
out on the same register read port in ARM7.
An alternative is to hold zero in another register (for example R12)
and do this:
adc r1, r12, r1, lsl #1 @ R12 contains zero
The added speed *may* be worth the cost of tying up a register. R12
is allowed to be corrupted during a function call so it doesn't have
to be stacked.
Peter.2004-09-07 by lpc2100
Info on the following link may also be useful http://www.poppyfields.net/acorn/tech/division.shtml Tom
2004-09-08 by Frank Sergeant
> An alternative is to hold zero in another register (for example R12) > and do this: > adc r1, r12, r1, lsl #1 @ R12 contains zero > http://www.poppyfields.net/acorn/tech/division.shtml Thank you, Peter and Tom, for the suggestion and link. I believe I'll start with a scratch register to hold zero and pursue the link when/if the speed becomes more important. It looks like a great resource. -- Frank