Yahoo Groups archive

Lpc2000

Index last updated: 2026-04-28 23:31 UTC

Thread

Re: [lpc2100] Simple test program - is now instruction pipeline/VPB question

Re: [lpc2100] Simple test program - is now instruction pipeline/VPB question

2003-11-22 by microbit

Hi Leon,
I'm "on the air" now too with LPC2106 !
Here is a prelim. current consumption figure I took :
I measure 6.5 mA @ 10 MHz setting and clearing P0.0 in a loop
(PLL default bypassed and 1 cclk / fetch) executing out of Flash.
Pretty impressive !
Trying to execute this as fast as possible has lifted up the veil a bit better on some of the instructions,
but there is still a mystery, so a question for people that are much more intimate with ARM7/LPC,
( I don't feel like asking an FAE that won't know an answer anyway and I can't find it in the
ARM7TDMI ref manual) :
I noticed it seems the VPB bus either causes inserted NOPs in the pipeline, or wait states are
automatically generated when I "write too fast" to the VPB bus.
Furthermore , I'm not even generating Read/Modify/Write instructions with my test C code !!!! ????
;
The 2nd question is, what if I write at a slower rate to VPB ?
Do I still need the fastest pclk for my I/O pins to update as fast as possible ?????
(I can't trace :-)
As an example, C code generating this sequence :
.......
STR R4,[R7,#0] P0.0 set to "1"
MOV R4,#1
STR R4,[R0,#0] ; P0.0 set to "0"
......
Takes :
1.6 uS (16 cclks) with pclk = 4 cclks
1.0 uS (10 cclks) with pclk = 2 cclks
0.8 uS (8 cclks) with pclk = 1 cclks
(that's what I measure here on HW)
Is there anyone that can shed some light on this ?
toggling P0.0 with pclk = cclk/4
-- Kris
----- Original Message -----
From: "Leon Heller" <leon_heller@...>
Sent: Saturday, November 22, 2003 9:49 PM
Subject: [lpc2100] Simple test program

Re: [lpc2100] Simple test program - is now instruction pipeline/VPB question

2003-11-22 by Robert Adsett

At 05:07 AM 11/23/03 +1100, you wrote:
>The 2nd question is, what if I write at a slower rate to VPB ?
>Do I still need the fastest pclk for my I/O pins to update as fast as 
>possible ?????
>(I can't trace :-)
>
>As an example, C code generating this sequence :
>.......
>STR     R4,[R7,#0]            P0.0 set to "1"
>MOV    R4,#1
>STR     R4,[R0,#0]            P0.0 set to "0"
>......
>
>Takes :
>1.6 uS (16 cclks) with pclk = 4 cclks
>1.0 uS (10 cclks) with pclk = 2 cclks
>0.8 uS (8 cclks)  with pclk = 1 cclks
>
>(that's what I measure here on HW)
>
>Is there anyone that can shed some light on this ?
>
>
>
>toggling P0.0 with pclk = cclk/4
>
>
>-- Kris
><http://www.microbit.com.au>www.microbit.com.au
>

I was going to play with bus optimization next anyway so I thought I'd 
measure the results and pass them along.

All of these with a 10MHz clock PLL'd to 60MHz.

MAM Off
ASM optimized 1.06uS period ~740nS on ~330nS off
C  1.8uS period ~800ns on ~1uS off

MAM on, Access to flash at recommended 3 cycles, VPB divider at default.
ASM optimized near square wave with 600nS period
C near square wave with 736nS period

MAM on , Access to flash at recommended 3 cycles, , VPB divider to 1
ASM optimized 264nS period ~168nS off ~118nS on
C near square wave with 416nS period

The (hand) optimized assembly loop used is

         mov     r3, #256
         ldr     r2, .L67+32
         ldr     r4, .L67+36
.L64:
         str     r3, [r2, #0]
         str     r3, [r4, #0]
         b       .L64

If the output is instruction rate limited then I would expect an output 
with an approx 2/3 duty cycle.  That is only approached for the first 
case.  For all other cases there is clearly some time taken up with I/O.

Also clearly getting maximum throughput will depend on setting up the bus 
'correctly'.

Setting the VPB divider to 1 in this configuration also seems to have an 
effect on the UART.   I haven't figured that out yet but what should be 
9600 baud drops to about 9000 baud.

Robert Adsett

Re: [lpc2100] Simple test program - is now instruction pipeline/VPB question

2003-12-03 by Robert Adsett

At 05:47 PM 11/22/03 -0500, you wrote:

>Setting the VPB divider to 1 in this configuration also seems to have an
>effect on the UART.   I haven't figured that out yet but what should be
>9600 baud drops to about 9000 baud.
>
>Robert Adsett

Got it.  Cleaning up by support and generalizing it so I could place it 
with some newlib support and I realized that I had misplaced the pll 
divider field by 1 bit, resulting in a value of 1/2 what I expected.  That 
means that the internal pll was running at ~120MHz which is below the 
156MHz specified minimum.  Apparently when that happens some peripherals 
notice the effect and others don't.

Robert

" 'Freedom' has no meaning of itself.  There are always restrictions,
be they legal, genetic, or physical.  If you don't believe me, try to
chew a radio signal. "

                         Kelvin Throop, III

I/O Speed - An Explanation

2004-11-10 by philips_apps

Here is an explanation of the I/O toggle speed that is observed in 
these devices.

Richard

The I/O speed has a maximum at ~3.7 Mhz because of several reasons, 
none specific to our parts. It is caused by interactions between the 
ARM pipeline, the VPB bus, the ARM AHB wrapper (interface between the 
ARM7TDMI-S core and the AHB bus), and the instruction timing itself. 
For the minimum 3-instruction loop below, a Store (Write to I/O pin) 
followed by another Store (toggle the I/O pin) and a Branch back to 
the first Store, the timing is as follows (Fe for Fetch, De for 
Decode, En for execution clock n):

Pass1:

STR:	Fe-De-E1-E1-E2-E2-E2-E2-E2
STR:	      Fe-De----------------------------E1-E1-E2-E2-E2-E2
B:	            Fe-----------------------------De-----------------
-----E1-E1-E2-E3
 
Pass2:
STR								     
Fe-De		

And so on...

An STR to VPB space takes 8 clocks because the last 2 phases (STR is 
a 4 phase instruction) are Non-Sequential (NS) accesses and the AHB 
wrapper adds one wait state for every NS access. This means the 3rd 
phase of the instruction takes 2 clocks, and the fourth phase takes 4 
because of the wait state and the VPB operations being 3 clocks.

The second STR can be fetched and Decoded in the pipeline but will 
then stall because the execution pipeline stage is busy (the first 
Store has not completed yet). The Branch instruction can also be 
fetched in the Decode slot of the second STR but it will then stall 
because the Decode stall is occupied by the second STR.

After the first STR completes, the second STR will start its 
execution phase and finally will allow the Branch instruction (which 
also has one NS phase) to proceed. 

End result: This takes 16 clocks (266.7 ns at 60 MHz with VPB clock 
set to 1) with a duty cycle of 6:10 .


Code:

.loop:
str r2, [r7, #0]
str r2, [r6, #0]
b .loop

Re: [lpc2000] I/O Speed - An Explanation

2004-11-11 by capiman@t-online.de

Hello,

first many thanks for your explanation ! We had a similar thread in this 
mailinglist at 1st of Feb 2004 (name: Optimization of capture routine...). 
There a trick was suggested to do the acces not only once and do a jump, but 
do it multiple times and then check if loop is finished (depends on what you 
want to do). Then you can go much higher:

> I am now at around 5,8956 MBytes / second, which is close to 5,898 MBytes 
> /
> sec. ( = Fosc * 4 / 10).
> So the two operations ( ldr ip, [r0, #0] and strb ip, [r2], #1) seems to
> take in sum 10 cycles.

Regards,

          Martin


----- Original Message ----- 
Show quoted textHide quoted text
From: "philips_apps" <philips_apps@...>
To: <lpc2000@yahoogroups.com>
Sent: Wednesday, November 10, 2004 10:27 PM
Subject: [lpc2000] I/O Speed - An Explanation


>
>
> Here is an explanation of the I/O toggle speed that is observed in
> these devices.
>
> Richard
>
> The I/O speed has a maximum at ~3.7 Mhz because of several reasons,
> none specific to our parts. It is caused by interactions between the
> ARM pipeline, the VPB bus, the ARM AHB wrapper (interface between the
> ARM7TDMI-S core and the AHB bus), and the instruction timing itself.
> For the minimum 3-instruction loop below, a Store (Write to I/O pin)
> followed by another Store (toggle the I/O pin) and a Branch back to
> the first Store, the timing is as follows (Fe for Fetch, De for
> Decode, En for execution clock n):
>
> Pass1:
>
> STR: Fe-De-E1-E1-E2-E2-E2-E2-E2
> STR:       Fe-De----------------------------E1-E1-E2-E2-E2-E2
> B:             Fe-----------------------------De-----------------
> -----E1-E1-E2-E3
>
> Pass2:
> STR
> Fe-De
>
> And so on...
>
> An STR to VPB space takes 8 clocks because the last 2 phases (STR is
> a 4 phase instruction) are Non-Sequential (NS) accesses and the AHB
> wrapper adds one wait state for every NS access. This means the 3rd
> phase of the instruction takes 2 clocks, and the fourth phase takes 4
> because of the wait state and the VPB operations being 3 clocks.
>
> The second STR can be fetched and Decoded in the pipeline but will
> then stall because the execution pipeline stage is busy (the first
> Store has not completed yet). The Branch instruction can also be
> fetched in the Decode slot of the second STR but it will then stall
> because the Decode stall is occupied by the second STR.
>
> After the first STR completes, the second STR will start its
> execution phase and finally will allow the Branch instruction (which
> also has one NS phase) to proceed.
>
> End result: This takes 16 clocks (266.7 ns at 60 MHz with VPB clock
> set to 1) with a duty cycle of 6:10 .
>
>
> Code:
>
> .loop:
> str r2, [r7, #0]
> str r2, [r6, #0]
> b .loop
>
>
>
>
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>

Re: I/O Speed - An Explanation

2004-11-11 by lp2000c

Is there any published documentation which contains this information 
(e.g.: that VPB operations are 3 clocks)?



--- In lpc2000@yahoogroups.com, "philips_apps" <philips_apps@y...> 
wrote:
> 
> Here is an explanation of the I/O toggle speed that is observed in 
> these devices.
> 
> Richard
> 
> The I/O speed has a maximum at ~3.7 Mhz because of several reasons, 
> none specific to our parts. It is caused by interactions between 
the 
> ARM pipeline, the VPB bus, the ARM AHB wrapper (interface between 
the 
> ARM7TDMI-S core and the AHB bus), and the instruction timing 
itself. 
> For the minimum 3-instruction loop below, a Store (Write to I/O 
pin) 
> followed by another Store (toggle the I/O pin) and a Branch back to 
> the first Store, the timing is as follows (Fe for Fetch, De for 
> Decode, En for execution clock n):
> 
> Pass1:
> 
> STR:	Fe-De-E1-E1-E2-E2-E2-E2-E2
> STR:	      Fe-De----------------------------E1-E1-E2-E2-E2-E2
> B:	            Fe-----------------------------De-----------------
> -----E1-E1-E2-E3
>  
> Pass2:
> STR								     
> Fe-De		
> 
> And so on...
> 
> An STR to VPB space takes 8 clocks because the last 2 phases (STR 
is 
> a 4 phase instruction) are Non-Sequential (NS) accesses and the AHB 
> wrapper adds one wait state for every NS access. This means the 3rd 
> phase of the instruction takes 2 clocks, and the fourth phase takes 
4 
> because of the wait state and the VPB operations being 3 clocks.
> 
> The second STR can be fetched and Decoded in the pipeline but will 
> then stall because the execution pipeline stage is busy (the first 
> Store has not completed yet). The Branch instruction can also be 
> fetched in the Decode slot of the second STR but it will then stall 
> because the Decode stall is occupied by the second STR.
> 
> After the first STR completes, the second STR will start its 
> execution phase and finally will allow the Branch instruction 
(which 
Show quoted textHide quoted text
> also has one NS phase) to proceed. 
> 
> End result: This takes 16 clocks (266.7 ns at 60 MHz with VPB clock 
> set to 1) with a duty cycle of 6:10 .
> 
> 
> Code:
> 
> .loop:
> str r2, [r7, #0]
> str r2, [r6, #0]
> b .loop

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.