[PATCH] bsp/tms570: implemented and tested initialization of Cortex-R performance counters.

Discussion:

Pavel Pisa

2014-08-22 15:20:46 UTC

The code is written as BSP specific now but it should work for all
Cortex-A and R based CPUs and can be moved to ARM generic place in future.

StackOverflow suggested sequences of writes to the registers required
to start counters is used.

http://stackoverflow.com/questions/3247373/how-to-measure-program-execution-time-in-arm-cortex-a8-processor
---
c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c | 88 +++++++++++++++++++++--
1 file changed, 84 insertions(+), 4 deletions(-)

diff --git a/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c b/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c
index f25380c..3ce2f63 100644
--- a/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c
+++ b/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c
@@ -3,7 +3,14 @@
*
* @ingroup tms570_clocks
*
- * @brief System clocks.
+ * @brief Cortex-R performace counters
+ *
+ * The counters setup functions are these which has been suggested
+ * on StackOverflow
+ *
+ * Code is probably for use on Cortex-A without modifications as well.
+ *
+ * http://stackoverflow.com/questions/3247373/how-to-measure-program-execution-time-in-arm-cortex-a8-processor
*/

/*
@@ -14,9 +21,6 @@
* 166 36 Praha 6
* Czech Republic
*
- * Based on LPC24xx and LPC1768 BSP
- * by embedded brains GmbH and others
- *
* The license and distribution terms for this file may be
* found in the file LICENSE in this distribution or at
* http://www.rtems.org/license/LICENSE.
@@ -27,6 +31,79 @@
#include <rtems.h>
#include <bsp.h>

+static int cpu_counter_initialized;
+
+
+/**
+ * @brief set mode of Cortex-R performance counters
+ *
+ * Based on example found on http://stackoverflow.com
+ *
+ * @param[in] do_reset if set, values of the counters are reset
+ * @param[in] enable_divider if set, CCNT counts clocks divided by 64
+ * @retval Void
+ */
+static inline void _CPU_Counter_init_perfcounters(
+ int32_t do_reset,
+ int32_t enable_divider
+)
+{
+ /* in general enable all counters (including cycle counter) */
+ int32_t value = 1;
+
+ /* peform reset */
+ if (do_reset)
+ {
+ value |= 2; /* reset all counters to zero */
+ value |= 4; /* reset cycle counter to zero */
+ }
+
+ if (enable_divider)
+ value |= 8; /* enable "by 64" divider for CCNT */
+
+ value |= 16;
+
+ /* program the performance-counter control-register */
+ asm volatile ("mcr p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
+
+ /* enable all counters */
+ asm volatile ("mcr p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));
+
+ /* clear overflows */
+ asm volatile ("mcr p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
+}
+
+/**
+ * @brief initialize Cortex-R performance counters subsystem
+ *
+ * Based on example found on http://stackoverflow.com
+ *
+ * @retval Void
+ *
+ */
+static void _CPU_Counter_initialize(void)
+{
+ rtems_interrupt_level level;
+
+ rtems_interrupt_disable(level);
+
+ if ( cpu_counter_initialized ) {
+ rtems_interrupt_enable(level);
+ return;
+ }
+
+ /* enable user-mode access to the performance counter */
+ asm volatile ("mcr p15, 0, %0, c9, c14, 0\n\t" :: "r"(1));
+
+ /* disable counter overflow interrupts (just in case) */
+ asm volatile ("mcr p15, 0, %0, c9, c14, 2\n\t" :: "r"(0x8000000f));
+
+ _CPU_Counter_init_perfcounters(false, false);
+
+ cpu_counter_initialized = 1;
+
+ rtems_interrupt_enable(level);
+}

/**
* @brief returns the actual value of Cortex-R cycle counter register
@@ -39,6 +116,9 @@
CPU_Counter_ticks _CPU_Counter_read(void)
{
uint32_t ticks;
+ if ( !cpu_counter_initialized ) {
+ _CPU_Counter_initialize();
+ }
asm volatile ("mrc p15, 0, %0, c9, c13, 0\n": "=r" (ticks));
return ticks;
}

--
1.9.1

Joel Sherrill

2014-08-22 15:25:24 UTC

Permalink

Pushed.

Followups can just be subsequent patches.

Post by Pavel Pisa
The code is written as BSP specific now but it should work for all
Cortex-A and R based CPUs and can be moved to ARM generic place in future.
StackOverflow suggested sequences of writes to the registers required
to start counters is used.
http://stackoverflow.com/questions/3247373/how-to-measure-program-execution-time-in-arm-cortex-a8-processor
---
c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c | 88 +++++++++++++++++++++--
1 file changed, 84 insertions(+), 4 deletions(-)
diff --git a/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c b/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c
index f25380c..3ce2f63 100644
--- a/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c
+++ b/c/src/lib/libbsp/arm/tms570/misc/cpucounterread.c
@@ -3,7 +3,14 @@
*
*
+ *
+ * The counters setup functions are these which has been suggested
+ * on StackOverflow
+ *
+ * Code is probably for use on Cortex-A without modifications as well.
+ *
+ * http://stackoverflow.com/questions/3247373/how-to-measure-program-execution-time-in-arm-cortex-a8-processor
*/
/*
@@ -14,9 +21,6 @@
* 166 36 Praha 6
* Czech Republic
*
- * Based on LPC24xx and LPC1768 BSP
- * by embedded brains GmbH and others
- *
* The license and distribution terms for this file may be
* found in the file LICENSE in this distribution or at
* http://www.rtems.org/license/LICENSE.
@@ -27,6 +31,79 @@
#include <rtems.h>
#include <bsp.h>
+static int cpu_counter_initialized;
+
+
+/**
+ *
+ * Based on example found on http://stackoverflow.com
+ *
+ */
+static inline void _CPU_Counter_init_perfcounters(
+ int32_t do_reset,
+ int32_t enable_divider
+)
+{
+ /* in general enable all counters (including cycle counter) */
+ int32_t value = 1;
+
+ /* peform reset */
+ if (do_reset)
+ {
+ value |= 2; /* reset all counters to zero */
+ value |= 4; /* reset cycle counter to zero */
+ }
+
+ if (enable_divider)
+ value |= 8; /* enable "by 64" divider for CCNT */
+
+ value |= 16;
+
+ /* program the performance-counter control-register */
+ asm volatile ("mcr p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
+
+ /* enable all counters */
+ asm volatile ("mcr p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));
+
+ /* clear overflows */
+ asm volatile ("mcr p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
+}
+
+/**
+ *
+ * Based on example found on http://stackoverflow.com
+ *
+ *
+ */
+static void _CPU_Counter_initialize(void)
+{
+ rtems_interrupt_level level;
+
+ rtems_interrupt_disable(level);
+
+ if ( cpu_counter_initialized ) {
+ rtems_interrupt_enable(level);
+ return;
+ }
+
+ /* enable user-mode access to the performance counter */
+ asm volatile ("mcr p15, 0, %0, c9, c14, 0\n\t" :: "r"(1));
+
+ /* disable counter overflow interrupts (just in case) */
+ asm volatile ("mcr p15, 0, %0, c9, c14, 2\n\t" :: "r"(0x8000000f));
+
+ _CPU_Counter_init_perfcounters(false, false);
+
+ cpu_counter_initialized = 1;
+
+ rtems_interrupt_enable(level);
+}
/**
@@ -39,6 +116,9 @@
CPU_Counter_ticks _CPU_Counter_read(void)
{
uint32_t ticks;
+ if ( !cpu_counter_initialized ) {
+ _CPU_Counter_initialize();
+ }
asm volatile ("mrc p15, 0, %0, c9, c13, 0\n": "=r" (ticks));
return ticks;
}

--
Joel Sherrill, Ph.D. Director of Research & Development
joel.sherrill-***@public.gmane.org On-Line Applications Research
Ask me about RTEMS: a free RTOS Huntsville AL 35805
Support Available (256) 722-9985

Pavel Pisa

2014-08-22 16:44:11 UTC

Permalink

Hello Joel,

Post by Joel Sherrill
Pushed.
Followups can just be subsequent patches.

thanks, you are faster than light ...

As for the RTEMS timekeeping code, I can imagine how it could
look better. I do not like Clock_driver_nanoseconds_since_last_tick.
I am not even sure if it is really used by TOD (i.e. in ticker test
seems to print rounded values on our board).

In the fact I would like to see RTEMS to work completely tickless
on hardware with modern free runing timebase and easily updated
compare event hardware. That would allow to implement all POSIX time
related functions with resolution limited only by hardware.

Scheduler is a question. Wen more than one task of same priority
are ready to run then tick is easiest but even in such case
slice time can be computed and only event for its overflow timer event
is set.

But all that is huge amount of work.

I would start with easier side now. It is necessary to have reliable
timebase. Consider 64 bit value running at some clock source speed.
It is really hard to have that reliable on PC hardware, the common
base 8254 can be used for that but access is horribly slow. All other
mechanisms (HPET, TSC) are problematic - need probe and check that
they are correct and synchronous between cores, do not change with
sleep modes etc. Really difficult task which is solved by thousands
lines of code by Linux kernel.

But ARM and PowerPC based systems usually provide reasonable
timer source register which is synchronized over all cores.
Unfortuantelly, ARM ones provide usually only 32 bits wide register.
I have solved problem how to extend that 32 bit counter to 64
bit for one my friend who worked at BlackBerry. Their phones platform
uses Cortex-A and QNX. The design constrains has been given
by usecase - userspace events timestamping in QML profiller.
This adds constrain that code can be called on more cores concurrently,
using mutex would degrade performance horribly, privileged instructions
cannot be used and value available from core was only 32 bit.

I have designed for him attached code fragments and he has written
some Qt derived code which is was used in Q10 phone debugging builds.

The main ideas is to write extension to more than 60 bits without
locking and use GCC builtin atomic support to ensure that counter overflow
results only in single increment of higher value part.

The only requirement for correct function is that clockCyclesExt()
is called at least once per half of the counter overflow period
and its execution is not interrupted for longer than equivalent time.
Code even minimizes cache write contention cases.

What do you think about use of this approach in RTEMS?

Then next step is to base timing on values which are not based on
the ticks. I have seen that discussion about NTP time format
(integer seconds + 1/2^32 fractions). Other option is 64bit nsec
which is better regard 2038 overflow problem. The priority queue
for finegrained timers ordering is tough task. It would worth
to have all operations with additional paremeter about required precision
for each interval/time event etc ...

But that is for longer discussion and incremental solution.

I cannot provide my full time for such enhancements anyway.

But it could be nice project if funding is found. I have friend
who has grants from ESA to develop theory for precise time sources
fussion (atomic clocks etc) and works on real hardware for satelite
based clock synchronization too. We have spoken about Linux kernel
NTP time synchronization and PLL loop long time ago and both gone
to same conclusion how it should be done right way. I would be interresting
to have this solution in RTEMS as well. But to do it right it would
require some agency/company funded project. We have even networking
cards with full IEEE-1588 HW support there for Intel and some articles
about our findings regarding problem to synchronize time where most
problematic part are latencies between ETHERNET card hardware and
CPU core. They are even more problematic than precise time over
local ETHERNET LAN ... So I think that there is enough competent
people to come with something usesfull. But most of them cannot
afford to work on it only for their pleassure.

OK, that some dump of my ideas.

I need to switch to other HW testing now to sustain our company
and university project above sea level.

Best wishes,

Pavel

Joel Sherrill

2014-08-22 17:45:02 UTC

Permalink

Post by Pavel Pisa
Hello Joel,

Post by Joel Sherrill
Pushed.
Followups can just be subsequent patches.

thanks, you are faster than light ...

Just truing to wrap up things on a Friday. :)

Post by Pavel Pisa
As for the RTEMS timekeeping code, I can imagine how it could
look better. I do not like Clock_driver_nanoseconds_since_last_tick.
I am not even sure if it is really used by TOD (i.e. in ticker test
seems to print rounded values on our board).

The Classic API get time method used returns TOD in a format with seconds and ticks since the last second. The print in that test only prints seconds. There is a nanoseconds sample which prints at higher granularity.

Post by Pavel Pisa
In the fact I would like to see RTEMS to work completely tickless
on hardware with modern free runing timebase and easily updated
compare event hardware. That would allow to implement all POSIX time
related functions with resolution limited only by hardware.

Agreed.

Post by Pavel Pisa
Scheduler is a question. Wen more than one task of same priority
are ready to run then tick is easiest but even in such case
slice time can be computed and only event for its overflow timer event
is set.

We just need to take that into account for that as an input for calculating when the next tick occurs. I think time slice is the only factor other than the watchdog timers.

Post by Pavel Pisa
But all that is huge amount of work.

Yep. But we can identify a development path where it is a sequence of smallish steps.

Post by Pavel Pisa
I would start with easier side now. It is necessary to have reliable
timebase. Consider 64 bit value running at some clock source speed.
It is really hard to have that reliable on PC hardware, the common
base 8254 can be used for that but access is horribly slow. All other
mechanisms (HPET, TSC) are problematic - need probe and check that
they are correct and synchronous between cores, do not change with
sleep modes etc. Really difficult task which is solved by thousands
lines of code by Linux kernel.
But ARM and PowerPC based systems usually provide reasonable
timer source register which is synchronized over all cores.
Unfortuantelly, ARM ones provide usually only 32 bits wide register.
I have solved problem how to extend that 32 bit counter to 64
bit for one my friend who worked at BlackBerry. Their phones platform
uses Cortex-A and QNX. The design constrains has been given
by usecase - userspace events timestamping in QML profiller.
This adds constrain that code can be called on more cores concurrently,
using mutex would degrade performance horribly, privileged instructions
cannot be used and value available from core was only 32 bit.
I have designed for him attached code fragments and he has written
some Qt derived code which is was used in Q10 phone debugging builds.
The main ideas is to write extension to more than 60 bits without
locking and use GCC builtin atomic support to ensure that counter overflow
results only in single increment of higher value part.
The only requirement for correct function is that clockCyclesExt()
is called at least once per half of the counter overflow period
and its execution is not interrupted for longer than equivalent time.
Code even minimizes cache write contention cases.
What do you think about use of this approach in RTEMS?

Sounds reasonable. The counter overflow period should be relatively long so this should be considered when determining the maximum length of time allowed between ticks.

Post by Pavel Pisa
Then next step is to base timing on values which are not based on
the ticks. I have seen that discussion about NTP time format
(integer seconds + 1/2^32 fractions). Other option is 64bit nsec
which is better regard 2038 overflow problem. The priority queue
for finegrained timers ordering is tough task. It would worth
to have all operations with additional paremeter about required
precision
for each interval/time event etc ...

I have discussed offline converting the delta chains in the watchdog to use timestamps. This would also let us do higher granularity absolute time events. Right now the TOD chain is second granularity.

But I don't have any good solutions to that. But it could be a discrete unit of work.

Post by Pavel Pisa
But that is for longer discussion and incremental solution.
I cannot provide my full time for such enhancements anyway.

None of us can. We have to have a plan with the steps and nibble.

Post by Pavel Pisa
But it could be nice project if funding is found. I have friend
who has grants from ESA to develop theory for precise time sources
fussion (atomic clocks etc) and works on real hardware for satelite
based clock synchronization too. We have spoken about Linux kernel
NTP time synchronization and PLL loop long time ago and both gone
to same conclusion how it should be done right way. I would be
interresting
to have this solution in RTEMS as well. But to do it right it would
require some agency/company funded project. We have even networking
cards with full IEEE-1588 HW support there for Intel and some articles
about our findings regarding problem to synchronize time where most
problematic part are latencies between ETHERNET card hardware and
CPU core. They are even more problematic than precise time over
local ETHERNET LAN ... So I think that there is enough competent
people to come with something usesfull. But most of them cannot
afford to work on it only for their pleassure.

This is likely not an area where volunteer effort will push it all the way through.

Post by Pavel Pisa
OK, that some dump of my ideas.
I need to switch to other HW testing now to sustain our company
and university project above sea level.

+1 I was reviewing a document while letting all the bsps build :)

Post by Pavel Pisa
Best wishes,
Pavel

Gedare Bloom

2014-08-22 18:12:11 UTC

Permalink

On Fri, Aug 22, 2014 at 1:45 PM, Joel Sherrill

Post by Joel Sherrill

Post by Pavel Pisa
Hello Joel,

Post by Joel Sherrill
Pushed.
Followups can just be subsequent patches.

thanks, you are faster than light ...

Just truing to wrap up things on a Friday. :)

Hi Pavel, you may also be interested in the recent mailing list thread
with subject:

[Bug 2180] New: _TOD_Get_with_nanoseconds() is broken on SMP

Sebastian suggested we adopt FreeBSD mechanisms for our clock drivers.
-Gedare

Daniel Gutson

2014-09-18 16:08:04 UTC

Permalink

Hi folks,

could you please give us an update of the shape of this BSP?

There's a project that uses the TMS570, and I'd like to convince some
people to move to RTEMS. How far do you consider the BSP is from the
"industrial" status?
I don't see much progress details in its wiki, such as stability and
drivers status.

Thanks!

Daniel.

On Fri, Aug 22, 2014 at 12:25 PM, Joel Sherrill

Post by Joel Sherrill
Pushed.
Followups can just be subsequent patches.

--
Joel Sherrill, Ph.D. Director of Research & Development
Ask me about RTEMS: a free RTOS Huntsville AL 35805
Support Available (256) 722-9985
_______________________________________________
devel mailing list
http://lists.rtems.org/mailman/listinfo/devel

--
Daniel F. Gutson
Chief Engineering Officer, SPD

San Lorenzo 47, 3rd Floor, Office 5

Córdoba, Argentina

Phone: +54 351 4217888 / +54 351 4218211

Skype: dgutson

Pavel Pisa

2014-09-18 20:55:03 UTC

Permalink

Hello Daniel,

Post by Daniel Gutson
Hi folks,
could you please give us an update of the shape of this BSP?
There's a project that uses the TMS570, and I'd like to convince some
people to move to RTEMS. How far do you consider the BSP is from the
"industrial" status?
I don't see much progress details in its wiki, such as stability and
drivers status.

the project is not ready for industrial grade use yet. I would be carefull
even to use actual RTEMS GIT until it is polished for real 4.11 release.

As for the TMS570 - I am not in a tight contact with Premek now.
Something like two weeks ago he hard worked on proper header files
preparation. But I expect that he need some rest time before semester
start after hard work over holiday now.

I expect that basic TMS570 support will mature till end of the year
during Premek's master thesis work. The final release Q1 2015.

As for the status:

I have run many test from testsuite - the success rate from internal
SRAM is very high (only some capture engine and test with memory
demand above SRAM capacity fails). But there is still some issue
for running from SDRAM and even from Flash which I have managed
to be useable with RTEMS in last weeks. I can upload output on
Wiki or my page if there is interrest.

We need proper automated test suite setup - probably in Python
and for sure with OpenOCD. I have worked on OpenOCD and its TMS570
Flash support so we do not need any Ti/Java/Eclipse tools
for regular RTEMS development now. I look at testing setup
with Premek when he returns.

RTEMS Flash image still runs with initial setup build by Ti tools
and HalCoGen. We expect to include even initial setup code
in RTEMS during next work phase.

But bootloader based startup sequence with RTEMS at some Flash base
offset would be important for many use cases still. I can imagine
that option is to port U-boot to TMS570 to allows Ethernet and TFTP
RTEMS+application updates. But it is considerable amount of work
which cannot be done in my time constrains. I have colleague who
worked on U-boot for AM3xxx and other Linux systems under his
contracts. But to gain his time, contract/payment for work is required.
So proper solution with boot over bootloader in this case is not
on my free plan now. But if we need some loader in our contracts
paid project I would argue to made it public - but it would be probably
some HalCoGen based hacks in such case.

The RTEMS TMS570 port is running soft float still to keep things
simple till we resolve other problems to not mix them with possible
problems caused by HF. Cortex-R4F HF is prepared by Sebastian in
tools and RTEMS core and actual switching to HF is a matter one
or two GCC switches in a BSP config. I plan to test that soon.
But not priority in our plan now.

Priority are header files to cover most of peripherals.

We have now proper driver for both serial ports which builds and
works reliably in interrupt driven and polled mode.
Timer is OK, still fixed tick based but that is normal setup
for RTEMS. Support for subtick time read resolution is there
and we have working CPU support for micro benchmarking with
CPU clock cycle resolution - I am not aware of other
ARM RTEMS BSP with such support.

Ethernet support has been analyzed, we have identified matching
BSD driver code but porting finishing can take some months
(even because project and steps prioritization).

As for my other colleagues work, we probably gain paid project
to deliver industry use ready Matlab/Simulink target support
for TMS570 and RM48. But that will be Ti tools (not RTEMS)
based to fulfill tight time constrains.

http://rtime.felk.cvut.cz/rpp-tms570/

I dream that there would be RTEMS based version of this support
one day. But that whole project is in scale of many man years
(three of my colleagues worked on it for year, paid students
over two holidays seasons etc.) and it has to pay bills of our
department. So there is minimal chance that RTEMS port happens
and would be supported by budget of our department head if there
is not the contract.

But I hope that solid RTEMS TMS570 support is on its way to
happen in our open/enthusiast activities frame. We would be
happy for all possible kinds of cooperation and testing.
If you can access TMS570LS31x HDK kit, I help you to setup
system to state where we are. You need to build patched
OpenOCD, Linux host preferred - I never used it on Windows,
same even for Ti tools. Toolchain is available as x86 64-bit
packages or regular mainline sources. SDRAM setup code is on GitHub
and I provide compiled binary too.

If your customer wants to have industrial ready system
till end of year than I would answer that RTEMS is no
solution if you or they do not plan to invest significant
time or money to the project. If they want to build platform
for future applications and expect to start final application
development about end of year with product testing at start
of Q2 2015, then RTEMS should become ready on base of our
interrest/work investment. Anyway, for real grade target
project, they (in house), you, OAR, Embedded Brains or somebody
else has to take care about the platform, its testing,
certification, professional grade support etc.

Best wishes,

Pavel