I recently got Circuitpython working on Fomu! But there's an issue with the speed of Fomu's SPI flash: It's currently incredibly slow, and that makes for a poor experience. Worse still, the program runs so slowly that the tinyusb stack can't keep up with the USB protocol, and the host gives up and resets the device.

An easy solution is to put frequently-used functions in RAM.  This is as simple as annotating them with __attribute__((section(".ramtext"))). However, the Circuitpython program is over 300 kilobytes, and Fomu only has 128 kilobytes of RAM. So we'll have to pick and choose which functions to load into RAM.

The Fomu EVT board has the ability to be debugged via SPI, which makes it possible to debug the board even before USB has loaded.  To do this, I need to load the spi-enabled firmware and start the debugger:

fomu@fomu-jig:~ $ fomu-flash -w ~/evt-spi-multiboot-v2.0.3.bin ; fomu-flash -r
Erasing @ 019000 / 01973a  Done
Programming @ 01973a / 01973a  Done
resetting fpga
fomu@fomu-jig:~ $ ~/wishbone-tool --spi-pins 2,3,4,18 -s gdb --bind-addr
INFO [wishbone_tool::bridge::spi::raspberry_spi] re-initialized spi device with pins MOSI:2 MISO:3 CLK:4 CS:18
INFO [wishbone_tool::server] accepting connections on
INFO [wishbone_tool::server] connection from

Now it's possible to debug the board, but how does that help with profiling?

Well, the simple answer is that we can keep breaking into the debugger, printing the current address, and continuing.  Over time we can build up a view of what addresses get called the most, and then make sure to put those into RAM.

There are lots of ways we could do that, but I didn't want to write a whole lot of code.  My solution was to print backtraces to a file, then use ordinary Unix tools on that.

In gdb, run the following script (either by pasting it in, or passing it to gdb e.g. gdb -x profiling.gdb circuitpython.elf):

set pagination 0
set logging file profile.txt
set logging overwrite

server define poor_profile
set $total = $arg0
set $i = 0
   set logging on
     set $i = $i + 1
     p $pc
   set logging off

Then we can start profiling and taking 100 samples by running poor_profile 100.  But how to interrupt gdb?

I did say this was cobbled together. To interrupt gdb, simply send it a SIGINT.  We can do that in another shell with the one-liner for i in $(seq 100); do echo $i...; killall -INT riscv64-unknown-elf-gdb; sleep 5; done.

The end result is a file called profile.txt that contains all of our samples.  We can run Unix commands on this to tease out a list of functions we should optimize.  For example: grep '0x200' profile.txt | awk '{print  " " $4}' | sort -n | uniq -c | sort -n | tail -n 10 gives:

      9  stdio_write
     10  fun_builtin_var_call
     10  mp_call_function_n_kw
     10  mp_stream_rw
     10  mp_stream_write_adaptor
     10  mp_stream_write
     13  mp_hal_stdin_rx_chr
     13  readline
     28  pyexec_friendly_repl
     28  run_repl

These are good candidates for inlining.

It may be preferrable to simply take random $pc samples and post-process them, and that may be a process I try in the future. But this approach is good enough to have a vague idea of which functions are hot and should be moved into RAM, and it just involves gluing together various Unix commands.