Cobbled-together Profiler
I recently got Circuitpython working on Fomu! But there’s an issue with the speed of Fomu’s SPI flash: It’s currently incredibly slow, and that makes for a poor experience. Worse still, the program runs so slowly that the tinyusb stack can’t keep up with the USB protocol, and the host gives up and resets the device.
An easy solution is to put frequently-used functions in RAM. This is as simple as annotating them with __attribute__((section(".ramtext")))
. However, the Circuitpython program is over 300 kilobytes, and Fomu only has 128 kilobytes of RAM. So we’ll have to pick and choose which functions to load into RAM.
The Fomu EVT board has the ability to be debugged via SPI, which makes it possible to debug the board even before USB has loaded. To do this, I need to load the spi-enabled firmware and start the debugger:
fomu@fomu-jig:~ $ fomu-flash -w ~/evt-spi-multiboot-v2.0.3.bin ; fomu-flash -r
Erasing @ 019000 / 01973a Done
Programming @ 01973a / 01973a Done
resetting fpga
fomu@fomu-jig:~ $ ~/wishbone-tool --spi-pins 2,3,4,18 -s gdb --bind-addr 0.0.0.0
INFO [wishbone_tool::bridge::spi::raspberry_spi] re-initialized spi device with pins MOSI:2 MISO:3 CLK:4 CS:18
INFO [wishbone_tool::server] accepting connections on 0.0.0.0:1234
INFO [wishbone_tool::server] connection from 10.0.240.4:62061
Now it’s possible to debug the board, but how does that help with profiling?
Well, the simple answer is that we can keep breaking into the debugger, printing the current address, and continuing. Over time we can build up a view of what addresses get called the most, and then make sure to put those into RAM.
There are lots of ways we could do that, but I didn’t want to write a whole lot of code. My solution was to print backtraces to a file, then use ordinary Unix tools on that.
In gdb
, run the following script (either by pasting it in, or passing it to gdb e.g. gdb -x profiling.gdb circuitpython.elf
):
set pagination 0
set logging file profile.txt
set logging overwrite
server define poor_profile
set $total = $arg0
set $i = 0
set logging on
while($i<$total)
set $i = $i + 1
cont
p $pc
bt
end
set logging off
end
Then we can start profiling and taking 100 samples by running poor_profile 100
. But how to interrupt gdb?
I did say this was cobbled together. To interrupt gdb, simply send it a SIGINT
. We can do that in another shell with the one-liner for i in $(seq 100); do echo $i...; killall -INT riscv64-unknown-elf-gdb; sleep 5; done
.
The end result is a file called profile.txt
that contains all of our samples. We can run Unix commands on this to tease out a list of functions we should optimize. For example: grep '0x200' profile.txt | awk '{print " " $4}' | sort -n | uniq -c | sort -n | tail -n 10
gives:
9 stdio_write
10 fun_builtin_var_call
10 mp_call_function_n_kw
10 mp_stream_rw
10 mp_stream_write_adaptor
10 mp_stream_write
13 mp_hal_stdin_rx_chr
13 readline
28 pyexec_friendly_repl
28 run_repl
These are good candidates for inlining.
It may be preferrable to simply take random $pc
samples and post-process them, and that may be a process I try in the future. But this approach is good enough to have a vague idea of which functions are hot and should be moved into RAM, and it just involves gluing together various Unix commands.