How the std::chrono::steady_clock is implemented on Raspberry Pi. Things that I don't know that you may be able to help with: $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq Verified cpu frequency: $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor steady_clock, system_clock, and high_resolution_clock all exhibit the same behaviour. Consecutive calls to std::chrono::steady_clock::now() produce increments of between 37 and 56ns. Vector operations rely on GCC compiler auto-vectorization, which have been properly annotated with restrict declarations, and verified to have produce optimal neon vectorization (with better instruction scheduling than I could produce with Neon intrinsics). And I don't think Raspberry PI 4s heat throttle until they get to 80C. Test runs seems to occasionally get shifted to a different CPU core, but for the most part, they seem to run on a single randomly selected core for the duration of the 3 tests per execution run. I don't think it varies depending on which CPU core the code is running on - although I'm only able to determine which core the code is running on in a particular run using HTOP, which isn't completely definitive. The pattern doesn't significantly change when connected via SSH. HTOP indicates ~6% CPU use at idle when connected by VNC, and about 0.3% (wifi supplicant) when connected via ssh. I don't think it's related to cache competition between processes. The CPU scaling governors have been set to performance, and all CPUs are running at 1.8Ghz. All matrices, matrix rows, and vectors are 16-byte aligned (checked with asserts in debug compiles). The code under test gets executed three times, taking ~0.3 x realtime, so further optimizations are in fact critical. I think there's probably another ~15% improvement available, but the timing inaccuracies are getting in the way at this point. So far, optimizaton has resulted in an improvement from ~0.18 x realtime, to 0.093 x realtime. Most of the code pipelines beautifully, so code may be running close L1-cache bandwidth limits. Code footprint is unknown, but may not fit in L1 i-cache. Data footprint is about 13kb (fits comfortably in L1 d-cache). The bulk of the execution (~90%) is matrix and vector arithmetic using hand-optimized neon intrinsics. The code under test is a machine-learning algorithm (LSTM->Dense), using hand-optimized neon intrinsics used to generate real-time audio. And I am unable to come up with a reasonable theory as to why the variances changes from invocation to invocation. But the ~10% variance is unworkable for what I'm trying to do. I'm an experienced programmer, so I get that benchmarks will vary somewhat. But the peculiar thing here is that the results are consistent to +/- 1% between the three tests performed each time the program is run. Results vary randomly roughly between those two extremes for each run. But when I re-run the program the results of the three benchmarks vary by +/- 10% from the previous run, but with each of the three results in the new run being +/- 1%.ĩ:21:37. I run the benchmark three times in a row each time the program is run, and get roughly the same results +/- 1%. But the peculiar thing is that the variance seems to be sticky for a particular execution of the benchmark. Results for a ~6-second benchmark vary by ~10%. What would cause performance to vary by 10% between executions of the benchmark program, while remaining consistent +/- 1% when the same test is run multiple times in the same execution of the program? Because GNU profiling tools don't work on Raspberry Pi, I'm stuck with benchmarking to evaluate code optimizations, so this is rather a big deal. I'm trying to benchmark a piece of DSP code on a Raspberry Pi 4 using the std::chrono::steady_clock, but the results I'm getting are peculiar.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |