docs: Add example of profiling and optimization

This commit is contained in:
Gabriel Ferreira
2025-03-20 01:08:40 -03:00
parent 4e48af7621
commit 7f89cbfb53
6 changed files with 26591 additions and 0 deletions

View File

@@ -92,6 +92,10 @@ SOURCEFIGS = \
figures/vtune-uarch-core-stats.png \
figures/vtune-uarch-profiling-summary.png \
figures/vtune-uarch-wifi-stats.png \
figures/perf.svg \
figures/perf-lte-frequency-reuse.png \
figures/perf-chunk-processor.png \
figures/perf-detail.png \
${SRC}/stats/doc/Stat-framework-arch.png \
${SRC}/stats/doc/Wifi-default.png \
${SRC}/stats/doc/dcf-overview.dia \

Binary file not shown.

After

Width:  |  Height:  |  Size: 304 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 500 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 390 KiB

26426
doc/manual/figures/perf.svg Normal file

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 2.0 MiB

View File

@@ -489,6 +489,167 @@ There are many tools to profile your program, including:
An overview on how to use `Perf`_ with `Hotspot`_, `AMD uProf`_ and
`Intel VTune`_ is provided in the following sections.
.. _Profiling and optimization :
Profiling and optimization
++++++++++++++++++++++++++
While profilers will help point out interest hotspots, they won't help fixing the issues.
At least for now...
After profiling, optimization should be done following the Pareto principle.
Focus first on least amount of work that can provide most benefits.
Let us see an example, based on ``lte-frequency-reuse``. In the example
below you will see a flamegraph generated from the CPU-``cycles`` metrics
collected by perf for this LTE example.
.. image:: figures/perf-lte-frequency-reuse.png
The vertical axis contain stacked bars. Every bar corresponds to a function.
The stack itself represents the call stack (when a function call other,
and other, etc). The width of each bar corresponds to the time it takes to
execute the function itself and the functions it calls (bars on top of it).
This is where we start looking for bottlenecks. Start looking at widest
columns at the top, and move down. As an example, we can select
``LteChunkProcessor::End()`` (highlighted in purple).
.. image:: figures/perf-chunk-processor.png
If we click on the ``LteChunkProcessor::End()``, we have a closer view of
where it is actually spending time on. We can see many bars related to callback
indirections. Note that each bar is slightly shorter and offset to the right.
This small difference is the overhead of each function call. When everything
is added together, it corresponds to about 35% of total time of
``LteChunkProcessor::End()``. Yes, callbacks are really expensive,
even in optimized builds. Now, let us take a look at the big gap and big
``LteAmc`` function being called by this callback.
.. image:: figures/perf-detail.png
Now that we are closed to the top of the stack, we have a better idea of
what is actually happening and can optimize more easily.
For example, notice that about 14% of ``LteAmc::CreateCqiFeedbacks()`` is
spent on itself (gap on top of bar). This usually means one of the following:
- many conditional operations (if-else, switches), causing instruction cache-misses or pipeline stalls in case of wrong branch prediction
- accessing data in erratic patterns, causing data cache-misses
- waiting for a value to be computed at a congested CPU unit (e.g. slow division, trigonometric), causing pipeline stalls
- a lot of actual work, also known as retired instructions
To check which case it really is, instead of CPU-``cycles``, one can collect complementary metrics
such as ``cache-misses``, ``branch-misses``, ``stalled-cycles-frontend``, ``stalled-cycles-backend``.
It is easy to spot which one of them is the actual culprit, because its bar will be wider.
After figuring out the issue, you need to look at the code to locate the source of the issue.
Some profilers, such as `Intel VTune`_ and `AMD uProf`_ can give you the rough location of the
instructions (line of code) causing these misses and stalls.
We also see that 86% of the time is spent elsewhere. In LteMiErrorModel, and some vector related calls.
One of these vector calls is ``push_back``, which takes incredible 30% of the time.
Now let us look at the code to see where we can improve it.
.. sourcecode:: cpp
std::vector<int>
LteAmc::CreateCqiFeedbacks(const SpectrumValue& sinr, uint8_t rbgSize)
{
NS_LOG_FUNCTION(this);
std::vector<int> cqi;
if (m_amcModel == MiErrorModel)
{
NS_LOG_DEBUG(this << " AMC-VIENNA RBG size " << (uint16_t)rbgSize);
NS_ASSERT_MSG(rbgSize > 0, " LteAmc-Vienna: RBG size must be greater than 0");
std::vector<int> rbgMap;
int rbId = 0;
for (auto it = sinr.ConstValuesBegin(); it != sinr.ConstValuesEnd(); it++)
{
/// WE CAN CUT 30% OF TIME BY AVOIDING THESE REALLOCATIONS WITH rbgMap.resize(rbgSize) AT BEGINNING
/// THEN USING std::iota(rbgMap.begin(), rbgMap.end(),rbId+1-rbgSize) INSIDE THE NEXT IF
rbgMap.push_back(rbId++);
if ((rbId % rbgSize == 0) || ((it + 1) == sinr.ConstValuesEnd()))
{
uint8_t mcs = 0;
TbStats_t tbStats;
while (mcs <= 28)
{
HarqProcessInfoList_t harqInfoList;
tbStats = LteMiErrorModel::GetTbDecodificationStats(
sinr,
rbgMap,
(uint16_t)GetDlTbSizeFromMcs(mcs, rbgSize) / 8, /// DIVISIONS ARE EXPENSIVE. OPTIMIZER WILL
/// REPLACE THIS TRIVIAL CASE WITH >> 3,
/// BUT THIS ISN'T ALWAYS THE CASE.
/// IF THE DIVISION IS BY A LOOP INVARIANT,
/// THEN COMPUTE ITS INVERSE OUTSIDE
/// THE LOOP TO REPLACE THE DIVISION WITH
/// A MULTIPLICATION.
mcs,
harqInfoList);
if (tbStats.tbler > 0.1)
{
break;
}
mcs++;
}
if (mcs > 0)
{
mcs--;
}
NS_LOG_DEBUG(this << "\t RBG " << rbId << " MCS " << (uint16_t)mcs << " TBLER "
<< tbStats.tbler);
int rbgCqi = 0;
if ((tbStats.tbler > 0.1) && (mcs == 0))
{
rbgCqi = 0; // any MCS can guarantee the 10 % of BER
}
else if (mcs == 28)
{
rbgCqi = 15; // all MCSs can guarantee the 10 % of BER
}
else
{
double s = SpectralEfficiencyForMcs[mcs];
rbgCqi = 0;
while ((rbgCqi < 15) && (SpectralEfficiencyForCqi[rbgCqi + 1] < s))
{
++rbgCqi;
}
}
NS_LOG_DEBUG(this << "\t MCS " << (uint16_t)mcs << "-> CQI " << rbgCqi);
// fill the cqi vector (per RB basis)
/// WE CAN CUT 30% OF TIME BY AVOIDING THESE REALLOCATIONS WITH cqi.resize(cqi.size()+rbgSize)
/// THEN std::fill(cqi.rbegin(), cqi.rbegin()+rbgSize, rbgCqi)
for (uint8_t j = 0; j < rbgSize; j++)
{
cqi.push_back(rbgCqi);
}
rbgMap.clear();
}
}
}
return cqi;
}
Try to explore by yourself with the interactive flamegraph below
(only available for html-based documentation).
.. raw:: html
:file: figures/perf.svg
Note: interactive flamegraph SVGs were generated with `perf.data` output exported by Linux Perf,
then transformed to text and finally rendered as SVG.
.. sourcecode:: console
git clone https://github.com/brendangregg/FlameGraph
./FlameGraph/stackcollapse-perf.pl perf.data > perf.folded
./FlameGraph/flamegraph.pl perf.folded --width=800 > perf.svg
.. _Linux Perf and Hotspot GUI :
Linux Perf and Hotspot GUI