mtp, mpi: Add fat-tree examples

This commit is contained in:
F5
2023-11-20 23:11:33 +08:00
parent b69f13f5d2
commit d70b17337a
7 changed files with 1891 additions and 16 deletions

View File

@@ -1,13 +1,13 @@
# UNISON for ns-3
# Unison for ns-3
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10077300.svg)](https://doi.org/10.5281/zenodo.10077300)
[![CI](https://github.com/NASA-NJU/UNISON-for-ns-3/actions/workflows/per_commit.yml/badge.svg)](https://github.com/NASA-NJU/UNISON-for-ns-3/actions/workflows/per_commit.yml)
A fast and user-transparent parallel simulator implementation for ns-3.
More information about UNISON can be found in our EuroSys '24 paper (coming soon).
More information about Unison can be found in our EuroSys '24 paper (coming soon).
Supported ns-3 version: [3.36.1](tree/unison-3.36.1), [3.37](tree/unison-3.37), [3.38](tree/unison-3.38), [3.39](tree/unison-3.36.1).
We are trying to keep UNISON updated with the latest version of ns-3.
Supported ns-3 version: [3.36.1](https://github.com/NASA-NJU/UNISON-for-ns-3/tree/unison-3.36.1), [3.37](https://github.com/NASA-NJU/UNISON-for-ns-3/tree/unison-3.37), [3.38](https://github.com/NASA-NJU/UNISON-for-ns-3/tree/unison-3.38), [3.39](https://github.com/NASA-NJU/UNISON-for-ns-3/tree/unison-3.39) and [3.40](github.com/NASA-NJU/UNISON-for-ns-3/tree/unison-3.40).
We are trying to keep Unison updated with the latest version of ns-3.
You can find each unison-enabled ns-3 version via `unison-*` tags.
## Getting Started
@@ -22,7 +22,7 @@ The quickest way to get started is to type the command
> If you want to get `-O3` optimized build and discard all log outputs, please add `-d optimized` arguments.
The `--enable-mtp` option will enable multi-threaded parallelization.
You can verify UNISON is enabled by checking whether `Multithreaded Simulation : ON` appears in the optional feature list.
You can verify Unison is enabled by checking whether `Multithreaded Simulation : ON` appears in the optional feature list.
Now, let's build and run a DCTCP example with default sequential simulation and parallel simulation (using 4 threads) respectively:
@@ -35,29 +35,32 @@ time ./ns3 run dctcp-example-mtp
The simulation should finish in 4-5 minutes for `dctcp-example` and 1-2 minutes for `dctcp-example-mtp`, depending on your hardware and your build profile.
The output in `*.dat` should be in accordance with the comments in the source file.
The speedup of UNISON is more significant for larger topologies and traffic volumes.
The speedup of Unison is more significant for larger topologies and traffic volumes.
If you are interested in using it to simulate topologies like fat-tree, BCube and 2D-torus, please refer to [Running Evaluations](#running-evaluations).
## Speedup Your Existing Code
To understand how UNISON affects your model code, let's find the differences between two versions of the source files of the above example:
To understand how Unison affects your model code, let's find the differences between two versions of the source files of the above example:
```shell
diff examples/tcp/dctcp-example.cc examples/mtp/dctcp-example-mtp.cc
```
It turns out that to bring UNISON to existing model code, all you need to do is to include the `ns3/mtp-interface.h` header file and add the following line at the beginning of the `main` function:
It turns out that to bring Unison to the existing model code, all you need to do is to include the `ns3/mtp-interface.h` header file and add the following line at the beginning of the `main` function:
```c++
MtpInterface::Enable(numberOfThreads);
```
The parameter `numberOfThreads` is optional.
If it is omitted, the number of threads is automatically chosen and will not exceed the maximum number of available hardware threads on your system. If you want to enable UNISON for distributed simulation on existing MPI programs for further speedup, place the above line before MPI initialization.
If it is omitted, the number of threads is automatically chosen and will not exceed the maximum number of available hardware threads on your system.
If you want to enable Unison for distributed simulation on existing MPI programs for further speedup, place the above line before MPI initialization and do not explicitly specify the simulator implementation in your code.
For such hybrid simulation with MPI, the `--enable-mpi` option is also required when configuring ns-3.
UNISON resolved a lot of thread-safety issues with ns-3's architecture.
Unison resolved a lot of thread-safety issues with ns-3's architecture.
You don't need to consider these issues on your own for most of the time, except if you have custom global statistics other than the built-in flow-monitor.
In the latter case, if multiple nodes can access your global statistics, you can replace them with atomic variables via `std::atomic<>`.
When collecting tracing data such as Pcap, it is strongly recommended to create separate output files for each node instead of a single trace file.
For complex custom data structures, you can create critical sections by adding
```c++
@@ -66,16 +69,30 @@ MtpInterface::CriticalSection cs;
at the beginning of your methods.
In addition to the DCTCP example above, you can find other adapted examples in `examples/mtp`.
## Examples
In addition to the DCTCP example, you can find other adapted examples in `examples/mtp`.
Meanwhile, Unison also supports manual partition, and you can find a minimal example in `src/mtp/examples/simple-mtp.cc`
For hybrid simulation with MPI, you can find a minimal example in `src/mpi/examples/simple-hybrid.cc`.
We also provide three detailed fat-tree examples for Unison, traditional MPI parallel simulation and hybrid simulation:
| Name | Location | Required configuration flags | Running commands |
| - | - | - | - |
| fat-tree-mtp | src/mtp/examples/fat-tree-mtp.cc | `--enable-mtp --enable-exaples` without `--enable-mpi` | `./ns3 run "fat-tree-mtp --thread=4"` |
| fat-tree-mpi | src/mpi/examples/fat-tree-mpi.cc | `--enable-mpi --enable-exaples` without `--enable-mtp` | `./ns3 run fat-tree-mpi --command-template "mpirun -np 4 %s"` |
| fat-tree-hybrid | src/mpi/examples/fat-tree-hybrid.cc | `--enable-mtp --enable-mpi --enable-exaples` | `./ns3 run fat-tree-mpi --command-template "mpirun -np 2 %s --thread=2"` |
Feel free to explore these examples, compare code changes and adjust the `-np` and `--thread` arguments.
## Running Evaluations
To evaluate UNISON, please switch to [unison-evaluations](https://github.com/NASA-NJU/UNISON-for-ns-3/tree/unison-evaluations) branch, which is based on ns-3.36.1.
To evaluate Unison, please switch to [unison-evaluations](https://github.com/NASA-NJU/Unison-for-ns-3/tree/unison-evaluations) branch, which is based on ns-3.36.1.
In this branch, you can find various topology models in the `scratch` folder.
There are a lot of parameters you can set for each topology.
We provided a utility script `exp.py` to compare these simulators and parameters.
We also provided `process.py` to convert these raw experiment data to CSV files suitable for plotting.
Please see the [README in that branch](https://github.com/NASA-NJU/UNISON-for-ns-3/tree/unison-evaluations) for more details.
Please see the [README in that branch](https://github.com/NASA-NJU/Unison-for-ns-3/tree/unison-evaluations) for more details.
The evaluated artifact (based on ns-3.36.1) is persistently indexed by DOI [10.5281/zenodo.10077300](https://doi.org/10.5281/zenodo.10077300).
@@ -83,7 +100,7 @@ The evaluated artifact (based on ns-3.36.1) is persistently indexed by DOI [10.5
### 1. Overview
UNISON for ns-3 is mainly implemented in the `mtp` module (located at `src/mtp/*`), which stands for multi-threaded parallelization.
Unison for ns-3 is mainly implemented in the `mtp` module (located at `src/mtp/*`), which stands for multi-threaded parallelization.
This module contains three parts: A parallel simulator implementation `multithreaded-simulator-impl`, an interface to users `mtp-interface`, and `logical-process` to represent LPs in terms of parallel simulation.
All LPs and threads are stored in the `mtp-interface`.
@@ -105,7 +122,7 @@ We also modified the module to make it locally thread-safe.
In addition to the `mtp` and `mpi` modules, we also modified the following part of the ns-3 architecture to make it thread-safe, also with some bug fixing for ns-3.
You can find the modifications to each unison-enabled ns-3 version via `git diff unison-* ns-*`.
Modifications to the build system to provide `--enable-mtp` option to enable/disable UNISON:
Modifications to the build system to provide `--enable-mtp` option to enable/disable Unison:
```
ns3 | 2 +
@@ -179,7 +196,7 @@ src/mpi/model/mpi-interface.cc | 3 +-
### 3. Logging
The reason behind UNISON's fast speed is that it divides the network into multiple logical processes (LPs) with fine granularity and schedules them dynamically.
The reason behind Unison's fast speed is that it divides the network into multiple logical processes (LPs) with fine granularity and schedules them dynamically.
To get to know more details of such workflow, you can enable the following log component:
```c++

View File

@@ -38,6 +38,18 @@ build_lib_example(
${libapplications}
)
build_lib_example(
NAME fat-tree-mpi
SOURCE_FILES fat-tree-mpi.cc
LIBRARIES_TO_LINK
${libmpi}
${libpoint-to-point}
${libinternet}
${libnix-vector-routing}
${libapplications}
${libflow-monitor}
)
if(${ENABLE_MTP})
build_lib_example(
NAME simple-hybrid
@@ -51,4 +63,17 @@ if(${ENABLE_MTP})
${libapplications}
${libmtp}
)
build_lib_example(
NAME fat-tree-hybrid
SOURCE_FILES fat-tree-hybrid.cc
LIBRARIES_TO_LINK
${libmpi}
${libmtp}
${libpoint-to-point}
${libinternet}
${libnix-vector-routing}
${libapplications}
${libflow-monitor}
)
endif()

View File

@@ -0,0 +1,613 @@
#include "ns3/applications-module.h"
#include "ns3/core-module.h"
#include "ns3/flow-monitor-module.h"
#include "ns3/internet-module.h"
#include "ns3/mpi-module.h"
#include "ns3/mtp-module.h"
#include "ns3/network-module.h"
#include "ns3/nix-vector-routing-module.h"
#include "ns3/point-to-point-module.h"
#include "ns3/traffic-control-module.h"
#include <chrono>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <map>
#include <numeric>
#include <vector>
using namespace std;
using namespace chrono;
using namespace ns3;
#define LOCAL(r) ((r) == conf::rank)
#define LOG(content) \
{ \
if (conf::rank == 0) \
cout << content << endl; \
}
// random variable distribution
class Distribution
{
public:
// load a distribution from a CDF file
Distribution(string filename = "src/mtp/examples/web-search.txt")
{
ifstream fin;
fin.open(filename);
while (!fin.eof())
{
double x, cdf;
fin >> x >> cdf;
m_cdf.push_back(std::make_pair(x, cdf));
}
fin.close();
m_rand = CreateObject<UniformRandomVariable>();
}
// expectation value of the distribution
double Expectation()
{
double ex = 0;
for (uint32_t i = 1; i < m_cdf.size(); i++)
{
ex +=
(m_cdf[i].first + m_cdf[i - 1].first) / 2 * (m_cdf[i].second - m_cdf[i - 1].second);
}
return ex;
}
// get a random value from the distribution
double Sample()
{
double rand = m_rand->GetValue(0, 1);
for (uint32_t i = 1; i < m_cdf.size(); i++)
{
if (rand <= m_cdf[i].second)
{
double slope =
(m_cdf[i].first - m_cdf[i - 1].first) / (m_cdf[i].second - m_cdf[i - 1].second);
return m_cdf[i - 1].first + slope * (rand - m_cdf[i - 1].second);
}
}
return m_cdf[m_cdf.size() - 1].second;
}
private:
// the actual CDF function
vector<pair<double, double>> m_cdf;
// random variable stream
Ptr<UniformRandomVariable> m_rand;
};
// traffic generator
class TrafficGenerator
{
public:
TrafficGenerator(string cdfFile,
uint32_t hostTotal,
double dataRate,
double incastRatio,
vector<uint32_t> victims)
{
m_distribution = Distribution(cdfFile);
m_currentTime = 0;
m_averageInterval = m_distribution.Expectation() * 8 / dataRate;
m_incastRatio = incastRatio;
m_hostTotal = hostTotal;
m_victims = victims;
m_flowCount = 0;
m_flowSizeTotal = 0;
m_uniformRand = CreateObject<UniformRandomVariable>();
m_expRand = CreateObject<ExponentialRandomVariable>();
}
// get one flow with incremental time and random src, dst and size
tuple<double, uint32_t, uint32_t, uint32_t> GetFlow()
{
uint32_t src, dst;
if (m_uniformRand->GetValue(0, 1) < m_incastRatio)
{
dst = m_victims[m_uniformRand->GetInteger(0, m_victims.size() - 1)];
}
else
{
dst = m_uniformRand->GetInteger(0, m_hostTotal - 1);
}
do
{
src = m_uniformRand->GetInteger(0, m_hostTotal - 1);
} while (src == dst);
uint32_t flowSize = max((uint32_t)round(m_distribution.Sample()), 1U);
m_currentTime += m_expRand->GetValue(m_averageInterval, 0);
m_flowSizeTotal += flowSize;
m_flowCount++;
return make_tuple(m_currentTime, src, dst, flowSize);
}
double GetActualDataRate()
{
return m_flowSizeTotal / m_currentTime * 8;
}
double GetAvgFlowSize()
{
return m_distribution.Expectation();
}
double GetActualAvgFlowSize()
{
return m_flowSizeTotal / (double)m_flowCount;
}
uint32_t GetFlowCount()
{
return m_flowCount;
}
private:
double m_currentTime;
double m_averageInterval;
double m_incastRatio;
uint32_t m_hostTotal;
vector<uint32_t> m_victims;
uint32_t m_flowCount;
uint64_t m_flowSizeTotal;
Distribution m_distribution;
Ptr<UniformRandomVariable> m_uniformRand;
Ptr<ExponentialRandomVariable> m_expRand;
};
namespace conf
{
// fat-tree scale
uint32_t k = 4;
uint32_t cluster = 0;
// link layer options
uint32_t mtu = 1500;
uint32_t delay = 3000;
string bandwidth = "10Gbps";
// traffic-control layer options
string buffer = "4MB";
bool ecn = true;
// network layer options
bool nix = false;
bool rip = false;
bool ecmp = true;
bool flow = true;
// transport layer options
uint32_t port = 443;
string socket = "ns3::TcpSocketFactory";
string tcp = "ns3::TcpDctcp";
// application layer options
uint32_t size = 1448;
string cdf = "src/mtp/examples/web-search.txt";
double load = 0.3;
double incast = 0;
string victim = "0";
// simulation options
string seed = "";
bool flowmon = false;
double time = 1;
double interval = 0.1;
// mtp options
uint32_t thread = 4;
// mpi options
uint32_t system = 0;
uint32_t rank = 0;
bool nullmsg = false;
}; // namespace conf
void
Initialize(int argc, char* argv[])
{
CommandLine cmd;
// parse scale
cmd.AddValue("k", "Number of pods in a fat-tree", conf::k);
cmd.AddValue("cluster", "Number of clusters in a variant fat-tree", conf::cluster);
// parse network options
cmd.AddValue("mtu", "P2P link MTU", conf::mtu);
cmd.AddValue("delay", "Link delay in nanoseconds", conf::delay);
cmd.AddValue("bandwidth", "Link bandwidth", conf::bandwidth);
cmd.AddValue("buffer", "Switch buffer size", conf::buffer);
cmd.AddValue("ecn", "Use explicit congestion control", conf::ecn);
cmd.AddValue("nix", "Enable nix-vector routing", conf::nix);
cmd.AddValue("rip", "Enable RIP routing", conf::rip);
cmd.AddValue("ecmp", "Use equal-cost multi-path routing", conf::ecmp);
cmd.AddValue("flow", "Use per-flow ECMP routing", conf::flow);
cmd.AddValue("port", "Port number of server applications", conf::port);
cmd.AddValue("socket", "Socket protocol", conf::socket);
cmd.AddValue("tcp", "TCP protocol", conf::tcp);
cmd.AddValue("size", "Application packet size", conf::size);
cmd.AddValue("cdf", "Traffic CDF file location", conf::cdf);
cmd.AddValue("load", "Traffic load relative to bisection bandwidth", conf::load);
cmd.AddValue("incast", "Incast traffic ratio", conf::incast);
cmd.AddValue("victim", "Incast traffic victim list", conf::victim);
// parse simulation options
cmd.AddValue("seed", "The seed of the random number generator", conf::seed);
cmd.AddValue("flowmon", "Use flow-monitor to record statistics", conf::flowmon);
cmd.AddValue("time", "Simulation time in seconds", conf::time);
cmd.AddValue("interval", "Simulation progreess print interval in seconds", conf::interval);
// parse mtp/mpi options
cmd.AddValue("thread", "Maximum number of threads", conf::thread);
cmd.AddValue("system", "Number of logical processes in MTP manual partition", conf::system);
cmd.AddValue("nullmsg", "Enable null message algorithm", conf::nullmsg);
cmd.Parse(argc, argv);
// link layer settings
Config::SetDefault("ns3::PointToPointChannel::Delay", TimeValue(NanoSeconds(conf::delay)));
Config::SetDefault("ns3::PointToPointNetDevice::DataRate", StringValue(conf::bandwidth));
Config::SetDefault("ns3::PointToPointNetDevice::Mtu", UintegerValue(conf::mtu));
// traffic control layer settings
Config::SetDefault("ns3::RedQueueDisc::MeanPktSize", UintegerValue(conf::mtu));
Config::SetDefault("ns3::RedQueueDisc::UseEcn", BooleanValue(conf::ecn));
Config::SetDefault("ns3::RedQueueDisc::UseHardDrop", BooleanValue(false));
Config::SetDefault("ns3::RedQueueDisc::LinkDelay", TimeValue(NanoSeconds(conf::delay)));
Config::SetDefault("ns3::RedQueueDisc::LinkBandwidth", StringValue(conf::bandwidth));
Config::SetDefault("ns3::RedQueueDisc::MaxSize", QueueSizeValue(QueueSize(conf::buffer)));
Config::SetDefault("ns3::RedQueueDisc::MinTh", DoubleValue(50));
Config::SetDefault("ns3::RedQueueDisc::MaxTh", DoubleValue(150));
// network layer settings
Config::SetDefault("ns3::Ipv4GlobalRouting::RandomEcmpRouting", BooleanValue(conf::ecmp));
Config::SetDefault("ns3::Ipv4GlobalRouting::FlowEcmpRouting", BooleanValue(conf::flow));
// transport layer settings
Config::SetDefault("ns3::TcpL4Protocol::SocketType", StringValue(conf::tcp));
Config::SetDefault("ns3::TcpSocket::SegmentSize", UintegerValue(conf::size));
Config::SetDefault("ns3::TcpSocket::ConnTimeout",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MilliSeconds(10) : Seconds(3)));
Config::SetDefault("ns3::TcpSocket::SndBufSize", UintegerValue(1073725440));
Config::SetDefault("ns3::TcpSocket::RcvBufSize", UintegerValue(1073725440));
Config::SetDefault(
"ns3::TcpSocketBase::MinRto",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MilliSeconds(5) : MilliSeconds(200)));
Config::SetDefault(
"ns3::TcpSocketBase::ClockGranularity",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MicroSeconds(100) : MilliSeconds(1)));
Config::SetDefault("ns3::RttEstimator::InitialEstimation",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MicroSeconds(200) : Seconds(1)));
// application layer settings
Config::SetDefault("ns3::BulkSendApplication::SendSize", UintegerValue(UINT32_MAX));
Config::SetDefault("ns3::OnOffApplication::DataRate", StringValue(conf::bandwidth));
Config::SetDefault("ns3::OnOffApplication::PacketSize", UintegerValue(conf::size));
Config::SetDefault("ns3::OnOffApplication::OnTime",
StringValue("ns3::ConstantRandomVariable[Constant=1000]"));
Config::SetDefault("ns3::OnOffApplication::OffTime",
StringValue("ns3::ConstantRandomVariable[Constant=0]"));
// simulation settings
Time::SetResolution(Time::PS);
RngSeedManager::SetSeed(Hash32(conf::seed));
// initialize hybrid
MtpInterface::Enable(conf::thread);
MpiInterface::Enable(&argc, &argv);
conf::rank = MpiInterface::GetSystemId();
conf::system = MpiInterface::GetSize();
}
void
SetupRouting()
{
InternetStackHelper internet;
if (conf::nix)
{
internet.SetRoutingHelper(Ipv4NixVectorHelper());
}
else if (conf::rip)
{
internet.SetRoutingHelper(RipHelper());
}
else
{
internet.SetRoutingHelper(Ipv4GlobalRoutingHelper());
}
internet.SetIpv6StackInstall(false);
internet.InstallAll();
LOG("\n- Setup the topology...");
}
void
InstallTraffic(map<uint32_t, Ptr<Node>>& hosts,
map<Ptr<Node>, Ipv4Address>& addrs,
double bisection)
{
// output address for debugging
LOG("\n- Calculating routes...");
LOG(" Host NodeId System Address");
for (auto& p : hosts)
{
LOG(" " << left << setw(6) << p.first << setw(8) << p.second->GetId() << setw(8)
<< p.second->GetSystemId() << addrs[p.second]);
}
if (!conf::nix)
{
Ipv4GlobalRoutingHelper::PopulateRoutingTables();
}
// server applications
PacketSinkHelper server(conf::socket, InetSocketAddress(Ipv4Address::GetAny(), conf::port));
for (auto& p : hosts)
{
if (LOCAL(p.second->GetSystemId()))
{
server.Install(p.second).Start(Seconds(0));
}
}
// calculate traffic
LOG("\n- Generating traffic...");
double bandwidth = bisection * DataRate(conf::bandwidth).GetBitRate() * 2;
string victim;
stringstream sin(conf::victim);
vector<uint32_t> victims;
while (getline(sin, victim, '-'))
{
victims.push_back(stoi(victim));
}
TrafficGenerator traffic(conf::cdf,
hosts.size(),
bandwidth * conf::load,
conf::incast,
victims);
// install traffic (client applications)
auto flow = traffic.GetFlow();
while (get<0>(flow) < conf::time)
{
Ptr<Node> clientNode = hosts[get<1>(flow)];
Ptr<Node> serverNode = hosts[get<2>(flow)];
if (LOCAL(clientNode->GetSystemId()))
{
if (conf::socket != "ns3::TcpSocketFactory")
{
OnOffHelper client(conf::socket, InetSocketAddress(addrs[serverNode], conf::port));
client.SetAttribute("MaxBytes", UintegerValue(get<3>(flow)));
client.Install(clientNode).Start(Seconds(get<0>(flow)));
}
else
{
BulkSendHelper client(conf::socket,
InetSocketAddress(addrs[serverNode], conf::port));
client.SetAttribute("MaxBytes", UintegerValue(get<3>(flow)));
client.Install(clientNode).Start(Seconds(get<0>(flow)));
}
}
flow = traffic.GetFlow();
}
// traffic installation check
LOG(" Expected data rate = " << bandwidth * conf::load / 1e9 << "Gbps");
LOG(" Generated data rate = " << traffic.GetActualDataRate() / 1e9 << "Gbps");
LOG(" Expected avg flow size = " << traffic.GetAvgFlowSize() / 1e6 << "MB");
LOG(" Generated avg flow size = " << traffic.GetActualAvgFlowSize() / 1e6 << "MB");
LOG(" Total flow count = " << traffic.GetFlowCount());
}
void
PrintProgress()
{
LOG(" Progressed to " << Simulator::Now().GetSeconds() << "s");
Simulator::Schedule(Seconds(conf::interval), PrintProgress);
}
void
StartSimulation()
{
// install flow-monitor
Ptr<FlowMonitor> flowMonitor;
FlowMonitorHelper flowHelper;
if (conf::flowmon)
{
flowMonitor = flowHelper.InstallAll();
}
// print progress
if (conf::interval)
{
Simulator::Schedule(Seconds(conf::interval), PrintProgress);
}
// start the simulation
Simulator::Stop(Seconds(conf::time));
LOG("\n- Start simulation...");
auto start = system_clock::now();
Simulator::Run();
auto end = system_clock::now();
auto time = duration_cast<duration<double>>(end - start).count();
// output simulation statistics
uint64_t eventCount = Simulator::GetEventCount();
if (conf::flowmon)
{
uint64_t dropped = 0, totalTx = 0, totalRx = 0, totalTxBytes = 0, flowCount = 0,
finishedFlowCount = 0;
double totalThroughput = 0;
Time totalFct(0), totalFinishedFct(0), totalDelay(0);
flowMonitor->CheckForLostPackets();
for (auto& p : flowMonitor->GetFlowStats())
{
dropped = p.second.packetsDropped.size();
if ((p.second.timeLastRxPacket - p.second.timeFirstTxPacket).GetTimeStep() > 0 &&
p.second.txPackets && p.second.rxPackets)
{
totalTx += p.second.txPackets;
totalRx += p.second.rxPackets;
totalTxBytes += p.second.txBytes;
totalFct += p.second.timeLastRxPacket - p.second.timeFirstTxPacket;
if (p.second.txPackets - p.second.rxPackets == p.second.packetsDropped.size())
{
totalFinishedFct += p.second.timeLastRxPacket - p.second.timeFirstTxPacket;
finishedFlowCount++;
}
totalDelay += p.second.delaySum;
totalThroughput +=
(double)p.second.txBytes /
(p.second.timeLastRxPacket - p.second.timeFirstTxPacket).GetSeconds();
flowCount++;
}
}
double avgFct = (double)totalFct.GetMicroSeconds() / flowCount;
double avgFinishedFct = (double)totalFinishedFct.GetMicroSeconds() / finishedFlowCount;
double avgDelay = (double)totalDelay.GetMicroSeconds() / totalRx;
double avgThroughput = totalThroughput / flowCount / 1e9 * 8;
LOG(" Detected #flow = " << flowCount);
LOG(" Finished #flow = " << finishedFlowCount);
LOG(" Average FCT (all) = " << avgFct << "us");
LOG(" Average FCT (finished) = " << avgFinishedFct << "us");
LOG(" Average end to end delay = " << avgDelay << "us");
LOG(" Average flow throughput = " << avgThroughput << "Gbps");
LOG(" Network throughput = " << totalTxBytes / 1e9 * 8 / conf::time << "Gbps");
LOG(" Total Tx packets = " << totalTx);
LOG(" Total Rx packets = " << totalRx);
LOG(" Dropped packets = " << dropped);
}
Simulator::Destroy();
uint64_t eventCounts[conf::system];
MPI_Gather(&eventCount,
1,
MPI_UNSIGNED_LONG_LONG,
eventCounts,
1,
MPI_UNSIGNED_LONG_LONG,
0,
MpiInterface::GetCommunicator());
LOG("\n- Done!");
for (uint32_t i = 0; i < conf::system; i++)
{
LOG(" Event count of LP " << i << " = " << eventCounts[i]);
}
LOG(" Event count = " << accumulate(eventCounts, eventCounts + conf::system, 0ULL));
LOG(" Simulation time = " << time << "s\n");
MpiInterface::Disable();
}
int
main(int argc, char* argv[])
{
Initialize(argc, argv);
uint32_t hostId = 0;
map<uint32_t, Ptr<Node>> hosts;
map<Ptr<Node>, Ipv4Address> addrs;
// calculate topo scales
uint32_t nPod = conf::cluster ? conf::cluster : conf::k; // number of pods
uint32_t nGroup = conf::k / 2; // number of group of core switches
uint32_t nCore = conf::k / 2; // number of core switch in a group
uint32_t nAgg = conf::k / 2; // number of aggregation switch in a pod
uint32_t nEdge = conf::k / 2; // number of edge switch in a pod
uint32_t nHost = conf::k / 2; // number of hosts under a switch
NodeContainer core[nGroup], agg[nPod], edge[nPod], host[nPod][nEdge];
// create nodes
for (uint32_t i = 0; i < nGroup; i++)
{
core[i].Create(nCore / 2, (2 * i) % conf::system);
core[i].Create((nCore - 1) / 2 + 1, (2 * i + 1) % conf::system);
}
for (uint32_t i = 0; i < nPod; i++)
{
agg[i].Create(nAgg, i % conf::system);
}
for (uint32_t i = 0; i < nPod; i++)
{
edge[i].Create(nEdge, i % conf::system);
}
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nEdge; j++)
{
host[i][j].Create(nHost, i % conf::system);
for (uint32_t k = 0; k < nHost; k++)
{
hosts[hostId++] = host[i][j].Get(k);
}
}
}
SetupRouting();
Ipv4AddressHelper addr;
TrafficControlHelper red;
PointToPointHelper p2p;
red.SetRootQueueDisc("ns3::RedQueueDisc");
// connect edge switches to hosts
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nEdge; j++)
{
string subnet = "10." + to_string(i) + "." + to_string(j) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nHost; k++)
{
Ptr<Node> node = host[i][j].Get(k);
NetDeviceContainer ndc = p2p.Install(NodeContainer(node, edge[i].Get(j)));
red.Install(ndc.Get(1));
addrs[node] = addr.Assign(ndc).GetAddress(0);
}
}
}
// connect aggregate switches to edge switches
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nAgg; j++)
{
string subnet = "10." + to_string(i) + "." + to_string(j + nEdge) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nEdge; k++)
{
NetDeviceContainer ndc = p2p.Install(agg[i].Get(j), edge[i].Get(k));
red.Install(ndc);
addr.Assign(ndc);
}
}
}
// connect core switches to aggregate switches
for (uint32_t i = 0; i < nGroup; i++)
{
for (uint32_t j = 0; j < nPod; j++)
{
string subnet = "10." + to_string(i + nPod) + "." + to_string(j) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nCore; k++)
{
NetDeviceContainer ndc = p2p.Install(core[i].Get(k), agg[j].Get(i));
red.Install(ndc);
addr.Assign(ndc);
}
}
}
InstallTraffic(hosts, addrs, nGroup * nCore * nPod / 2.0);
StartSimulation();
return 0;
}

View File

@@ -0,0 +1,617 @@
#include "ns3/applications-module.h"
#include "ns3/core-module.h"
#include "ns3/flow-monitor-module.h"
#include "ns3/internet-module.h"
#include "ns3/mpi-module.h"
#include "ns3/network-module.h"
#include "ns3/nix-vector-routing-module.h"
#include "ns3/point-to-point-module.h"
#include "ns3/traffic-control-module.h"
#include <chrono>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <map>
#include <numeric>
#include <vector>
using namespace std;
using namespace chrono;
using namespace ns3;
#define LOCAL(r) ((r) == conf::rank)
#define LOG(content) \
{ \
if (conf::rank == 0) \
cout << content << endl; \
}
// random variable distribution
class Distribution
{
public:
// load a distribution from a CDF file
Distribution(string filename = "src/mtp/examples/web-search.txt")
{
ifstream fin;
fin.open(filename);
while (!fin.eof())
{
double x, cdf;
fin >> x >> cdf;
m_cdf.push_back(std::make_pair(x, cdf));
}
fin.close();
m_rand = CreateObject<UniformRandomVariable>();
}
// expectation value of the distribution
double Expectation()
{
double ex = 0;
for (uint32_t i = 1; i < m_cdf.size(); i++)
{
ex +=
(m_cdf[i].first + m_cdf[i - 1].first) / 2 * (m_cdf[i].second - m_cdf[i - 1].second);
}
return ex;
}
// get a random value from the distribution
double Sample()
{
double rand = m_rand->GetValue(0, 1);
for (uint32_t i = 1; i < m_cdf.size(); i++)
{
if (rand <= m_cdf[i].second)
{
double slope =
(m_cdf[i].first - m_cdf[i - 1].first) / (m_cdf[i].second - m_cdf[i - 1].second);
return m_cdf[i - 1].first + slope * (rand - m_cdf[i - 1].second);
}
}
return m_cdf[m_cdf.size() - 1].second;
}
private:
// the actual CDF function
vector<pair<double, double>> m_cdf;
// random variable stream
Ptr<UniformRandomVariable> m_rand;
};
// traffic generator
class TrafficGenerator
{
public:
TrafficGenerator(string cdfFile,
uint32_t hostTotal,
double dataRate,
double incastRatio,
vector<uint32_t> victims)
{
m_distribution = Distribution(cdfFile);
m_currentTime = 0;
m_averageInterval = m_distribution.Expectation() * 8 / dataRate;
m_incastRatio = incastRatio;
m_hostTotal = hostTotal;
m_victims = victims;
m_flowCount = 0;
m_flowSizeTotal = 0;
m_uniformRand = CreateObject<UniformRandomVariable>();
m_expRand = CreateObject<ExponentialRandomVariable>();
}
// get one flow with incremental time and random src, dst and size
tuple<double, uint32_t, uint32_t, uint32_t> GetFlow()
{
uint32_t src, dst;
if (m_uniformRand->GetValue(0, 1) < m_incastRatio)
{
dst = m_victims[m_uniformRand->GetInteger(0, m_victims.size() - 1)];
}
else
{
dst = m_uniformRand->GetInteger(0, m_hostTotal - 1);
}
do
{
src = m_uniformRand->GetInteger(0, m_hostTotal - 1);
} while (src == dst);
uint32_t flowSize = max((uint32_t)round(m_distribution.Sample()), 1U);
m_currentTime += m_expRand->GetValue(m_averageInterval, 0);
m_flowSizeTotal += flowSize;
m_flowCount++;
return make_tuple(m_currentTime, src, dst, flowSize);
}
double GetActualDataRate()
{
return m_flowSizeTotal / m_currentTime * 8;
}
double GetAvgFlowSize()
{
return m_distribution.Expectation();
}
double GetActualAvgFlowSize()
{
return m_flowSizeTotal / (double)m_flowCount;
}
uint32_t GetFlowCount()
{
return m_flowCount;
}
private:
double m_currentTime;
double m_averageInterval;
double m_incastRatio;
uint32_t m_hostTotal;
vector<uint32_t> m_victims;
uint32_t m_flowCount;
uint64_t m_flowSizeTotal;
Distribution m_distribution;
Ptr<UniformRandomVariable> m_uniformRand;
Ptr<ExponentialRandomVariable> m_expRand;
};
namespace conf
{
// fat-tree scale
uint32_t k = 4;
uint32_t cluster = 0;
// link layer options
uint32_t mtu = 1500;
uint32_t delay = 3000;
string bandwidth = "10Gbps";
// traffic-control layer options
string buffer = "4MB";
bool ecn = true;
// network layer options
bool nix = false;
bool rip = false;
bool ecmp = true;
bool flow = true;
// transport layer options
uint32_t port = 443;
string socket = "ns3::TcpSocketFactory";
string tcp = "ns3::TcpDctcp";
// application layer options
uint32_t size = 1448;
string cdf = "src/mtp/examples/web-search.txt";
double load = 0.3;
double incast = 0;
string victim = "0";
// simulation options
string seed = "";
bool flowmon = false;
double time = 1;
double interval = 0.1;
// mpi options
uint32_t system = 0;
uint32_t rank = 0;
bool nullmsg = false;
}; // namespace conf
void
Initialize(int argc, char* argv[])
{
CommandLine cmd;
// parse scale
cmd.AddValue("k", "Number of pods in a fat-tree", conf::k);
cmd.AddValue("cluster", "Number of clusters in a variant fat-tree", conf::cluster);
// parse network options
cmd.AddValue("mtu", "P2P link MTU", conf::mtu);
cmd.AddValue("delay", "Link delay in nanoseconds", conf::delay);
cmd.AddValue("bandwidth", "Link bandwidth", conf::bandwidth);
cmd.AddValue("buffer", "Switch buffer size", conf::buffer);
cmd.AddValue("ecn", "Use explicit congestion control", conf::ecn);
cmd.AddValue("nix", "Enable nix-vector routing", conf::nix);
cmd.AddValue("rip", "Enable RIP routing", conf::rip);
cmd.AddValue("ecmp", "Use equal-cost multi-path routing", conf::ecmp);
cmd.AddValue("flow", "Use per-flow ECMP routing", conf::flow);
cmd.AddValue("port", "Port number of server applications", conf::port);
cmd.AddValue("socket", "Socket protocol", conf::socket);
cmd.AddValue("tcp", "TCP protocol", conf::tcp);
cmd.AddValue("size", "Application packet size", conf::size);
cmd.AddValue("cdf", "Traffic CDF file location", conf::cdf);
cmd.AddValue("load", "Traffic load relative to bisection bandwidth", conf::load);
cmd.AddValue("incast", "Incast traffic ratio", conf::incast);
cmd.AddValue("victim", "Incast traffic victim list", conf::victim);
// parse simulation options
cmd.AddValue("seed", "The seed of the random number generator", conf::seed);
cmd.AddValue("flowmon", "Use flow-monitor to record statistics", conf::flowmon);
cmd.AddValue("time", "Simulation time in seconds", conf::time);
cmd.AddValue("interval", "Simulation progreess print interval in seconds", conf::interval);
// parse mtp/mpi options
cmd.AddValue("system", "Number of logical processes in MTP manual partition", conf::system);
cmd.AddValue("nullmsg", "Enable null message algorithm", conf::nullmsg);
cmd.Parse(argc, argv);
// link layer settings
Config::SetDefault("ns3::PointToPointChannel::Delay", TimeValue(NanoSeconds(conf::delay)));
Config::SetDefault("ns3::PointToPointNetDevice::DataRate", StringValue(conf::bandwidth));
Config::SetDefault("ns3::PointToPointNetDevice::Mtu", UintegerValue(conf::mtu));
// traffic control layer settings
Config::SetDefault("ns3::RedQueueDisc::MeanPktSize", UintegerValue(conf::mtu));
Config::SetDefault("ns3::RedQueueDisc::UseEcn", BooleanValue(conf::ecn));
Config::SetDefault("ns3::RedQueueDisc::UseHardDrop", BooleanValue(false));
Config::SetDefault("ns3::RedQueueDisc::LinkDelay", TimeValue(NanoSeconds(conf::delay)));
Config::SetDefault("ns3::RedQueueDisc::LinkBandwidth", StringValue(conf::bandwidth));
Config::SetDefault("ns3::RedQueueDisc::MaxSize", QueueSizeValue(QueueSize(conf::buffer)));
Config::SetDefault("ns3::RedQueueDisc::MinTh", DoubleValue(50));
Config::SetDefault("ns3::RedQueueDisc::MaxTh", DoubleValue(150));
// network layer settings
Config::SetDefault("ns3::Ipv4GlobalRouting::RandomEcmpRouting", BooleanValue(conf::ecmp));
Config::SetDefault("ns3::Ipv4GlobalRouting::FlowEcmpRouting", BooleanValue(conf::flow));
// transport layer settings
Config::SetDefault("ns3::TcpL4Protocol::SocketType", StringValue(conf::tcp));
Config::SetDefault("ns3::TcpSocket::SegmentSize", UintegerValue(conf::size));
Config::SetDefault("ns3::TcpSocket::ConnTimeout",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MilliSeconds(10) : Seconds(3)));
Config::SetDefault("ns3::TcpSocket::SndBufSize", UintegerValue(1073725440));
Config::SetDefault("ns3::TcpSocket::RcvBufSize", UintegerValue(1073725440));
Config::SetDefault(
"ns3::TcpSocketBase::MinRto",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MilliSeconds(5) : MilliSeconds(200)));
Config::SetDefault(
"ns3::TcpSocketBase::ClockGranularity",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MicroSeconds(100) : MilliSeconds(1)));
Config::SetDefault("ns3::RttEstimator::InitialEstimation",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MicroSeconds(200) : Seconds(1)));
// application layer settings
Config::SetDefault("ns3::BulkSendApplication::SendSize", UintegerValue(UINT32_MAX));
Config::SetDefault("ns3::OnOffApplication::DataRate", StringValue(conf::bandwidth));
Config::SetDefault("ns3::OnOffApplication::PacketSize", UintegerValue(conf::size));
Config::SetDefault("ns3::OnOffApplication::OnTime",
StringValue("ns3::ConstantRandomVariable[Constant=1000]"));
Config::SetDefault("ns3::OnOffApplication::OffTime",
StringValue("ns3::ConstantRandomVariable[Constant=0]"));
// simulation settings
Time::SetResolution(Time::PS);
RngSeedManager::SetSeed(Hash32(conf::seed));
// initialize mpi
if (conf::nullmsg)
{
GlobalValue::Bind("SimulatorImplementationType",
StringValue("ns3::NullMessageSimulatorImpl"));
}
else
{
GlobalValue::Bind("SimulatorImplementationType",
StringValue("ns3::DistributedSimulatorImpl"));
}
MpiInterface::Enable(&argc, &argv);
conf::rank = MpiInterface::GetSystemId();
conf::system = MpiInterface::GetSize();
}
void
SetupRouting()
{
InternetStackHelper internet;
if (conf::nix)
{
internet.SetRoutingHelper(Ipv4NixVectorHelper());
}
else if (conf::rip)
{
internet.SetRoutingHelper(RipHelper());
}
else
{
internet.SetRoutingHelper(Ipv4GlobalRoutingHelper());
}
internet.SetIpv6StackInstall(false);
internet.InstallAll();
LOG("\n- Setup the topology...");
}
void
InstallTraffic(map<uint32_t, Ptr<Node>>& hosts,
map<Ptr<Node>, Ipv4Address>& addrs,
double bisection)
{
// output address for debugging
LOG("\n- Calculating routes...");
LOG(" Host NodeId System Address");
for (auto& p : hosts)
{
LOG(" " << left << setw(6) << p.first << setw(8) << p.second->GetId() << setw(8)
<< p.second->GetSystemId() << addrs[p.second]);
}
if (!conf::nix)
{
Ipv4GlobalRoutingHelper::PopulateRoutingTables();
}
// server applications
PacketSinkHelper server(conf::socket, InetSocketAddress(Ipv4Address::GetAny(), conf::port));
for (auto& p : hosts)
{
if (LOCAL(p.second->GetSystemId()))
{
server.Install(p.second).Start(Seconds(0));
}
}
// calculate traffic
LOG("\n- Generating traffic...");
double bandwidth = bisection * DataRate(conf::bandwidth).GetBitRate() * 2;
string victim;
stringstream sin(conf::victim);
vector<uint32_t> victims;
while (getline(sin, victim, '-'))
{
victims.push_back(stoi(victim));
}
TrafficGenerator traffic(conf::cdf,
hosts.size(),
bandwidth * conf::load,
conf::incast,
victims);
// install traffic (client applications)
auto flow = traffic.GetFlow();
while (get<0>(flow) < conf::time)
{
Ptr<Node> clientNode = hosts[get<1>(flow)];
Ptr<Node> serverNode = hosts[get<2>(flow)];
if (LOCAL(clientNode->GetSystemId()))
{
if (conf::socket != "ns3::TcpSocketFactory")
{
OnOffHelper client(conf::socket, InetSocketAddress(addrs[serverNode], conf::port));
client.SetAttribute("MaxBytes", UintegerValue(get<3>(flow)));
client.Install(clientNode).Start(Seconds(get<0>(flow)));
}
else
{
BulkSendHelper client(conf::socket,
InetSocketAddress(addrs[serverNode], conf::port));
client.SetAttribute("MaxBytes", UintegerValue(get<3>(flow)));
client.Install(clientNode).Start(Seconds(get<0>(flow)));
}
}
flow = traffic.GetFlow();
}
// traffic installation check
LOG(" Expected data rate = " << bandwidth * conf::load / 1e9 << "Gbps");
LOG(" Generated data rate = " << traffic.GetActualDataRate() / 1e9 << "Gbps");
LOG(" Expected avg flow size = " << traffic.GetAvgFlowSize() / 1e6 << "MB");
LOG(" Generated avg flow size = " << traffic.GetActualAvgFlowSize() / 1e6 << "MB");
LOG(" Total flow count = " << traffic.GetFlowCount());
}
void
PrintProgress()
{
LOG(" Progressed to " << Simulator::Now().GetSeconds() << "s");
Simulator::Schedule(Seconds(conf::interval), PrintProgress);
}
void
StartSimulation()
{
// install flow-monitor
Ptr<FlowMonitor> flowMonitor;
FlowMonitorHelper flowHelper;
if (conf::flowmon)
{
flowMonitor = flowHelper.InstallAll();
}
// print progress
if (conf::interval)
{
Simulator::Schedule(Seconds(conf::interval), PrintProgress);
}
// start the simulation
Simulator::Stop(Seconds(conf::time));
LOG("\n- Start simulation...");
auto start = system_clock::now();
Simulator::Run();
auto end = system_clock::now();
auto time = duration_cast<duration<double>>(end - start).count();
// output simulation statistics
uint64_t eventCount = Simulator::GetEventCount();
if (conf::flowmon)
{
uint64_t dropped = 0, totalTx = 0, totalRx = 0, totalTxBytes = 0, flowCount = 0,
finishedFlowCount = 0;
double totalThroughput = 0;
Time totalFct(0), totalFinishedFct(0), totalDelay(0);
flowMonitor->CheckForLostPackets();
for (auto& p : flowMonitor->GetFlowStats())
{
dropped = p.second.packetsDropped.size();
if ((p.second.timeLastRxPacket - p.second.timeFirstTxPacket).GetTimeStep() > 0 &&
p.second.txPackets && p.second.rxPackets)
{
totalTx += p.second.txPackets;
totalRx += p.second.rxPackets;
totalTxBytes += p.second.txBytes;
totalFct += p.second.timeLastRxPacket - p.second.timeFirstTxPacket;
if (p.second.txPackets - p.second.rxPackets == p.second.packetsDropped.size())
{
totalFinishedFct += p.second.timeLastRxPacket - p.second.timeFirstTxPacket;
finishedFlowCount++;
}
totalDelay += p.second.delaySum;
totalThroughput +=
(double)p.second.txBytes /
(p.second.timeLastRxPacket - p.second.timeFirstTxPacket).GetSeconds();
flowCount++;
}
}
double avgFct = (double)totalFct.GetMicroSeconds() / flowCount;
double avgFinishedFct = (double)totalFinishedFct.GetMicroSeconds() / finishedFlowCount;
double avgDelay = (double)totalDelay.GetMicroSeconds() / totalRx;
double avgThroughput = totalThroughput / flowCount / 1e9 * 8;
LOG(" Detected #flow = " << flowCount);
LOG(" Finished #flow = " << finishedFlowCount);
LOG(" Average FCT (all) = " << avgFct << "us");
LOG(" Average FCT (finished) = " << avgFinishedFct << "us");
LOG(" Average end to end delay = " << avgDelay << "us");
LOG(" Average flow throughput = " << avgThroughput << "Gbps");
LOG(" Network throughput = " << totalTxBytes / 1e9 * 8 / conf::time << "Gbps");
LOG(" Total Tx packets = " << totalTx);
LOG(" Total Rx packets = " << totalRx);
LOG(" Dropped packets = " << dropped);
}
Simulator::Destroy();
uint64_t eventCounts[conf::system];
MPI_Gather(&eventCount,
1,
MPI_UNSIGNED_LONG_LONG,
eventCounts,
1,
MPI_UNSIGNED_LONG_LONG,
0,
MpiInterface::GetCommunicator());
LOG("\n- Done!");
for (uint32_t i = 0; i < conf::system; i++)
{
LOG(" Event count of LP " << i << " = " << eventCounts[i]);
}
LOG(" Event count = " << accumulate(eventCounts, eventCounts + conf::system, 0ULL));
LOG(" Simulation time = " << time << "s\n");
MpiInterface::Disable();
}
int
main(int argc, char* argv[])
{
Initialize(argc, argv);
uint32_t hostId = 0;
map<uint32_t, Ptr<Node>> hosts;
map<Ptr<Node>, Ipv4Address> addrs;
// calculate topo scales
uint32_t nPod = conf::cluster ? conf::cluster : conf::k; // number of pods
uint32_t nGroup = conf::k / 2; // number of group of core switches
uint32_t nCore = conf::k / 2; // number of core switch in a group
uint32_t nAgg = conf::k / 2; // number of aggregation switch in a pod
uint32_t nEdge = conf::k / 2; // number of edge switch in a pod
uint32_t nHost = conf::k / 2; // number of hosts under a switch
NodeContainer core[nGroup], agg[nPod], edge[nPod], host[nPod][nEdge];
// create nodes
for (uint32_t i = 0; i < nGroup; i++)
{
core[i].Create(nCore / 2, (2 * i) % conf::system);
core[i].Create((nCore - 1) / 2 + 1, (2 * i + 1) % conf::system);
}
for (uint32_t i = 0; i < nPod; i++)
{
agg[i].Create(nAgg, i % conf::system);
}
for (uint32_t i = 0; i < nPod; i++)
{
edge[i].Create(nEdge, i % conf::system);
}
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nEdge; j++)
{
host[i][j].Create(nHost, i % conf::system);
for (uint32_t k = 0; k < nHost; k++)
{
hosts[hostId++] = host[i][j].Get(k);
}
}
}
SetupRouting();
Ipv4AddressHelper addr;
TrafficControlHelper red;
PointToPointHelper p2p;
red.SetRootQueueDisc("ns3::RedQueueDisc");
// connect edge switches to hosts
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nEdge; j++)
{
string subnet = "10." + to_string(i) + "." + to_string(j) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nHost; k++)
{
Ptr<Node> node = host[i][j].Get(k);
NetDeviceContainer ndc = p2p.Install(NodeContainer(node, edge[i].Get(j)));
red.Install(ndc.Get(1));
addrs[node] = addr.Assign(ndc).GetAddress(0);
}
}
}
// connect aggregate switches to edge switches
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nAgg; j++)
{
string subnet = "10." + to_string(i) + "." + to_string(j + nEdge) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nEdge; k++)
{
NetDeviceContainer ndc = p2p.Install(agg[i].Get(j), edge[i].Get(k));
red.Install(ndc);
addr.Assign(ndc);
}
}
}
// connect core switches to aggregate switches
for (uint32_t i = 0; i < nGroup; i++)
{
for (uint32_t j = 0; j < nPod; j++)
{
string subnet = "10." + to_string(i + nPod) + "." + to_string(j) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nCore; k++)
{
NetDeviceContainer ndc = p2p.Install(core[i].Get(k), agg[j].Get(i));
red.Install(ndc);
addr.Assign(ndc);
}
}
}
InstallTraffic(hosts, addrs, nGroup * nCore * nPod / 2.0);
StartSimulation();
return 0;
}

View File

@@ -8,3 +8,15 @@ build_lib_example(
${libnix-vector-routing}
${libapplications}
)
build_lib_example(
NAME fat-tree-mtp
SOURCE_FILES fat-tree-mtp.cc
LIBRARIES_TO_LINK
${libmtp}
${libpoint-to-point}
${libinternet}
${libnix-vector-routing}
${libapplications}
${libflow-monitor}
)

View File

@@ -0,0 +1,579 @@
#include "ns3/applications-module.h"
#include "ns3/core-module.h"
#include "ns3/flow-monitor-module.h"
#include "ns3/internet-module.h"
#include "ns3/mtp-module.h"
#include "ns3/network-module.h"
#include "ns3/nix-vector-routing-module.h"
#include "ns3/point-to-point-module.h"
#include "ns3/traffic-control-module.h"
#include <chrono>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <map>
#include <numeric>
#include <vector>
using namespace std;
using namespace chrono;
using namespace ns3;
#define LOG(content) \
{ \
cout << content << endl; \
}
// random variable distribution
class Distribution
{
public:
// load a distribution from a CDF file
Distribution(string filename = "src/mtp/examples/web-search.txt")
{
ifstream fin;
fin.open(filename);
while (!fin.eof())
{
double x, cdf;
fin >> x >> cdf;
m_cdf.push_back(std::make_pair(x, cdf));
}
fin.close();
m_rand = CreateObject<UniformRandomVariable>();
}
// expectation value of the distribution
double Expectation()
{
double ex = 0;
for (uint32_t i = 1; i < m_cdf.size(); i++)
{
ex +=
(m_cdf[i].first + m_cdf[i - 1].first) / 2 * (m_cdf[i].second - m_cdf[i - 1].second);
}
return ex;
}
// get a random value from the distribution
double Sample()
{
double rand = m_rand->GetValue(0, 1);
for (uint32_t i = 1; i < m_cdf.size(); i++)
{
if (rand <= m_cdf[i].second)
{
double slope =
(m_cdf[i].first - m_cdf[i - 1].first) / (m_cdf[i].second - m_cdf[i - 1].second);
return m_cdf[i - 1].first + slope * (rand - m_cdf[i - 1].second);
}
}
return m_cdf[m_cdf.size() - 1].second;
}
private:
// the actual CDF function
vector<pair<double, double>> m_cdf;
// random variable stream
Ptr<UniformRandomVariable> m_rand;
};
// traffic generator
class TrafficGenerator
{
public:
TrafficGenerator(string cdfFile,
uint32_t hostTotal,
double dataRate,
double incastRatio,
vector<uint32_t> victims)
{
m_distribution = Distribution(cdfFile);
m_currentTime = 0;
m_averageInterval = m_distribution.Expectation() * 8 / dataRate;
m_incastRatio = incastRatio;
m_hostTotal = hostTotal;
m_victims = victims;
m_flowCount = 0;
m_flowSizeTotal = 0;
m_uniformRand = CreateObject<UniformRandomVariable>();
m_expRand = CreateObject<ExponentialRandomVariable>();
}
// get one flow with incremental time and random src, dst and size
tuple<double, uint32_t, uint32_t, uint32_t> GetFlow()
{
uint32_t src, dst;
if (m_uniformRand->GetValue(0, 1) < m_incastRatio)
{
dst = m_victims[m_uniformRand->GetInteger(0, m_victims.size() - 1)];
}
else
{
dst = m_uniformRand->GetInteger(0, m_hostTotal - 1);
}
do
{
src = m_uniformRand->GetInteger(0, m_hostTotal - 1);
} while (src == dst);
uint32_t flowSize = max((uint32_t)round(m_distribution.Sample()), 1U);
m_currentTime += m_expRand->GetValue(m_averageInterval, 0);
m_flowSizeTotal += flowSize;
m_flowCount++;
return make_tuple(m_currentTime, src, dst, flowSize);
}
double GetActualDataRate()
{
return m_flowSizeTotal / m_currentTime * 8;
}
double GetAvgFlowSize()
{
return m_distribution.Expectation();
}
double GetActualAvgFlowSize()
{
return m_flowSizeTotal / (double)m_flowCount;
}
uint32_t GetFlowCount()
{
return m_flowCount;
}
private:
double m_currentTime;
double m_averageInterval;
double m_incastRatio;
uint32_t m_hostTotal;
vector<uint32_t> m_victims;
uint32_t m_flowCount;
uint64_t m_flowSizeTotal;
Distribution m_distribution;
Ptr<UniformRandomVariable> m_uniformRand;
Ptr<ExponentialRandomVariable> m_expRand;
};
namespace conf
{
// fat-tree scale
uint32_t k = 4;
uint32_t cluster = 0;
// link layer options
uint32_t mtu = 1500;
uint32_t delay = 3000;
string bandwidth = "10Gbps";
// traffic-control layer options
string buffer = "4MB";
bool ecn = true;
// network layer options
bool nix = false;
bool rip = false;
bool ecmp = true;
bool flow = true;
// transport layer options
uint32_t port = 443;
string socket = "ns3::TcpSocketFactory";
string tcp = "ns3::TcpDctcp";
// application layer options
uint32_t size = 1448;
string cdf = "src/mtp/examples/web-search.txt";
double load = 0.3;
double incast = 0;
string victim = "0";
// simulation options
string seed = "";
bool flowmon = false;
double time = 1;
double interval = 0.1;
// mtp options
uint32_t thread = 4;
}; // namespace conf
void
Initialize(int argc, char* argv[])
{
CommandLine cmd;
// parse scale
cmd.AddValue("k", "Number of pods in a fat-tree", conf::k);
cmd.AddValue("cluster", "Number of clusters in a variant fat-tree", conf::cluster);
// parse network options
cmd.AddValue("mtu", "P2P link MTU", conf::mtu);
cmd.AddValue("delay", "Link delay in nanoseconds", conf::delay);
cmd.AddValue("bandwidth", "Link bandwidth", conf::bandwidth);
cmd.AddValue("buffer", "Switch buffer size", conf::buffer);
cmd.AddValue("ecn", "Use explicit congestion control", conf::ecn);
cmd.AddValue("nix", "Enable nix-vector routing", conf::nix);
cmd.AddValue("rip", "Enable RIP routing", conf::rip);
cmd.AddValue("ecmp", "Use equal-cost multi-path routing", conf::ecmp);
cmd.AddValue("flow", "Use per-flow ECMP routing", conf::flow);
cmd.AddValue("port", "Port number of server applications", conf::port);
cmd.AddValue("socket", "Socket protocol", conf::socket);
cmd.AddValue("tcp", "TCP protocol", conf::tcp);
cmd.AddValue("size", "Application packet size", conf::size);
cmd.AddValue("cdf", "Traffic CDF file location", conf::cdf);
cmd.AddValue("load", "Traffic load relative to bisection bandwidth", conf::load);
cmd.AddValue("incast", "Incast traffic ratio", conf::incast);
cmd.AddValue("victim", "Incast traffic victim list", conf::victim);
// parse simulation options
cmd.AddValue("seed", "The seed of the random number generator", conf::seed);
cmd.AddValue("flowmon", "Use flow-monitor to record statistics", conf::flowmon);
cmd.AddValue("time", "Simulation time in seconds", conf::time);
cmd.AddValue("interval", "Simulation progreess print interval in seconds", conf::interval);
// parse mtp/mpi options
cmd.AddValue("thread", "Maximum number of threads", conf::thread);
cmd.Parse(argc, argv);
// link layer settings
Config::SetDefault("ns3::PointToPointChannel::Delay", TimeValue(NanoSeconds(conf::delay)));
Config::SetDefault("ns3::PointToPointNetDevice::DataRate", StringValue(conf::bandwidth));
Config::SetDefault("ns3::PointToPointNetDevice::Mtu", UintegerValue(conf::mtu));
// traffic control layer settings
Config::SetDefault("ns3::RedQueueDisc::MeanPktSize", UintegerValue(conf::mtu));
Config::SetDefault("ns3::RedQueueDisc::UseEcn", BooleanValue(conf::ecn));
Config::SetDefault("ns3::RedQueueDisc::UseHardDrop", BooleanValue(false));
Config::SetDefault("ns3::RedQueueDisc::LinkDelay", TimeValue(NanoSeconds(conf::delay)));
Config::SetDefault("ns3::RedQueueDisc::LinkBandwidth", StringValue(conf::bandwidth));
Config::SetDefault("ns3::RedQueueDisc::MaxSize", QueueSizeValue(QueueSize(conf::buffer)));
Config::SetDefault("ns3::RedQueueDisc::MinTh", DoubleValue(50));
Config::SetDefault("ns3::RedQueueDisc::MaxTh", DoubleValue(150));
// network layer settings
Config::SetDefault("ns3::Ipv4GlobalRouting::RandomEcmpRouting", BooleanValue(conf::ecmp));
Config::SetDefault("ns3::Ipv4GlobalRouting::FlowEcmpRouting", BooleanValue(conf::flow));
// transport layer settings
Config::SetDefault("ns3::TcpL4Protocol::SocketType", StringValue(conf::tcp));
Config::SetDefault("ns3::TcpSocket::SegmentSize", UintegerValue(conf::size));
Config::SetDefault("ns3::TcpSocket::ConnTimeout",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MilliSeconds(10) : Seconds(3)));
Config::SetDefault("ns3::TcpSocket::SndBufSize", UintegerValue(1073725440));
Config::SetDefault("ns3::TcpSocket::RcvBufSize", UintegerValue(1073725440));
Config::SetDefault(
"ns3::TcpSocketBase::MinRto",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MilliSeconds(5) : MilliSeconds(200)));
Config::SetDefault(
"ns3::TcpSocketBase::ClockGranularity",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MicroSeconds(100) : MilliSeconds(1)));
Config::SetDefault("ns3::RttEstimator::InitialEstimation",
TimeValue(conf::tcp == "ns3::TcpDctcp" ? MicroSeconds(200) : Seconds(1)));
// application layer settings
Config::SetDefault("ns3::BulkSendApplication::SendSize", UintegerValue(UINT32_MAX));
Config::SetDefault("ns3::OnOffApplication::DataRate", StringValue(conf::bandwidth));
Config::SetDefault("ns3::OnOffApplication::PacketSize", UintegerValue(conf::size));
Config::SetDefault("ns3::OnOffApplication::OnTime",
StringValue("ns3::ConstantRandomVariable[Constant=1000]"));
Config::SetDefault("ns3::OnOffApplication::OffTime",
StringValue("ns3::ConstantRandomVariable[Constant=0]"));
// simulation settings
Time::SetResolution(Time::PS);
RngSeedManager::SetSeed(Hash32(conf::seed));
// initialize mtp
MtpInterface::Enable(conf::thread);
}
void
SetupRouting()
{
InternetStackHelper internet;
if (conf::nix)
{
internet.SetRoutingHelper(Ipv4NixVectorHelper());
}
else if (conf::rip)
{
internet.SetRoutingHelper(RipHelper());
}
else
{
internet.SetRoutingHelper(Ipv4GlobalRoutingHelper());
}
internet.SetIpv6StackInstall(false);
internet.InstallAll();
LOG("\n- Setup the topology...");
}
void
InstallTraffic(map<uint32_t, Ptr<Node>>& hosts,
map<Ptr<Node>, Ipv4Address>& addrs,
double bisection)
{
// output address for debugging
LOG("\n- Calculating routes...");
LOG(" Host NodeId System Address");
for (auto& p : hosts)
{
LOG(" " << left << setw(6) << p.first << setw(8) << p.second->GetId() << setw(8)
<< p.second->GetSystemId() << addrs[p.second]);
}
if (!conf::nix)
{
Ipv4GlobalRoutingHelper::PopulateRoutingTables();
}
// server applications
PacketSinkHelper server(conf::socket, InetSocketAddress(Ipv4Address::GetAny(), conf::port));
for (auto& p : hosts)
{
server.Install(p.second).Start(Seconds(0));
}
// calculate traffic
LOG("\n- Generating traffic...");
double bandwidth = bisection * DataRate(conf::bandwidth).GetBitRate() * 2;
string victim;
stringstream sin(conf::victim);
vector<uint32_t> victims;
while (getline(sin, victim, '-'))
{
victims.push_back(stoi(victim));
}
TrafficGenerator traffic(conf::cdf,
hosts.size(),
bandwidth * conf::load,
conf::incast,
victims);
// install traffic (client applications)
auto flow = traffic.GetFlow();
while (get<0>(flow) < conf::time)
{
Ptr<Node> clientNode = hosts[get<1>(flow)];
Ptr<Node> serverNode = hosts[get<2>(flow)];
if (conf::socket != "ns3::TcpSocketFactory")
{
OnOffHelper client(conf::socket, InetSocketAddress(addrs[serverNode], conf::port));
client.SetAttribute("MaxBytes", UintegerValue(get<3>(flow)));
client.Install(clientNode).Start(Seconds(get<0>(flow)));
}
else
{
BulkSendHelper client(conf::socket, InetSocketAddress(addrs[serverNode], conf::port));
client.SetAttribute("MaxBytes", UintegerValue(get<3>(flow)));
client.Install(clientNode).Start(Seconds(get<0>(flow)));
}
flow = traffic.GetFlow();
}
// traffic installation check
LOG(" Expected data rate = " << bandwidth * conf::load / 1e9 << "Gbps");
LOG(" Generated data rate = " << traffic.GetActualDataRate() / 1e9 << "Gbps");
LOG(" Expected avg flow size = " << traffic.GetAvgFlowSize() / 1e6 << "MB");
LOG(" Generated avg flow size = " << traffic.GetActualAvgFlowSize() / 1e6 << "MB");
LOG(" Total flow count = " << traffic.GetFlowCount());
}
void
PrintProgress()
{
LOG(" Progressed to " << Simulator::Now().GetSeconds() << "s");
Simulator::Schedule(Seconds(conf::interval), PrintProgress);
}
void
StartSimulation()
{
// install flow-monitor
Ptr<FlowMonitor> flowMonitor;
FlowMonitorHelper flowHelper;
if (conf::flowmon)
{
flowMonitor = flowHelper.InstallAll();
}
// print progress
if (conf::interval)
{
Simulator::Schedule(Seconds(conf::interval), PrintProgress);
}
// start the simulation
Simulator::Stop(Seconds(conf::time));
LOG("\n- Start simulation...");
auto start = system_clock::now();
Simulator::Run();
auto end = system_clock::now();
auto time = duration_cast<duration<double>>(end - start).count();
// output simulation statistics
uint64_t eventCount = Simulator::GetEventCount();
if (conf::flowmon)
{
uint64_t dropped = 0, totalTx = 0, totalRx = 0, totalTxBytes = 0, flowCount = 0,
finishedFlowCount = 0;
double totalThroughput = 0;
Time totalFct(0), totalFinishedFct(0), totalDelay(0);
flowMonitor->CheckForLostPackets();
for (auto& p : flowMonitor->GetFlowStats())
{
dropped = p.second.packetsDropped.size();
if ((p.second.timeLastRxPacket - p.second.timeFirstTxPacket).GetTimeStep() > 0 &&
p.second.txPackets && p.second.rxPackets)
{
totalTx += p.second.txPackets;
totalRx += p.second.rxPackets;
totalTxBytes += p.second.txBytes;
totalFct += p.second.timeLastRxPacket - p.second.timeFirstTxPacket;
if (p.second.txPackets - p.second.rxPackets == p.second.packetsDropped.size())
{
totalFinishedFct += p.second.timeLastRxPacket - p.second.timeFirstTxPacket;
finishedFlowCount++;
}
totalDelay += p.second.delaySum;
totalThroughput +=
(double)p.second.txBytes /
(p.second.timeLastRxPacket - p.second.timeFirstTxPacket).GetSeconds();
flowCount++;
}
}
double avgFct = (double)totalFct.GetMicroSeconds() / flowCount;
double avgFinishedFct = (double)totalFinishedFct.GetMicroSeconds() / finishedFlowCount;
double avgDelay = (double)totalDelay.GetMicroSeconds() / totalRx;
double avgThroughput = totalThroughput / flowCount / 1e9 * 8;
LOG(" Detected #flow = " << flowCount);
LOG(" Finished #flow = " << finishedFlowCount);
LOG(" Average FCT (all) = " << avgFct << "us");
LOG(" Average FCT (finished) = " << avgFinishedFct << "us");
LOG(" Average end to end delay = " << avgDelay << "us");
LOG(" Average flow throughput = " << avgThroughput << "Gbps");
LOG(" Network throughput = " << totalTxBytes / 1e9 * 8 / conf::time << "Gbps");
LOG(" Total Tx packets = " << totalTx);
LOG(" Total Rx packets = " << totalRx);
LOG(" Dropped packets = " << dropped);
}
Simulator::Destroy();
LOG("\n- Done!");
LOG(" Event count = " << eventCount);
LOG(" Simulation time = " << time << "s\n");
}
int
main(int argc, char* argv[])
{
Initialize(argc, argv);
uint32_t hostId = 0;
map<uint32_t, Ptr<Node>> hosts;
map<Ptr<Node>, Ipv4Address> addrs;
// calculate topo scales
uint32_t nPod = conf::cluster ? conf::cluster : conf::k; // number of pods
uint32_t nGroup = conf::k / 2; // number of group of core switches
uint32_t nCore = conf::k / 2; // number of core switch in a group
uint32_t nAgg = conf::k / 2; // number of aggregation switch in a pod
uint32_t nEdge = conf::k / 2; // number of edge switch in a pod
uint32_t nHost = conf::k / 2; // number of hosts under a switch
NodeContainer core[nGroup], agg[nPod], edge[nPod], host[nPod][nEdge];
// create nodes
for (uint32_t i = 0; i < nGroup; i++)
{
core[i].Create(nCore / 2);
core[i].Create((nCore - 1) / 2 + 1);
}
for (uint32_t i = 0; i < nPod; i++)
{
agg[i].Create(nAgg);
}
for (uint32_t i = 0; i < nPod; i++)
{
edge[i].Create(nEdge);
}
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nEdge; j++)
{
host[i][j].Create(nHost);
for (uint32_t k = 0; k < nHost; k++)
{
hosts[hostId++] = host[i][j].Get(k);
}
}
}
SetupRouting();
Ipv4AddressHelper addr;
TrafficControlHelper red;
PointToPointHelper p2p;
red.SetRootQueueDisc("ns3::RedQueueDisc");
// connect edge switches to hosts
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nEdge; j++)
{
string subnet = "10." + to_string(i) + "." + to_string(j) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nHost; k++)
{
Ptr<Node> node = host[i][j].Get(k);
NetDeviceContainer ndc = p2p.Install(NodeContainer(node, edge[i].Get(j)));
red.Install(ndc.Get(1));
addrs[node] = addr.Assign(ndc).GetAddress(0);
}
}
}
// connect aggregate switches to edge switches
for (uint32_t i = 0; i < nPod; i++)
{
for (uint32_t j = 0; j < nAgg; j++)
{
string subnet = "10." + to_string(i) + "." + to_string(j + nEdge) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nEdge; k++)
{
NetDeviceContainer ndc = p2p.Install(agg[i].Get(j), edge[i].Get(k));
red.Install(ndc);
addr.Assign(ndc);
}
}
}
// connect core switches to aggregate switches
for (uint32_t i = 0; i < nGroup; i++)
{
for (uint32_t j = 0; j < nPod; j++)
{
string subnet = "10." + to_string(i + nPod) + "." + to_string(j) + ".0";
addr.SetBase(subnet.c_str(), "255.255.255.0");
for (uint32_t k = 0; k < nCore; k++)
{
NetDeviceContainer ndc = p2p.Install(core[i].Get(k), agg[j].Get(i));
red.Install(ndc);
addr.Assign(ndc);
}
}
}
InstallTraffic(hosts, addrs, nGroup * nCore * nPod / 2.0);
StartSimulation();
return 0;
}

View File

@@ -0,0 +1,12 @@
0 0
10000 0.15
20000 0.20
30000 0.30
50000 0.40
80000 0.53
200000 0.60
1000000 0.70
2000000 0.80
5000000 0.90
10000000 0.97
30000000 1