diff --git a/src/internet/doc/tcp.rst b/src/internet/doc/tcp.rst index 1256b9770..6b2910bb6 100644 --- a/src/internet/doc/tcp.rst +++ b/src/internet/doc/tcp.rst @@ -752,10 +752,46 @@ are then asked to lower such value, and to return it. PktsAcked is used in case the algorithm needs timing information (such as RTT), and it is called each time an ACK is received. +TCP SACK and non-SACK ++++++++++++++++++++++ +To avoid code duplication and the effort of maintaining two different versions +of the TCP core, namely RFC 6675 (TCP-SACK) and RFC 5681 (TCP congestion +control), we have merged RFC 6675 in the current code base. If the receiver +supports the option, the sender bases its retransmissions over the received +SACK information. However, in the absence of that option, the best it can do is +to follow the RFC 5681 specification (on Fast Retransmit/Recovery) and +employing NewReno modifications in case of partial ACKs. + +The merge work consisted in implementing an emulation of fake SACK options in +the sender (when the receiver does not support SACK) following RFC 5681 rules. +The generation is straightforward: each duplicate ACK (following the definition +of RFC 5681) carries a new SACK option, that indicates (in increasing order) +the blocks transmitted after the SND.UNA, not including the block starting from +SND.UNA itself. + +With this emulated SACK information, the sender behaviour is unified in these +two cases. By carefully generating these SACK block, we are able to employ all +the algorithms outlined in RFC 6675 (e.g. Update(), NextSeg(), IsLost()) during +non-SACK transfers. Of course, in the case of RTO expiration, no guess about +SACK block could be made, and so they are not generated (consequently, the +implementation will re-send all segments starting from SND.UNA, even the ones +correctly received). Please note that the generated SACK option (in the case of +a non-SACK receiver) by the sender never leave the sender node itself; they are +created locally by the TCP implementation and then consumed. + +A similar concept is used in Linux with the function tcp_add_reno_sack. Our +implementation resides in the TcpTxBuffer class that implements a scoreboard +through two different lists of segments. TcpSocketBase actively uses the API +provided by TcpTxBuffer to query the scoreboard; please refer to the Doxygen +documentation (and to in-code comments) if you want to learn more about this +implementation. + +When SACK attribute is enabled for the receiver socket, the sender will not +craft any SACK option, relying only on what it receives from the network. + Current limitations +++++++++++++++++++ -* SACK is not supported * TcpCongestionOps interface does not contain every possible Linux operation * Fast retransmit / fast recovery are bound with TcpSocketBase, thereby preventing easy simulation of TCP Tahoe diff --git a/src/internet/model/tcp-socket-base.h b/src/internet/model/tcp-socket-base.h index 3b39fe349..08a91d5b7 100644 --- a/src/internet/model/tcp-socket-base.h +++ b/src/internet/model/tcp-socket-base.h @@ -192,11 +192,11 @@ public: * \brief A base class for implementation of a stream socket using TCP. * * This class contains the essential components of TCP, as well as a sockets - * interface for upper layers to call. This serves as a base for other TCP - * functions where the sliding window mechanism is handled here. This class - * provides connection orientation and sliding window flow control. Part of - * this class is modified from the original NS-3 TCP socket implementation - * (TcpSocketImpl) by Raj Bhattacharjea of Georgia Tech. + * interface for upper layers to call. This class provides connection orientation + * and sliding window flow control; congestion control is delegated to subclasses + * of TcpCongestionOps. Part of TcpSocketBase is modified from the original + * NS-3 TCP socket implementation (TcpSocketImpl) by + * Raj Bhattacharjea of Georgia Tech. * * For IPv4 packets, the TOS set for the socket is used. The Bind and Connect * operations set the TOS for the socket to the value specified in the provided @@ -225,10 +225,15 @@ public: * Congestion control interface * --------------------------- * - * Congestion control, unlike older releases of ns-3, has been splitted from + * Congestion control, unlike older releases of ns-3, has been split from * TcpSocketBase. In particular, each congestion control is now a subclass of * the main TcpCongestionOps class. Switching between congestion algorithm is - * now a matter of setting a pointer into the TcpSocketBase class. + * now a matter of setting a pointer into the TcpSocketBase class. The idea + * and the interfaces are inspired by the Linux operating system, and in + * particular from the structure tcp_congestion_ops. + * + * Transmission Control Block (TCB) + * -------------------------------- * * The variables needed to congestion control classes to operate correctly have * been moved inside the TcpSocketState class. It contains information on the @@ -240,13 +245,13 @@ public: * (see for example cWnd trace source). * * Fast retransmit - * --------------------------- + * ---------------- * * The fast retransmit enhancement is introduced in RFC 2581 and updated in - * RFC 5681. It basically reduces the time a sender waits before retransmitting + * RFC 5681. It reduces the time a sender waits before retransmitting * a lost segment, through the assumption that if it receives a certain number * of duplicate ACKs, a segment has been lost and it can be retransmitted. - * Usually it is coupled with the Limited Transmit algorithm, defined in + * Usually, it is coupled with the Limited Transmit algorithm, defined in * RFC 3042. * * In ns-3, these algorithms are included in this class, and it is implemented inside @@ -257,15 +262,39 @@ public: * recovery phase, the method EnterRecovery is called. * * Fast recovery - * -------------------------- + * ------------- * - * The fast recovery algorithm is introduced RFC 2001, and it basically + * The fast recovery algorithm is introduced RFC 2001, and it * avoids to reset cWnd to 1 segment after sensing a loss on the channel. Instead, * the slow start threshold is halved, and the cWnd is set equal to such value, * plus segments for the cWnd inflation. * * The algorithm is implemented in the ProcessAck method. * + * RTO expiration + * -------------- + * + * When the Retransmission Time Out expires, the TCP faces a big performance + * drop. The expiration event is managed in ReTxTimeout method, that basically + * set the cWnd to 1 segment and start "from scratch" again. + * + * Options management + * ------------------ + * + * SYN and SYN-ACK options, which are allowed only at the beginning of the + * connection, are managed in the DoForwardUp and SendEmptyPacket methods. + * To read all others, we have set up a cycle inside ReadOptions. For adding + * them, there is no a unique place, since the options (and the information + * available to build them) are scattered around the code. For instance, + * the SACK option is built in SendEmptyPacket only under certain conditions. + * + * SACK + * ---- + * + * The SACK generation/management is delegated to the buffer classes, namely + * TcpTxBuffer and TcpRxBuffer. Please take a look on their documentation if + * you need more informations. + * */ class TcpSocketBase : public TcpSocket { diff --git a/src/internet/model/tcp-tx-buffer.h b/src/internet/model/tcp-tx-buffer.h index 68e58882a..216f8cc9c 100644 --- a/src/internet/model/tcp-tx-buffer.h +++ b/src/internet/model/tcp-tx-buffer.h @@ -69,33 +69,55 @@ public: * * \brief Tcp sender buffer * - * The class keeps track of all data that the application wishes to transmit - * to the other end. When the data is acknowledged, it is removed from the buffer. + * The class keeps track of all data that the application wishes to transmit to + * the other end. When the data is acknowledged, it is removed from the buffer. * The buffer has a maximum size, and data is not saved if the amount exceeds - * the limit. Packets can be added to the class through the method Add(). - * An important thing to remember is that all the data managed is strictly - * sequential. It can be divided in blocks, but all the data follow a strict - * ordering. That ordering is managed through SequenceNumber. + * the limit. Packets can be added to the class through the method Add(). An + * important thing to remember is that all the data managed is strictly + * sequential. It can be divided into blocks, but all the data follow a strict + * ordering. That order is managed through SequenceNumber. * - * In other words, this buffer contains numbered bytes (e.g 1,2,3), and the class - * is allowed to return only ordered (using "<" as operator) subsets (e.g. 1,2 - * or 2,3 or 1,2,3). + * In other words, this buffer contains numbered bytes (e.g., 1,2,3), and the + * class is allowed to return only ordered (using "<" as operator) subsets + * (e.g. 1,2 or 2,3 or 1,2,3). * * The data structure underlying this is composed by two distinct packet lists. - * - * The first (SentList) is initially empty, and it contains the packets returned - * by the method CopyFromSequence. - * - * The second (AppList) is initially empty, and it contains the packets coming - * from the applications, but that are not transmitted yet as segments. - * - * To discover how the chunk are managed and retrieved from these lists, check - * CopyFromSequence documentation. + * The first (SentList) is initially empty, and it contains the packets + * returned by the method CopyFromSequence. The second (AppList) is initially + * empty, and it contains the packets coming from the applications, but that + * are not transmitted yet as segments. To discover how the chunks are managed + * and retrieved from these lists, check CopyFromSequence documentation. * * The head of the data is represented by m_firstByteSeq, and it is returned by - * HeadSequence(). The last byte is returned by TailSequence(). - * In this class we store also the size (in bytes) of the packets inside the - * SentList in the variable m_sentSize. + * HeadSequence(). The last byte is returned by TailSequence(). In this class, + * we also store the size (in bytes) of the packets inside the SentList in the + * variable m_sentSize. + * + * SACK management + * --------------- + * + * The SACK information is usually saved in a data structure referred as + * scoreboard. In this implementation, the scoreboard is developed on top of + * the existing classes. In particular, instead of keeping raw pointers to + * packets in TcpTxBuffer we added the capability to store some flags + * associated with every segment sent. This is done through the use of the + * class TcpTxItem: instead of storing a list of packets, we store a list of + * TcpTxItem. Each item has different flags (check the corresponding + * documentation) and maintaining the scoreboard is a matter of travelling the + * list and set the SACK flag on the corresponding segment sent. + * + * Inefficiencies + * -------------- + * + * The algorithms outlined in RFC 6675 are full of inefficiencies. In + * particular, traveling all the sent list each time it is needed to compute + * the bytes in flight is expensive. We try to overcome the issue by + * maintaining a pointer to the highest sequence SACKed; in this way, we can + * avoid traveling all the list in some cases. Another option could be keeping + * a count of each critical value (e.g., the number of packets sacked). + * However, this would be different from the algorithms in RFC. There are some + * other possible improvements; if you wish, take a look and try to add some + * earlier exit conditions in the loops. * * \see Size * \see SizeFromSequence