Overview

Important concepts:

  1. Stream
  2. Reliable
  3. Full-duplex
  4. Connection oriented
  5. Port numbers

The TCP is a layer 4 protocol which takes the layer 3 packet services of IP and forms a reliable stream-of-bytes communication model.

This is done using positive acknowledgements in which each byte is numbered by its position in the data stream. Acknowledgements are sent in every packet to indicate which bytes have been correctly received. Each connection independently numbers the bytes in the two directions of data flow (full duplex). TCP is connection oriented protocol: before the flow of data, the two parties engage in a handshake which establishes a bi-directional communication channel.

The channel is uniquely identified by four values: source IP, destination IP, source port, and destination port. A Port is an integer which multiplexes multiple TCP streams into a single IP to IP connection. All four numbers matter in the definition of a connection. Typically, the destination port is well-known, for instance, www is port 80. The destination IP will be some familiar server, such as www.amazon.com, whose IP is 72.21.210.11 (see notes below). The source IP will be your machine's IP. If you run multiple browsers on your machine, each browser will create a connection with a different source port, chosen dynamically and randomly for each connection. Because all four numbers define a connection, the connections will not get confused.

Basic segment management

Important concepts:

  1. Sequence Number, the index of a byte in the stream of bytes; also refers to the header field which gives the sequence number of the first byte of data that is being sent, or that would be sent, if no data is being sent by this packet;
  2. Acknowledgement Number, the index of the next byte expected to be received; also acknowledges reception of all lower indexed bytes;
  3. Window Advertisement, the size of the window; this value plus the acknowledgement number indicate the sequence number beyond which sender bytes will be discarded, as being outside the window.

TCP sets up a pair of data streams, where each stream is an ordered sequence of bytes. There is one stream in each of the two directions of data flow between the network endpoints. The bytes are numbered from an Initial Sequence Number, which is different for each direction, and is randomly chosen during TCP connection setup. However, most network analysis tools allow you to see the data streams with the ISN subtracted, so that the sequence number is more simply a 32-bit unsigned integer numbering the bytes starting with byte 0 for the stream establishment byte and continuing 1, 2, 3, etc.

The data stream is cut up into segments of consecutive stream bytes with sequence numbers x, x+1, ... , x+l. Each segment is sent as the payload of a TCP packet along with a header that contains three important numbers: the sequence number, an acknowledgement number, and a window advertisement.

The sequence number is the sequence number of the first byte of the segment in the body of the packet. The packet might not contain a data segment, for instance, if the only purpose of the packet is to acknowledge data or to adjust the window. In this case, the the sequence number is the value of the next byte to send. TCP connection setup consumes one data byte so if no data has been sent, the sequence number will be set to 1 (or just think that the byte numbers count from 1, not 0).

The acknowledgement number is the sequence number of the next byte of data that the sender of this header is expecting to receive, and it also acknowledges that all bytes of lower sequence number have been received. The purpose of a packet can be to send data, to acknowledge data, to both to send and acknowledge data, or neither (for instance, to announce the opening of the data window). However, every TCP header acknowledges the sequence numbers already received. Setting up the data stream consumes one pseudo-byte, so the the acknowledgement number is set to one in the event that there is no real data to acknowledge. This style of sending acknowledgements on packets that can and do send data at the same time is called a piggy-back ack.

The acknowledgement number acknowledges receipt of all lower numbered bytes. If the receiver of bytes has a gap in the bytes it has received, for instance the ppacket containing a segment of TCP data has been dropped, it will make no acknowledgement beyond the gap. The receiver will continue to send an acknowledgement number equal to the byte number of the first byte in the gap until that byte is successfully received. At that point the acknowledgement number can jump to include all the segments that have been received beyond the gap.

The window advertisement indirectly gives the largest sequence number the sender of the header will accept. It prevents too much data from being sent. The window is actually the number of bytes open in the current window. Add this value to the acknowledgement number in the header to get the sequence number of the most forward byte that will be accepted. It is an error for a TCP endpoint to ever rescind a window in the sense that it sends an acknowledgement number and window advertisement combination that moves that sum backwards. However, the window can be allowed to close, eventually sending packets with advertisements of zero. Eventually, when the TCP endpoint is ready for more data, it can reopen the windows by spontaneously sending a packet with a positive window advertisement.

In order to improve efficiency, multiple segments can be sent while waiting for acknowledgments. If this is not permitted, the data rate would be limited by the round trip time for a packet to be sent and an ack received. Instead a series of segments can be sent and a series of acknowledgements being received in a constant flow, increasing the data rate to the channel capacity, rather than round trip time.

This highlights the difference between latency and through-put. Latency would be about the rount-trip time, and through-put the amount of data pressed through a channel in a give time.

However, the acknowledgements can control the amount of data sent by using the window advertisement. This is called a sliding window because if the window size is held fixed it moves forward as the acknowledgement number moves forward. With a window size of w and an ack value of y, the header is advertising its interest in receiving bytes y through y+w-1, but bytes beyond that index should not be sent and will be discarded.

Connection establishment

Important concepts:

  1. Three way handshake;
  2. Initial sequence numbers;
  3. Close and half close;
  4. 2 MSL wait (Maximum Segment Lifetime).

TCP is a connection oriented protocol. Connections are setup before use and torn-down down before close. There are special flags in the TCP header used for setup and tear-down: the synchronization flag SYN, the acknowledgement flag ACK, and the finishing flag FIN. Actually the ACK flag is used during the connection as well. It is set whenever the acknowledgement number in the header contains a valid value, and the only time it does not contain a valid value is the very first packet sent during connection establishment, therefore, with that one exception, it is always set. However, the SYN flag is special for establishment, and the FIN flag is special for tear down.

Connections are established using a three-way handshake. In client-sever terms, the client will begin the handshake by sending a packet with the SYN flag set. The server will respond with a packet with both the SYN and the ACK flags set. The client will complete the establishment with a packet with the ACK flag set. The SYN flags will never again be used after establishment.

Each byte in the each of the two data streams has in increasing index number, called the sequence number. The starting values for the two sequence numbers are chosen randomly at the time of the three-way handshake, and are part of the handshake.

  1. The client chooses its Initial Sequence Number C, the starting value of the indexing of bytes flowing from client to server, and sends this value to the server in the sequence number field of the header, and sets the SYN flag.
  2. The server responds acknowledging receipt of that sequence number by setting the value in the acknowledgement number field of the return packet to C+1, one more than the client's ISN and sets the ACK flag in that packet. The sever also chooses its own Initial Sequence Number S, to index the bytes flowing from server to client, and sets the sequence number field of this return packet to this ISN, setting the SYN flag.
  3. The client completes the protocol by sending back to the server a packet with only the ACK flag set, the sequence field set to C+1, one more than its own ISN, and the acknowledgement field set to S+1, one more than the server's ISN.

The connection is released by setting the FIN flag, or in error connections, the RST (reset) flag. Each side closes their direction of the data stream separately. A packet with the FIN flag set is a promise to send no more data in that direction. However, the other direction of data can remain open. This is called a half-close. Once the other side sends a FIN, it is ack'ed and the connection is considered fully dis-established. The FIN packet, similar to the SYN packet, consumes one sequence number. So a FIN with sequence number X is ack'ed with a response packet with acknowledgement number X+1.

The purpose of random ISN's is twofold: to prevent TCP hijacking and to avoid that a connection gets reused in a manner that confuses host software. If the connection (the four numbers source IP, destination IP, source port, destination port) get reused, the random ISN means that any old segments that manage to survive and are delivered like some sort of ice-age postcard from the past are not likely to be mistaken for the current connection. The ISN's will be too different the the segment will be ignored. Hijacking is where a bad guy tries to predict the next packet that a client or server awaits, and sends it with forged addresses, correct except with substituted, bad-guy, data in the segment. In order for this to work, the attacker must guess the random ISN, which is unlikely.

A further attempt to avoid the reuse of a connection is that after close the connection must not be used again for a period of twice the length of time a segment could possibly survive out there on the internet, passing from router to router. This is called the 2MSL wait.

Making TCP work

Important concepts:

  1. Resend timers and Karn's algorithm
  2. Congestion control: Multiplicative decrease and slow start
  3. Silly window syndrome and Nagle's Algorithm

Karns' Algorithm

A lost packet elicits no acknowledgment. After a certain amount of time, the sender of the segment resends. Determining the correct wait time before resending is an important issue and is covered in detail by the TCP protocol.

A segment is retransmitted after an acknowledgement is not received after a certain percentage about the dynamically determined Round Trip Time for the segment. TCP requires that the RTT be dynamically determined using the averaging technique:

     RTT = alpha * RTT + (1-alpha) * New-Sample-RTT
where alpha is set by the TCP protocol. A retransmission occurs if an ack is not received after time:
    Timeout = beta * RTT
where beta is also dynamically determined by estimating the network variance.

A problem remains: what to accept as a New-Sample-RTT. If a segment is lost and retransmitted, the acknowledgement for the resent packet should not be used as a New-Sample-RTT, because it is ambiguous. It is impossible to know if this is a late acknowledgement to the first attempt at sending the segment.

Karn's algorithm:

Congestion window: a server side window

Congestion on the internet is a big problem which is handled by two mechanisms: Multiplicative decrease and Slow start.

The data sender keeps a congestion window and will not send bytes beyond this window. In the case of no congestion, the congestion window is opened up until it equals the advertsed window size.

The discipline of multiplicative decrease requires that the congestion window will be halved for each lost segment. Slow-start recovers from a closed congestion window, and is also used at the start of any new connection. The slow-start technique will open the congestion window by one segment for each acknowledgement packet which arrives. When used on a new connection, the congestion window is initialized to a single segment.

Nagel's algorithm

A further problem remains, and it deals with the tradeoff between two categories of data: interactive versus bulk. Interactive data require low latency. Data should be sent immediately after available. Examples are keystrokes or mouse clicks. Bulk data requires highest overall efficiency and hence works best if sent in few large segments, even if the sender must delay sending while assembling a large segment. A related problem is called silly window syndrome. Suppose the window is full on the receiver side, and suppose the receiver leisurely now consumes one byte at a time. It could be that each byte consumed opens the window by one byte, and the receiver sends an ack packet to inform the sender of one byte of room. And it could be that the sender then sends the one byte. And so it goes with sender and receiver taking the trouble to keep the window closed and great cost in network overhead. What should happen is that the window should be substantially emptied before a new window size is advertised.

It is therefore required for the receiver to delay acknowledges until a substantial amount of new space is available inside the window. The send side helps avoid the problem by using Nagel's algorithm:

Nagel's algorithm: Hold off the sending of data until:

Network address translation (NAT)

Although not a basic topic, in order to make sense of commonly encountered situations on the Internet, it is important to understand network address translation, also known as NAT. The IP address that the destination host sees is often not the IP address belonging to the source host. Along the path, a NAT box has substituted in for the local IP of the source host a global IP, usually temporarily and dynamically assigned for that source host. The NAT box is responsible or setting up a correspondence between these local and global IP's, and continuing the substitution with each further packet, and doing a reverse substitution, putting back the local IP, in response packets from the destination host.

There are approximately two variants on this idea. Sometimes for each different local IP a different global IP address is chosen for a substitution. Other times, many local IP address with receive the same global IP address. In the first variant, although the IP address is changed by the NAT, the port number is not. In the second variant, which in Cisco terminology is PAT, port address translation, the port number might need to be changed, since many local IP's are competing for one global IP, and the port number could overlap. In PAT, the reverse IP substitution requires looking at the port number, so see which local IP has claimed the port number.

NAT treats connection establishments initiated from the outside (the global IP side) very different than establishments from the inside. Going from the inside out, the global-local IP binding can be made dynamically and on-demand. However, the NAT box must be configured with a set, called static, global-local IP binding (or global port/local IP binding, in the case of PAT) in order to establish IP connections from the outside to the inside. People who have home ADSL routers might be familiar with this. My airport has a configuration panel called "Port Mapping", which allows me to associate with a Public Port a Private Address and Private Port.

NAT seems to have begun life as a way of conserving IP addresses. For instance, I have have many computers at home sharing one ADSL modem, and thanks to PAT, it only costs the world at large one IP address. Prior to NAT, I would need one IP address per machine. However, it is a powerful security management tool as well. But refusing, controlling, and monitoring the NAT, strict controls can be placed on the communications of a particular machine. My airport has no static bindings. Any TCP session was initiated by my machine's actions. This lowers my machine's vulnerability a great deal.

Conclusion

TCP/IP has been a wildly successful networking protocol. It is successful because the engineering assumptions were good, and the response to those assumptions effective. That is, making IP a packet-oriented best effort delivery, with independent routing decisions being made along the route. The use of piggy-back ack's, sliding window to pipeline segments, and a scheme for byte acknowledgement which is robust and efficient. However, it is successful because its deployment was followed by a series of fixes such as Karn's and Nagel's algorithms, and the tuning of parameters such as the alpha in the RTT computation and the beta in the time out computation.

On top of TCP many services can be built: Remote Procedure Call (RPC), or Remote Method Call. SMB file sharing and printer sharing services. HTTP for the web, FTP for file transfer, ssh and telnet for terminal communications. These are considered level 5 and up ... the levels at this point get a bit poorly defined, less standardized. Levels 3 and 4 are pretty much dominated by IP, UPD and TCP. Other level 4 protocols, such as IBM/Microsoft NetBios, respond to this by creating shim layers such as NBT, NetBios of TCP, which allows them to ride the success of TCP/IP while maintain their code and style of network communications.


Appendix

Amazon.com name resolve

moonachie:~ burt$ date
Fri Mar  7 11:27:52 EST 2008
moonachie:~ burt$ dig www.amazon.com

; <<>> DiG 9.3.4 <<>> www.amazon.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57737
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 4

;; QUESTION SECTION:
;www.amazon.com.                        IN      A

;; ANSWER SECTION:
www.amazon.com.         24      IN      A       72.21.210.11

;; AUTHORITY SECTION:
www.amazon.com.         6757    IN      NS      ns-923.amazon.com.
www.amazon.com.         6757    IN      NS      ns-911.amazon.com.
www.amazon.com.         6757    IN      NS      ns-912.amazon.com.
www.amazon.com.         6757    IN      NS      ns-921.amazon.com.

;; ADDITIONAL SECTION:
ns-911.amazon.com.      402     IN      A       207.171.178.13
ns-912.amazon.com.      9       IN      A       207.171.191.123
ns-921.amazon.com.      402     IN      A       72.21.192.209
ns-923.amazon.com.      402     IN      A       72.21.204.208

;; Query time: 1 msec
;; SERVER: 172.20.0.6#53(172.20.0.6)
;; WHEN: Fri Mar  7 11:27:58 2008
;; MSG SIZE  rcvd: 196

moonachie:~ burt$ 

IP Header, from RFC 791:

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |Version|  IHL  |Type of Service|          Total Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Identification        |Flags|      Fragment Offset    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Time to Live |    Protocol   |         Header Checksum       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Source Address                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Destination Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                    Example Internet Datagram Header

                               Figure 4.

TCP Header, from RFC 793:

  TCP Header Format

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source Port          |       Destination Port        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Acknowledgment Number                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Data |           |U|A|P|R|S|F|                               |
   | Offset| Reserved  |R|C|S|S|Y|I|            Window             |
   |       |           |G|K|H|T|N|N|                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Checksum            |         Urgent Pointer        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                             data                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                            TCP Header Format

          Note that one tick mark represents one bit position.

                               Figure 3.