Techniques
1. Dedicated Cores for TCP
Unlike conventional network stacks that share CPU cores between the application and TCP processing, we move TCP processing to a dedicated set of cores. Applications communicate asynchronously with TAS through in-memory queues. This split provides two benefits: 1) Full cache isolation between applications and TAS for data and instruction caches, TLB and branch predictors, avoiding cache pollution in either direction. 2) Applications and TCP processing operate in parallel, thereby reducing tail latency.
2. Separate Fast Path and Slow Path
We further separate TCP processing into a streamlined fast-path that performs all per packet operations, a slow path for less frequent and periodic tasks such as connection establishment and timeouts, and a small application library that provides the sockets interface. The fast path minimizes the CPU cycles required to receive and transmit TCP packets and because of its simplified nature is enables additional optimizations, such as careful cache management and simplified control flow.
3. Minimal Connection State
Splitting out a separate fast path, also allows us to minimize per-connection state. In TAS, the connection state used by the fast path amounts to two cache lines per connections consisting of a flat set of integer fields (no complicated data structures). As a result, TAS scales to large numbers of connections.
4. Centralized Packet Scheduling
We schedule TCP packet transmissions centrally and with precise timing. This allows TAS to enforce per-flow fairness under congestion and keep tail latency low even with large numbers of connections.