Throughput vs. Latency in System Design

April 25, 2024
Naeem Ul Haq
10 min read

In system design, latency and throughput emerge as critical counterparts, each influencing the other in shaping the performance landscape. Latency, the silent pause between request and response, defines the user experience, while throughput, the system’s capacity for simultaneous tasks, charts the efficiency of its operations. Navigating the delicate balance between minimizing latency for real-time interactions and maximizing throughput for scalability is a requirement for system designers.

Motivation

Latency and throughput are two commonly used metrics in large-scale systems. When we say that an API call from the client to the service and back should be completed in a few milliseconds, we’re talking about latency; and when we say that a specific service can process millions of requests per second, we’re talking about the service throughput.

Let’s take an analogy from the world of freight trains. Let’s assume that a freight train has 100 container cars, and it takes 24 hours to move those 100 containers from source to destination. In this example, the latency is 24 hours while the throughput is 100 containers daily. Now, if we stack the containers such that in total, there are 200 containers, and if this new train also takes 24 hours to reach its destination, we would’ve increased our throughput twofold (200 containers per day). However, we would have to add locomotives to keep the latency at 24 hours, or the heavier train will take longer to reach its destination.

On the other hand, if we double the train’s speed and reduce the number of containers to 50, it might reach its destination in 12 hours (reduced latency), but the throughput will remain the same (4.17 containers per hour).

The analogy above highlights how latency and throughput can interact with each other.

The concepts of latency and throughput permeate through the technology stack. Let’s discuss throughput vs. latency in detail.

Latency

Latency is commonly used in computing and networking to describe the delay or lag in data transmission or a system’s response time. It represents the time it takes for a packet of data to travel from its source to its destination or for a request to get a response from a service. Latency is typically measured in units of time, such as milliseconds (ms), microseconds (μs), or even nanoseconds (ns), depending on the context. For example, instruction execution time inside a processor is in the order of nanoseconds, while a typical API call response time can range from a few milliseconds to seconds.

There are several types of latency, each relevant to different aspects of technology and communication:

Network latency: Network latency refers to the delay that occurs when data packets travel from one point in a network to another. It can be caused by various factors, including the physical distance between devices, the number of network hops, and the processing time at each hop. Network latency can be categorized into four main types:
- Transmission latency: This is the time it takes for the data to be put on the transmission media, including encoding and packetization. Transmission latency depends on the physical properties of the transmission medium, such as the link’s bandwidth. For example, a kind of copper wire can support up to 10 Gbps, while a certain type of optical fiber cable can support up to 60 Tbps (terabits per second) data.
- Propagation latency: This is the time it takes for data to travel from the sender to the receiver. The physical distance between the devices and the signal speed in the transmission medium determines it. For example, the electrical signal on copper wire travels at 200 million meters per second.
- Node processing latency: This is the time it takes for network devices (routers, switches, and endpoints) to process data packets, including routing decisions and data manipulation.
- Queueing latency: This is the wait time for data inside a network switch while an outgoing link is not available. When a network is busy with many users, queuing latency can increase. We can understand queuing latency using the analogy of driving our car from home to work during the morning rush hour. Queues often build up at the intersections, and we might have to wait longer for our turn to cross the intersection.
Application latency: Application latency refers to the time a software application or system takes to respond to a user’s input or request. It includes factors like the processing time within the application, time spent over the network, database queries, and other computational tasks. High application latency can result in slow response times for users, which can be particularly problematic for real-time applications like online gaming and video conferencing.
Disk latency: Disk latency is the delay in reading or writing data to storage devices, like hard drives or solid-state drives. It’s influenced by factors such as seek time, rotational delay (for HDDs), and data transfer speed. High disk latency can slow down file access and system performance.
Memory latency: Memory latency refers to the delay in accessing data from computer memory (RAM). It includes factors such as the time it takes to locate and transfer data within memory modules. High memory latency can impact overall system performance.
Display latency: Display latency, often important in gaming and multimedia applications, is the delay between a graphics card sending a frame to a monitor and the frame displayed on the screen. This can include factors like input lag and pixel response time.

Latency is a critical consideration in many technological applications. Low-latency systems are desirable, especially for real-time and interactive applications, because they provide a more responsive and efficient user experience. Reducing latency often involves optimizing hardware, software, and network configurations to minimize delays and ensure faster data transmission and processing.

Latency values decide if the delay is real-time or not. For example, for VoIP applications, a delay of more than 150 ms is considered noticeable to humans, and if latency increases any further, it can make such an app unusable.

Note: Latency within a data center can be a few microseconds, thanks to a technique called remote direct memory access (RDMA). On the other extreme, for NASA’s Mars rover, the one-way communication latency from Earth to Mars is multiple minutes, which makes it impossible to communicate with the rover from Earth in real-time.

The following illustration shows latency from AWS’s Virginia data center to many other data centers across the globe. We can see that the latency increases (the propagation delay factor) with increasing distance.

Latency as measured from AWS’s US-West 1 California to AWS regions in Canada (80ms), Virginia (62ms), Ireland(128ms), and UK(148ms)

Throughput

In the context of system design, throughput refers to the rate at which a system can process a certain amount of work within a given period. It measures the system’s capacity to handle and execute tasks efficiently. Throughput is often expressed in the number of tasks completed per unit of time. For example, in a database system, throughput might be measured in terms of the number of database transactions processed per second. In a web server context, throughput could be measured by the number of requests served per second. The choice of throughput metrics depends on the specific characteristics and requirements of the system under consideration.

Throughput indicates how efficiently a system can process and deliver results. Higher throughput generally implies better performance and responsiveness.

Throughput can be influenced by the system’s ability to handle multiple tasks concurrently or in parallel. Systems that can efficiently utilize parallel processing or concurrency often exhibit higher throughput.

Maximizing throughput involves optimizing the use of system resources, such as CPU, memory, and network bandwidth. Efficient resource utilization contributes to higher throughput without sacrificing performance.

Throughput is an important factor when designing scalable systems. A scalable system should be able to handle increased workloads without a proportional decrease in performance, maintaining or increasing throughput as demand grows.

For our examples in the latency section above, there are also counterparts for throughput.

Network throughput: Network throughput tells how much data can be moved between a source and a destination per unit of time. It’s often measured as bits per second.
Requests served per second by a server: A bus web server gets thousands of requests per second. It can send the response back to the clients at a specific rate. Both the request and response rates are a kind of system throughput.
Disk throughput: The maximum amount of data a disk system can serve is the sustained throughput.
Memory throughput: The maximum amount of data the memory bus of a system can feed to a processor is the memory throughput. For example, Apple’s M3 Max chip has a memory bandwidth of 300 GB per second.
Display throughput: The bigger the display, the higher the bandwidth we usually need. According to an estimate, we need 25 Mbps for 4K content and about 100 Mbps for 8K content.

Network bandwidth is the upper capacity that a communication channel can support, while network throughput is what’s used from the max capacity

Relationship between latency and throughput

Inverse relationship: In general, there’s an inverse relationship between latency and throughput. As we push the system to increase the throughput, latency also starts increasing.

Latency-throughput are often inversly related to each other.

Example: Consider a network connection. If we want to maximize throughput, we might send large batches of data in each transmission (high throughput), but this could introduce higher latency for the packets to arrive at the destination due to possible congestion at some network switch. Conversely, if we send smaller batches, we might reduce latency for individual packets but might sacrifice overall throughput.

Saturation point: Systems often have a saturation point, beyond which increasing throughput can lead to increased latency due to resource limitations or contention. An example is the network. By increasing the offered load, initially, throughput increases, but then it plateaus. If we push it further, it can result in decreased throughput and higher latency due to the dropped packets at the congested switches, which will trigger the retransmission of data.

Latency vs. throughput in the context of offered load

We see a similar relationship for CPU utilization. By increasing the degree of multiprogramming, a system can start thrashing at a point where there aren’t enough resources available to support any more processes.

Trade-offs: System designers often face trade-offs between optimizing for low latency or high throughput because improving one can come at the expense of the other.

Conclusion

Latency and throughput are important performance metrics for smaller components like disks and large-scale systems like Google Search. We desire the lowest possible latency with the highest possible throughput from our systems. Often, there’s a throughput vs. latency trade-off, and a designer might have to sacrifice one for the other.

To learn about such trade-offs across many real-world systems, check out the following course:

Naeem Ul Haq

Naeem Ul Haq is a tech enthusiast and system design expert, passionate about building scalable, high-performance systems. He shares practical insights to help developers master architecture and design.

Share with others

Related Blogs

Blog

Reliability vs Availability in System Design

In System Design, few concepts are as essential — and as frequently confused — as reliability and availability. They both describe how well a system performs over time, but they address two very different aspects of performance. A service might be available all day yet still unreliable if it produces errors, or it might be […]

Blog

The Best Way to Learn System Design: Your Complete Roadmap

You’ve hit that point in your development journey where complex features and distributed services are no longer academic; they’re your reality. Whether you’re leveling up to senior roles, preparing for interviews, or just want to build more reliable systems, you want the best way to learn system design, which is fast, focused, and without wasted […]

Blog

How to Use ChatGPT for System Design | A Complete Guide

Learning System Design can feel intimidating. They test more than just your technical knowledge. They evaluate how you think, structure, and communicate solutions at scale. Whether you’re designing a social media platform or a load balancer, you’re expected to reason like an architect. That’s where ChatGPT can help. By learning how to use ChatGPT for […]