Picture this: a global esports final draws 50 million concurrent viewers, and the chat explodes with 200,000 messages per second during a game-winning play. Within milliseconds, every single viewer must see that wave of excitement ripple across their screen. This is the engineering reality that platforms like Twitch, YouTube Live, and TikTok face daily.
Traditional comment systems where users refresh pages or load content on demand collapse under these conditions. Building a live comments system means rethinking everything from connection protocols to message ordering guarantees. It also means rethinking consistency models across global regions and the precise throughput calculations that determine whether your infrastructure survives or crumbles.
This guide walks you through the complete architecture of real-time comment delivery at massive scale. You’ll learn how to design ingestion pipelines that process messages in under 30 milliseconds and build pub/sub systems that fan out to millions of subscribers. You’ll also learn how to implement graceful degradation strategies when traffic spikes threaten system stability and calculate the exact bandwidth and storage requirements for production deployments.
Whether you’re preparing for a System Design interview or architecting production infrastructure, these patterns form the foundation of every major live streaming platform’s chat system.
The following diagram illustrates the end-to-end flow of a live comments system, from user submission through delivery to millions of connected viewers.
Functional and non-functional requirements
Before diving into architecture, you need crystal-clear requirements. In interviews, this step demonstrates strategic thinking. Rushing to solutions without understanding constraints is a red flag. In production, unclear requirements lead to protocol mismatches, broken fan-out logic, and infrastructure that can’t handle real traffic patterns. The requirements also establish your delivery guarantees, which affect every downstream decision from storage selection to client reconnection logic.
Functional requirements
Real-time comment posting and delivery forms the core capability. Users submit messages that must reach all active viewers within milliseconds. Each livestream operates as an isolated channel with its own message feed to prevent cross-contamination between events.
The system must support fan-out to thousands or millions of concurrent viewers per stream. Popular events regularly exceed one million active connections. This creates a read-heavy workload where the read/write ratio can exceed 10,000:1 during peak moments. Every single posted comment generates thousands of delivery operations.
Message ordering requires careful consideration of trade-offs between consistency and performance. Strict global ordering across millions of users is prohibitively expensive because it would require distributed coordination that adds unacceptable latency.
Most production systems implement best-effort ordering with server-assigned sequence numbers per stream. This provides partition-level guarantees rather than global consistency. These sequence numbers enable clients to detect gaps, request missed messages, and maintain local ordering even when network conditions cause out-of-order delivery. The sequence number becomes your primary mechanism for catch-up and reconnection logic. Clients can use small buffers (50-100ms) to reorder slightly out-of-sequence messages before rendering.
Moderation capabilities must operate in real-time without blocking the critical path. This includes spam detection using keyword filters and ML-based toxicity scoring, rate limiting to prevent message floods, shadow banning where offending users see their own comments but others don’t, and soft deletes that remove content while preserving audit trails. Optional features like reactions, pinned comments, badges, and slow mode (limiting users to one message every N seconds) enhance engagement but shouldn’t compromise core delivery latency.
Pro tip: When discussing requirements in interviews, explicitly state your delivery semantics. “At-least-once delivery” means clients may receive duplicates during reconnection, which they must deduplicate using message IDs. This is far cheaper than exactly-once guarantees and acceptable for chat systems where seeing a duplicate message is merely annoying rather than dangerous.
Non-functional requirements and scale targets
Latency targets define user experience and must be quantified precisely rather than described vaguely as “low latency.” Aim for under 500 milliseconds end-to-end from publish to delivery at the 95th percentile. Ingestion should complete in under 30 milliseconds, and fan-out should add no more than 50 milliseconds per hop. These targets require careful attention to every component in the pipeline because synchronous database writes or blocking moderation checks will immediately blow your latency budget.
Throughput and scale requirements for popular platforms are staggering and must drive your capacity planning. A single viral stream might support over one million concurrent viewers generating 50,000 messages per second. This results in 50 billion delivered messages per hour when you multiply message count by viewer count.
With an average message payload of 200 bytes (including user ID, display name, text, and metadata), ingestion bandwidth reaches 10 MB/second while fan-out bandwidth explodes to terabytes per second across your WebSocket fleet. Your architecture must handle these peaks without degradation. This means horizontal scaling at every layer and careful attention to connection limits per server.
Availability and fault tolerance expectations require multi-region deployment with specific SLA targets. Livestream audiences are global and expect uninterrupted interaction. Target 99.99% availability (roughly 52 minutes of downtime per year) with automatic failover completing within 30 seconds.
No single server failure should disrupt comment delivery. This requires multi-region redundancy, automatic failover, and self-healing infrastructure. The system must also handle unreliable client connections gracefully. Mobile users switching between WiFi and cellular, network drops, and browser tab hibernation are all common scenarios requiring seamless reconnection with catch-up.
| Requirement | Target | Implications |
|---|---|---|
| End-to-end latency | < 500ms p95 | Async processing, no synchronous DB writes on critical path |
| Ingestion latency | < 30ms | Pre-warm connection pools, minimize validation overhead |
| Concurrent viewers per stream | 1M+ | Sharded WebSocket servers, regional clusters |
| Message throughput per stream | 50K/sec | Partitioned pub/sub, multiple consumer groups |
| Availability SLA | 99.99% | Multi-region, automatic failover, no single points of failure |
| Storage retention | 7-30 days | Compressed NoSQL storage, time-based partitioning |
These constraints shape every architectural decision. The next section translates these requirements into a layered system architecture that separates concerns and enables independent scaling of each component.
High-level architecture and component design
A live comments system uses a distributed, multi-layer architecture where each component has clear responsibilities and can scale independently. The goal is ensuring any user’s comment moves from client through ingestion, into the pub/sub backbone, and out to delivery subscribers with minimal bottlenecks.
Separating these concerns allows each layer to scale based on its specific load characteristics. Connection servers scale with viewer count. Ingestion scales with message rate. Fan-out scales with the product of both.
Edge layer and API gateway
All traffic enters through a global load balancer or API gateway that handles routing, authentication, and protection before requests reach your application servers. The edge layer routes users to the nearest regional cluster to minimize latency. A user in Tokyo connects to Asia-Pacific servers rather than Virginia.
The edge layer also authorizes connections and requests using JWT validation or session tokens, applies DDoS mitigation and WAF rules to filter malicious traffic, manages TLS termination to offload encryption overhead from application servers, and forwards WebSocket upgrade requests to the appropriate connection brokers. For global deployments, consider using anycast routing or a CDN with WebSocket support (like Cloudflare or Fastly) to ensure users automatically connect to their nearest point of presence.
Real-world context: Twitch uses a sophisticated edge network that routes users to regional chat clusters based on geographic proximity and current load. During major events like The International or League of Legends Worlds, they pre-provision additional capacity in regions where they expect high viewership based on team locations and historical data.
Connection management layer
Once clients establish persistent connections (typically WebSockets), connection brokers maintain these sessions throughout the livestream duration. These servers track which users are watching which streams through subscription registries and manage millions of concurrent WebSocket connections with careful memory management.
They send messages over active sessions when fan-out workers deliver new comments, perform heartbeat checks to detect stale connections that didn’t close cleanly, and handle reconnections gracefully by accepting last-seen sequence numbers for catch-up. Connection servers are typically sharded by livestream ID using consistent hashing. This ensures all viewers of a particular stream connect to a predictable set of servers and simplifies message routing from the pub/sub layer.
Watch out: Connection state management is tricky at scale. Storing user session data on individual WebSocket servers creates problems during failover because that state is lost when the server restarts. Instead, maintain minimal state locally (just the socket reference and current subscriptions) and store subscription metadata in a shared store like Redis that survives server restarts and enables any server to resume a user’s session.
Ingestion service and durable event log
The ingestion service accepts user-submitted comments and transforms them into events for distribution through the fan-out pipeline. It authenticates users against your identity service, validates input to ensure messages meet length and format requirements, runs fast spam checks using pre-compiled keyword filters, sanitizes text by removing XSS vectors and script tags, packages the comment into a structured event with consistent schema, assigns a server-generated sequence number for ordering, and publishes to a durable log. The sequence number is critical because it provides ordering guarantees within a stream and enables clients to detect missing messages during catch-up after reconnection.
The durable log (Kafka, Pulsar, or Redis Streams) serves as the system’s source of truth and provides capabilities that ephemeral pub/sub cannot match. Unlike pure pub/sub where missed messages are lost forever, a durable log retains events for a configurable window (typically 24-72 hours for live chat).
This enables replay during consumer restarts when fan-out workers crash, catch-up for reconnecting clients who need messages they missed, failure recovery when entire regions go offline, and audit trails for moderation review. Partitioning by stream_id ensures all messages for a given livestream flow through the same partition, maintaining strict ordering within that stream even as the system scales horizontally.
The following diagram shows the ingestion pipeline from user submission through moderation to the durable log.
Pub/sub and fan-out architecture
The pub/sub layer is the core message distribution engine that routes comments from ingestion to all connected viewers. Technologies like Kafka, Redis Streams, Google Pub/Sub, NATS, and Pulsar can serve this role, each with different trade-offs around latency, durability, and operational complexity.
The system must provide high throughput (millions of messages per second across all streams), guarantee at-least-once delivery so no messages are silently dropped, and scale horizontally because popular livestreams create enormous fan-out traffic that can spike unpredictably.
Fan-out workers consume from the pub/sub system and push messages to connected clients. This represents the critical multiplication point in your architecture. Each WebSocket server subscribes to topics corresponding to the streams its connected users are watching. When a new comment arrives, the fan-out worker iterates through all local connections for that stream and pushes the message. One incoming message becomes millions of outgoing messages during popular events, which is why this layer must be horizontally scalable and carefully monitored for lag.
Historical note: YouTube Live and Twitch both use push-based fan-out (fan-out on write) because it delivers the lowest latency for real-time interaction. Early chat systems used pull-based approaches where clients periodically fetched new comments, but users noticed even 2-3 second lags in chat responsiveness. The industry shifted to push-based architectures around 2012-2014 as WebSocket support became universal in browsers.
Understanding real-time delivery protocols is essential for choosing the right approach for your scale and use case. We’ll explore this in depth next.
Real-time delivery protocols and connection scaling
Choosing the appropriate real-time communication protocol is one of the most consequential decisions in live comments System Design. The protocol determines your latency floor, connection overhead, operational complexity, and which edge cases you’ll need to handle. Most production systems use WebSockets, but understanding alternatives helps you make informed trade-offs and defend your choices in interviews.
WebSockets as the industry standard
WebSockets provide a full-duplex bidirectional channel over a single TCP connection. They offer sub-millisecond messaging capability and minimal per-message overhead after the initial HTTP upgrade handshake. They’re compatible with all modern browsers, have mature server-side implementations in every major language (ws in Node.js, gorilla/websocket in Go, Netty in Java), and support both text and binary message formats. For live comments where clients both send messages and receive the full chat stream, WebSockets are almost always the right choice because the bidirectional nature matches the interaction pattern perfectly.
The challenges with WebSockets center on operational complexity at scale rather than protocol limitations. Managing millions of open connections requires careful memory management because each connection consumes server resources even when idle, typically 10-50KB per connection depending on your implementation.
Load balancer configuration becomes tricky because WebSocket connections are long-lived and stateful. This requires sticky sessions or connection-aware routing to ensure subsequent messages reach the same server. Server failovers must handle graceful migration of connections, and you need monitoring to detect zombie connections that consume resources without active users on the other end.
Server-Sent Events (SSE) offer a simpler alternative for scenarios where clients primarily receive messages rather than send them. SSE provides automatic reconnection support built into the browser, simpler server implementation than WebSockets, and works through HTTP/2 multiplexing for efficient resource usage. However, SSE only supports server-to-client communication, so you’d need a separate HTTP endpoint for posting comments. This hybrid approach works but adds complexity and slightly higher latency for the post path because each comment requires a full HTTP request/response cycle rather than using an already-open connection.
Long polling remains the fallback for environments where WebSockets are blocked by corporate firewalls, older proxy servers, or restrictive network configurations. The client makes an HTTP request, and the server holds it open until new messages arrive or a timeout occurs (typically 30 seconds). Then the client immediately reconnects. This approach is inefficient because each message delivery requires a full HTTP request/response cycle with headers and connection setup overhead. Use it only when WebSockets genuinely aren’t available, and implement automatic protocol negotiation so clients fall back transparently.
Watch out: Don’t assume WebSockets will work everywhere. Some corporate networks, mobile carriers, and countries with restrictive internet policies block or interfere with WebSocket connections. Always implement fallback mechanisms and test your system’s behavior when WebSocket upgrades fail. A robust client should attempt WebSocket first, fall back to SSE, and finally resort to long polling.
Connection scaling and sharding strategies
To manage millions of persistent connections, you need systematic approaches to distribution and resource management that go beyond simply adding more servers. Sharding WebSocket servers by livestream ID using consistent hashing ensures all viewers of a particular stream connect to a predictable set of servers. This simplifies message routing from the pub/sub layer and enables efficient fan-out.
Maintaining region-specific connection clusters reduces cross-continent latency so European users connect to European servers and Asian users connect to Asian servers. Consistent hashing distributes connection load evenly while allowing servers to be added or removed with minimal connection disruption. This is important for scaling up during events and scaling down afterward to reduce costs.
Heartbeat mechanisms detect dropped connections that didn’t close cleanly. This happens frequently with mobile users who lose connectivity or close apps without proper disconnection. Clients send periodic ping frames (typically every 30 seconds), and servers mark connections as dead if no ping arrives within the timeout window (typically 90 seconds or 3 missed heartbeats). This prevents resource leaks from zombie connections that would otherwise accumulate and eventually exhaust server capacity. The heartbeat interval involves a trade-off. Shorter intervals detect dead connections faster but increase bandwidth overhead, especially significant when you have millions of connections each sending heartbeats.
Reconnection logic on the client side handles the inevitable network disruptions that occur during any long-lived session. Best practices include exponential backoff with jitter (starting at 1 second and doubling up to 30 seconds with random variation) to avoid thundering herd problems during outages. Request missed messages using the last-seen sequence number so the server knows exactly what to replay. Deduplicate messages that might be delivered twice during the reconnection window using message IDs stored in a local cache. Quickly re-establish subscription state so the user doesn’t miss ongoing chat. Reliable client-side logic directly impacts user experience because users should barely notice brief network hiccups.
The following diagram compares connection patterns across different protocol choices and their trade-offs.
With delivery protocols established, the next critical challenge is ensuring messages survive system failures and clients can recover from disconnections without losing chat context.
Message ordering, catch-up, and delivery guarantees
Perfect global ordering across millions of concurrent users posting to thousands of simultaneous streams is computationally prohibitive. It requires distributed consensus that would add seconds of latency to every message. But users still expect chat to make sense, with messages appearing in roughly the order they were sent. The solution involves server-assigned sequence numbers, partition-level ordering guarantees, and robust catch-up mechanisms that let clients recover missed messages without noticeable gaps in their experience.
Sequence numbers and ordering guarantees
Each comment receives a server-assigned sequence number within its stream, creating a total order for that livestream’s messages independent of when different servers process them. When the ingestion service accepts a comment, it atomically increments a counter (typically using Redis INCR or a distributed ID generator like Snowflake) and stamps the message before publishing to the durable log. Clients track the highest sequence number they’ve seen and can detect gaps when messages arrive out of order. They can request missing messages from a history API or accept best-effort ordering depending on how large the gap is.
Partition-level ordering in systems like Kafka guarantees that messages within a partition maintain their publish order. This provides strong consistency within a bounded scope. By partitioning on stream_id using consistent hashing, you ensure all messages for a given livestream pass through the same partition and arrive at consumers in sequence. This doesn’t guarantee global ordering across all streams (messages in different streams may interleave arbitrarily), but within a single chat, ordering is preserved. For most chat applications, small reorderings across streams are invisible to users who are only watching one stream at a time.
Client-side handling addresses edge cases that occur despite server-side ordering guarantees. When a message arrives with a sequence number higher than expected (indicating missed messages), the client can either request the gap from a history API if the gap is small (fewer than 100 messages), accept best-effort ordering and display messages as they arrive if the gap is large, or buffer incoming messages briefly (50-100ms) before rendering to allow slightly out-of-order messages to be sorted locally. For most chat applications, users won’t notice if message 1000 appears slightly before message 999 as long as the delay is under 200 milliseconds.
Pro tip: Implement client-side buffering with a short delay (50-100ms) before rendering messages to the screen. This allows slightly out-of-order messages to be sorted locally before display, improving perceived ordering without requiring expensive server-side coordination. The buffer should be sequence-number aware, flushing when the expected next message arrives or when the timeout expires.
Cold-start and catch-up mechanisms
When a viewer joins a livestream already in progress, they need context to participate meaningfully in the conversation. Seeing an empty chat while everyone else discusses a play that just happened creates a poor experience and confusion about what’s being referenced. The cold-start problem requires a history API that returns the last N messages (typically 50-200 depending on screen size and chat velocity) along with the current sequence number. The client establishes its baseline state, subscribes to the real-time stream starting from that sequence number, and begins receiving new messages without gaps.
Reconnection catch-up handles network disruptions that disconnect users mid-stream. When a client reconnects, it supplies its last_seen_sequence_number in the connection request. The server compares this against the current sequence and either streams the missing messages directly from the durable log if the gap is small (under 1000 messages), returns them via the history API if the gap is moderate, or returns only recent messages with an indication that older messages were skipped if the client was disconnected for hours. The threshold for “too large a gap” depends on your retention policy and the cost of replay.
The durable log enables this catch-up capability in ways that pure pub/sub cannot match. Unlike ephemeral messaging where missed messages vanish immediately, Kafka and similar systems retain messages for a configurable retention period (typically 24-72 hours for live chat, longer for compliance-heavy applications). Consumer groups track their position (offset) in the log, and replaying from a specific offset provides exactly the messages the client missed without requiring a separate database query. This architecture transforms reconnection from “hope you didn’t miss anything important” to “here’s exactly what you missed, in order.”
Delivery semantics and deduplication
At-least-once delivery is the practical choice for live comments because it provides reliability without the coordination overhead of exactly-once semantics. The system guarantees every message will be delivered at least once, but during failures, reconnections, or consumer restarts, duplicates may occur. Clients must deduplicate using message IDs, maintaining a small cache of recently-seen IDs (sized to the maximum expected duplicate window, typically 1000-5000 messages). When a message arrives with an ID already in the cache, it’s discarded silently.
Exactly-once delivery requires distributed transactions spanning ingestion, pub/sub, and delivery. Each component participates in a two-phase commit or uses idempotency keys with persistent storage. Each coordination step adds milliseconds of latency and introduces additional failure modes that must be handled. Since duplicate messages in chat are merely annoying rather than dangerous (unlike duplicate financial transactions or inventory decrements), at-least-once with client deduplication is the industry standard. The engineering effort required for exactly-once is better spent on reducing base latency and improving reliability.
These guarantees form the foundation for reliable message delivery. Production systems must also handle the inevitable traffic spikes that threaten stability during the most important moments.
Hot stream detection and graceful degradation
Popular livestreams create traffic patterns that can overwhelm even well-designed systems within seconds. A surprise celebrity appearance, a game-winning goal, a controversial statement, or a product announcement can spike message volume by 10-100x with no warning. Systems that don’t plan for these scenarios fail at the worst possible moment, which is exactly when the most users are watching and the business impact of failure is highest. Graceful degradation preserves the user experience during overload by strategically reducing quality rather than failing completely.
Detecting hot streams
Proactive detection identifies streams approaching dangerous thresholds before they cause cascading failures that affect other streams or the entire platform. The key metrics to monitor continuously include fan-out latency (time from pub/sub consumption to client delivery, which indicates whether workers are keeping up), queue lag (how far behind consumers are from producers, measured in messages or seconds), publish rate per stream (messages per second, compared against capacity thresholds), active connection count per stream (which determines fan-out multiplication factor), and error rates (failed deliveries, dropped connections, timeout errors). When these metrics cross configurable warning thresholds, the system can automatically engage protective measures before crossing critical thresholds that would require emergency intervention.
Real-time dashboards should surface these metrics with per-stream granularity so operations teams can see which streams are hot, how hot they are, and whether protective measures are engaged. Alerts trigger when metrics exceed warning thresholds, allowing human intervention before automatic degradation kicks in if the situation warrants special handling. For planned events like major esports finals or product launches, teams can pre-enable degradation modes or pre-provision additional capacity based on expected viewership.
Watch out: Don’t rely solely on aggregate metrics across all streams because a single hot stream can hide behind healthy system-wide averages while causing localized failures. If your average fan-out latency is 50ms but one stream is at 2000ms, that stream’s users have a terrible experience even though dashboards look green. Always monitor per-stream metrics for at least your top 100 streams by viewer count.
Graceful degradation strategies
Slow mode is the gentlest intervention, limiting users to one message every N seconds (typically 5-30 seconds depending on severity). This directly reduces ingestion load while still allowing participation. Most users accept slow mode during major events because they understand the system is under strain. Slow mode can be applied globally to a stream or selectively to non-subscribers, new accounts, or users without verified email, preserving full-speed chat for your most engaged community members.
Message sampling becomes necessary when even aggressive slow mode can’t reduce volume enough to prevent system overload. Rather than delivering every message to every viewer, the system randomly samples a percentage of messages for delivery, typically 10-50% during extreme spikes. A viewer might see half of all comments during peak moments, missing some but still experiencing an active chat.
This approach is controversial because users sending messages want them to be seen. However, it preserves system stability and ensures the chat experience continues rather than failing entirely. Some platforms implement “lottery” visibility where each message has a chance of being shown, and users understand this trade-off.
Reaction aggregation addresses a specific type of traffic spike that occurs during exciting moments. Emoji reactions flood in when something exciting happens, with thousands of users clicking reaction buttons simultaneously. Rather than delivering every individual reaction as a separate message, the system aggregates them into periodic summaries (“1,247 users reacted with 🔥 in the last 5 seconds”). This dramatically reduces message count while preserving the social signal that reactions provide. Users generally prefer seeing aggregated counts over a stream of individual reaction notifications.
Tiered delivery priority ensures important messages survive even aggressive degradation modes. Moderator messages, pinned comments, verified user posts, and subscription announcements receive guaranteed delivery even when the system is sampling regular comments at 10%. Moderation retractions (removing a comment that violated rules) are highest priority because displaying harmful content because the retraction was dropped is far worse than dropping regular messages. Implement priority queues in your fan-out workers with strict ordering guarantees for high-priority message types.
The following diagram illustrates the degradation decision tree based on system load metrics.
CDN-based delivery for mega-streams
For truly massive streams with tens of millions of viewers, even sharded WebSocket infrastructure struggles to handle the fan-out multiplication. An alternative approach uses CDN-based delivery with periodic snapshots, trading latency for massive scale.
Rather than pushing every message individually over millions of WebSocket connections, the system generates chat snapshots every few seconds (typically 3-10 seconds) containing recent messages, and distributes them via CDN edge nodes worldwide. Clients poll for snapshot updates from their nearest edge cache, which can serve millions of requests per second with minimal origin load.
This hybrid model trades latency for scale in a way that makes sense at extreme viewer counts. Messages might be 5-10 seconds delayed rather than sub-second, but the infrastructure cost drops by orders of magnitude since CDN edge nodes handle the fan-out rather than your WebSocket servers. Some platforms use this approach automatically for streams exceeding certain thresholds (perhaps 5 million concurrent viewers), transparently switching clients from real-time WebSocket mode to snapshot polling mode without user intervention.
Degradation strategies protect the real-time path. Many systems also need to store comments for replay, compliance, or analytics purposes, introducing additional architectural considerations.
Storage, persistence, and retrieval patterns
Live comment systems vary widely in their persistence requirements based on product needs and regulatory constraints. Twitch chat is mostly ephemeral, with comments existing only during the stream and disappearing afterward. YouTube Live stores comments as part of video playback, allowing viewers to see synchronized chat during VOD replay. Facebook Live sometimes preserves comments with the video archive for social context. Your storage architecture must match your product requirements while not compromising real-time delivery performance. This means persistence happens asynchronously from the critical path.
Persistence models and storage selection
Ephemeral storage keeps comments only in memory or short-lived caches, simplifying infrastructure and reducing cost but meaning comments vanish when the stream ends. Ephemeral systems still need caching for the cold-start problem (serving recent messages to joining viewers) and catch-up (serving missed messages to reconnecting clients), but don’t require durable databases. Redis with TTL-based expiration works well for this pattern, automatically cleaning up after streams end without explicit deletion logic.
Persistent storage saves comments for replay during VOD viewing, moderation audits and appeals, analytics on engagement patterns, legal compliance and discovery requests, or long-term archival. The write volume is substantial and requires databases designed for high-throughput writes. At 50,000 messages per second for a hot stream, you’re looking at 180 million messages per hour that need durable storage.
Traditional SQL databases struggle with this write throughput due to transaction overhead and index maintenance. NoSQL solutions like DynamoDB, Cassandra, or ScyllaDB are common choices. These systems offer excellent write performance through append-only storage patterns and horizontal scaling through consistent hashing, ideal for high-volume short messages.
Time-series databases offer another option for teams already operating InfluxDB, TimescaleDB, or similar systems. Comments are inherently time-series data with each message having a timestamp and belonging to a temporal sequence within its stream. Time-series databases optimize for high-frequency sequential writes and time-range queries, both common access patterns for chat data. However, they’re less common for this use case than general-purpose NoSQL because most teams don’t want to introduce a new database category just for chat storage.
Real-world context: YouTube stores live chat messages as part of video metadata, enabling the “chat replay” feature during VOD playback. The comments are chunked by time offset (typically 30-second segments), allowing efficient seeking without loading the entire chat history. When you jump to 45 minutes into a video, only that segment’s comments load, then adjacent segments prefetch in the background.
Sharding and partitioning strategies
Comments partition naturally by livestream_id, keeping all messages for a stream together and enabling efficient retrieval for replay or moderation review. Within a stream, secondary partitioning by timestamp allows range queries for replay (give me comments from minute 30 to minute 35) and limits partition size for very long streams that might otherwise create hot partitions. Some systems add regional partitioning for compliance requirements (European user data stays in European regions under GDPR) or latency optimization (queries from Europe hit European replicas).
Large platforms may shard a single popular stream across multiple storage partitions when traffic is unusually high. This prevents any single database partition from becoming a write bottleneck. It requires more complex query routing when retrieving the full chat history because you need to merge results from multiple partitions, but the write scalability is worth the read complexity for streams that would otherwise overwhelm a single partition.
Caching for recent comments
When a user joins a livestream, they expect to see recent chat history immediately without waiting for database queries. Hitting the database for every joining viewer (potentially thousands per second during popular streams) creates unnecessary load on storage systems designed for durability rather than read throughput. Instead, maintain a rolling cache of recent comments (typically the last 100-500 messages depending on typical screen size and scroll depth) in Redis or Memcached, serving cold-start requests from cache and only hitting storage for deeper history.
The ingestion pipeline updates this cache as part of its processing, after publishing to the durable log. It pushes the comment to the cache with a TTL slightly longer than the expected stream duration, and the cache automatically expires old entries using an LRU policy or explicit TTL to maintain a sliding window of recent activity. When the livestream ends, cache entries expire naturally over the following hours, requiring no cleanup logic. For very long streams, the cache size limits naturally bound memory usage.
Retrieval during VOD replay uses different patterns optimized for different access characteristics. Replay requires stronger consistency (showing the actual historical chat) but handles lower concurrency since users scrub through videos at different points rather than all requesting the same moment simultaneously. Comments can be prefetched in chunks aligned with video segments, compressed for faster loading using standard algorithms like gzip or brotli, and cached at CDN edges for popular archived streams that receive many replay views.
Storage decisions interact closely with moderation requirements, which need audit trails, the ability to retroactively remove content, and compliance with data retention policies.
Moderation as a real-time control plane
Large live events attract spam, harassment, hate speech, and coordinated abuse campaigns. Effective moderation must operate in real-time without blocking the critical message path because you can’t add 500ms of latency waiting for ML models to score every message before delivery. The solution treats moderation as a parallel control plane that can intercept, modify, or retract messages at multiple points in the pipeline, optimizing for the common case (legitimate messages that should flow through immediately) while still catching violations.
Pre-delivery and asynchronous moderation
Fast synchronous checks run during ingestion and must complete within the latency budget (under 30ms total for all checks combined). Keyword filters catch obvious violations like slurs, banned terms, and known spam patterns using pre-compiled regular expressions or Aho-Corasick automata for efficient multi-pattern matching. Rate limiting prevents message floods from individual users by tracking recent message counts per user in Redis. Known-bad-actor lists immediately reject messages from previously banned accounts or IP addresses without further processing. These checks use pre-computed data structures loaded into memory and avoid external service calls that would add latency.
Asynchronous scoring handles expensive checks like ML-based toxicity detection, image analysis for embedded media, and link safety verification. The message publishes to the fan-out pipeline immediately without waiting for these checks, while a parallel path sends it to the moderation service for analysis.
If the moderation service flags the message (toxicity score above threshold, known spam pattern, prohibited link), it publishes a retraction event that propagates through the same fan-out infrastructure. Clients receive the retraction and remove the already-displayed message, typically within 2-5 seconds of original display. This approach optimizes for the common case (legitimate messages) while still catching violations with acceptable delay.
Pro tip: Prioritize retraction events above regular messages in your fan-out queue because a harmful message that displays for 5 seconds before retraction is far worse than a regular message being delayed by 100ms. Implement priority queues or separate high-priority channels for moderation actions, and monitor retraction latency as a key SLO.
Shadow banning and soft deletes
Shadow banning is a powerful tool against persistent bad actors who would otherwise create new accounts to evade traditional bans. The shadow-banned user sees their own messages appear normally in their client, but other viewers never receive them because the ingestion service checks a shadow-ban list and, for flagged users, publishes messages only to that user’s own connection rather than the broadcast channel. The user has no indication they’re banned, reducing the likelihood they’ll create new accounts, appeal the decision, or escalate their behavior. Shadow banning is ethically complex and should be used judiciously with clear internal policies about when it’s appropriate.
Soft deletes remove content from display while preserving data for audit, appeal, and compliance purposes. When a moderator removes a message, the system publishes a deletion event containing the message ID that propagates to all clients, but retains the original content in storage with a “deleted” flag and metadata about who deleted it and why. This creates an audit trail for reviewing moderation decisions, enables restoration if a deletion was incorrect or successfully appealed, and satisfies legal hold requirements that may prevent actual deletion even when content is removed from public view.
Audit trails and compliance
Moderation actions themselves need comprehensive logging for accountability, training, and legal compliance. Every ban, deletion, timeout, warning, and flag should record the action taken, the actor (human moderator username or automated system identifier), the timestamp with timezone, the reason code and any free-text justification, the content that triggered the action (preserved even if the message is deleted from public view), and the outcome of any appeal.
This audit log enables reviewing moderator performance and consistency, investigating user complaints about unfair treatment, demonstrating compliance with content policies to regulators or advertisers, training ML models on moderation decisions, and responding to legal discovery requests.
The control plane approach means moderation scales independently from message delivery. You can add more sophisticated ML models, expand human review queues, implement new compliance checks for different jurisdictions, or add integration with third-party trust and safety services without modifying the core fan-out infrastructure.
With moderation architecture established, the next consideration is ensuring your infrastructure can handle the calculated load through systematic capacity planning.
Capacity planning and workload estimation
System Design interviews often expect back-of-envelope calculations demonstrating you understand the scale involved rather than just hand-waving at “big numbers.” Production planning requires similar estimates with more precision to provision infrastructure appropriately, negotiate cloud contracts, and set realistic expectations with stakeholders. This section walks through realistic numbers for a major live streaming platform, showing the methodology you can adapt to your specific requirements.
Connection and bandwidth modeling
Consider a popular livestream scenario with 500,000 concurrent viewers and an average message rate of 10,000 messages per second (representing about 2% of viewers actively chatting at any moment). Each message payload averages 200 bytes including user ID, display name, message text, timestamp, and metadata. The raw ingestion bandwidth is straightforward at 10,000 messages × 200 bytes = 2 MB/second. But fan-out multiplies this dramatically because each message must reach every viewer.
In practice, sharding reduces the apparent multiplication. If you have 500 WebSocket servers each handling 1,000 viewers for this stream, each server receives the 2 MB/second message stream from pub/sub and fans out to its local connections. The pub/sub bandwidth is 2 MB/second × 500 servers = 1 GB/second to distribute messages to all WebSocket servers. Each server’s outbound is 2 MB/second to its 1,000 connected clients. Still substantial (500 servers × 2 MB/s = 1 GB/second total outbound), but tractable with modern infrastructure and careful capacity planning.
| Metric | Value | Calculation |
|---|---|---|
| Concurrent viewers | 500,000 | Given scenario |
| Active chatters | 10,000 | ~2% participation rate |
| Message rate | 10,000/sec | 1 msg/sec per active chatter |
| Message size | 200 bytes | Average including metadata |
| Ingestion bandwidth | 2 MB/sec | 10,000 × 200B |
| WebSocket servers needed | 500 | 500,000 ÷ 1,000 connections/server |
| Pub/sub bandwidth | 1 GB/sec | 2 MB/sec × 500 subscribers |
| Per-server outbound | 2 MB/sec | Each server fans out to 1,000 local clients |
| Total delivery bandwidth | 1 GB/sec | 500 servers × 2 MB/sec |
Watch out: Connection counts per server depend heavily on your language runtime, garbage collection characteristics, and message patterns. A Go or Rust server with efficient memory management might handle 50,000+ mostly-idle connections. Java or Node.js might max out at 10,000-20,000 with active message flow due to GC pressure and event loop blocking. Benchmark your specific stack under realistic load before committing to capacity numbers in production planning.
Storage and retention calculations
If comments require persistent storage for replay, compliance, or analytics, storage accumulates quickly and requires careful capacity planning. At 10,000 messages per second for a 3-hour livestream, you generate 10,000 × 60 × 60 × 3 = 108 million messages. At 200 bytes each, that’s approximately 21.6 GB per stream before any indexing overhead. A platform hosting 1,000 concurrent livestreams would store 21.6 TB per day just for comments, not including other metadata.
Retention policies become critical for controlling storage costs and meeting compliance requirements. Keeping 30 days of data means approximately 650 TB of raw storage, while 7 days is more manageable at around 150 TB. Compression helps significantly because chat messages are highly compressible (lots of repeated patterns, common words, similar metadata structures, repeated usernames). LZ4 or Zstandard compression typically achieves 3-5x reduction for text-heavy workloads, bringing that 650 TB down to 130-220 TB of actual storage consumption.
For platforms subject to legal hold requirements or GDPR right-to-deletion requests, storage architecture must support efficient deletion of individual user’s messages across all streams they participated in. This typically requires secondary indexes on user_id in addition to the primary stream_id partitioning, adding storage overhead but enabling compliance operations.
With capacity planning complete, the final consideration is how to present this system effectively in interview settings where communication matters as much as technical correctness.
Interview strategy and trade-off discussions
Live comments System Design tests your understanding of real-time systems, connection protocols, scaling strategies, and distributed system trade-offs in a compact format that reveals how you think about complex problems. A well-structured presentation demonstrates senior-level thinking and separates you from candidates who jump straight to implementation details without establishing context. The key is showing that you can navigate ambiguity, make reasoned decisions, and communicate technical concepts clearly.
Requirements clarification
Start by asking clarifying questions that demonstrate awareness of design dimensions and prevent wasted effort on irrelevant features. Ask about scale and whether you’re targeting thousands of viewers or millions, since the architecture differs significantly. Ask about message ordering and whether strict ordering is required, or best-effort is acceptable given the latency trade-offs. Ask about persistence and whether comments need to survive for replay, or are ephemeral. Ask about latency and what the target is from post to delivery, and whether sub-second is acceptable or you need sub-100ms. Ask about feature scope and which features matter most among basic chat, reactions, moderation, threading, or rich media.
These questions prevent wasted effort on unnecessary features and show you understand that requirements drive architecture. A system for 10,000 concurrent viewers looks fundamentally different from one serving 10 million. The smaller system might use a single Redis instance for pub/sub while the larger one needs Kafka clusters across multiple regions.
Structured architecture presentation
Walk through the system in logical order that mirrors the actual message path, making your design easy to follow even for interviewers less familiar with real-time systems. Cover client connection (WebSocket vs SSE, with justification), edge routing and geographic distribution, the ingestion pipeline with validation and moderation hooks, the durable log for ordering guarantees and catch-up capability, pub/sub for distribution to WebSocket servers, fan-out workers that multiply messages to connected clients, WebSocket servers for delivery, and optional persistence for replay or compliance.
Draw diagrams as you explain, even rough boxes and arrows, because visual representation clarifies relationships between components far better than verbal description alone. Label data flows with expected latency contributions (edge at 5ms, ingestion at 20ms, pub/sub at 10ms, delivery at 5ms equals roughly 40ms total) to demonstrate you understand where time is spent in the system.
Trade-off articulation
Interviewers care deeply about trade-off awareness because real engineering requires choosing between imperfect options rather than finding perfect solutions. Prepare to discuss WebSockets versus SSE (bidirectional capability versus implementation simplicity, with WebSockets winning for chat because users both send and receive). Prepare to discuss fan-out on write versus fan-out on read (latency versus infrastructure cost, with write-time fan-out winning for real-time requirements). Prepare to discuss strict ordering versus best-effort (coordination overhead versus user experience, with best-effort plus sequence numbers being the practical choice). Prepare to discuss persistent versus ephemeral storage (replay capability versus operational simplicity). Prepare to discuss regional versus global routing (latency versus consistency trade-offs).
Don’t just list trade-offs abstractly. Explain which choice you’d make for this specific system and why, grounding your decision in the requirements you clarified earlier. “For live comments, I’d choose fan-out on write because our latency requirement of sub-500ms is critical for user experience. The infrastructure cost is higher than fan-out on read, but it’s the only way to meet our latency target. We can manage costs through graceful degradation during extreme spikes.”
Pro tip: When interviewers push back on your choices or propose alternative approaches, don’t get defensive. Acknowledge valid concerns, explain why you still prefer your approach given the requirements, or adapt your design if their point reveals a flaw. Flexibility and reasoned discussion demonstrate engineering maturity far better than rigid defense of initial decisions.
The following diagram provides a reference architecture suitable for whiteboard sessions that captures the essential components without overwhelming detail.
Conclusion
Building a live comments system requires mastering the intersection of real-time messaging, distributed systems, and scale engineering in ways that few other System Design challenges demand. The core architecture separates concerns into distinct layers. Edge routing handles geographic distribution and protection. Connection management maintains millions of WebSocket sessions. Ingestion validates and sequences messages. Durable logging provides ordering guarantees and catch-up capability. Pub/sub distributes messages to regional clusters. Fan-out workers multiply each message to thousands or millions of connected viewers. Each layer scales independently based on its specific load characteristics, allowing you to add capacity precisely where bottlenecks emerge.
Server-assigned sequence numbers provide ordering guarantees within streams without the prohibitive cost of global consensus, while durable logs enable catch-up for reconnecting clients and cold-start for joining viewers. Graceful degradation through slow mode, message sampling, and reaction aggregation preserves user experience during traffic spikes that would otherwise cause system failure. The moderation control plane operates in parallel with message delivery, catching violations through asynchronous ML scoring and propagating retractions through the same fan-out infrastructure.
The future of live interaction is pushing toward even lower latencies and richer experiences as viewer expectations evolve. We’re seeing experiments with peer-to-peer message relay to reduce server load and latency for geographically proximate users, AI-powered moderation that adapts to context and community norms rather than applying fixed rules, and integration with AR/VR experiences where chat messages appear in spatial environments rather than flat overlays. Edge computing pushes more logic closer to users through WebAssembly and serverless functions at CDN edges, potentially enabling sub-50ms global delivery. As live streaming grows across gaming, education, commerce, and social connection, these systems become increasingly critical infrastructure that defines what real-time interaction means for billions of users.