Every time someone taps an advertisement on their phone or clicks a banner on a website, a small but critical event fires into the void. That event must travel through networks, survive retries and duplicates, get validated against fraud, and ultimately land in a system that can aggregate it with billions of similar events happening simultaneously around the world. The margin for error is razor-thin because advertisers make million-dollar decisions based on these counts, and billing disputes can arise from even minor inaccuracies. At Google’s scale, a 0.1% error rate means millions of miscounted clicks and potential disputes worth hundreds of thousands of dollars daily. This is the challenge of building an ad click aggregator, a system that sits at the intersection of real-time streaming, distributed storage, and high-stakes data accuracy.
This guide walks through the complete architecture of an ad click aggregator, from ingestion endpoints absorbing traffic spikes to streaming pipelines performing deduplication and windowed aggregation. You will learn how to handle late and out-of-order events using watermarking, design storage tiers that balance cost with query performance, and implement idempotency keys that prevent duplicate billing. Whether you are preparing for a System Design interview or architecting a production ad platform, this deep dive covers the technical decisions that separate a functional prototype from a battle-tested system handling hundreds of thousands of events per second.
The following diagram illustrates the end-to-end flow of an ad click aggregator, from user interaction through ingestion, processing, and storage layers.
Why ad click aggregators matter
An ad click aggregator is a large-scale backend system that collects, processes, deduplicates, and aggregates click events from millions of users across devices, geographies, and platforms. When someone interacts with an advertisement, that click event becomes part of the data advertisers use to measure campaign performance, calculate conversions, analyze user behavior, and determine budget allocation. Modern advertising platforms must track these clicks in real time because marketers expect up-to-the-second dashboards showing impressions, clicks, click-through rates, cost per click, and revenue attribution.
The complexity arises from scale and reliability requirements that few other systems match. A global ad platform must handle traffic surges driven by campaign launches, Super Bowl ads, holiday shopping seasons, or viral content. Events arrive from thousands of publishers and millions of devices simultaneously. They may be duplicated, delayed, malformed, fraudulent, or out of order due to network conditions and client behavior. All of this must be handled with low latency and high accuracy because advertisers rely on these metrics for billing and optimization.
Real-world context: Major ad platforms like Google Ads and Meta process billions of click events daily. Their systems must maintain reconciliation pipelines that compare raw event logs against aggregated metrics, catching discrepancies before they affect billing cycles.
In System Design interviews, the ad click aggregator problem is particularly valuable because it requires thinking deeply about ingestion throughput, data quality, idempotency, distributed queues, streaming pipelines, storage choices, multi-region availability, and large-scale analytics. It touches nearly every major distributed systems concept, making it an excellent test of architectural thinking.
Understanding this system prepares you for backend and data infrastructure roles where similar patterns apply to metrics collection, event sourcing, and real-time analytics platforms. The first step toward building such a system is establishing clear functional and non-functional requirements.
Functional and non-functional requirements
Before designing an architecture, you must clearly articulate what the system needs to do and how well it needs to perform. Ad tech systems have strict functional correctness requirements like deduplication and filtering, combined with aggressive non-functional requirements around latency and throughput. Clear requirements shape everything from API design to storage selection and help you make principled trade-offs during implementation.
Functional requirements
The system must receive click events from multiple sources including mobile SDKs, web pixels, server-to-server callbacks, and third-party ad networks. Each channel has different characteristics that affect design decisions. Mobile SDKs batch events to conserve battery, accumulating clicks and sending them periodically or when the app moves to background. Web pixels fire immediately on click, generating synchronous traffic patterns. Server callbacks may arrive with significant delays of minutes or even hours depending on partner integration quality. The ingestion layer must accept millions of events per second across all channels while maintaining consistent behavior regardless of source.
Every incoming event requires validation and normalization before processing. Click events may arrive with invalid fields, missing IDs, incorrect timestamps, or nonstandard formats depending on the integration quality of each publisher. Normalization transforms these varied inputs into a uniform schema that downstream processing can handle consistently. This includes timestamp standardization to UTC, field type coercion, and enrichment with derived fields like geographic region based on IP address.
Schema versioning becomes critical here because ad formats and tracking requirements evolve constantly. Forward-compatible schemas with envelope versioning allow the system to handle both old and new payload formats without breaking existing processing jobs.
Deduplication prevents inflation of metrics caused by client retries, network delays, load balancer retries, or malicious bot traffic. A single user click can generate multiple events through these mechanisms, and each duplicate must be filtered out before aggregation. Methods include click ID checking against recent events, device ID combined with timestamp windows, and probabilistic structures like Bloom filters for memory-efficient approximate deduplication at massive scale. Signature validation adds another layer by cryptographically verifying that click events originated from legitimate impression renders, preventing fabricated clicks from entering the pipeline.
Watch out: Relying solely on exact click ID matching fails when clients generate IDs incorrectly or when malicious actors deliberately vary IDs. Production systems often combine multiple deduplication strategies as defense in depth, including behavioral analysis that detects impossible click patterns.
The aggregator must compute metrics across multiple dimensions including clicks per campaign, clicks per ad, clicks per publisher, hourly and daily rollups, and custom metrics like click-through rate when combined with impression data. These aggregations span what engineers call multi-dimensional cubes, where each combination of campaign, geography, device type, and time bucket represents a cell that must be computed and stored.
Managing cardinality explosion becomes essential because the number of possible dimension combinations grows multiplicatively. Downstream systems consume the aggregated data for attribution models, billing calculations, reporting dashboards, fraud detection, and data science pipelines through reliable publishing mechanisms.
Non-functional requirements
Traffic patterns in advertising are inherently spiky, with volumes that can reach hundreds of thousands of events per second during normal operation and millions per second during peak hours like major sporting events or holiday shopping periods. Consider a concrete scenario. During a typical day, the system might handle 200,000 clicks per second on average, but a viral campaign launch could spike traffic to 800,000 clicks per second within minutes. The entire processing chain from ingestion through streaming to storage must scale horizontally to handle these surges without dropping events or introducing unacceptable latency. Advertisers expect dashboard updates within one minute ideally, with five minutes being the upper bound of acceptability for most use cases.
Fault tolerance is non-negotiable because advertising metrics directly impact billing. The system must preserve correctness and never lose data even when nodes crash, partitions become unavailable, data centers fail, or events arrive out of order. This requires careful attention to replication, checkpointing, and exactly-once or at-least-once processing semantics depending on where in the pipeline you are operating.
Audit trails and lineage tracking become essential for billing disputes, allowing engineers to trace any aggregated metric back to the raw events that produced it. Global availability adds another dimension since ad traffic comes from around the world, requiring multi-region ingestion with geo-routing, regional endpoints, and failover strategies.
Cost efficiency balances against all other requirements because storing raw click events indefinitely would be prohibitively expensive. At 500 bytes per event after compression and 200,000 events per second, raw data accumulates at approximately 8.5 TB daily. The system must compress data, batch writes, archive to cold storage, and implement lifecycle policies that migrate older data to cheaper tiers. A well-designed storage strategy using hot and cold tiering can reduce costs by an order of magnitude compared to keeping everything in hot queryable storage.
The following table summarizes the key requirements and their typical targets for a production ad click aggregator.
| Requirement | Category | Typical target |
|---|---|---|
| Peak throughput | Non-functional | 500K–2M events/second |
| Dashboard latency | Non-functional | Under 60 seconds |
| Billing accuracy | Functional | 99.99% after reconciliation |
| Data durability | Non-functional | Zero loss (replicated storage) |
| Deduplication window | Functional | 24–72 hours |
| Geographic coverage | Non-functional | Multi-region (3+ regions) |
| Storage retention (hot) | Non-functional | 7 days |
| Storage retention (cold) | Non-functional | 2+ years |
With requirements clearly defined, the next step is designing the high-level architecture that satisfies these constraints while remaining operationally manageable.
High-level architecture
An ad click aggregator is fundamentally a real-time event processing pipeline. While many architectural variations exist, most high-performing systems share key components arranged in a dataflow pattern. These are ingest, buffer, process, store, and analyze. Each layer has specific responsibilities and scaling characteristics that together enable the system to handle massive throughput while maintaining accuracy. The separation of concerns between layers allows each to scale independently and fail gracefully without cascading failures.
The ingestion layer begins with HTTP or gRPC APIs that receive events from various sources. SDK-based event emitters batch and send from client applications, while CDN edge endpoints provide faster global access by terminating connections close to users. Load balancers distribute incoming traffic across ingestion servers using consistent hashing or round-robin strategies. Rate limits and throttling protect downstream components from being overwhelmed by traffic spikes or misbehaving clients. Edge validation at this layer rejects obviously malformed requests before they consume downstream resources, checking basic schema compliance and authentication credentials.
A durable, high-throughput streaming system like Kafka, Kinesis, or Pulsar serves as the central buffer between producers and consumers. This message queue decouples the ingestion layer from processing, isolating failures and allowing each layer to scale independently. When processing slows down or fails, the queue absorbs the backlog until capacity recovers, preventing data loss during transient issues. Partition count directly impacts parallelism since each partition can have only one consumer per consumer group. Replication factor determines durability, with most production systems using a replication factor of three to survive broker failures without data loss.
Pro tip: Choosing partition keys carefully in your streaming layer determines whether related events land on the same processor. Partitioning by campaign ID or advertiser ID enables correct windowing and simplifies stateful operations like deduplication. However, watch for hot keys when a single campaign generates disproportionate traffic.
Stream processors like Flink, Kafka Streams, or Spark Streaming consume from the queue to perform deduplication, filtering, schema validation, enrichment with external data, and windowed aggregation. Raw events flow to cheap, durable object storage like S3, GCS, or HDFS for long-term retention, audit trails, and reprocessing capability. Aggregated metrics go to analytical databases optimized for fast queries, with the choice depending on query patterns and latency requirements. Finally, monitoring and anomaly detection systems watch for sudden traffic changes, suspicious click bursts, bot farm patterns, and system health issues.
The diagram below shows how stream processing components connect internally, with data flowing through validation, enrichment, and aggregation stages.
Understanding the high-level flow sets the stage for examining each component in detail, starting with the critical ingestion layer that forms the system’s entry point.
Event ingestion at scale
The ingestion layer is the foundation of the system because everything downstream depends on its reliability and speed. A poorly designed ingestion layer causes dropped events, backpressure buildup, latency spikes, and incorrect reporting that propagates through the entire pipeline. Getting this layer right requires handling multiple event sources, absorbing traffic bursts, protecting against abuse, and managing failures gracefully while maintaining consistent semantics for clients.
Modern platforms receive events through multiple channels with different characteristics that demand flexible API design. Websites send click events via JavaScript beacons that fire immediately on user interaction, generating synchronous request patterns. Mobile apps use SDKs that batch events to conserve battery and network resources, sending accumulated clicks periodically or when the app moves to background. Backend services forward events through direct API calls with retry logic, while external ad networks deliver callbacks that may arrive with significant delays. The API must support JSON or Protocol Buffer formats, authentication keys for access control, and optional batching for efficiency. Forward-compatible schemas with version fields in the envelope allow the system to accept events from older SDK versions while newer versions include additional fields.
Ad traffic is inherently spiky because campaign launches, influencer posts, viral content, or major events can multiply traffic instantly. The ingestion layer absorbs these bursts through several complementary mechanisms. Auto-scaling ingestion servers based on CPU and network metrics handles gradual increases over minutes. Buffering events in Kafka provides milliseconds to seconds of absorption time for downstream systems to catch up. CDN edges can absorb geographic bursts by terminating connections close to users, reducing latency and distributing load across edge locations. Backpressure signals to clients through HTTP 429 responses or exponential backoff guidance slow down producers when the system approaches capacity limits.
Historical note: Early ad systems used synchronous database writes directly from ingestion servers, which created severe bottlenecks during traffic spikes. The adoption of message queues as buffers was a paradigm shift that enabled modern real-time ad analytics by decoupling receipt acknowledgment from processing completion.
Rate limiting and throttling protect downstream systems from both malicious attacks and misconfigured clients. Per-publisher limits prevent any single source from overwhelming the system, typically implemented using token bucket algorithms with configurable burst allowances. Dynamic quotas adjust during extreme surges to prioritize high-value traffic from top advertisers. Throttling on clients sending malformed or suspicious data prevents abuse from impacting legitimate traffic. These limits must be adjustable in real time without requiring deployments, typically through configuration systems or feature flags that propagate within seconds.
Error handling in the ingestion layer requires graceful degradation rather than hard failures. Malformed data should be logged and rejected with clear error codes rather than crashing the server or silently dropping events. Unauthorized requests return appropriate HTTP status codes without revealing internal system details that could aid attackers. Retries from clients are expected behavior, so the system must be idempotent at the ingestion level, using the same deduplication infrastructure that protects against network-level duplicates. Partial failures where some events in a batch succeed while others fail need clear semantics communicated back to clients through structured response bodies.
Regional ingestion with geo-routing minimizes latency for global traffic and improves resilience against regional outages. Clients connect to the nearest regional endpoint through DNS-based or anycast routing, reducing round-trip time and improving user experience for latency-sensitive click tracking. Data replicates across regions for durability, with streams synchronized periodically to maintain global consistency for reporting. Failover strategies automatically redirect traffic when a region experiences issues, ensuring continuous availability even during infrastructure problems. This geographic distribution also provides natural isolation, preventing issues in one region from affecting others.
With events reliably entering the system, the next critical challenge is ensuring each click is counted exactly once through robust deduplication and idempotency mechanisms.
Deduplication and idempotency
Deduplication is one of the most challenging aspects of ad click aggregator design because duplicates directly cause financial harm through inflated metrics and incorrect billing. Without robust deduplication, advertisers see artificially high click counts, billing calculations become inaccurate, and trust in the platform erodes quickly. The challenge is implementing deduplication that is fast enough for real-time processing, accurate enough for billing, and scalable enough to handle billions of daily events while consuming reasonable memory.
Duplicates arise through multiple mechanisms that must all be addressed systematically. Client retries occur when network timeouts cause SDKs to resend events that actually succeeded on the server side. Load balancer retries happen when backend servers respond slowly and the load balancer assumes failure. Ad networks sometimes forward the same event multiple times due to their own reliability mechanisms or integration bugs. Malicious bot traffic deliberately generates duplicate clicks to inflate metrics or drain advertiser budgets. Delayed events from poor network conditions can arrive long after the original, appearing as duplicates to naive detection systems that only check recent windows.
Idempotency keys and click IDs
Many ad platforms generate a globally unique click ID at the moment of user interaction, before the event ever reaches the backend. This ID, sometimes called an idempotency key or impression signature, enables straightforward deduplication by checking whether an ID has been seen before. Strong uniqueness guarantees come from UUIDs or cryptographically signed tokens that include campaign and ad identifiers embedded in the signature. Signature validation verifies that the click ID was genuinely generated by an impression render, preventing attackers from fabricating click IDs. When a valid click ID is present, the system can detect duplicates with high confidence and support consistent reprocessing during backfills or recovery scenarios.
When no client-provided ID exists, the system must derive one from available data using heuristics that balance false positive and false negative rates. Combining device ID with timestamp creates a reasonable approximation, though it fails when the same device genuinely clicks multiple ads in quick succession. URL parameters sometimes contain tracking identifiers from upstream systems that can serve as keys.
Fingerprinting techniques combine multiple fields like user agent, IP address, and referrer into a hash that identifies likely duplicates. These approaches are probabilistic and can produce false positives that incorrectly discard legitimate clicks. Production systems often use a hierarchy of methods, preferring explicit IDs when available and falling back to derived keys with more conservative matching otherwise.
Watch out: Deriving deduplication keys from mutable fields like IP address fails when users move between networks or use VPNs. Always prefer immutable identifiers assigned at click time over inferred keys that depend on connection state.
Deduplication strategies by time horizon
Short-term deduplication handles the common case of immediate retries within seconds or minutes. High-speed stores like Redis or Memcached hold click IDs for a configurable window, typically 5 to 15 minutes. Checks against this cache must complete in microseconds to avoid becoming a bottleneck in the processing pipeline, requiring careful attention to connection pooling and cluster topology. The cache uses TTL-based expiration to automatically remove old entries, keeping memory usage bounded without explicit garbage collection. For extremely high throughput, sharding the cache across multiple nodes distributes both storage and lookup load, with consistent hashing ensuring that lookups for the same click ID always hit the same shard.
Long-term deduplication addresses delayed duplicates that arrive hours or even days after the original event due to network issues or partner system delays. Raw event datasets stored in data warehouses can be deduplicated during batch processing using SQL queries with window functions or distinct operations that identify and remove duplicates. Billing workflows that run daily or weekly apply their own deduplication pass to ensure financial accuracy before invoices are generated. Backfill operations that reprocess historical data must also handle deduplication to avoid double-counting events during recovery scenarios. Tools like BigQuery or ClickHouse can efficiently deduplicate billions of records using unique constraints, merge operations, or hash comparisons.
Probabilistic techniques become necessary when exact deduplication is too memory-intensive for the event volume at hand. Bloom filters provide fast, memory-efficient approximate membership testing with a configurable false positive rate. A Bloom filter using 10 bits per element achieves roughly 1% false positive rate, meaning some duplicates slip through but the vast majority are caught.
HyperLogLog structures estimate cardinality for analytics without storing individual IDs, useful for approximate unique user counts across campaigns where exact precision is less critical than memory efficiency. Simhash algorithms detect near-duplicates where events are similar but not identical, useful for catching sophisticated bot traffic that varies fields slightly between requests to evade exact matching.
The following table compares deduplication approaches across key dimensions.
| Approach | Memory usage | Accuracy | Latency | Best for |
|---|---|---|---|---|
| Exact ID lookup (Redis) | High | Perfect | Microseconds | Short-term, billing-critical |
| Bloom filter | Low | ~99% | Microseconds | High-volume preprocessing |
| Batch SQL dedup | N/A (disk) | Perfect | Minutes | Historical reconciliation |
| HyperLogLog | Very low | ~98% | Microseconds | Unique count estimates |
Handling late and out-of-order events
Events rarely arrive in perfect chronological order due to network delays, client batching, and varying path lengths through the infrastructure. Late events must still be deduplicated against events that arrived earlier but represent the same user action. This requires maintaining deduplication state for a window extending beyond the expected latency, typically 24 to 72 hours for comprehensive coverage depending on partner SLAs and network conditions. Windowed aggregation must handle updates when late events modify previously computed totals, either by recomputing affected windows or by emitting correction records.
Watermarking is the standard technique for managing event-time processing with late arrivals in streaming systems. A watermark represents a timestamp threshold below which no more events are expected to arrive, advancing as the stream progresses. Events arriving after their window’s watermark closes can be routed to a side output for separate handling based on business requirements.
They can be discarded with logging for investigation, processed into correction streams that update downstream systems, or aggregated into late data buckets for reconciliation during billing cycles. The watermark advance rate balances latency against completeness. Aggressive watermarks provide faster results but miss more late events, while conservative watermarks wait longer but capture more data.
Pro tip: Configure different lateness tolerances for different use cases. Real-time dashboards might accept 5-minute watermarks for responsiveness, while billing pipelines use 24-hour watermarks to capture delayed partner callbacks before finalizing invoices.
The interplay between deduplication, late events, and windowed aggregation creates complex state management requirements that the streaming pipeline must handle correctly for billing accuracy. Understanding these mechanisms prepares us to examine how the streaming pipeline orchestrates all these operations in real time.
Streaming pipeline and real-time processing
The streaming pipeline transforms raw click events into structured, accurate analytics signals that power advertiser dashboards and billing systems. Once events pass through ingestion, they enter a high-throughput, fault-tolerant streaming system that performs validation, deduplication, enrichment, and aggregation in real time. A well-designed pipeline ensures low latency, provides exactly-once or at-least-once semantics depending on requirements, and makes data available to dashboards within the target SLA window of one minute or less.
The message queue serves as both a buffer and a durability layer between ingestion and processing. Systems like Kafka, Pulsar, or Kinesis absorb bursty traffic, persist events for replay capability, and provide distributed consumption for parallel processing. Even if processing nodes fail or slow down, the queue stores millions of events until capacity recovers without data loss. Partition count and replication factor configuration directly impacts both throughput and durability. More partitions enable higher parallelism by allowing more consumer instances, while higher replication factors protect against data loss during broker failures at the cost of additional storage and network bandwidth.
Ordering matters for several processing operations that depend on temporal relationships between events. Windowed aggregations require events to be grouped by time, and processing out-of-order events naively produces incorrect counts that drift from reality. Deduplication based on timestamps needs events from the same session to arrive at the same processor to check for duplicates correctly. Fraud detection that analyzes click sequences depends on seeing events in approximately the right order to identify impossible patterns. Partitioning by campaign ID, publisher ID, or a hash of the click ID ensures related events land on the same partition, enabling correct windowing and reducing complexity in stateful operators.
Real-world context: Uber’s ad platform uses Apache Pinot for real-time analytics, handling millions of queries per day with p99 latency under 100 milliseconds. Their Flink-based stream processing layer partitions by rider and driver IDs to maintain session consistency for attribution.
Stream processing operations
Stream processors like Flink, Spark Streaming, or Kafka Streams perform the core transformation logic through a series of operators that can be either stateless or stateful. Validation checks that required fields are present and correctly typed, rejecting malformed events to a dead-letter queue for investigation and potential manual correction. Normalization standardizes timestamps to UTC, converts field formats to canonical representations, and ensures consistent schemas across different event sources.
Enrichment joins click events with external data like user profiles, device metadata, campaign information, or geographic details derived from IP addresses. These operations can be stateless, processing each event independently without maintaining information between events.
Deduplication requires stateful processing because the system must remember which click IDs have been seen recently to detect duplicates. State stores, often backed by RocksDB in Flink or internal topics in Kafka Streams, maintain this memory efficiently using LSM trees optimized for write-heavy workloads. Windowed aggregation is similarly stateful, accumulating counts within time windows and emitting results when windows close based on watermark advancement. The state must be checkpointed regularly to durable storage so that failures can recover without losing progress or double-counting events. Event-time windowing uses timestamps embedded in the events rather than processing time, ensuring correct aggregation even when events arrive delayed or out of order.
The distinction between stateless and stateful processing has significant operational implications that affect scaling and recovery strategies. Stateless operators scale trivially by adding more instances since there is no coordination required and any instance can process any event. Stateful operators require careful partition assignment to ensure state locality, state migration during scaling events, and checkpoint management for recovery. Production deployments often separate stateless preprocessing from stateful aggregation, allowing each stage to scale according to its specific bottlenecks and failure characteristics.
The following diagram illustrates how event-time windows handle late-arriving events through watermarking.
Processing guarantees and accuracy trade-offs
In advertising systems where metrics drive billing, processing guarantees directly impact financial accuracy and advertiser trust. Exactly-once semantics, where each event affects the output exactly once regardless of failures, provides the strongest guarantee but requires coordination between the streaming system and downstream sinks through two-phase commits or idempotent writes. At-least-once semantics, where events may be processed multiple times during recovery, is simpler to implement but requires idempotent downstream systems or post-hoc deduplication to correct overcounting. Best-effort delivery, where events may be lost during failures, is rarely acceptable for billing-critical ad data.
Most production systems use at-least-once semantics in the streaming layer combined with strong deduplication strategies and reconciliation processes to achieve practical exactly-once behavior. The streaming pipeline guarantees no events are lost through replicated queues and checkpointed processing, while deduplication filters out duplicates introduced by retries. Batch reconciliation jobs that run daily compare raw event logs against aggregated outputs, identifying and correcting any discrepancies before billing cycles complete. This layered approach provides practical accuracy without the complexity and performance overhead of true exactly-once processing throughout the pipeline.
Historical note: The streaming community debated exactly-once semantics extensively for years. The practical resolution came through recognizing that idempotent sinks and reconciliation provide equivalent guarantees with better performance characteristics than distributed transactions across the entire pipeline.
Understanding these trade-offs prepares you to discuss real-world system constraints during interviews and make informed decisions when building production systems. The next section covers how processed data flows into storage systems optimized for different access patterns and cost profiles.
Storage architecture and analytics
After processing and deduplication, events must be stored for analytics, dashboards, billing workflows, and attribution models. Storage design is crucial for achieving real-time query performance without sacrificing durability or cost efficiency. Different data types and access patterns require different storage strategies, leading to a tiered architecture that balances hot queryable storage against cold archival systems with automatic lifecycle transitions.
The following diagram illustrates how data flows through different storage tiers based on age and access patterns.
Real-time aggregation and hot storage
Stream processors compute rolling aggregates before writing to storage, reducing query load on downstream databases by pre-materializing common access patterns. Pre-computed metrics include clicks per ad, clicks per campaign, clicks per publisher, and time-bucketed counts at minute or hour granularity. When impression data is available, click-through rates can be calculated by joining the two streams. These aggregations must handle late events that update previously computed windows through upsert operations, state migration when processors scale to more or fewer instances, and large window sizes for historical trend views that span days or weeks.
Real-time dashboards require sub-second query latency on recent data, driving the need for specialized analytical databases designed for this access pattern. ClickHouse excels at fast analytical queries with excellent compression ratios and vectorized execution that processes columns efficiently. Apache Druid provides time-series optimized storage with automatic data tiering between segments. Apache Pinot offers real-time ingestion with consistent low-latency queries, making it popular for user-facing analytics where predictable performance matters. StarRocks and Apache Doris are newer entrants offering similar capabilities with different operational characteristics. Cassandra serves well for high-write workloads where the access pattern is primarily key-value lookups rather than complex aggregations.
The schema for hot storage typically includes campaign_id, ad_id, publisher_id, timestamp bucket at minute or hour granularity, and aggregated metrics like click_count, unique_users estimated via HyperLogLog, and revenue. Indexing and partitioning strategies are essential for low-latency queries that scan only relevant data. Time-based partitioning enables efficient range scans for dashboard time selectors. Secondary indexes on campaign or publisher IDs support slice-and-dice analytics across dimensions. Materialized views can pre-compute common query patterns, trading storage space for query speed on frequently accessed combinations.
Pro tip: When choosing between OLAP engines, benchmark with realistic query patterns and data distributions. The same engine can perform very differently depending on cardinality, data skew, and whether queries are time-bounded or full-table scans.
Raw event storage and cold archival
All raw events must be stored in durable object storage regardless of what aggregates are computed, serving multiple critical purposes beyond immediate analytics. Raw data enables recomputation when business logic changes or bugs are discovered in aggregation code. Fraud investigations require examining individual events to understand attack patterns. Billing audits need complete records to resolve disputes with advertisers. Machine learning pipelines train on historical patterns to improve fraud detection and attribution models. Storage systems like S3, GCS, or Azure Blob provide effectively unlimited capacity at low cost, with high durability guarantees through internal replication across availability zones.
Data formats for raw storage balance compression ratio against query flexibility for different access patterns. Parquet provides columnar storage with excellent compression and predicate pushdown for analytical queries that access only specific columns. ORC offers similar benefits with slightly different performance characteristics favored in Hive ecosystems. Avro stores data in row format with schema evolution support, useful when events are typically processed as complete records during backfills. Partitioning raw data by date and hour enables efficient access patterns for both batch processing and selective retrieval during investigations, allowing queries to skip irrelevant partitions entirely.
Storage tiering moves data between hot, warm, and cold tiers based on age and access patterns through automated lifecycle policies. Recent data stays in hot OLAP storage for fast dashboard queries with sub-second latency. Data older than a week migrates to warm storage in a data warehouse like BigQuery or Snowflake, where queries take seconds rather than milliseconds but costs are significantly lower. Historical data beyond 30-90 days moves to cold object storage, where retrieval takes minutes to hours but storage costs are minimal at roughly $0.01 per GB per month. Backfill pipelines can restore cold data to hot storage when needed for analysis or reprocessing.
Reconciliation and data quality
Reconciliation pipelines provide the critical function of verifying that derived aggregates match raw event logs, catching errors before they affect billing. Daily jobs compare the sum of raw clicks against aggregated totals for each campaign, flagging discrepancies that exceed configurable thresholds. Lineage tracking records which raw events contributed to each aggregate, enabling engineers to drill down into specific discrepancies and identify root causes. Drift detection monitors for gradual divergence between raw and aggregated data that might indicate subtle bugs in processing logic or deduplication failures.
Ad formats and tracking requirements evolve constantly, requiring the storage layer to handle schema changes gracefully without breaking existing pipelines. Adding new fields should not break existing queries or processing jobs that don’t use those fields. Removing fields requires deprecation periods where both old and new formats are supported. Schema registries like Confluent Schema Registry or AWS Glue Schema Registry coordinate changes across producers and consumers, preventing incompatible schemas from entering the pipeline by validating compatibility before allowing registration.
The following table compares popular OLAP engines for ad click aggregation workloads across key operational dimensions.
| Engine | Ingestion latency | Query latency | Join support | Update/delete |
|---|---|---|---|---|
| ClickHouse | Seconds | Sub-second | Good | Limited |
| Apache Druid | Seconds | Sub-second | Limited | No |
| Apache Pinot | Sub-second | Sub-second | Limited | No |
| StarRocks | Seconds | Sub-second | Good | Yes |
| BigQuery | Seconds-minutes | Seconds | Excellent | Yes |
With storage architecture established, the final major concern is ensuring the system scales reliably under production load while remaining observable and debuggable through comprehensive monitoring.
Scaling, fault tolerance, and capacity planning
Scalability and reliability are essential because ad click aggregators operate under extreme load and must maintain high accuracy for billing purposes. Scaling must occur across ingestion, streaming, processing, and storage layers, each with unique challenges and solutions. Capacity planning ensures the system can handle expected growth and traffic spikes without degradation while controlling costs.
Horizontal scaling strategies
The ingestion layer scales by adding servers behind load balancers, with auto-scaling policies triggered by CPU utilization, network throughput, or request queue depth. Regional endpoints distribute geographic load, with each region scaling independently based on local traffic patterns and time-of-day effects. Connection pooling and keep-alive settings optimize resource utilization by reusing established connections. Health checks quickly remove unhealthy instances from rotation, preventing failed servers from receiving traffic and degrading user experience.
Streaming layers scale by adding partitions to increase parallelism, though partition count changes require careful coordination with consumers to avoid rebalancing disruptions. Replication factor increases protect against broker failures but consume additional storage and network bandwidth proportionally. Separating high-volume campaigns into dedicated topics prevents hot tenants from impacting others through resource contention. Topic compaction settings balance storage efficiency against retention requirements for different data types.
Processing layers scale by adding stream processor instances, with work automatically distributed across available workers through partition assignment protocols. Checkpointing frequency trades recovery time against checkpoint overhead, with more frequent checkpoints enabling faster recovery but consuming more resources for snapshot creation. State backend tuning, particularly RocksDB configuration in Flink including block cache size and compaction settings, significantly impacts performance for stateful operations like deduplication and windowing.
Watch out: When scaling stateful stream processors, increase parallelism gradually and monitor state migration progress. Sudden large increases can cause prolonged rebalancing periods where processing stalls and consumer lag grows, potentially triggering alerts or SLA violations.
Handling backpressure and hot partitions
Backpressure occurs when downstream systems cannot keep up with incoming event rates, creating a queue buildup that can cascade through the system. Without proper handling, backpressure causes memory exhaustion, cascading failures, and ultimately data loss. Solutions include batching events to amortize per-event overhead, adjusting producer flush intervals to smooth traffic bursts, temporarily throttling ingestion based on consumer lag metrics, and auto-scaling processors based on queue depth. Monitoring Kafka consumer lag provides early warning of backpressure building in the system, allowing proactive intervention before problems become visible to users.
Hot partitions arise when traffic skews heavily toward certain campaigns or publishers, overloading single partitions while others sit idle. This commonly occurs when a viral campaign generates orders of magnitude more traffic than average. Mitigation strategies include changing partition keys to distribute load more evenly across the cluster. Adding random suffixes to keys for high-volume entities spreads their traffic across multiple partitions. Splitting large campaigns across multiple logical partitions through virtual partitioning enables parallel processing. Dynamic rebalancing that moves partitions between brokers addresses hot spots after they develop.
Capacity planning with real numbers
Production capacity planning requires realistic throughput and storage estimates based on current and projected traffic. A typical large ad platform might see 200,000 clicks per second on average, with peaks reaching 500,000 to 1,000,000 during major events like the Super Bowl or Black Friday. Each click event averages 500 bytes after compression, yielding approximately 8.5 TB of raw data daily at average load. Aggregated data is much smaller, perhaps 1-5% of raw volume depending on aggregation granularity and the number of dimensions being tracked.
Storage costs scale with retention policy and tier allocation. Keeping 7 days of raw data in hot storage at $0.10/GB/month costs approximately $6,000 monthly for the example volume. Moving older data to cold storage at $0.01/GB/month reduces costs by 90% for historical data, making long-term retention financially viable. Query costs in serverless systems like BigQuery add another dimension, where poorly optimized queries scanning large datasets can generate significant bills unexpectedly. Capacity planning must account for both storage and compute costs across all tiers, projecting forward based on traffic growth rates.
The following diagram shows key metrics that should be monitored for a healthy ad click aggregator deployment.
Failure recovery and observability
Systems must handle failures at every level including partition leader failures, region-wide outages, worker crashes, and network partitions that isolate components. Multi-region Kafka clusters with cross-datacenter replication protect against regional failures by maintaining synchronized copies. Active-active ingestion endpoints in multiple regions ensure availability during outages by automatically absorbing redirected traffic. Checkpoint-based recovery in stream processors restores state without data loss by replaying from the last consistent snapshot. Replay capability from raw event logs enables reprocessing when bugs are discovered or business logic changes require recalculation.
Essential monitoring metrics include ingestion requests per second segmented by region and source, Kafka consumer lag across all partitions to detect processing slowdowns, producer and consumer throughput to identify bottlenecks, deduplication cache hit ratio to ensure the cache is effective, stream processor latency percentiles to catch degradation early, error rates segmented by campaign and publisher to identify problematic integrations, and dashboard query latency to ensure user-facing performance. Alerting thresholds on these metrics prevent silent miscounts that could cause billing issues or SLA violations. Distributed tracing through events enables debugging specific click paths when problems are reported.
Real-world context: LinkedIn’s ad platform uses distributed tracing to follow individual click events through their entire lifecycle, enabling engineers to diagnose latency issues by examining the processing time at each pipeline stage and identifying bottlenecks quickly.
With the system architecture fully explored, the final section addresses how to present this knowledge effectively in interview settings and handle common deep-dive questions.
Interview strategies and common questions
The ad click aggregator problem appears frequently in System Design interviews because it tests your ability to balance correctness, scalability, efficiency, and real-time processing simultaneously. It reveals understanding of distributed streaming systems, which is essential for modern backend engineering roles. A strong answer demonstrates both breadth across system components and depth in areas where interviewers probe, showing that you can navigate trade-offs thoughtfully.
Structure your answer by walking through requirements first, distinguishing functional requirements like deduplication from non-functional requirements like latency targets. Then cover ingestion and throttling, explaining how you handle bursty traffic and protect downstream systems from overload. Move to the streaming pipeline, discussing partition strategies, stateful versus stateless processing, and handling late events through watermarking. Cover deduplication in depth since this is often a focus area that reveals your understanding of data quality challenges. Explain storage choices and why different data needs different systems based on access patterns. Address scaling and fault tolerance, including what happens when components fail. Finally, discuss trade-offs you would make and optimizations you would apply at different scales.
Common deep-dive questions include how you would guarantee exactly-once semantics and whether the complexity is worth it compared to at-least-once with reconciliation. Interviewers ask how you detect and handle late events arriving hours after they should have been processed, exploring your understanding of watermarking and side outputs. They want to know how you would shard partitions to balance traffic when some campaigns are much larger than others, testing your knowledge of hot key mitigation.
Questions about identifying suspicious or fraudulent click patterns are common, connecting to anomaly detection approaches. Expect questions about designing the deduplication layer to handle billions of IDs without running out of memory, where probabilistic data structures become relevant. Interviewers ask what happens if Kafka falls behind during peak traffic and how you keep dashboard metrics both fresh and consistent during backpressure scenarios.
Pro tip: When discussing trade-offs, frame them in terms of business impact. Instead of saying “exactly-once is hard,” explain that “at-least-once with reconciliation achieves equivalent billing accuracy at lower latency and operational complexity, which matters more for advertiser dashboards.”
Important trade-offs to discuss include at-least-once versus exactly-once semantics and when each is appropriate based on downstream requirements. Consider OLAP databases versus NoSQL stores for aggregated metrics, with the choice depending on query patterns. Discuss high availability versus cost optimization in storage and compute through tiering strategies. Compare probabilistic versus deterministic deduplication methods and when memory constraints force the probabilistic approach. Evaluate window size versus latency in real-time analytics, showing how business requirements drive technical choices. Demonstrating awareness of these trade-offs and the ability to make reasoned choices based on requirements shows senior engineering judgment that interviewers value.
Conclusion
Building an ad click aggregator requires mastering the full spectrum of distributed systems challenges that appear in modern data-intensive applications. High-throughput ingestion must absorb global traffic spikes through auto-scaling and geographic distribution. Streaming pipelines must deduplicate and aggregate in real time while handling late events through watermarking and side outputs. Storage architectures must balance query performance against cost through intelligent tiering that moves data based on age and access patterns. Operational practices must maintain accuracy under failure conditions through reconciliation, lineage tracking, and comprehensive monitoring that catches discrepancies before they affect billing.
The trajectory of ad tech systems points toward even tighter latency requirements as programmatic bidding demands sub-second feedback loops for real-time optimization. Privacy regulations like GDPR and the deprecation of third-party cookies are reshaping what data can be collected and how long it can be retained, adding new constraints to System Design that require careful attention to data governance. Machine learning models for fraud detection and attribution are becoming more sophisticated, requiring richer event data and more flexible processing pipelines that can evolve without breaking existing consumers.
Whether you encounter this problem in an interview or production, the core principles remain constant. Design for failure, measure everything, and make trade-offs explicit. The ad click aggregator serves as a template for any high-stakes, high-volume event processing system where accuracy and availability both matter, from financial transaction processing to IoT sensor aggregation to real-time gaming analytics.