Observability at scale: How we built a cutting-edge Dream11 monitoring ecosystem ?
~By Umesh Gadage (SRE) , Mihir Shah (SRE)
As the world’s largest fantasy sports platform, Dream11 has over 120 million sports fans participating in exciting fantasy sports contests to showcase their skill and knowledge in various sports! With so many users using the app at a given time, our traffic pattern can get very spiky in nature. It can go from thousands of concurrent users to millions in just a few minutes! To support this dynamic pattern, we use prediction-based scaling and scale-out infra well before the match.
The Indian Premier League (IPL) is one of the biggest sports events in India and possibly the world that is viewed by over 400 million cricket fans. Every season, we scale to handle a peak of about 100 million requests per minute (RPM) at edge services. All the microservices generate many Application Performance Monitoring (APM) traces, system metrics, network metrics, DNS metrics, and more. These in turn, generate around 50+ TeraBytes of data and 65+ billion ingested spans per day.
Our Scale At A Glance -
What were our challenges and how did we solve them?
Observability Lens from a single dashboard:
Monitoring a large infrastructure can be a difficult task. There are many moving components involved — from computing and networking, to application and business performance indicators. With a ‘top dashboard’ idea, only events that require attention and are actionable can be monitored during a fantasy sports contest. For example, the top API with response time of more than 200ms , a top API with 5xx, 4xx error of more than.01% , top RDS based on connection established or the CPU. This way, we created a ‘bird eye’ view of Dream11’s whole infrastructure within a single dashboard.
Using this bird’s-eye view of our footprint, we were able to quickly navigate through the issues and reduce the Mean Time To Detect (MTTD) and Mean Time To Resolution (MTTR). Our monitoring tool can build relationships between Cloudwatch metrics, APM metrics, network metrics as well as logs. We are easily able to navigate all of these, using a click of a button.
CloudWatch metrics lag and API rate limit issue:
Initially, we were sending metrics to our monitoring tool using an Amazon Web Services (AWS) integration. Our monitoring tool scraps CloudWatch data and stores it as an integration metric. During big events, we generate over 700 million Cloudwatch metrics per day. The AWS integration with the monitoring tool scraps all the metrics from Cloudwatch and sends it to the platform on a configurable regular interval of two to five minutes. Due to the huge volume of metrics, we end up with an API rate limit issue at the AWS end and also an approximate delay of 20 to 30 minutes in metrics. This leads to a lot of false and late alerts. AWS has come up with a CloudWatch Metric Streams solution to handle such scenarios. Using these streams, we can send metrics to any major monitoring solution provider in almost real-time with an open telemetry collector. After implementing this solution, the delay was reduced to almost one to two seconds which is negligible, and the improvement is significant with cost benefits. AWS Cloudwatch MetricStreamUsage is almost four times cheaper than AWS Cloudwatch GWD-Metrics ( Scrap ).
At Dream11 scale, network monitoring plays one of the most critical roles in the troubleshooting of our systems since our network is complex and distributed across microservices. Our monitoring tools help us in keeping track of our network Ingress/Egress architecture alongside the application, infrastructure, and DNS performance for faster troubleshooting.
Sometimes, we face intermittent issues with the AWS Network in one particular availability zone. Using our network monitoring, we quickly get to know if there is any abnormal activity in our network. A clear signal to validate, is to check the TCP retransmits by AZ. We have multiple slice-and-dice options available to see our network performance. This has capabilities to filter traffic across the availability zones, services, ENVs, domains, hosts, IP, VPCs, ports, region, IP type, etc. (public/private/local).
Upon observing these issues, we quickly removed the faulty AZ from our microservice Auto-Scaling Groups using custom in-house automation scripts. Since all our microservices have their own dedicated autoscaling group, removing one AZ can actually spin up new instances in another AZ using AZ Rebalance policy. Similarly, we removed all the faulty subnets from ALB/ELBs for that AZ.
Automatic Service Map:
Our APM tool provides end-to-end distributed tracing from frontend devices to databases — with no sampling. By seamlessly correlating distributed traces with frontend and backend data, our monitoring APM tool helps us to automatically monitor service dependencies, latency, and eliminate errors so that our users get the best possible experience. Distributed tracing is a technique that helps us address the problem of bringing visibility into the lifetime of a request across several systems which is very useful while debugging and figuring out where the application spends most of its time. We have a Service Map, where we can inspect all the services to get their RED metrics along with their dependencies and filter capabilities by Application Service, Databases, Caches, Lambda Functions and Custom Scripts. Also, this service map reflects almost real-time, and the monitoring agent sends data to our tool at an interval of 10 seconds. The service map shows all services in green when there is no issue, but turns red once anything goes wrong. This comes from the monitor ,which is configured for each service.
We also conduct predictive scaling based on our data science model. For more details on how we scale, you can read our blog post here. The data science model at Dream11 predict the concurrency expected for each match. Based on this prediction and our load test, we benchmark our numbers to scale over 100 microservices. Previously, we used ASG scales manually and validated them on a regular basis, which was a very complex and time-consuming task. But now, this process has been automated and our monitoring tool helps us to verify much faster than before. We have set up a few alerts which help us to identify where we need to look to resolve issues.
We send expected capacity data from our benchmark for an hour and from a cloudwatch to our monitoring system. The comparison between the two helps us to detect provisioning issues faster.
This is how we solved observability at our scale, to provide the best possible user experience to over 120 million sports fans! Keen to know more about how we solved complex problems and build our own solutions? Stay tuned for our upcoming blog posts on:
- Cost Wastage Portal
- Automated Scaling From data Science Prediction
- Adoption of Graviton & Granulate
- Migration from Gp2 → Gp3
We are hiring! Come, be a part of our dream team by applying here!