With our commitment to excellence in education and our dedication to empowering the youth of India, PhysicsWallah offers a plethora of courses designed to meet the diverse needs of our students. Our mission is to provide quality education that enhances the skills and knowledge of the next generation.
To achieve this goal, we conduct numerous live classes daily across various academic batches. Leveraging AWS Elemental Services for live streaming, we ensure seamless delivery of our educational content, thereby facilitating access to high-quality instruction for students nationwide. However, the necessity to cater to the educational needs of a large and diverse student base requires us to operate multiple live classes simultaneously. However, we have encountered a recurring challenge wherein, post the conclusion of a live class, the MediaLive channel continues running due to operational issues. This oversight has led to substantial unwanted costs without corresponding productivity.
The depicted diagram provides an overview of our high-level live streaming architecture. In this setup, OBS is employed to transmit live classes from the Studio to AWS Elemental MediaLive. MediaLive plays a crucial role in both encoding and transcoding, facilitating adaptive bitrate streaming by generating multiple renditions of the same video stream with varying bitrates and resolutions. The processed stream then proceeds to MediaPackage, where HLS packaging is applied, also harvesting the class for live-to-video-on-demand (VOD) conversion. Given our extensive scale of operation, a content distribution network (CDN) is essential to ensure wide accessibility for students, making use of the CloudFront to deliver the live streaming content effectively.
The issue at hand arises from the persistence of active MediaLive channels after the conclusion of a live class. As we conduct multiple live classes throughout the day to serve different batches, the operational challenge emerges from the substantial number of approximately 300 daily Medialive channels used for these sessions. The sheer volume of classes poses difficulties in manually shutting down the channels by the operations team, resulting in a failure rate of around 10% where the team is unable to stop the channel after the conclusion of a class.
This operational challenge results in an unintended and substantial financial burden for us. The prolonged running of these channels post-live sessions contributes to an unnecessary increase in our AWS billing, without corresponding educational or operational benefits.
After considering multiple approaches, we settled on a solution that leverages AWS CloudWatch, AWS MediaLive, AWS SNS, and AWS Lambda. This solution entails implementing a mechanism wherein, if a MediaLive channel remains active without receiving any input for more than 30 minutes, it will be automatically stopped. The necessary code for this automated intervention is developed and executed within AWS Lambda.
Our architecture employs a strategy wherein a MediaLive channel is equipped with a metric known as “Active Alerts.” MediaLive generates alerts for various conditions, such as the absence of video or sound or when there is no input. We utilize this Active Alerts metric in conjunction with CloudWatch to establish a custom CloudWatch alarm.
The configured alarm operates in such a way that if the Active Alerts persist at a level exceeding the set threshold for more than 30 minutes, the alarm transitions to an “in-alarm” state. This state triggers an action, prompting the invocation of an AWS Simple Notification Service (SNS). Subsequently, a Lambda function acts as a subscriber to this SNS.
Upon receiving an event from the SNS, the Lambda function extracts the relevant MediaLive channel ID and initiates the process to stop the channel, effectively implementing an automated response to running the channel without input.
While our production setup was live and operational, we encountered a specific edge case.
Let’s take a look at the edge case scenario :
We initiate a medialive channel at 11 am
We had a 30-minute threshold for the channel, meaning that if it ran continuously without receiving any input for 30 minutes, it would be automatically stopped. To illustrate, let’s examine the timeline: the channel commenced at 11 am and operated without any input stream for the initial 15 minutes. Consequently, this generated 15 data points for the alarm. Subsequently, the channel stopped, leaving no data points for the alarm. Thus, up until 11:17, the alarm had only accumulated 15 data points where the active alert was greater than zero.
Following this, the channel resumed operation but encountered another period of 13 minutes without any input stream. During this time, the alarm received 13 new data points where the active alert was greater than zero. However, by 11:30, the alarm had only received a total of 28 data points, prompting it to monitor the subsequent few minutes closely.
At 11:30, the channel once again ran without receiving any input for two minutes, providing the alarm with two additional data points. This completed the required 30 data points, triggering the channel to stop at 11:32.
The below graph clearly shows that there are no active alerts when the channel is stopped around 19:57 UTC. And this missing data leads to this unexpected behavior of CloudWatch.
Upon deeper investigation, we identified an internal factor referred to as the “evaluation period.” In cases where there was missing data, which could possibly be due to stopping the channel in between for 1–2 minutes during the 30-minute timeframe specified for the CloudWatch alarm, the system would consider the next 3–4 data points.
To address the issue stemming from missing data, we implemented a solution using a CloudWatch feature called the “fill function” This feature ensures that if there is no information from MediaLive, it treats the number of active alerts as 0. By adopting this approach, we effectively handled our edge case, ensuring that CloudWatch always has a numeric value for active alerts for any given channel.
The below graph illustrates that in the absence of active alerts, the fill function is applied to substitute zero values for any missing active alert data points.
In our effort to make AWS Elemental MediaLive work better, our clever setup not only made things smoother but also saved us money. We run, on average, 100 channels for 4 hours each day, totaling 400 hours, accounting for a 5% miss rate results in an additional 100 hours of runtime. Therefore, by factoring in this 5% margin, we effectively save approximately 25% of our total cost. We have eliminated the need for manual intervention by the Operations Team, thereby eliminating the possibility of mistakes and errors.
Even when we faced a surprise hiccup, we found a fix with CloudWatch’s “fill function,” making sure our success stayed on track. It’s not just tech talk — it’s about our team staying strong. As we keep improving, we’re not just cutting costs; we’re showing how to stream smarter on AWS without breaking the bank. Our journey isn’t done; we’re committed to keeping things simple, smart, and budget-friendly in the ever-changing world of AWS live streaming.
Written by — Tejas Gupta, DevOps Team
Tejas Gupta is working as Devops Engineer at PhysicsWallah and also an AWS Community Builder. He is actively contributing towards tech community. He has knowledge of multiple horizon with an expertise in Streaming, Infrastructure and Security. He has been handling Media and Infra of PW since a year.