RCA of Subscription Outage in v2.9.0
In the tech world when a software system’s functionality is degraded due to software defects (bugs) and those bugs cause a severe event or outage, we do a root cause analysis (RCA). Tempo’s v2.9.0 caused an outage for paid subscribers, where premium subscription was not recognized by the app, and no one could access premium features. It was a simple programming mistake, but a long time coming, and as it affects my long-term strategy, I want to share the details of this RCA with everyone.
I am sorry!
First, my apologies about the outage. This should not happen, and I will strive to do better next time.
While I wouldn't want this to have happened, the support and kindness that I received from all of you, as we interacted over the emails during the outage, was really encouraging for me. Thank you for your patience and understanding. ♥️
What happened?
Tempo's premium subscription was not recognized by v2.9.0 of the app. Even attempting to restore a previous purchase did not fix the issue. This meant that users, who had subscribed to premium features, were not able to access any premium features i.e. Tempo could only be used in basic free mode.
The issue was discovered late on Wednesday night (Feb 18, 2020). I was able to troubleshoot and fix the bug overnight to submit v2.9.1 with a request for an expedited app review. The App Store team was kind enough to pick it up first thing in the morning (Thursday), and approved it quickly. The outage lasted for ~12 hours.
Why did it happen?
v2.9.0 introduced a software defect caused due to code changes around how subscriptions (free trial and paid) are recognized as well as restored in Tempo.
Why did v2.9.0 change the code that handles subscriptions?
In v2.8.0, Tempo replaced a custom implementation of the free 14-day trial with the App Store's standard free trial implementation that is supported and managed by the App Store's payment system infrastructure.
When a subscription transitions from a free trial to paid (the customer has decided to continue with the subscription), the subscribed app (Tempo) receives a callback from the App Store payment system to notify the app that it should continue to recognize the subscription as active.
In the absence of this callback, the app will consider the free trial to be at the end of the trial period and will disable the subscribed features (Tempo premium in our case). Starting with 2.8.0, some folks reported a delay between the time the free trial ended and the paid subscription became active. In order to resolve this issue, I updated the code in the app that handles subscriptions, and released v2.8.2. But the issue did not completely go away, and I was still hearing about the delay from some folks. So in 2.9.0, I implemented a grace period to keep Tempo premium enabled for a few more days after the subscription expires. With these changes, if the app does not receive an update for the free trial to paid subscription transition in time, this grace period kicks in to prevent any disruption for the folks who have paid. And yes, this also means that folks who do not continue with their subscription will get extra free time of premium, but I would rather allow extra days of free premium for folks who don't pay than have any kind of disruption for all you caring, paying patrons.
Why was the defect caused by a code change not caught during testing, before affecting real customers in production?
I made a human error in my code around date comparison to identify a date in the past. Instead of ordering the dates, my careless code compared the dates by checking for the time difference between the two to be less than 24 hours. This code change was tested with sandbox environment of in-app payment processing services, where test subscriptions expire within 24 hours. This bug was not discovered due to specificity of 24 hours in the sandbox environment and in my code. Also, testing all the subscription scenarios with a sandbox environment require a lot of manual subscribe-wait-test-repeat cycles that can take hours (or a full day at times) and is, therefore, hard to automate.
Why did a defect in processing subscription callbacks break access to premium features for customers who already had active subscriptions?
Tempo derives access to premium based on current subscription state (unsubscribed, active, or expired). This state is updated with payment transactions callbacks. The code change was meant to refine state transitions more gracefully, but in the process it also affected existing (unchanged) state. The change caused a bug that resulted in subscription state for every user to appear expired.
What should be changed to avoid this?
The key issues that caused this outage are,
- State of a subscription, that acts as a source of truth for access to premium, is coupled with the code that also handles processing transactions. This state management is done on the device and the state itself is mutable — it can change due to code changes.
- Testing changes to this code is difficult to do in a production environment, before releasing to the App Store.
Tempo currently has no custom server-side services (a backend for Tempo), so all payment transactions are processed on the device and subscription state is also maintained on customer's device. This code is complex, fragile, and doing adequate testing of the app connected to a sandbox environment is challenging.
This needs to change to support following,
- Separate, immutable source of truth for access to premium. This should be decoupled from the code that processes new transactions.
- Ability to test code in pre-released versions of the app.
- Ability to quickly fix any production outages with minimum reliance, none in most cases, on the App Store review team. In other words, while the app review team is awesome, do not rely and abuse the expedited app review requests process.
A better architecture here would be to implement a backend that would maintain source of truth for access to premium features in a database. This source of truth will still be updated by new transactions, but instead of modifying database records, it would maintain a list of updates/additions — similar to App Store API structure for payment transactions. This architecture will then provide following benefits,
- Enable better testing and provide ability to automate various scenarios.
- Unlike app-side, server-side source of truth records will be permanent and can not be deleted. This further reduces the risk from future code bugs (as on the app-side) around recreating the source of truth.
- Handle similar outages much faster — server-side code fixes won't require changes to the Tempo app, avoiding full release + app approval cycle.
- More graceful handling of interruptions to the production environment of in-app payment processing services.
- Better security around subscription verifications.
Conclusion
Tempo's primary goal is to be the best running app out there. I have been obsessively focused on building all the features that runners need. It's gratifying to ship all these app-side features for runners (including myself), but in doing so, I haven't been prioritizing upgrading to a desirable server-side architecture for improved resiliency. Tempo started as a side-project, and I have been delaying the server-side part until I got to building some of the features on the roadmap that will require a backend service. But that lack of a backend is now causing disruptions to the quality experience that patrons expect, and I deeply care to deliver. Quality is at risk due to my focus on quantity of app-side features. I have already started looking into building out the backend architecture for Tempo. After the current in-progress projects are shipped, porting over in-app purchase processing to server-side will become my top priority.
My thanks to Eric in helping craft this message to make more sense vs me just rambling.