The Paragon servers are still offline but there is now a full status update from Epic explaining what went wrong.
The outage problems appear to have arisen from granting chests to players and Epic inadvertently DDOSed themselves. Ouch! Updates to their database are still ongoing which has also had to be upgraded and now requires more testing.
Fingers crossed Paragon won’t be down too much longer but it sounds like the planning for this launch went terribly wrong. An update thread is now live on the forums and it’s worth paying attention to that as well as the launcher.
Update: The Paragon servers appear to be back online as of 1:45AM GMT 9 August.
3:30 AM – planned outage for release of v42
5:30 AM – services back online after successful smoke test
5:3x AM – services buckle under load
By granting chests to existing players (some several hundreds) we inadvertently ended up DDOS’ing ourselves with a 200x increase in backend load due to chest opening process.
We were unsuccessful in mitigating this issue with the services running and degradation of quality forced us to take the services back down.
7:20 AM – unable to mitigate issue while system is running resulting in us taking downtime
We changed code to bulk open chests drastically reducing backend load and deployed the change.
9:40 AM – services are back online
Service recovery was short lived and our DB setup got into a bad state between primary and replicas. Overall load was still too high for our infrastructure.
11:00 AM – services offline
We are using downtime to upgrade DB to latest version for an estimated 2x increase in load we can handle, are bringing over recent optimizations for profile handling from our experience with Fortnite, and are also working to compress our profiles to reduce load to combat the 2x increase in profile size introduced by v42.
We expect this (*fingers crossed*) to allow us to handle the load, but also need to ensure we get the DB back into a synchronized state before we can go live.
We are running into a bottleneck where a single person is responsible for all the remaining work (no pressure…).
1:30 PM – technical update
My apologies for the delay in getting information out to everyone!
We are expecting the outage to persist for a while longer and will do a proper post mortem like we did with Fortnite’s recent outage here.
We are also accelerating work to shard the DB.
2:15 PM – technical update
We have a list of tasks to complete, but no good way to provide a meaningful (aka accurate) update. We are also talking about aggressively limiting rate of new users when we bring services online again as we are not sure whether our current changes will be sufficient.
3:15 PM – technical update
We are running into and working through issues getting new MCP build deployed (currently failing unit test) to be able to test DB upgrade on our testing environment.
We have a few additional operational items in-progress as well.
It should be roughly 20 minutes to test that DB upgrade caused no harm after MCP deploy succeeds, and if testing is successful we will roll changes to Live (production) environment, sanity test, and start bringing folks back in slowly.
4:15 PM – technical update
MCP is being deployed to live testing environment. QA will sanity test changes there (20 min). Assuming nothing goes wrong (and it did previously) this will not be the longest pole.
DB update needs to finish in production environment (unknown), followed by ensuring we have a valid backup (unknown), and enabling of compression (seconds).
Once that is done we deploy to live (20 min) and do final testing (10 min).
The times aren’t additive, but at least 30 min after we have DB updated and backup is verified.
5:30 PM – technical update
Sadly not much to update other than us trying to parallelize as much as possible to reduce time to being back up again.
5:45 PM – technical update
DB changes / updates are mostly done, now getting ready to do deploys and testing. If everything goes well (and it rarely does) we should be online in an hour.
6:00 PM – testing begins in staging / live testing environment
We are testing changes in our staging environment. This is an important step given the scope of changes we made for v42 to ensure that we are not breaking your profiles. Some ongoing tweaks to production DB and once testing in stage and those changes are done we will deploy MCP changes to live / production environment.
6:30 PM – testing in staging successful
Testing was successful in staging environment. We are currently limited from verifying cross-play in this environment though.
We are in process of doing last minute tweaks to DB which will be followed by deployment of MCP (there is a dependency), testing, and enabling waiting room to let players in. This has felt about 30 minutes out for 90 minutes so not sure how accurate my estimates are going to be here.