Problems When Scaling Fails–and Solutions

Everything is on fire, but that's okay...we have water

Originally posted on the 8th Light Blog


Things were working fine a few days ago... and then traffic spiked for whatever reason and the system could not handle it. Everything is slow or just not working. Maybe deploying new code to your servers isn't working consistently either. It's just a hot mess. The application was working fine as it was, but adding more load caused it to crumble rather than absorb it as it had previously—or as you thought you had planned for.

This isn't an uncommon problem. Big websites have the issue all the time (cough, Apple when new things go on sale, Reddit, etc.). So you definitely aren't alone if you have experienced the above. But maybe you're alone in being responsible for actually fixing it. In which case, this situation can be very daunting. Where do you even begin to get things up and running to handle the traffic?

Before you can fix the problem, you'll need to know the problem. But knowing the the problem is only half the problem (well, maybe not even half). So, below are common infrastructure problems and short and long term solutions to handle them. The short-term solutions are when you're trying to triage the system and get it up and running. After that, you should work to prioritize the long-term solutions so it doesn't happen again. You need both to handle the situation.

The database is on fire

By on fire, I mean either CPU is maxed out/nearly there, or that the connections are maxed out. Depending on the application, this could be due to zombie processes or runaway processes... in which case a nuclear fix is just restarting servers responsible for such things. A longer-term solution is fixing the app code to be more efficient, if possible, or just increasing the database power. If you're using a serverless database, this shouldn't be a recurring issue or a prolonged one.

Short-Term Solutions

Long-Term Solutions

The servers are on fire

Similar to the database: the servers have too much traffic slamming them and are being bottlenecked by CPU or RAM. More servers could dissipate the load, same with larger servers. Either less with more or more with less, basically.

Short-Term Solutions

Long-Term Solutions

Disk space (or inodes) maxed out

Maybe log rotation isn't working, or your server volumes were tiny and more logging than expected caused it to fill up. Or some file-upload flow went awry with massive, unchecked uploads. Regardless: the disk space is almost maxed out.

Another issue that can happen related to disk space is inode utilization being maxing out. This can cause similar weirdness/problems as disk space. Zombie processes/many tiny files can be the culprit.

Short-Term Solutions

Long-Term Solutions

Too Much Data

This is where serverless may not be the solution anyway. You are literally pulling in too much data for anything to handle at once. As in, not paginating. This can cause problems in the client-side and server-side (even if serverless).

Unfortunately, short-term solutions are hard to find for this one. This one you can probably see coming, so hopefully this is dealt with before it causes issues for everyone.

Long-Term Solutions

Conclusion

It's sad to say, but restarting servers is a quick way to clear out most problems in the short term so you can get breathing room. It won't be the long-term solution, though. A long-term solution involves a combination of optimization, rearchitecting, and throwing more power at the problem. You can either just beef up what you have or figure out ways to spread it out and allow it to handle better. Both are valid answers depending on the context for a problem and the surge in traffic/usage.

But, going forward, you want to be able to nip these problems before they become critical fires. This can mean setting the site up before you expect traffic, adding more monitoring/alarm triggers (like a ladder of severity), and following some of the solutions noted above: CDN, cloud storage, proper app resource utilization, and keeping resources up to date.