Problems When Scaling Fails–and Solutions

🗓 13 May 2020

In: 📁 DevOps

Things were working fine a few days ago... and then traffic spiked for whatever reason and the system could not handle it. Everything is slow or just not working. Maybe deploying new code to your servers isn't working consistently either. It's just a hot mess. The application was working fine as it was, but adding more load caused it to crumble rather than absorb it as it had previously—or as you thought you had planned for.

This isn't an uncommon problem. Big websites have the issue all the time (cough, Apple when new things go on sale, Reddit, etc.). So you definitely aren't alone if you have experienced the above. But maybe you're alone in being responsible for actually fixing it. In which case, this situation can be very daunting. Where do you even begin to get things up and running to handle the traffic?

Before you can fix the problem, you'll need to know the problem. But knowing the the problem is only half the problem (well, maybe not even half). So, below are common infrastructure problems and short and long term solutions to handle them. The short-term solutions are when you're trying to triage the system and get it up and running. After that, you should work to prioritize the long-term solutions so it doesn't happen again. You need both to handle the situation.

The database is on fire

By on fire, I mean either CPU is maxed out/nearly there, or that the connections are maxed out. Depending on the application, this could be due to zombie processes or runaway processes... in which case a nuclear fix is just restarting servers responsible for such things. A longer-term solution is fixing the app code to be more efficient, if possible, or just increasing the database power. If you're using a serverless database, this shouldn't be a recurring issue or a prolonged one.

Short-Term Solutions

To reiterate the above, a quick fix is restarting servers to kill any process hogging the CPU.
Depending on your definition of quick and tolerance for service disruption, you could increase the database tier-size so it has more CPU and RAM.

Long-Term Solutions

Verifying you have proper indexes in place to increase performance when querying common datasets.
In addition to indexes, check to see if your database parameters are not artificially creating bottlenecks.
No outrageous N+1 queries going on in main app pages. You can enable slow query logs for your database if available through RDS to see if there are any major poor queries going on.
Upgrading the database engine to the latest version.
Use a serverless database and set it so it can scale up to handle increased usage.
Build out read replicas so heavy read-operations are spread out and do not impact the core instance.

The servers are on fire

Similar to the database: the servers have too much traffic slamming them and are being bottlenecked by CPU or RAM. More servers could dissipate the load, same with larger servers. Either less with more or more with less, basically.

Short-Term Solutions

Restart the "problem" servers.
Increase size of the server (higher CPU, more RAM).
Add more servers to the load balancer and tweak auto-scaling triggers (perhaps switch to network traffic from CPU or vice-versa depending on what type of load is causing problems).
If memory is the issue and you cannot/do not want to increase server memory, add swap.

Long-Term Solutions

To reduce general load on server for pages that aren't dynamic, put a CDN such as Cloudfront in front of it to cache static resources. This will reduce the load on the servers by not having them work to serve up content that doesn't change.
You can even do caching at an application page level if you cannot use a CDN, depending on your framework/application.

Disk space (or inodes) maxed out

Maybe log rotation isn't working, or your server volumes were tiny and more logging than expected caused it to fill up. Or some file-upload flow went awry with massive, unchecked uploads. Regardless: the disk space is almost maxed out.

Another issue that can happen related to disk space is inode utilization being maxing out. This can cause similar weirdness/problems as disk space. Zombie processes/many tiny files can be the culprit.

Short-Term Solutions

If your servers are expendable, create new ones and change the scaling config to bump the volume size going forward.
For inode issues, restarting the servers may resolve it if it's related to transient zombie processes.
Increase volume size of EBS volume. You may need to manually expand the drive to recognize it after it finishes. If you are able to, it may be faster to just create new servers/volumes than waiting for AWS to expand the EBS volume and for you to manually expand it at the OS level.

Long-Term Solutions

More robust log rotation, if that's the root cause of the disk space being eaten up.
If files are legit being saved onto a server, move them to the cloud (as in, S3). If it was just a few massive uploads, potentially add a file-size limit to file uploads (if not, increase volume size to handle).
Add monitoring so you can track disk space so you can increase it before it becomes an issue.

Too Much Data

This is where serverless may not be the solution anyway. You are literally pulling in too much data for anything to handle at once. As in, not paginating. This can cause problems in the client-side and server-side (even if serverless).

Unfortunately, short-term solutions are hard to find for this one. This one you can probably see coming, so hopefully this is dealt with before it causes issues for everyone.

Long-Term Solutions

Reduce payloads being returned from server and/or database.
Paginate if you aren't and you could be!
Back-end filtering/sorting rather than pulling in all data to be processed on the client-side.

Conclusion

It's sad to say, but restarting servers is a quick way to clear out most problems in the short term so you can get breathing room. It won't be the long-term solution, though. A long-term solution involves a combination of optimization, rearchitecting, and throwing more power at the problem. You can either just beef up what you have or figure out ways to spread it out and allow it to handle better. Both are valid answers depending on the context for a problem and the surge in traffic/usage.

But, going forward, you want to be able to nip these problems before they become critical fires. This can mean setting the site up before you expect traffic, adding more monitoring/alarm triggers (like a ladder of severity), and following some of the solutions noted above: CDN, cloud storage, proper app resource utilization, and keeping resources up to date.