Need help scaling Spinster, seeking volunteers
6 votes alexgleason — 6 votes, 13 commentsSource

spinster.xyz is currently the fastest growing fediverse instance. I’m having trouble keeping up with it. I think we need a load balancer and we need to split the application across multiple servers, but this is something I’ve never done before. I’d appreciate if any volunteers would be willing to help me out! You can find me at:

Email: alex@alexgleason.me
Fediverse: @alex@spinster.xyz
Matrix: @alexgleason:matrix.org

Thank you!

the fastest growing fediverse instance

sorry but.. how fast? As far as I can tell it’s not been open for a long time. With a single server you should be able to serve 1000s or tens of 1000s of requests per second without much problems. That’s a lot of users! I’ve never personally managed any mastodon instance, but from what I know the first problem is going to be hard disk space; they use a lot of space. Before thinking of load balancers and multiple servers you should find the bottleneck. What server do you have now?

Thank you for the info. I think we might be getting DDoSed. Disk space isn’t an issue because we’re storing it on an S3 type thing.

Our RAM usage is hovering around 4GB even though we have 16GB available. I installed pgbouncer. I think it could be a number of connections issue. Really struggling to understand why we’re getting these failures.

Update, I found the error: “768 worker_connections are not enough”

I updated this value to 7680 and the errors seemed to stop.

Spinster LLC relies on volunteers?? They can’t pay a poor soul for their IT services?

What, because I paid $120 out of my pocket to get an LLC suddenly I’m rich or something?? I just got a donation for $100 and it might cover the server costs I’ve been paying from my credit card. Also my partner just lost her job, our neighbors almost shot us (with a gun), and now we have to move to Texas to live with my parents. ????? Yes I’m seeking volunteers dammit!

I thought you were working for the LLC, I didn’t know you own it.

Sorry for my defensive reaction. This has been a very tough week.

1500 users and you have trouble keeping up? Multiple servers for text messages? This app will cost you a fortune

Hiya, you said that your RAM is doing fine, what’s the CPU load like?

Seems like if your CPU and RAM are both fine, then you don’t need a second instance and load balancing. If CPU is maxed out, that’s another issue though.

Assuming you’ve got CPU to spare, you can increase the number of threads and you shouldn’t need a second instance yet. Where are you seeing delays? Need to determine whether you need more threads in Puma, the streaming, or Sidekiq. Also you’ll need to increase the number of PG connections to match.

You also said that you’ve set up pgBouncer - were the PG connections all used up?

Hard to know what to advise without seeing what specific errors or delays you’re seeing, but it doesn’t sound like you’re in need of a second server as yet. Might be soon if it stays popular though! :-) Love the server, thanks for everything you’re doing.

Thanks so much for the advice. I installed pgBouncer while basically feeling around in the dark. I don’t think I needed it, but it’s still there.

My settings look like this now:

WEB_CONCURRENCY=16
MAX_THREADS=6
STREAMING_CLUSTER_NUM=3

Sidekiq
DB_POOL=100
100 threads

pgBouncer
max_client_conn = 1000
default_pool_size = 200

Postgres
max_connections = 200

Nginx
worker_processes auto;
worker_connections 7680;
worker_rlimit_nofile 100480;

I don’t really understand how the settings between pgBouncer and postgres are supposed to be. Like, should I increase my postgres max_connections to match that of pgBouncer? The way it works is beyond me.

Even with these settings my CPU and RAM aren’t close to maxing out. I tried setting 1000 sidekiq threads but it seemed to actually slow down the site even though my CPU and RAM still didn’t max out. I feel there’s some fundamental concept I’m missing.

I don’t really understand the relationship between workers, threads, and the physical hardware. I’d really appreciate if you looked over my settings and see if they seem sane to you.

My machine is 16GB of RAM with 6 CPU cores. Thank you so much for your help.

Okay, TL;DR - afaik these settings look fine.

So you’re running (at least) four processes on your server, each of which can use CPU and RAM. They’re all talking to each other, and it’ll all run best when their workloads are balanced - you’re right, giving too much resource to one process can (paradoxically) slow everything down. Won’t always happen, but absolutely can. So the key thing is knowing which symptoms indicate a problem in which setting, so you know what to tweak.

The Puma settings (WEB_CONCURRENCY, MAX_THREADS) affect people accessing via the web: if people are having trouble connecting, dropped connections, etc, it’s here that you’ll need to look. The stat that you’d need to increase is WEB_CONCURRENCY - that will allow more users to access via the web. Whenever you increase this, you need to check that there are enough db connections available. You’ll need (WEB_CONCURRENCY x MAX_THREADS) number of db connections available for just the Puma part of the application. So, as of right now, 16x6 = 96. You shouldn’t need to increase MAX_THREADS.

pgBouncer is a db pooling service. When an app makes a database request, there’s quite a lot of overhead in creating the connection, and destroying it afterwards A pooling services skips this and keeps the connections open, passing them out to the services that request it on demand. This means that the db access is quicker, so a larger number of requests can potentially be supported with the same number of open connections. The drawback is that we don’t know exactly how effective this is - it depends on the specific queries you’re executing. (Other Gab/Mastodon admins would be able to advise on this). As a rule of thumb, longer larger queries get less benefit. So you’ll need to keep an eye on how the db access is doing and whether it’s a bottleneck. Right now, your web connections need (96) + streaming (3) + sidekiq (100) is 199, less than your postgres connections limit (200) - but barely. So watch for problems as you increase, and see how well the app is coping. In theory, though, your pgBouncer settings give you headroom here. To answer your question, yes if all the processes are using pgBouncer than default_pool_size can match max_connections in postgres.

If lots of people are connecting fine to the website, but not seeing the right content / seeing lags, delays or no updates, then you can probably ignore WEB_CONCURRENCY and look at increasing Sidekiq threads and the streaming cluster size. I’m not 100% sure of what the streaming is doing, so I need to look into that a bit more. But I’ve seen people saying that the site has improved today, so seems like you’re good for now :-)

Your way forward is keep gradually upping the numbers and adding people a bit at a time, keeping an eye on the CPU and RAM. You also want to know peaks and troughs in usage each day so you can check you’re seeing what peak CPU and RAM usage are. When they’re starting to get loaded, you’ll need to look at multiple instances and load balancing. Seems like you’re okay for now.

TBH, it sounds to me like you know all the right things, but you’re just going through the process of learning the details by trial and error: that’s a really stressful situation to be in. Lots of sympathy, but it doesn’t seem like you’re really missing any fundamental knowledge: just regular thrown-in-at-the-deep-end stress. :-)

Okay, TL;DR - afaik these settings look fine.

So you’re running (at least) four processes on your server, each of which can use CPU and RAM. They’re all talking to each other, and it’ll all run best when their workloads are balanced - you’re right, giving too much resource to one process can (paradoxically) slow everything down. Won’t always happen, but absolutely can. So the key thing is knowing which symptoms indicate a problem in which setting, so you know what to tweak.

The Puma settings (WEB_CONCURRENCY, MAX_THREADS) affect people accessing via the web: if people are having trouble connecting, dropped connections, etc, it’s here that you’ll need to look. The stat that you’d need to increase is WEB_CONCURRENCY - that will allow more users to access via the web. Whenever you increase this, you need to check that there are enough db connections available. You’ll need (WEB_CONCURRENCY x MAX_THREADS) number of db connections available for just the Puma part of the application. So, as of right now, 16x6 = 96. You shouldn’t need to increase MAX_THREADS.

pgBouncer is a db pooling service. When an app makes a database request, there’s quite a lot of overhead in creating the connection, and destroying it afterwards A pooling services skips this and keeps the connections open, passing them out to the services that request it on demand. This means that the db access is quicker, so a larger number of requests can potentially be supported with the same number of open connections. The drawback is that we don’t know exactly how effective this is - it depends on the specific queries you’re executing. (Other Gab/Mastodon admins would be able to advise on this). As a rule of thumb, longer larger queries get less benefit. So you’ll need to keep an eye on how the db access is doing and whether it’s a bottleneck. Right now, your web connections need (96) + streaming (3) + sidekiq (100) is 199, less than your postgres connections limit (200) - but barely. So watch for problems as you increase, and see how well the app is coping. In theory, though, your pgBouncer settings give you headroom here. To answer your question, yes if all the processes are using pgBouncer than default_pool_size can match max_connections in postgres.

If lots of people are connecting fine to the website, but not seeing the right content / seeing lags, delays or no updates, then you can probably ignore WEB_CONCURRENCY and look at increasing Sidekiq threads and the streaming cluster size. I’m not 100% sure of what the streaming is doing, so I need to look into that a bit more. But I’ve seen people saying that the site has improved today, so seems like you’re good for now :-)

Your way forward is keep gradually upping the numbers and adding people a bit at a time, keeping an eye on the CPU and RAM. You also want to know peaks and troughs in usage each day so you can check you’re seeing what peak CPU and RAM usage are. When they’re starting to get loaded, you’ll need to look at multiple instances and load balancing. Seems like you’re okay for now.

TBH, it sounds to me like you know all the right things, but you’re just going through the process of learning the details by trial and error: that’s a really stressful situation to be in. Lots of sympathy, but it doesn’t seem like you’re really missing any fundamental knowledge: just regular thrown-in-at-the-deep-end stress. :-)