Slurming It
As an academic researcher (with a strong computer science bend) I often have to strike a balance between performance and cost. Having the latest and greatest cloud computing infrastructure is great! It makes getting things done much less complicated because you don't have to fuss with hardware provisioning, configuring systems, driver upkeep, security patches, and all those other shenanigans bare metal offers. You just set up what you need and are off to the races. Of course, this doesn't come cheap, and sometimes 'free credits' and 'trial' periods leave you wondering when the check will eventually come due. There is no such thing as free lunch, right? But are there any other options available?
Enter the wonderful world of HPC - high-performance computing. For folks in particle physics (looking at you, CERN), big-data 'omics researchers, and other people pushing the boundaries of what's possible with massive amounts of data and equally massive amounts of RAM and CPU cores, HPC systems are your bread and butter. You set up your big tests that could run for days at a time. You get everything nice and neat and clean. You debug your code in little batches and think things look ready for the big run. So you package it up and ship it to the HPC Scheduler, a.k.a. SLURM. The system magically sorts out your code and data (if you were smart in setting it up) and parallelizes things to optimize for runtime and/or memory use. It's truly fascinating stuff I greatly enjoyed learning about. Still, I never felt qualified to use these systems because I'm just an HCI researcher who builds interactive systems for people to use. I'm not doing anything with massive datasets. When would I need any sort of "supercomputer" to help me do my work?
As my work started to scale up with LLMs (large language models) and other GPU/memory-heavy workflows like Diffusion based image generation and NLP (natural language processing), I started to run into bottlenecks individually and also across my team. While I personally could head home and make use of my desktop (which has a GPU with 24 GB of VRAM) to test out ideas and explore concepts, there was practically no way for my teams to make use of the same resource. A few students were willing to put down their own money to pay for infrastructure on AWS, but I didn't like the idea of them having to pay to do our work, so I set out to exhaust all the available options and find a solution. Our grant didn't have provisions for computing infrastructure, so I eventually circled back to freely available campus-based computing resources, which plopped me right back into the land of HPC. Little did I know, but this would blossom into a beautiful adventure with some surprising twists and turns.
Off I rode, head held high. It was time to tame the supercomputer.
Our workflows were not designed for HPCs in the least. We had high computing demand that occurred sporadically and momentarily. We could have zero demand ramp up to 100 GB of VRAM utilization in a few minutes, which would drop back to zero as quickly as it rose and remain at zero for hours, even days. At first glance it seemed possible to send jobs to the HPC when users needed those specific resources, but that quickly turned sour as soon as I tested out response time. Requesting a compute node with GPU could take a few seconds to several minutes, depending entirely on community demand at that particular time. And once that node was allocated and available, loading in our models could take another 30-60 seconds. Then, even after minor optimization, the compute jobs took anywhere from 4 to 8 seconds. This whole plan was not viable for any sort of real-time system. I really needed to get creative.
Holding a lot of compute resources hostage for a long time just because we might need it at a moment's notice felt wrong (and was sure to raise red flags and get me in trouble). But holding a single core was not likely to do too much to a supercomputer of 30,000+ cores. So a plan was hatched to build up a chain of microservices I could host from a more sparsely equipped VM to grant my team access to the HPC on demand.
First was resource allocation. Through some thoughtful Python-based SSHing (<3 you, Paramiko), I could request compute resources on the supercomputer and keep that resource alive with idle commands (silly stuff like running ls
every few minutes) while logging the hostname for that compute node to a file (yes, I'm using the filesystem as a DB). This compute node was air-gapped from the web, so a second service would first connect to the main HPC host (not air-gapped), then SSH into the reserved compute node. There it would load in any models or other resources I would need to get those tasks out of the way early. Keeping things in memory and cache will be a topic for another day, but let me just say it proved valuable. Finally, this second service established a ZMQ server to enable interactive messaging between future clients and this computing resource host. Because the compute nodes were dynamic (and shared) resources, I would never know what port I could use. This port was randomly selected and, if available, would be logged into a file to tell clients how to connect in the future.
Ok, so now I had a compute resource with all models and resources loaded, waiting for work. I just needed to send work to the system and get the data back from an air-gapped compute node. This worked out to be possible thanks to a third service that provisioned a mixture of a FastAPI server (running behind a reverse proxied NGINX) and a subprocess
that would use a ZMQ client to send the job over to the compute node, produce output, then SFTP the data back to our VM and make it publically available on the web for consumption (via yet another service that spawned an incredibly basic Python HTTP server). All in all I actually feel proud of this system I built, as brittle and spaghetti-like it may have felt, it also very practically worked! I had found a way to make HPC compute cores accessible and usable to my team via API calls and a few micro services. Furthermore the turnaround time on our compute jobs was only a few seconds, meaning that realtime system evaluation was possible! Mission accomplished?
Eventually, I will open source this code because I think it is useful to others doing this kind of work. Check back here at the end of April (or email me sooner if you are interested). We have a paper in review currently, so I will update this post later after that gets published (or otherwise open-sourced if the publication route does not work out). I had a wonderful candid chat with our directors for research computing after I had accomplished the aforementioned goals. I think I was looking for forgiveness, but they welcomed the effort with smiles and said they would have done the same if they were in my position.
A future post will discuss my efforts to build out a sort of 'elastic compute' timeshare (or my attempt at a dynamic growth algorithm) to scale up compute resource allocation with changing user demand. Still, for now, this has grown long enough and I will close this here.
Slurm on!