Run JupyterLab on OpenGPU Cluster with SLURM at UCI

Jan 2, 2024 · 4 mins read

As a graduate student in UCI’s IN4MATIX program, exploring the resources available on the Donald Bren School Computing Support page (wiki.ics.uci.edu) is truly fascinating. While most of the documentation is quite clear, I noticed that there is a lack of detailed guidance on utilizing the OpenGPU cluster. In this guide, I will provide you with a comprehensive walkthrough to help you effectively harness the power of the GPUs.

1. Prerequisites

You will need an ICS account
You need ssh installed on your system
Familiarity with JupyterLab is a plus

2. SSH to ICS Openlab

SSH to the ICS OpenLab Linux host. Use your ICSusername password to log in (N.B. This is different from UCINetId)

$ ssh <ICSusername>@openlab.ics.uci.edu

After logging in you should see your ICS home directory.

3. Create sbatch FIle

Create a sh file (technically sbatch file). You your favorite text editor to do it.

$ vi jupyterLab.sh

Paste the following text into your sbatch script, and save the file.

#!/bin/bash
#SBATCH --job-name=jupyter
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=1-00:00:00
#SBATCH --mem=20GB
#SBATCH --output=/home/<ICSusername>/jupyter.log

cat /etc/hosts
jupyter lab --ip=0.0.0.0 --port=8888

Make sure to replace the part of your #SBATCH --output=home/<ICSusername>/jupyter.log with your ICSusername.

This file will allow us to tell Slurm to launch a Jupyter Lab server on the node with your requested resources over port 8888.

4. Load Slurm

Now we will need to interact with Slurm. The slurm-client is already installed on the machine. You can activate it by

$ module load slurm

Now we can interact with the slurm queue.

5. Submit the sbatch to Slurm

Let’s send the script we prepared as a job to Slurm.

$ sbatch jupyterLab.sh

Now, you can check that your job is running:

$ squeue -l
$ squeue -u $USER ## or

You should see something like this

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2977100 opengpu.p   xxxxxx dddddddd  R   10:01:13      1 poison
           2910105 openlab.p yyyyyyyy eeeeeeee  R 22-05:23:37     1 circinus-32
           2961968 openlab.p zzzzzzzz    fffff  R 12-04:19:50     1 circinus-81
           2976070 openlab.p      aaa gggggggg  R 6-07:19:49      1 circinus-21
           2962837 openlab.p bbbbbbbb    hhhhh  R 11-12:14:34     1 circinus-17
           2961916 openlab.p cccccccc    hhhhh  R 12-11:47:43     1 circinus-48

Once the job is in running state you can continue to the next steps. However, whether it is running depends on how many other people are using it. I got my first job after 5 or 6 days.

6. Create an SSH tunnel for access

Once the job is up and running on the GPUs check the log output to find out the address we need to create an SSH tunnel:

$ cat ~/jupyter.log | grep http

    http://poison.ics.uci.edu:8888/lab?token=64ddb49f56acc8569f2d237cce504e7edd1ef559e76a23e7
    http://127.0.0.1:8888/lab?token=64ddb49f56acc8569f2d237cce504e7edd1ef559e76a23e7

Note the address in the first part (poison.ics.uci.edu), you’ll need that info (your address will be different) to setup a port-forwarding connection.

Now, on your laptop, open a new Terminal Window and create an SSH tunnel.

ssh -L8888:10.1.80.10:8888 <ICSusername>@openlab.ics.uci.edu

7. Open JupyterLab

Now, on your laptop open a browser window and you can then browse to:

http://127.0.0.1:8888/lab?token=64ddb49f56acc8569f2d237cce504e7edd1ef559e76a23e7

Voila! Now you have a jupyterlab instance running on GPU nodes which you can access from your local machine!

For the remainder of your job run, the IP and port will stay the same. If you close your laptop you will need to recreate an SSH Tunnel - you can reuse the “ssh -L8888” command above. If your job ends, you need to resubmit your slurm job, and then modify your SSH Tunnel command with the new IP address. Jobs last for a maximum of 7 days.

If at any time port 8888 seems to already be in use, you can change the port (any above 1024) in your sbatch file, and in the corresponding commands for your SSH tunnel.

8. Managing slurm jobs

Please don’t occupy the GPUs once you are done with your work. This will allow other researchers to share the valuable resources. To cancel your job

$ scancel <JOBID> # Find job id using squeue command

More on how to manage slurm: stanford-rc.github.io/docs-earth/docs/slurm-basics

9. What you can to with the OpenGPU cluster?

As of 2023 the specs of the OpenGPU cluster is:

GPU: 4 x NVIDIA Corporation GP102 [TITAN Xp], 12GB RAM
RAM: 128 GB
Processor: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz

With these specifications, training small to medium-sized deep neural networks is quite manageable. However, I believe that Google Colab offers greater convenience and more powerful processors, enabling you to accomplish even more.

My initial goal was to utilize the cluster for running large language models. Unfortunately, even the smallest of these models prove challenging to accommodate on this relatively modest cluster.

Conclusion

If you found this write-up helpful, please consider leaving a thumbs up or a positive comment below. I trust you’ve discovered valuable insights in this blog.