Benjamin Steenhoek

How to limit the number of Twitter posts in your timeline using JavaScript

2024-03-12T00:00:00-07:00

I have a bit of a problem scrolling too much of the awesome content on Twitter. There are just too many other people creating cool ideas and software! At first I justified it by the fact that I find academic papers (my feed is curated to mostly academic CS/ML/SE), but now I admit it’s too much! In a bid to waste more time in order to waste less time, I created a simple script that puts an extra UI onto each post in my Twitter feed:

For the first five posts, it numbers them to remind me how many I’ve browsed.
After that, it blocks them out with a reminder of the limit I set (currently I keep it at five posts).

Here’s what the result looks like:

Here’s the link to the script: JavaScript for Twitter Post Limiter · GitHub. I currently import it automatically with ViolentMonkey. Plans are to turn it into a FireFox extension (once conference deadline season is over 🙃). Hope it’s useful!

How to set up a shared cache for HuggingFace libraries

2024-01-16T00:00:00-08:00

TL;DR

I set up a shared cache for HuggingFace libraries like transformers and datasets. See the repository: https://github.com/bstee615/shared-hf-cache. To use it, create a shared directory which can be edited by all interested users and set the environment variable export HF_HOME="/huggingface".

The problem: duplicated model checkpoints and datasets

My lab members and I use a shared machine to run, among other things, large language model inference using the transformers and datasets libraries. HuggingFace libraries download the model weights or datasets, and the downloaded files can be very large (over 50GB). By default, the weights and datasets are downloaded to some folders under ~/.cache/huggingface/. Different users will download copies of the same models. This causes the storage requirements to grow much larger than what is needed.

How to set up a shared HF cache

In order to cut down the amount of storage used, I set up a shared directory /huggingface so users can all use the same folder to download their models and datasets. Users need only to set an environment variable using export HF_HOME="/huggingface", then the HF libraries will download all files to the shared folder.

Here’s the script I used:

#!/bin/bash

# Create a group for permissions to the directory
sudo groupadd hf-users
sudo usermod -aG hf-users $USER

# Create shared directory and make it owned by the group
sudo mkdir --mode=u+rwx,g+rwxs,o-rwx /huggingface # Give the directory rwx for user and group, and make files the directory inherit these permissions
sudo chown $USER /huggingface/
sudo chgrp hf-users /huggingface/

# Add to .bashrc
cat <<EOF >> $HOME/.bashrc
export HF_HOME="/huggingface" # Download HF cache items to /huggingface
umask 002 # Give user and group rw/rwx by deefault
EOF

# Optional: join the group in this shell, or restart the shell
newgrp hf-users

Testing it out

The shared cache must be activated with these two commands. The script above adds these to your profile in ~/.bashrc.

export HF_HOME="/huggingface" # Download HF cache items to /huggingface
umask 002 # Give user and group rw/rwx by default

Here’s the script I used to test it out, running the same script on two different users at the same time:

# Dependencies:
# pip install transformers torch accelerate bitsandbytes

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Do inference with Mistral-7B. Load the model weights ten times in a row to simulate loading the weights at the same time as another user.
for i in range(10):
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True)
    tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
    model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")
    print("Generating tokens...")
    generated_ids = model.generate(**model_inputs, do_sample=True, temperature=1.0, max_new_tokens=10)
    print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Concurrency

The libraries include a file locking system to prevent concurrency issues from multiple users using the same files at the same time. When I tested it by trying to load the checkpoint with user B while downloading the checkpoint with user A, this gave me an error on user B like so, without interrupting user A:

PermissionError: [Errno 13] Permission denied: '/huggingface/hub/.locks/models--mistralai--Mistral-7B-v0.1/9742cb4764964155b7a5f35eefad651f590006091ddeb536863d6c5865cca1b9.lock

How to fix `docker-compose: command not found` error with newer versions of Docker

2023-12-20T00:00:00-08:00

The docker-compose command is missing from recent versions of Docker, replaced by a plugin built into Docker: docker compose. To restore compatibility with scripts which use docker-compose, we can create a wrapper script which forwards its arguments to docker compose. Here’s the script:

# Switch to root
sudo su -
# Write script to file
cat << EOF > /usr/local/bin/docker-compose
#!/bin/bash
docker compose $@
EOF
# Make the script executable so that we can invoke it directly from the shell
chmod +x /usr/local/bin/docker-compose

This avoids the error docker-compose: command not found which I faced, for example, trying to install https://github.com/amithkoujalgi/ollama-pdf-bot.

How to use task-spooler as a shared queueing system

2023-12-16T00:00:00-08:00

TL;DR

I set up a wrapper around task-spooler. See the repository: https://github.com/bstee615/shared-task-spooler. To use it, create a file containing the below script and invoke it using the same arguments as task-spooler.

#!/bin/bash
# Dependency: sudo apt install -y task-spooler
TS_SOCKET=$(dirname -- "$0")/TS_SOCKET exec tsp -L "$USER" $@

Example usage:

alice@shared-box:~$ q echo hello
0
alice@shared-box:~$ q
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.t7NbRs   0        0.00/0.00/0.00 [alice]echo hello
alice@shared-box:~$ sudo su - bill
bill@shared-box:~$ q
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.t7NbRs   0        0.00/0.00/0.00 [alice]echo hello

Why?

I share a machine with several labmates. This machine has a single high-powered GPU which we share for our experiments. However, only one of us (usually) can use it at a time. This means that if someone else is running an experiment, I have to write down my command, wait for their experiment to be over (which can run several hours), get pinged by them, then check back when they’re done and run by experiment. If I run my experiment without checking if the GPU is in-use, it can my program can experience an error, or worse, the other person’s in-progress program may experience an error and they’ll have to reset it. How can we efficiently run our experiments?

There are several shared queueing systems, such as Slurm or HTCondor or sqs, but these are overkill - they tend to require a lot of work to set up and administrate. Additionally, these often require structuring your scripts around the format of the queueing tool, which slows down our work. Instead, I set up a shared queue using task-spooler (link)! This solution generally works best when the users are somewhat technical and will not overrun the resources of the machine and interrupt other users’ projects. Also, it only works on one machine and can’t manage jobs distributed over a cluster. For more advanced systems which can limit resources or run on a cluster, see the alternatives listed above.

How I developed it

task-spooler tool is used to spool Bash scripts, or execute them in sequence.

However, while it’s intended to be used from multiple terminals, it isn’t intended for multiple users - each user has their own queue of tasks. This is because a different socket is created for each user.

alice@shared-box:~$ tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
alice@shared-box:~$ tsp echo hello
0
alice@shared-box:~$ tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.YZiMxq   0        0.00/0.00/0.00 echo hello
alice@shared-box:~$ sudo su - bill
bill@shared-box:~$ tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
bill@shared-box:~$ tsp echo hello
0
bill@shared-box:~$ tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.LjNllC   0        0.00/0.00/0.00 echo hello

In order for two users to share a queue, they must share the same socket file, which is specified using the environment variable TS_SOCKET.

alice@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp echo hello
0
alice@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.XItzuo   0        0.00/0.00/0.00 echo hello
alice@shared-box:~$ sudo su - bill
alice@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp
c: cannot connect to the server

Uh oh, now bill ncannot access the same queue. Looking into the source code, we see this is because when bill’s tsp tries to open the socket file, they receive the error ENOACCESS, which spouts the error message.

Now we can make the socket file accessible by all users using chmod 777. If you want more restrictive permissions (for example, to restrict access to a specific group of users), you can use Linux permission groups.

alice@shared-box:~$ chmod 777 /opt/shared-queue/TS_SOCKET
alice@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp -L "$USER" echo hello
0
alice@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.t7NbRs   0        0.00/0.00/0.00 echo hello
alice@shared-box:~$ sudo su - bill
bill@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.t7NbRs   0        0.00/0.00/0.00 echo hello

Works great!

Finally, we want to distinguish jobs submitted by different users, so that each user can manage their own jobs in the shared queue. We can do this by giving a label to the jobs using the -L option (see man pages).

alice@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp -L "$USER" echo hello
0
alice@shared-box:~$ TS_SOCKET=/tmp/shared-socket tsp
ID   State      Output               E-Level  Times(r/u/s)   Command [run=0/1]
0    finished   /tmp/ts-out.t7NbRs   0        0.00/0.00/0.00 [alice]echo hello

The final version is wrapped into a convenient helper script q in this repository: https://github.com/bstee615/shared-task-spooler. To install it, you can clone it into a directory shared by all users (we have it in /opt/shared-queue), add the directory to the PATH variable, and start using q!

friendship ended with `earlyoom`, now `nohang` is my best friend

2023-02-13T00:00:00-08:00

I used to use earlyoom to ensure that my desktop PC will still keep running if a program hogs all the memory (e.g. loading a too-large dataset into memory). I recently couldn’t get earlyoom to work with Fedora 37, and while searching for a solution I found nohang. Here are some useful features of nohang which convinced me to switch:

Desktop notification when a program is killed.
Includes a demo command, nohang --memload, to safely test out a low-memory situation.
Easy to build from source and sensible default config.

Both are very useful and well-made programs, but nohang was better for my situation. Try them out!

Siamese network and triplet loss

2022-10-10T00:00:00-07:00

Siamese network is an architecture which runs two networks with shared weights (effectively runs the same network twice) on two different inputs simultaneously. It is commonly trained with a contrastive loss such as triplet loss in order to draw together the representations of similar inputs and push apart the representations of contrasting inputs.

Define distance as the norm between the two encodings: $d(x_i, x_j) = ||f(x_i) - f(x_j)||^2$

Goal: learn parameters so that

$x_i, x_j$ are the same person -> $d(x_i, x_j)$ is small
$x_i, x_j$ are the different people -> $d(x_i, x_j)$ is large

How to train? Triplet loss

Anchor $A$
Positive $P$
Negative $N$
Want:
- $d(A, P) \leq d(A, N)$
- $d(A, P) - d(A, N) \leq 0$
- This can be satisfied trivially with $d(*) = 0$.
- To prevent trivial solution, require the difference larger than a margin. $d(A, P) - d(A, N) + \alpha \leq 0$.

End up with Triplet loss $\mathcal L(A, P, N) = max(d(A, P) - d(A, N) + \alpha, 0)$.

Get filepath of Bash activation script

2022-07-25T00:00:00-07:00

Use ${BASH_SOURCE[0]} to reference the filepath of a Bash script. Unlike $0, this works if the script is called via bash script.sh or source script.sh.

Source: https://stackoverflow.com/a/8912075

Example:

# activate.sh

root=$(realpath $(dirname ${BASH_SOURCE[0]}))
source $root/venv/bin/activate
export PYTHONPATH="$root:$PYTHONPATH"

webcam-mods for Linux background blur & swap

2022-07-07T00:00:00-07:00

webcam-mods is the best method I have found for webcam background blur/swap on Linux. I use this for my meetings on Google Meet and Webex.

Repo: https://github.com/hamidzr/webcam-mods Install globally:

pip install git+https://github.com/hamidzr/webcam-mods@master

Original

Cropped: webcam_mods crop-cam. I was impressed with the interactive cropping mode which allowed me to crop to my profile pretty easily. The crop settings are saved to disk for future runs.

Cropped and blurred: webcam_mods bg-blur

Cropped with bg: webcam_mods bg-swap

Video feed was displayed with ffplay. Runner-ups that I tried:

Linux-Fake-Background-Webcam. I found its blur didn’t work quite as well (background and limbs pop in/out), so it was distracting in meetings.
fakecam. It was a bit difficult to install (see issue here).

Beware any vs len

2022-07-05T00:00:00-07:00

I fell into the habit of using any() to check if a list is empty. It’s nice because it works for any enumerable, including generators, even if len() is not defined. However, it has a pitfall where if the list is nonempty but contains only falsy values, any() returns False. For this reason, I advise to use len() to check if a list is empty.

In [1]: coll = [0]
   ...: if any(coll):
   ...:     print("has some!")
   ...: 

In [2]: coll = [0]
   ...: if len(coll) > 0:
   ...:     print("has some!")
   ...: 
has some!

Use head -n -0 to get all items in list

2022-07-01T00:00:00-07:00

You may know that you can use head -n $n to get the first N lines of a list. But you may not know that you can supply n=-0 to get all items in the list.

I use this frequently when I want to test some preprocessing script on a sample of a dataset before running on the whole dataset. Here is a snippet I use sometimes. It gets the first N lines if an argument is given, otherwise all lines.

#!/bin/bash
n="$1"
if [ -z "$n" ]
then
    n="-0"
fi

head -n $n data.txt

Here is the snippet in action.

$ seq 1 10 > data.txt
$ bash head_script.sh 1
1
$ bash head_script.sh 5
1
2
3
4
5
$ bash head_script.sh
1
2
3
4
5
6
7
8
9
10