Viswanath (Vish) Sivakumar

Neural Interfaces & LLMs: Tokenizing EMG via Residual Vector Quantization

2025-10-08

Electromyography (EMG) at the wrist surface is an emerging modality for human-machine input. EMG taps directly into the motor system and enables decoding human intent from electrical impulses. The recently launched Meta Neural Band relies on EMG as the primary modality for decoding gestures and handwritten text into inputs for AI glasses. A critical milestone in taking this technology from a research prototype into a consumer product, and a long-standing challenge in BCI, has been achieving out-of-the-box generalization across people. Anatomical, physiological, and behavioral variations across the population demand powerful models that can generalize broadly without requiring extensive calibration.

At the same time, these models must run in real time with extremely low latency. We are accustomed to the instantaneous feedback of keyboards and touchscreens, and any technology that aims to supersede them will be expected to have an equally snappy user experience. For neural interface decoders, the latency requirements are often far stricter than for speech. While speech-to-text systems can tolerate delays of a second or more (with the transcription updating every 2-3 words), gestures for UI control or fast typing require $\mathcal{O}(10\text{ms})$ feedback. Meeting such constraints means the models must run close to the user on edge devices. And because these wearables are expected to be worn and used all day, the models must also be highly efficient in terms of compute and battery usage.

How do we run powerful EMG models efficiently on all-day wearable devices with limited compute and battery? A pragmatic approach is to offload the bulk of the model inference from the wristband that senses EMG onto a companion device such as the glasses or phone, by streaming the signal over Bluetooth. But streaming EMG in its raw form is infeasible. For instance, the 16-channel sEMG research device described in Kaifosh et al. (Nature, 2025) has a sampling rate of 2000 Hz with 24-bit ADC resolution, resulting in a transmission rate of 768 kbps (2000 Hz x 16 channels x 24 bits/channel). This exceeds what Bluetooth Low Energy (BLE) can handle for frequent and continuous streaming, and would drain the battery while also making the signal vulnerable to interference and packet loss.

Instead of transmitting raw signals, a more efficient approach is to run a lightweight encoder on the wristband and stream intermediate representations over BLE. For example, with features downsampled to 50-100 Hz, a 256-dimensional embedding, and 8-bit quantization, the transmission rate drops to 100-200 kbps. This is an improvement, but still on the higher side for sustained BLE streaming. Given the high temporal resolution (2 kHz sampling) and moderate spatial resolution (16 electrode channels) of EMG relative to the discrete events being decoded (gestures, characters), intuition suggests we should be able to compress EMG even further without losing essential information.

SoundStream and Residual Vector Quantization (RVQ)

Residual Vector Quantization (RVQ) was popularized by SoundStream: An End-to-End Neural Audio Codec as a scalable way to compress speech and audio. Unlike most prior codecs, SoundStream is trained end-to-end with an encoder-decoder architecture using a combination of reconstruction and adversarial losses. Vector Quantization (VQ) itself was first introduced in generative modeling by VQ-VAE, but a challenge for VQ in audio is that achieving high-fidelity reconstruction/generation often requires extremely large codebooks.

For example, assume spectral features are produced at 100 Hz and we target a compressed bitrate of 5000 bits/sec. That allows a budget of 50 bits/frame (i.e., per quantized embedding vector), requiring a VQ codebook with $2^{50}$ (≈1 quadrillion) entries. RVQ solves this scalability problem by stacking multiple vector quantizers, each with its own codebook, to sequentially quantize the residuals left over from the previous quantizers.

With RVQ, the 50 bits/frame budget can now be split across, say, 5 quantizers with 10 bits each ($2^{10} = 1024$ entries per codebook). With a total storage requirement of just $5 \times 2^{10} = 5120$ vectors, RVQ approximates the expressive capacity of a single codebook with $2^{50} \approx 1$ quadrillion vectors. This factorization of a single intractable lookup table into several tractable ones has led RVQ to be widely adopted for discretizing continuous latent representations in modern generative modeling.

RVQ on the emg2qwerty benchmark

We use the emg2qwerty benchmark to evaluate RVQ for compressing EMG signals. The dataset consists of sEMG recordings collected while users typed on a QWERTY keyboard, and the baseline model follows a streaming-friendly ASR-style architecture trained with CTC loss to predict keystrokes from EMG. Decoding performance is measured as Character Error Rate (CER), defined as the normalized edit distance between the predicted and ground-truth character sequences.

To enable offloading the bulk of model inference from the wristband to a companion device, we introduce an RVQ bottleneck immediately after the first encoder block and train it end-to-end via CTC on the standard emg2qwerty task. Since EMG signals are not directly human-interpretable (unlike speech/audio), we do not train a generator to reconstruct the compressed EMG (as SoundStream does for speech). Otherwise, we follow the same strategy as SoundStream for learning the codebook: the codebook vectors are initialized using k-means clustering on the first training batch and updated using exponential moving average (EMA) per training step. Codebook vectors that are unused or have very few assignments (i.e., whose EMA cluster size falls below a certain threshold) are expired and replaced with randomly chosen entries from the current training batch.

RVQ implementation can be found at vector_quantizer.py and its hyperparams at residual_vq.yaml.

Considering my present GPU-poor conditions, the experiments below focus only on single-user personalized models in the emg2qwerty suite and do not include the larger generic models. In this setting, we achieve 400x compression by mapping 256-dimensional continuous float32 embeddings after the first encoder block to 20-bit discrete tokens using a 2-level RVQ consisting of 1024 codes per level, without degrading task performance measured by CER.

emg2qwerty personalization CER with RVQ Fig 1: RVQ with 2x1024 codes indicates 2 codebook levels consisting of 1024 entries each (10 bits per level), compressing 256-dim float32 embeddings into 20-bit discrete tokens (400x compression). RVQ 2x1024 approximates the expressive capacity of vanilla VQ with $2^{20}$ (≈1 million) codebook entries, while only storing a total of 2048 codes.

The following illustrates the scalability advantage of RVQ over vanilla VQ.

VQ vs RVQ comparison on a single user Fig 2: 4-level RVQ with 512 codes per level (2048 total codes) fully recovers baseline CER, compressing 256-dim float32 embeddings into 36-bit discrete tokens. In contrast, vanilla VQ with 4096 codes (2x more storage) fails to match baseline performance. This RVQ configuration approximates the expressive capacity of a vanilla VQ with $2^{36}$ (≈68 billion) codebook entries, while only storing 2048 codes.

Full repo on GitHub. Plots were generated using this Colab notebook.

EMG, LLMs, and Adaptive Neural Interfaces

So far, this discussion has focused on using RVQ to compress EMG signals. But the larger opportunity is that tokenizing EMG into discrete codes opens the door to leveraging LLM architectures for neural interfaces. Indeed, SoundStream and its successors are now used more widely as tokenizers/detokenizers for speech in multimodal LLMs than as standalone compression codecs. For EMG and related neural interfaces, leveraging LLMs goes beyond improving benchmark scores on tasks with pre-determined user behavior.

The promise of neuromotor interfaces is that machines can tap into the richness and fidelity of the human motor system and co-evolve to enable deeper personalization. Imagine the Meta Neural Band detecting gestures barely perceptible to the eye, or decoding text handwritten with increasing levels of hastiness, sloppiness, and imprecision. Decoding these reliably requires the model to understand user intent at a much richer level (“don’t listen to what I say; listen to what I mean). LLMs excel at this: they respond accurately to vague or imprecise prompts, and conversational speech LLMs capture subtleties like emotions, interruptions, or backchanneling when speech tokens are fed natively, bypassing the textual information bottleneck.

With neural interfaces, the potential for adaptation to fine-grained motor movements and enabling human augmentation is high, far greater than in text or speech. Realizing this vision requires EMG to become a first-class, native modality for LLMs, and tokenization is the very first step.