MaskLLM:
Learnable Semi-Structured Sparsity for Large Language Models

- NeurIPS 2024 Spotlight -

Gongfan Fang^♣,♢,*, Hongxu Yin^♢, Saurav Muralidharan^♢, Greg Heinrich^♢,
Jeff Pool^♢, Jan Kautz^♢, Pavlo Molchanov^♢, Xinchao Wang^♣

NVIDIA^♢ National University of Singapore^♣
^*Work done at NVIDIA Research

Arxiv NVlabs/MaskLLM

Hugging Face

Learnable Semi-Structured (or "N:M") Sparsity for Large Language Models. The learned mask can be further transfered to downstream tasks for lossless compression.

Abstract

Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or "N:M") Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model's 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM's learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains.

Method

Figure 2: This work introduces learnable semi-structured sparsity for LLMs. MaskLLM models mask selection as a distribution learning problem, enabling the creation of accurate masks through end-to-end training on large-scale datasets. The learned and general mask can be further transferred to downstream tasks or domains, achieving lossless compression.

Figure 3: Differentiable mask with Gumbel Softmax. Each consecutive M parameters are associated with a learnable categorical distribution of candidate masks. All illustrated computations, including the Gumbel Softmax, and the weighted averaging are differentiable.

Key Findings

Table 1: Evaluation of 2:4 Sparsity with frozen weights (SparseGPT does perform the weight update step). One-shot pruning methods are calibrated with C4 and evaluated on Wikitext-2 following [10]. More results for Llama-3 [1] or other SOTA methods can be found in Table 12 and 13 of the appendix.

Table 4: Consumed samples vs. PPL on LLaMA2 7B. MaskLLM scales effectively to larger calibration datasets, achieving lower PPL with more samples.

Table 2: The effectiveness of transfer learning with prior masks. We report the Wikitext PPL of both prior and learned masks. The learned masks use the corresponding prior for initialization and refine the logits through end-to-end training. All results are obtained with frozen weights.

Figure 5: (a) The L1 distance of sampled masks between adjacent training steps. (b) The maximum probability of mask distribution, serving as an indicator of convergence. In our method, the randomness of mask sampling is regulated by the scaling factor κ. A too-small κ introduces huge randomness, resulting in slow convergence as shown in (b). And an inappropriately large κ will suppress mask exploration and yield zero mask difference throughout the training process in (a).

MaskLLM:
Learnable Semi-Structured Sparsity for Large Language Models

Learnable Semi-Structured (or "N:M") Sparsity for Large Language Models. The learned mask can be further transfered to downstream tasks for lossless compression.

Abstract

Method

Figure 3: Differentiable mask with Gumbel Softmax. Each consecutive M parameters are associated with a learnable categorical distribution of candidate masks. All illustrated computations, including the Gumbel Softmax, and the weighted averaging are differentiable.

Key Findings

Table 1: Evaluation of 2:4 Sparsity with frozen weights (SparseGPT does perform the weight update step). One-shot pruning methods are calibrated with C4 and evaluated on Wikitext-2 following [10]. More results for Llama-3 [1] or other SOTA methods can be found in Table 12 and 13 of the appendix.

Table 4: Consumed samples vs. PPL on LLaMA2 7B. MaskLLM scales effectively to larger calibration datasets, achieving lower PPL with more samples.

Table 2: The effectiveness of transfer learning with prior masks. We report the Wikitext PPL of both prior and learned masks. The learned masks use the corresponding prior for initialization and refine the logits through end-to-end training. All results are obtained with frozen weights.

Table 3: Weight Regularization helps mask learning

Table 4 & 5 & 6: Learning task-specific masks for lossless compression.

Acknowledgement

We would like to thank Jorge Albericio Latorre for fruitful discussion and feedback on the project.

BibTeX

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Learnable Semi-Structured (or "N:M") Sparsity for Large Language Models. The learned mask can be further transfered to downstream tasks for lossless compression.

Abstract

Method

Figure 3: Differentiable mask with Gumbel Softmax. Each consecutive M parameters are associated with a learnable categorical distribution of candidate masks. All illustrated computations, including the Gumbel Softmax, and the weighted averaging are differentiable.

Key Findings

Table 1: Evaluation of 2:4 Sparsity with frozen weights (SparseGPT does perform the weight update step). One-shot pruning methods are calibrated with C4 and evaluated on Wikitext-2 following [10]. More results for Llama-3 [1] or other SOTA methods can be found in Table 12 and 13 of the appendix.

Table 4: Consumed samples vs. PPL on LLaMA2 7B. MaskLLM scales effectively to larger calibration datasets, achieving lower PPL with more samples.

Table 2: The effectiveness of transfer learning with prior masks. We report the Wikitext PPL of both prior and learned masks. The learned masks use the corresponding prior for initialization and refine the logits through end-to-end training. All results are obtained with frozen weights.

Table 3: Weight Regularization helps mask learning

Table 4 & 5 & 6: Learning task-specific masks for lossless compression.

Acknowledgement

We would like to thank Jorge Albericio Latorre for fruitful discussion and feedback on the project.

BibTeX

MaskLLM:
Learnable Semi-Structured Sparsity for Large Language Models