#30 Pruning vs Quantization. Final showdown!

Ludovico Bessi

Jul 23, 2023

#30 Pruning vs Quantization. Final showdown!

Introduction.
Pruning vs quantization. Which one is better?
Closing thoughts.

Introduction

I know what you might be thinking. I have already discussed about quantization and pruning in two past articles:

Machine learning at scale: On-Device Machine Learning.

Machine Learning at Scale: Challenges and solutions of deploying On-Device Machine Learning models.

Machine learning at scaleLudovico Bessi

Compressing LLMs using novel quantization techniques

How to compress LLMs and accelerate inference to be able to use them in your product, novel technique from an MIT paper.

Machine learning at scaleLudovico Bessi

However, the paper [1] published recently on arxiv really caught my eyes, as It is the first paper I have read that directly compares the two techniques to be able to decide which one is best in different settings.

I found the results pretty interesting and I really wanted to share them with you!

Let's get started!

Pruning vs quantization. Which one is better?

The following methods have been compared:

Magnitude pruning.
Symmetric uniform quantization.

The methods have been evaluated on the common signal-to-noise ratio (SNR) measure.

Many weights in neural networks are roughly Gaussian-shaped, so the first distribution that was evaluated was the Gaussian distribution.

As we can see from the figure, the errors for both methods have very different behavior. The quantization error oscillates between the quantization nodes and has a moderate range. The pruning error effectively corresponds to rounding many weights to zero and thus has a higher error. As we can see in figure 1 (right), this results in a higher SNR for quantization.

The second distribution being evaluated is one with heavy tails: a student-t distribution is chosen.

Comparing t-student distribution from [1]

Despite the significant outliers and high kurtosis, quantization still has higher SNR in most of the cases for moderate compression. Pruning is better however in the region of high clipping range and very high compression rate, e.g. 2-3 bits per value.

Still, the previous results were mostly theoretical in nature. The paper goes on to compare all pre-trained model from the PyTorch model zoo:

Comparing all pre-trained weights from PyTorch model zoo from [1]

Results from the previous theoretical section hold. The kurtosis is indeed a good metric for predicting if a tensor should be quantized or pruned for optimal accuracy.

The paper goes on to discuss the Per-layer comparison between Post-training quantization and Post-training pruning.

The rectangles indicate the full range of the pruning and quantization methods between the heuristic solution and the error lower bound or the global solution. Whenever a rectangle for each chunk intersects the diagonal line, the ranking of the two methods could depend on the optimization method, while in cases below or above the diagonal, the ranking is guaranteed regardless of the optimizer.

Quantization mostly outperforms pruning for moderate compression, while methods become more comparable for higher compression ratios.

Closing thoughts

The conclusion is clear: under the paper's assumption, Quantization generally outperforms pruning for neural networks.

Next time you need to shrink a model size, forget about pruning! ;)

References

Pruning vs quantization

Machine learning at scale

#30 Pruning vs Quantization. Final showdown!

Table of contents

Introduction

Pruning vs quantization. Which one is better?

Closing thoughts

References