Edge 432: NVIDIA Created Minitron by Distilling Llama 3.1

The two resulting models of 8B and parameters respectively highlight the potential of distillation.

featured-image

Minitron focuses on reducing the size of AI models through pruning and distillation, making them more efficient without sacrificing too much accuracy. Pruning reduces a model’s size by either cutting layers (depth pruning) or removing neurons, attention heads, or embedding channels (width pruning). To recover some lost accuracy, retraining is often necessary after pruning.

How did they do it?.