LADE Seminar

Tomislav Šubić (Arctur)

Title: The effects of model scaling on quantization error

Abstract

The performance and robustness of current artificial intelligence and machine learning systems heavily depend on their scale and the amount of compute spent on their training and inference. State-of-the-art models require vast amounts of energy, and compute has become one of the main contributors to greenhouse gas emissions. Additionally, due to the extensive costs connected to training and running these models, only a few global companies can afford to run them. This puts smaller companies and individuals at a disadvantage and centralizes the power of the technology; from that, a trend has emerged in the form of pre-trained model adoption. Since most of the energy is spent on model inference, reducing its resource requirements is vital for running modern networks, from workstations and dedicated servers to devices with strict power and compute limits like edge and IoT devices. Several techniques exist for reducing the memory footprint of neural networks, where the goal is to compress a model without losing much of its accuracy. Quantization of neural networks is a very active research area, but researchers often approach it with different goals. Some are converting neural networks into other numerical representations to increase the numerical range and increase accuracy; some do it as a simulation and test to drive hardware development, others try to reduce the memory footprint of a model and improve inference time by sacrifying accuracy. One of the questions that is left unanswered is that of how scaling a neural network influences the accuracy and performance of its quantized version. The ratios are not well understood; does a 10B parameter model represented in fp16 have the same accuracy as a quantized version of the scaled 20B parameter model? Which of these models would have better performance and energy efficiency? How much does a model need to be scaled and in what way, to achieve the same accuracy in its quantized form? In this talk, we will take a look at the tradeoff between model scaling and quantization, present some papers that address these issues, and discuss the motivations behind optimizing this part of the AI/ML workflow.

Date
Feb 6, 2024 11:30 AM — 12:30 PM
Event
LADE Seminar
Location
Sala Riunioni RIT, Area Science Park
Località Padriciano 99, Trieste, 34149
Area Science Park - RIT
Area Science Park - RIT
Research Institute

The Institute of Research and Innovation Technology (RIT) at Area Science Park carries out cutting-edge research and provides services and consulting to public and private-sector users through its three laboratories equipped with state-of-the-art technology.