arxiv.org/abs/2210.17323

Preview meta tags from the arxiv.org website.

Linked Hostnames

Thumbnail

Search Engine Appearance

Google

https://arxiv.org/abs/2210.17323

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

Bing

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

https://arxiv.org/abs/2210.17323

DuckDuckGo

https://arxiv.org/abs/2210.17323

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

General Meta Tags
18
- title
  [2210.17323] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- title
  open search
- title
  open navigation menu
- title
  contact arXiv
- title
  subscribe to arXiv mailings
Open Graph Meta Tags
10
- og:type
  website
- og:site_name
  arXiv.org
- og:title
  GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- og:url
  https://arxiv.org/abs/2210.17323v2
- og:image
  /static/browse/0.3.4/images/arxiv-logo-fb.png
Twitter Meta Tags
6
- twitter:site
  @arxiv
- twitter:card
  summary
- twitter:title
  GPTQ: Accurate Post-Training Quantization for Generative...
- twitter:description
  Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high...
- twitter:image
  https://static.arxiv.org/icons/twitter/arxiv-logo-twitter-square.png
Link Tags
13
- apple-touch-icon
  /static/browse/0.3.4/images/icons/apple-touch-icon.png
- canonical
  /abs/2210.17323
- icon
  /static/browse/0.3.4/images/icons/favicon-32x32.png
- icon
  /static/browse/0.3.4/images/icons/favicon-16x16.png
- manifest
  /static/browse/0.3.4/images/icons/site.webmanifest

arxiv.org/abs/2210.17323

Linked Hostnames

Thumbnail

Search Engine Appearance

Google

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Bing

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

DuckDuckGo

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

General Meta Tags

Open Graph Meta Tags

Twitter Meta Tags

Link Tags

Links