TurboTransformers: An Efficient GPU Serving System For Transformer Models (PPoPP 2021 - Main Conference)

Who

Jiarui Fang, Yang Yu, Chengduo Zhao, Jie Zhou

Track

PPoPP 2021 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 3 Mar 2021 12:30 - 12:45 - Session 10. Machine Learning and Software Engineering Chair(s): Albert Cohen

Abstract

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, transformers are able to process on dimensions of sequence lengths in parallel, therefore leads to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization.

To solve the above challenges, this paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, which are major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

Link to Publication

https://dl.acm.org/doi/10.1145/3437801.3441578

Jiarui Fang

Tencent

China

Yang Yu

Chengduo Zhao

Tencent

Jie Zhou

Tencent

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 3 Mar
Displayed time zone: Eastern Time (US & Canada) change

12:30 - 13:30	Session 10. Machine Learning and Software EngineeringMain Conference Chair(s): Albert Cohen Google

12:30 15m Talk		TurboTransformers: An Efficient GPU Serving System For Transformer Models Main Conference Jiarui Fang Tencent, Yang Yu , Chengduo Zhao Tencent, Jie Zhou Tencent Link to publication
12:45 15m Talk		Extracting Clean Performance Models from Tainted Programs Main Conference Marcin Copik ETH Zurich, Alexandru Calotoiu ETH Zurich, Tobias Grosser University of Edinburgh, Nicolas Wicki ETH Zurich, Felix Wolf TU Darmstadt, Torsten Hoefler ETH Zurich Link to publication Pre-print
13:00 15m Talk		Modernizing Parallel Code with Pattern Analysis Main Conference Roberto Castañeda Lozano University of Edinburgh, Murray Cole University of Edinburgh, Björn Franke University of Edinburgh Link to publication
13:15 15m Talk		DAPPLE: A Pipelined Data Parallel Approach for Training Large Models Main Conference Shiqing Fan Alibaba Group, Yi Rong Alibaba Group, Chen Meng Alibaba Group, ZongYan Cao Alibaba Group, Siyu Wang Alibaba Group, Zhen Zheng Alibaba Group, Chuan Wu The University of Hong Kong, Guoping Long Alibaba Group, Jun Yang Alibaba Group, LiXue Xia Alibaba Group, Lansong Diao Alibaba Group, Xiaoyong Liu Alibaba Group, Wei Lin Alibaba Group Link to publication