Vol. 19, No. 5, May 31, 2025
10.3837/tiis.2025.05.003,
Download Paper (Free):
Abstract
To address the performance bottlenecks in the GPTQ quantization inference on heterogeneous platforms, we conducted an in-depth analysis of the vLLM architecture, identifying linear algebra operations and fused operations as critical components for performance enhancement. Based on this observation, we proposed a comprehensive optimization framework, termed TripleOptim, designed to significantly improve both inference speed and computational efficiency. TripleOptim introduces a memory access optimization technique, termed HalfMemOpt, which leverages half-precision numbers to reduce memory access latency. It also incorporates HighVecOpt, a high-precision vectorized floating-point arithmetic optimization, to improve computational efficiency through parallelism. Lastly, TripleOptim applies an instruction-level optimization strategy, InstrMatOpt, to accelerate linear algebra operations, further streamlining the inference process. Experimental results show that TripleOptim significantly boosts throughput across all models, achieving increases ranging from 34% to 98%. These optimization strategies not only accelerate model inference speed but also deliver substantial enhancements in inference performance on heterogeneous platforms.
Statistics
Show / Hide Statistics
Statistics (Cumulative Counts from December 1st, 2015)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.
Cite this article
[IEEE Style]
W. Wang, L. Han, J. Zhou, J. Yu, J. Xie, C. Kong, Z. Gao, "TripleOptim: A Comprehensive Optimization Framework for GPTQ Quantization Inference on Heterogeneous Platforms," KSII Transactions on Internet and Information Systems, vol. 19, no. 5, pp. 1441-1458, 2025. DOI: 10.3837/tiis.2025.05.003.
[ACM Style]
Wei Wang, Lin Han, Jiehan Zhou, Jinling Yu, Jingming Xie, Chaowei Kong, and Zhenqi Gao. 2025. TripleOptim: A Comprehensive Optimization Framework for GPTQ Quantization Inference on Heterogeneous Platforms. KSII Transactions on Internet and Information Systems, 19, 5, (2025), 1441-1458. DOI: 10.3837/tiis.2025.05.003.
[BibTeX Style]
@article{tiis:102585, title="TripleOptim: A Comprehensive Optimization Framework for GPTQ Quantization Inference on Heterogeneous Platforms", author="Wei Wang and Lin Han and Jiehan Zhou and Jinling Yu and Jingming Xie and Chaowei Kong and Zhenqi Gao and ", journal="KSII Transactions on Internet and Information Systems", DOI={10.3837/tiis.2025.05.003}, volume={19}, number={5}, year="2025", month={May}, pages={1441-1458}}