과제정보
본 논문은 한국정부(MSIT)가 지원하는 정보통신기술 기획평가원(IITP) 보조금 지원(제 2021-0-00779 호, 개인정보보호기반 하드웨어를 보장하는 초고속 암호화 데이터처리기술개발)과 정부(과학 기술 정보통신부)의 재원으로 한국연구재단-차세대 지능형 반도체 기술개발사업(소자)의 연구(RS-2023-00258227) 그리고 2023 년도 정부(과학 기술 정보통신부)의 재원으로 정보통신기획평가원(RS-2023-00229849,IoT Intelligence 용 eFLASH 파운드리 공정 기반 MPU/Connectivity/경량신경망 통합 반도체 개발)의 지원을 받아 수행된 연구이다.
참고문헌
- A. Vaswani et al., "Attention is all you need," in Proc. of NeurIPS, 2017.
- T. Brown et al., "Language models are few-shot learners," in Proc. of NeurIPS, 2020, pp. 1877-1901.
- A. Chowdhery et al., "Palm: Scaling language modeling with pathways," arXiv preprint arXiv:2204.02311, 2022.
- S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding," arXiv preprint arXiv:1510.00149, 2015.
- Sutskever, I., Vinyals, O., & Le, Q. V.. "Sequence to sequence learning with neural networks." Advances in neural information processing systems 27 (2014).
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proc. of NAACL, 2019, pp. 4171-4186.
- Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. "GLUE: A multi-task benchmark and analysis platform for natural language understanding." arXiv preprint arXiv:1804.07461 (2018).
- Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., ... & Fernandez, R. "The LAMBADA dataset: Word prediction requiring a broad discourse context." arXiv preprint arXiv:1606.06031 (2016).
- G. Park, B. Park, S. J. Kwon, B. Kim, Y. Lee, and D. Lee, "nuqmm: Quantized matmul for efficient inference of large-scale generative language models," arXiv preprint arXiv:2206.09557, 2022.
- Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. "Llm. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022).
- Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., & He, Y. "Zeroquant: Efficient and affordable post-training quantization for large-scale transformers." Advancesin Neural Information Processing Systems 35 (2022): 27168-27183.
- Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. "Smoothquant: Accurate and efficient post-training quantization for large language models." International Conference on Machine Learning. PMLR, 2023.
- Gou, Jianping, et al. "Knowledge distillation: A survey." International Journal of Computer Vision 129 (2021): 1789-1819. https://doi.org/10.1007/s11263-021-01453-z
- Gu, Y., Dong, L., Wei, F., & Huang, M. "Knowledge Distillation of Large Language Models." arXiv preprint arXiv:2306.08543 (2023).
- Frantar, E., & Alistarh, D. "SparseGPT: Massive Language Models Can Be Accurately Pruned in OneShot." (2023).
- Ma, X., Fang, G., & Wang, X. "LLM-Pruner: On the Structural Pruning of Large Language Models." arXiv preprint arXiv:2305.11627 (2023).
- Zhang, M., Shen, C., Yang, Z., Ou, L., Yu, X., & Zhuang, B. "Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning." arXiv preprint arXiv:2305.18403 (2023).
- V. Sanh, T. Wolf, and A. Rush, "Movement pruning: Adaptive sparsity by fine-tuning," in Proc. of NeurIPS, 2020, pp. 20 378-20 389.
- Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, 1993.
- Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022.
- Y. He, X. Zhang, and J. Sun, "Channel pruning for accelerating very deep neural networks," in Proc. of ICCV, 2017, pp. 1389-1397.
- M. Zhu, T. Zhang, Z. Gu, and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus," in Proc. of MICRO, 2019, pp. 359-371.
- E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, "Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned," in Proc. of ACL, 2019, pp. 5797-5808.
- P. Michel, O. Levy, and G. Neubig, "Are sixteen heads really better than one?" Proc. of NeurIPS, vol. 32, 2019.
- Lagunas, Francois, et al. "Block pruning for faster transformers." arXiv preprint arXiv:2109.04838 (2021).
- E. Yoo, G. Park, J. Min, S. Kwon, B. Park, D. Lee, and Y. Lee*, "TF-MVP: Novel sparsity-aware transformer accelerator with mixed-length vector pruning," Design Automation Conference (DAC), San Francis-co, CA, USA, July 2023.
- J. Park, H. Yoon, D. Ahn, J. Choi, and J.-J. Kim, "Optimus: Optimized matrix multiplication structure for transformer neural network accelerator," Proc. of MLSys, pp. 363-378, 2020.
- A. Parashar et al., "Scnn: An accelerator for compressed-sparse convolutional neural networks," ACM SIGARCH computer architecture news, vol. 45, no. 2, pp. 27-40, 2017. https://doi.org/10.1145/3140659.3080254
- S. Zhang et al., "Cambricon-x: An accelerator for sparse neural networks," in Proc. of MICRO. IEEE, 2016, pp. 1-12.
- S. Moon, H. Lee, Y. Byun, J. Park, J. Joe, S. Hwang, S. Lee, and Y. Lee*, "FPGA-based sparsity-aware CNN accelerator for noise-resilient edge-level image recognition," IEEE Asian Solid-State Circuits Confer-ence (A-SSCC), Macao, China, Nov. 2019, pp. 205-208.
- H. Kwon, Y. Byun, S. Kang, and Y. Lee*, "CHAMP: Channel merging process for cost-efficient highly-pruned CNN acceleration," IEEE Transactions on Circuits and Systems I: Regular vol. 69, no. 8, pp. 3308-3319, Aug. 2022. https://doi.org/10.1109/TCSI.2022.3174531
- Y. Byun, S. Moon, B. Park, S. Kwon, D. Lee, G. Park, E. Yoo, J. Min and Y. Lee*, "Sparsity-Aware Memory Interface Architecture using Stacked XOR-Net Compression for Accelerating Pruned-DNN Models," Proceedings of Machine Learning and Systems, Miami, FL, USA, June 2023.