A reconfigurable processor for mix-precision CNNs on FPGA

Libo CHANG; Shengbing ZHANG

doi:10.1051/jnwpu/20224020344

All issues

Volume 40 / No 2 (April 2022)

JNWPU, 40 2 (2022) 344-351

Abstract

Open Access

Issue		JNWPU Volume 40, Number 2, April 2022


Page(s)		344 - 351
DOI		https://doi.org/10.1051/jnwpu/20224020344
Published online		03 June 2022

JNWPU 2022, 40(2): 344–351

A reconfigurable processor for mix-precision CNNs on FPGA

面向混合量化CNNs的可重构处理器设计

Libo CHANG (常立博)¹^,2 and Shengbing ZHANG (张盛兵)¹

¹ School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
² School of Electronic Engineering, Xi’an University of Posts and Telecommunication, Xi’an 710121, China

Received: 15 July 2021

Abstract

To solve the problem of low computing efficiency of existing accelerators for convolutional neural network (CNNs), which caused by the inability to adapt to the characteristics of computing mode and caching for the mixed-precision quantized CNNs model, we propose a reconfigurable CNN processor in this paper, which consists of the reconfigurable adaptable computing unit, flexible on-chip cache unit and macro-instruction set. The multi-core CNN processor can be reconstructed according to the structure of CNN models and constraints of reconfigurable resources, to improve the utilization of computing resources. The elastic on-chip buffer and the data access approach by dynamically configuring an address to better utilization of on-chip memory. Then, the macroinstruction set architecture (mISA) can fully express the characteristics of the mixed-precision CNN models and reconfigurable processors, to reduce the complexity of mapping CNNs with different network structures and computing modes to reconfigurable the CNNs processors. For the well-known CNNs-VGG16 and ResNet-50, the proposed CNN processor has been implemented using Ultra96-V2 and ZCU102 FPGA, showing the throughput of 216.6 GOPS, and 214 GOPS, the computing efficiency of 0.63 GOPS/DSP and 0.64 GOPS/DSP on Ultra96-V2, respectively, achieving a better efficiency than the CNN accelerator based on fixed bit-width. Meanwhile, for ResNet-50, the throughput and the computing efficiency are up to 931.8 GOPS, 0.40 GOPS/DSP on ZCU102, respectively. In addition, these achieve up to 55.4% higher throughput than state-of-the-art CNN accelerators.

摘要

为了解决已有卷积神经网络(convolution neural networks，CNNs)加速器，因无法适应混合量化CNN模型的计算模式和访存特性而引起加速器效率低的问题，设计了可适应混合量化模型的可重构计算单元、弹性片上缓存单元和宏数据流指令集。其中，采用了可根据CNN模型结构的重构多核结构以提高计算资源利用率，采用弹性存储结构以及基于Tile的动态缓存划分策略以提高片上数据复用率，采用可有效表达混合精度CNN模型计算和可重构处理器特性的宏数据流指令集以降低映射策略的复杂度。在Ultra96-V2平台上实现VGG-16和ResNet-50的计算性能达到216.6和214 GOPS，计算效率达到0.63和0.64 GOPS/DSP。同时，在ZCU102平台上实现ResNet-50的计算性能可达931.8 GOPS，计算效率可达0.40 GOPS/DSP，相较于其他类似CNN加速器，计算性能和计算效率分别提高了55.4% 和100%。

Key words: mixed-precision quantization / convolutional neural network accelerator / reconfigurable computing

关键字 : 混合精度量化 / 卷积神经网络加速器 / 可重构计算

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.