Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
FPGA Hardware Acceleration of Inception Style Parameter Reduced Convolution Neural Networks
KTH, School of Information and Communication Technology (ICT).
2016 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Some researchers have noted that the growth rate in the number of network parameters of many recently proposed state-of-the-art CNN topologies is placing unrealistic demands on hardware resources and limits the practical applications of Neural Networks. This is particularly apparent when considering many of the projected applications (IoT, autonomous vehicles, etc) utilize embedded systems with even greater restrictions on computation and memory bandwidth than the typical research-class computer cluster that the CNN was designed on.

The GoogLeNet CNN in 2014 proposed a new level of organization (“Inception Module”) that was demonstrated in competition to achieve similar/better performance, while using an order of magnitude less network parameters than the other competing topologies.

This thesis explores the characteristics of the new GoogLeNet inception modules and the implications it presents to current CNN accelerator architectures. A custom FPGA accelerator is proposed to offset the inception module’s increased need to buffer large intermediate convolution arrays through array partitioning and cascading two convolution operations into a single pipeline pass.

A Xilinx Artix-7 FPGA was used to implement architecture where it was able continuously supply data to the 331 utilized DSP blocks (approx. half of total available), while using only a quarter of the DDR bandwidth to achieve a peak throughput of 9.11 GFLOPS. The low utilization of the DDR bandwidth suggests that with some optimization, the design can be scaled up to better utilize the available resources and increase throughput.

Place, publisher, year, edition, pages
2016. , 54 p.
Series
TRITA-ICT-EX, 2016:188
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:kth:diva-205026OAI: oai:DiVA.org:kth-205026DiVA: diva2:1087367
Subject / course
Electrical Engineering
Educational program
Master of Science - System-on-Chip Design
Examiners
Available from: 2017-04-06 Created: 2017-04-06 Last updated: 2017-04-06Bibliographically approved

Open Access in DiVA

fulltext(2075 kB)152 downloads
File information
File name FULLTEXT01.pdfFile size 2075 kBChecksum SHA-512
241614ea738ce66daef1b4c860d563cf707cc589fe98baf40046befd1a9c4525f752518908f30d2f2bfd859fd64352d85c1ff2de5c6d16d3d81e632517d70825
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 152 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 134 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf