Tuesday, January 14, 2020 |
Room 310 | Room 308 | Room 307A | Room 307B |
---|---|---|---|
Opening and Keynote Session I 9:00 - 10:30 |
|||
10:45 - 12:00 |
10:45 - 12:00 |
10:45 - 12:00 |
10:45 - 12:00 |
Keynote Session II 12:00 - 12:40 |
|||
12:40 - 14:00 |
|||
14:00 - 15:40 |
14:00 - 15:40 |
14:00 - 15:40 |
14:00 - 15:40 |
15:40 - 16:00 |
|||
16:00 - 17:15 |
16:00 - 17:15 |
16:00 - 17:15 |
16:00 - 17:15 |
Wednesday, January 15, 2020 |
Thursday, January 16, 2020 |
Room 310 | Room 308 | Room 307A | Room 307B |
---|---|---|---|
Keynote Session VI 9:00 - 10:00 |
|||
10:00 - 10:15 |
|||
10:15 - 11:30 |
10:15 - 11:30 |
10:15 - 11:30 |
10:15 - 11:30 |
11:30 - 13:50 |
|||
13:50 - 15:30 |
13:50 - 15:30 |
13:50 - 15:30 |
13:50 - 15:30 |
15:30 - 15:45 |
|||
15:45 - 17:00 |
15:45 - 17:00 |
15:45 - 17:00 |
15:45 - 17:00 |
Tuesday, January 14, 2020 |
Title | (Keynote Address) Skin Electronics for Continuous Health Monitoring |
Author | Takao Someya (The University of Tokyo, Japan) |
Keyword | Keynote |
Abstract | Flexible and stretchable hybrid electronics are expected to open up a new class of applications ranging from healthcare, medical, sports, wellness, human-machine interfaces, and new IT fashion. In particular, to expand emerging applications of wearable technologies, printed flexible biomedical sensors have attracted much attention recently. In order to minimize the discomfort of wearing sensors, it is highly desirable to use soft electronic materials particularly for devices that come directly into contact with the skin and/or biological tissues. In this regard, electronics manufactured on thin polymeric films, elastomeric and textile substrates by printing are very attractive. In this talk, I will review recent progresses of wearables, smart apparels, and artificial electronic skins (E-skins) from the contexts of high-precision and long-term vital signal monitoring. Furthermore, the issues and the future prospect of wearables and beyond wearables will be addressed. |
Title | Design of a Single-Stage Wireless Charger with 92.3%-Peak-Efficiency for Portable Devices Applications |
Author | *Lin Cheng (University of Science and Technology of China, China), Xinyuan Ge, Wai Chiu Ng, Wing-Hung Ki, Jiawei Zheng, Tsz Fai Kwok, Chi-Ying Tsui (The Hong Kong University of Science and Technology, China), Ming Liu (Institute of Microelectronics, Chinese Academy of Sciences,, China) |
Page | pp. 1 - 2 |
Keyword | Wireless charging, single-stage, CC-CV charging, high efficiency |
Abstract | This summary presents a fully-integrated wireless charger to achieve high efficiency with low cost and volume. The charger realizes power rectification, voltage regulation and CC-CV charging in one power stage only. A bootstrapping technique is also designed for on-chip integration of the bootstrap capacitors. A chip prototype was fabricated in a standard 0.35µm CMOS process with a die area of 8mm2. The charger achieves peak efficiency of 92.3% and 91.4% when the charging currents are 1A and 1.5A, respectively.. |
PDF file |
Title | A Capacitance-to-Digital Converter with Differential Bondwire Accelerometer, On-chip Air Pressure and Humidity Sensor in 0.18 um CMOS |
Author | Sujin Park (Korea Advanced Institute of Science and Technology, Republic of Korea), Geon-Hwi Lee (Korea Advanced Institute of Science and Technology/SK Hynix, Republic of Korea), *Seungmin Oh, SeongHwan Cho (Korea Advanced Institute of Science and Technology, Republic of Korea) |
Page | pp. 3 - 4 |
Keyword | Capacitance-to-digital converter, Air pressure sensor, Relative humidity sensor, Accelerometer, Standard CMOS sensor |
Abstract | This paper presents a sensor front-end for air pressure sensor, relative humidity (RH) sensor, and accelerometer in a standard CMOS process. For air pressure and RH, interdigitated top metals in air and polyimide are exploited respectively, which exhibit the change in dielectric constant. For acceleration, separation among three bondwires is exploited. These sensing transducers induce capacitance change that is quantized by a CDC based on a dual quantization architecture that employs a single-bit 1st-order delta-sigma modulator and a 7-bit SAR ADC. |
PDF file |
Title | A 28GHz CMOS Differential Bi-Directional Amplifier for 5G NR |
Author | *Zheng Li, Jian Pang, Ryo Kubozoe, Xueting Luo, Rui Wu, Yun Wang, Dongwon You, Ashbir Aviat Fadila, Joshua Alvin, Bangan Liu, Zheng Sun, Hongye Huang, Atsushi Shirane, Kenichi Okada (Tokyo Institute of Technology, Japan) |
Page | pp. 5 - 6 |
Keyword | 28GHz, bi-directional, amplifier, CMOS, 5G NR |
Abstract | A 28GHz differential bi-directional amplifier in a standard 65nm CMOS process is presented. This work is realized based on the neutralized bi-directional core together with the fully shared inter-stage matching networks. The core chip area is only 0.11mm2. At 28GHz, a 15.1-dBm saturation output power and a 4.2-dB noise figure are realized for PA mode and LNA mode, respectively. The DC power consumptions for PA mode and LNA mode are 149mW and 31mW, respectively, under 1-V DC supply. |
PDF file |
Title | A Quantity Evaluation and Reconfiguration Mechanism for Signal- and Power-Interconnections in 3D-Stacking System |
Author | *Ching-Hwa Cheng (Feng Chia University, Taiwan) |
Page | pp. 7 - 8 |
Keyword | 3D stacking system, Interconnection test, design for testable |
Abstract | Due to the high integration required for system application, the three-dimensional chip may resolve this requirement. The three-dimensional vertically stacking (3D-stacking) systems have been proposed to satisfy these requirements. However, the 3D-stacking system contains several design risks from its long layer interconnections. For a 3D-stacking system, it is difficult to identify where the numerous power and signal-interconnection are open-, shorted-fault, or resistive-short has accrued. Therefore, solving these interconnection problems is necessary. A feasible interconnection quality-evaluation, fault-diagnosis, and connection-reconfigurable mechanism are proposed. The proposed interconnection-measurement-recovery (IMR) mechanism will make it easy to find interconnection faults and make recovery in 3D-Stacking systems. The proposed IMR can detect interconnection open, short, bridge and resistive defects with the path-reroute mechanism. Future more, the signal transmission quality can be measured. This measurement provides to monitor signal propagation in pico-second accuracy. IMR has less extra area and power consumption overhead. The feasibilities of the proposed mechanism have been justified by 2D-chip and 3D-stacking MorPack both systems. |
PDF file |
Title | An Inductively Coupled Wireless Bus for Chiplet-Based Systems |
Author | *Junichiro Kadomoto, Satoshi Mitsuno, Hidetsugu Irie, Shuichi Sakai (The University of Tokyo, Japan) |
Page | pp. 9 - 10 |
Keyword | chiplet, inductive coupling, wireless communication |
Abstract | A wireless bus for inter-chiplet communication is presented. Utilizing horizontal inductive coupling of on-chip coils, wireless connection between chiplets are established. A test chip prototyped in 0.18 μm CMOS confirms 2.0 Gb/s bus communication between horizontally arranged coils with BER of less than 10-12. |
PDF file |
Title | FPGA-based Heterogeneous Solver for Three-Dimensional Routing |
Author | Kento Hasegawa, *Ryota Ishikawa, Makoto Nishizawa, Kazushi Kawamura, Masashi Tawada, Nozomu Togawa (Waseda University, Japan) |
Page | pp. 11 - 12 |
Keyword | heterogeneous, FPGA |
Abstract | A heuristic algorithm is one of the approaches to solve an NP-hard problem. In order to enhance the capability of the system, heterogeneous computing is often adapted. In this paper, we propose an FPGA-based heterogeneous solver for three-dimensional routing. The proposed system is implemented into multiple FPGA boards and a single-board computer. The experimental results demonstrate that the proposed system outperforms a single FPGA system. |
PDF file |
Title | PowerNet: Transferable Dynamic IR Drop Estimation via Maximum Convolutional Neural Network |
Author | *Zhiyao Xie (Duke University, USA), Haoxing Ren, Brucek Khailany, Ye Sheng, Santosh Santosh (Nvidia, USA), Jiang Hu (TAMU, USA), Yiran Chen (Duke University, USA) |
Page | pp. 13 - 18 |
Keyword | IR drop, machine learning |
Abstract | IR drop is a fundamental constraint required by almost all chip designs. However, its evaluation usually takes a long time that hinders mitigation techniques for fixing its violations. In this work, we develop a fast dynamic IR drop estimation technique, named PowerNet, based on a convolutional neural network (CNN). It can handle both vector-based and vectorless IR analyses. Moreover, the proposed CNN model is general and transferable to different designs. This is in contrast to most existing machine learning (ML) approaches, where a model is applicable only to a specific design. Experimental results show that PowerNet outperforms the latest ML method by 9% in accuracy for the challenging case of vectorless IR drop and achieves a 30 times speedup compared to an accurate IR drop commercial tool. Further, a mitigation tool guided by PowerNet reduces IR drop hotspots by 26% and 31% on two industrial designs, respectively, with very limited modification on their power grids. |
PDF file |
Title | FIST: A Feature-Importance Sampling and Tree-Based Method for Automatic Design Flow Parameter Tuning |
Author | *Zhiyao Xie (Duke University, USA), Guan-Qi Fang, Yu-Hung Huang (National Taiwan University of Science and Technology, Taiwan), Haoxing Ren, Yanqing Zhang, Brucek Khailany (Nvidia, USA), Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan), Jiang Hu (TAMU, USA), Yiran Chen (Duke University, USA), Erick Carvajal Barboza (TAMU, USA) |
Page | pp. 19 - 25 |
Keyword | automatic parameter tuning, design flow, machine learning |
Abstract | Design flow parameters are of utmost importance to chip design quality and require a painfully long time to evaluate their effects. In reality, flow parameter tuning is usually performed manually based on designers' experience in an ad hoc manner. In this work, we introduce a machine learning-based automatic parameter tuning methodology that aims to find the best design quality with a limited number of trials. Instead of merely plugging in machine learning engines, we develop clustering and approximate sampling techniques for improving tuning efficiency. The feature extraction in this method can reuse knowledge from prior designs. Furthermore, we leverage a state-of-the-art XGBoost model and propose a novel dynamic tree technique to overcome overfitting. Experimental results on benchmark circuits show that our approach achieves 25% improvement in design quality or 37% reduction in sampling cost compared to random forest method, which is the kernel of a highly cited previous work. Our approach is further validated on two industrial designs. By sampling less than 0.02% of possible parameter sets, it reduces area by 1.83% and 1.43% compared to the best solutions hand-tuned by experienced designers. |
PDF file |
Title | High-Definition Routing Congestion Prediction for Large-Scale FPGAs |
Author | Mohamed Baker Alawieh, Wuxi Li, *Yibo Lin (The University of Texas at Austin, USA), Love Singhal, Mahesh A. Iyer (Intel, USA), David Z. Pan (The University of Texas at Austin, USA) |
Page | pp. 26 - 31 |
Keyword | Congestion, Routing, FPGA, Machine Learning |
Abstract | To speed up the FPGA placement and routing closure, we propose a novel approach to predict the routing congestion map for large-scale FPGA designs at the placement stage. After reformulating the problem into an image translation task, our proposed approach leverages recent advancement in generative adversarial learning to address the task. Particularly, state-of-the-art generative adversarial networks for high- resolution image translation are used along with well-engineered features extracted from the placement stage. Unlike available approaches, our novel framework demonstrates a capability of handling large-scale FPGA designs. With its superior accuracy, our proposed approach can be incorporated into the placement engine to provide congestion prediction resulting in up to 7% reduction in routed wirelength for the most congested design in ISPD 2016 benchmark. |
PDF file |
Title | Integrated Airgap Insertion and Layer Reassignment for Circuit Timing Optimization |
Author | *Younggwang Jung, Daijoon Hyun, Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea) |
Page | pp. 32 - 37 |
Keyword | Airgap, Airgap insertion, Layer reassignment, Timing optimization |
Abstract | Airgap is an intentional void formed in inter-metal dielectric. It brings about reduced coupling capacitance, and so can be used to improve circuit timing. Airgap can be utilized in a limited number of metal layers due to its high process cost. For given airgap layers, two problems should be addressed to insert airgap: relocate some metal segments in non-airgap layers into airgap layers (called layer reassignment) and determine the amount of airgap for each metal segment in airgap layers (airgap insertion). Two problems are solved together in this paper with a goal of maximizing setup total negative slack (TNS) while assuring no hold violations. It is formulated as mixed integer quadratically constrained programming (MIQCP); heuristic algorithm is proposed for practical application and its performance against MIQCP is experimentally assessed using small test circuits. Experiments demonstrate that TNS and WNS are improved by 35% and 10%, respectively, while simple minded approach achieves 6% and 4% less improvements compared to the proposed method. |
PDF file |
Title | An Adaptive Electromigration Assessment Algorithm for Full-chip Power/Ground Networks |
Author | *Shaobin Ma, Xiaoyi Wang (Beijing Engineering Research Center for IoT Software and Systems,Beijing University of Technology, China), Sheldon X.-D. Tan, Liang Chen (Department of Electrical and Computer Engineering, University of California, Riverside, USA), Jian He (Beijing Engineering Research Center for IoT Software and Systems,Beijing University of Technology, China) |
Page | pp. 38 - 43 |
Keyword | Electromigration, Power/Ground Networks, Eigenfunction |
Abstract | In this paper, an adaptive algorithm is proposed to perform electromigration (EM) assessment for full-chip power/ground networks. Based on the eigenfunction solutions, the proposed method improves the efficiency by properly selecting the eigenfunction terms and utilizing the closed-form eigenfunctions for commonly seen interconnect wires such as T-shaped or cross-shaped wires. It is demonstrated that the proposed method can trad-off well among the accuracy, efficiency and applicability of the eigenfunction based methods. The experimental results show that the proposed method is about three times faster than the finite difference method and other eigenfunction based methods. |
PDF file |
Title | Template-based PDN Synthesis in Floorplan and Placement Using Classifier and CNN Techniques |
Author | *Vidya A. Chhabria (University of Minnesota, USA), Andrew B. Kahng, Minsoo Kim, Uday Mallappa (University of California, San Diego, USA), Sachin S. Sapatnekar (University of Minnesota, USA), Bangqi Xu (University of California, San Diego, USA) |
Page | pp. 44 - 49 |
Keyword | Power Delivery Network, Machine Learning |
Abstract | Designing an optimal power delivery network (PDN) is a time-intensive task that involves many iterations. This paper proposes a methodology that employs a library of predesigned, stitchable templates, and uses machine learning (ML) to rapidly build a PDN with region-wise uniform pitches based on these templates. Our methodology is applicable at both the floorplan and placement stages of physical implementation. (i) At the floorplan stage, we synthesize an optimized PDN based on early estimates of current and congestion, using a simple multilayer perceptron classifier. (ii) At the placement stage, we incrementally optimize an existing PDN based on more detailed congestion and current distributions, using a convolution neural network. At each stage, the neural network builds a safe-by-construction PDN that meets IR drop and electromigration (EM) specifications. On average, the optimization of the PDN brings an extra 3% (1,850 tracks) of routing resources, which corresponds to a thousands of routing tracks in congestion-critical regions, when compared to a globally uniform PDN, while staying within the IR drop and EM limits. |
PDF file |
Title | Analyzing The Security of The Cache Side Channel Defences With Attack Graphs |
Author | *Limin Wang, Ziyuan Zhu, Zhanpeng Wang, Dan Meng (Institute of Information Engineering, Chinese Academy of Sciences, China) |
Page | pp. 50 - 55 |
Keyword | cache side-channel defences, micro-architecture, model checking, attack graph, early-stage design |
Abstract | Note that very limited work is proposed to analyze the security of defenses against the cache side channel attacks on micro-architecture. In this paper, we propose a model based method to generate a visual attack graph and analyze the security of micro-architecture security designs in the early stages of processor design. The experiments indicate that our method can identify the special attack paths that some common security designs fail to defend against and show them in an attack graph. |
PDF file |
Title | iGPU Leak: An Information Leakage Vulnerability on Intel Integrated GPU |
Author | *Wenjian He, Wei Zhang (The Hong Kong University of Science and Technology, Hong Kong), Sharad Sinha (Indian Institute of Technology Goa, India), Sanjeev Das (University of North Carolina at Chapel Hill, USA) |
Page | pp. 56 - 61 |
Keyword | Security, Information Leakage, Integrated GPU |
Abstract | Hardware accelerators such as integrated graphics processing units (iGPUs) are increasingly prevalent in modern systems. They typically provide multiplexing support where several user applications can share the iGPU acceleration resources. However, security in this setting has not received sufficient consideration. In this work, we disclose a critical information leakage vulnerability due to defective GPU context management. In essence, residual register values and shared local memory in the iGPU are not cleared during a context switch. As a result, adversaries can recover the secret key of a cryptographic algorithm running on an iGPU from a single snapshot of the leaking channel. User privacy is also under threat due to browser activity eavesdropping through website-fingerprinting attack with high accuracy and resolution. Moreover, this vulnerability can constitute a covert channel with a bandwidth of up to 8 Gbps. |
PDF file |
Title | Design for EM Side-Channel Security through Quantitative Assessment of RTL Implementations |
Author | *Jiaji He (Tsinghua University, China), Haocheng Ma (Tianjin University, China), Xiaolong Guo (Kansas State University, USA), Yiqiang Zhao (Tianjin University, China), Yier Jin (University of Florida, USA) |
Page | pp. 62 - 67 |
Keyword | side-channel attack, design for side-channel security, T-test, RTL hardware implementation |
Abstract | Electromagnetic (EM) side-channel attacks aim at extracting secret information from cryptographic hardware implementations. Countermeasures have been proposed at device level, register-transfer level (RTL) and layout level, though efficient, there are still requirements for quantitative assessment of the hardware implementations' resistance against EM side-channel attacks. In this paper, we propose a design for EM side-channel security evaluation and optimization framework based on the t-test evaluation results derived from RTL hardware implementations. Different implementations of the same cryptographic algorithm are evaluated under different hypothesis leakage models considering the driven capabilities of logic components, and the evaluation results are validated with side-channel attacks on FPGA platform. Experimental results prove the feasibility of the proposed side-channel leakage evaluation method at pre-silicon stage. The remedies and suggested security design rules are also discussed. |
PDF file |
Title | (Keynote Address) Edge-to-Cloud Innovations for Inclusive AI |
Author | Xiaoning Qi (Alibaba) |
Abstract | Technology has propelled us into an era of data and AI, and computing power is the force behind it all. At the core of computing power is the tiny yet mighty chip. T-Head has formed a full-stack chip system that facilitates edge-to-cloud integration, including processor IPs, SoC platforms, and AI chips. T-Head's success in hardware-software innovation is built on its self-developed chip structure and bolstered by Alibaba DAMO Academy's leading AI algorithms and AliOS operating system. |
Title | (Invited Paper) Impact of Self-Heating On Performance, Power and Reliability in FinFET Technology |
Author | Victor M. van Santen, Paul R. Genssler, Om Prakash, Simon Thomann, Jörg Henkel, *Hussam Amrouch (Karlsruhe Institute of Technology, Germany) |
Page | pp. 68 - 73 |
Keyword | Self Heating, Reliability, FinFET, Temperature |
Abstract | Self-heating is one of the biggest threats to reliability in current CMOS technologies like FinFET and Nanowire. Encapsulating the channel with the gate dielectric improved electrostatics, but also thermally insulates the channel resulting in elevated channel temperatures as the generated heat is trapped within the channel. Elevated channel temperatures lowers performance, increases power consumption and lowers reliability of these circuits build from FinFET or Nanowire transistors. This work provides an overview over self-heating with respect to circuit design. |
PDF file |
Title | (Invited Paper) Reliable Power Grid Network Design Framework Considering EM Immortalities for Multi-Segment Wires |
Author | Han Zhou, Shuyuan Yu, Zeyu Sun, *Sheldon X.-D. Tan (University of California, Riverside, USA) |
Page | pp. 74 - 79 |
Keyword | Power Grid, Electromigration, Immortality, Multi-Segment |
Abstract | This paper presents a new power grid network design and optimization technique that considers the new EM immortality constraint due to EM void saturation volume for multi-segment interconnects. Void may grow to its saturation volume without changing the wire resistance significantly. However, this phenomenon was ignored in existing EM-aware optimization methods. By considering this new effect, we can remove more conservativeness in the EM-aware on-chip power grid design. Along with recently proposed nucleation phase immortality constraint for multi-segment wires, we show that both EM immortality constraints can be naturally integrated into the existing programming based power grid optimization framework. To further mitigate the overly conservative problem of existing immortality-constrained optimization methods, we further explore two strategies: first we size up failed wires to meet one of immorality conditions subject to design rules; second, we consider the EM-induced aging effects on power supply networks for a targeted lifetime, which allows some short-lifetime wires to fail and optimizes the rest of the wires. Numerical results on a number of IBM and self-generated power supply networks demonstrate that the new method can reduce more power grid area compared to the existing EM-immortality constrained optimizations. Furthermore, the new method can optimize power grids with nucleated wires, which would not be possible with the existing methods. |
PDF file |
Title | (Invited Paper) Investigating the Inherent Soft Error Resilience of Embedded Applications by Full-System Simulation |
Author | Uzair Sharif, Daniel Müller-Gritschneder, *Ulf Schlichtmann (Technical University of Munich, Germany) |
Page | pp. 80 - 84 |
Keyword | Soft error resilience, silent data corruption, application resilience, safety critical embedded systems |
Abstract | It has long been acknowledged that some applications feature inherent resilience against soft errors, e.g., the impact of soft errors on multimedia applications is often nonvisible to humans. In this paper we investigate the inherent resilience of two typical embedded applications using a case study of a control system and a robot arm. Both studies were enabled by our mixed-mode fault injection simulator ETISS-ML, which allows RTL-accurate fault injection while being able to simulate very long scenarios, e.g. robot movements of several seconds. Our results indicate that full simulation of the embedded system and its environment are required to classify whether the system can tolerate the impact of a soft error. This is due to the fact that it is hard to predict the impact of a certain output deviation without investigating the change in the system behavior taking into account the control loop. Based on this classification method we hope to be able to exploit this resilience for lowering the cost of error detection mechanisms in future research. |
PDF file |
Title | Co-Exploring Neural Architecture and Network-on-Chip Design for Real-Time Artificial Intelligence |
Author | *Lei Yang (University of Pittsburgh, USA), Weiwen Jiang (University of Notre Dame, USA), Weichen Liu (Nanyang Technological University, Singapore), Edwin H. M. Sha (East China Normal University, China), Yiyu Shi (University of Notre Dame, USA), Jingtong Hu (University of Pittsburgh, USA) |
Page | pp. 85 - 90 |
Keyword | NAS and NoC Co-Exploration |
Abstract | Hardware-aware Neural Architecture Search (NAS), which automatically finds an architecture that works best on a given hardware design, has prevailed in response to the ever-growing demand for real-time Artificial Intelligence (AI). However, in many situations, the underlying hardware is not pre-determined. We argue that simply assuming an arbitrary yet fixed hardware design will lead to inferior solutions, and it is best to co-explore neural architecture space and hardware design space for the best pair of neural architecture and hardware design. To demonstrate this, we employ Network-on-Chip (NoC) as the infrastructure and propose a novel framework, namely NANDS, to co-explore NAS space and NoC Design Search (NDS) space with the objective to maximize accuracy and throughput. Since two metrics are tightly coupled, we develop a multi-phase manager to guide NANDS to gradually converge to solutions with the best accuracy-throughput tradeoff. On top of it, we propose techniques to detect and alleviate timing performance bottleneck, which allows better and more efficient exploration of NDS space. |
PDF file |
Title | Thanos: High-Performance CPU-GPU Based Balanced Graph Partitioning Using Cross-Decomposition |
Author | Dae Hee Kim, Rakesh Nagi, *Deming Chen (University of Illinois at Urbana-Champaign, USA) |
Page | pp. 91 - 96 |
Keyword | Graph Partitioning, GPU, Cross-Decomposition, Acceleration |
Abstract | As graphs become larger and more complex, it is becoming nearly impossible to process them without graph partitioning. Graph partitioning creates many subgraphs which can be processed in parallel thus delivering high-speed computation results. However, graph partitioning is a difficult task. In this work, we introduce Thanos, a fast graph partitioning tool which uses the cross-decomposition algorithm that iteratively partitions a graph. It also produces balanced loads of partitions. The algorithm is well suited for parallel GPU programming which leads to fast and high-quality graph partitioning solutions. Experimental results show that we have achieved 30x speedup and 35% better edge cut reduction compared to the CPU version of the popular graph partitioner, METIS, on average. |
PDF file |
Title | Reutilization of Trace Buffers for Performance Enhancement of NoC based MPSoCs |
Author | *Sidhartha Sankar Rout, Badri M, Sujay Deb (Indraprastha Institute of Information Technology, Delhi, India) |
Page | pp. 97 - 102 |
Keyword | network on chip, design for debug, trace buffers, virtual channels, fair division |
Abstract | The contemporary network-on-chips (NoCs) are so complex that capturing all network functional faults at pre-silicon verification stage is nearly impossible. So, on-chip design-for-debug (DfD) structures such as trace buffers are provided to assist capturing escaped faults during post-silicon debug. Most of the DfD modules are left idle after the debug process. Reuse of such structures can compensate for the area overhead introduced by them. In this work, the trace buffers are reutilized as extended virtual channels for the router nodes of an NoC during in-field execution. Optimal distribution of trace buffers among the routers is performed based upon their load profiling. Experiments with several benchmarks on the proposed architecture show an average of 11.36% increase in network throughput and 13.97% decrease in average delay. |
PDF file |
Title | Formal Semantics of Predictable Pipelines: a Comparative Study |
Author | Mathieu Jan, *Mihail Asavoae (CEA LIST, France), Martin Schoeberl (Technical University of Denmark, Denmark), Edward A. Lee (University of California at Berkeley, USA) |
Page | pp. 103 - 108 |
Keyword | real-time systems, timing anomalies, model-checking |
Abstract | Computer architectures used in safety-critical domains are subjected to worst-case execution time analysis. The presence of performance-driven microarchitectures may trigger undesired timing phenomena, called timing anomalies, and complicate the timing analysis. This paper investigates pipelines specifically designed to simplify the worst-case execution time analysis (also called predictable pipelines). We propose formal and executable models of four research-oriented pipelines and one industrial pipeline to validate some of their claims related to their timing behavior. We indeed validate, via bounded model checking, the absence of a type of timing anomalies called amplification timing anomalies, or its potential presence by identifying prerequisite to situations where they can occur. |
PDF file |
Title | Maximizing the Communication Parallelism for Wavelength-Routed Optical Networks-on-Chips |
Author | *Mengchu Li, Tsun-Ming Tseng (Technical University of Munich, Germany), Mahdi Tala (University of Ferrara, Italy), Ulf Schlichtmann (Technical University of Munich, Germany) |
Page | pp. 109 - 114 |
Keyword | WRONoC, Bit-Parallelism, ILP |
Abstract | Wavelength-routed optical networks-on-chips (WRONoCs) apply a passive routing mechanism that statically reserves all data transmission paths at design time, and are thus able to avoid the latency and energy overhead for arbitration. Current research mostly assumes that in WRONoCs, each initiator sends one bit at a time to a target. However, the communication parallelism can be increased by assigning multiple wavelengths to each path, which requires a systematic analysis of the physical parameters of silicon microring resonators and the wavelength usage among different paths. This work proposes a mathematical modeling method to maximize the communication parallelism of a given WRONoC topology, which provides a foundation for exploiting the bandwidth potential of WRONoCs. |
PDF file |
Title | Concurrency in DD-based Quantum Circuit Simulation |
Author | *Stefan Hillmich, Alwin Zulehner, Robert Wille (Johannes Kepler University Linz Institute for Integrated Circuits, Austria) |
Page | pp. 115 - 120 |
Keyword | quantum computing, decision diagrams, design automation |
Abstract | Despite recent progress in physical implementations of quantum computers, a significant amount of research still depends on simulating quantum computations on classical computers. Here, most state-of-the-art simulators rely on array-based approaches which are perfectly suited for acceleration through concurrency using multi- or many-core processors. However, those methods have exponential memory complexities and, hence, become infeasible if the considered quantum circuits are too large. To address this drawback, complementary approaches based on decision diagrams (called DD-based simulation) have been proposed which provide more compact representations in many cases. While this allows to simulate quantum circuits that could not be simulated before, it is unclear whether DD-based simulation also allows for similar acceleration through concurrency as array-based approaches. In this work, we investigate this issue. The resulting findings provide a better understanding about when DD-based simulation can be accelerated through concurrent executions of sub-tasks and when not. |
PDF file |
Title | Approximation of Quantum States Using Decision Diagrams |
Author | Alwin Zulehner, Stefan Hillmich (Johannes Kepler University Linz Institute for Integrated Circuits, Austria), Igor L. Markov (University of Michigan, USA), *Robert Wille (Johannes Kepler University Linz Institute for Integrated Circuits, Austria) |
Page | pp. 121 - 126 |
Keyword | quantum computing, decision diagrams, design automation |
Abstract | The computational power of quantum computers poses major challenges to new design tools since representing pure quantum states typically requires exponentially large memory. As shown previously, decision diagrams can reduce these memory requirements by exploiting redundancies. In this work, we demonstrate further reductions by allowing for small inaccuracies in the quantum state representation. Such inaccuracies are legitimate since quantum computers themselves experience gate and measurement errors and since quantum algorithms are somewhat resistant to errors (even without error correction). We develop four dedicated schemes that exploit these observations and effectively approximate quantum states represented by decision diagrams. We empirically show that the proposed schemes reduce the size of decision diagrams by up to several orders of magnitude while controlling the fidelity of approximate quantum state representations. |
PDF file |
Title | Improved DD-based Equivalence Checking of Quantum Circuits |
Author | *Lukas Burgholzer, Robert Wille (Johannes Kepler University Linz, Austria) |
Page | pp. 127 - 132 |
Keyword | quantum computing, equivalence checking, decision diagrams, reversible circuits |
Abstract | Quantum computing is gaining considerable momentum through the recent progress in physical realizations of quantum computers. This led to rather sophisticated design flows in which the originally specified quantum functionality is compiled through different abstractions. This increasingly raises the question whether the respectively resulting quantum circuits indeed realize the originally intended function. Accordingly, efficient methods for equivalence checking are gaining importance. However, existing solutions still suffer from significant shortcomings such as their exponential worst case performance and an increased effort to obtain counterexamples in case of non-equivalence. In this work, we propose an improved DD-based equivalence checking approach which addresses these shortcomings. To this end, we utilize decision diagrams and exploit the fact that quantum operations are inherently reversible -- allowing for dedicated strategies that keep the overhead moderate in many cases. Experimental results confirm that the proposed strategies lead to substantial speed-ups -- allowing to perform equivalence checking of quantum circuits factors or even magnitudes faster than the state of the art. |
PDF file |
Title | Equivalent Capacitance Guided Dummy Fill Insertion for Timing and Manufacturability |
Author | *Sheng-Jung Yu, Chen-Chien Kao, Chia-Han Huang, Iris Hui-Ru Jiang (National Taiwan University, Taiwan) |
Page | pp. 133 - 138 |
Keyword | CMP, Dummy Fill, Equivalent Capacitance |
Abstract | To improve manufacturability, dummy fill insertion is widely adopted for reducing the thickness variation after chemical mechanical polishing. However, inserted metal fills induce significant coupling to nearby signal nets, thus possibly incurring timing degradation. Existing timing-aware fill insertion strategies focus on optimizing induced coupling capacitance instead of resultant equivalent capacitance. Therefore, the impact on timing cannot be fully captured. In contrast, in this paper, we analyze equivalent capacitance friendly regions for dummy fills. The analysis can wisely guide dummy fill insertion to prevent unwanted and unnecessary increase in the resultant equivalent capacitance of timing critical nets. Experimental results based on the ICCAD 2018 CAD Contest benchmark suite show that our solution outperforms the contest winning teams and state-of-the-art work. Moreover, our analysis results are highly correlated to actual equivalent capacitance values and indeed provide accurate guidance for timing-aware dummy fill insertion. |
PDF file |
Title | Synthesis of Hardware Performance Monitoring and Prediction Flow Adapting to Near-Threshold Computing and Advanced Process Nodes |
Author | *Jeongwoo Heo (Seoul National University, Republic of Korea), Kwangok Jeong (Samsung Electronics Co., Ltd., Republic of Korea), Taewhan Kim, Kyumyung Choi (Seoul National University, Republic of Korea) |
Page | pp. 139 - 144 |
Keyword | monitoring, performance, variation, prediction |
Abstract | An elaborate hardware performance monitor (HPM) has become increasingly important for handling huge performance variation of near-threshold computing and recent process technologies. In this paper, we propose a new approach to the problem of predicting critical path delays (CPDs) using HPM. Precisely, for a target circuit or system, we formulate the problem of finding an efficient combination of ring oscillators (ROs) for accurate prediction of CPDs on the circuit as a mixed integer second-order cone programming and propose a method of minimizing the total number of ROs for a given pessimism level of prediction. Then, we propose a prediction flow of CPDs through statistical estimation of process parameters from measurements of the customized HPM and machine learning based delay mapping from the estimation. For a set of benchmark circuits tested using 28nm PDK and 0.6V operation, it is shown that our approach is very effective, reducing the pessimism of CPDs and minimum supply voltages by 6.7~52.9% and 20.6~50.8% over those of conventional approaches, respectively. |
PDF file |
Title | Enhancing Generalization of Wafer Defect Detection by Data Discrepancy-aware Preprocessing and Contrast-varied Augmentation |
Author | Chaofei Yang, Hai Li, *Yiran Chen (Duke University, USA), Jiang Hu (Texas A&M University, USA) |
Page | pp. 145 - 150 |
Keyword | Wafer, Defect, CNN, Preprocessing, Augmentation |
Abstract | Wafer inspection locates defects at early fabrication stages and traditionally focuses on pixel-level defects. However, there are very few solutions that can effectively detect large-scale defects. In this work, we leverage Convolutional Neural Networks (CNNs) to automate the wafer inspection process and propose several techniques to preprocess and augment wafer images for enhancing our model's generalization on unseen wafers (e.g., from other fabs). Cross-fab experimental results of both wafer-level and pixel-level detections show that the F1 score increases from 0.09 to 0.77 and the Precision-Recall area under curve (PR AUC) increases from 0.03 to 0.62 using our proposed method. |
PDF file |
Title | Exploring Graphical Models with Bayesian Learning and MCMC for Failure Diagnosis |
Author | *Hongfei Wang, Wenjie Cai, Jianwen Li, Kun He (Huazhong University of Science and Technology, China) |
Page | pp. 151 - 156 |
Keyword | Graphical Models, Diagnosis, machine learning, Bayesian methods, test |
Abstract | Graphical models are powerful machine learning techniques for data analytics. Being capable of statistical reasoning and probabilistic inference, graphical models have the advantages of encoding prior knowledges into the learning procedure, and producing explainable models that can be understood and effectively tuned. In this work, we describe our exploration on the frontier of using graphical models for improving circuit diagnosis results. A statistical framework has been proposed for this aim, which builds Bayesian inference models using directed chain graphs, and structural learning models using undirected tree graphs. As a generative model, the framework integrates Markov chain Monte Carlo (MCMC) algorithm for sampling to evaluate the quality of diagnostic results. It exploits maximumlikelihood to estimate the underlying defect types, which can be informative towards the possible follow-up failure analysis. Five circuit examples demonstrate that the proposed framework achieves the same or better results over a state-of-the-art work. Moreover, our method also shows opportunities for dealing with missing features and locating root causes. |
PDF file |
Title | Mitigating Adversarial Attacks for Deep Neural Networks by Input Deformation and Augmentation |
Author | *Pengfei Qiu (Tsinghua University, China), Qian Wang (University of Maryland, College Park, USA), Dongsheng Wang, Yongqiang Lyu (Tsinghua University, China), Zhaojun Lu, Gang Qu (University of Maryland, College Park, USA) |
Page | pp. 157 - 162 |
Keyword | Deep Neural Network, Adversarial Attack, Input Deformation, Data Augmentation, Majority Voting |
Abstract | Typical Deep Neural Networks (DNN) are susceptible to adversarial attacks that add malicious perturbations to input to mislead the DNN model. Most of the state-of-the-art countermeasures concentrate on the defensive distillation or parameter re-training, which require prior knowledge of the target DNN and/or the attacking methods and hence greatly limit their generality and usability. In this paper, we propose to defend against adversarial attacks by utilizing the input deformation and augmentation techniques that are currently widely utilized to enlarge the dataset during DNN's training phase. This is based on the observation that certain input deformation and augmentation methods will have little or no impact on DNN model's accuracy, but the adversarial attacks will fail when the maliciously induced perturbations are randomly deformed. We also use the ensemble of decisions to further improve DNN model's accuracy and the effectiveness of defending various attacks. Our proposed mitigation method is model independent (i.e. it does not require additional training, parameter fine-tuning, or any structure modifications of the target DNN model) and attack independent (i.e., it does not require any knowledge of the adversarial attacks). So it has excellent generality and usability. We conduct experiments on standard CIFAR-10 dataset and three representative adversarial attacks: Fast Gradient Sign Method, Carlini and Wagner, and Jacobian-based Saliency Map Attack. Results show that the average success rate of the attacks can be reduced from 96.5% to 28.7% while the DNN model accuracy is improved by about 2%. |
PDF file |
Title | When Single Event Upset Meets Deep Neural Networks: Observations, Explorations, and Remedies |
Author | Zheyu Yan (Zhejiang University, China), Yiyu Shi (University of Notre Dame, USA), Wang Liao, Masanori Hashimoto (Osaka University, Japan), Xichuan Zhou (Chongqing University, China), *Cheng Zhuo (Zhejiang University, China) |
Page | pp. 163 - 168 |
Keyword | DNN, SEU, ECC |
Abstract | For Deep Neural Network (DNN) used in security sensitive systems, we investigate from hardware perspective about the impact of Single Event Upset (SEU) induced parameter perturbation (SIPP). We define the fault models of SEU and then provide a robustness measure for networks. We then analytically explore the impact of SIPP on different SEU patterns and networks. We then propose remedy solutions to protect DNNs from SIPPs, mitigating accuracy degradation from 28% to 0.27% for ResNet with 25% SRAM area overhead. |
PDF file |
Title | Concurrent Monitoring of Operational Health in Neural Networks Through Balanced Output Partitions |
Author | Elbruz Ozen, *Alex Orailoglu (University of California, San Diego, USA) |
Page | pp. 169 - 174 |
Keyword | fault-tolerance, deep neural networks, autonomous driving |
Abstract | The abundant usage of deep neural networks in safety-critical domains such as autonomous driving raises concerns regarding the impact of hardware-level faults on deep neural network computations. As a failure can prove to be disastrous, low-cost safety mechanisms are needed to check the integrity of the deep neural network computations. We embed safety checksums into deep neural networks by introducing a custom regularization term in the network training. We partition the outputs of each network layer into two groups and guide the network to balance the summation of these groups through an additional penalty term in the cost function. The proposed approach delivers twin benefits. While the embedded checksums deliver low-cost detection of computation errors upon violations of the trained equilibrium during network inference, the regularization term enables the network to generalize better during training by preventing overfitting, thus leading to significantly higher network accuracy. |
PDF file |
Title | PARC: A Processing-in-CAM Architecture for Genomic Long Read Pairwise Alignment using ReRAM |
Author | *Fan Chen, Linghao Song, Hai "Helen" Li, Yiran Chen (Duke University, USA) |
Page | pp. 175 - 180 |
Keyword | memristor, DNA Alignment, Processing-in-Memory |
Abstract | Technological advances in long read sequences have greatly facilitated the development of genomics. However, managing and analyzing the raw genomic data that outpaces Moore’s Law requires extremely high computational efficiency. On the one hand, existing software solutions can take hundreds of CPU hours to complete human genome alignment. On the other hand, the recently proposed hardware platforms achieve low processing throughput with significant overhead. In this paper, we propose PARC, an Processing-in-Memory architecture for long read pair-wise alignment leveraging emerging resistive CAM (content-addressable memory) to accelerate the bottleneck chaining step in DNA alignment. Chaining takes 2-tupleanchorsas inputs and identifies a set of correlated anchors as potential alignment candidates. Unlike traditional main memory which organizes relational data structure in a linear address space, PARC stores tuples in two neighboring crossbar arrays with shared row decoder such that column-wise in-memory computational operations and row-wise memory accesses can be performed in-situ in a symmetric crossbar structure. Compared to both software tools and state-of-the-art accelerators, PARC shows significant improvement in alignment throughput and energy efficiency, thanks to the in-site computation capability and optimized data mapping. |
PDF file |
Title | RRAM-VAC: A Variability-Aware Controller for RRAM-based Memory Architectures |
Author | *Shikhar Tuli, Marco Rios, Alexandre Levisse, David Atienza (Swiss Federal Institute of Technology (EPFL), Switzerland) |
Page | pp. 181 - 186 |
Keyword | RRAM, controller, variability, edge computing, WBSN |
Abstract | The growing need for connected, smart and energy efficient devices requires them to provide both ultra-low standby power and relatively high computing capabilities when awoken. In this context, emerging resistive memory technologies (RRAM) appear as a promising solution as they enable cheap fine grain technology co-integration with CMOS, fast switching and non-volatile storage. However, RRAM technologies suffer from fundamental flaws such as a strong device-to-device and cycle-to-cycle variability which is worsened by aging, forcing the designers to consider worst case design conditions. In this work, we propose, for the first time, a circuit that can take advantage of recently published Write Termination (WT) circuits from both the energy and performances point of view. The proposed RRAM Variability Aware Controller (RRAM-VAC) stores and then coalesces the write requests from the processor before triggering the actual write process. By doing so, it averages the RRAM variability and enables the system to run at the memory programming time distribution mean rather than the worst case tail. We explore the design space of the proposed solution for various RRAM variability specifications, benchmark the effect of the proposed memory controller with real application memory traces and show (for the considered RRAM technology specifications) 44 % to 50 % performances improvement and from 10% to 85% energy gains depending on the application memory access patterns. |
PDF file |
Title | Defects Mitigation in Resistive Crossbars for Analog Vector/Matrix Multiplication |
Author | *Fan Zhang, Miao Hu (Binghamton University, USA) |
Page | pp. 187 - 192 |
Keyword | Resistive crossba, memristor defect, matrix multiplication |
Abstract | With storage and computation happening at the same place, computing in resistive crossbars minimizes data movement and avoids the memory bottleneck issue. It leads to ultra-high energy efficiency for data-intensive applications. However, defects in crossbars severely affect computing accuracy. Existing solutions, including re-training with defects and redundant designs, but they have limitations in practical implementations. In this work, we introduce row shuffling and output compensation to mitigate defects without re-training or redundant resistive crossbars. We also analyzed the coupling effects of defects and circuit parasitics. Moreover, We study different combinations of methods to achieve the best trade-off between cost and performance. Our proposed methods could rescue up to 10% defects in ResNet-20 application without performance degradation. |
PDF file |
Title | S3DET: Detecting System Symmetry Constraints for Analog Circuits with Graph Similarity |
Author | Mingjie Liu, Wuxi Li, Keren Zhu, Biying Xu, *Yibo Lin, Linxiao Shen, Xiyuan Tang, Nan Sun, David Z. Pan (The University of Texas at Austin, USA) |
Page | pp. 193 - 198 |
Keyword | Analog, System, Symmetry, Graph, Similarity |
Abstract | Symmetry and matching between critical building blocks have a significant impact on analog system performance. However, there is limited research on generating system level symmetry constraints. In this paper, we propose a novel method of detecting system symmetry constraints for analog circuits with graph similarity. Leveraging spectral graph analysis and graph centrality, the proposed algorithm can be applied to circuits and systems of large scale and different architectures. To the best of our knowledge, this is the first work in detecting system level symmetry constraints for analog and mixed-signal (AMS) circuits. Experimental results show that the proposed method can achieve high accuracy of 88.3\% with low false alarm rate of less than 1.1\% in large-scale AMS designs. |
PDF file |
Title | Establishing Reachset Conformance for the Formal Analysis of Analog Circuits |
Author | *Niklas Kochdumper (Technical University of Munich, Germany), Ahmad Tarraf (Goethe University Frankfurt, Germany), Malgorzata Rechmal, Markus Olbrich (Leibniz University Hannover, Germany), Lars Hedrich (Goethe University Frankfurt, Germany), Matthias Althoff (Technical University of Munich, Germany) |
Page | pp. 199 - 204 |
Keyword | reachset conformance, hybrid systems, analog circuits, linear abstraction |
Abstract | We present the first work on the automated generation of reachset conformant models for analog circuits. Our approach applies reachset conformant synthesis to add non-determinism to piecewise-linear circuit models so that they enclose all recorded behaviors of the real system. To achieve this, we present a novel technique to compute the required non-determinism for the piecewise-linear models. The effectiveness of our approach is demonstrated on a real analog circuit. Since the resulting models enclose all measurements, they can be used for formal verification. |
PDF file |
Title | Contention Minimized Bypassing in SMART NoC |
Author | *Peng Chen (Nanyang Technological University/Chongqing University, Singapore), Weichen Liu (Nanyang Technological University, Singapore), Mengquan Li (Nanyang Technological University/Chongqing University, Singapore), Lei Yang (University of Pittsburgh, USA), Nan Guan (The Hong Kong Polytechnic University, Hong Kong) |
Page | pp. 205 - 210 |
Keyword | SMART NoC, Routing Strategy, Contention Minimization |
Abstract | SMART, a recently proposed dynamically reconfigurable NoC, enables single-cycle long-distance communication by building single-bypass paths. However, such a single-cycle single-bypass path will be broken when contention occurs. Thus, lower-priority packets will be buffered at intermediate routers with blocking latency from higher-priority packets, and extra router-stage latency to rebuild remaining path, reducing the bypassing benefits that SMART offers. In this paper, we for the first time propose an effective routing strategy to achieve nearly contention-free bypassing in SMART NoC. Specifically, we identify two different routes for communication pairs: direct route, with which data can reach the destination in a single bypass; and indirect route, with which data can reach the destination in two bypasses via an intermediate router. If a direct route is not found, we would alternatively resort to an indirect route in advance to eliminate the blocking latency, at the cost of only one router-stage latency. Compared with the current routing, our new approach can effectively isolate conflicting communication pairs, greatly balance the traffic loads and fully utilize bypass paths. Experiments show that our approach makes 22.6% performance improvement on average in terms of communication latency. |
PDF file |
Title | FTT-NAS: Discovering Fault-Tolerant Neural Architecture |
Author | *Wenshuo Li, Xuefei Ning, Guangjun Ge (Tsinghua University, China), Xiaoming Chen (State Key Laboratory of Computer Architecture, Institute of Computing Technology, China), Yu Wang, Huazhong Yang (Tsinghua University, China) |
Page | pp. 211 - 216 |
Keyword | fault tolerance, neural architecture search |
Abstract | With the fast evolvement of deep-learning specific embedded computing systems, applications powered by deep learning are moving from the cloud to the edge. When deploying NNs onto the edge devices under complex environments, there are various types of possible faults: soft errors caused by atmospheric neutrons and radioactive impurities, voltage instability, aging, temperature variations, and malicious attackers. Thus the safety risk of deploying neural networks at edge computing devices in safety-critic applications is now drawing much attention. In this paper, we implement the random bit-flip, Gaussian, and Salt-and-Pepper fault models and establish a multi-objective fault-tolerant neural architecture search framework. On top of the NAS framework, we propose Fault-Tolerant Neural Architecture Search (FT-NAS) to automatically discover convolutional neural network (CNN) architectures that are reliable to various faults in nowadays edge devices. Then we incorporate fault- tolerant training (FTT) in the search process to achieve better results, which we called FTT-NAS. Experiments show that the discovered architecture FT-NAS-Net and FTT-NAS-Net outperform other hand-designed baseline architectures (58.1%/86.6% VS. 10.0%/52.2%), with comparable FLOPs and less parameters. What is more, the architectures trained under a single fault model can also defend against other faults. By inspecting the discovered architecture, we find that there are redundant connections learned to protect the sensitive paths. This insight can guide future fault-tolerant neural architecture design, and we verify it by a modification on ResNet-20 — ResNet-M. |
PDF file |
Title | The Notion of Cross Coverage in AMS Design Verification |
Author | Sayandeep Sanyal, Aritra Hazra, *Pallab Dasgupta (Indian Institute of Technology Kharagpur, India), Scott Morrison (Texas Instruments, USA), Sudhakar Surendran, Lakshmanan Balasubramanian (Texas Instruments (India) Pvt. Ltd., India) |
Page | pp. 217 - 222 |
Keyword | Cross Coverage, AMS Coverage |
Abstract | Coverage monitoring is fundamental to design verification. Coverage artifacts are well developed for digital integrated circuits and these aim to cover the discrete state space and logical behaviors of the design. Analog designers are similarly concerned with the operating regions of the design and its response to an infinite and dense input space. Analog variables can influence each other in far more complex ways as compared to digital variables, consequently, the notion of cross coverage, as introduced in the analog context for the first time in this paper, is of high importance in analog design verification. This paper presents the formal syntax and semantics of analog cross coverage artifacts, the methods for evaluating them using our tool kit, and most importantly, the insights that can be gained from such cross coverage analysis. |
PDF file |
Title | Automated Test Generation for Activation of Assertions in RTL Models |
Author | *Yangdi Lyu, Prabhat Mishra (University of Florida, USA) |
Page | pp. 223 - 228 |
Keyword | Concolic Testing, RTL model, Assertions, Test Generation, Validation |
Abstract | A major challenge in assertion-based validation is how to activate the assertions to ensure that they are valid. While existing test generation using model checking is promising, it cannot generate directed tests for large designs due to state space explosion. We propose an automated and scalable mechanism to generate directed tests using a combination of symbolic execution and concrete simulation of RTL models. Experimental results show that the directed tests are able to activate assertions non-vacuously. |
PDF file |
Wednesday, January 15, 2020 |
Title | (Keynote Address) Design Automation for Customizable Computing |
Author | Jason Cong (University of California, Los Angles, USA) |
Abstract | With large-scale deployment of FPGAs in both private and public clouds in the past a few years, customizable computing is transitioning from advanced research into mainstream computing. Customized accelerators have demonstrated significant performance and energy efficiency benefits for a wide range of applications. However, efficient design and implementation of various accelerators on FPGAs remains a formidable barrier to many software programmers, despite the recent advances in high-level synthesis. This calls for a community-wide effort to “demacratize customizable computing”. In this talk, I shall first discuss various research opportunities associated with design automation for customizable computing. Then, I shall highlight our recent progress on source-code level transformation and optimization for customizable computing, including support of high-level domain-specific languages (DSL) for deep learning (e.g. Caffe), imaging processing (e.g. Halide), and big-data processing (e.g. Spark), and suppoort of automated compilation to customized microarchictecture templates, such as systolic arrays, stencils, and CPPs (composable parallel and pipelined). |
Title | (Designers' Forum) The Golden Age of EDA — Clock Design, Machine Learning and A-I Collaboration |
Author | Zhuo Li (Cadence design systems, USA) |
Abstract | Dr. David Patterson said it is a new golden age for computer architecture at 2018 Design Automation Conference. With the booming of electronic design and systems for applications like machine learning and AI, autonomous system, 5g, cloud computing, and embedded systems, more domain specific architectures are needed. At the same time, the advancement of technology nodes takes longer and needs much more investment. It is a new golden age for EDA industry, which serves the increasing requirement of low power and high performance for both traditional and new architectures, and the pressure of time to market and design productivity. At this talk, I will focus on some new trend and challenges in clock design and synthesis, front to back synthesis and optimization integration, machine learning in EDA as well as some design challenges in ML/AI chips. Finally, I will briefly discuss the academic and industry collaboration during this new age. |
Title | (Designers' Forum) New Trend on High-Level Synthesis and Customized Compiler for Edge Intelligence |
Author | Deming Chen (UIUC, USA) |
Abstract | High-level synthesis (HLS) has gained significant traction recently for both FPGA and ASIC designs. Especially, when FPGAs are moving into cloud computing and also being developed into a commodity product, HLS would become essential to make FPGAs more accessible and programmable. Meanwhile, AI computing on the edge also represents an important future trend due to its unique advantages, including faster speed, cheaper cost, and higher level of privacy protection. Combined together, treating machine-learning as a special domain, domain-specific HLS and customized compiler to map machine learning algorithms to edge devices are important future trends as well. In this talk, we will discuss some great future opportunities in these areas and also present some challenges we need to overcome in order to facilitate the solid growth of AI solutions for various smart applications. |
Title | (Designers' Forum) Data-driven Instant Model Synthesis Enhanced by Learning Algorithms For DTCO Enablement In the FinFET Era |
Author | Yanfeng Li (Platform Design Automation, Inc., China) |
Abstract | Faster and more accurate variation characterizations and instant modeling of semiconductor devices/circuits are in great demand as process technologies scale down to Fin-FET era, they are also crucial inputs for Design Technology Co-Optimization (DTCO) methodology to work . Traditional methods with intensive data testing are extremely costly, also SPICE model generations are carried out manually by engineers, which often take weeks thus become the bottleneck of DTCO in practice. In this paper, we propose for the first time, a complete eco-system with super-fast device characterization capability and instant model generation enabled by learning algorithms. |
Title | Machine Learning Based Online Full-Chip Heatmap Estimation |
Author | Sheriff Sadiqbatcha, Yue Zhao, Jinwei Zhang (University of California, Riverside, USA), Hussam Amrouch, Joerg Henkel (Karlsruhe Institute of Technology, Germany), *Sheldon X.-D. Tan (University of California, Riverside, USA) |
Page | pp. 229 - 234 |
Keyword | Thermal Model, RNN, Infrared Imaging, Online Estimation, deep learning |
Abstract | Runtime power and thermal control is crucial in any modern processor. However, these control schemes require accurate real-time temperature information, ideally of the entire die area, in order to be effective. On-chip temperature sensors alone cannot provide the full-chip temperature information since the number of sensors that are typically available is very limited due to their high area and power overheads. Furthermore, as we will demonstrate, the peak locations within hot-spots are not stationary and are very workload dependent, making it difficult to rely on fixed temperature sensors alone. Therefore, we propose a novel approach to real-time estimation of fullchip transient heatmaps for commercial processors based on machine learning. The model derived in this work supplements the temperature data sensed from the existing on-chip sensors, allowing for the development of more robust runtime power and thermal control schemes that can take advantage of the additional thermal information that is otherwise not available. The new approach involves offline acquisition of accurate spatial and temporal heatmaps using an infrared thermal imaging setup while nominal working conditions are maintained on the chip. To build the dynamic thermal model, we apply Long- Short-Term-Memory (LSTM) neutral networks with system-level variables such as chip frequency, instruction counts, and other performance metrics as inputs. To reduce the dimensionality of the model, 2D spatial discrete cosine transformation (DCT) is first performed on the heatmaps so that they can be expressed with just their dominant DCT frequencies. Our study shows that only 6x6 DCT coefficients are required to maintain sufficient accuracy across a variety of workloads. Experimental results show that the proposed approach can estimate the full-chip heatmaps with less than 1.4C root-mean-square-error and take only ~19ms for each inference which suits well for real-time use. |
PDF file |
Title | A Reconfigurable Approximate Multiplier for Quantized CNN Applications |
Author | *Chuliang Guo, Li Zhang, Xian Zhou (Zhejiang University, China), Weikang Qian (Shanghai Jiao Tong University, China), Cheng Zhuo (Zhejiang University, China) |
Page | pp. 235 - 240 |
Keyword | Approximate Computing, Approximate Multiplier, Quantization, Neural Network, Energy Efficiency |
Abstract | Quantized CNNs, featured with different bit- widths at different layers, have been widely deployed in mobile and embedded applications. The implementation of a quantized CNN may have multiple multipliers at different precisions with limited resource reuse or one multiplier at higher precision than needed causing area overhead. It is then highly desired to design a multiplier by accounting for the characteristics of quantized CNNs to ensure both flexibility and energy efficiency. In this work, we present a reconfigurable approximate multiplier to support multiplications at various precisions, i.e., bit-widths. Moreover, unlike prior works assuming uniform distribution with bit-wise independence, a quantized CNN may have centralized weight distribution and hence follow a Gaussian-like distribution with correlated adjacent bits. Thus, a new block- based approximate adder is also proposed as part of the multiplier to ensure energy efficient operation with awareness of bit-wise correlation. Our experimental results show that the proposed adder significantly reduces the error rate by 76-98% over a state-of-the-art approximate adder for such scenarios. Moreover, with the deployment of the proposed multiplier, which is 17% faster and 22% more power saving than a Xilinx multiplier IP at the same precision, a quantized CNN implemented in FPGA achieves 17% latency reduction and 15% power saving compared with a full precision case. |
PDF file |
Title | EFFORT: Enhancing Energy Efficiency and Error Resilience of a Near-Threshold Tensor Processing Unit |
Author | *Noel Daniel Gundi, Tahmoures Shabanian, Prabal Basu, Pramesh Pandey, Sanghamitra Roy, Koushik Chakraborty, Zhen Zhang (Utah State University, USA) |
Page | pp. 241 - 246 |
Keyword | Low Power, Error Resilience, DNN, Accelerator |
Abstract | Modern deep neural network (DNN) applications demand a remarkable processing throughput usually unmet by traditional Von Neumann architectures. Consequently, hardware accelerators, comprising a sea of multiplier and accumulate (MAC) units, have recently gained prominence in accelerating DNN inference engine. For example, Tensor Processing Units (TPU) account for a lion’s share of Google’s datacenter inference operations. The proliferation of real-time DNN predictions is accompanied with a tremendous energy budget. In quest of trimming the energy footprint of DNN accelerators, we propose EFFORT—an energy optimized, yet high performance TPU architecture, operating at the Near-Threshold Computing (NTC) region. EFFORT promotes a better-than-worst-case design by operating the NTC TPU at a substantially high frequency while keeping the voltage at the NTC nominal value. In order to tackle the timing errors due to such aggressive operation, we employ an opportunistic error mitigation strategy. Additionally, we implement an in-situ clock gating architecture, drastically reducing the MACs’ dynamic power consumption. Compared to a cutting-edge error mitigation technique for TPUs, EFFORT enables up to 2.5× better performance at NTC with only 2% average accuracy drop across 3 out of 4 DNN datasets. |
PDF file |
Title | Towards Efficient Kyber on FPGAs: A Processor for Vector of Polynomials |
Author | *Zhaohui Chen (School of Computer Science and Technology, University of Chinese Academy of Sciences, China), Yuan Ma, Tianyu Chen, Jingqiang Lin (State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, China), Jiwu Jing (School of Computer Science and Technology, University of Chinese Academy of Sciences, China) |
Page | pp. 247 - 252 |
Keyword | Post-quantum Cryptography, Security, Hardware implementation, Kyber, FPGA |
Abstract | Kyber is a promising candidate in post-quantum cryptography standardization process. In this paper, we propose a targeted optimization strategy and implement a processor for Kyber on FPGAs. By merging the operations, we cut off 29.4% clock cycles for Kyber512 and 33.3% for Kyber1024 compared with the textbook implementations. We utilize Gentlemen-Sande (GS) butterfly to optimize the Number-Theoretic Transform (NTT) implementation. The bottleneck of memory access is broken taking advantage of a dual-column sequential scheme. We further propose a pipeline architecture for better performance. The optimizations help the processor achieve 31684 NTT operations per second using only 477 LUTs, 237 FFs and 1 DSP. Our strategy is at least 3x more efficient than the state-of-the-art module for NTT with a similar security level. |
PDF file |
Title | Efficient Subquadratic Space Complexity Digit-Serial Multipliers over GF(2m) based on Bivariate Polynomial Basis Representation |
Author | Chiou-Yng Lee (Lunghwa University of Science and Technology, Taiwan), *Jiafeng Xie (Villanova University, USA) |
Page | pp. 253 - 258 |
Keyword | Bivariate polynomial basis, digit-serial multiplier, Karatsuba algorithm block recombination, subquadratic space complexity, GF(2m) |
Abstract | Digit-serial finite field multipliers over GF(2m) with subquadratic space complexity are critical components to many applications such as elliptic curve cryptography. In this paper, we propose a pair of novel digit-serial multipliers based on bivariate polynomial basis (BPB). Firstly, we have proposed a novel digit-serial BPB multiplication algorithm based on a new decomposition strategy. Secondly, the proposed algorithm is properly mapped into a pair of pipelined and non-pipelined digit-serial multipliers. Lastly, through the detailed complexity analysis and comparison, the proposed designs are found to have less area-time complexities than the competing ones. |
PDF file |
Title | Security Threats and Countermeasures for Approximate Arithmetic Computing |
Author | *Pruthvy Yellu, Mezanur Rahman Monjur, Timothy Kammerer, Dongpeng Xu, Qiaoyan Yu (University of New Hampshire, USA) |
Page | pp. 259 - 264 |
Keyword | Approximate computing, security, Hardware Trojan, ANN, attack model |
Abstract | Approximate computing (AC) emerges as a promising approach for energy-accuracy trade-off in compute-intensive applications.However, recent work reveals that AC techniques could lead to new security vulnerabilities, which are presented in a format of visionary view. There is a lack of in-depth research on concrete attack models and estimation of the significance of the attacks on approximate arithmetic computing systems. This work presents several practical attack examples and then proposes two attack models with quantitative analysis. Input integrity check and exclusive logic based attack detection methods are proposed to address the attacks on AC systems. The experimental results show that the attack detection failure rate of our method is below2.2∗10−3 and the area and power overhead is less than 6.8% and 1.5%, respectively. |
PDF file |
Title | Broadcast Mechanism Based on Hybrid Wireless/Wired NoC for Efficient Barrier Synchronization in Parallel Computing |
Author | Hemanta Kumar Mondal (National Institute of Technology Durgapur, India), *Navonil Chatterjee, Rodrigo Cataldo, Jean-Philippe Diguet (Université de Bretagne Sud, France) |
Page | pp. 265 - 270 |
Keyword | NoC, Wireless, Broadcast, Parallel Computing, Barrier |
Abstract | Parallel computing is essential to achieve the manycore architecture performance potential, since it utilizes the parallel nature provided by the hardware for its computing. These applications will inevitably have to synchronize its parallel execution: for instance, broadcast operations for barrier synchronization. Conventional network-on-chip architectures for broadcast operations limit the performance as the synchronization is affected significantly due to the critical path communications that increase the network latency and degrade the performance drastically. A Wireless network-on-chip offers a promising solution to reduce the critical path communication bottlenecks of such conventional architectures by providing hardware broadcast support. We propose efficient barrier synchronization support using hybrid wireless/wired NoC to reduce the cost of broadcast operations. The proposed architecture reduces the barrier synchronization cost up to 42.79% regarding network latency and saves up to 42.65% communication energy consumption for a subset of applications from the PARSEC benchmark. |
PDF file |
Title | A Generic FPGA Accelerator for Minimum Storage Regenerating Codes |
Author | Mian Qin (Texas A&M University, USA), Joo Hwan Lee, Rekha Pitchumani, Yang Seok Ki (Samsung Semiconductor Inc., USA), Narasimha Reddy, *Paul V. Gratz (Texas A&M University, USA) |
Page | pp. 271 - 276 |
Keyword | Erasure codes, Minimum Storage Regenerating Codes, FPGA, accelerator |
Abstract | Erasure coding is widely used in storage systems to achieve fault tolerance while minimizing the storage over- head. Recently, Minimum Storage Regenerating (MSR) codes are emerging to minimize repair bandwidth while maintaining the storage efficiency. Traditionally, erasure coding is imple- mented in the storage software stacks, which hinders normal operations and blocks resources that could be serving other user needs due to poor cache performance and costs high CPU and memory utilizations. In this paper, we propose a generic FPGA accelerator for MSR codes encoding/decoding which maximizes the computation parallelism and minimizes the data movement between off-chip DRAM and the on-chip SRAM buffers. To demonstrate the efficiency of our proposed accelerator, we implemented the encoding/decoding algorithms for a specific MSR code called Zigzag code on Xilinx VCU1525 acceleration card. Our evaluation shows our proposed accelerator can achieve ∼2.4-3.1x better throughput and ∼4.2-5.7x better power efficiency compared to the state-of-art multi-core CPU implementation and ∼2.8-3.3x better throughput and ∼4.2-5.3x better power efficiency compared to a modern GPU accelerator. |
PDF file |
Title | Parallel-Log-Single-Compaction-Tree: Flash-Friendly Two-Level Key-Value Management in KVSSDs |
Author | *Yen-Ting Chen (Department of Computer Science, National Tsing Hua University, Taiwan), Ming-Chang Yang (The Chinese University of Hong Kong, Hong Kong), Yuan-Hao Chang (Institute of Information Science, Academia Sinica, Taiwan), Wei-Kuan Shih (Department of Computer Science, National Tsing Hua University, Taiwan) |
Page | pp. 277 - 282 |
Keyword | key-value, flash storage system, LSM-tree, performance |
Abstract | Log-Structured Merge-Trees (LSM-trees) based key-value store applications have gained popularity due to their high write performance. To further pursue better performance for key-value applications, various researches were conducted by adopting or proposing different architecture of flash devices, such as key-value solid-state drive (KVSSD). However, since LSM-trees are originally designed based on the architecture of hard disk drives (HDDs), true potential of SSDs can not be well exploited without re-designing the management strategy. In this work, we propose Parallel-Log-Single-Compaction-Tree (PLSC-tree), which is a two-level and flash-friendly key-value management strategy specially tailored for KVSSDs. In particular, the first layer takes advantage of the massive internal parallelism of SSDs for maximizing the write performance via logging, while the second layer is designed to alleviate the internal recycling (i.e., compaction) overheads of flash device for ultimately optimizing the performance on managing key-value pairs. A series of experiments were conducted based on a well-known SSD simulator with realistic workloads, and the results are very encouraging. |
PDF file |
Title | (Keynote Address) Huge Development of RISC-V Arising from IOT Spurt |
Author | Yingwu Zhang (GigaDevice, China) |
Abstract | With the huge demand of IoT, wearable device, AI, automotive, intelligent manufacturing and new emerging applications, which offers MCU greater opportunities as well as more challenges. We should find the optimized solutions and technologies for these obstacles in different scenarios, such as larger data processing and faster processing speed in automotive, ultra-low power in wearable and IoT, interconnection and data reliability and post Moore Era. As a leading company in 32-bit general MCU market, GigaDevice provided the low power, connectivity, security design in both ARM and RISC-V MCUs. In this speech, we will unveil our RISC-V core solutions and advantage design techniques, like modular design, user extension instructions and ecological development and active community, and the security design focus on code protection, data encryption, safe downloading, security boot and reliability design. |
Title | Towards Design Methodology of Efficient Fast Algorithms for Accelerating Generative Adversarial Networks on FPGAs |
Author | Jung-Woo Chang, *Saehyun Ahn, Keon-Woo Kang, Suk-Ju Kang (Sogang University, Republic of Korea) |
Page | pp. 283 - 288 |
Keyword | Deep learning, Generative adversarial networks, FPGA, CNN, Accelerator |
Abstract | In this paper, we propose an efficient Winograd DeConv accelerator that combines these two orthogonal approaches on FPGAs. Firstly, we introduce a new class of fast algorithm for DeConv layers using Winograd minimal filtering. Since there are regular sparse patterns in Winograd filters, we further amortize the computational complexity by skipping zero weights. Secondly, we propose a new dataflow to prevent resource underutilization by reorganizing the filter layout in Winograd DeConv. Finally, we propose an efficient architecture for Winograd DeConv by designing the line buffer and exploring the design space. Experimental results on various GANs show that our accelerator achieves up to 1.78× ~ 8.38× speedup over the state-of-the-art DeConv accelerators. |
PDF file |
Title | Designing Efficient Shortcut Architecture for Improving the Accuracy of Fully Quantized Neural Networks Accelerator |
Author | *Baoting Li, Longjun Liu, Yanming Jin, Peng Gao, Hongbin Sun, Nanning Zheng (Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, China) |
Page | pp. 289 - 294 |
Keyword | DNN, Quantization, Shortcut, Hardware architecture, Accelerator |
Abstract | Network quantization is an effective solution to compress Deep Neural Networks (DNN) that can be accelerated with custom circuit. However, existing quantization methods suffer from significant loss in accuracy. In this paper, we propose an efficient shortcut architecture to enhance the representational capability of DNN between different convolution layers. We further implement the shortcut hardware architecture to effectively improve the accuracy of fully quantized neural networks accelerator. The experimental results show that our shortcut architecture can obviously improve network accuracy while increasing very few hardware resources (0.11× and 0.17× for LUT and FF respectively) compared with the whole accelerator. |
PDF file |
Title | CRANIA: Unlocking Data and Value Reuse in Iterative Neural Network Architectures |
Author | Maedeh Hemmat, *Tejas Shah, Yuhua Chen, Joshua San Miguel (University of Wisconsin Madison, USA) |
Page | pp. 295 - 300 |
Keyword | iterative neural network architectures, temporal and spatial locality, input-dependent networks |
Abstract | A common inefficiency in traditional Convolutional Neural Network (CNN) architectures is that they do not adapt to variations in inputs. Not all inputs require the same amount of computation to be correctly classified, and not all of the weights in the network contribute equally to generate the output. Recent work introduces the concept of iterative inference, enabling per-input approximation. Such an iterative CNN architecture clusters weights based on their importance and saves significant power by incrementally fetching weights from off-chip memory until the classification result is accurate enough. Unfortunately, this comes at a cost of increased execution time since some inputs need to go through multiple rounds of inference, negating the savings in energy. We propose Cache Reuse Approximation for Neural Iterative Architectures (CRANIA) to overcome this inefficiency. We recognize that the re-execution and clustering built into these iterative CNN architectures unlock significant temporal data reuse and spatial value reuse, respectively. CRANIA introduces a lightweight cache+compression architecture customized to the iterative clustering algorithm, enabling up to 9x energy savings and speeding up inference by 5.8x with only 0.3% area overhead. |
PDF file |
Title | Tiny but Accurate: A Pruned, Quantized and Optimized Memristor Crossbar Framework for Ultra Efficient DNN Implementation |
Author | Xiaolong Ma, Geng Yuan, *Sheng Lin (Northeastern University, USA), Caiwen Ding (University of Connecticut, USA), Fuxun Yu (George Mason University, USA), Tao Liu (Florida International University, USA), Wujie Wen (Lehigh University, USA), Xiang Chen (George Mason University, USA), Yanzhi Wang (Northeastern University, USA) |
Page | pp. 301 - 306 |
Keyword | DNN, Memristor, Pruning, Quantization |
Abstract | The memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. Many techniques such as memristor-based weight pruning and memristor-based quantization have been studied. However, the high accuracy solution for the above techniques is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating alternating direction method of multipliers (ADMM) algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. Motivated by these discoveries, we design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression rate on the state-of-art neural network structures with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link https://bit.ly/2VnMUy0. |
PDF file |
Title | Towards Read-Intensive Key-Value Stores with Tidal Structure Based on LSM-Tree |
Author | *Yi Wang, Shangyu Wu, Rui Mao (Shenzhen University, China) |
Page | pp. 307 - 312 |
Keyword | Storage system, key-value store, LSM-tree, read amplifications, write amplifications |
Abstract | Key-value store has played a critical role in many large-scale data storage applications. The log-structured merge-tree (LSM-tree) based key-value store achieves excellent performance on write-intensive workloads which is mainly benefited from the mechanism of converting a batch of random writes into sequential writes. However, LSM-tree doesn't improve a lot in read-intensive workloads which takes a higher latency. The main reason lies in the hierarchical search mechanism in LSM-tree structure. The key challenge is how to propose new strategies based on the existing LSM-tree structure to improve read efficiency and reduce read amplifications. This paper proposes Tidal-tree, a novel data structure where data flows inside LSM-tree like Tidal waves. Tidal-tree targets at improving read efficiency in read-intensive workloads. Tidal-tree allows frequently accessed files at the bottom of LSM-tree to move to higher positions, thereby reducing read latency. Tidal-tree also makes LSM-tree into a variable shape to cater for different characteristic workloads. To evaluate the performance of Tidal-tree, we conduct a series of experiments using standard benchmarks from YCSB. The experimental results show that Tidal-tree can significantly improve read efficiency and reduce read amplifications compared with representative schemes. |
PDF file |
Title | A Flexible Processing-in-Memory Accelerator for Dynamic Channel-Adaptive Deep Neural Networks |
Author | Li Yang (Arizona State University, USA), Shaahin Angizi (University of Central Florida, USA), *Deliang Fan (Arizona State University, USA) |
Page | pp. 313 - 318 |
Keyword | Deep neural network, Processing in memory |
Abstract | With the success of deep neural networks (DNN), many recent works have been focusing on developing hardware accelerator for power and resource-limited embedded system via model compression techniques, such as quantization, pruning, low-rank approximation, etc. However, almost all existing DNN structure is fixed after deployment, which lacks runtime adaptive DNN structure to adapt to its dynamic hardware resource, power budget, throughput requirement, as well as dynamic workload. Correspondingly, there is no runtime adaptive hardware platform to support dynamic DNN structure.To address this problem, we first propose a dynamic channel-adaptive deep neural network (CA-DNN) which can adjust the involved convolution channel (i.e. model size, computing load) at run-time (i.e. at inference stage without retraining) to dynamically trade off between power, speed, computing load and accuracy. Further, we utilize knowledge distillation method to optimize the model and quantize the model to 8-bits and 16-bits, respectively, for hardware friendly mapping. We test the proposed model on CIFAR-10 and ImageNet dataset by using ResNet. Comparing with the same model size of individual model, our CA-DNN achieves better accuracy. Moreover, as far as we know, we are the first to propose a Processing-in-Memory accelerator for such adaptive neural networks structure based on Spin Orbit Torque Magnetic Random Access Memory(SOT-MRAM) computational adaptive sub-arrays. Then, we comprehensively analyze the trade-off of the model with different channel-width between the accuracy and the hardware parameters, eg., energy, memory, and area overhead. |
PDF file |
Title | Workload-aware Data-eviction Self-adjusting System of Multi-SCM Storage to Resolve Trade-off between SCM Data-retention Error and Storage System Performance |
Author | *Reika Kinoshita, Chihiro Matsui, Atsuya Suzuki, Shouhei Fukuyama, Ken Takeuchi (Chuo University, Japan) |
Page | pp. 319 - 324 |
Keyword | Storage Class Memory, ReRAM, Data-retention, Data management technique |
Abstract | Workload-aware data-eviction self-adjusting system is proposed to resolve trade-off between data-retention reliability and system performance of Multi-SCM (storage class memory) storage that uses M-SCM (memory-type SCM) as NV (non-volatile) cache. M-SCM such as MRAM may cause data-retention errors at high temperatures. Therefore, data in M-SCM should be evicted to storage at short interval, but frequent data eviction severely degrades system performance. Proposals adjust data-eviction interval, and improve data-retention reliability and system performance by up to 79% and 5.9 times, respectively. |
PDF file |
Title | An Energy-Efficient Quantized and Regularized Training Framework For Processing-In-Memory Accelerators |
Author | *Hanbo Sun, Zhenhua Zhu, Yi Cai (Tsinghua University, China), Xiaoming Chen (Chinese Academy of Sciences, China), Yu Wang, Huazhong Yang (Tsinghua University, China) |
Page | pp. 325 - 330 |
Keyword | Energy-Efficient, Quantized, Regularized, Processing-In-Memory Accelerators |
Abstract | Convolutional Neural Networks (CNNs) have madebreakthroughs in various fields, while the energy consumptionbecomes enormous. Processing-In-Memory (PIM) architecturesbased on emerging non-volatile memory (e.g., Resistive RandomAccess Memory, RRAM) have demonstrated great potential inimproving the energy efficiency of CNN computing. However,there is still much room for improvement in the energy efficiencyof existing PIM architectures. On the one hand, current workshows that high resolution Analog-to-Digital Converters (ADCs)are required for maintaining computing accuracy, but theydominate more than60%energy consumption of the entiresystem, damaging the energy efficiency benefits of PIM. On theother hand, the characteristic of computing in the analog domainin PIM accelerators leads to the computing energy consumptionis influenced by the specific input and weight values. However, asfar as we know, there is no energy efficiency optimization methodbased on this characteristic in existing work. To solve theseproblems, in this paper, we propose an energy-efficient quantizedand regularized training framework for PIM accelerators, whichconsists of a PIM-based non-uniform activation quantizationscheme and an energy-aware weight regularization method. Theproposed framework can improve the energy efficiency of PIMarchitectures by reducing the ADC resolution requirements andtraining low energy consumption CNN models for PIM, with littleaccuracy loss. The experimental results show that the proposedtraining framework can reduce the resolution of ADCs by2bitsand the computing energy consumption in the analog domain by35%. The energy efficiency, therefore, can be enhanced by3.4×in our proposed training framework. |
PDF file |
Title | Unified Redistribution Layer Routing for 2.5D IC Packages |
Author | Chun-Han Chiang, *Fu-Yu Chuang, Yao-Wen Chang (National Taiwan University, Taiwan) |
Page | pp. 331 - 337 |
Keyword | Redistribution Layer Routing, Package Routing, Bipartite Matching, Modulus-based Matrix Splitting Iteration Method |
Abstract | A 2.5-dimensional integrated circuit, which introduces an interposer as an interface between chips and a package, is one of the most popular integration technologies. Multiple chips can be mounted on an interposer, and inter-chip nets are routed on redistribution layers (RDLs). In traditional designs, the wire widths and spacings are uniform (i.e., grid-based). To improve circuit performance in modern designs, however, variable widths and spacings are also often adopted (i.e., gridless designs). In this paper, we propose the first unified routing framework that can handle both grid-based and gridless routing on RDLs based on the modulus-based matrix splitting iteration method (MMSIM) and bipartite matching. The MMSIM-based method assigns each wire a rough position while considering multiple design rules, and bipartite matching is applied to further refine those positions. We also prove the optimality of our RDL routing framework for grid-based designs and validate it empirically. Experimental results show that our framework can solve all the gridless and grid-based designs provided by industry effectively and efficiently. In particular, our framework is general and readily extends to other routing (and some quadratic optimization) problems. |
PDF file |
Title | AIR: A Fast but Lazy Timing-Driven FPGA Router |
Author | *Kevin E. Murray, Shen Zhong, Vaughn Betz (University of Toronto, Canada) |
Page | pp. 338 - 344 |
Keyword | FPGA, Routing |
Abstract | Routing is a key step in the FPGA design process, which significantly impacts design implementation quality. Routing is also very time-consuming, and can scale poorly to very large designs. This paper describes the Adaptive Incremental Router (AIR), a high-performance timing-driven FPGA router. AIR dynamically adapts to the routing problem, which it solves ‘lazily’ to minimize work. Compared to the widely used VPR 7 router, AIR significantly reduces route-time (7.1x faster), while also improving quality (15% wirelength, and 18% critical path delay reductions). We also show how these techniques enable efficient incremental improvement of existing routing. |
PDF file |
Title | SP&R: Simultaneous Placement and Routing Framework for Standard Cell Synthesis in Sub-7nm |
Author | Dongwon Park, *Daeyeal Lee (University of California, San Diego, USA), Ilgweon Kang (Cadence, USA), Sicun Gao, Bill Lin, Chung-Kuan Cheng (University of California, San Diego, USA) |
Page | pp. 345 - 350 |
Keyword | Standard Cell, Synthesis, SMT, Placement, Routing |
Abstract | Standard cell synthesis requires careful engineering approaches to ensure routability across various digital IC designs since physical design for sub-7nm technology nodes demands holistic efforts to address urgent and nontrivial design challenges. Many conventional approaches have been suggested for improving transistor-level P&R and pin accessibility, nonetheless insufficient because of the heuristic/divide-and-conquer manners. In this paper, we propose a novel framework, which simultaneously solves P&R for designing standard cell’s layout without deploying any sequential procedures by using dynamic pin allocation-based cell synthesis.The proposed SP&R utilizes the Optimization Modulo Theories (OMT), an extension of the Satisfiability modulo theories (SMT), to obtain optimal standard cell layout by virtue of SAT (Boolean Satisfiability)-based fast reasoning ability. We validate that our SP&R framework achieves 10.5% of reduction on average in terms of metal length compared to the sequential approach, through practical standard cell designs targeting sub-7nm technology nodes. |
PDF file |
Title | Chiplet-Package Co-Design For 2.5D Systems Using Standard ASIC CAD Tools |
Author | MD Arafat Kabir, *Yarui Peng (University of Arkansas, USA) |
Page | pp. 351 - 356 |
Keyword | 2.5D Design, Chip-Package co-design, Redistribution Layer Planning, Package Design |
Abstract | Chiplet integration using 2.5D packaging is gaining popularity nowadays which enables several interesting features like heterogeneous integration and drop-in design method. In the traditional die-by-die approach of designing a 2.5D system, each chiplet is designed independently without any knowledge of the package RDLs. In this paper, we propose a Chip-Package Co-Design flow for implementing 2.5D systems using existing commercial chip design tools. Our flow encompasses 2.5D-aware partitioning suitable for SoC design, Chip-Package Floorplanning, and post-design analysis and verification of the entire 2.5D system. We also designed our own package planners to route RDL layers on top of chiplet layers. We use an ARM Cortex-M0 SoC system to illustrate our flow and compare analysis results with a monolithic 2D implementation of the same system. We also compare two different 2.5D implementations of the same SoC system following the drop-in approach. Alongside the traditional die-by-die approach, our holistic flow enables design efficiency and flexibility with accurate cross-boundary parasitic extraction and design verification. |
PDF file |
Title | Event Delivery using Prediction for Faster Parallel SystemC Simulation |
Author | *Zhongqi Cheng, Emad Arasteh, Rainer Dömer (University of California, Irvine, USA) |
Page | pp. 357 - 362 |
Keyword | SystemC, PDES, Simulation, Event |
Abstract | Out-of-order Parallel Discrete Event Simulation (OoO PDES) is an advanced simulation approach that efficiently verifies and validates SystemC models. To preserve the simulation semantics, OoO PDES performs a conservative event delivery strategy which often postpones the execution of waiting threads due to unknown future behaviors of the model. In this paper, based on predicted behaviors of threads, we introduce a novel event delivery strategy that allows waiting threads to resume execution earlier, resulting in significantly increased simulation speed. Experimental results show that the proposed approach increases the OoO PDES simulation speed by up to 4.9x compared to the original one on a 4-core machine. |
PDF file |
Title | Standard-compliant Parallel SystemC simulation of Loosely-Timed Transaction Level Models |
Author | *Gabriel Busnot, Tanguy Sassolas, Nicolas Ventroux (CEA, LIST, Computing and Design Environment Laboratory, France), Matthieu Moy (Univ Lyon, EnsL, UCBL, CNRS, Inria, LIP, France) |
Page | pp. 363 - 368 |
Keyword | Parallel SystemC, Simulation, TLM |
Abstract | To face the growing complexity of System-on-Chips (SoCs) and their tight time-tomarket constraints, Virtual Prototyping (VP) tools based on SystemC/TLM must get faster while keeping accuracy. However, the Accellera SystemC reference implementation remains sequential and cannot leverage the multiple cores of modern workstations. In this paper, we present a new implementation of a parallel and standard-compliant SystemC kernel, reaching unprecedented performances. By coupling a parallel SystemC kernel and memory access monitoring, we are able to keep SystemC atomic thread evaluation while leveraging the available host cores. Evaluations show a ×19 speed-up compared to the Accellera SystemC kernel using 33 host cores reaching speeds above 2000 Million simulated Instructions Per Second (MIPS). |
PDF file |
Title | JIT-Based Context-Sensitive Timing Simulation for Efficient Platform Exploration |
Author | *Alessandro Cornaglia, Md Shakib Hasan, Alexander Viehl (FZI Research Center for Information Technology, Germany), Oliver Bringmann, Wolfgang Rosenstiel (University of Tübingen, Germany) |
Page | pp. 369 - 374 |
Keyword | Software Timing Simulation, Embedded Systems, Hardware-Related Software, Early Design Exploration of Heterogeneous Platforms, Compiler optimizations effects |
Abstract | Fast and accurate predictions of a program’s execution time are essential during the design space exploration of embedded systems. In this paper, we present a novel approach for efficient context-sensitive timing simulations based on the LLVM IR code representation. Our approach allows evaluating simultaneously multiple hardware platform configurations with only one simulation run. State-of-the-art solutions are improved by speeding up the simulation throughput relying on the fast LLVM IR JIT execution engine. Results show on average over 94% prediction accuracy and a speedup of 200 times compared to interpretive simulations. The simulation performance reaches up to 300 MIPS when one HW configuration is assessed and it grows up to 1 GIPS evaluating four configurations in parallel. Additionally, we show that our approach can be utilized for producing early timing estimations that support the designers in mapping a system to heterogeneous hardware platforms. |
PDF file |
Title | Towards Automatic Hardware Synthesis from Formal Specification to Implementation |
Author | *Fritjof Bornebusch, Christoph Lüth (German Research Center for Artificial Intelligence (DFKI), Germany), Robert Wille (Johannes Kepler University Linz, Austria), Rolf Drechsler (University of Bremen, Germany) |
Page | pp. 375 - 380 |
Keyword | hardware synthesis, formal verification, functional hardware description |
Abstract | In this work, we sketch an automated design flow for hardware synthesis based on a formal specification. Verification results are propagated from the FSL level through the proposed flow to generate an ESL model as well as an RTL implementation automatically. In contrast, the established design flow relies on manual implementations at the ESL and RTL level. The proposed design flow combines proof assistants with functional hardware description languages. This combination decreases the implementation effort significantly and the generation of testbenches is no longer needed. We illustrate our design flow by specifying and synthesizing a set of benchmarks that contain sequential and combinational hardware designs. We compare them with implementations required by the established hardware design flow. |
PDF file |
Title | (Invited Paper) Emerging Non-Volatile Memories for Computation-in-Memory |
Author | *Bin Gao (Tsinghua University, China) |
Page | pp. 381 - 384 |
Keyword | NVM, CIM, RRAM |
Abstract | This talk will introduce the principles of different emerging NVM devices. The device structures, working mechanisms, as well as typical performance of these devices will be discussed. Then different approaches of CIM based on emerging NVM will be presented, specially focus on matrix-vector-multiplication. Later, the talk will summary the performance requirements and key challenges on the device level to realize the CIM. Finally, this work will provide some possible research directions in the future development on emerging NVM for CIM applications. |
PDF file |
Title | (Invited Paper) The Power of Computation-in-Memory Based on Memristive Devices |
Author | *Jintao Yu, Muath Abu Lebdeh, Hoang Anh Du Nguyen, Mottaqiallah Taouil, Said Hamdioui (Delft University of Technology, Netherlands) |
Page | pp. 385 - 392 |
Keyword | Computation-in-Memory, Memristive Devices, Classification |
Abstract | Conventional computing architectures and the CMOS technology that they are based on are facing major challenges such as the memory bottleneck making the memory access for data transfer a major killer of energy and performance. Computation-in-memory (CIM) paradigm is seen as a potential alternative that could alleviate such problems by adding computational resources to the memory, and significantly reducing the communication. Memristive devices are promising enablers of a such CIM paradigm, as they are able to support both storage and computing. This paper shows the power of memristive device based CIM paradigm in enabling new efficient application-specific architectures as well as efficient implementations of some known domain-specific architectures. In addition, the paper discusses the potential applications that could benefit from such paradigm and highlights the major challenges. |
PDF file |
Title | (Invited Paper) Tolerating Retention Failures in Neuromorphic Fabric based on Emerging Resistive Memories |
Author | Christopher Münch (Karlsruhe Institute of Technology, Germany), Rajendra Bishnoi (Delft University of Technology, Netherlands), *Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany) |
Page | pp. 393 - 400 |
Keyword | BNN, MTJ, retention |
Abstract | In this paper, we evaluate the retention issues of emerging resistive memories used as non-volatile weight storage for embedded NN. We exploit the asymmetric retention behavior of Spintronic based Magnetic Tunneling Junctions (MTJs), which is also present in other resistive memories like Phase-Change memory (PCM) and ReRAM, to optimize the retention of the NN accuracy over time. We propose mixed retention cell arrays and an adapted training scheme to achieve a trade-off between array size and the reliable long-term accuracy of NNs. The results of our proposed method save up to 24% of inference accuracy of an MNIST trained Multi-Layer-Perceptron on MTJ-based crossbars. |
PDF file |
Title | (Invited Paper) Ferroelectrics: From Memory to Computing |
Author | *Kai Ni (Rochester Institute of Technology, USA), Sourav Dutta, Suman Datta (University of Notre Dame, USA) |
Page | pp. 401 - 406 |
Keyword | Ferroelectric, Nonvolatile Memory, Synaptic Weight Cell, In-Memory Computing, Neuron |
Abstract | Research discovery of ferroelectricity in HfO2 thin films has ignited tremendous activity in exploration of ferroelectric FETs for a range of applications from low-power logic to embedded non-volatile memory to in-memory compute kernels. In this paper, key milestones in the evolution of Ferroelectric Field Effect Transistors (FeFETs) and the emergence of a versatile ferroelectronic platform are presented. From all these developments, ferroelectric emerges as a highly promising platform for various exciting applications. |
PDF file |
Title | (Invited Paper) Adaptive Circuit Approaches to Low-Power Multi-Level/Cell FeFET Memory |
Author | Juejian Wu, Yixin Xu, Bowen Xue, Yu Wang, Yongpan Liu, Huazhong Yang, *Xueqing Li (The Department of Electronic Engineering, Tsinghua University, China) |
Page | pp. 407 - 413 |
Keyword | Nonvolatile Memory, Multi-Level-Cell, Ferroelectric FET, Adaptive MLC Approaches, Computing-in-Memory |
Abstract | Ferroelectric FETs (FeFETs) have emerged as a promising multi-level/cell (MLC) nonvolatile memory (NVM) candidate for low-power applications. This originates from the advantages of both efficient memory access and intrinsic device-level in-memory computing flexibilities. However, there still exist challenges for FeFET MLC NVM: (i) high power consumption in read operations due to high-gain requirement for sense amplifiers during sensing, and (ii) high latency and energy consumption in write operations with conventional recursive program-and-verify. Targeting at lower power, less latency, and higher density, this work investigates and optimizes the read and write approaches to MLC FeFET NVM design: (i) Adaptive FeFET memory State Mapping (ASM) between the FeFET drain-source current and the digital states to increase the sensing margin; (ii) Adaptive FeFET Gate Biasing (AGB) read methods that adopt the optimized FeFET gate voltage to boost the sensible dynamic range and to store more levels of states per cell; (iii) Adaptive Prediction-based Direct (APD) write methods that minimize the program-and-verify activities. Evaluations show significant latency and energy improvement. Furthermore, the number of sensible levels of states per cell is also increased with an enhanced dynamic sensing range and an enhanced sensing margin. |
PDF file |
Title | (Invited Paper) Emerging Memories as Enablers for In-Memory Layout Transformation Acceleration and Virtualization |
Author | Minli Liao, *John (Jack) Sampson (The Pennsylvania State University, USA) |
Page | pp. 414 - 421 |
Keyword | Memory, In-cache layout transform |
Abstract | Recent works have shown that certain emerging memory technologies can inherently support dense multi-orientation memory (MOM) access, such as row-column memories. However, with few exceptions, these works have only considered MOMs and MOM-caching techniques that provide multiple views of a single memory region. This work explores the potential of MOMs to present concurrent views of data organization as a means to offload data layout transformations. We demonstrate the potential of MOM-offloading to substantially reduce data movement for select computation patterns. |
PDF file |
Title | (Invited Paper) Benchmark Non-volatile and Volatile Memory Based Hybrid Precision Synapses for In-situ Deep Neural Network Training |
Author | Yandong Luo, *Shimeng Yu (Georgia Institute of Technology, USA) |
Page | pp. 422 - 427 |
Keyword | deep learning, non-volatile memory, hardware accelerator, training |
Abstract | Compute-in-memory (CIM) with emerging non-volatile memories (eNVMs) is time and energy efficient for deep neural network (DNN) inference. However, challenges still remain for in-situ DNN training with eNVMs due to the asymmetric weight update behavior, high programming latency and energy consumption. To overcome these challenges, a hybrid precision synapse combining eNVMs with capacitor has been proposed. It leverages the symmetric and fast weight update in the volatile capacitor, as well as the non-volatility and large dynamic range of the eNVMs. In this paper, in-situ DNN training architecture with hybrid precision synapses is proposed and benchmarked with the modified NeuroSim simulator. |
PDF file |
Title | (Invited Paper) Capacitance Extraction and Power Grid Analysis Using Statistical and AI Methods |
Author | *Wenjian Yu, Ming Yang, Yao Feng, Ganqu Cui (Tsinghua University, China), Ben Gu (Cadence, USA) |
Page | pp. 428 - 433 |
Keyword | Capacitance extraction, Power grid simulation, Statistical method, Artificial intelligence, Classification problem |
Abstract | Capacitance extraction and power grid (PG) analysis for IC design involve large-scale numerical simulation problems. As the process technology becomes more complicated and design margin is shrinking, the capacitance field solver and power-grid matrix solver with high accuracy and capability for handing large and complex structure are highly demanded. In this invited paper, we present recent application of statistical and AI methods in these two fields. The Markov-chain model and relevant analysis are presented for developing an efficient technique for handling conformal dielectrics in the floating random walk based capacitance extraction. Then, two approaches reducing the computational cost of a domain decomposition based power-grid solver are presented. One employs supervised machine learning while the other is inspired by the A*-search algorithm. |
PDF file |
Title | (Invited Paper) VLSI Mask Optimization: From Shallow To Deep Learning |
Author | *Haoyu Yang (The Chinese University of Hong Kong, Hong Kong), Wei Zhong (Dalian University of Technology, China), Yuzhe Ma, Hao Geng, Ran Chen, Wanli Chen, Bei Yu (The Chinese University of Hong Kong, Hong Kong) |
Page | pp. 434 - 439 |
Keyword | OPC, Machine Learning |
Abstract | VLSI mask optimization is one of the most critical stages in manufacturability aware design, which is costly due to the complicated mask optimization and lithography simulation. Recent researches have shown prominent advantages of machine learning techniques dealing with complicated and big data problems, which bring potential of dedicated machine learning solution for DFM problems and facilitate the VLSI design cycle. In this paper, we focus on a heterogeneous OPC framework that assists mask layout optimization. Preliminary results show the efficiency and effectiveness of proposed frameworks that have the potential to be alternatives to existing EDA solutions. |
PDF file |
Title | (Invited Paper) Bayesian Methods for the Yield Optimization of Analog and SRAM Circuits |
Author | Shuhan Zhang, *Fan Yang (Fudan University, China), Dian Zhou (University of Texas at Dallas, USA), Xuan Zeng (Fudan University, China) |
Page | pp. 440 - 445 |
Keyword | Yield Optimization, Bayeisan Optimization, Max-value Entropy Search |
Abstract | As the technology node shrinks to the nanometer scale, process variation become one of the most important issues in IC designs. The industry calls for designs with high yield under process variations. Yield optimization is computationally intensive because traditionally it relies on the Monte-Carlo yield estimation. In this paper, we will first review the Bayesian methods that reduce the computational cost of yield estimation and optimization. By applying Bayes’ theorem, maximizing the circuit yield is transformed to identify the design parameters with maximal probability density, conditioning on the event that the corresponding circuit is ”pass”. It can thus avoid repetitive yield estimations during optimization. The computational cost can also be reduced by using the Bayesian optimization strategy. By using the Gaussian process surrogate model and adaptive yield estimation, Bayesian optimization can significantly reduce the number of simulations while achieving even comparable yields for analog and SRAM circuits. We further propose a Bayesian optimization approach for yield optimization via max-value entropy search in this paper. The proposed max-value entropy search can better explore the state space, and thus reduce the number of circuit simulations while achieving competitive results. |
PDF file |
Title | (Designers' Forum) Recent Advances in Hardware Security and Testing Tools |
Author | Junfeng Fan (Open Security Research, Inc, China) |
Abstract | This talk will give an overview of design challenges about hardware security faced by the ICT industry today, and introduce three new research directions, instruction set extension for cryptography algorithms, hardware-assisted software security and system-level security modeling. This talk will also discuss the increasing importance of security testing tools in the full life-cycle of silicon chips. Two new tools, namely, chip vulnerability scan at design phase, and protocol implementation security analysis, will be introduced. |
Title | (Designers' Forum) Design of Energy-Efficient Dynamic Reconfigurable Cryptographic Chip |
Author | Jinjiang Yang (Tsinghua University, China) |
Abstract | With the increasing security demand for network servers and cloud servers, high throughput cryptographic chip design becomes a hot research topic. However, the power consumption of current solutions is too high when the throughput exceeds 10 Gbps. Among current solutions, ASIC accelerators have higher energy efficiency. While the lack of flexibility makes them not suitable for servers which are supposed to implement a plenty of various cryptographic algorithms. General purpose processors (GPPs) are widely used because of their ease of use. Whereas, instruction fetching and decoding induce inevitable power overhead. Reconfigurable processors use configuration streams instead of instructions to reduce power overhead, while maintaining considerable flexibility due to their reconfiguration capability. We develop a coarse-grained dynamic reconfigurable cryptographic chip for high-throughput secure network processing and cloud computing. This chip implements international/national cryptographic algorithms (AES, 3DES, SHA256, SHA3, RSA, ECC, SM2, SM3, SM4, etc.) and supports new algorithms after the silicon implementation. It supports key management and virtualization (SR-IOV) and has considerable acceleration for security protocols such as IPsec and TLS. So, the chip can provide security for network and cloud computing with high performance and energy efficiency. For example, TLS handshake speed can reach 40KQps @ ECDHE-RSA-WITH-AES256-GCM-SHA384 based on a single chip. |
Title | (Designers' Forum) Cognitive SSD Controller: A Case for Agile Domain-Specific SoC Design |
Author | Ying Wang (Chinese Academy of Sciences, China) |
Abstract | Large-scale data analysis systems have been suffering from the overhead of deep I/O stack, and the rocketing cost of data moving across the storage and memory hierarchy. Decades ago, In-storage processing has been proposed to address the issue of off-device bandwidth, latency and energy in conventional systems. However, due to the growth of unstructured and irregular data such as video, image and graphs, AI-driven data analysis is becoming prevalent in commercial data centers. To embrace AI-driven In-storage data processing, we propose Cognitive SSD, a flexible and energy-efficient solution to unstructured data analysis. In Cognitive SSD, the specialized deep learning and graph related hardware are abstracted and exposed to the users as library APIs via NVMe command extension, and it enable the free definition of data analysis functions inside the storage, such as the service of image retrieval, graph search and query. In this talk, we present the architecture of Cognitive SSD, and its programming interface, and we also showcase a prototyping data retrieval system, followed by the discussion of future directions in the Cognitive SSD project. |
Title | (Keynote Address) Emulation View of Synopsys Verification Continuum Platform |
Author | Michael Wang (Synopsys) |
Abstract | Increasing System-on-Chip (SoC) complexity and software content combined with rising time-to-market pressures are driving the need for a next-generation verification solution that spans pre-silicon verification, post-silicon validation and early software bring-up. Synopsys' Verification Continuum platform, developed in collaboration with market leaders, unites Synopsys' best-in-class verification solutions, facilitating a seamless transition between them and improving SoC time-to-market by months. Verification Continuum is architected with FPGA-based emulation and prototyping, delivering the speed and scalability required for software bring-up and SoC verification. By natively integrating the industry’s fastest emulator, ZeBu Server 4, with other Synopsys’ verification engines in the Verification Continuum Platform, like Virtualizer virtual prototyping, VCS simulation, HAPS prototyping, SpyGlass static and Verdi debug, many effective emulation solutions are created and help improve design verification and software bring-up productivity significantly. In addition, on top of the Verification Continuum Platform, Synopsys develops domain specific solutions, to meet special technical requests from Networking, AI and 5G sectors. All above emulation technologies and solutions will be discussed in this presentation. |
Thursday, January 16, 2020 |
Title | (Keynote Address) Explore the Next Tides of EDA |
Author | Lifeng Wu (Empyrean Software) |
Abstract | EDA, one of the most critical pillars of semiconductor industry, has been supporting Moore’s law for four decades. On the other hand, recent EDA growth in last two decades is mostly driven by applications rather than fundamental breakthrough in EDA research. What are the possible directions for future EDA tides? From our point of view, computing platform (heterogeneous computing, Cloud computing, ARM-based massive-threading architecture) and AI based algorithm will provide more dimensions for EDA research. We will demonstrate some solutions powered by heterogeneous computing platform and machine-learning algorithms. |
Title | Programmable Neuromorphic Circuit based on Printed Electrolyte-Gated Transistors |
Author | *Dennis D. Weller, Michael Hefenbrock, Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany), Jasmin Aghassi-Hagmann (Offenburg University of Applied Sciences, Germany), Michael Beigl (Karlsruhe Institute of Technology, Germany) |
Page | pp. 446 - 451 |
Keyword | Neuromorphic Computing, Printed Electronics, Programmable, Electrolyte-gated transistors, Inkjet Printing |
Abstract | Neuromorphic computing systems have demonstrated many advantages for popular classification problems with significantly less computational resources. We present in this paper the design, fabrication and training of a programmable neuromorphic circuit, which is based on printed electrolyte-gated field-effect transistor (EGFET). Based on printable neuron architecture involving several resistors and one transistor, the proposed circuit can realize multiply-add and activation functions. The functionality of the circuit, i.e. the weights of the neural network, can be set during a post-fabrication step in form of printing resistors to the crossbar. Besides the fabrication of a programmable neuron, we also provide a learning algorithm, tailored to the requirements of the technology and the proposed programmable neuron design, which is verified through simulations. The proposed neuromorphic circuit operates at 5V and occupies 385mm2 of area. |
PDF file |
Title | HashHeat: An O(C) Complexity Hashing-based Filter for Dynamic Vision Sensor |
Author | Shasha Guo, *Ziyang Kang, Lei Wang, Shiming Li, Weixia Xu (National University of Defense Technology, China) |
Page | pp. 452 - 457 |
Keyword | DVS noise filtering, hash, memory |
Abstract | Neuromorphic event-based dynamic vision sensors (DVS) have much faster sampling rates and a higher dynamic range than frame-based imagers. However, they are sensitive to background activity (BA) events which are unwanted. We propose HashHeat, a hashing-based BA filter with O(C) complexity. It is the first spatiotemporal filter that doesn't scale with the DVS output size N and doesn't store the 32-bits timestamps. HashHeat consumes 100x less memory and increases the signal to noise ratio by 15x compared to previous designs. |
PDF file |
Title | A Tuning-Free Hardware Reservoir Based on MOSFET Crossbar Array for Practical Echo State Network Implementation |
Author | *Yuki Kume, Song Bian, Takashi Sato (Kyoto University, Japan) |
Page | pp. 458 - 463 |
Keyword | hardware implementation, echo state network, reservoir computing, recurrent neural network, weight tuning |
Abstract | Echo state network (ESN) is a class of recurrent neural network, and is known for drastically reducing the training time by the use of reservoir, a random and fixed network as the input and middle layers. In this paper, we propose a hardware implementation of ESN that uses practical MOSFET-based reservoir. As opposed to existing reservoirs that require additional tuning of network weights for improved stability, our ESN requires no post-training parameter tuning. To achieve this, we apply the circular law of random matrix to sparse reservoirs to determine a stable and fixed feedback gain. Through the evaluations using Mackey-Glass time-series dataset, the proposed ESN performs successful inference without post parameter tuning. |
PDF file |
Title | MindReading: An Ultra-Low-Power Photonic Accelerator for EEG-based Human Intention Recognition |
Author | *Qian Lou (Indiana University Bloomington, USA), Wenyang Liu, Weichen Liu (Nanyang Technological University, Singapore), Feng Guo, Lei Jiang (Indiana University Bloomington, USA) |
Page | pp. 464 - 469 |
Keyword | EGG, Photonic, Neural Network, Accelerator |
Abstract | A scalp-recording electroencephalography (EEG)-based brain-computer interface (BCI) system can greatly improve the quality of life for people who suffer from motor disabilities. Deep neural networks consisting of multiple convolutional, LSTM and fully-connected layers are created to decode EEG signals to maximize the human intention recognition accuracy. However, prior FPGA, ASIC, ReRAM and photonic accelerators cannot maintain sufficient battery lifetime when processing real-time intention recognition. In this paper, we propose an ultra-low-power photonic accelerator, MindReading, for human intention recognition by only low bit-width addition and shift operations. Compared to prior neural network accelerators, to maintain the real-time processing throughput, MindReading reduces the power consumption by 62.7% and improves the throughput per Watt by 168%. |
PDF file |
Title | LanCe: A Comprehensive and Lightweight CNN Defense Methodology against Physical Adversarial Attacks on Embedded Multimedia Applications |
Author | *Zirui Xu, Fuxun Yu, Xiang Chen (George Mason University, USA) |
Page | pp. 470 - 475 |
Keyword | Convolutional Neural Network, Physical Adversarial Attack, Image Classification, Speech Recognition |
Abstract | Recently, adversarial attacks can be applied to the physical world, causing practical issues to various Convolutional Neural Networks (CNNs) powered applications. Most existing physical adversarial attack defense works only focus on eliminating explicit perturbation patterns from inputs, ignoring interpretation to CNN’s intrinsic vulnerability. Therefore, they lack expected versatility to different attacks and thereby depend on considerable data processing costs. In this paper, we propose LanCe – a comprehensive and lightweight CNN defense methodology against different physical adversarial attacks. By interpreting CNN’s vulnerability, we find that non-semantic adversarial perturbations can activate CNN with significantly abnormal activations and even overwhelm other semantic input patterns’ activations. We improve the CNN recognition process by adding a self-verification stage to detect the potential adversarial input with only one CNN inference cost. Based on the detection result, we further propose a data recovery methodology to defend the physical adversarial attacks. We apply such defense methodology into both image and audio CNN recognition scenarios and analyze the computational complexity for each scenario, respectively. Experiments show that our methodology can achieve an average 91% successful rate for attack detection and 89% accuracy recovery. Moreover, it is at most 3× faster compared with the state-of-the-art defense methods, making it feasible to resource-constrained embedded systems, such as mobile devices. |
PDF file |
Title | Towards Area-Efficient Optical Neural Networks: An FFT-based Architecture |
Author | *Jiaqi Gu, Zheng Zhao, Chenghao Feng, Mingjie Liu, Ray T. Chen, David Z. Pan (University of Texas at Austin, USA) |
Page | pp. 476 - 481 |
Keyword | FFT, Optical Neural Networks, Area-efficient |
Abstract | As a promising neuromorphic framework, the optical neural network (ONN) demonstrates ultra-high inference speed with low energy consumption. However, the previous ONN architectures have high area overhead which limits their practicality. In this paper, we propose an area-efficient ONN architecture based on structured neural networks, leveraging optical fast Fourier transform for efficient computation. A two-phase software training flow with structured pruning is proposed to further reduce the optical component utilization. Experimental results demonstrate that the proposed architecture can achieve 2.2~3.7x area cost improvement compared with the previous singular value decomposition-based architecture with comparable inference accuracy. |
PDF file |
Title | Automated Trigger Activation by Repeated Maximal Clique Sampling |
Author | *Yangdi Lyu, Prabhat Mishra (University of Florida, USA) |
Page | pp. 482 - 487 |
Keyword | Trigger Activation, Clique Coverage, Hardware Trojan, Satisfiability |
Abstract | Hardware Trojans are serious threat to security and reliability of computing systems. It is hard to detect these malicious implants using traditional validation methods since an adversary is likely to hide them under rare trigger conditions. While existing statistical test generation methods are promising for Trojan detection, they are not suitable for activating extremely rare trigger conditions in stealthy Trojans. To address the fundamental challenge of activating rare triggers, we propose a new test generation paradigm by mapping trigger activation problem to clique cover problem. The basic idea is to utilize a satisfiability solver to construct a test corresponding to each maximal clique. This paper makes two fundamental contributions: 1) it proves that the trigger activation problem can be mapped to clique cover problem, 2) it proposes an efficient test generation algorithm to activate trigger conditions by repeated maximal clique sampling. Experimental results demonstrate that our approach is scalable and it outperforms state-of-the-art approaches by several orders-of-magnitude in detecting stealthy Trojans. |
PDF file |
Title | Audio Adversarial Examples Generation with Recurrent Neural Networks |
Author | Kuei-Huan Chang, *Po-Hao Huang (National Tsing Hua University, Taiwan), Honggang Yu, Yier Jin (University of Florida, USA), Ting-Chi Wang (National Tsing Hua University, Taiwan) |
Page | pp. 488 - 493 |
Keyword | Neural network security, Adversarial attack |
Abstract | Abstract—Previous methods of performing adversarial attacks against speech recognition systems often treat this problem as a solely optimization problem and require iterative updates to generate optimal solutions. Although they can achieve high success rate, the process is too computational heavy even with the help of GPU. In this paper, we introduce a new type of real-time adversarial attack methodology, which applies Recurrent Neural Networks (RNN) with a two-step training process to generate adversarial examples targeting a Keyword Spotting (KWS) system. We extend our attack to physical world by adding extra constraints in order to eliminate the distortions in real world. |
PDF file |
Title | Database and Benchmark for Early-stage Malicious Activity Detection in 3D Printing |
Author | *Xiaolong Ma (Northeastern University, USA), Zhe Li (Syracuse University, USA), Hongjia Li (Northeastern University, USA), Qiyuan An (Virginia Polytechnic Institute and State University, USA), Qinru Qiu (Syracuse University, USA), Wenyao Xu (The State University of New York at Buffalo, USA), Yanzhi Wang (Northeastern University, USA) |
Page | pp. 494 - 499 |
Keyword | DNN, 3D printing, Detection, Dataset |
Abstract | Increasing malicious users have sought practices to leverage 3D printing technology to produce unlawful tools in criminal activities. It is of vital importance to enable 3D printers to identify the objects to be printed and terminate at early stage if illegal objects are identified. Deep learning yields significant rises in performance in the object recognition tasks. However, the lack of large-scale databases in 3D printing domain stalls the advancement of automatic illegal weapon recognition. This paper presents a new 3D printing image database, namely C3PO, which compromises two subsets for the different system working scenarios. We extract images from the numerical control programming code files of 22 3D models, and then categorize the images into 10 distinct labels. These two sets are designed for identifying: (i). printing knowledge source (G-code) at beginning of manufacturing, (ii). printing procedure during manufacturing. Importantly, we demonstrate that the weapons can be recognized in either scenario using deep learning based approaches using our proposed database. The quantitative results are promising, and the future exploration of the database and the crime prevention in 3D printing are demanding tasks. |
PDF file |
Title | EA-HRT: An Energy-Aware scheduler for Heterogeneous Real-Time systems |
Author | *Sanjay Moulik, Rishabh Chaudhary, Zinea Das (IIIT Guwahati, India), Arnab Sarkar (IIT Guwahati, India) |
Page | pp. 500 - 505 |
Keyword | Fair Scheduling, Multicore, Heterogeneous |
Abstract | Developing energy-efficient schedulers for real-time heterogeneous platforms executing periodic tasks is an onerous as well as a computationally challenging issue. This research presents a heuristic strategy named, EA-HRT, for DVFS based energy-aware scheduling of a set of periodic tasks executing on a heterogeneous multicore platform. Initially it calculates the execution demands of every task on each of the different type of cores. Then, it simultaneously allocates each task on available cores and selects operating frequencies for the concerned cores such that the summation of execution demands of all tasks are met as well as there is minimum change in energy consumption for the system. Experimental results show that our proposed strategy is not only able to achieve appreciable energy savings with respect to state-of-the-art (2% to 37% on average) but also enables significant improvement in resource utilization (as high as 57%). |
PDF file |
Title | Insights and Optimizations on IR-drop Induced Sneak-Path for RRAM Crossbar-based Convolutions |
Author | *Yujie Zhu, Xue Zhao, Keni Qiu (Capital Normal University, China) |
Page | pp. 506 - 511 |
Keyword | Sneak-Path, IR-drop, RRAM crossbar |
Abstract | RRAM crossbar structure has been proposed to accelerate the convolution computation neural networks because its current-mode weighted summation operation intrinsically matches the dominant multiplication-and-accumulation (MAC) operations. However, there is an inevitable IR-drop problem with the RRAM crossbar, which may introduce sneak-path and thus reduce the accuracy of neural network algorithms and the system reliability. This work addresses the sneak-path problem caused by the IR-drop in a RRAM crossbar. We first present the characteristics of variation distribution of the sneak-path through numerous experiments, taking into account RRAM cell resistance, input voltage, and cell location in a crossbar. Then we propose optimization strategies from the hardware and software perspectives respectively to mitigate the variations resulting from sneak-path. The experimental results show that the proposed methods can compensate the accuracy of algorithms. |
PDF file |
Title | Boosting the Profitability of NVRAM-based Storage Devices via the Concept of Dual-Chunking Data Deduplication |
Author | *Shuo-Han Chen (Academia Sinica, Taiwan), Yu-Pei Liang (National Tsing Hua University, Taiwan), Yuan-Hao Chang (Academia Sinica, Taiwan), Hsin-Wen Wei (Tamkang University, Taiwan), Wei-Kuan Shih (National Tsing Hua University, Taiwan) |
Page | pp. 512 - 517 |
Keyword | deduplication, NVRAM, storage, profitability |
Abstract | With the latest advance in the non-volatile random-access memory (NVRAM), NVRAM is widely considered as the mainstream for the next-generation storage mediums. NVRAM has numerous attractive features, which include byte addressability, limited idle energy consumption, and great read/write access speed. However, owing to the high manufacturing cost of NVRAM, the incentive of deploying NVRAM in consumer electronics is lowered due to the consideration of profitability. To resolve the profitability issue and bring the benefits of NVRAM into the design of consumer electronics, avoiding storing duplicate data on NVRAM becomes a crucial task for lowering the demand and deployment cost of NVRAM. Such observation motivates us to propose a data deduplication extended file system design (DeEXT) to boost the profitability of NVRAM via the concept of dual-chunking data deduplication while considering the characteristics of NVRAM and duplicate data content. The proposed DeEXT was then evaluated by real-world data deduplication traces with encouraging results. |
PDF file |
Title | Black Box Search Space Profiling for Accelerator-Aware Neural Architecture Search |
Author | *Shulin Zeng, Hanbo Sun (Tsinghua University, China), Yu Xing (Tsinghua University, Xilinx inc., China), Xuefei Ning (Tsinghua University, China), Yi Shan (Xilinx inc., China), Xiaoming Chen (Chinese Academy of Sciences, China), Yu Wang, Huazhong Yang (Tsinghua University, China) |
Page | pp. 518 - 523 |
Keyword | Search Space, AI, Accelerator, Neural Architecture Search |
Abstract | Neural Architecture Search (NAS) is a promising approach to discover good neural network architectures for given applications. Among the three basic components in a NAS system (search space, search strategy, and evaluation), prior work mainly focused on the development of different search strategies and evaluation methods. As most of the previous hardware-aware search space designs aimed at CPUs and GPUs, it still remains a challenge to design a suitable search space for Deep Neural Network (DNN) accelerators. Besides, the architectures and compilers of DNN accelerators vary greatly, so it is quite difficult to get a unified and accurate evaluation of the latency of DNN across different platforms. To address these issues, we propose a black box profiling-based search space tuning method and further improve the latency evaluation by introducing a layer adaptive latency correction method. Used as the first stage in our general accelerator-aware NAS pipeline, our proposed methods could provide a smaller and dynamic search space with a controllable trade-off between accuracy and latency for DNN accelerators. Experimental results on CIFAR-10 and ImageNet demonstrate our search space is effective with up to 12.7\% improvement in accuracy and 2.2x reduction of latency, and also efficient by reducing the search time and GPU memory up to 4.35x and 6.25x, respectively. |
PDF file |
Title | Search-free Accelerator for Sparse Convolutional Neural Networks |
Author | *Bosheng Liu, Xiaoming Chen, Yinhe Han, Ying Wang, Jiajun Li, Haobo Xu, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China) |
Page | pp. 524 - 529 |
Keyword | accelerator, sparse neural network, search-free, energy efficiency |
Abstract | We propose a sparsity-aware architecture, called Swan, which frees the search process for sparse CNNs under limited interconnect and bandwidth resources. The architecture comprises two parts: a MAC unit that can free the search operation for the sparsity-aware MAC calculation, and a systolic compressive dataflow that well suits the MAC architecture and greatly reuses inputs for interconnect and bandwidth saving. |
PDF file |
Title | NESTA: Hamming Weight Compression-Based Neural Proc. Engine |
Author | Ali Mirzaeian, Houman Homayoun (George Mason University, USA), *Avesta Sasan (Institute for Research in Fundamental Sciences, USA) |
Page | pp. 530 - 537 |
Keyword | Neural Network Accelerator, Convolutional Neural Network, Low Power Computation, MAC, Compressor |
Abstract | In this paper, we present NESTA, a specialized Neural engine that significantly accelerates the computation of convolution layers in a deep convolutional neural network, while reducing the computational energy. NESTA reformats Convolutions into 3 × 3 batches and uses a hierarchy of Hamming Weight Compressors to process each batch. Besides, when processing the convolution across multiple channels, NESTA, rather than computing the precise result of a convolution per channel, quickly computes an approximation of its partial sum, and a residual value such that if added to the approximate partial sum, generates the accurate output. Then, instead of immediately adding the residual, it uses (consumes) the residual when processing the next batch in the hamming weight compressors with available capacity. This mechanism shortens the critical path by avoiding the need to propagate carry signals during each round of computation and speeds up the convolution of each channel. In the last stage of computation, when the partial sum of the last channel is computed, NESTA terminates by adding the residual bits to the approximate output to generate a correct result. |
PDF file |
Title | Representable Matrices: Enabling High Accuracy Analog Computation for Inference of DNNs using Memristors |
Author | Baogang Zhang, Necati Uysal (University of Central Florida, USA), Deliang Fan (Arizona State University, USA), *Rickard Ewetz (University of Central Florida, USA) |
Page | pp. 538 - 543 |
Keyword | memristor, DNNs |
Abstract | Analog computing based on memristor technology is a promising solution to accelerating the inference phase of deep neural networks (DNNs). A fundamental problem is to map an arbitrary matrix to a memristor crossbar array (MCA) while maximizing the resulting computational accuracy. The state-of-the-art mapping technique is based on a heuristic that only guarantees to produce the correct output for two input vectors. In this paper, a technique that aims to produce the correct output for every input vector is proposed, which involves specifying the memristor conductance values and a scaling factor realized by the peripheral circuitry. The key insight of the paper is that the conductance matrix realized by an MCA is only required to be proportional to the target matrix. The selection of the scaling factor between the two regulates the utilization of the programmable memristor conductance range and the representability of the target matrix. Consequently, the scaling factor is set to balance precision and value range errors. Moreover, a technique of converting conductance values into state variables and vice versa is proposed to handle memristors with non-ideal device characteristics. Compared with the state-of-the-art technique, the proposed mapping results in 4X-9X smaller errors. The improvements translate into that the classification accuracy of a seven-layer convolutional neural network (CNN) on CIFAR-10 is improved from 20.5% to 71.8%. |
PDF file |
Title | Reliability-Oriented IEEE Std. 1687 Network Design and Block-Aware High-Level Synthesis for MEDA Biochips |
Author | Zhanwei Zhong, Tung-Che Liang, *Krishnendu Chakrabarty (Duke University, USA) |
Page | pp. 544 - 549 |
Keyword | Biochip, MEDA, Reliability, Synthesis, IJTAG |
Abstract | A digital microfluidic biochip (DMFB) enables miniaturization of immunoassays, point-of-care clinical diagnostics, DNA sequencing, and other laboratory procedures in biochemistry. A recent generation of biochips uses a micro-electrode-dot-array (MEDA) architecture, which provides fine-grained control of droplets and seamlessly integrates microelectronics and microfluidics using CMOS technology. To ensure that bioassays are carried out on MEDA biochips efficiently, high-level synthesis algorithms have recently been proposed. However, as in the case of conventional DMFBs, microelectrodes are likely to fail when they are heavily utilized, and previous methods fail to consider reliability issues. In this paper, we present the design of an IEEE Std. 1687 (IJTAG) network and a block-aware high-level synthesis method that can effectively alleviate reliability problems in MEDA biochips. A comprehensive set of simulation results demonstrate the effectiveness of the proposed method. |
PDF file |
Title | Optimization of Fluid Loading on Programmable Microfluidic Devices for Bio-protocol Execution |
Author | Satoru Maruyama (Ritsumeikan University, Japan), Debraj Kundu (Indian Institute of Technology Roorkee, India), *Shigeru Yamashita (Ritsumeikan University, Japan), Sudip Roy (Indian Institute of Technology Roorkee, India) |
Page | pp. 550 - 555 |
Keyword | Programmable Microfluidic Device, Fluid Loading |
Abstract | Recently, Programmable Microfluidic Device (PMD) has got an attention of the design automation communities as a new type of microfluidic biochips. For the design of PMD chips, one of the important tasks is to minimize the number of flows for loading the reactant fluids into specific cells (by creating some flows of the fluids) before the bio-protocol is executed. Nevertheless of the importance of the problem, there has been almost no work to study this problem. Thus, in this paper, we intensively study this fluid loading problem in PMD chips. First, we successfully formulate the problem as a constraint satisfaction problem (CSP) to solve the problem optimally for the first time. Then, we also propose an efficient heuristic called Determining Flows from the Last (DFL) method for larger problem instances. DFL is based on a novel idea that it is better to determine the flows from the last flow unlike the state-of-the-art method Fluid Loading Algorithm for PMD (FLAP) [Gupta et al., TODAES, 2019]. Simulation results confirm that the exact method can find the optimal solutions for practical test cases, whereas our heuristic can find near-optimal solutions, which are better than those obtained by FLAP. |
PDF file |
Title | An FPGA based Network Interface Card with Query Filter for Storage Nodes of Big Data Systems |
Author | Ying Li, *Jinyu Zhan, Wei Jiang, Junting Wu (University of Electronic Science and Technology of China, China), Jianping Zhu (Tencent Technology Shenzhen Co., Ltd, China) |
Page | pp. 556 - 561 |
Keyword | Storage and computing separated big data systems, Query filter, Network Interface Card, FPGA |
Abstract | In this paper, we are interested in improving the data processing of storage and computing separated Big Data systems. We propose an Field Programmable Gate Array (FPGA) based Network Interface Card with Query Filter (NIC-QF) to accelerate the data query efficiency of storage nodes and reduce the workloads of computing nodes and the communication overheads between them. NIC-QF designed with PCIe core, query filter and NIC communication can filter the original data on storage nodes as an implicit coprocessor and directly send the filtered data to computing nodes of Big Data systems. Filter units in query filter can perform multiple SQL tasks in parallel, and each filter unit is internally pipelined, which can further speed up the data processing. Filter units can be designed to support general SQL queries on different data formats and we implement two schemes for TextFile and RCFile separately. Based on TPC-H benchmark and Tencent data set, we conduct extensive experiments to evaluate our design, which can achieve averagely up to 46.91\% faster than the traditional approach. |
PDF file |
Title | Nonvolatile and Energy-Efficient FeFET-Based Multiplier for Energy-Harvesting Devices |
Author | *Mengyuan Li (University of Notre Dame, USA), Xunzhao Yin (Zhejiang University, China), Xiaobo Sharon Hu (University of Notre Dame, USA), Cheng Zhuo (Zhejiang University, China) |
Page | pp. 562 - 567 |
Keyword | FeFET, Multiplier, Nonvolatile |
Abstract | Energy-harvesting internet-of-things devices must deal with unstable power input. Nonvolatile processors (NVPs) can offer an effective solution. Compact and low-energy arithmetic circuits that can efficiently switch between computation and backup operations are highly desirable for NVP design. This paper introduces a nonvolatile FeFET-based multiplier with the ability to do continued calculation after a power outage. Simulation results show the proposed design saves up to 21% and 19% area than a conventional CMOS-based sequential multiplier of 4-bits and 8-bits, respectively. |
PDF file |
Title | Modulo Scheduling with Rational Initiation Intervals in Custom Hardware Design |
Author | *Patrick Sittel (University of Kassel, Germany), John Wickerson (Imperial College London, UK), Martin Kumm (University of Applied Sciences Fulda, Germany), Peter Zipf (University of Kassel, Germany) |
Page | pp. 568 - 573 |
Keyword | Modulo Scheduling, High-level Synthesis, Design Space Exploration, Computer-aided Design |
Abstract | In modulo scheduling, the number of clock cycles between successive inputs (the initiation interval, II) is traditionally an integer, but in this paper, we explore the benefits of allowing it to be a rational number. This rational II can be interpreted as the average number of clock cycles between successive inputs. As the minimum rational II can be less than the minimum integer II, this translates to higher throughput. We formulate rational-II modulo scheduling as an integer linear programming (ILP) problem that is able to find latency-optimal schedules for a fixed rational II. We have applied our scheduler to a standard benchmark of hardware designs, and our results demonstrate a significant speedup compared to state-of-the-art integer-II and rational-II formulations. |
PDF file |
Title | HL-Pow: A Learning-Based Power Modeling Framework for High-Level Synthesis |
Author | *Zhe Lin, Jieru Zhao (Hong Kong University of Science and Technology, Hong Kong), Sharad Sinha (Indian Institute of Technology Goa, India), Wei Zhang (Hong Kong University of Science and Technology, Hong Kong) |
Page | pp. 574 - 580 |
Keyword | power modeling, design space exploration, machine learning, high-level synthesis |
Abstract | High-level synthesis (HLS) enables designers to customize hardware designs efficiently. However, it is still challenging to foresee the correlation between power consumption and HLS-based applications at an early design stage. To overcome this problem, we introduce HL-Pow, a power modeling framework for FPGA HLS based on state-of-the-art machine learning techniques. HL-Pow incorporates an automated feature construction flow to efficiently identify and extract features that exert a major influence on power consumption, simply based upon HLS results, and a modeling flow that can build an accurate and generic power model applicable to a variety of designs with HLS. By using HL-Pow, the power evaluation process for FPGA designs can be significantly expedited because the power inference of HL-Pow is established on HLS instead of the time-consuming register-transfer level (RTL) implementation flow. Experimental results demonstrate that HL-Pow can achieve accurate power modeling that is only 4.67% (24.02 mW) away from onboard power measurement. To further facilitate power-oriented optimizations, we describe a novel design space exploration (DSE) algorithm built on top of HL-Pow to trade off between latency and power consumption. This algorithm can reach a close approximation of the real Pareto frontier while only requiring running HLS flow for 20% of design points in the entire design space. |
PDF file |
Title | DRiLLS: Deep Reinforcement Learning for Logic Synthesis |
Author | *Abdelrahman Hosny, Soheil Hashemi (Brown University, USA), Mohamed Shalan (The American University in Cairo, Egypt), Sherief Reda (Brown University, USA) |
Page | pp. 581 - 586 |
Keyword | reinforcement learning, logic synthesis, parameter tuning, optimization |
Abstract | Logic synthesis requires extensive tuning of the synthesis optimization flow where the quality of results (QoR) depends on the sequence of optimizations used. Efficient design space exploration is challenging due to the exponential number of possible optimization permutations. Therefore, automating the optimization process is necessary. In this work, we propose a novel reinforcement learning-based methodology that navigates the optimization space without human intervention. We demonstrate the training of an Advantage Actor Critic (A2C) agent that seeks to minimize area subject to a timing constraint. Using the proposed methodology, designs can be optimized autonomously with no-humans in-loop. Evaluation on the comprehensive EPFL benchmark suite shows that the agent outperforms existing exploration methodologies and improves QoRs by an average of 13%. |
PDF file |
Title | Lightening Asynchronous Pipeline Controller Through Resynthesis and Optimization |
Author | *Jeongwoo Heo, Taewhan Kim (Seoul National University, Republic of Korea) |
Page | pp. 587 - 592 |
Keyword | Asynchronous, Resource, Synthesis, Optimization, Timing |
Abstract | A bundled-data asynchronous circuit is a promising alternative to a synchronous circuit for implementing high performance low power systems, but it requires to deploy special circuitry to support the asynchronous communication between every pair of consecutive pipeline stages. This work addresses the problem of reducing the size of asynchronous pipeline controller. Lightening the pipeline controller directly impacts two critical domains: (1) it mitigates the increase of controller area caused by high process-voltage-temperature variation on circuit; (2) it contributes to proportionally reducing the leakage power. (Note that a long delay in circuit between pipeline stages requires a long chain of delay elements in the controller.) Precisely, we analyze the setup timing paths on the conventional asynchronous pipeline controller, and (i) resynthesize new setup timing paths, which allows to share some of the expensive delay elements among the paths while assuring the communication correctness. Then, we (ii) optimally solve the problem of minimizing the number of delay elements by formulating it into a linear programming. For a set of test circuits with a 45nm standard cell library, it is shown that our synthesis and optimization method reduces the total area of delay elements and the leakage power of pipeline controller by 46.4% and 43.6% on average, respectively, while maintaining the same level of performance and dynamic power consumption. |
PDF file |
Title | WEID: Worst-case Error Improvement in Approximate Dividers |
Author | *Hassaan Saadat (University of New South Wales, Sydney, Australia), Haris Javaid (Xilinx, Singapore), Aleksandar Ignjatovic, Sri Parameswaran (University of New South Wales, Sydney, Australia) |
Page | pp. 593 - 598 |
Keyword | Approximate, Divider, Worst-case, Error |
Abstract | Approximate integer dividers suffer from unreasonably high worst-case relative errors (such as 50% or 100%), which can adversely affect the application-level output. In this paper, we propose WEID, which is a novel lightweight method to improve the worst-case relative errors in approximate integer dividers. We first present an in-depth analysis to gain insights into the cause of the high worst-case relative error. Based on our insights, we propose a novel method to detect when an error occurs in an approximate divider, and modify the output to reduce the error. Further, we present the hardware realization of WEID method and demonstrate that it can be generically coupled with several state-of-the-art approximate dividers. Our results show that for 32-by-16 dividers, WEID reduces worst-case relative errors from 100% to ~20%, while still achieving ~80% and ~70% reduction in delay and energy compared to an accurate array divider. |
PDF file |
Title | Small-Area and Low-Power FPGA-Based Multipliers using Approximate Elementary Modules |
Author | *Yi Guo, Heming Sun, Shinji Kimura (Waseda University, Japan) |
Page | pp. 599 - 604 |
Keyword | Approximate computing, Multiplier, FPGA-based, Low power, Small area |
Abstract | This paper presents a novel methodology for designing approximate multipliers by employing the FPGA-based fabrics. The area and latency are significantly reduced by cutting the carry propagation path in the multiplier. Moreover, we explore higher-order multipliers on architectural space by using our proposed small-size approximate multipliers as elementary modules. For different accuracy requirements, eight configurations on approximate 8×8 multiplier are discussed. In terms of mean relative error distance (MRED), the accuracy loss of the proposed 8×8 multiplier is low as 0.17%. Compared with the exact multiplier, our proposed design can reduce area by 43.66% and power by 20.36%. The critical path latency reduc-tion is up to 27.66%. The proposed multiplier design has a bet-ter accuracy-hardware tradeoff than other designs with comparable accuracy. |
PDF file |
Title | LeAp: Leading-one Detection-based Softcore Approximate Multipliers with Tunable Accuracy |
Author | *Zahra Ebrahimi, Salim Ullah, Akash Kumar (Technische Universität Dresden, Germany) |
Page | pp. 605 - 610 |
Keyword | Field-Programmable Gate Arrays, Approximate Multiplier, Mitchell’s Algorithm, Energy-Efficiency, Area-Optimization |
Abstract | Approximate multipliers are ubiquitous in ASIC platform used by diverse application domains. However, comparable source gains is not gained when directly applying these techniques on FPGA platforms. We propose LeAp, an area-, throughput-, and energy-efficient approximate multiplier for FPGAs which efficiently utilizes 6-LUTs and fast carry chains to implement Mitchell’s algorithm. Moreover, through three novel light-weight error-refinement, we have boosted accuracy to>99%. Experimental results from Vivado, ANN and image processing applications indicate 69.7%, 14.7%, 42.1%,and 37.1% improvement in area, throughput, power, and energy, respectively. |
PDF file |
Title | Scaled Population Arithmetic for Efficient Stochastic Computing |
Author | *He Zhou, Sunil P. Khatri, Jiang Hu (Texas A&M University, USA), Frank Liu (IBM Research - Austin, USA) |
Page | pp. 611 - 616 |
Keyword | scaled population, approximate computation |
Abstract | We propose a new Scaled Population (SP) based arithmetic computation approach that achieves considerable improvements over existing stochastic computing (SC) techniques. First, SP arithmetic introduces scaling operations that significantly reduce the numerical errors as compared to SC. Experiments show accuracy improvements of a single multiplication and addition operation by 6.3× and 4.0×, respectively. Secondly, SP arithmetic erases the inherent serialization associated with stochastic computing, thereby significantly improves the computational delays. We design each of the operations of SP arithmetic to take O(1) gate delays, and eliminate the need of serially iterating over the bits of the population vector. Our SP approach improves the area, delay and power compared with conventional stochastic computing on an FPGA-based implementation. We also apply our SP scheme on a handwritten digit recognition application (MNIST), improving the recognition accuracy by 32.79% compared to SC. |
PDF file |
Title | (Invited Paper) Soft Error and Its Countermeasures in Terrestrial Environment |
Author | *Masanori Hashimoto (Osaka University, Japan), Wang Liao (Kochi University of Technology, Japan) |
Page | pp. 617 - 622 |
Keyword | soft error, SRAM, ECC, processor, GPU |
Abstract | This paper discusses soft errors in digital chips consisting of SRAM, flip-flops, and combinational logic in the terrestrial environment. We review the effectiveness of error-correction coding (ECC) in processor systems and point out the importance of radiation-hardened flip-flops for further error mitigation. The discussion includes the difference between planar and FD-SOI transistors, and the type of secondary cosmic rays including neutron and muon, using irradiation test results. Also, the difficulty in characterizing SER of a commercial GPU chip is exemplified. |
PDF file |
Title | (Invited Paper) Timing Resilience for Efficient and Secure Circuits |
Author | Grace Li Zhang, Michaela Brunner, *Bing Li, Georg Sigl, Ulf Schlichtmann (Technical University of Munich, Germany) |
Page | pp. 623 - 628 |
Keyword | timing, process variations, circuit resilience, anti-counterfeiting, netlist security |
Abstract | In this paper, we will cover several techniques that can enhance the resilience of timing of digital circuits. Using post-silicon tuning components, the clock arrival times at flip- flops can be modified after manufacturing to balance delays between flip-flops. The actual delay properties of flip-flops will be examined to exploit the natural flexibility of such components. Wave-pipelining paths spanning several flip-flop stages can be integrated into a synchronous design to improve the circuit performance and to reduce area. In addition, with this technique, it cannot be taken for granted anymore that all the combinational paths in a circuit work with respect to one clock period. There- fore, a netlist alone does not represent all the design information. This feature enables the potential to embed wave-pipelining paths into a circuit to increase the complexity of reverse engineering. In order to replicate a design, attackers therefore have to identify the locations of the wave-pipelining paths, in addition to the netlist extracted from reverse engineering. Therefore, the security of the circuit against counterfeiting can be improved. |
PDF file |
Title | (Invited Paper) Run-Time Enforcement of Non-Functional Application Requirements in Heterogeneous Many-Core Systems |
Author | Jürgen Teich, Behnaz Pourmohseni, *Oliver Keszocze, Jan Spieck, Stefan Wildermann (Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany) |
Page | pp. 629 - 636 |
Keyword | run-time enforcement, many-core systems, reliability, realtime |
Abstract | For many embedded applications, non-functional requirements such as safety, reliability, and execution time must be guaranteed in tight bounds on a given multi-core platform. Here, jitter in non-functional program execution qualities is caused either by outer influences such as faults injected by the environment, but can be induced also from the system management software itself, including thread-to-core mapping, scheduling and power management. A second huge source of variability typically stems from data-dependent workloads. In this paper, we classify and present techniques to enforce non-functional execution properties on multi-core platforms. Based on a static design space exploration and analysis of influences of variability of non-functional properties, enforcement strategies are generated to guide the execution of periodically executed applications in given requirement corridors. Using the case study of a complex image streaming application, we show that by controlling DVFS settings of cores proactively, not only tight execution times, but also reliability requirements may be enforced dynamically while trying to minimize energy consumption. |
PDF file |
Title | (Invited Paper) NCFET to Rescue Technology Scaling: Opportunities and Challenges |
Author | *Hussam Amrouch, Victor M. van Santen (Karlsruhe Institute of Technology, Germany), Girish Pahwa (Indian Institute of Technology Kanpur, India), Yogesh Chauhan (Indian Institute of Technology Kanpur, Germany), Jörg Henkel (Karlsruhe Institute of Technology (KIT), Germany) |
Page | pp. 637 - 644 |
Keyword | Negative Capacitance, NCFET, Emerging technology, Beyond CMOS, FinFET |
Abstract | Negative Capacitance Field Effect Transistor (NCFET) is one of the promising emerging technologies that may overcome the fundamental limits of conventional CMOS technology. Since NCFET features a ferroelectric (FE) layer within the transistor’s gate, which internally amplifies the voltage, NCFET can operate at a lower voltage while sustaining performance at considerable energy savings. In this work, we raise awareness that n- and p-NCFET transistors are asymmetrically affected by the FE layer and show, for the first time, how this asymmetry results in unbalanced circuit performance (e.g., longer fall than rise propagation delay, reduced noise margins). |
PDF file |
Title | (Invited Paper) Parallelism in Deep Learning Accelerators |
Author | *Linghao Song, Fan Chen, Yiran Chen, Hai (Helen) Li (Duke University, USA) |
Page | pp. 645 - 650 |
Keyword | Parallelism, Deep Learning Accelerators |
Abstract | Deep learning is the core of artificial intelligence and it achieves state-of-the-art in a wide range of applications. The intensity of computation and data in deep learning processing poses significant challenges to the conventional computing platforms. Thus, specialized accelerator architectures are proposed for the acceleration of deep learning. In this paper, we classify the design space of current deep learning accelerators into three levels, (1) processing engine, (2) memory and (3) accelerator, and present a constructive view from a perspective of parallelism in the three levels. |
PDF file |
Title | (Invited Paper) Software-Based Memory Analysis Environments for In-Memory Wear-Leveling |
Author | *Christian Hakert, Kuan-Hsun Chen, Mikail Yayla, Georg von der Brüggen, Sebastian Blömeke, Jian-Jia Chen (TU Dortmund, Germany) |
Page | pp. 651 - 658 |
Keyword | non-volatile memory, wear-leveling, system simulation |
Abstract | Emerging non-volatile memory (NVM) architectures are considered as a replacement for DRAM and storage in the near future, since NVMs provide low power consumption, fast access speed, and low unit cost. Due to the lower write-endurance of NVMs, several in-memory wear-leveling techniques have been studied over the last years. Since most approaches propose or rely on specialized hardware, the techniques are often evaluated based on assumptions and in-house simulations rather than on real systems. To address this issue, we develop a setup consisting of a gem5 instance and an NVMain2.0 instance, which simulates an entire system (CPU, peripherals, etc.) together with an NVM plugged into the system. Taking a recorded memory access pattern from a low-level simulation into consideration to design and optimize wear-leveling techniques as operating system services allows a cross-layer design of wear-leveling techniques. With the insights gathered by analyzing the recorded memory access patterns, we develop a software-only wear-leveling solution, which does not require special hardware at all. This algorithm is evaluated afterwards by the full system simulation. |
PDF file |
Title | (Invited Paper) Theory of Ising Machines and a Common Software Platform for Ising Machines |
Author | *Shu Tanaka (Waseda University, Japan), Yoshiki Matsuda (Fixstars, Japan), Nozomu Togawa (Waseda University, Japan) |
Page | pp. 659 - 666 |
Keyword | Ising machine, Combinatorial Optimization Problem, Quantum annealing, Ising Model |
Abstract | Ising machines are a new type of non-Neumann computer that specializes in solving combinatorial optimization problems efficiently. The input form of Ising machines is the energy function of the Ising model or quadratic unconstrained binary optimization form, and Ising machines operate to search for a condition to minimize the energy function. We describe the theory of Ising machines and the present status of the Ising machines, software for Ising machines, and applications using Ising machines. |
PDF file |
Title | (Invited Paper) Digital Annealer for High-Speed Solving of Combinatorial Optimization Problems and Its Applications |
Author | *Satoshi Matsubara, Motomu Takatsu, Toshiyuki Miyazawa, Takayuki Shibasaki, Yasuhiro Watanabe, Kazuya Takemoto, Hirotaka Tamura (Fujitsu Laboratories LTD., Japan) |
Page | pp. 667 - 672 |
Keyword | combinatorial optimization problem, Ising model, Markov Chain Monte Carlo, Digital Annealer, benchmark |
Abstract | Digital Annealer is a dedicated architecture for high-speed solving of combinatorial optimization problems mapped to an Ising model. Digital Annealer uses Markov Chain Monte Carlo as a basic search mechanism, accelerated by the hardware implementation of multiple speed-enhancement techniques. It is currently being offered as a cloud service using a second-generation chip operating on a scale of 8,192 bits. This paper presents an overview of Digital Annealer, its performance against benchmarks, and application examples. |
PDF file |
Title | (Invited Paper) CMOS Annealing Machine: A Domain-Specific Architecture for Combinatorial Optimization Problem |
Author | *Chihiro Yoshimura, Masato Hayashi, Takashi Takemoto, Masanao Yamaoka (Hitachi, Ltd., Japan) |
Page | pp. 673 - 678 |
Keyword | Domain-specific architecture, Combinatorial optimization problem, In-memory computing, Ising model, FPGA |
Abstract | Domain-specific architectures are being studied to improve computer performance beyond the end of Moore's Law. Here, we propose a new computing architecture, the CMOS annealing machine, which provides a fast means of solving combinatorial optimization problems. Our architecture is based on in-memory computing architecture through utilizing the locality of interactions in the Ising model. The prototype presented in 2019 has two processors on a business-card-sized board and solves problems 55 times faster than conventional computers. |
PDF file |
Title | (Designers' Forum) AI Chips, What's Next: Architecture, Tools, and Methodology |
Author | Shan Tang (AI chip expert, China) |
Abstract | In recent years, AI chips are developed by IC vendors, tech giants, and startups to fulfill the huge requirements of AI applications. To provide computation power more efficiently for the AI domain, different types of hardware architectures are explored and optimized. At the same time, the software technology is also evolving to make the best use of these monsters. As the first generation of AI chips is getting mature, it is a good time to discuss what may happen in the coming years. |
Title | (Designers' Forum) Computing-in-Memory SoC Chip for Neural Network Inference |
Author | Shaodi Wang (Witin Tech, China) |
Abstract | Neural Networks (NNs) have been widely employed in modern artificial intelligence (AI) systems due to their unprecedented capability in classification, recognition and detection. However, the massive data communication between the processing units and the memory has been proven to be the main bottleneck to improve the efficiency of NNs based hardware. Furthermore, the significant power demand for massive addition and multiplication limits its adoption at the edge devices. In addition, the cost is another major concern for an edge device. WITIN Tech has developed edge neural processing chips with analog computing-in-memory technology simultaneous achieving low-power, high-performance, and low-cost. First two products: MemCore001 and MemCore101 are releasing to customers in Nov. 2019. It achieves 8-bit 10Gops performance with 1mW power, boosting state-of-the-art AI voice chip in market by 50X. It satisfies the urgent need for the fast-growing IoT market. |
Title | (Designers' Forum) Enabling Data Center-Wide Accelerator Resource Pools for AI Applications |
Author | Kun Wang (VirtAI Tech, China) |
Abstract | As AI technologies evolve fast, the amount of AI compute has grown by more than 300,000x since 2012. Today the major providers of AI compute are accelerators like GPUs, FPGAs and ASICs. However, most users are using those accelerators exclusively, it results in low accelerator utilization and high costs. By VirtAI Tech’s innovative accelerator virtualization technologies, OrionX Computing Platform (OXCP) helps customers build data center-wide accelerator resource pools, and enables customer applications to run on and share any accelerators on any servers in a data center. OXCP not only significantly increases accelerator utilization and reduces costs, but also makes application deployment much easier. |