Details of a Researcher - KIMURA, Shinji

写真a

KIMURA, Shinji

Scopus Paper Info

Paper Count: 122 Citation Count: 810 h-index: 15

Click to view the Scopus page. The data was downloaded from Scopus API in May 05, 2026, via http://api.elsevier.com and http://www.scopus.com .

Affiliation

Faculty of Science and Engineering, Graduate School of Information, Production, and Systems

Job title

Professor

Degree

Doctor of Engineering ( Kyoto University )

Homepage URL

http://www.f.waseda.jp/shinji_kimura/

Research Experience

2002

-

Now

Professor at Waseda University
1993

-

2002

Associate Professor at Nara Institute of Science and Technology
1985

-

1993

Assistant Professor at Dept. of Electric Engineering, Kobe University

Education Background

　

-

1985

Kyoto University Graduate School of Engineering Doctor Course on Information Engineering
　

-

1984

Kyoto University Graduate School of Engineering Master Course on Information Engineering
　

-

1982

Kyoto University Faculty of Engineering

Professional Memberships

　

　

　

Associations for Computing Machinery
　

　

　

IPSJ
　

　

　

IEICE
　

　

　

IEEE
　

　

　

The 14th Workshop on Synthesis And System Integration of Mixed Information technologies
　

　

　

The 15th Workshop on Synthesis And System Integration of Mixed Information technologies
　

　

　

VLSI Design Technologies WG, IEICE
　

　

　

Information Processing Society in Japan
　

　

　

International Conference on Computer Aided Design
　

　

　

Asia and South Pacific Design Automation Conference

▼display all

Research Areas

Electron device and electronic equipment / Computer system

Research Interests

Logic Circuit Design and Verification, High-level Synthesis and Verification, Electronic Design Automation, LSI

Awards

編集活動感謝状

2012.09
日経 BP 社, LSI IP デザインアワード, IP 賞

2000
Asian South-Pacific Design Automation Conference, University LSI Design Contest

2000
日経 BP 社, LSI IP デザインアワード, IP 賞

1999
情報処理学会全国大会第４５回奨励賞

1993.03

Papers

Accuracy-Configurable Low-Power Approximate Floating-Point Multiplier Based on Mantissa Bit Segmentation.

Jie Li, Yi Guo, Shinji Kimura

2020 IEEE Region 10 Conference(TENCON) 1311 - 1316 2020

DOI

Scopus

13

Citation

(Scopus)
Approximate FPGA-Based Multipliers Using Carry-Inexact Elementary Modules.

Yi Guo, Heming Sun, Ping Lei, Shinji Kimura

IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 103-A ( 9 ) 1054 - 1062 2020

DOI

Scopus

3

Citation

(Scopus)
Small-Area and Low-Power FPGA-Based Multipliers using Approximate Elementary Modules.

Yi Guo, Heming Sun, Shinji Kimura

Proc. of ASP-DAC 2020 599 - 604 2020 [Refereed]

DOI

Scopus

28

Citation

(Scopus)
Energy-Efficient and High-Speed Approximate Signed Multipliers with Sign-Focused Compressors.

Yi Guo, Heming Sun, Shinji Kimura

Proc. of 2019 32nd IEEE International System-on-Chip Conference (SOCC) 330 - 335 2019 [Refereed]

DOI

Scopus

7

Citation

(Scopus)
Approximate Multiplier Using Reordered 4-2 Compressor with OR-based Error Compensation.

Yufeng Xu, Yi Guo, Shinji Kimura

Proc. of 2019 IEEE 13th International Conference on ASIC (ASICON) 1 - 4 2019 [Refereed]

DOI

Scopus

8

Citation

(Scopus)
Approximate DCT Design for Video Encoding Based on Novel Truncation Scheme.

Heming Sun, Zhengxue Cheng, Amir Masoud Gharehbaghi, Shinji Kimura, Masahiro Fujita

IEEE Trans. Circuits Syst. I Regul. Pap. 66-I ( 4 ) 1517 - 1530 2019 [Refereed]

DOI

Scopus

35

Citation

(Scopus)
Design of Low-Cost Approximate Multipliers Based on Probability-Driven Inexact Compressors.

Yi Guo, Heming Sun, Ping Lei, Shinji Kimura

IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 102-A ( 12 ) 1781 - 1791 2019 [Refereed]

DOI

Scopus

1

Citation

(Scopus)
Design of Power and Area Efficient Lower-Part-OR Approximate Multiplier.

Yi Guo, Heming Sun, Shinji Kimura

TENCON 2018 - 2018 IEEE Region 10 Conference(TENCON) 2110 - 2115 2018 [Refereed]

DOI

Scopus

33

Citation

(Scopus)
Energy-Efficient and High Performance Approximate Multiplier Using Compressors Based on Input Reordering.

Zhenhao Liu, Yi Guo, Xiaoting Sun, Shinji Kimura

TENCON 2018 - 2018 IEEE Region 10 Conference(TENCON) 545 - 550 2018 [Refereed]

DOI

Scopus

1

Citation

(Scopus)
Sparseness Ratio Allocation and Neuron Re-pruning for Neural Networks Compression.

Li Guo 0006, Dajiang Zhou, Jinjia Zhou, Shinji Kimura

IEEE International Symposium on Circuits and Systems(ISCAS) 1 - 5 2018 [Refereed]

DOI

Scopus

1

Citation

(Scopus)
Embedded Frame Compression for Energy-Efficient Computer Vision Systems.

Li Guo 0006, Dajiang Zhou, Jinjia Zhou, Shinji Kimura

IEEE International Symposium on Circuits and Systems(ISCAS) 1 - 5 2018 [Refereed]

DOI

Scopus

2

Citation

(Scopus)
A Radix-4 Partial Product Generation-Based Approximate Multiplier for High-speed and Low-power Digital Signal Processing.

Xiaoting Sun, Yi Guo, Zhenhao Liu, Shinji Kimura

25th IEEE International Conference on Electronics, Circuits and Systems(ICECS) 777 - 780 2018 [Refereed]

DOI

Scopus

4

Citation

(Scopus)
Sparse ternary connect: Convolutional neural networks using ternarized weights with enhanced sparsity.

Canran Jin, Heming Sun, Shinji Kimura

23rd Asia and South Pacific Design Automation Conference(ASP-DAC) 190 - 195 2018 [Refereed]

DOI

Scopus

11

Citation

(Scopus)
Quad-multiplier packing based on customized floating point for convolutional neural networks on FPGA.

Zhifeng Zhang, Dajiang Zhou, Shihao Wang, Shinji Kimura

23rd Asia and South Pacific Design Automation Conference(ASP-DAC) 184 - 189 2018 [Refereed]

DOI

Scopus

6

Citation

(Scopus)
Low-Cost Approximate Multiplier Design using Probability-Driven Inexact Compressors.

Yi Guo, Heming Sun, Li Guo 0006, Shinji Kimura

2018 IEEE Asia Pacific Conference on Circuits and Systems(APCCAS) 291 - 294 2018 [Refereed]

DOI

Scopus

41

Citation

(Scopus)
Towards Ultrasound Everywhere: A Portable 3D Digital Back-End Capable of Zone and Compound Imaging.

Aya Ibrahim, Shuping Zhang, Federico Angiolini, Marcel Arditi, Shinji Kimura, Satoshi Goto, Jean-Philippe Thiran, Giovanni De Micheli

IEEE Trans. Biomed. Circuits Syst. 12 ( 5 ) 968 - 981 2018 [Refereed]

DOI

Scopus

13

Citation

(Scopus)
Lossy Compression for Embedded Computer Vision Systems.

Li Guo 0006, Dajiang Zhou, Jinjia Zhou, Shinji Kimura, Satoshi Goto

IEEE Access 6 39385 - 39397 2018 [Refereed]

DOI

Scopus

19

Citation

(Scopus)
A Variable-Clock-Cycle-Path VLSI Design of Binary Arithmetic Decoder for H.265/HEVC.

Jinjia Zhou, Dajiang Zhou, Shuping Zhang, Shinji Kimura, Satoshi Goto

IEEE Trans. Circuits Syst. Video Technol. 28 ( 2 ) 556 - 560 2018

　View Summary

The next-generation 8K ultra-high-definition video format involves an extremely high bit rate, which imposes a high throughput requirement on the entropy decoder component of a video decoder. Context adaptive binary arithmetic coding (CABAC) is the entropy coding tool in the latest video coding standards including H.265/High Efficiency Video Coding and H.264/Advanced Video Coding. Due to critical data dependencies at the algorithm level, a CABAC decoder is difficult to be accelerated by simply leveraging parallelism and pipelining. This letter presents a new very-large-scale integration arithmetic decoder, which is the most critical bottleneck in CABAC decoding. Our design features a variable-clock-cycle-path architecture that exploits the differences in critical path delay and in probability of occurrence between various types of binary symbols (bins). The proposed design also incorporates a novel data-forwarding technique (rLPS forwarding) and a fast path-selection technique (coarse bin type decision), and is enhanced with the capability of processing additional bypass bins. As a result, its maximum throughput achieves 1010 Mbins/s in 90-nm CMOS, when decoding 0.96 bin per clock cycle at a maximum clock rate of 1053 MHz, which outperforms previous works by 19.1%.

DOI

Scopus

8

Citation

(Scopus)
Distortion control and optimization for lossy embedded compression in video codec system

Li Guo, Dajiang Zhou, Shinji Kimura, Satoshi Goto

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E100A ( 11 ) 2416 - 2424 2017.11

　View Summary

For mobile video codecs, the huge energy dissipation for external memory traffic is a critical challenge under the battery power constraint. Lossy embedded compression (EC), as a solution to this challenge, is considered in this paper. While previous studies in lossy EC mostly focused on algorithm optimization to reduce distortion, this work, to the best of our knowledge, is the first one that addresses the distortion control. Firstly, from both theoretical analysis and experiments for distortion optimization, a conclusion is drawn that, at the frame level, allocating memory traffic evenly is a reliable approximation to the optimal solution to minimize quality loss. Then, to reduce the complexity of decoding twice, the distortion between two sequences is estimated by a linear function of that calculated within one sequence. Finally, on the basis of even allocation, the distortion control is proposed to determine the amount of memory traffic according to a given distortion limitation. With the adaptive target setting and estimating function updating in each group of pictures (GOP), the scene change in video stream is supported without adding a detector or retraining process. From experimental results, the proposed distortion control is able to accurately fix the quality loss to the target. Compared to the baseline of negative feedback on non-referred B frames, it achieves about twice memory traffic reduction.

DOI

Scopus
Fast Algorithm and VLSI Architecture of Rate Distortion Optimization in H.265/HEVC

Heming Sun, Dajiang Zhou, Landan Hu, Shinji Kimura, Satoshi Goto

IEEE TRANSACTIONS ON MULTIMEDIA 19 ( 11 ) 2375 - 2390 2017.11 [Refereed]

　View Summary

In H.265/high efficiency video coding (HEVC) encoding, rate distortion optimization (RDO) is an important cost function for mode decision and coding structure decision. Despite being near-optimum in terms of coding efficiency, RDO suffers from a high complexity. To address this problem, this paper presents a fast RDO algorithm and its very large scale implementation (VLSI) for both intra-and inter-frame coding. The proposed algorithm employs a quantization-free framework that significantly reduces the complexity for rate and distortion optimization. Meanwhile, it maintains a low degradation of coding efficiency by taking the syntax element organization and probability model of HEVC into consideration. The algorithm is also designed with hardware architecture in mind to support an efficient VLSI implementation. When implemented in the HEVC test model, the proposed algorithm achieves 62% RDO time reduction with 1.85% coding efficiency loss for the "all-intra" configuration. The hardware implementation achieves 1.6 x higher normalized throughput relative to previous works, and it can support a throughput of 8k@30fps (for four fine-processed modes per prediction unit) with 256 k logic gates when working at 200 MHz.

DOI

Scopus

22

Citation

(Scopus)
Time-efficient and TSV-aware 3D gated clock tree synthesis based on self-tuning spectral clustering

Fan Yang, Minghao Lin, Heming Sun, Shinji Kimura

Midwest Symposium on Circuits and Systems 2017- 1200 - 1203 2017.09

　View Summary

3D gated clock tree synthesis (CTS) mainly consists of three steps: 1) abstract clock topology generation
2) layer embedding for minimal TSV allocation and 3) clock tree routing with gate and buffer insertion. In this paper, a self-tuning spectral clustering based nearest-neighbor selection (SSC-NNS) algorithm with parallel structure is proposed to achieve high time efficiency in clock tree topology generation, with reduced runtime. In addition, a postorder traversal based layer embedding (PTLE) strategy is adopted for determining the embedding layer of internal nodes with minimal TSVges. Experimental results show that the proposed method achieves 32% and 82% runtime reduction on ISPD2009 and IBM benchmarks respectively compared with the state-of-the-art 3D work. Besides, the TSV count is also reduced by 46% on ISPD2009 benchmarks.

DOI

Scopus

2

Citation

(Scopus)
A low-cost approximate 32-point transform architecture

Heming Sun, Zhengxue Cheng, Amir Masoud Gharehbaghi, Shinji Kimura, Masahiro Fujita

Proceedings - IEEE International Symposium on Circuits and Systems 2017.09

　View Summary

This paper presents an area-efficient approximate method for 32-point transform which is one of the most area-consuming parts in High Efficiency Video Coding (HEVC) applications. Compared to prior literatures, this work reduces the hardware cost of transform by 1) eliminating all the arithmetic operations of 6 least significant bits (LSB), 2) presenting a low-delay method for generating carry propagation from the remaining 5 LSBs and 3) truncating the most significant bits (MSB) according to the position of component. In the implementation of a 32-point forward transform, the experimental results show that 27% area consumption can be saved and the coding efficiency loss aroused by the approximation is only 0.044% compared with the origin.

DOI

Scopus

3

Citation

(Scopus)
Effective write-reduction method for MLC non-volatile memory

Masashi Tawada, Shinji Kimura, Masao Yanagisawa, Nozomu Togawa

Proceedings - IEEE International Symposium on Circuits and Systems 2017.09

　View Summary

Recently, the requirement for non-volatile memory on embedded systems has increased because they can be applied with normally-off and power gating technologies to. However, they have a lower endurance than volatile memories. When data is encoded as a write-reduction code appropriately, the endurance of non-volatile memory can be enhanced by writing the encoded data into the memory. We propose a highly effective write-reduction method for a multi-level cell (MLC) non-volatile memory focusing on the write-reduction code (WRC) as the optimal bit-write reduction method. The WRC can be applied only to single-level cell non-volatile memory. The proposed method generates a cell-write reduction code based on the WRC
the cell has multiple bits as the holdable data. Our proposed method achieves a cell-write reduction by 31.6% compared to the conventional method.

DOI

Scopus
A 7-Die 3D Stacked 3840 × 2160@120 fps motion estimation processor

Zhang, Shuping, Zhou, Jinjia, Zhou, Dajiang, Kimura, Shinji, Goto, Satoshi

IEICE Transactions on Electronics E100C ( 3 ) 223 - 231 2017.03

　View Summary

© 2017 The Institute of Electronics, Information and Communication Engineers. In this paper, a hamburger architecture with a 3D stacked reconfigurable memory is proposed for a 4K motion estimation (ME) processor. By positioning the memory dies on both the top and bottom sides of the processor die, the proposed hamburger architecture can reduce the usage of the signal through-silicon via (TSV), and balance the power delivery network and the clock tree of the entire system. It results in 1/3 reduction of the usage of signal TSVs. Moreover, a stacked reconfigurable memory architecture is proposed to reduce the fabrication complexity and further reduce the number of signal TSVs by more than 1/2. The reduction of signal TSVs in the entire design is 71.24%. Finally, we address unique issues that occur in electronic design automation (EDA) tools during 3D largescale integration (LSI) designs. As a result, a 4K ME processor with 7-die stacking 3D system-on-chip design is implemented. The proposed design can support real time 3840 × 2160 @ 120 fps encoding at 130 MHz with less than 540 mW.

DOI

Scopus
Accelerating HEVC inter prediction with improved merge mode handling

Cheng, Zhengxue, Cheng, Zhengxue, Sun, Heming, Zhou, Dajiang, Kimura, Shinji

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E100A ( 2 ) 546 - 554 2017.02

　View Summary

© 2017 The Institute of Electronics, Information and Communication Engineers. High Efficiency Video Coding (HEVC/H.265) obtains 50% bit rate reduction than H.264/AVC standard with comparable quality at the cost of high computational complexity. Merge mode is one of the most important new features introduced in HEVC's inter prediction. Merge mode and traditional inter mode consume about 90% of the total encoding time. To address this high complexity, this paper utilizes the merge mode to accelerate inter prediction by four strategies. 1) A merge candidate decision is proposed by the sum of absolute transformed difference (SATD) cost. 2) An early merge termination is presented with more than 90% accuracy. 3) Due to the compensation effect of merge candidates, symmetric motion partition (SMP) mode is disabled for non-8×8 coding units (CUs). 4) A fast coding unit filtering strategy is proposed to reduce the number of CUs which need to be fine-processed. Experimental results demonstrate that our fast strategies can achieve 35.4%-58.7% time reduction with 0.68%-1.96% BD-rate increment in RA case. Compared with similar works, the proposed strategies are not only among the best performing in average-case complexity reduction, but also notably outperforming in the worst cases.

DOI

Scopus

4

Citation

(Scopus)
Development of TOF-PET using Compton scattering by plastic scintillators

Kuramoto, M, Nakamori, T, Kimura, S, Gunji, S, Takakura, M, Kataoka, J

Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 845 668 - 672 2017.02

　View Summary

© 2016 Elsevier B.V. We propose a time-of-flight (TOF) technique using plastic scintillators which have fast decay time of a few ns for positron emission tomography (PET). While the photoelectric absorption probability of the plastic for 511 keV gamma rays are extremely low due to its small density and effective atomic number, the cross section of Compton scattering is comparable to that of absorption by conventional inorganic scintillators. We thus propose TOF-PET using Compton scattering with plastic scintillators (Compton-PET), and performed fundamental experiments towards exploration of the Compton-PET capability. We demonstrated that the plastic scintillators achieved the better time resolution in comparison to LYSO(Ce) and GAGG(Ce) scintillators. In addition we evaluated the depth-of-interaction resolving capability with the plastic scintillators.

DOI
Distortion Control and Optimization for Lossy Embedded Compression in Video Codec System

GUO Li, ZHOU Dajiang, KIMURA Shinji, GOTO Satoshi

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 100 ( 11 ) 2416 - 2424 2017

　View Summary

<p>For mobile video codecs, the huge energy dissipation for external memory traffic is a critical challenge under the battery power constraint. Lossy embedded compression (EC), as a solution to this challenge, is considered in this paper. While previous studies in lossy EC mostly focused on algorithm optimization to reduce distortion, this work, to the best of our knowledge, is the first one that addresses the distortion control. Firstly, from both theoretical analysis and experiments for distortion optimization, a conclusion is drawn that, at the frame level, allocating memory traffic evenly is a reliable approximation to the optimal solution to minimize quality loss. Then, to reduce the complexity of decoding twice, the distortion between two sequences is estimated by a linear function of that calculated within one sequence. Finally, on the basis of even allocation, the distortion control is proposed to determine the amount of memory traffic according to a given distortion limitation. With the adaptive target setting and estimating function updating in each group of pictures (GOP), the scene change in video stream is supported without adding a detector or retraining process. From experimental results, the proposed distortion control is able to accurately fix the quality loss to the target. Compared to the baseline of negative feedback on non-referred B frames, it achieves about twice memory traffic reduction.</p>

CiNii
A 7-Die 3D Stacked 3840×2160@120 fps Motion Estimation Processor.

Shuping Zhang, Jinjia Zhou, Dajiang Zhou, Shinji Kimura, Satoshi Goto

IEICE Trans. Electron. 100-C ( 3 ) 223 - 231 2017 [Refereed]

　View Summary

In this paper, a hamburger architecture with a 3D stacked reconfigurable memory is proposed for a 4K motion estimation (ME) processor. By positioning the memory dies on both the top and bottom sides of the processor die, the proposed hamburger architecture can reduce the usage of the signal through-silicon via (TSV), and balance the power delivery network and the clock tree of the entire system. It results in 1/3 reduction of the usage of signal TSVs. Moreover, a stacked reconfigurable memory architecture is proposed to reduce the fabrication complexity and further reduce the number of signal TSVs by more than 1/2. The reduction of signal TSVs in the entire design is 71.24%. Finally, we address unique issues that occur in electronic design automation (EDA) tools during 3D large-scale integration (LSI) designs. As a result, a 4K ME processor with 7-die stacking 3D system-on-chip design is implemented. The proposed design can support real time 3840 x 2160 @ 120 fps encoding at 130 MHz with less than 540 mW.

DOI CiNii

Scopus
Accelerating HEVC Inter Prediction with Improved Merge Mode Handling.

Zhengxue Cheng, Heming Sun, Dajiang Zhou, Shinji Kimura

IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 100-A ( 2 ) 546 - 554 2017 [Refereed]

　View Summary

High Efficiency Video Coding (HEVC/H.265) obtains 50% bit rate reduction than H.264/AVC standard with comparable quality at the cost of high computational complexity. Merge mode is one of the most important new features introduced in HEVC's inter prediction. Merge mode and traditional inter mode consume about 90% of the total encoding time. To address this high complexity, this paper utilizes the merge mode to accelerate inter prediction by four strategies. 1) A merge candidate decision is proposed by the sum of absolute transformed difference (SATD) cost. 2) An early merge termination is presented with more than 90% accuracy. 3) Due to the compensation effect of merge candidates, symmetric motion partition (SMP) mode is disabled for non-8x8 coding units (CUs). 4) A fast coding unit filtering strategy is proposed to reduce the number of CUs which need to be fine-processed. Experimental results demonstrate that our fast strategies can achieve 35.4%-58.7% time reduction with 0.68%-1.96% BD-rate increment in RA case. Compared with similar works, the proposed strategies are not only among the best performing in average-case complexity reduction, but also notably outperforming in the worst cases.

DOI CiNii

Scopus

4

Citation

(Scopus)
An 8K H.265/HEVC Video Decoder Chip With a New System Pipeline Design.

Dajiang Zhou, Shihao Wang, Heming Sun, Jian-Bin Zhou, Jiayi Zhu, Yijin Zhao, Jinjia Zhou, Shuping Zhang, Shinji Kimura, Takeshi Yoshimura, Satoshi Goto

J. Solid-State Circuits 52 ( 1 ) 113 - 126 2017 [Refereed]

　View Summary

8K ultra-HD is being promoted as the next-generation video specification. While the High Efficiency Video Coding (HEVC) standard greatly enhances the feasibility of 8K with a doubled compression ratio, its implementation is a challenge, owing to ultrahigh-throughput requirements and increased complexity per pixel. The latter comes from the new features of HEVC. At the system level, the most challenging of them is the enlarged and highly variable-size coding/prediction/transform units, which significantly increase the requirement for on-chip memory as pipeline buffers and the difficulty in maintaining pipeline utilization. This paper presents an HEVC decoder chip featuring a system pipeline that works at a nonunified and variable granularity. The pipeline saves on-chip memory with a novel block-in-block-out queue system and a parameter delivery network, while allowing overhead-free and fully pipelined operation of the processing components. With the system pipeline design combined with various component-level optimizations, the proposed decoder in 40 nm achieves a maximum throughput of 4 Gpixels/s or 8K 120 frames/s for the low-delay-P configuration of HEVC, 7.5-55 times faster than prior works. It supports 8K 60 frames/s for the low-delay and random-access configurations. In a normalized comparison, it also shows 3.1-3.6 times better area efficiency and 31%-55% superior energy efficiency.

DOI

Scopus

26

Citation

(Scopus)
A low-power VLSI architecture for HEVC de-quantization and inverse transform

Sun, Heming, Zhou, Dajiang, Zhang, Shuping, Kimura, Shinji

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E99A ( 12 ) 2375 - 2387 2016.12

　View Summary

Copyright © 2016 The Institute of Electronics, Information and Communication Engineers.In this paper, we present a low-power system for the de- quantization and inverse transform of HEVC. Firstly, we present a low-delay circuit to process the coded results of the syntax elements, and then reduce the number of multipliers from 16 to 4 for the de-quantization process of each 4x4 block. Secondly, we give two efficient data mapping schemes for the memory between de-quantization and inverse transform, and the memory for transpose. Thirdly, the zero information is utilized through the whole system. For two memory parts, the write and read operation of zero blocks/ rows/ coefficients can all be skipped to save the power consumption. The results show that up to 86% power consumption can be saved for the memory part under the configuration of "Random-access" and common QPs. For the logical part, the proposed architecture for de-quantization can reduce 77% area consumption. Overall, our system can support real-time coding for 8K x 4K 120 fps video sequences and the normalized area consumption can be reduced by 68% compared with the latest work.

DOI

Scopus

1

Citation

(Scopus)
A Low-Power VLSI Architecture for HEVC De-Quantization and Inverse Transform

Heming Sun, Dajiang Zhou, Shuping Zhang, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E99A ( 12 ) 2375 - 2387 2016.12 [Refereed]

　View Summary

In this paper, we present a low-power system for the de-quantization and inverse transform of HEVC. Firstly, we present a low-delay circuit to process the coded results of the syntax elements, and then reduce the number of multipliers from 16 to 4 for the de-quantization process of each 4x4 block. Secondly, we give two efficient data mapping schemes for the memory between de-quantization and inverse transform, and the memory for transpose. Thirdly, the zero information is utilized through the whole system. For two memory parts, the write and read operation of zero blocks/ rows/ coefficients can all be skipped to save the power consumption. The results show that up to 86% power consumption can be saved for the memory part under the configuration of Random-access and common QPs. For the logical part, the proposed architecture for de-quantization can reduce 77% area consumption. Overall, our system can support real-time coding for 8K x 4K 120fps video sequences and the normalized area consumption can be reduced by 68% compared with the latest work.

DOI CiNii

Scopus

1

Citation

(Scopus)
Merge mode based fast inter prediction for HEVC

Zhengxue Cheng, Heming Sun, Dajiang Zhou, Shinji Kimura

2015 Visual Communications and Image Processing, VCIP 2015 2016.04

　View Summary

The latest High Efficiency Video Coding (HEVC/H.265) obtains 50% bit rate reduction than H.264/AVC standard with comparable quality, but at the cost of high computational complexity. Inter prediction accounts for large complexity and merge mode is one of the most important new features introduced in HEVC. To address this issue, this paper utilizes the merge mode to accelerate inter prediction by three fast mode decision methods. 1) A merge candidate decision is proposed to select the best merge mode by Sum of Absolute Transformed Difference (SATD) cost to reduce the merge time. 2) An early merge termination is presented still based on SATD cost with more than 90% accuracy. 3) Based on efficient merge mode, symmetric motion partition (SMP) modes can be disabled for non-8 × 8 code units (CUs). Experimental results demonstrate that our work can achieve 53.1%-54.2% time reduction on average with 1.57%-2.30% BD-rate increment. Besides, our method achieves an improvement of 18%-30% time reduction with 0.89%-2.85% BD-rate increment when combined with other existing approaches.

DOI

Scopus

1

Citation

(Scopus)
A-6-3 Reduction of Rewriting Routing Switches for Reconfiguration of NanoBridge Based FPGA

Aoki Kohei, Yanagisawa Masao, Kimura Shinji

Proceedings of the IEICE Engineering Sciences Society/NOLTA Society Conference 2016 77 - 77 2016.03

CiNii
A 4Gpixel/s 8/10b H.265/HEVC Video Decoder Chip for 8K Ultra HD Applications

Dajiang Zhou, Shihao Wang, Heming Sun, Jianbin Zhou, Jiayi Zhu, Yijin Zhao, Jinjia Zhou, Shuping Zhang, Shinji Kimura, Takeshi Yoshimura, Satoshi Goto

2016 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC) 59 266 - U369 2016 [Refereed]

　View Summary

© 2016 IEEE.8K Ultra HD is being promoted as the next-generation digital video format. From a communication channel perspective, the latest high-efficiency video coding standard (H.265/HEVC) greatly enhances the feasibility of 8K by doubling the compression ratio. Implementation of such codecs is a challenge, owing to ultra-high throughput requirements and increased complexity per pixel. The former corresponds to up to 10b/pixel, 7680×4320pixels/frame and 120fps - 80× larger than 1080p HD. The latter comes from the new features of HEVC relative to its predecessor H.264/AVC. The most challenging of them is the enlarged and highly variable-size coding/prediction/transform units (CU/PU/TU), which significantly increase: 1) the requirement for on-chip memory as pipeline buffers, 2) the difficulty in maintianing pipeline utilization, and 3) the complexity of inverse transforms (IT). This paper presents an HEVC decoder chip supporting 8K Ultra HD, featuring a 16pixel/cycle true-variable-block-size system pipeline. The pipeline: 1) saves on-chip memory with a novel block-in-block-out (BIBO) queue system and a parameter delivery network, and 2) allows high design efficiency and utilization of processing components through local synchronization. Key optimizations at the component level are also presented.

DOI

Scopus

28

Citation

(Scopus)
FRAME-LEVEL QUALITY AND MEMORY TRAFFIC ALLOCATION FOR LOSSY EMBEDDED COMPRESSION IN VIDEO CODEC SYSTEMS

Li Guo, Dajiang Zhou, Shinji Kimura, Satoshi Goto

2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW) 2016 [Refereed]

　View Summary

For mobile video codecs, the huge energy dissipation for external memory traffic is a critical challenge under the battery power constraint. Lossy embedded compression (EC), as a solution to this challenge, is considered in this paper. While previous studies in EC mostly focused on compression algorithms at the block level, this work, to the best of our knowledge, is the fIrst one that addresses the allocation of video quality and memory traffic at the frame level. For lossy EC, a main difficulty of its application lies in the error propagation from quality degradation of reference frames. Instinctively, it is preferred to perform more lossy EC in non-reference frames to minimize the quality loss. The analysis and experiments in this paper, however, will show lossy EC should actually be distributed to more frames. Correspondingly, for hierarchical-B GOPs, we developed an efficient allocation that outperforms the non-reference-only allocation by up to 4.5 dB in PSNR. In comparison, the proposed allocation also delivers more consistent quality between frames by having lower PSNR fluctuation.

DOI

Scopus

2

Citation

(Scopus)
Power-Efficient and Slew-Aware Three Dimensional Gated Clock Tree Synthesis

Minghao Lin, Heming Sun, Shinji Kimura

2016 IFIP/IEEE INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION (VLSI-SOC) 2016 [Refereed]

　View Summary

This paper presents a three dimensional (3D) gated clock tree synthesis (CTS) approach, which consists of two steps: 1) abstract tree topology generation; and 2) 3D gated and buffered clock routing. 3D Pair Matching (3D-PM) algorithm is proposed to generate the initial tree topology and then the proposed TSV-minimization algorithm is applied to generate TSV-aware tree topology. Based on TSV-aware tree topology, 3D gated and buffered clock tree routing is done using the proposed 3D Gated and Buffered Deferred-Merge Embedding (3D-GB-DME) algorithm. The slew constraint satisfaction is considered and the clock skew is minimized in our approach. Experimental results show that the proposed method achieves 29.11% power reduction compared with the state-of-the-art 2D work.

DOI

Scopus

13

Citation

(Scopus)
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor for Forward and Backward Propagation of Convolutional Neural Networks

Xushen Han, Dajiang Zhou, Shihao Wang, Shinji Kimura

PROCEEDINGS OF THE 34TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD) 320 - 327 2016 [Refereed]

　View Summary

Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Moreover, VLSI solutions are usually regarded as a lack of the flexibility to be reconfigured for the various parameters of CNNs. This paper presents CNN-MERP to address these issues. CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/offchip data allocation, data flow optimization and data reuse. The proposed 2-level reconfigurability is utilized to enable fast and efficient reconfiguration, which is based on the control logic and the multiboot feature of FPGA. As a result, an external memory bandwidth requirement of 1.94MB/GFlop is achieved, which is 55% lower than prior arts. Under limited DRAM bandwidth, a system throughput of 1244GFlop/s is achieved at the Vertex UltraScale platform, which is 5.48 times higher than the state-of-the-art FPGA implementations.

DOI

Scopus

36

Citation

(Scopus)
Optimization of Area and Power in Multi-Mode Power Gating Scheme for Static Memory Elements

Xing Su, Shinji Kimura

2016 IEEE ASIA PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS (APCCAS) 214 - 217 2016 [Refereed]

　View Summary

This paper presents an optimization method of area and power for static memory elements by using multi-mode power gating (MMPG) scheme. A 2-transistor MMPG scheme replaces the usual 5-transistor one to effectively reduce on chip area overhead and leakage power, simultaneously combining trimming circuits (TC) to guarantee the safety of data retention. When applying the proposed approach into clean/dirty-cache (CD-cache), we can reduce area overhead and leakage power consumption. The simulation results show that the area overhead of SRAM with the proposed approach is reduced from 33.4% to 21.8% compared to that of SRAM with usual MMPG. On the other hand, leakage power is reduced by 12.35% compared to SRAM with usual MMPG and by 86.77% compared to SRAM without power gating scheme. Moreover, the ability of noise immunity of SRAM with proposed approach can also be improved.

DOI

Scopus
ECC-Based Bit-Write Reduction Code Generation for Non-Volatile Memory

Masashi Tawada, Shinji Kimura, Masao Yanagisawa, Nozomu Togawa

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E98A ( 12 ) 2494 - 2504 2015.12 [Refereed]

　View Summary

Non-volatile memory has many advantages such as high density and low leakage power but it consumes larger writing energy than SRAM. It is quite necessary to reduce writing energy in non-volatile memory design. In this paper, we propose write-reduction codes based on error correcting codes and reduce writing energy in non-volatile memory by decreasing the number of writing bits. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. In our write-reduction codes, every data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to the data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. Experimental results demonstrate that we have achieved writing-bits reduction by an average of 51% and energy reduction by an average of 33% compared to non-encoded memory.

DOI

Scopus

2

Citation

(Scopus)
An independent bandwidth reduction device for HEVC VLSI video system

Jiayi Zhu, Li Guo, Dajiang Zhou, Shinji Kimura, Satoshi Goto

Proceedings - IEEE International Symposium on Circuits and Systems 2015- 609 - 612 2015.07 [Refereed]

　View Summary

FRC (frame re-compression) is a kind of widely used technique in reducing the SDRAM (synchronous dynamic random access memory) bandwidth of HEVC video system. However, in previous research works, FRC imposes requirements on accessing pattern and hence its usage are only limited in HEVC video codecs. While in a typical HEVC VLSI video system, there exists many other video IPs with high bandwidth requirements. Therefore, in this article, we propose a new FRC architecture to overcome the limitation and make it applicable to all the video IPs in a HEVC VLSI video system, which raises the overall bandwidth reduction rate of the whole video system. Our proposal has two points: firstly we propose a system internal bus based FRC architecture, which is independent, transparent, and easily connected to all other video IPs. Secondly, we propose a FA (freely access) scheme to remove the requirements on access pattern in previous work. By using this proposal, the bandwidth reduction rate in our VLSI video system model is raised from 92.4% to 69.6%.

DOI

Scopus

4

Citation

(Scopus)
Low-Power Motion Estimation Processor with 3D Stacked Memory

Shuping Zhang, Jinjia Zhou, Dajiang Zhou, Shinji Kimura, Satoshi Goto

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E98A ( 7 ) 1431 - 1441 2015.07 [Refereed]

　View Summary

Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60 fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.

DOI

Scopus

1

Citation

(Scopus)
Control Signal Extraction for Sequential Clock Gating Using Time Expansion of Sequential Circuits

2015 ( 6 ) 1 - 6 2015.05

　View Summary

Recently, clock gating is utilized as a method for reducing the dynamic power of LSI. Clock gating can be automatically inserted by the synthesis tools, but there are problems such as designers must specify control signals. So more aggressive and automatable clock gating techniques have been proposed. In this study, a clock gating candidate extraction method for combinational clock gating is enhanced to the method for sequential clock gating using time expansion of sequential circuits. Using time expansion and detection by SAT, it is possible to find multiple clock past signal as a candidate. The proposed method was applied to ISCAS'89 benchmark and we got more control signal candidates.

CiNii
A Bit-Write Reduction Method based on Error-Correcting Codes for Non-Volatile Memories

Masashi Tawada, Shinji Kimura, Masao Yanagisawa, Nozomu Togawa

2015 20TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC) 496 - 501 2015 [Refereed]

　View Summary

Non-volatile memory has many advantages over SRAM. However, one of its largest problems is that it consumes a large amount of energy in writing. In this paper, we propose a bit-write reduction method based on error correcting codes for non-volatile memories. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. We focus on error-correcting codes and generate new codes called write-reduction codes. In our write-reduction codes, each data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to a data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. We perform several experimental evaluations and demonstrate up to 72% energy reduction.
ECC-Based Bit-Write Reduction Code Generation for Non-Volatile Memory

TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

IEICE Trans. Fundamentals 98 ( 12 ) 2494 - 2504 2015

　View Summary

Non-volatile memory has many advantages such as high density and low leakage power but it consumes larger writing energy than SRAM. It is quite necessary to reduce writing energy in non-volatile memory design. In this paper, we propose write-reduction codes based on error correcting codes and reduce writing energy in non-volatile memory by decreasing the number of writing bits. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. In our write-reduction codes, every data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to the data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. Experimental results demonstrate that we have achieved writing-bits reduction by an average of 51% and energy reduction by an average of 33% compared to non-encoded memory.

CiNii
Low-Power Motion Estimation Processor with 3D Stacked Memory

ZHANG Shuping, ZHOU Jinjia, ZHOU Dajiang, KIMURA Shinji, GOTO Satoshi

IEICE Trans. Fundamentals 98 ( 7 ) 1431 - 1441 2015

　View Summary

Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.

CiNii
HARDWARE-ORIENTED RATE-DISTORTION OPTIMIZATION ALGORITHM FOR HEVC INTRA-FRAME ENCODER

Landan Hu, Heming Sun, Dajiang Zhou, Shinji Kimura

2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) 2015 [Refereed]

　View Summary

Digital video is widely used in the mobile applications, where video compression technology is necessary to store or transmit the videos. High Efficiency Video Coding (HEVC) achieves the highest compression ratio while it costs huge computational complexity, in which rate-distortion (RD) cost calculation takes the majority. This paper presents a low-complexity RD estimation method for HEVC intra prediction by the following schemes. 1) The transformed coefficients rather than quantized coefficients are used to do the RD estimation. 2) For the rate part, the position after the last non-zero quantized coefficient is considered to improve the accuracy of estimation, and a header-bit estimation method is presented to save about 82% complexity on header bits calculation. 3) For the distortion part, the scaling parameter of quantization is modified to the exponential of two so that the bit depth of multiplication can be reduced from 15 to 5 in the worst case. 4) In transform unit 4x4, we consider transform skip mode which is neglect in the prior research. Our proposal could achieve 72.22% time reduction of rate-distortion optimization (RDO) compared with original HEVC Test Model while the BD-rate is only 1.76%.

DOI

Scopus

6

Citation

(Scopus)
Fast SAO Estimation Algorithm and Its Implementation for 8 K x 4 K @ 120 FPS HEVC Encoding

Jiayi Zhu, Dajiang Zhou, Shinji Kimura, Satoshi Goto

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E97A ( 12 ) 2488 - 2497 2014.12 [Refereed]

　View Summary

High efficiency video coding (HEVC) is the new generation video compression standard. Sample adaptive offset (SAO) is a new compression tool adopted in HEVC which reduces the distortion between original samples and reconstructed samples. SAO estimation is the process of determining SAO parameters in video encoding. It is divided into two phases: statistic collection and parameters determination. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of Rate Distortion Optimization (RDO) in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistics of all the 16 samples in one 4 x 4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the algorithms complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented using 156.32 K gates, 8,832 bits single port RAM for 8 bits depth case. It can be synthesized to 400 MHz @ 65 nm technology and is capable of 8 K x 4 K @ 120 fps encoding.

DOI

Scopus

5

Citation

(Scopus)
Small-Sized Encoder/Decoder Circuit Design for Bit-Write Reduction Targeting Non-Volatile Memories

TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

Technical report of IEICE. VLD 114 ( 328 ) 227 - 232 2014.11

　View Summary

Non-volatile memory has many advantages such as low leakage power and non-volatility. However, there are problems that a non-volatile memory consumes a large amount of energy in writing and that the maximum number of bit re-writings is limited. We have proposed a Hamming-code based bit-write reduction method using data encoding/decoding but its encoder/decoder becomes too much large. In this paper, we propose small-sized encoder/decoder circuit design for the bit-write reduction codes. In this design, we simplify data encoding/decoding by using code redundancy. Experimental results show the efficiency of our encoder/decoder design.

CiNii
Fast SAO estimation algorithm and its VLSI architecture

Jiayi Zhu, Dajiang Zhou, Shinji Kimura, Satoshi Goto

2014 IEEE International Conference on Image Processing, ICIP 2014 1278 - 1282 2014.01 [Refereed]

　View Summary

SAO estimation is the process of determining SAO parameters in video encoding. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of RDO in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistic of all the 16 samples in one 4×4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented by 156.32K gates, 8832 bits SPRAM, 400MHz @ 65nm technology and is capable of 8Kx4K @ 120fps encoding.

DOI

Scopus

15

Citation

(Scopus)
AN AREA-EFFICIENT 4/8/16/32-POINT INVERSE DCT ARCHITECTURE FOR UHDTV HEVC DECODER

Heming Sun, Dajiang Zhou, Jiayi Zhu, Shinji Kimura, Satoshi Goto

2014 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING CONFERENCE 197 - 200 2014 [Refereed]

　View Summary

This paper presents a new VLSI architecture for HEVC inverse discrete cosine transform (IDCT). Compared to prior arts, this work reduces hardware cost by 1) reducing computational logic of 1-D IDCTs with a reordered parallel-in serial-out (RPISO) scheme that shares the inputs of the butterfly structure, and 2) reducing the area of the transpose buffer with a cyclic memory organization that achieves 100% I/O utilization of the SRAMs. In the implementation of a unified 4/8/16/32-point IDCT, the proposed schemes demonstrate 35% and 62% reduction of logic and memory costs, respectively. The IDCT implementation can support real-time decoding of 4Kx2K 60fps video with a total hardware cost of 357,250um(2) on 2-D IDCT and 80,988um(2) on transpose memory in 90nm process.
Fast SAO Estimation Algorithm and Its Implementation for 8K×4K @ 120 FPS HEVC Encoding

ZHU Jiayi, ZHOU Dajiang, KIMURA Shinji, GOTO Satoshi

IEICE Trans. Fundamentals 97 ( 12 ) 2488 - 2497 2014

　View Summary

High efficiency video coding (HEVC) is the new generation video compression standard. Sample adaptive offset (SAO) is a new compression tool adopted in HEVC which reduces the distortion between original samples and reconstructed samples. SAO estimation is the process of determining SAO parameters in video encoding. It is divided into two phases: statistic collection and parameters determination. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of Rate Distortion Optimization (RDO) in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistics of all the 16 samples in one 4×4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the algorithms complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented using 156.32K gates, 8,832bits single port RAM for 8bits depth case. It can be synthesized to 400MHz @ 65nm technology and is capable of 8K×4K @ 120fps encoding.

CiNii
A Reduction Method of Writing Operations to Non-volatile Memory by Keeping Data Difference for Low-Power Circuit Design

SHINOHARA Hiroyuki, YANAGISAWA Masao, KIMURA Shinji

Technical report of IEICE. VLD 113 ( 416 ) 167 - 172 2014.01

　View Summary

In order to reduce the power consumption of LSI, unnecessary parts should be powered off with fine granularity, and current status data before power-off should be stored for the behavior after power-on. Next generation non-volatile memory is expected to be used to store data for power-off. However, the writing power of non-volatile memory is about 10 times higher than that of CMOS memory, so the reduction of writing behaviors is very important to reduce the total energy. The manuscript proposes a reduction method of writing behaviors using the difference of the original data and the new data for monitoring data sequences such as wireless sensor nodes. With the redundancy of the difference and the original data, the number of writing bits for these registers can be saved. The modificaiton system for the original and differential data registers has been developed and its power consumption has been evaluated. When applying to temperature monitoring, 24 % writing bits reduction and 11 % power reduction can be obtained.

CiNii
Dual-Stage Pseudo Power Gating with Advanced Clustering Algorithm for Gate Level Power Optimization

Yu Jin, Zhe Du, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E96A ( 12 ) 2568 - 2575 2013.12 [Refereed]

　View Summary

Pseudo Power Gating (Pseudo PG) is one of gate level power reduction methods for combinational circuits by stopping unnecessary input changes of gates. In Pseudo PG, an extra control signal might be added to a gate and other input changes of the gate are deactivated when the control signal takes the controlling value. To improve the power reduction capability, the paper newly introduces dual-stage Pseudo PG with advanced clustering algorithm where up to two extra control signals are added to a gate if effective. The advanced clustering algorithm selects the first control signal to be compatible with the second control signal based on the propagation of controlling condition via a path, with which candidates of controllable gates excluded by the maximum depth constraint can be controlled. Experimental results show that the proposed dual-stage Pseudo PG method has obtained 23.23% average power reduction with 5.28% delay penalty with respect to the original circuits, and has obtained 10.46% more power reduction with 2.75% delay penalty compared with respect to circuits applying the original single-stage Pseudo PG.

DOI

Scopus
Power Reduction of Non-volatile Logic Circuits Using the Minimum Writing Power Cut-set of State Registers

ITOI Yudai, KIMURA Shinji

Technical report of IEICE. VLD 113 ( 320 ) 147 - 152 2013.11

　View Summary

Recently, the next generation non-volatile memory/register using magnetic tunnel junction elements has been paid attention. Such devices can keep the data when power off, can be integrated in CMOS LSI and have fast access speed. By using such devices, we can apply fine and low overhead power control for CMOS LSI. The write energy of such devices, however, is larger than that of a usual D flip-flop (about 10 times). So it is very important to reduce the write operations on such devices. Therefore we have proposed a write reduction method for non-volatile registers, where a minimum cut-set that has the smallest switching activity is searched by using the min-cut max-flow theorem and non-volatile registers are inserted to the cut-set. In this study, we also consider the overhead of additional circuits for recovering and saving the state to minimize the total power of the circuit. The method has been implemented and applied to ISCAS 89 benchmarks. Compared with the case where non-volatile registers are inserted to the original position, 2.6%〜15.1% power reductions (8.34% on average) have been found.

CiNii
Energy Evaluation of Writing Reduction Method for Non-Volatile Memory

TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

Technical report of IEICE. VLD 113 ( 320 ) 141 - 146 2013.11

　View Summary

Non-volatile memory has many advantages over SRAM, such as high density, low leakage power, and non-volatility. However, one of its largest problems is that it consumes a large amount of energy in writing. It is quite necessary to reduce the number of writing bits and thus decrease its writing energy. We have proposed a memory writing reduction method based on error correcting codes. When a data is written into a memory, we do not write it directly but encode it into a codeword. Then the number of writing bits into memory is also limited in data writing. In this paper, we demonstrate several experimental evaluations from the viewpoints of energy reduction and discuss the effectiveness of our proposed writing-reduction codes.

CiNii
Power Reduction of Non-volatile Logic Circuits Using the Minimum Writing Power Cut-set of State Registers

ITOI Yudai, KIMURA Shinji

IEICE technical report. Dependable computing 113 ( 321 ) 147 - 152 2013.11

　View Summary

Recently, the next generation non-volatile memory/register using magnetic tunnel junction elements has been paid attention. Such devices can keep the data when power off, can be integrated in CMOS LSI and have fast access speed. By using such devices, we can apply fine and low overhead power control for CMOS LSI. The write energy of such devices, however, is larger than that of a usual D flip-flop (about 10 times). So it is very important to reduce the write operations on such devices. Therefore we have proposed a write reduction method for non-volatile registers, where a minimum cut-set that has the smallest switching activity is searched by using the min-cut max-flow theorem and non-volatile registers are inserted to the cut-set. In this study, we also consider the overhead of additional circuits for recovering and saving the state to minimize the total power of the circuit. The method has been implemented and applied to ISCAS 89 benchmarks. Compared with the case where non-volatile registers are inserted to the original position, 2.6%〜15.1% power reductions (8.34% on average) have been found.

CiNii
Energy Consumption Evaluation for Two-Level Cache with Non-Volatile Memory Targeting Mobile Processors

Shota Matsuno, Masashi Tawada, Masao Yanagisawa, Shinji Kimura, Tadahiko Sugibayashi, Nozomu Togawa

IEEK Transactions on Smart Processing and Computing Vol. 2 ( No. 4 ) 226 - 239 2013.08
Low Power Memory Based Design Method of Constant Multipliers for Digital Filters

KABASAWA Kosuke, SUGIBAYASHI Tadahiko, YANAGISAWA Masao, KIMURA Shinji

Technical report of IEICE. VLD 113 ( 119 ) 101 - 106 2013.07

　View Summary

Digital Signal Processing of sounds and images are using many digital filters which computes the summation of multiplications between a sequence of constants and a time sequence of an input. In this manuscript, a memory based design method for such constant multiplication is described. In the design, the trade-off between the size of a memory and that of the logic is considered, and its speed and power consumption is optimized. The read power of a memory is independent with the output read from the memory and a memory can encapsulate the toggles of logic gates in gate-based designs. By separating an input into several parts and designing such separated small multipliers using a memory, the memory size can be reduced drastically. The proposed constant multiplier has been implemented on ASIC, and shows the power reduction compared with gate-level design.

CiNii
A non-volatile memory writing reduction method based on state encoding limiting maximum Hamming distance

TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

Technical report of IEICE. VLD 113 ( 119 ) 95 - 100 2013.07

　View Summary

Non-volatile memory has many advantages over SRAM, such as high density, low leakage power, and non-volatility. However, one of its largest problems is that it consumes a large amount of energy in writing. It is quite necessary to reduce the number of writing bits and thus decrease its writing energy. In this paper, we propose a memory writing reduction method based on state encoding limiting maximum Hamming distance. When a data is written into a memory, we do not write it directly but encode it into a codeword. Then we write the codeword into a memory. At this time, we encode a data into a codeword limiting its maximum Hamming distance from another codeword. If the maximum Hamming distance is limited among all the codewords, the number of flipped bits are also limited and then the number of writing bits will be reduced. We show several experimental evaluations and discuss the effectiveness of our proposed algorithm.

CiNii
Evaluation of energy consumption for two-level cache using Non-Volatile Memory for IL1 and UL2 caches

MATSUNO Shota, TAWADA Masashi, YANAGISAWA Masao, KIMURA Shinji, TOGAWA Nozomu, SUGIBAYASHI Tadahiko

Technical report of IEICE. VLD 113 ( 119 ) 89 - 94 2013.07

　View Summary

A non-volatile memory has advantages such as low leak energy and non-volatility compared with SRAM or DRAM has high leak energy. It is strongly expected to use a non-volatile memory for realizing normally-off systems. A non-volatile memory, however, consumes more energy to write than SRAM or DRAM. In this paper, we evaluate energy consumption of a cache memory in an embedded processor with non-volatile memories. In our evaluation, we assume that their write energy is 1.0x to 10.0x higher than that of SRAM. Experimental evaluations demonstrate that using non-volatile memories in a cache is better choice in some cases, even when write energy of non-volatile memories is 10.0x higher than that of SRAM.

CiNii
Write Control Method for Nonvolatile Flip-Flops Based on State Transition Analysis

Naoya Okada, Yuichi Nakamura, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E96A ( 6 ) 1264 - 1272 2013.06 [Refereed]

　View Summary

Nonvolatile flip-flop enables leakage power reduction in logic circuits and quick return from standby mode. However, it has limited write endurance, and its power consumption for writing is larger than that of conventional D flip-flop (DFF). For this reason, it is important to reduce the number of write operations. The write operations can be reduced by stopping the clock signal to synchronous flip-flops because write operations are executed only when the clock is applied to the flip-flops. In such clock gating, a method using Exclusive OR (XOR) of the current value and the new value as the control signal is well known. The XOR based method is effective, but there are several cases where the write operations can be reduced even if the current value and the new value are different. The paper proposes a method to detect such unnecessary write operations based on state transition analysis, and proposes a write control method to save power consumption of nonvolatile flip-flops. In the method, redundant bits are detected to reduce the number of write operations. If the next state and the outputs do not depend on some current bit, the bit is redundant and not necessary to write. The method is based on Binary Decision Diagram (BDD) calculation. We construct write control circuits to stop the clock signal by converting BDDs representing a set of states where write operations are unnecessary. Proposed method can be combined with the XOR based method and reduce the total write operations. We apply combined method to some benchmark circuits and estimate the power consumption with Synopsys NanoSim. On average, 15.0% power consumption can be reduced compared with only the XOR based method.

DOI

Scopus
A-3-7 REDUCING THE WRITING BITS TO NON-VOLATILE MEMORY BY HOLDING DATA DIFFERENCE

Shinohara Hiroyuki, Yanagisawa Masao, Kimura Shinji

Proceedings of the IEICE General Conference 2013 67 - 67 2013.03

CiNii
Controlling-value-based power gating considering controllability propagation and power-off probability

Zhe Du, Yu Jin, Shinji Kimura

Proceedings of International Conference on ASIC 2013 [Refereed]

　View Summary

Power gating technology is useful in reducing standby leakage current. Controlling value based power gating is a fine-grained power gating approach using the controlling value of logic elements. However, power saving capability suffers from the steady maximum depth constraint, which prohibits the power gating assignment when the control of a gate increases the critical path length. To increase power savings, this paper proposes a power gating control extraction method based on controllability propagation and power-off probability. Multiple power domains can be clustered by a smaller depth signal with the controllability propagation. Experimental results show that 21.4% power reduction can be obtained in average, achieving 8.5% improvement compared with previous algorithm. © 2013 IEEE.

DOI

Scopus
Energy Evaluation for Two-level On-chip Cache with Non-Volatile Memory on Mobile Processors

Shota Matsuno, Masashi Tawada, Masao Yanagisawa, Shinji Kimura, Nozomu Togawa, Tadahiko Sugibayashi

2013 IEEE 10TH INTERNATIONAL CONFERENCE ON ASIC (ASICON) 2013 [Refereed]

　View Summary

As leakage power of traditional SRAM becomes larger, a ratio of static energy in total energy of memory architecture becomes also larger. Non-volatile memory (NVM) has many advantages over SRAM, such as high density, low leakage power, and non-volatility, but consumes too much write energy. In this paper, we evaluate energy consumption of two-level cache using NVM in part on mobile processors and confirm that it effectively reduces energy consumption.
An exact approach for gpc-based compressor tree synthesis

Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E96-A ( 12 ) 2553 - 2560 2013

　View Summary

Multi-operand adders that calculate the summation of more than two operands usually consist of compressor trees, which reduce the number of operands to two without any carry propagation, and carry-propagate adders for the two operands in the ASIC implementation. Compressor trees that consist of full adders and half adders cannot be implemented efficiently on LUT-based FPGAs, and carry-chains or dedicated structures have been utilized to produce multi-operand adders on FPGAs. Recent studies indicate that compressor trees can be implemented efficiently on LUTs using Generalized Parallel Counters (GPCs) as the building blocks of compressor trees. This paper addresses the problem of synthesizing compressor trees based on GPCs. Based on the observation that characteristics such as the area, power, and delay correlate roughly to the total number and the maximum level of GPCs, the target problem can be regarded as a minimization problem for the total number of GPCs and the maximum levels of the GPCs, for which an ILP-based approach is proposed. The key point of our formulation is not to model the problem based on the structures of compressor trees like the existing approach, but instead the compression process itself is used to reduce the number of variables and constraints in the ILP formulation. The experimental results demonstrate the advantage of our formulation in terms of the quality and runtime.Copyright © 2013 The Institute of Electronics, Information and Communication Engineers.

DOI

Scopus

15

Citation

(Scopus)
On Gate Level Power Optimization of Combinational Circuits Using Pseudo Power Gating

Yu Jin, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E95A ( 12 ) 2191 - 2198 2012.12 [Refereed]

　View Summary

In recent years, the demand for low-power design has remained undiminished. In this paper, a pseudo power gating (SPG) structure using a normal logic cell is proposed to extend the power gating to an ultrafine grained region at the gate level. In the proposed method, the controlling value of a logic element is used to control the switching activity of modules computing other inputs of the element. For each element, there exists a submodule controlled by an input to the element. Power reduction is maximized by controlling the order of the submodule selection. A basic algorithm and a switching activity first algorithm have been developed to optimize the power. In this application, a steady maximum depth constraint is added to prevent the depth increase caused by the insertion of the control signal. In this work, various factors affecting the power consumption of library level circuits with the SPG are determined. In such factors, the occurrence of glitches increases the power consumption and a method to reduce the occurrence of glitches is proposed by considering the parity of inverters. The proposed SPG method was evaluated through the simulation of the netlist extracted from the layout using the VDEC Rohm 0.18 mu m process. Experiments on ISCAS'85 benchmarks show that the reduction in total power consumption achieved is 13% on average with a 2.5% circuit delay degradation. Finally, the effectiveness of the proposed method under different primary input statistics is considered.

DOI

Scopus

2

Citation

(Scopus)
Write Reduction for Non-volatile Registers Using the Max-flow Min-cut

ITOI Yudai, KIMURA Shinji

112 ( 247 ) 101 - 106 2012.10

CiNii
Automatic Multi-Stage Clock Gating Optimization Using ILP Formulation

Xin Man, Takashi Horiyama, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E95A ( 8 ) 1347 - 1358 2012.08 [Refereed]

　View Summary

Clock gating is supported by commercial tools as a power optimization feature based on the guard signal described in HDL (structural method). However, the identification of control signals for gated registers is hard and designer-intensive work. Besides, since the clock gating cells also consume power, it is imperative to minimize the number of inserted clock gating cells and their switching activities for power optimization. In this paper, we propose an automatic multi-stage clock gating algorithm with ILP (Integer Linear Programming) formulation, including clock gating control candidate extraction, constraints construction and optimum control signal selection. By multi-stage clock gating, unnecessary clock pulses to clock gating cells can be avoided by other clock gating cells, so that the switching activity of clock gating cells can be reduced. We find that any multi-stage control signals are also single-stage control signals, and any combination of signals can be selected from single-stage candidates. The proposed method can be applied to 3 or more cascaded stages. The multi-stage clock gating optimization problem is formulated as constraints in LP format for the selection of cascaded clock-gating order of multi-stage candidate combinations, and a commercial ILP solver (IBM CPLEX) is applied to obtain the control signals for each register with minimum switching activity. Those signals are used to generate a gate level description with guarded registers from original design, and a commercial synthesis and layout tools are applied to obtain the circuit with multi-stage clock gating. For a set of benchmark circuits and a Low Density Parity Check (LDPC) Decoder (6.6k gates, 212 F.F.s), the proposed method is applied and actual power consumption is estimated using Synopsys NanoSim after layout. On average, 31% actual power reduction has been obtained compared with original designs with structural clock gating, and more than 10% improvement has been achieved for some circuits compared with single-stage optimization method. CPU time for optimum multi-stage control selection is several seconds for up to 25k variables in LP format. By applying the proposed clock gating, area can also be reduced since the multiplexors controlling register inputs are eliminated.

DOI

Scopus

1

Citation

(Scopus)
On gate level power optimization of combinational circuits using pseudo power gating

Yu Jin, Shinji Kimura

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E95-A ( 12 ) 2191 - 2198 2012

　View Summary

In recent years, the demand for low-power design has remained undiminished. In this paper, a pseudo power gating (SPG) structure using a normal logic cell is proposed to extend the power gating to an ultrafine grained region at the gate level. In the proposed method, the controlling value of a logic element is used to control the switching activity of modules computing other inputs of the element. For each element, there exists a submodule controlled by an input to the element. Power reduction is maximized by controlling the order of the submodule selection. A basic algorithm and a switching activity first algorithm have been developed to optimize the power. In this application, a steady maximum depth constraint is added to prevent the depth increase caused by the insertion of the control signal. In this work, various factors affecting the power consumption of library level circuits with the SPG are determined. In such factors, the occurrence of glitches increases the power consumption and a method to reduce the occurrence of glitches is proposed by considering the parity of inverters. The proposed SPG method was evaluated through the simulation of the netlist extracted from the layout using the VDEC Rohm 0.18 μm process. Experiments on ISCAS'85 benchmarks show that the reduction in total power consumption achieved is 13% on average with a 2.5% circuit delay degradation. Finally, the effectiveness of the proposed method under different primary input statistics is considered. Copyright © 2012 The Institute of Electronics, Information and Communication Engineers.

DOI

Scopus

2

Citation

(Scopus)
Multi-Operand Adder Synthesis Targeting FPGAs

Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E94A ( 12 ) 2579 - 2586 2011.12 [Refereed]

　View Summary

Multi-operand adders, which calculates the summation of more than two operands, usually consist of compressor trees which reduce the number of operands to two without any carry propagation, and a carry-propagate adder for the two operands in ASIC implementation. The former part is usually realized using full adders or (3;2) counters like Wallace-trees in ASIC, while adder trees or dedicated hardware are used in FPGA. In this paper, an approach to realize compression trees on FPGAs is proposed. In case of FPGA with m-input LUT, any counters with up to m inputs can be realized with one LUT per an output. Our approach utilizes generalized parallel counters (GPCs) with up to m inputs and synthesizes high-performance compressor trees by setting some intermediate height limits in the compression process like Dadda's multipliers. Experimental results show that the number of GPCs are reduced by up to 22% compared to the existing heuristic. Its effectivity on reduction of delay is also shown against existing approaches on Altera's Stratix III.

DOI

Scopus

11

Citation

(Scopus)
Multi-Stage Power Gating Based on Controlling Values of Logic Gates

Yu Jin, Shinji Kimura

Proc. IEEE International Symposium on ASIC (ASICON) 87 - 90 2011.10
Low Power LSI Design Methods Based on Gating Technology

Shinji Kimura

Keynote Speech of IEEE International Conference on ASIC (ASICON) 2011.10
High-parallel LDPC decoder with power gating design

Ying Cui, Xiao Peng, Yu Jin, Peilin Liu, Shinji Kimura, Satoshi Goto

Proceedings of International Conference on ASIC 21 - 24 2011 [Refereed]

　View Summary

Leakage power is growing comparable to dynamic power dissipation as a result of technology trends, and thus it has become an important issue in low-power circuit design. As a popular technique for standby power reduction, power gating is applied to high-parallel LDPC decoder for WiMAX standard. The clustered-block processing engine (CBPE) array are divided into 9 power domains, and they are switched on or off according to different code lengths of LDPC code defined in WiMAX standard. As CBPE array occupies about 70% of the decoder system, the dedicated power gating strategy is very effective in shorter code length case since more power domains can be switched off. At shortest code length, power gating design brings about 55% power reduction compared to that of longest code length. © 2011 IEEE.

DOI

Scopus
Power and delay aware synthesis of multi-operand adders targeting LUT-based FPGAs

Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

Proceedings of the International Symposium on Low Power Electronics and Design 217 - 222 2011

　View Summary

Recent researches have indicated that multi-operand addition on FPGAs can be efficiently realized as the architecture consisting of a compressor tree which reduces the number of operands and a carry-propagate adder like ASIC by utilizing generalized parallel counters(GPCs). This paper addresses power and delay aware synthesis of GPC-based compressor trees. Based on the observation that dynamic power would correlate to the number of GPCs and the levels of GPCs, our approach targets to minimize the maximum levels and the total number of GPCs, and an ILP-based algorithm and heuristic approaches are proposed. Several experiments targeting Altera Stratix III architecture show that the proposed approach reduced the delay by up to 20% under a slight increase in total power dissipation. © 2011 IEEE.

DOI

Scopus

17

Citation

(Scopus)
Comparison of Optimized Multi-Stage Clock Gating with Structural Gating Approach

Xin Man, Shinji Kimura

2011 IEEE REGION 10 CONFERENCE TENCON 2011 651 - 656 2011 [Refereed]

　View Summary

Clock gating is a power efficient technique by switching off unnecessary clock signals to the registers. The condition under which the registers can be safely gated is checked using EXOR of the current and the next state values. Due to the extra power consumed by clock gating logics consisting of a latch and an AND gate, we have proposed an optimum sharing method of gating controls based on BDD (Binary Decision Diagram) with single-stage clock gating for power optimization. In this paper, we enhance the optimization method including multi-stage clock gating and compare with structural gating approach. By multi-stage clock gating, the activities of both registers and clock gating logics can be reduced. On a set of interface circuits, we have obtained power reduction by 14.1% on average compared with single-stage structural method and by 10.8% compared with multi-stage structural gating approach. Our BDD based method is also fast and scalable by candidates pruning.
Power Optimization of Sequential Circuits Using Switching Activity Based Clock Gating

Xin Man, Takashi Horiyama, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E93A ( 12 ) 2472 - 2480 2010.12 [Refereed]

　View Summary

Clock gating is the insertion of control signal for registers to switch off unnecessary clock signals selectively without violating the functional correctness of the original design so as to reduce the dynamic power consumption Commercial EDA tools usually have a mechanism to generate clock gating logic based on the structural method where the con trol signals specified by designers are used and the effectiveness of the clock gating depends on the specified control signals In the research we focus on the automatic clock gating logic generation and propose a method based on the candidate extraction and control signal selection We formalize the control signal selection using linear formulae and devise an optimization method based on BDD The method is effective for circuits with a lot of shared candidates by different registers The method is applied to counter circuits to check the co relation with power simulation results and a set of benchmark circuits 19 1-71 9% power reduction has been found on counter circuitsafter layout and 2 3-18 0% cost reduction on benchmark circuits

DOI

Scopus

2

Citation

(Scopus)
Acceleration of a SAT Based Solver for Minimum Cost Satisfiability Problems Us ing Optimized Boolean Constraint Propagation

Xin Zhang, Peilin Liu, Shinji Kimura

Proc. of 16th Workshop on Synthesis And System Integration of Mixed Information Technologies 365 - 370 2010.10
The Sizing of Sleep Transistors In Controlling Value Based Power Gating

Lei Chen, Shinji Kimura

Proc. of 16th Workshop on Synthesis And System Integration of Mixed Information Technologies 202 - 207 2010.10
Automatic Clock Gating Generation through Power-optimal Control Signal Selection

MAN Xin, HORIYAMA Takashi, KIMURA Shinji

2010 ( 1 ) 1 - 6 2010.05

CiNii
Multi-Operand Adder Synthesis on FPGAs Using Generalized Parallel Counters

Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

2010 15TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC 2010) 332 - + 2010 [Refereed]

　View Summary

Multi-operand adders usually consist of compression trees which reduce the number of operands per a bit to two, and a carry-propagate adder for the two operands in ASIC implementation. The former part is usually realized using full adders or (3;2) counters like Wallace-trees in ASIC, while adder trees or dedicated hardware are used in FPGA. In this paper, an approach to realize compression trees on FPGAs is proposed. In case of FPGA with m-input LUT, any counters with up to m inputs can be realized with one LUT per an output. Our approach utilizes generalized parallel counters (GPCs) with up to m inputs and synthesizes high-performance compression trees by setting some intermediate height limits in the compression process like Dadda's multipliers. Experimental results show its effectiveness against existing approaches at GPC level and on Altera's Stratix III.
Optimizing Controlling-Value-Based Power Gating with Gate Count and Switching Activity

Lei Chen, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E92A ( 12 ) 3111 - 3118 2009.12 [Refereed]

　View Summary

In this paper. a new heuristic algorithm is proposed to optimize the power domain clustering in controlling-value-based (CV-based) power gating technology. In this algorithm, both the switching activity of sleep signals (p) and the overall numbers of sleep gates (gate count, N) are considered, and the sum of the product of p and N is optimized. The algorithm effectively exerts the total power reduction obtained from the CV-based power gating. Even when the maximum depth is kept to be the same, the proposed algorithm can still achieve power reduction approximately 10% more than that of the prior algorithms. Furthermore, detailed comparison between the proposed heuristic algorithm and other possible heuristic algorithms are also presented. HSPICE simulation results show that over 26% of total power reduction can be obtained by using the new heuristic algorithm. In addition, the effect of dynamic power reduction through the CV-based power gating method and the delay overhead caused by the switching of sleep transistors are also shown in this paper.

DOI

Scopus

4

Citation

(Scopus)
Framework for Parallel Prefix Adder Synthesis Considering Switching Activities

Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

IPSJ Trans. SLDM 212 - 221 2009.08
Finite Input-Memory Automaton Based Checker Synthesis of SystemVerilog Assertions for FPGA Prototyping

Chengjie Zang, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E92A ( 6 ) 1454 - 1463 2009.06 [Refereed]

　View Summary

Checker synthesis for assertion based verification becomes popular because of the recent progress on the FPGA prototyping environment. In the paper, we propose a checker synthesis method based on the finite input-memory automaton suitable for embedded RAM modules in FPGA. There are more than 1 Mbit memories in medium size FPGA's and such embedded memory cells have the capability to be used as the shift registers. The main idea is to construct a checker circuit using the finite input-memory automata and implement shift register chain by logic elements or embedded RAM modules. When using RAM module, the method does not consume any logic element for storing the value. Note that the shift register chain of input memory can be shared with different assertions and we can reduce the hardware resource significantly. We have checked the effectiveness of the proposed method using several assertions.

DOI

Scopus
Automatic pipeline generation for fpga-based prototyping

W. Xing, K. Zheng, T. Kimura, S. Kuromaru, K. Kai, S. Kimura

Proc. 15th Workshop on Synthesis And System Integration of Mixed Information technologies 155 - 160 2009.03
Assertion checker synthesis for FPGA emulation

C. Zang, Q. Wei, S. Kimura

Proc. 15th Workshop on Synthesis And System Integration of Mixed Information technologies 149 - 154 2009.03
Fine-Grained Power Gating Based on the Controlling Value of Logic Elements

Lei Chen, Takashi Horiyama, Yuichi Nakamura, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E91A ( 12 ) 3531 - 3538 2008.12 [Refereed]

　View Summary

Leakage power consumption of logic elements has become a serious problem, especially in the sub-100-nanometer process. In this paper, a novel power gating approach by using the controlling value of logic elements is proposed, In the proposed method, sleep signals of the power-gated blocks are extracted completely front the original circuits Without any extra logic element. A basic algorithm and it probability-based heuristic algorithm have been developed to implement the basic idea. The steady maximum delay constraint has also been introduced to handle the delay issues. Experiments on the ISCAS'85 benchmarks show that averagely 15-36% of logic elements could he power gated at a time for random input patterns, and 3-31% of elements could be stopped under the steady maximum delay constraints. we also show a power optimizition method for AND/OR tree circuits, in which more than 80% of gates can be power-gated.

DOI
Efficient Hybrid Grid Synthesis Method Based on Genetic Algorithm for Power/Ground Network Optimization with Dynamic Signal Consideration

Yun Yang, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E91A ( 12 ) 3431 - 3442 2008.12 [Refereed]

　View Summary

This paper proposes all efficient design algorithm for power/ground (P/G) network synthesis with dynamic signal consideration, which is mainly caused by Ldi/dt noise and Cdv/dt decoupling capacitance (DE-CAP) Current in the distribution network. To deal with the nonlinear global optimization under synthesis constraints directly, the genetic algorithm (GA) is introduced. The proposed GA-based synthesis method call avoid the linear transformation loss and the restraint condition complexity in current SLP, SQP, ICG, and random-walk methods. In the proposed Hybrid Grid Synthesis algorithm, the dynamic signal is simulated in the gene disturbance process, and Trapezoidal Modified Euler (TME) method is introduced to realize the precise dynamic time step process. We also use a hybrid-SLP method to reduce the genetic execute time and increase the network synthesis efficiency. Experimental results on given power distribution network show the reduction on layout area and execution time compared with current P/G network synthesis methods.

DOI
FPGA prototyping of a simultaneous multithreading processor

C. Zang, S. Imai, S. Kimur

Proc. 21th Workshop on Circuits and Systems in Karuizaw 219 - 224 2008.04
The Optimal Architecture Design of Two-Dimensional Matrix Multiplication

Y. Yang, S. Kimura

IEICE Trans. Fundamentals E91-A ( 4 ) 1101 - 1111 2008.04
Issue mechanism for embedded Simultaneous Multithreading processor

Chengjie Zang, Shigeki Imai, Steven Frank, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E91A ( 4 ) 1092 - 1100 2008.04 [Refereed]

　View Summary

Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of V are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T-1-T-n)) with other policies: thread one always has the highest priority (PR(T-1)) and thread one or thread n has the highest priority in turn (PR(T-1-T-n)). The results show that RR(T-1-T-n) policy outperforms others and PR(T-1-T-n) is almost the same to RR(T-1-T-n) from the point ofview of the issued instructions per cycle.

DOI

Scopus

3

Citation

(Scopus)
Synthesis of Parallel Prefix Adders Considering Switching Activities

Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

2008 IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN 404 - + 2008 [Refereed]

　View Summary

This paper addresses parallel prefix adder synthesis which targets minimization of the total switching activities under bitwise timing constraints. This problem is treated as synthesis of prefix graphs which represent global structures of parallel prefix adders at technology-independent level. An approach for timing-driven area minimization has been proposed which first finds the exact minimum solution on a specific subset of prefix graphs by dynamic programming, then restructures the result for further reduction by removing restriction on the subset. This approach can be applied for switching cost minimization almost directly, though it is not so effective as area minimization in some cases. In this paper, a heuristic is proposed which estimates the effect of the restructuring phase and improve cost calculation fo some specific cases. Through various kinds of experiments, conditions where this approach can be executed effectively is also discussed.
Resynthesis Method for Circuit Acceleration on LUT-based FPGA

Weijie Xing, Takashi Horiyama, Shunichi Kuromaru, Tomoo Kimura, Shinji Kimura

Proceedings of 14th Workshop on Synthesis And System Integration of Mixed Information technologies 375 - 380 2007.10
Active Mode Leakage Power Reduction Based on the Controlling Value of Logic Gates

Lei Chen, Shinji Kimura

Proceedings of 14th Workshop on Synthesis And System Integration of Mixed Information technologies 266 - 271 2007.10
Power-Conscious Synthesis of Parallel Prefix Adders under Bitwise Timing Constraints

Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

Proceedings of 14th Workshop on Synthesis And System Integration of Mixed Information technologies 7 - 14 2007.10
Optimal planar jumping systolic array design for matrix multiplication

Yun Yang, Shinji Kimura

Proceedings of 20th Workshop on Circuits and Systems in Karuizawa 343 - 348 2007.04
Issue Mechanism for Embedded Simultaneous Multithreading Processor

Chengjie Zang, Shigeki Imai, Shinji Kimura

Proceedings of 20th Workshop on Circuits and Systems in Karuizawa 325 - 330 2007.04
Coverage estimation using transition perturbation for symbolic model checking in hardware verification

Xingwen Xu, Shinji Kimura, Kazunari Horikawa, Takehiko Tsuchiya

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E89A ( 12 ) 3451 - 3457 2006.12 [Refereed]

　View Summary

Lack of complete formal specification is one of the major obstacles to the deployment of model checking. Coverage estimation addresses this issue by revealing the unverified part of the design according to the specified properties. In this paper we propose a new transition-based coverage metric to evaluate the completeness of properties for symbolic model checking. Our coverage metric pinpoints the transitions through which the values of signals are checked. An efficient symbolic algorithm is presented for computing the transition coverage for a subset of ACTL. Our coverage estimator has been applied to the model checking of a cache coherence protocol. We uncovered several coverage holes including one that eventually led to the discovery of a design bug.

DOI

Scopus
Bit-length optimization method for high-level synthesis based on non-linear programming technique

Nobuhiro Doi, Takashi Horiyama, Masaki Nakanishi, Shinji Kimura

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E89A ( 12 ) 3427 - 3434 2006.12 [Refereed]

　View Summary

High-level synthesis is a novel method to generate a RT-level hardware description automatically from a high-level language such as C, and is used at recent digital circuit design. Floating-point to fixed-point conversion with bit-length optimization is one of the key issues for the area and speed optimization in high-level synthesis. However, the conversion task is a rather tedious work for designers. This paper,introduces automatic bit-length optimization method on floating-point to fixed-point conversion for high-level synthesis. The method estimates computational errors statistically, and formalizes an optimization problem as a non-linear problem. The application of NLP technique improves the balancing between computational accuracy and total hardware cost. Various constraints such as unit sharing, maximum bit-length of function units can be modeled easily, too. Experimental result shows that our method is fast compared with typical one, and reduces the hardware area.

DOI

Scopus

3

Citation

(Scopus)
An Efficient Instruction Issue Mechanism for Simultaneous Multithreading Microprocessor

Taeseok Jeong, Chengjie Zang, Shinji Kimura

Proc. International SoC Design Conference (ISOCC2006) 533 - 536 2006.10
Performance and Energy Efficient Data Cache Architecture for Embedded Simultaneous Multithreading Microprocessor

Chengjie Zang, Shigeki Imai, Shinji Kimura

International SoC Design Conference (ISOCC2006) 351 - 354 2006.10
Performance and Energy Efficient Data Cache Architecture for Embedded Simultaneous Multithreading Microprocessor

Chengjie Zang, Shigeki Imai, Shinji Kimura

Proceedings of 13th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI2006) 268 - 273 2006.04
Selective low-care coding: A means for test data compression in circuits with multiple scan chains

Youhua Shi, Nozomu Togawa, Shinji Kimura, Masao Yanagisawa, Tatsuo Ohtsuki

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E89-A ( 4 ) 996 - 1003 2006 [Refereed]

　View Summary

This paper presents a test input data compression technique, Selective Low-Care Coding (SLC), which can he used to significantly reduce input test data volume as well as the external test channel requirement for multiscan-based designs. In the proposed SLC scheme, we explored the linear dependencies of the internal scan chains, and instead of encoding all the specified bits in test cubes, only a smaller amount of specified bits are selected for encoding, thus greater compression can be expected. Experiments on the larger benchmark circuits show drastic reduction in test data volume with corresponding savings on test application time can be indeed achieved even for the well-compacted test set. Copyright © 2006 The Institute of Electronics, Information and Communication Engineers.

DOI

Scopus

2

Citation

(Scopus)
FCSCAN: An efficient multiscan-based test compression technique for test cost reduction

Youhua Shi, Nozomu Togawa, Shinji Kimura, Masao Yanagisawa, Tatsuo Ohtsuki

ASP-DAC 2006: 11TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 653 - 658 2006 [Refereed]

　View Summary

This paper proposes a new multiscan-based test input data compression technique by employing a Fan-out Compression Scan Architecture (FCSCAN) for test cost reduction. The basic idea of FCSCAN is to target the minority specified 1 or 0 bits (either 1 or 0) in scan slices for compression. Due to the low specified bit density in test cube set, FCSCAN can significantly reduce input test data volume and the number of required test channels so as to reduce test cost. The FCSCAN technique is easy to be implemented with small hardware overhead and does not need any special ATPG for test generation. In addition, based on the theoretical compression efficiency analysis, improved procedures are also proposed for the FCSCAN to achieve further compression. Experimental results on both benchmark circuits and one real industrial design indicate that drastic reduction in test cost can be indeed achieved.
Transition-based coverage estimation for symbolic model checking

Xingwen Xu, Shinji Kimura, Kazunari Horikawa, Takehiko Tsuchiya

ASP-DAC 2006: 11TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 1 - 6 2006 [Refereed]

　View Summary

Lack of complete formal specification is one of the major obstacles for the deployment of model checking. Coverage estimation addresses this issue by revealing the unverified part of the design according to the specified properties. In this paper we propose a new transition-based coverage metric to evaluate the completeness of properties for symbolic model checking. It is more comprehensive and accurate than the existing coverage metrics for model checking. An efficient symbolic algorithm is presented for computing the transition coverage for a subset of ACTL. Our coverage estimator has been applied to the model checking of a cache coherence protocol. We uncovered several coverage holes including one that eventually led to the discovery of a design bug.
Functional State Coverage Estimation for CTL Model Checking

Xingwen Xu, Shinji Kimura, Kazunari Horikawa, Takehiko Tsuchiya

Proceeding of the 20th International Technical Conference on Circuits/Systems, Computers and Communications(ITC-CSCC2005) 1 - 2 2005.07
Low power test compression technique for designs with multiple scan chains

Youhua Shi, Nozomu Togawa, Shinji Kimura, Masao Yanagisawa, Tatsuo Ohtsuki

Proceedings of the Asian Test Symposium 2005 386 - 389 2005 [Refereed]

　View Summary

This paper presents a new DFT technique that can significantly reduce test data volume as well as scan-in power consumption for multiscan-based designs. It can also help to reduce test time and tester channel requirements with small hardware overhead. In the proposed approach, we start with apre-computed test cube set and fill the don't-cares with proper values for joint reduction of test data volume and scan power consumption. In addition we explore the linear dependencies of the scan chains to construct a fanout structure only with inverters to achieve further compression. Experimental results for the larger ISCAS'89 benchmarks show the efficiency of the proposed technique. © 2005 IEEE.

DOI

Scopus

17

Citation

(Scopus)
Special section on VLSI design and CAD algorithms

Shinji Kimura

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E88-A ( 12 ) 3273 2005 [Refereed]

DOI

Scopus
Extended abstract: Transition traversal coverage estimation for symbolic model checking

XW Xu, S Kimura, K Horikawa, T Tsuchiya

THIRD ACM & IEEE INTERNATIONAL CONFERENCE ON FORMAL METHODS AND MODELS FOR CO-DESIGN, PROCEEDINGS 259 - 260 2005 [Refereed]
Duplicated register file design for embedded simultaneous multithreading microprocessor

C Zang, S Imai, S Kimura

2005 6th International Conference on ASIC Proceedings, Books 1 and 2 160 - 163 2005 [Refereed]

　View Summary

In modern microprocessors, the access time of register file becomes a critical part in total delay. Instruction level or thread level parallelism improves Instructions Per. Cycle (IPC) by executing multiple instructions in one cycle. Such multiple instructions need to read or write data from/to register files simultaneously. To satisfy that, register file with sufficient ports should be designed. However, the area and access time of register file with large ports will increase sharply. Duplicated Register File (DupRF) architecture can reduce access time by distributing read ports. In this paper, we propose a new kind of DupRF architecture for embedded Simultaneous Multithreading (SMT) microprocessor and estimate the effect with respect to the area and access time. Especially, we measure the product of area and access time as computation cost. For a SMT microprocessor with 6 threads, 64-bit data-width and 6 function units, a 3-duplicate register file architecture can reduce access time by 12.61% with a slight increase of computation cost by 3.35% compared with the central register file architecture.
Transition traversal coverage estimation for symbolic model checking

XW Xu, S Kimura, K Horikawa, T Tsuchiya

2005 6TH INTERNATIONAL CONFERENCE ON ASIC PROCEEDINGS, BOOKS 1 AND 2 850 - 853 2005 [Refereed]

　View Summary

Model checking can exhaustively verify a set of specified properties on a given implementation. However, it is very hard to determine whether sufficient properties have been speci ed or not. In this paper, we propose a transition traversal coverage method for a subset of CTL to evaluate the completeness, of properties. With this method, we can detect the transitions which are not veri ed by any property. It is more comprehensive and accurate than state-based coverage metric. We avoid generating the perturbed implementation by directly traversing transitions based on the semantics of CTL formulas. Experimental results show that the proposed method can discover subtle coverage holes with low computation cost.
A selective scan chain reconfiguration through run-length coding for test data compression and scan power reduction

Y Shi, S Kimura, M Yanagisawa, T Ohtsuki

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E87A ( 12 ) 3208 - 3215 2004.12 [Refereed]

　View Summary

Test data volume and power consumption for scan-based designs are two major concerns in system-on-a-chip testing. However, test set compaction by filling the don't-cares will invariably increase the scan-in power dissipation for scan testing, then the goals of test data reduction and low-power scan testing appear to be conflicted. Therefore, in this paper we present a selective scan chain reconfiguration method for test data compression and scan-in power reduction. The proposed method analyzes the compatibility of the internal scan cells for a given test set and then divides the scan cells into compatible classes. After the scan chain reconfiguration a dictionary is built to indicate the run-length of each compatible class and only the scan-in data for each class should be transferred from the ATE to the CUT so as to reduce test data volume. Experimental results for the larger ISCAS' 89 benchmarks show that the proposed approach overcomes the limitations of traditional run-length coding techniques, and leads to highly reduced test data volume with significant power savings during scan testing in all cases.
A hybrid dictionary test data compression for multiscan-based designs

Y Shi, S Kimura, M Yanagisawa, T Ohtsuki

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E87A ( 12 ) 3193 - 3199 2004.12 [Refereed]

　View Summary

In this paper, we present a test data compression technique to reduce test data volume for multiscan-based designs. In our method the internal scan chains are divided into equal sized groups and two dictionaries were build to encode either an entire slice or a subset of the slice. Depending on the codeword, the decompressor may load all scan chains or may load only a group of the scan chains, which can enhance the effectiveness of dictionary-based compression. In contrast to previous dictionary coding techniques, even for the CUT with a large number of scan chains, the proposed approach can achieve satisfied reduction in test data volume with a reasonable smaller dictionary. Experimental results showed the proposed test scheme works particularly well for the large ISCAS'89 benchmarks.
Efficient Hardware Architecture of a New Simple Public-Key Cryptosystem for Real-Time Data Processing

C. Jin, N. Doi, H. Tanaka, S. Imai, S. Kimura

Proc. of Workshop on Synthesis and System Integration of Mixed Technologies (SASIMI'2004) 107 - 112 2004.10
An Optimization Method in Floating-point to Fixed-point Conversion using Positive and Negative Error Analysis and Sharing of Operations

N. Doi, T. Horiyama, M.Nakanishi, S.Kimura

Proc. of Workshop on Synthesis and System Integration of Mixed Technologies (SASIMI'2004) 466 - 471 2004.10
Reconfigurable Architecture for Bit-Level Data Processing

S. Kimura

Invited Talk of The 1st Silicon-Seabelt Workshop on VLSI Designs in National Taiwan University 2004.04
Alternative run-length coding through scan chain reconfiguration for joint minimization of test data volume and power consumption in scan test

Youhua Shi, Shinji Kimura, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki

Proceedings of the Asian Test Symposium 432 - 437 2004 [Refereed]

　View Summary

Test data volume and scan power are two major concerns in SoC test. In this paper we present an alternative run-length coding method through scan chain reconfiguration to reduce both test data volume and scan-in power consumption. The proposed method analyzes the compatibility of the internal scan cells for a given test set and then divides the scan cells into compatible classes. To extract the compatible scan cells we apply a heuristic algorithm by solving the graph coloring problem
and then a simple greedy algorithm is used to configure the scan chain for the minimization of scan power. Experimental results for the larger IS-CAS'89 benchmarks show that the proposed approach leads to highly reduced test data volume with significant power savings during scan test.

DOI

Scopus

2

Citation

(Scopus)
Minimization of fractional wordlength on fixed-point conversion for high-level synthesis

N Doi, T Horiyama, M Nakanishi, S Kimura

ASP-DAC 2004: PROCEEDINGS OF THE ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 80 - 85 2004 [Refereed]

　View Summary

In the hardware synthesis from high-level language such as C, bit length of variables is one of the key issues on the area and speed optimization. Usually, designers are required to specify the word length of each variable manually, and verify the correctness by the simulation on huge data. In this paper, we propose an optimization method of fractional wold length of floating-point variables in the floating to fixed-point conversion of variables. The amount of round-off errors are formulated with parameters and propagated via data flow graphs. The non-linear programming is used to solve the fractional wordlength minimization problem. The method does not require the simulation on huge data, and is very fast compared to ones based on the simulation. We have shown the effect on several programs.
Reducing test data volume for multiscan-based designs through single/sequence mixed encoding

Y Shi, S Kimura, N Togawa, M Yanagisawa, T Ohtsuki

2004 47TH MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL II, CONFERENCE PROCEEDINGS 445 - 448 2004 [Refereed]

　View Summary

This paper presents a new test data compression technique for multiscan-based designs through dictionary-based encoding on the single or sequences scan-inputs. In spite of its simplicity, it achieves significant reduction in test data volume. Unlike some previous approaches on test data compression, our approach eliminates the need for additional synchronization and handshaking between the CUT and the ATE, so it is especially suitable to be integrated in a low cost test scheme for SoC test In addition in contrast to previous dictionary-based coding techniques, even for the CUT with a small number of scan chains, the proposed approach can achieve satisfied reduction in test data volume. Experimental results showed the proposed test scheme works particularly well for the large ISCAS'89 benchmarks.
A built-in reseeding technique for LFSR-based test pattern generation

Y Shi, Z Zhang, S Kimura, M Yanagisawa, T Ohtsuki

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E86A ( 12 ) 3056 - 3062 2003.12 [Refereed]

　View Summary

Reseeding technique is proposed to improve the fault coverage in pseudo-random testing. However most of previous works on reseeding is based on storing the seeds in an external tester or in a ROM. In this paper we present a built-in reseeding technique for LFSR-based test pattern generation. The proposed structure can run both in pseudorandom mode and in reseeding mode. Besides, our method requires no storage for the seeds since in reseeding mode the seeds can be generated automatically in hardware. In this paper we also propose an efficient grouping algorithm based on simulated annealing to optimize test vector grouping. Experimental results for benchmark circuits indicate the superiority of our technique against other reseeding methods with respect to test length and area overhead. Moreover, since the theoretical properties of LFSRs are preserved, our method could be beneficially used in conjunction with any other techniques proposed so far.
Bit Length Optimization of Fractional Part on Floating to Fixed Point Conversion for High Level Synthesis

N. Doi, T. Horiyama, N. Nakanishi, S. Kimura, K. Watanabe

IEICE Trans. Fundamentals Vol. E86-A ( No. 12 ) 3176 - 3183 2003.12
Bit Length Optimization in High Level Synthesis Based on Analytical Methods (Invited Talk)

Shinji Kimura, Nobuhiro Doi

System on Chip Design Automation Conference 2003 at Korea 2003.11
Bit Length Optimization of Fractional Parts on Floating to Fixed Point Conversion fro High-Level Synthesis

Nobuhiro Doi, Takashi Horiyama, Masaki Nakanishi, Shinji Kimura, Katsumasa Watanabe

Proc. of the Workshop on Synthesis and System Integration of Mixed Information technologies 129 - 136 2003.04
An on-chip high speed serial communication method based on independent ring oscillators

S Kimura, T Hayakawa, T Horiyama, M Nakanishi, K Watanabe

2003 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE: DIGEST OF TECHNICAL PAPERS 46 ( 22.3 ) 390 - 391 2003 [Refereed]
Look up table compaction based on folding of logic functions

S Kimura, A Ishii, T Horiyama, M Nakanishi, H Kajihara, K Watanabe

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E85A ( 12 ) 2701 - 2707 2002.12 [Refereed]

　View Summary

The paper describes the folding method of logic functions to reduce the size of memories to keep the functions. The folding is based on the relation of fractions of logic functions. If the logic function includes 2 or 3 same parts, then only one part should be kept and other parts can be omitted. We show that the logic function of I-bit addition can be reduced to half size using the bit-wise NOT relation and the bit-wise OR relation. The paper also introduces 3-1 LUT's with the folding mechanism. A full adder can be implemented using only one 3-1 LUT with the folding. Multi-bit AND and OR operations can be mapped to our LUT's not using the extra cascading circuit but using the carry circuit for addition. We have also tested the mapping capability of 4 input functions to our 3-1 LUT's with folding and carry propagation mechanisms. We have shown the reduction of the area consumption when using our LUT's compared to the case using 4-1 LUT's on several benchmark circuits.
Folding of logic functions and its application to look up table compaction

S Kimura, T Horiyama, M Nakanishi, H Kajihara

IEEE/ACM INTERNATIONAL CONFERENCE ON CAD-02, DIGEST OF TECHNICAL PAPERS 694 - 697 2002 [Refereed]

　View Summary

The paper describes the folding method of logic functions to reduce the size of memories for keeping the functions. The folding is based on the relation of fractions of logic functions. We show that the fractions of the full adder function have the bit-wise NOT relation and the bit-wise OR relation, and that the memory size becomes half (8-bit). We propose a new 3-1 LUT with the folding mechanisms whcih can implement a full adder with one LUT. A fast carry propagation line is introduced for a multi-bit addition. The folding and fast carry propagation mechanisms are shown to be useful to implement other multi-bit operations and general 4 input functions without extra hardware resources. The paper shows the reduction of the area consumption when using our LUTs compared to the case using 4-1 LUTs on several benchmark circuits.
A Real-Time User-Independent Eye Tracking LSI with Environment Adaptability

K. Nakamura, M. Nakanishi, T. Horiyama, M. Suzuki, S. Kimura, K. Watanabe

In Proc. of the 10th Workshop on Synthesis And System Integration of Mixed Technologies (SASIMI 2001) 357 - 361 2001.10
A New Symbolic Image Computation Algorithm Based on BDD Constrain Operator

S. Kimura, D. Dill, S. G. Govindaraju

In Proc. of the 10th Workshop on Synthesis And System Integration of Mixed Technologies (SASIMI 2001) 167 - 171 2001.10
Speech recognition chip for monosyllables

K Nakamura, Q Zhu, S Maruoka, T Horiyama, S Kimura, K Watanabe

PROCEEDINGS OF THE ASP-DAC 2001: ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 2001 396 - 399 2001 [Refereed]

　View Summary

In the paper, we present a real-time speech recognition chip for monosyllables such as A, B,.,., etc. The chip recognizes up to 64 monosyllables based on the Hidden Markov Model (HMM), which is a well known speaker-independent recognition method. The chip accepts a short-speech frame including 256 16-bit digitized samples corresponding to 11.6 msec period, and outputs the 6-bit symbol code of monosyllables for 16 short-frames (corresponding to 185.6 msec), A learning circuit to update HMM parameters for the recognition chip has also been designed, and the recognition chip includes an interface to the learning circuit. We have fabricated the recognition chip by VDEC Rohm 0.6 mum process on a 4.5 mm x 4.5 mm chip. We have also made a layout of the entire circuit including the learning circuit by VDEC Rohm 0.35 mum process on a 4.9 mm x 4.9 mm chip.
A real-time 64-monosyllable recognition LSI with learning mechanism

K Nakamura, Q Zhu, S Maruoka, T Horiyama, S Kimura, K Watanabe

PROCEEDINGS OF THE ASP-DAC 2001: ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 2001 31 - 32 2001 [Refereed]

　View Summary

In the paper, a real-time 64-mono-syllable recognition LSI is presented. The LSI accepts 11.6 msec speech frame and outputs a 6-bit symbol-code for each frame by the end of the next frame with the pipelining manner. The recognition method is based on the Hidden Markov Model and is speaker-independent. An on-chip learning mechanism has also been designed, but the circuit is off-chip at present implementation because of the restriction of LSI area, The LSI is fablicated by VDEC Rohm with 0.6 mum process on a 4.5 mm x 4.5 mm chip.
Multi-cycle path detection based on propositional satisfiability with CNF simplification using adaptive variable insertion

K Nakamura, S Maruoka, S Kimura, K Watanabe

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E83A ( 12 ) 2600 - 2607 2000.12 [Refereed]

　View Summary

Multi-cycle paths are paths between registers where 2 or more clock cycles are allowed to propagate signals, and the detection of multi-cycle paths is important in deciding proper clock period, timing verification and logic optimization. This paper presents a satisfiability-based multi-cycle paths detection method, where the detection problems are reduced to CNF formulae and the satisfiability is checked using SAT provers. We also show heuristics on conversion from multi-level circuits into CNF formulae. We have applied our method of ISCAS'89 benchmarks and other sample circuits. Experimental results show the remarkable improvements on the size of manipulatable circuits.
Robust heuristics for multi-level logic simplification considering local circuit structure

Q Zhu, Y Matsunaga, S Kimura, K Watanabe

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E83A ( 12 ) 2520 - 2527 2000.12 [Refereed]

　View Summary

Combinational logic circuits are usually implemented as multi-level networks of logic nodes, Multi-level logic simplification using the don't cares on each node is widely used. Large don't cares give good simplification results, but suffer from huge memory area and computation time. Extraction of useful don't cares and reduction of the size of the don't cares are important problems on the simplification using don't cares. In the paper, we propose a new robust heuristic method for the selection of dent cares. MIF consider an adaptive subnetwork for each simplified node in the network and introduce a stepwise enhancement method of the subnetwork considering the memory area and the network structure. The don't cares extracted from the adaptive subnetworks are called the local don't cares. We have implemented our method for satisfiability don't cares and observability don't cares. We have applied the method on MCNC89 benchmarks, and compared the experimental results with those of the SIS system. The results demonstrate the superiority of our method on the quality of the results and on the size of applicable circuits.
Robust Heuristics for Multi-Level Logic Simplification Considering Local Circuit Structure

Q. Zhu, Y. Matsunaga, S. Kimura, K. Watanabe

In Proc. of the 9th Workshop on Synthesis And System Integration of Mixed Technologies (SASIMI 2000) 299 - 306 2000.04
An application specific Java processor with reconfigurabilities

Shinji Kimura, Hiroyuki Kida, Kazuyoshi Takagi, Tatsumori Abematsu, Katsumasa Watanabe

Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC 25 - 26 2000

　View Summary

The paper presents an application specific Java processor including reconfigurabilities, which is a DLX like pipeline processor with 5 stages and executes Java byte codes directly. Reconfigurabilities are the key technologies for application specific operations. We have introduced two reconfigurabilities: (1) a mechanism to override the control signals for a specific instruction, (2) external components can be attached with the same input and output ports as the internal ALU. © 2000 IEEE.

DOI

Scopus
Multi-clock path analysis using propositional satisfiability

Kazuhiro Nakamura, Shinji Maruoka, Shinji Kimura, Katsumasa Watanabe

Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC 81 - 86 2000

　View Summary

We present a satisfiability based multi-clock path analysis method. The method uses propositional satisfiability (SAT) in the detection of multi-clock paths. We show a method to reduce the multi-clock path detection problems to SAT problems. We also show heuristics on the conversion from multi-level circuits into CNF formulae. We have applied our method to ISCAS89 benchmarks and other sample circuits. Experimental results show the improvement on the manipulatable size of circuits by using SAT. © 2000 IEEE.

DOI

Scopus

3

Citation

(Scopus)
Exact minimization of free BDDs and its application to pass-transistor logic optimization

K Takagi, H Hatakeda, S Kimura, K Watanabe

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E82A ( 11 ) 2407 - 2413 1999.11 [Refereed]

　View Summary

In several design methods for Pass-transistor Logic (PTL) circuits, Boolean functions are expressed as OBDDs in decomposed form and then the component OBDDs are directly mapped to PTL cells. The total size of OBDDs (number of nodes) corresponds to the circuit size. In this paper, we investigate a method for PTL synthesis based on exact minimization of Free BDDs (FBDDs). FBDDs are well-studied extension of OBDDs with free variable ordering on each path. We present statistics showing that more than 56% of 616126 iu:PN-equivalence classes of 5-variable Boolean functions have minimum FBDDs with less size than their OBDDs. This result can be used for PTL synthesis as libraries. We also applied the exact minimization algorithm of FBDDs to the minimization of subcircuits in the synthesis for MCNC benchmarks and found up to 5% size reduction.
Hardware synthesis from C programs with estimation of bit length of variables

O Ogawa, K Takagi, Y Itoh, S Kimura, K Watanabe

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E82A ( 11 ) 2338 - 2346 1999.11 [Refereed]

　View Summary

In the hardware synthesis methods with high level languages such as C language, optimization quality of the compilers has a great influence on the area and speed of the synthesized circuits. Among hardware-oriented optimization methods required in such compilers, minimization of the bit length of the data-paths is one of the most important issues. In this paper, rye propose an estimation algorithm of the necessary bit length of variables for this aim. The algorithm analyzes the control/dataflow graph translated from C programs and decides the bit length of each variable. On several experiments, the bit length of variables can be reduced by half with respect to the declared length. This method is effective not only for reducing the circuit area but also for reducing the delay of the operation units such as adders.
Multi-Level Logic Simplification using Statisfiability Don't Cares

Q.Zhu, Y.Matsunaga, S.Kimura, K.Watanabe

Proceedings of Asia Pacific Conference on cHip Design Languages 127 - 131 1999.10
Timing verification of sequential logic circuits based on controlled multi-clock path analysis

K Nakamura, S Kimura, K Takagi, K Watanabe

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E81A ( 12 ) 2515 - 2520 1998.12

　View Summary

This paper introduces a new kind of false path, which is sensitizable but does not affect the decision of the maximum clock frequency. Such false paths exist in multi-clock operations controlled by waiting states, and the delay time of these paths can be greater than the clock period. This paper proposes a method to detect these waiting false paths based on the symbolic state traversal. In this method, the maximum allowable clock cycle of each path is computed using update cycles of each register.

▼display all

Books and Other Publications

システムLSI設計工学

藤田昌宏, 梶原誠司, 木村晋二, 高田宏章, 浜口清治, 冨山宏之

オーム社 2006.10 ISBN: 4274202976

Research Projects

Hardware-Trojan Detection Utilizing Machine-Learning Models Considering Privacies

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2025.04

-

2028.03
再構成アクセラレータのための近似最適化手法

日本学術振興会科学研究費助成事業

Project Year :

2023.04

-

2026.03

木村晋二, 戸川望, 孫鶴鳴

　View Summary

今年度は、再構成アクセラレータ向けのデータ表現法、Ad Hoc な近似演算器の設計手法、およびシステマティックな近似回路の合成手法の文献調査を行った。近似演算器の場合は誤差と電力などとのトレードオフの下で設計最適化を行うので、誤差の評価は非常に重要である。Ad Hoc な近似乗算器の設計法に関しては、乗算における部分積の各桁の積算のための圧縮機に着目し、同じ重みで2つに圧縮する圧縮機を用いた乗算回路、新たな部分積の圧縮機の提案と種類の異なる圧縮機を用いた誤差削減を用いた乗算回路、誤差が正負の方向に同じ確率で現れるバイアスのない乗算回路についてパレート最適化と評価を行い、国際会議において発表した。また、各桁に符号をつけた符号付二進数の最適化に基づく8-ポイントの近似 DCT (Discrete Cosign Transformation) 回路の設計と評価を行い、国際会議において発表した。DCT の定数係数との乗算では、符号付二進数を用いることで連続した1からなる数字との乗算を一回の減算に変換できるので、出力への影響に基づいて係数をなるべく簡単な符号付二進数に近似することで演算のハードウェア資源を大きく削減している。近似回路の自動合成に向けては、与えられた論理関数を厳密に最小の素子数で合成する手法の検討を行い、3入力中の2つ以上が1であるときに出力が1となる多数決演算向けの厳密合成手法の提案を行った。厳密合成では、素子数の少ない順に、すべての構造を調べて目的の論理関数が実現できるかをチェックするが、素子数が大きいと、すべての構造を一度にチェックするよりも、クラスタに分けてチェックする方が効率的となるため、各素子の入力のレベルでクラスタ化する手法を提案し、他のクラスタ化法と比較して全体の合成時間を削減できることを示し、論文誌に掲載した。
再構成アクセラレータのための近似最適化手法

日本学術振興会科学研究費助成事業

Project Year :

2023.04

-

2026.03

木村晋二, 戸川望, 孫鶴鳴
攻撃に耐性を持つ機械学習モデルによる設計工程ハードウェアトロイ検知

日本学術振興会科学研究費助成事業

Project Year :

2022.04

-

2025.03

戸川望, 木村晋二

　View Summary

本研究では，レジスタトランスファレベル・論理レベル等の集積回路設計データを対象に，機械学習によるハードウェアトロイの「学習」を利用し機械学習モデルを進化，未知ハードウェアトロイや，摂動を加えたハードウェアトロイを含む設計データ（未知設計データ）に対し，未知設計データ中の「各信号線のトロイ／非トロイを識別」する技術の確立を目的とする．しかも機械学習モデルそのものを「騙す」攻撃を解明し理論的に「騙されにくい」ハードウェアトロイ検知技術を構築するものとする．
<BR>
上記の目的を達成するために，2023年度は，2022年度に実施したハードウェアトロイのための「特徴量」の最適化ならびにハードウェアトロイの「摂動」を利用して，防御側に立って攻撃に耐性を持つ機械学習モデルを構築した．
<BR>
2023年度にはこれらの成果を受けて，対象とする設計段階の回路情報に対して，「摂動」を加える．「摂動」は(1)回路機能的に等価であり，(2)さらにハードウェアトロイを構成する信号線特徴量を変化させた．このような「摂動」を加えることで，機械学習モデルの識別性能が低下することを確認した．続いて，上記識別性能が低下するような回路情報の摂動に対して，信号線特徴量のうち，摂動によって変化しないもの，すなわち摂動に強い信号線特徴量を抽出し，これらの信号線特徴量をもとに新たに機械学習モデルを構築した．この際，データ拡張による機械学習モデル生成，加えてハードウェアトロイのためのAdversarial Training手法を考案し，理論的に攻撃に耐性を持つ機械学習モデルを構築，評価を行った．さらにこのような工程を，研究代表者らが持つさまざまハードウェアトロイビッグデータに適用し，評価した．
攻撃に耐性を持つ機械学習モデルによる設計工程ハードウェアトロイ検知

日本学術振興会科学研究費助成事業

Project Year :

2022.04

-

2025.03

戸川望, 木村晋二
サテライトコンピューティングシステムの信頼性と高性能化

日本学術振興会科学研究費助成事業

Project Year :

2021.09

-

2023.03

木村晋二, MEYER MICHAEL

　View Summary

Over the past 5 months, the Photonic Networks-on-Chip has been studied especially on reliable routing algorithm. Photonic Networks-on-Chip are inherently more resilient to alpha particles because of using photons for communication but suffer from other forms of faults such as thermal variation. In order to control faults by the thermal variation, the microring resonators are fabricated with a flattened coil that can heat up the microring resonators. The strain-based calculation can be improved by improving both sub aspects of the algorithm. Instead of using a threshold of failed microring resonators, a performance factor is calculated based on every possible combinations of failed microring resonators in the individual switch. This way, there is no guessing whether a message can make it through the switch if two microring resonators in the same location have failed. The second point of improvement is to skip the power-based temperature estimation and replace it by separating the network into segments of single nodes and measuring the temperature of each segment. The resulting publication is almost ready for submission.In the future, the control method might be applied to advanced network such as satellite networks.
Hardware-Trojan Detection for Integrated Circuit Design Data based on Machine Learning

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2019.04

-

2022.03

Togawa Nozomu

　View Summary

Recently, as Internet of Things (IoT) devices become widespread, the demand for embedded hardware devices has been increasing. In order to produce embedded hardware devices more inexpensively, the manufacturing bases have been internationalized, and several processes in the IC design and manufacturing steps have been outsourced to third-party vendors. Under the circumstances, a hardware Trojan, which is a malicious function circuit inserted into a hardware device, may be inserted into IC products by the malicious third-party vendors, and therefore the risk of hardware Trojans has arisen. In this research, we have developed a machine-learning-based hardware Trojan detection method to detect known and unknown hardware Trojans effectively and efficiently.
再構成アクセラレータにおけるデータ形式最適化と精度保証

Project Year :

2018.04

-

2021.03
大域的超低エネルギー化を実現するLSI抽象モデルと上位下位統合化LSI設計技術

科学研究費助成事業(早稲田大学) 科学研究費助成事業(基盤研究(B))

Project Year :

2013

-

2015

　View Summary

平成25年度には研究計画全体の基礎となる研究項目(I)～(III)を実施した．
(I) LSI 抽象モデルの構築: 本研究で提案するLSI抽象モデルを採り入れ，実際のアプリケーションを試行設計した．試行設計の結果，動作記述で数千行を越える実大規模応用プログラムにおいて電源制御，クロック制御，周波数制御可能によりエネルギー削減の可能性を確認した．
(II) LSI 抽象モデルの検証: (I)によって設計された回路動作を「形式検証」した．特にここでは意味結合・強/弱-物理結合によるLSI 抽象モデルが，従来のLSI設計モデルと等価であることを検証した．これに加えて検証結果を用いて等価性を担保した制御回路分割を検討し(III)にてアルゴリズム化を検討した．
(III) 低エネルギー統合化LSI 自動設計技術の構築・検証（フェーズ1-電源制御): (I)および(II)により，提案するLSI抽象モデルの妥当性が検証された後，これをベースに統合化LSI自動設計フローを構築・検証した．仮想物理設計にて，実物理制約を緩和し上位工程の面から見た理想的な物理設計をし，これと実物理設計との「距離」を小さくすることを基本とするものを考えた．距離として各機能モジュールの位置の差の総和あるいは差の二乗和としている．『意味結合』として電源『意味結合』モジュールを対象に，パワーゲーティング，複数電源電圧制御および基板電圧制御を想定，低エネルギー指向統合化LSI 自動設計技術を構築・検証した．さらにこれを計算機上に実装，複数の応用プログラムに適用することで評価した．
Abstract LSI Model and Its Associated High-Level Synthesis Algorithm for Deep Submicron Technologies

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2010

-

2012

TOGAWA Nozomu, KIMURA Shinji

　View Summary

In this reseach, we have firstly developed an abstract LSI model, where we introduce "logical connection" and "physical connection" among registers, controllers, and functinal units inside an LSI chip. Using our abstract LSI model, we can have well-defined interface between high-level design and physical-level design. Secondly, we have developed a high-level synthesis algorithm for our abstract LSI mode, which realizes physical-synthesis-aware high-level sythnsis. Our simulation results demonstrate that our abstract LSI model and its associated high-level sysnthsis outperform several convetntional LSI synthesis modethods.
Research on design and implementation of Ultra Large scale LSI

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2008

-

2010

GOTO Satoshi, TAKESHI Yoshimura, SHINJI Kimura

　View Summary

In this research, fundamental technologies have been developed from architecture circuit, device design to package design to implement 100 million Gate LSI within 1/5 development period, 1/10 fabrication cost and 1/10 power consumption compared with conventional SoC or SiP technologies. Particularly, by doing (1) Research on large-scale system design methodologies, (2) Research on large-scale design automation technologies, (3) Research on high level verification technologies, achieved the drastic reduction on design and fabrication cost with realizing ultra low power and huge bandwidth communication
High-level Hardware Verification Based on Equivalence Logic with Similarities

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2005

-

2007

KIMURA Shinji

　View Summary

For the formal hardware verification at high level, the equivalence checking system based on the equivalence logic with un-interpreted functions and similarities has been studied. The original equivalence logic manipulates the equivalence of variables, and has been shown to be effective for the verification of pipeline processor. The equivalence logic with similarities is a logic system to manipulate the similarity between variables. For example, if we design a circuit with fixed-point number system, and we would like to show the correctness with respect to a C program using floating number system, then the exact equivalence cannot be shown and we should cope with the similarity At first, we have developed a prototyping system which converts Verilog description to the equivalence logic formula, a prototyping system converting C descriptions to the equivalence logic formulae, and a prototype equivalence checking system based on the time expansion and published equivalence logic checking system(like CVCL/YICES). We have tested the prototype system and Sound that the computation is proportional to the exponential with respect to the number of time expansions, and we have worked on the SAT based equivalence checking and the transitivity constraints issue. For similarities, we are working on the optimization of the number of bits of variables in the floating to fixed point conversion, and the similarity based on the difference of the values and one based one the difference with values of other live variables. We have also applied the proposed equivalence checking to the multi-threading processor design and the acceleration of equivalence verification using the prototyping environment
Hardware Verification with respect to Program Specification

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2002

-

2004

KIMURA Shinji

　View Summary

With the recent development of integrated circuit technology, we can integrate 1 million transistors in one chip. For the design of such huge circuits, high-level design methodologies have been developed and applied to many application specific chips. In the high-level design, programming languages are used to describe the functionality and the description is automatically converted to hardware modules based on high-level synthesis algorithms. So the modification and verification should be done at programming level and high-level verification methods are needed. In this research, we have developed several basic algorithms to show the correctness of hardware modules with respect to the program specification.At first, we have surveyed the current research on the equality with uninterpreted function and its application to software and hardware verification. We have also checked the current equality systems such as SVC, CLVL, etc. We have applied these systems for the verification of arithmetic circuits and shown the limitation of such systems. We have also applied the equality checking systems for the verification of parallel and pipeline circuits.In the equality checking, the algorithm uses logic formulae to represent and decide the equality. For the acceleration of the decision procedure, we proposed a prototyping system based on new look-up-table architecture of Field Programmable Gate Array. We have devised the architecture and proposed a mapping method for the new architecture. The architecture is more area-efficient and faster compared to the usual loop-up-table architecture.For the program specification, we have proposed a control-data-flow graph based data-path optimization methods. Especially, we focused on the bit-width of data-paths and proposed an optimization method of integer operations and an error estimation method for floating point operations. With the optimization and estimation algorithms, we can verify application specific circuits written in C programs.We have also worked on the high-level test and proposed a test pattern compaction method with small area overhead for system-on-chip design
高性能プロセッサの設計技術に関する研究

Project Year :

2002

-

　
フレキシブルIPの形式的検証技術の研究

Project Year :

2002

-

　
IPベースシステムLSI設計技術の研究

Project Year :

2001

-

　
Implementation of Adaptable Hardware and Software for Changing Environment

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

1999

-

2001

WATANABE Katsumasa, HORIYAMA Takashi, TAKAGI Kazuyosi, KIMURA Shinji, NAKANISHI Masaki

　View Summary

The aim of our research is how to construct adaptable hardware and software for changing environment. In design and implementat ion of new informat ion systems, we research about methods of const ruct ing re-configurable system depending on changing envi ronment from total view points of hardware and software.Through 3 years, we studied at the fol lowing theoretical and practical aspects.(1) From the view point of adaptable software, we propose about the representat ion and construct ion of act ive software, the spontaneity and extensibility of objects in conversational programming, and optimizing C compiler to generate optimum bit-length variables in VHDL. Then we implement some examples and show the effectiveness of our proposals.(2) From the view point of adaptable hardware, as examples of LSI with re-configurability, we design and construct LSI of Java processor with abiliity to shorten the sequence of instructions dynamically, LSI to guess the eye track and LSI to determine the direct ion of face person-independently. These LSI have hardware oriented algorithms and give response in real time.(3) About hardware synthesis and verification, we propose a new symbolic image computation algorithm based on BDD(Binary Decision Diagram) constrain operator. Then we show good performance and effectiveness of the algorithm to large scale circuits.(4) From the view point of learning and knowledge acquirement for environmental adaptability, we propose a method based on OBDD(ordered BOD). Then we design the algorithms of mutual conversion between conventional character istic model and OBDD.(5) We pay attention to quantum computation. Quantun computers can exploi t quantum paralleiism to recognize the dynamic characteristics of environment. Then we research non-deterministic quantum fin te automata (OFA) and compare OFA with the classical counterparts.As results of the research, we get some mechanism for constructing systems with environmental adaptability in hardware and software totally
Research on Reconfigurable General Purpose Co-processor Systems and Their Optimized Hardware/Software Codesign Compiler

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

1995

-

1997

WATANABE Katsumasa, TAKAGI Kazuyoshi, KUNISHIMA Takeo, KIMURA Shinji

　View Summary

We have investigated computer systems with reconfigurable general purpose co-processors, and the hardware/software codesign environment for the systems. The results of our research are as follows :1. We have proposed a reconfigurable coprocessor architecture made of FPGAs (Field Programmable Gate Arrays), a cache memory, and a bus interface.2. We have designed and implemented a prototype of the co-processor for Sun workstations. The coprocessor includes 4 FPGAs, a 1 MB cache memory, and a bus interface with a hardware queue.3. We proposed a hardware/software codesign environment for the computer system with the co-processor. We have investigated the system description languages and the co-operation method between the main processor and the co-processor.4. We have designed and implemented the codesign environment from C programs for the coprocessor system. The hardware/software codesign compiler accepts a C program and estimates the execution time and the hardware costs of each function when the function is implemented as a hardware. The compiler also estimates the execution time of the function with the software implementation. Then the compiler decides the implementation method of each function.5. We have investigated the optimization method of C programs to be implemented as hardware modules on FPGAs. We have introduced hardware independent optimization methods such as the loop-unrolling, the variable bit-length reduction, the function expansion, ets., optimization methods such as the 4-1 LUT (Look-Up Table) based hardware estimation method, the marge method of bit-level operations, etc.6. We have tested several algorithms on the prototype of the codesign system, which include lexical analysis, sorting, and several graphic applications. We have found that the FPGA based co-processor is useful for the fast execution of programs, when the program include the parallel-if structure or the bit-level operations.In the future, we would like to investigate context switching on the co-preoessor system, and dynamic reconfigurability of the co-processor
二分決定グラフを用いた論理回路の自動合成に関する研究

日本学術振興会科学研究費助成事業

Project Year :

1992

　

　

木村晋二
Studies on Digital-Controller Configuration Design and Its Synchronization Control Using Multiple Digital Signal Processors.

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

1990

-

1992

HANEDA Hiromasa, KIMURA Shinji, OHTA Yuzo

　View Summary

1.Modeling Multi-Processor Digital Controllers and Their Synchronization SchemeEach processor is required to observe timing constraints to avoid command signal collision and is preferred not to have idling time intervals. Appropriate models have been investigated and selected for both feedback-type and program-controlled-type controllers with an emphasis on the verification of synchronization capabilities. It was concluded that discrete-time system representation is appropriate when viewed from digital side. We have adopted a gain matrix model for the state feedback and discrete-time transfer matrix model for program control. As a by-product, it was learned that robust stability compensation method is not mature enough for applications. Hence, we have investigated/developed a new and more useful stabilization method.2.Minimum Throughput-Time Configuration and Synchronization ControlFor single input and single output controllers, a shared memory and bus configuration was proposed. We have investigated computational loading mechanism among each processor and proposed a new synchronization which achieves the minimum through-put time. The results were also extended to multiple-input and multiple-output cases.3.Verifying The Proposed SchemeSeveral supporting computer-software tools have been developed to verify those proposed schemes and also to be utilized in the process of design : Digital Signal Processor Command Generator, Throughput Estimator and Hybrid Simulator. Digital Signal Processor Command Generator gives program list written in the processor's command for a given digital-controller characteristics, the configuration and synchronization control protocol. Throughput Estimator evaluates throughput efficiency of a given control program. Hybrid Simulator simulates those digital control systems which include analog plant, A/D and D/A converters and digital controllers. Users can select different types of converters with different arithmetic employed and plants can be modeled as a block diagram using elementary blocks
二分決定グラフの並列構成アルゴリズムおよびその設計検証への応用に関する研究

日本学術振興会科学研究費助成事業

Project Year :

1991

　

　

木村晋二
OPERATION ON SETS AND IT'S APPLICATIONS TO COMPUTER AIDED DESIGN OF ROBUST CONTROL SYSTEMS

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

1988

-

1989

OHTA Yuzo, KIMURA Shinji, HANEDA Hiromasa

　View Summary

Because of variation and uncertainty of system parameters, mathematical models Invariably give an imperfect description of real systems. Therefore, robustness is one of the most fundamental requirements for control systems. Main results of this research project may be summarized as follows:1. Polygon Interval Arithmetic. To treat uncertainty, we defined operations (addition, multiplication, and reversion) on sets consisting of all the convex polygons, which we call polygon interval arithmetic. We derived several important properties of polygon interval arithmetic. We also proposed an efficient algorithm to calculate addition of convex polygons.2. Robust Stability Analysis. (1)Stability of feedback systems can be analyzed by examining determinants of return differential matrix at every frequencies. A method based on the mapping theorem was used to calculate it, but it is very much time consuming. We proposed to use a method based on both polygon interval arithmetic and the mapping theorem to calculate determinants. (2)Stability of (nonlinear) systems can be examined by using Liapunov functions. We proposed a method to construct a Liapunov function via computational geometric technique to calculate convex hulls.3. RSRD(Robust Sequential Return Difference) method. We proposed RSRD method to design robust control systems, which uses polygon interval arithmetic, and which makes possible to design controllers of each loops "independently", and to guarantee the integrity.4. CAD(Computer Aided Design) System. We developed a CAD system to design robust control systems based on RSRD method. We also implemented a program to calculate stability margin of multi-input multi-output systems
Studies on Computer-Aided Design of Microprocessor Controlled Precise AC Servo Systems.

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

1984

-

1986

HANEDA Hiromasa, KIMURA Shinji

　View Summary

Robust and maintenance free induction motors, used in every movable portion of engineering, have seen new drive technologies based on power electronics, control schemes, and microprocessor implementation techniques. This was led to the inevitable need of computer-aided design environment for higher reliability and efficiency in more complex design process. The following research has been carried out:1. Design method of precise AC servo systems has been investigated. Theoretical basis for vector control is given in circuit theory context which is suitable for both qualitative analysis and coputer-aided procedural applications.2. Computer-aided design method has been studied. General and efficient CAD methods have been investigated based on sparsity technique, decomposition technique and discretization technique to cope with electronic-mechanical control system analysis. The proposed method has been implemented into a new CAD environment.3. Those CAD tool environment was applied to the real design of precise AC servo system, and the result was verified experimentally
コンテンフに適応する発展的ソフトウェアの構成法

　View Summary

本研究では,「ソフトウェアの設計・開発時には適用範囲を設定できない処理対象をもつソフトウェア」の発展的な構成法を研究している.平成9年度では,ソフトウェアを発展的に構成するための方法や事例を調査し,具体的に,ソフトウェアの機能(仕様)を拡張させながら,プログラムを構成する過程を追跡した.その結果のひとつとして,「細胞に基づくプログラミング」(Poc:Programming on cells)の方針を打ち出し,そのためのエディタの構成を進めた.また,ハードウェア/ソフトウェア協調設計の観点から,メタレベルの機能等をハードウェアにより支援することの検討を進めた.細胞に基づくプログラミングでは、データ細胞、開始細胞、名前細胞、および、パターン細胞の4種を導入した。また、それぞれの細胞が活動する前条件と,活動の結果の後条件を明示して、プログラムの動きを判り易くする方針を提案した。さらに、細胞によるプログラミングを支援する環境を整えるために、Pocエディタの構成を計画して、その一部の実装を進めた。Pocの実際的な適用計画として、手指動作記述文から3次元グラフィックス表示へ変換するプログラムの開発を進めた。その結果,あらかじめ準備した記述文に対応する基本的な構文形式と,語句の辞書および表示パラメータの推定規則を用意して,中間表現への変換プログラムが作成できた.一方、発展するソフトウェアの実行環境を、ハードウェアの面から支援するために、すでに開発している「FPGAにより可変論理回路部を備えた汎用コプロセッサ」の有用性の考察を進めた.今後は、これらの内容を充実させながら,「発展するソフトウェア」を,変化するハードウェア/ソフトウェアの両面から研究を進めて行く
順序機械の設計検証のための暗黙状態数え上げの並列化に関する研究

　View Summary

本研究では、論理関数の効率的な表現方法である二分決定グラフを用いた、順序回路の到達可能状態の数え挙げ手法の並列化を行った。本手法は暗黙状態数え挙げ(Implicit State Enumeration)と呼ばれ、順序回路の検証やテスト生成に使用されている。暗黙状態数え挙げは、基本的に初期状態から到達できる状態集合を網羅する手法である。順序回路において現状態と入力から次状態を決める関数は、状態を二進符合化することにより論理関数として表される。また、これまでに到達した状態集合なども、集合に属する時に1となる論理関数である特性関数で表される。本研究ではこれらの論理関数を並列二分決定グラフ処理手法で扱うことの研究を行なった。これらの論理関数の処理は、基本的には論理演算の複数個の列となるので、ここでは一般化した問題として、与えられた論理演算の列をいかに高速に処理するかの研究を行なった。並列処理手法としては、Shannon展開法を用いたもの、出力毎に処理する手法を用いたもの、Shannonの展開法を一般化したものの三つについて研究を行ない、多くの主記憶容量を要する論理関数に対してはShannonの展開法が優れていることと、一般的なベンチマークの回路に対しては出力毎の分割法が有効であるという結果を得た。Shannon展開の一般化については現在も研究を継続している。富士通研究所のAP1000を使用した実験では、乗算器の処理を512プロセサを用いて130倍程度の高速化を達成した他、一般的なベンチマーク回路に対しても64プロセサで、良い場合に27倍程度、平均で13倍程度の高速化を達成した。今後は暗黙状態数え挙げ処理特有の性質をより深く研究し、それを用いた並列化について考察する必要がある
超並列アルゴリズム設計のためのデータ構造と計算モデルに関する研究

　View Summary

逐次処理のアルゴリズム設計においては,データ構造の工夫が効率的なアルゴリズム設計に大きく影響することが良く知られているが,数万から数百万個のプロセッサ上で動作する超並列アルゴリズムの設計においても,データ構造の重要性は当然認識されるべきものである.本研究では,超並列処理のアルゴリズム設計に対する計算モデルを確立し,その上でのデータ構造の設計原理を明確化することを目指している.特に,プロッセサ間の通信量の制約を考慮して,通信量を限定した処理に適した「局所計算可能なデータ構造」の確立を目指す.本年度の研究としては,1)多重階層メッシュネットワーク上でのデータ構造の研究:本重点領域研究で提案されているRDTネットワークの能力とその上でのアルゴリズム開発の基礎理論を構築するために,RDTネットワークを包含する概念として多重階層メッシュネットワークを定義し,ネットワーク構造とデータ構造や通信によるオーバーヘッドの関係を調べた.この結果,RDTネットワークを含む多重階層メッシュネットワークの数万台規模の並列計算機における有効性を確認した.2)局所計算可能な符号化に関する研究:昨年度に引き続き,複数の単項演算が定義された有限集合に対し,すべての単項演算を局所計算可能とするための符号化の条件について研究を行い,いくつかの理論的成果を得た.3)二分決定グラフの並列処理アルゴリズムに関する研究:組み合わせ問題の分野で重要なデータ構造である二分決定グラフに対する並列アルゴリズムを研究し,実際に並列計算機上に実現してその能力を調べた.本プログラムは設計検証などの実用分野にも応用している.以上のように,本年度の研究では,多重階層メッシュネットワークや局所計算可能性に関する理論的な研究と並行して,二分決定グラフの並列処理アルゴリズムの考案とそのプログラム化を行った
パイプライン処理の形式的並列設計検証手法に関する研究

　View Summary

本研究では、パイプライン処理方式の形式的な並列設計検証手法の研究を行なった。とくに、パイプラインプロセッサの制御方式の検証に着目し、二分決定グラフを用いた暗黙状態数え上げに基づき、命令をパイプライン処理するときのパイプラインの乱れであるハザードが生じるかどうかを判定する手法を示した。通常ハザードの検出はシミュレーションで行われているが、本手法はこのシミュレーションを記号的にすべての場合について網羅的に行う手法である。具体的には、連続する二つの命令を記号的に与えて記号実行を行う。着目している二つの命令以外はNOP命令にする。またそれと同時に二命令の間にNOP命令を適当な数だけはさんだ命令列を記号実行し、最初の命令列と比較を行なうことで、ハザードを生じるかどうかおよび、ハザードを消すためにどのような機構を備えているかを検出する。記号実行の部分は順序回路の暗黙状態数え上げ手法を用いている。実行はプログラムカウンタの値を除いて、すべてのレジスタの値が定常状態になるまで行なう。記号実行の結果は論理関数として表される。検証は、各命令列について定常状態になるまでのクロック数および定常状態の各レジスタの値が等しいかどうかを比較することで行なう。記号実行対象の回路の演算回路部分の簡単化のために剰余BDDと呼ばれる新しい二分決定グラフを提案した。また、並列化に関しては、暗黙状態数え上げの並列化手法を示した。本並列化手法は、二分決定グラフのグラフ自体をデータフローグラフと見て並列性を抽出するという新しい手法である。これにより、10CPUで4倍程度の高速化を達成した。今後は、本検証手法をスーパースカラプロセッサの検証に適用することや、二分決定グラフのグラフ構造を用いた並列化手法と通常の二分決定グラフの演算の並列化手法と組み合わせることなどが必要である
論理回路の縮約モデルの自動抽出とそれを用いた大規模論理回路の設計検証に関する研究

　View Summary

本研究では、論理回路の縮約モデルの抽出と、それを用いた大規模論理回路の検証に関する研究を行った。まず、縮約モデルを用いた検証手法に関する調査研究を行なった。つぎに現在多くの論理設計検証手法で用いられている二分決定グラフ(Binary Decision Diagram、BDD)について研究を行なった。特に回路の内部の適当な論理ゲートの出力を変数として扱ってBDDを小さくするとともに、相異なる内部変数を持つ二つの回路の等価性判定を行なう手法の研究を行なった。等価性判定では、一方の回路の内部変数を他方の回路の内部変数へ多項式時間で変換する手法を新たに開発して用いた。第二に、乗算など算術演算回路で二分決定グラフの節点爆発を抑制する手法を提案した。これは数の剰余数表現に基づく手法で、算術演算回路の入力が二進数に対応づけられているという性質を用い、二分決定グラフの節点数を入力変数の数の多項式で限定する。限定された結果のBDDを剰余BDD(Residue BDD)と呼ぶ。検証では、回路を複数の法について個別に検証する。剰余数表現で知られているように、もとの関数の剰余の組で、もとの関数を完全に表せるので、検証においても各剰余毎の検証で良い。研究ではまず剰余BDDを組み合わせ回路の検証に適用し、ある程度の効果を確認した。また、乗算器など算術演算回路を含む順序回路の検証への適用についても研究を行なった。第三に、プロセッサの検証などで重要な、回路の構造に基づく縮約手法の研究を行ない、論理回路をグラフと見て、構造が等しい部分を縮約するという手法の研究を行なった。さらに、時相論理に基づく仕様記述法について研究を行ない、仕様記述から仕様記述に関係のない回路部分を縮約する手法の研究を行なった
論理回路の合成手法および最適化手法の高速化に関する研究

　View Summary

本研究では、大規模論理回路の高速合成技術に関する研究を行なった。論理合成における最適化問題の多くはNP完全問題で効率の良いアルゴリズムの生成が困難であることが多い。そこで、不必要な論理合成最適化機能を用いないようにすることで、論理合成時間を短くする手法の研究を行なった。まず、データパス部のビット幅に着目し、それを必要最小限にすることで論理の最適化に必要な時間を減少させる手法についての研究を行なった。具体的には、VHDLあるいはC言語などで記述された回路の機能を解析し、機能記述で用いられる変数の最小値と最大値を求め、その差の対数をとることで必要最小限のビット幅の変数とする手法を提案した。さらにそれに付随する演算器のビット幅を減らして全体のハードウェア量を減らし、その合成にかかる時間を減少させる手法を提案した。フラグ変数やループの制御変数などでビット幅の減少効果が認められ、2割程度のハードウェア量の減少が認められた。また、定数との比較などでは、ゲートレベルで定数判定を行なう回路を自動生成し、論理合成系の最適化機能を用いないようにした。本手法は、通常の論理合成系のフロントエンドとして動作し、論理最適化機能の適用を減少させる効果を持つ。また、これらの手法で生成された論理回路のタイミング解析を高いレベルで行なう手法に関する研究を行なった。さらに、論理合成最適化手法の一つであるトランスダクション法の並列化に関する研究を行ない、並列に回路変換および最適化を行なう手法を提案した。この並列化手法は、共有主記憶方の並列計算機上で有効に動作し、4プロセッサで2倍程度の高速化を達成した。最後に、論理合成アルゴリズムと論理素子の割り当て手法の統合について、基本演算器を中心にFPGA実現のための論理素子割り当てをVHDLレベルで行なう手法を開発し、論理合成系の処理時間を短縮した。現在これらの手法の実装および改良を行なっている
コンテンツに適応する発展的ソフトウェアの構成法

　View Summary

本研究では,「ソフトウェアの設計・開発時には適用範囲を設定できない処理対象をもつソフトウェア」の発展的な構成法を,その実現方式を含めて研究している.平成9年度では,ソフトウェアの機能(仕様)を拡張させながらプログラムを、溝成する過程を追跡して,その結果,「細胞に基づくプログラミング(Poc)」の方針を打ち出した.細胞によるプログラミングでは,プログラムを.細胞の集まりで構成する.その特徴は,その前条件が満たされた時点で,自ら起動する能動細胞を導入していることである.平成10年度では,実際にPocエディタを作成した.これは,単なるエディタ機能に加えて,細胞のグループを集めて1つのCプログラムに結合する機能も持っている.それを用いていくつかのプログラムを記述し問題点の検討と評価を行なった.その経験から,「能動形計算モデル」を提案した.能動形計算モデルは,前条件により能動的に起動する関数と,その起動を制御する部分とから構成されており,完全自律型関数と他から起動される受動型関数の中間的な性質をもつモデルである.Pocの効率的な実現機構については,動的結合機構や,再構成可能なハードウェア部分をもつコンピュータの構成を検討した.ソフトウエアが発展的に拡張していくためには,新しい概念の導入とそれを表す新しい言語に加えて,それらの実現を支援するコンピュータアーキテクチャの機構が有効になると考えて,ハードウェア/ソフトウェア協訓設計に関連する研究を進めた.また,Pocの能動細胞の特長だけを抽出して,能動形計算モデルを導入し、C言語に,ある条件によって自ら起動する能動関数の定義を追加した.それに基づいた,新しいアルゴリズムを考えるとともに,並列計算機によって複数の能動関数が並列に動作する状況を調べた.今後は,能動形プログラムの言語プロセッサを開発して、手指動作の記述の解析,英文契約書の草案作成の支援などの実際の問題で,プログラムを発展的に溝成する方法を求めていく.これらは,いずれも,プログラムの仕様の拡張を余儀なくされる問題である.また,能動形プログラムの実行に適した新しいコンピュータアーキテクチャの検討を進める.このように,「発展するソフトウェア」を,変化するハードウェア/ソフトウェアの両面から研究を進めて行く

▼display all

Misc

Data Structure for Quantum Annealing Emulator

( 2019 ) 39 - 44 2019.08

CiNii
Implementation and Optimization of Parallel Prefix Adders Using Majority Function

117 ( 274 ) 109 - 114 2017.11

CiNii
Implementation and Optimization of Parallel Prefix Adders Using Majority Function

117 ( 273 ) 109 - 114 2017.11

CiNii
High Accuracy 8×8 Approximate Multiplier based on OR Operation (VLSI設計技術)

GUO Yi, SUN Heming, JIN Canran, KIMURA Shinji

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 116 ( 478 ) 19 - 24 2017.03

CiNii
MERP-CNN : A memory-efficient reconfigurable processor for convolutional neural networks based on FPGA (VLSI設計技術)

HAN Xushen, ZHOU Dajiang, KIMURA Shinji

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 116 ( 21 ) 47 - 52 2016.05

CiNii
Write-Reduction using Encoding data on MLC for Non-Volatile Memories

115 ( 398 ) 221 - 225 2016.01

CiNii
Control Signal Extraction for Backward Sequential Clock Gating

115 ( 398 ) 97 - 102 2016.01

CiNii
A Circuit Area-Aware Bit-Write Reduction Code Generation for Non-Volatile Memories

115 ( 338 ) 249 - 253 2015.12

CiNii
Small-Sized Encoder/Decoder Circuit Design for Bit-Write Reduction Targeting Non-Volatile Memories

2014 ( 35 ) 1 - 6 2014.11

CiNii
Write Reduction of Internal Registers for Non-volatile RISC Processors

GOTO Tomoya, YANAGISAWA Masao, KIMURA Shinji

Mathematical Systems Science and its Applications : IEICE technical report 114 ( 125 ) 213 - 218 2014.07

　View Summary

Recently next-generation non-volatile memories based on MTJ (Magnetic Tunnel Junction) have been paid attention because of their enough endurance and fast access speed. The access speed is comparable with that of CMOS memory devices but their writing energy is far larger than the energy of CMOS memory devices. So the reduction of writing operations is very important. In this study, we propose write-reduction methods depending on the types of internal registers for RISC processors. By considering the types, the control circuit can be reduced. For the register file, write operations are reduced by using "write aware flags" and "sign extension flags". For the program counter, write operations are reduced by using "XOR-based comparison" and "carry detection". The proposed method is applied to the MIPS32 processor and the write activity has been evaluated using a simulator. The write activity can be reduced about 93.1-93.8% on register files and about 54.5-56.8% on the program counter.

CiNii
A Reduction Method of Writing Operations to Non-volatile Memory by Keeping Data Difference for Low-Power Circuit Design

2014 ( 30 ) 1 - 6 2014.01

CiNii
Energy Evaluation of Writing Reduction Method for Non-Volatile Memory

2013 ( 26 ) 1 - 6 2013.11

CiNii
Power Reduction of Non-volatile Logic Circuits Using the Minimum Writing Power Cut-set of State Registers

2013 ( 27 ) 1 - 6 2013.11

CiNii
Write Reduction for Non-volatile Registers Using the Max-flow Min-cut Theorem

2012 ( 19 ) 1 - 6 2012.10

CiNii
Write Control Method Based on State Transition for Magnetic Flip-Flop

OKADA Naoya, NAKAMURA Yuichi, KIMURA Shinji

Technical report of IEICE. VLD 112 ( 71 ) 13 - 18 2012.05

　View Summary

In this manuscript, we propose a write control method for nonvolatile MFF(Magnetic Flip-Flop). MFF enables leakage power reduction in the logic circuits and quick return from standby mode. However, it consumes about 10 times power as large as conventional DFF during the write operation. So, it is desirable to reduce redundant write operations. We focus on the state transition of sequential circuit to detect them. If the next state and outputs do not depend on some current bit, the bit is redundant and unnecessary to write. We propose a method to detect such bits. Our method can be combined with a reduction method based on the EXOR of the current value and the new value. When applying combined method to several benchmark circuits, up to 15.3% power reduction can be achieved with the area over head of 1.9%〜4.8% compared with only the EXOR based method.

CiNii
A-3-10 A Control Circuit Based on Analysis of State Transition

Okada Naoya, Nakamura Yuichi, Kimura Shinji

Proceedings of the IEICE General Conference 2012 94 - 94 2012.03

CiNii
A-3-8 Memory-based Arithmetic Circuits on FPGA and Their Power Evaluation

Yu Xinmu, Hamaguchi Kiyoharu, Kimura Shinji

Proceedings of the IEICE General Conference 2012 92 - 92 2012.03

CiNii
ILP-based Multi-Operand Adder Synthesis on FPGAs using Generalized Parallel Counters

MATSUNAGA TAEKO, KIMURA SHINJI, MATSUNAGA YUSUKE

IEICE technical report 111 ( 40 ) 39 - 44 2011.05

　View Summary

Recent researches suggest that multi-operand adders can be efficiently realized on FPGAs by using compressor trees which reduce the number operands without any carry propagation. This paper addresses compressor tree synthesis based on Generalized Parallel Counters (GPCs). The target problem is regarded as the minimization problem of the total number of GPCs under the minimum level of GPCs, and an ILP-based algorithm is proposed.

CiNii
Multi-Stage Power Gating Based on Controlling Values of Logic Gates

JIN Yu, KIMURA Shinji

IEICE technical report 111 ( 40 ) 33 - 38 2011.05

　View Summary

Controlling value based power gating is a fine-grained power gating approach using the controlling values of logic elements. In this method, one input of a logic gates with controlling value the logic block generating other inputs. In this paper, we propose a multi-stage power gating method by considering the application of our method to the power controlled blocks under the active mode. Experimental results show that, the proposed approach results in increasing the number of power off elements by 10%-50% compared with single stage methods of clustering the power controlled blocks.

CiNii
Write Optimization for High-speed Non-volatile Memory Using Next State Function

OKADA Naoya, NAKAMURA Yuichi, KIMURA Shinji

IEICE technical report 110 ( 432 ) 165 - 170 2011.02

　View Summary

Non-volatile memory, such as MRAM and PCM, attracts attention for reducing power consumption. However, it consumes large write energy and has the limitation on the number of write operation. Therefore, it is desirable to reduce redundant writes for Non-volatile memory. In this manuscript, a detection method of redundant writes is proposed based on the next state function. If the next state does not depend on some current bit, the bit is redundant and unnecessary to write. Experiment results on ISCAS'89 benchmark circuits show that 0.45%〜50.78% writes are redundant.

CiNii
Low power synthesis of multi-operand adders using carry-chain structures on FPGAs

IEICE technical report 110 ( 361 ) 93 - 98 2011.01

CiNii
Low power synthesis of multi-operand adders using carry-chain structures on FPGAs

IEICE technical report 110 ( 362 ) 93 - 98 2011.01

CiNii
Low power synthesis of multi-operand adders using carry-chain structures on FPGAs

IEICE technical report 110 ( 360 ) 93 - 98 2011.01

CiNii
Low power synthesis of multi-operand adders using carry-chain structures on FPGAs

2011 ( 16 ) 1 - 6 2011.01

CiNii
Sharing of Clock Gating Modules under Multi-Stage Clock Gating Control

MAN Xin, HORIYAMA Takashi, KIMURA Tomoo, KAI Koji, KIMURA Shinji

IEICE technical report 110 ( 316 ) 185 - 190 2010.11

　View Summary

Clock gating is an effective technique to reduce dynamic power consumption for sequential circuits. This paper shows a sharing method of clock gating logic under multi-stage clock gating control. By sharing the clock gating logic, the total activity of registers and clock gating modules can be reduced. The method is implemented based on BDD and is applied to counters and a set of benchmark circuits. There have been found on average 23.0% cost reduction by the proposed multi-stage clock gating generation method. The power estimation using layout data will also be shown.

CiNii
Advances in VLSI Technologies for Ultra-Low-Power Computing : Ultra Low Power SoC Design Technologies for Media Processing

GOTO Satoshi, IKENAGA Takeshi, YOSHIMURA Takeshi, KIMURA Shinji, TOGAWA Nozomu

IPSJ Magazine 51 ( 7 ) 837 - 845 2010.07

CiNii
FPGA-Based Prototyping Acceleration Using Automatic Pipeline Synthesis

ZHENG Kai, XING Weijie, KIMURA Tomoo, KAI Koji, KUROMARU Shun-ichi, KIMURA Shinji

2009 ( 4 ) 1 - 6 2009.05

CiNii
A New Heuristic for Autonomic Controlling Value Based Power Gating

CHEN LEI, KIMURA SHINJI

2009 ( 5 ) 1 - 6 2009.05

CiNii
FPGAを対象とした部分積加算回路の合成について

松永多苗子, 木村晋二, 松永裕介

電子情報通信学会技術研究報告. IE, 画像工学 108 ( 229 ) 59 - 63 2008.09

　View Summary

本稿では、FPGAを対象として、並列乗算器の部分積加算回路を、一般化したカウンタを用いて合成する手法について述べる。ライブラリセルを用いて実現する場合、カウンタの規模が大きくなると、その面積や遅延の特性も大きくなり、大規模カウンタを用いる効果は単純には判断できない。しかし、k入力のLUTから構成されるFPGAを対象とした場合、カウンタの入力がk以下であれば、同じコストで実現できるため、適切なカウンタを組み合わせて部分回路を構成することによって高速化、小面積化が期待できる。提案手法は、Dadda Treeの概念を一般化したカウンタに適用したもので、実験結果により、既存手法より10%程度面積が削減できることが確認された。

CiNii
Fine-Grained Power Gating Based on the Controlling Value of Logic Gates

CHEN Lei, HORIYAMA Takashi, NAKAMURA Yuichi, KIMURA Shinji

IEICE technical report 108 ( 23 ) 19 - 24 2008.05

　View Summary

Leakage power dissipation of logic gates has become an increasingly important problem. A novel fine-grained power gating approach based on the controlling value of logic gates is proposed for leakage power reduction. In the method, sleep signals of the power-gated blocks are extracted based on the probability of the controlling value of logic gates without any extra control logic. A basic algorithm and a probability-based heuristic algorithm have been developed to implement this method. The steady maximum delay constraint has also been introduced to handle the delay overhead. Experiments on the ISCAS'85 benchmarks show the effectiveness of our algorithms and the effect on the extra delay.

CiNii
Checker Circuit Generation for SystemVerilog Assertions in Prototyping Verification

WANG Mengru, KIMURA Shinji

IEICE technical report 108 ( 22 ) 7 - 12 2008.05

　View Summary

Reduction of verification period is the crucial problem in the recent LSI designs, and prototyping/emulation technologies are used for the reduction. Assertion-Based Verification (ABV) has been paid attention to check design errors at run time in simulation, and it has become an important to combine ABV with the prototyping. In the manuscript, we discuss about a generation method of checker circuit for System Verilog Assertions (SVA's). SVA is one of standard method to describe assertions in ABV. In the checker circuit generation, we focus on the hardware cost reduction.

CiNii
improvement of switching activity aware algorithm for prefix graph synthesis

MATSUNAGA Taeko, KIMURA Shinji, MATSUNAGA Yusuke

IEICE technical report 108 ( 22 ) 31 - 36 2008.05

　View Summary

A prefix graph represents a global structure of a parallel prefix adder, and can be utilized to search various adder structures at technology independent level. An approach for timing-driven area minimization has been proposed which consists of two phases, dynamic programming based area minimization and area reduction with restructuring. This approach is also applied to minimize the total switching activity which is one factor which affects power consumption, though it is not so effective as area minimization. In this paper, an approach is proposed which integrates the effect of the restructuring phase into dynamic programming phase to improve ability of switching cost minimization. Effects and issues of our method are discussed through experimental results.

CiNii
improvement of switching activity aware algorithm for prefix graph synthesis

MATSUNAGA Taeko, KIMURA Shinji, MATSUNAGA Yusuke

2008 ( 38 ) 31 - 36 2008.05

　View Summary

A prefix graph represents a global structure of a parallel prefix adder, and can be utilized to search various adder structures at technology independent level. An approach for timing-driven area minimization has been proposed which consists of two phases, dynamic programming based area minimization and area reduction with restructuring. This approach is also applied to minimize the total switching activity which is one factor which affects power consumption, though it is not so effective as area minimization. In this paper, an approach is proposed which integrates the effect of the restructuring phase into dynamic programming phase to improve ability of switching cost minimization. Effects and issues of our method are discussed through experimental results.

CiNii
Fine-Grained Power Gating Based on the Controlling Value of Logic Gates

CHEN Lei, HORIYAMA Takashi, NAKAMURA Yuichi, KIMURA Shinji

2008 ( 38 ) 55 - 60 2008.05

　View Summary

Leakage power dissipation of logic gates has become an increasingly important problem. A novel fine-grained power gating approach based on the controlling value of logic gates is proposed for leakage power reduction. In the method, sleep signals of the power-gated blocks are extracted based on the probability of the controlling value of logic gates without any extra control logic. A basic algorithm and a probability-based heuristic algorithm have been developed to implement this method. The steady maximum delay constraint has also been introduced to handle the delay overhead. Experiments on the ISCAS'85 benchmarks show the effectiveness of our algorithms and the effect on the extra delay.

CiNii
Synthesis of parallel prefix adders based on Ling's carry computation

MATSUNAGA Taeko, KIMURA Shinji, MATSUNAGA Yusuke

2007 ( 114 ) 163 - 168 2007.11

　View Summary

Ling adders calculate carry propagation based on adjacent bit pairs, and can be formulated as parallel prefix adders. In this paper, our synthesis framework for usual parallel prefix adders based on carry-generate and propagate functions is extended to treat Ling' carry. Some experimental results are shown to discuss its effectiveness to integrate into our framework.

CiNii
Synthesis of parallel prefix adders based on Ling's carry computation

MATSUNAGA Taeko, KIMURA Shinji, MATSUNAGA Yusuke

IEICE technical report 107 ( 336 ) 49 - 54 2007.11

　View Summary

Ling adders calculate carry propagation based on adjacent bit pairs, and can be formulated as parallel prefix adders. In this paper, our synthesis framework for usual parallel prefix adders based on carry-generate and propagate functions is extended to treat Ling' carry. Some experimental results are shown to discuss its effectiveness to integrate into our framework.

CiNii
Acceleration of Prototyping Design Verification Using Circuit Modification

INOUE Keita, WEIJIE Xing, KIMURA Shinji

2007 ( 27 ) 113 - 118 2007.03

　View Summary

In recent SoC (System on Chip) design, more then 60% of design period has been spent by the verification, so we need efficient verification method to reduce the verification time. In the verification, functional simulation is mainly applied, and the acceleration of the simulation by using hardware emulation with FPGA is considered effective. The emulation for large circuits, however, is rather slow, and the speed-up is expected for the reduction of the verification time. In this report, we show an accelerator method based on synchronous pipelining and false-path based combinational circuit delay reduction method. The synchronous pipelining is effective to one-dimensional processing circuits. In the false path-based methods, we focus on the O&1 skip method where we propagate 0-signal and 1-signal separately.

CiNii
Bit-Length Optimization Method for High-Level Synthesis Based on Non-linear Programming Technique

DOI Nobuhiro, HORIYAMA Takashi, NAKANISHI Masaki, KIMURA Shinji

IEICE Trans. Fundamentals, A 89 ( 12 ) 3427 - 3434 2006.12

　View Summary

High-level synthesis is a novel method to generate a RT-level hardware description automatically from a high-level language such as C, and is used at recent digital circuit design. Floating-point to fixed-point conversion with bit-length optimization is one of the key issues for the area and speed optimization in high-level synthesis. However, the conversion task is a rather tedious work for designers. This paper introduces automatic bit-length optimization method on floating-point to fixed-point conversion for high-level synthesis. The method estimates computational errors statistically, and formalizes an optimization problem as a non-linear problem. The application of NLP technique improves the balancing between computational accuracy and total hardware cost. Various constraints such as unit sharing, maximum bit-length of function units can be modeled easily, too. Experimental result shows that our method is fast compared with typical one, and reduces the hardware area.

CiNii
Software Defined Radio with Reconfigurable Processor Based on ALU Array Architecture

OZONE Makoto, HIRASE Katsunori, IIZUKA Kazuhisa, NAKAJIMA Hiroshi, HIRAMATSU Tatsuo, KIMURA Shinji

IEICE technical report 106 ( 188 ) 173 - 178 2006.07

　View Summary

Software defined radio is expected as a next generation radio system because it will be able to provide various radio systems in a single hardware by changing software. This paper includes details of our original reconfigurable processor based on ALU array architecture, compiler, and the software defined radio prototype with our processor. Our processor has two dimensional array of ALUs, and originally developed limited interconnections between ALUs. Our compiler converts C programs to configuration information for our processor with the original method using Data Flow Graphs. We have developed a software defined radio prototype using our reconfigurable processor. On the prototype, we have implemented Digital Terrestrial TV "One-Seg" reception and FM radio reception, and have realized the real-time reception of both "One-Seg" and FM radio and the switching functionality of receptions by changing software. We show that our processor is effective to software defined radio by the prototype.

CiNii
Dynamic Recongurable Wiring Architecture and lts Application to Hardware Mapping

KIMURA Shinji

2006 ( 41 ) 7 - 12 2006.05

　View Summary

Recon gurable architecture is one of key technologies to cope with bugs and the speci cation changes of systerm LSI. Especially, the dynamic recon guration has been paid attention. In the paper, we consider about the dynamic recon gurable wiring architecture in FPGA and its application to mapping of logic circuits. By the architecture, we can map multiplexors to wiring resource in FPGA with small extra area and no extra delay.

CiNii
Dynamic Recon gurable Wiring Architecture and Its Application to Hardware Mapping

KIMURA Shinji

IEICE technical report 106 ( 31 ) 7 - 12 2006.05

　View Summary

Recon gurable architecture is one of key technologies to cope with bugs and the speci cation changes of systerm LSI. Especially, the dynamic recon guration has been paid attention. In the paper, we consider about the dynamic recon gurable wiring architecture in FPGA and its application to mapping of logic circuits. By the architecture, we can map multiplexors to wiring resource in FPGA with small extra area and no extra delay.

CiNii
Coarse-grained Reconfigurable Hardware with Mapping Mechanisms of Floating Point Operations and Chained Additions

AKUTSU Hidemi, KIMURA Shinji

IEICE technical report 105 ( 647 ) 43 - 48 2006.03

　View Summary

Recently, researches on coarse-grained reconfigurable hardware are studied hard, since such architecture needs less information at the configuration and is suitable for dynamic reconfiguration. Coarse-grained reconfigurable architecture also has a chance to implement the same functionality with small area, small delay time and low power compared to fine-grained reconfigurable hardware such as FPGA. Current coarse-grained hardware do not care about the chaining of additions/subtractions for consecutive additions/subtractions, which is effective to the fast execution of consecutive aditions/subtractions. They also do not care about the floating-point operations which might be major in the emulation of software algorithms. In the paper, we propose a new reconfigurable hardware architecture considering the mapping of the chaining of additions/subtractions and that of floating point operations. We also shows an LSI implementation of such architecture.

CiNii
Conversion Method from High-level Hardware Description to Equivalence Logic Formulae

JUNG Kwanghoon, KIMURA Shinji

IEICE technical report 105 ( 646 ) 79 - 84 2006.03

　View Summary

With the enlargement of the integration size on one LSI chip, high-level design methods using C-language are applied to the design of LSI chips, and the requirement for high-level verification methods becomes large. At the high-level, the simulation speed is known to be high, but it is hard to verify the functionality with only the simulation, and formal methods are suitable especially for the comparison of the original and optimized circuits. In the paper, we focus on the equivalence checking methods based on equivalence logic such as CVC (Cooperating Validity Checker), and show a verification method of C/Verilog descriptions by generating CVC formulae from C/Verilog sources. Since the verification time using CVC is influenced by the size of formulae, we should devise conversion methods from C/Verilog sources. We introduce several conversion methods and show their effectiveness.

CiNii
Structural Coverage of Traversed Transitions for Symbolic Model Checking

XU Xingwen, KIMURA Shinji, HORIKAWA Kazunari, TSUCHIYA Takehiko

2005 ( 121 ) 197 - 202 2005.11

　View Summary

Coverage estimation for model checking has become an important issue in practical formal verification. Transition traversal coverage focuses on the transition characteristics of CTL operators and calculates which transitions are traversed during the model checking process of properties. One limitation of the method is the lack of the correspondence between the circuit structure and transitions. One transition might be covered no matter which part of the circuit is checked (or not) related to the transition. This leads to the overestimation of the coverability of properties. In this paper, we enhance the transition traversal coverage by analyzing the structural coverage of each traversed transition. We consider which variables are checked explicitly or implicitly on the traversed transitions. Thus, we deduce which part of the circuit is checked by properties for each traversed transition. The importance is that we can analyze which part of the circuit has not been verified. The accuracy of the transition traversal coverage is enhanced by our technique. We show the effectiveness of the proposed method by experiments.

CiNii
Bit-length Optimization Method for High-level Synthesis based on Non-linear Programming and Searching Integer Solutions

DOI Nobuhiro, HORIYAMA Takashi, NAKANISHI Masaki, KIMURA Shinji

2005 ( 27 ) 133 - 138 2005.03

　View Summary

This paper presents bit-length optimization technique for high-level synthesis based on non-linear programming and searching integer solutions. The results of the bit-length optimization based on non-linear programming are real values, and these values are converted to integer with round-up for hardware implementation. In this paper, we show a method to search integer solutions under the constraints of the real solution. The experimental results shows the advantage of searching based method.

CiNii
Bit-length Optimization Method for High-level Synthesis based on Non-linear Programming and Searching Integer Solutions

DOI Nobuhiro, HORIYAMA Takashi, NAKANISHI Masaki, KIMURA Shinji

IEICE technical report. Computer systems 104 ( 738 ) 43 - 48 2005.03

　View Summary

This paper presents bit-length optimization technique for high-level synthesis based on non-linear programming and searching integer solutions. The results of the bit-length optimization based on non-linear programming are real values, and these values are converted to integer with round-up for hardware implementation. In this paper, we show a method to search integer solutions under the constraints of the real solution. The experimental results shows the advantage of searching based method.

CiNii
A Reconfigurable Processor Based on ALU Array Architecture with Limitation on the Interconnection

OKADA Makoto, HIRAMATSU Tatsuo, NAKAJIMA Hiroshi, OZONE Makoto, HIRASE Katsunori, KIMURA Shinji

IEICE technical report. Computer systems 104 ( 591 ) 1 - 6 2005.01

　View Summary

Dynamic reconfigurable processor based on ALU array architecture for consumer appliances is introduced. We propose the ALU array architecture with the limitation on the interconnection for area reduction. With the proposed architecture, we can reduce gate size by 63% on interconnections. In addition, we show that the performance of the proposed architecture is almost the same as one without limitations. The proposed processor is a parallel processing processor that consists of a sequencer and an ALU array, adopted multi threading technology. We develop compilation tools from source codes written in C language for the proposed processor. We demonstrate the software model of the processor using MPEG-4 video decoding application.

CiNii
A Selective Scan Chain Reconfiguration through Run-Length Coding for Test Data Compression and Scan Power Reduction

SHI Youhua, KIMURA Shinji, YANAGISAWA Masao, OHTSUKI Tatsuo

IEICE transactions on fundamentals of electronics, communications and computer sciences 87 ( 12 ) 3208 - 3215 2004.12

　View Summary

Test data volume and power consumption for scan-based designs are two major concerns in system-on-a-chip testing. However, test set compaction by filling the don't-cares will invariably increase the scan-in power dissipation for scan testing, then the goals of test data reduction and low-power scan testing appear to be conflicted. Therefore, in this paper we present a selective scan chain reconfiguration method for test data compression and scan-in power reduction. The proposed method analyzes the compatibility of the internal scan cells for a given test set and then divides the scan cells into compatible classes. After the scan chain reconfiguration a dictionary is built to indicate the run-length of each compatible class and only the scan-in data for each class should be transferred from the ATE to the CUT so as to reduce test data volume. Experimental re sults for the larger ISCAS'89 benchmarks show that the proposed approach overcomes the limitations of traditional run-length coding techniques, and leads to highly reduced test data volume with significant power savings during scan testing in all cases.

CiNii
A Hybrid Dictionary Test Data Compression for Multiscan-Based Designs

SHI Youhua, KIMURA Shinji, YANAGISAWA Masao, OHTSUKI Tatsuo

IEICE transactions on fundamentals of electronics, communications and computer sciences 87 ( 12 ) 3193 - 3199 2004.12

　View Summary

In this paper, we present a test data compression technique to reduce test data volume for multiscan-based designs. In our method the internal scan chains are divided into equal sized groups and two dictionaries were build to encode either an entire slice or a subset of the slice. Depending on the codeword, the decompressor may load all scan chains or may load only a group of the scan chains, which can enhance the effectiveness of dictionary-based compression. In contrast to previous dictionary coding techniques, even for the CUT with a large number of scan chains, the proposed approach can achieve satisfied reduction in test data volume with a reasonable smaller dictionary. Experimental results showed the proposed test scheme works particularly well for the large ISCAS'89 benchmarks.

CiNii
Program Analysis Based on Abstruct Interpretation and Its Application for Datapath Optimization

DOI Nobuhiro, HORIYAMA Takashi, NAKANISHI Masaki, KIMURA Shinji

2004 ( 56 ) 41 - 46 2004.05

　View Summary

Various optimization techniques such as bit-length optimization are required for hardware generation from C programs. The value range analysis and dataflow analysis are effective for such optimization and static program analysis methods have been used. The static methods, however, have several problems such as the preciseness, the overestimation, etc. In this paper, we describe a program analysis method based on abstract interpretation and its application for datapath optimization.

CiNii
Program Analysis Based on Abstruct Interpretation and Its Application for Datapath Optimization

DOI Nobuhiro, HORIYAMA Takashi, NAKANISHI Masaki, KIMURA Shinji

Technical report of IEICE. VLD 104 ( 79 ) 7 - 12 2004.05

　View Summary

Various optimization techniques such as bit-length optimization are required for hardware generation from C programs. The value range analysis and dataflow analysis are effective for such optimization and static program analysis methods have been used. The static methods, however, have several problems such as the preciseness, the overestimation, etc. In this paper, we describe a program analysis method based on abstract interpretation and its application for datapath optimization.

CiNii
Reconfigurable Interconnection and Its Application to the Bit - Exchange Unit in a Processor

HARADA Yasunori, KIMURA Shinji, YANAGISAWA Masao

2004 ( 5 ) 1 - 6 2004.01

　View Summary

This paper proposes a reconfigurable interconnect unit for a general (and/or embedded) processor. The performance of a processor depends not only on the operational units but also on the interconnection between registers and operational units. When configuring a procesor architecture, we usually focus on the application specific operational units, but there are not a few applications on which the effect of the interconnection is larger than that of the operational units. So we focus on the reconfigurability in the interconnect architecture and we introduce a reconfigurable interconnect unit for the bit-level data processing. The unit corresponds to a switch-matrix in FPGA and is called as a barrel-exchanger because of the similarity to a barrel-shifter. An n-bit barrel-exchanger has n inputs and n outputs, and any connection between inputs and outputs can be obtained. A processor with a barrel-exchanger gains more than 10 times speed-up for the bit substitution and for DES encription. We also show the area estimation of 8, 16, 32 and 64 bit barrel exchangers.

CiNii
Reconfigurable Interconnection and Its Application to the Bit-Exchange Unit in a Processor

HARADA Yasunori, KIMURA Shinji, YANAGISAWA Masao

Technical report of IEICE. VLD 103 ( 578 ) 1 - 6 2004.01

　View Summary

This paper proposes a reconfigurable interconnect unit for a general (and/or embedded) processor. The performance of a processor depends not only on the operational units but also on the interconnection between registers and operational units. When configuring a procesor architecture, we usually focus on the application specific operational units, but there are not a few applications on which the effect of the interconnection is larger than that of the operational units. So we focus on the reconfigurability in the interconnect architecture and we introduce a reconfigurable interconnect unit for the bit-level data procassing. The unit corresponds to a switch-matrix in FPGA and is called as a barrel-exchanger because of the similarity to a barrel-shifter. An n-bit barrel-exchanger has n inputs and n outputs, and any connection between inputs and outputs can be obtained. A processor with a barrel-exchanger gains more than 10 times speed-up for the bit substitution and for DES encription. We also show the area estimation of 8, 16, 32 and 64 bit barrel exchangers.

CiNii
A Built-in Reseeding Technique for LFSR-Based Test Pattern Generation

SHI Youhua, ZHANG Zhe, KIMURA Shinji, YANAGISAWA Masao, OHTSUKI Tatsuo

IEICE transactions on fundamentals of electronics, communications and computer sciences, A 86 ( 12 ) 3056 - 3062 2003.12

　View Summary

Reseeding technique is proposed to improve the fault coverage in pseudo-random testing. However most of previous works on reseeding is based on storing the seeds in an external tester or in a ROM. In this paper we present a built-in reseeding technique for LFSR-based test pattern generation. The proposed structure can run both in pseudorandom mode and in reseeding mode. Besides, our method requires no storage for the seeds since in reseeding mode the seeds can be generated automatically in hardware. In this paper we also propose an efficient grouping algorithm based on simulated annealing to optimize test vector grouping. Experimental results for benchmark circuits indicate the superiority of our technique against other reseeding methods with respect to test length and area overhead. Moreover, since the theoretical properties of LFSRs are preserved, our method could be beneficially used in conjunction with any other techniques proposed so far.

CiNii
Area Efficient FPGA Architecture with Logic Function Folding

KAJIHARA Hirotsugu, NAKANISHI Masaki, HORIYAMA Takashi, KIMURA Shinji, WATANABE Katsumasa

2003 ( 7 ) 37 - 42 2003.01

　View Summary

The paper describes an area efficient FPGA architecture based on LUTs with logic function folding. Each LUT is a 3-1 LUT but is enhanced to implement a full adder function with only one LUT. The area of our 3-1 LUT is about 56 % compared to that of a simple 4-1 LUT. In the paper, we measure not only the LUT area but also the area of routing resource. We adopt the well-known island style-architecture for the routing mechanism, and find that the total FPGA area can be saved up to 32.4 % and on average 12 % by the experiments on several benchmark circuits compared to FPGA architecture based on 4-1 LUTs.

CiNii
Area Efficient FPGA Architecture with Logic Function Folding

KAJIHARA Hirotsugu, NAKANISHI Masaki, HORIYAMA Takashi, KIMURA Shinji, WATANABE Katsumasa

Technical report of IEICE. VLD 102 ( 608 ) 37 - 42 2003.01

　View Summary

The paper describes an area efficient FPGA architecture based on LUTs with logic function folding. Each LUT is a 3-1 LUT but is enhanced to implement a full adder function with only one LUT. The area of our 3-1 LUT is about 56 % compared to that of a simple 4-1 LUT. In the paper, we measure not only the LUT area but also the are aof routing resource. We adopt the well-known island style-architecture for the routing mechanism, and find that the total FPGA area can be saved up to 32.4 % and on average 12% by the experiments on several benchmark circuits compared to FPGA architecture based on 4-1 LUTs.

CiNii
Design and Evaluation of Java Processor with Dynamic Instruction Conversion Mechanism for Embedded Systems

SUZUKI MASATO, KIMURA SHINJI, WATANABE KATSUMASA

IEICE technical report. Computer systems 101 ( 671 ) 33 - 40 2002.02

　View Summary

Java processors are key for executing Java bytecodes in embedded systems, and are expected low hardware consumption and high executio is based on stack operations, and we can raise the efficiency by changing these codes into extended codes corresponding to RISC-like operations. Rewriting is done in cache and done in parallel with the direct execution of bytecode. By performing the execution and the conversion in parallel, we can manipulate complex conversions with low hardware cost. The paper shows the design and evaluation of the Java processor with the dynamic instruction conversion mechanism.

CiNii
A New Image Computation Method Based on Generalized Cofactor of Binary Decision Diagrams

KIMURA Shinji, DILL David, GOVINDARAJU Shankar

Technical report of IEICE. VLD 101 ( 467 ) 73 - 78 2001.11

　View Summary

In the paper, we show a new image computation method based on the BDD constrain operator. The image computation is to compute the next state set from the current state set using the logic functions, and is widely used in the formal verification of sequential circuits. We have shown a property on the relation between the constrain operator and the conjunction operator for transition relations of state variables. The constrain operator can reduce BDD node size compared with the conjunction operator. The new method outperforms for several ISCAS benchmarks comparing recent conjunction based methods.

CiNii
Design and Implementation of LSI for Speech Recognition with Learning Mechanism Using C Language

NAKAMURA Kazuhiro, ZHU Qiang, MARUOKA Shinji, HORIYAMA Takashi, KIMURA Shinji, WATANABE Katsumasa

Technical report of IEICE. VLD 100 ( 473 ) 125 - 130 2000.11

　View Summary

Speech recognition has become one of popular human interfaces. We have designed a real-time speech recognition LSI using C language. The LSI recognizes up to 64 monosyllables(A, B, …, "A", "I", …, etc.)based on the Hidden Markov Model(HMM), which is a well known speaker-independent recognition method. The LSI also owns an interface to an external learning circuit. In this paper, we descibe speech recognition and learning algorithms, and also we describe a design method of LSI using C language. We used behavior level C to check the behavior of algorithms.We also checked clocked behavior and decided bit length of registers by using register-transfer level C. Then we translated register-transfer level C into VHDL.

CiNii
16-bit Pipelined Processor with CORDIC Unit based on Redundant Binary Representation

OTSUJI Takashi, HORIYAMA Takashi, KIMURA Shinji, WATANABE Katsumasa

Proceedings of the Society Conference of IEICE 2000 78 - 78 2000.09

CiNii
Design Verification of Arithmetic Circuits Using Residue BDD's

KIMURA Shinji

Technical report of IEICE. VLD 95 ( 171 ) 1 - 8 1995.07

　View Summary

A Binary Decision Diagram(BDD) is an acyclic graph representation of a logic function, and is widely used in logic synthesis systems and design verification systems. BDD's are compact representation for usual logic functions, but for some logic functions BDD's cannot be used because of the node explosion, where the number of nodes becomes the exponential order with respect to the number of input variables. A multiplication is an example of such functions. For the problem, we newly introduce a residue BDD for representing arithmetic functions. The residue BDD is based on the residue arithmetic, and the node size is proportional to the polynomial of the input size. The paper describes the properties of the residue BDD's and the verification method using the residue BDD's.

CiNii
Formal Timing Verification of Logic Circuits

KIMURA Shinji

IPSJ Magazine 35 ( 8 ) 726 - 735 1994.08

CiNii
Parallel Binary Decision diagram Manipulation

KIMURA Shinji

IPSJ Magazine 34 ( 5 ) 624 - 630 1993.05

CiNii

▼display all

Syllabus

Digital Circuits

Graduate School of Information, Production and Systems

2026 fall semester
High-Level Verification Technologies Research (Spring)

Graduate School of Information, Production and Systems

2026 spring semester
High-Level Verification Technologies Research (Doctor's Thesis)

Graduate School of Information, Production and Systems

2026 full year
High-Level Verification Technologies Research (Fall)

Graduate School of Information, Production and Systems

2026 fall semester
High-Level Verification Technologies

Graduate School of Information, Production and Systems

2026 fall semester
High-Level Verification Technologies Research (Fall)

Graduate School of Information, Production and Systems

2026 fall semester
High-Level Verification Technologies Research (Spring)

Graduate School of Information, Production and Systems

2026 spring semester
System LSI Architecture

Graduate School of Information, Production and Systems

2026 spring semester
Master's Thesis (Integrated Systems)(Fall)

Graduate School of Information, Production and Systems

2026 fall semester
Master's Thesis (Integrated Systems)(Spring)

Graduate School of Information, Production and Systems

2026 spring semester
High-Level Verification Technologies D

Graduate School of Information, Production and Systems

2026 fall semester
High-Level Verification Technologies C

Graduate School of Information, Production and Systems

2026 spring semester
High-Level Verification Technologies B

Graduate School of Information, Production and Systems

2026 spring semester
High-Level Verification Technologies A

Graduate School of Information, Production and Systems

2026 fall semester
Design for Testability

Graduate School of Information, Production and Systems

2026 fall semester
Topics in Fundamental Science and Engineering C

School of Fundamental Science and Engineering

2026 spring semester
Topics in Fundamental Science and Engineering C

School of Fundamental Science and Engineering

2026 spring semester
Topics in Fundamental Science and Engineering C

School of Fundamental Science and Engineering

2026 spring semester
Topics in Fundamental Science and Engineering C

School of Fundamental Science and Engineering

2026 spring semester
Topics in Fundamental Science and Engineering C

School of Fundamental Science and Engineering

2026 spring semester

▼display all

Sub-affiliation

Faculty of Science and Engineering School of Fundamental Science and Engineering

Research Institute

2024

-

2026

Waseda Research Institute for Science and Engineering Concurrent Researcher

Internal Special Research Projects

単一命令計算機を用いたディジタルデータの意味保存手法の研究

2016

　View Summary

ディジタルデータは0と1の並びであり、それだけでは意味を持たず、その意味解釈方法を同時に記憶する必要がある。これまで、文字データについては、1文字のデータのビット数とビットパターンに対応するフォントの最小データとそれへの変換方法を添付し、読めるデータに変換する手法を提案してきた。今回、画像圧縮されたデータの意味保存を見えるデータに戻すことと定義し、プログラムの意味記述の研究に取り組み、単一命令計算機の subleq の命令解釈機構の記述と subleq のアセンブラでプログラムの保存を行う手法と、その場合の記述量の最適化について研究を行った。subleq は命令が一種類しかなく、意味記述が簡単で、解釈機構の模擬や再構築が容易である。
次世代不揮発素子の活用に向けたハードウェア設計技術

2013

　View Summary

　近年の携帯端末および無線センサなどのアンビエントデバイスの発達・普及に伴い、これらの稼働時間を延ばすため、アイドル状態での電源停止制御が重要になってきた。この時、電源復帰後の動作のために内部状態を保存することが必要で、電源停止でも記憶が保持できる次世代不揮発素子が注目されている。　MTJ (Magnetic Tunnel Junction) に基づく次世代不揮発素子は、アクセスは通常の CMOS SRAM と同等の速度で、集積度は DRAM と同様に高い。しかし、値の書込みにおいては、MTJ 内部の磁場の向きを制御するため、通常の SRAM と比較して10倍程度の書込みエネルギーを必要とし、その削減が急務である。　そこで本研究では、書込みエネルギーの削減を含む次世代不揮発素子の活用のための設計技術の研究を行った。メモリをROMとして書き換えずに計算結果の記憶に用いる手法の他、書込みそのものを減らす手法を研究した。MTJの書換えは同じ値を書込む場合でも違う値の書換えと同様大きなエネルギーを必要とするので、今記憶している値と書込みたい値が同じ場合に、書込みを停止することが基本となる。ここでは、それと組み合わせてさらに書込み回数を削減する手法を示した。　まず、順序回路の状態遷移解析に基づき、書換える必要のないレジスタの探索手法を提案し、書換えを停止する条件から停止制御回路の自動生成を行い、電力削減を確認した。　第二に、値の変化にあたって、変更するビット数を削減する手法の研究を行った。新しい値を元の値と新しい値との差分で表すことで、書き換えるビット数を削減する手法や、最大変更ビット数を制限した符号の研究などを行った。　第三に入力をアドレス、計算結果をメモリの内容としたメモリベース演算の研究を行った。基本的には入力数に対して指数的な容量を必要とするので、乗算等に対して必要に応じて演算器と組み合わせてメモリ量を削減する手法を検討した。　最後に、論理素子の制御値の伝播を考慮した細粒度の実行時パワーゲーティングの研究を行った。論理素子の制御値は一つの入力だけで出力を決定できる値である。ある入力が制御値をとると、他の入力の値は不要となり、それを計算する部分の電源を停止できる。この制御値の直列接続での伝播を用いてより多くの素子の電力停止を行う手法を示した。
システムオンシリコンにおけるランタイム解析・最適化に関する研究

2012

　View Summary

システムオンシリコンにおけるランタイム解析・最適化に関する研究というテーマで、細粒度の動的なクロックゲーティングとパワーゲーティング、Single Event Upset (SEU) エラーに対するFPGA上での回路の動的書き換えを用いた対処手法、メモリベース演算、キャッシュ構成の最適化の研究を行った。細粒度の動的なクロックゲーティングとパワーゲーティングについては、回路内部の信号を用いて動的にクロックや電源の ON/OFF を制御することで、ランタイムに電力を制御する手法の検討を行った。マルチステージクロックゲーティングや、疑似パワーゲーティング法で電力を10%～20%程度削減できることが分かった。FPGA上での回路の動的書き換えについては、SEU エラーにより FPGA の構成ビットが変化し、回路の機能が正しくなくなる現象に対し、3重系よりも安全な4重系の構造を提案するとともに、エラー発生時にエラーを同定してエラーモジュールの動的再書込みによる機能の復帰を行う手法の提案を行った。実際に提案手法を Xilinx FPGA の動的部分書換え機能を用いて実現し、安全性と面積オーバーヘッドの評価を行った。メモリベース演算については、メモリ部の書換え可能性がランタイムの最適化に有効であるという判断から、基礎的な算術演算および CORDIC 法による三角関数や乗算・除算の実現手法の研究を行った。これは、演算器の入力をアドレスとして、計算結果をメモリに入れることで算術演算を実現するものである。なお、アドレスに対してメモリのサイズが指数的であるので、入力をいくつかに分割してメモリで実現し、メモリ出力を演算器に入れるなどの手法が必要であった。また、ハードウェア内部の演算器の結果をキャッシュ的にメモリに入れることで再計算を行わずにメモリアクセスで済ませる手法の検討を行った。これらのメモリを用いた演算手法は、論理ゲートの出力の変化による動的電力を削減する効果があり、実行時の電力最適化に有効であることがわかった。さらに、次世代不揮発メモリを用いたキャッシュメモリの電力の最適化についても検討を行い、L1 キャッシュの一部とL2 キャッシュを不揮発化することで、リーク電力の大きな削減が得られることがわかった。
システムオンシリコンのためのランタイム解析・最適化手法の研究

2011 戸川望

　View Summary

システムオンシリコンのためのランタイム解析・最適化の研究として、アサーションチェッカを用いたランタイムエラー検出法と得られたエラーの暗号化と安全な記憶方式や耐タンパ性に関する基礎的な研究を行った。まずアサーションチェッカーについては、入力記憶オートマトンを用いる手法に基づき、入力記憶部を共有することでFPGA実現によりハードウェア資源が削減できることを示した。つぎに、ランタイム解析で必要なアサーション集合に関する十分性について、回路の一部を変更したミュータントベースのアサーションの十分性判定に基づく手法の調査と検討を行った。ミュータントベース手法では、加えた変更がアサーションにより検出できるかでアサーションの十分性を判断するが、どのような変更を加えるかはランタイム解析の種類に大きく依存する。とくに遅延エラーについては、記述手法を含めて議論する必要があることがわかった。エラー情報の圧縮については、圧縮能力に優れたLFSRベース手法を検討した。ランタイム最適化については、FPGA の動的再構成の機構を用いる手法の検討を行った。とくに、内臓プロセッサの命令実行中に、その命令に対応する演算器を動的に構築し、ループに対応する命令列を検出して、データを動的に構築した演算系に通す手法の検討およびプロトタイプの構築を行った。これはハードウェアの高位合成をアセンブラレベルから動的に行う手法であるが、ループの検出部およびデータを新たに構築した演算系に流す手法、およびFPGA の動的再構成を高速に行う手法を検討する必要がある。また、演算系の最適化も今後の課題であり、メモリを用いた算術演算の効率化および低電力化や複数の加算を連続して行うマルチオペランド加算の最適化などの最適化の研究を行った。エラー情報の暗号化および情報漏洩の耐タンパ性についても検討を行い、スキャンパスがある場合の耐タンパ性について議論を行った。
論理制御値を用いたＶＬＳＩの電力・遅延最適化

2009

　View Summary

論理制御値を用いたVLSIの電力・遅延の最適化というテーマで、VLSI ゲートレベル回路の最適化の研究を行った。まず遅延の最適化に関しては、パイプライン回路の自動生成の研究を行い、FPGA 向けのパイプライン合成手法の提案を行い、加算回路や乗算回路で2段のパイプラインで1.8倍のクロック周波数を得られるという結果を得た。アルゴリズムおよび実験結果は、情報処理学会SLDM研究会およびASP_DACの Student Forum で口頭発表を行った。つぎに、電力の最適化に関しては、論理素子の制御値でパワーを停止する細粒度のパワーゲーティング手法を提案し、制御信号の制御値確率とそれで停止できるゲート数の積を評価し、評価値の大きい順にパワーゲーティングを挿入するアルゴリズムで、平均15%程度の電力削減効果を得た。研究成果は電子情報通信学会の英文論文誌に掲載された。さらに、順序回路のレジスタのクロックを停止して動的電力を削減するクロックゲーティング手法の最適共有の研究を行い、カウンタや ISCAS 89 ベンチマーク回路に適用して効果を確認した。研究成果は、2010年5月の情報処理学会SLDM研究会で口頭発表の予定である。
ＶＬＳＩの論理素子の制御値に基づく電力・遅延最適化

2008

　View Summary

VLSIの性能向上および電力消費を削減する目的に対し、論理素子の制御値を用いる手法を提案し、基礎的な実験を行った。まず性能向上に対しては、AND ゲートの制御値が0であることを用いて、論理回路の最長経路を通る0への変化をANDゲートで先に通すこととし、そのための制御条件を生成する方法を導いた。また1への変化に対しては OR ゲートで先に通すこととした。0への変化と1への変化を分けてスキップ(バイパス)するので 01-skip 手法と呼んでいる。本手法を簡単な回路に適用し、期待通りの高速化が得られることを確認した。ツール化と種々の回路への適用が今後の課題である。また制御回路の共有による付加回路の削減も今後の課題である。一方、電力消費の削減に関しては、AND ゲートの制御値が 0 であることを用い、一方が 0 であるときに他方の入力の値が不定でも出力に影響を与えないという性質を利用し、他方の入力を計算するブロックの電力を停止する手法を提案し、簡単な回路で効果を確認した。本手法は、プロセスの微細化に伴い大幅な増加が見られるリーク電力の削減に有効であると同時に、動的な電力の削減にも有効であることが確認されている。ツール化および種々の回路への適用および実LSI試作を用いた評価が今後の課題である。
プログラムを仕様とするハードウェアの設計検証手法

2002

　View Summary

ハードウェアの設計の高位化に対応し、プログラムを仕様として用い、ハードウェアの設計を形式的に検証する手法に関する研究を行った。まず、現状の検証手法の調査を論文誌および国際会議、研究会などに対して行った。その結果として、二分決定グラフを用いた厳密な順序回路の検証手法、SAT に基づく近似的な検証手法、無評価関数に基づく等価性判定論理の 3 つが基本的な手法であることと、これらを組み合わせたハードウェアの検証手法の研究が盛んに行われていることがわかった。ただ、プログラムを仕様とするものについては、プログラムの直接実行による、シミュレーションの高速化の側面が主に強調され、形式的な手法の研究開発が不十分であることも明らかとなった。そこで、これらのハードウェアの手法の中で、大規模な回路に適用可能と考えられる無評価関数に基づく等価性判定論理を適用した手法の開発を目指し、そのための基礎的な研究を行った。無評価関数に基づく等価性判定論理では、記号的な式の等価性を判断することができるので、プログラムの代入をそのまま等価性判定の式に変換することで、二つのプログラムの等価性を式の等価性として判定することができる。具体的には、Ｃ言語のプログラムを対象として、それを等価性判定論理の式へ変換する規則を求めるとともに、多バイトの演算問題に適用し、手法の有効性と適用限界を求めた。実際のプロセッサなどで用いられている、桁上げ選択加算を含むような演算では、64 ビット程度の加算の等価性の検証が時間的に不可能となることがわかり、等価性判定論理自体の性質を含めて、今後のさらなる研究が必要である。

▼display all