Updated on 2022/06/28

写真a

 
KIMURA, Shinji
 
Affiliation
Faculty of Science and Engineering, Graduate School of Information, Production, and Systems
Job title
Professor

Concurrent Post

  • Faculty of Science and Engineering   Graduate School of Fundamental Science and Engineering

  • Faculty of Science and Engineering   School of Fundamental Science and Engineering

Research Institute

  • 2020
    -
    2022

    理工学術院総合研究所   兼任研究員

Education

  •  
    -
    1985

    Kyoto University   Graduate School of Engineering   Doctor Course on Information Engineering  

  •  
    -
    1985

    Kyoto University   Graduate School of Engineering   Doctor Course on Information Engineering  

  •  
    -
    1984

    Kyoto University   Graduate School of Engineering   Master Course on Information Engineering  

  •  
    -
    1984

    Kyoto University   Graduate School of Engineering   Master Course on Information Engineering  

  •  
    -
    1982

    Kyoto University   Faculty of Engineering  

Degree

  • Kyoto University   Doctor of Engineering

Research Experience

  • 2002
    -
    Now

    Professor at Waseda University

  • 1993
    -
    2002

    Associate Professor at Nara Institute of Science and Technology

  • 1985
    -
    1993

    Assistant Professor at Dept. of Electric Engineering, Kobe University

Professional Memberships

  •  
     
     

    Associations for Computing Machinery

  •  
     
     

    IPSJ

  •  
     
     

    IEICE

  •  
     
     

    IEEE

  •  
     
     

    Information Processing Society in Japan

  •  
     
     

    The 14th Workshop on Synthesis And System Integration of Mixed Information technologies

  •  
     
     

    The 15th Workshop on Synthesis And System Integration of Mixed Information technologies

  •  
     
     

    Asia and South Pacific Design Automation Conference

  •  
     
     

    VLSI Design Technologies WG, IEICE

  •  
     
     

    Asia and South Pacific Design Automation Conference

  •  
     
     

    Information Processing Society in Japan

  •  
     
     

    International Conference on Computer Aided Design

  •  
     
     

    Asia and South Pacific Design Automation Conference

▼display all

 

Research Areas

  • Electron device and electronic equipment

  • Computer system

Research Interests

  • Logic Circuit Design and Verification, High-level Synthesis and Verification, Electronic Design Automation, LSI

Papers

  • Accuracy-Configurable Low-Power Approximate Floating-Point Multiplier Based on Mantissa Bit Segmentation.

    Jie Li, Yi Guo, Shinji Kimura

    2020 IEEE Region 10 Conference(TENCON)     1311 - 1316  2020

    DOI

  • Approximate FPGA-Based Multipliers Using Carry-Inexact Elementary Modules.

    Yi Guo, Heming Sun, Ping Lei, Shinji Kimura

    IEICE Trans. Fundam. Electron. Commun. Comput. Sci.   103-A ( 9 ) 1054 - 1062  2020

    DOI

  • Small-Area and Low-Power FPGA-Based Multipliers using Approximate Elementary Modules.

    Yi Guo, Heming Sun, Shinji Kimura

    Proc. of ASP-DAC 2020     599 - 604  2020  [Refereed]

    DOI

  • Energy-Efficient and High-Speed Approximate Signed Multipliers with Sign-Focused Compressors.

    Yi Guo, Heming Sun, Shinji Kimura

    Proc. of 2019 32nd IEEE International System-on-Chip Conference (SOCC)     330 - 335  2019  [Refereed]

    DOI

  • Approximate Multiplier Using Reordered 4-2 Compressor with OR-based Error Compensation.

    Yufeng Xu, Yi Guo, Shinji Kimura

    Proc. of 2019 IEEE 13th International Conference on ASIC (ASICON)     1 - 4  2019  [Refereed]

    DOI

  • Approximate DCT Design for Video Encoding Based on Novel Truncation Scheme.

    Heming Sun, Zhengxue Cheng, Amir Masoud Gharehbaghi, Shinji Kimura, Masahiro Fujita

    IEEE Trans. Circuits Syst. I Regul. Pap.   66-I ( 4 ) 1517 - 1530  2019  [Refereed]

    DOI

  • Design of Low-Cost Approximate Multipliers Based on Probability-Driven Inexact Compressors.

    Yi Guo, Heming Sun, Ping Lei, Shinji Kimura

    IEICE Trans. Fundam. Electron. Commun. Comput. Sci.   102-A ( 12 ) 1781 - 1791  2019  [Refereed]

    DOI

  • Design of Power and Area Efficient Lower-Part-OR Approximate Multiplier.

    Yi Guo, Heming Sun, Shinji Kimura

    TENCON 2018 - 2018 IEEE Region 10 Conference(TENCON)     2110 - 2115  2018  [Refereed]

    DOI

  • Energy-Efficient and High Performance Approximate Multiplier Using Compressors Based on Input Reordering.

    Zhenhao Liu, Yi Guo, Xiaoting Sun, Shinji Kimura

    TENCON 2018 - 2018 IEEE Region 10 Conference(TENCON)     545 - 550  2018  [Refereed]

    DOI

  • Sparseness Ratio Allocation and Neuron Re-pruning for Neural Networks Compression.

    Li Guo 0006, Dajiang Zhou, Jinjia Zhou, Shinji Kimura

    IEEE International Symposium on Circuits and Systems(ISCAS)     1 - 5  2018  [Refereed]

    DOI

  • Embedded Frame Compression for Energy-Efficient Computer Vision Systems.

    Li Guo 0006, Dajiang Zhou, Jinjia Zhou, Shinji Kimura

    IEEE International Symposium on Circuits and Systems(ISCAS)     1 - 5  2018  [Refereed]

    DOI

  • A Radix-4 Partial Product Generation-Based Approximate Multiplier for High-speed and Low-power Digital Signal Processing.

    Xiaoting Sun, Yi Guo, Zhenhao Liu, Shinji Kimura

    25th IEEE International Conference on Electronics, Circuits and Systems(ICECS)     777 - 780  2018  [Refereed]

    DOI

  • Sparse ternary connect: Convolutional neural networks using ternarized weights with enhanced sparsity.

    Canran Jin, Heming Sun, Shinji Kimura

    23rd Asia and South Pacific Design Automation Conference(ASP-DAC)     190 - 195  2018  [Refereed]

    DOI

  • Quad-multiplier packing based on customized floating point for convolutional neural networks on FPGA.

    Zhifeng Zhang, Dajiang Zhou, Shihao Wang, Shinji Kimura

    23rd Asia and South Pacific Design Automation Conference(ASP-DAC)     184 - 189  2018  [Refereed]

    DOI

  • Low-Cost Approximate Multiplier Design using Probability-Driven Inexact Compressors.

    Yi Guo, Heming Sun, Li Guo 0006, Shinji Kimura

    2018 IEEE Asia Pacific Conference on Circuits and Systems(APCCAS)     291 - 294  2018  [Refereed]

    DOI

  • Towards Ultrasound Everywhere: A Portable 3D Digital Back-End Capable of Zone and Compound Imaging.

    Aya Ibrahim, Shuping Zhang, Federico Angiolini, Marcel Arditi, Shinji Kimura, Satoshi Goto, Jean-Philippe Thiran, Giovanni De Micheli

    IEEE Trans. Biomed. Circuits Syst.   12 ( 5 ) 968 - 981  2018  [Refereed]

    DOI

  • Lossy Compression for Embedded Computer Vision Systems.

    Li Guo 0006, Dajiang Zhou, Jinjia Zhou, Shinji Kimura, Satoshi Goto

    IEEE Access   6   39385 - 39397  2018  [Refereed]

    DOI

  • A Variable-Clock-Cycle-Path VLSI Design of Binary Arithmetic Decoder for H.265/HEVC.

    Jinjia Zhou, Dajiang Zhou, Shuping Zhang, Shinji Kimura, Satoshi Goto

    IEEE Trans. Circuits Syst. Video Technol.   28 ( 2 ) 556 - 560  2018

     View Summary

    The next-generation 8K ultra-high-definition video format involves an extremely high bit rate, which imposes a high throughput requirement on the entropy decoder component of a video decoder. Context adaptive binary arithmetic coding (CABAC) is the entropy coding tool in the latest video coding standards including H.265/High Efficiency Video Coding and H.264/Advanced Video Coding. Due to critical data dependencies at the algorithm level, a CABAC decoder is difficult to be accelerated by simply leveraging parallelism and pipelining. This letter presents a new very-large-scale integration arithmetic decoder, which is the most critical bottleneck in CABAC decoding. Our design features a variable-clock-cycle-path architecture that exploits the differences in critical path delay and in probability of occurrence between various types of binary symbols (bins). The proposed design also incorporates a novel data-forwarding technique (rLPS forwarding) and a fast path-selection technique (coarse bin type decision), and is enhanced with the capability of processing additional bypass bins. As a result, its maximum throughput achieves 1010 Mbins/s in 90-nm CMOS, when decoding 0.96 bin per clock cycle at a maximum clock rate of 1053 MHz, which outperforms previous works by 19.1%.

    DOI

  • Distortion control and optimization for lossy embedded compression in video codec system

    Li Guo, Dajiang Zhou, Shinji Kimura, Satoshi Goto

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E100A ( 11 ) 2416 - 2424  2017.11

     View Summary

    For mobile video codecs, the huge energy dissipation for external memory traffic is a critical challenge under the battery power constraint. Lossy embedded compression (EC), as a solution to this challenge, is considered in this paper. While previous studies in lossy EC mostly focused on algorithm optimization to reduce distortion, this work, to the best of our knowledge, is the first one that addresses the distortion control. Firstly, from both theoretical analysis and experiments for distortion optimization, a conclusion is drawn that, at the frame level, allocating memory traffic evenly is a reliable approximation to the optimal solution to minimize quality loss. Then, to reduce the complexity of decoding twice, the distortion between two sequences is estimated by a linear function of that calculated within one sequence. Finally, on the basis of even allocation, the distortion control is proposed to determine the amount of memory traffic according to a given distortion limitation. With the adaptive target setting and estimating function updating in each group of pictures (GOP), the scene change in video stream is supported without adding a detector or retraining process. From experimental results, the proposed distortion control is able to accurately fix the quality loss to the target. Compared to the baseline of negative feedback on non-referred B frames, it achieves about twice memory traffic reduction.

    DOI

  • Fast Algorithm and VLSI Architecture of Rate Distortion Optimization in H.265/HEVC

    Heming Sun, Dajiang Zhou, Landan Hu, Shinji Kimura, Satoshi Goto

    IEEE TRANSACTIONS ON MULTIMEDIA   19 ( 11 ) 2375 - 2390  2017.11  [Refereed]

     View Summary

    In H.265/high efficiency video coding (HEVC) encoding, rate distortion optimization (RDO) is an important cost function for mode decision and coding structure decision. Despite being near-optimum in terms of coding efficiency, RDO suffers from a high complexity. To address this problem, this paper presents a fast RDO algorithm and its very large scale implementation (VLSI) for both intra-and inter-frame coding. The proposed algorithm employs a quantization-free framework that significantly reduces the complexity for rate and distortion optimization. Meanwhile, it maintains a low degradation of coding efficiency by taking the syntax element organization and probability model of HEVC into consideration. The algorithm is also designed with hardware architecture in mind to support an efficient VLSI implementation. When implemented in the HEVC test model, the proposed algorithm achieves 62% RDO time reduction with 1.85% coding efficiency loss for the "all-intra" configuration. The hardware implementation achieves 1.6 x higher normalized throughput relative to previous works, and it can support a throughput of 8k@30fps (for four fine-processed modes per prediction unit) with 256 k logic gates when working at 200 MHz.

    DOI

  • Time-efficient and TSV-aware 3D gated clock tree synthesis based on self-tuning spectral clustering

    Fan Yang, Minghao Lin, Heming Sun, Shinji Kimura

    Midwest Symposium on Circuits and Systems   2017-   1200 - 1203  2017.09

     View Summary

    3D gated clock tree synthesis (CTS) mainly consists of three steps: 1) abstract clock topology generation
    2) layer embedding for minimal TSV allocation and 3) clock tree routing with gate and buffer insertion. In this paper, a self-tuning spectral clustering based nearest-neighbor selection (SSC-NNS) algorithm with parallel structure is proposed to achieve high time efficiency in clock tree topology generation, with reduced runtime. In addition, a postorder traversal based layer embedding (PTLE) strategy is adopted for determining the embedding layer of internal nodes with minimal TSVges. Experimental results show that the proposed method achieves 32% and 82% runtime reduction on ISPD2009 and IBM benchmarks respectively compared with the state-of-the-art 3D work. Besides, the TSV count is also reduced by 46% on ISPD2009 benchmarks.

    DOI

  • A low-cost approximate 32-point transform architecture

    Heming Sun, Zhengxue Cheng, Amir Masoud Gharehbaghi, Shinji Kimura, Masahiro Fujita

    Proceedings - IEEE International Symposium on Circuits and Systems    2017.09

     View Summary

    This paper presents an area-efficient approximate method for 32-point transform which is one of the most area-consuming parts in High Efficiency Video Coding (HEVC) applications. Compared to prior literatures, this work reduces the hardware cost of transform by 1) eliminating all the arithmetic operations of 6 least significant bits (LSB), 2) presenting a low-delay method for generating carry propagation from the remaining 5 LSBs and 3) truncating the most significant bits (MSB) according to the position of component. In the implementation of a 32-point forward transform, the experimental results show that 27% area consumption can be saved and the coding efficiency loss aroused by the approximation is only 0.044% compared with the origin.

    DOI

  • Effective write-reduction method for MLC non-volatile memory

    Masashi Tawada, Shinji Kimura, Masao Yanagisawa, Nozomu Togawa

    Proceedings - IEEE International Symposium on Circuits and Systems    2017.09

     View Summary

    Recently, the requirement for non-volatile memory on embedded systems has increased because they can be applied with normally-off and power gating technologies to. However, they have a lower endurance than volatile memories. When data is encoded as a write-reduction code appropriately, the endurance of non-volatile memory can be enhanced by writing the encoded data into the memory. We propose a highly effective write-reduction method for a multi-level cell (MLC) non-volatile memory focusing on the write-reduction code (WRC) as the optimal bit-write reduction method. The WRC can be applied only to single-level cell non-volatile memory. The proposed method generates a cell-write reduction code based on the WRC
    the cell has multiple bits as the holdable data. Our proposed method achieves a cell-write reduction by 31.6% compared to the conventional method.

    DOI

  • A 7-Die 3D Stacked 3840 × 2160@120 fps motion estimation processor

    Zhang, Shuping, Zhou, Jinjia, Zhou, Dajiang, Kimura, Shinji, Goto, Satoshi

    IEICE Transactions on Electronics   E100C ( 3 ) 223 - 231  2017.03

     View Summary

    © 2017 The Institute of Electronics, Information and Communication Engineers. In this paper, a hamburger architecture with a 3D stacked reconfigurable memory is proposed for a 4K motion estimation (ME) processor. By positioning the memory dies on both the top and bottom sides of the processor die, the proposed hamburger architecture can reduce the usage of the signal through-silicon via (TSV), and balance the power delivery network and the clock tree of the entire system. It results in 1/3 reduction of the usage of signal TSVs. Moreover, a stacked reconfigurable memory architecture is proposed to reduce the fabrication complexity and further reduce the number of signal TSVs by more than 1/2. The reduction of signal TSVs in the entire design is 71.24%. Finally, we address unique issues that occur in electronic design automation (EDA) tools during 3D largescale integration (LSI) designs. As a result, a 4K ME processor with 7-die stacking 3D system-on-chip design is implemented. The proposed design can support real time 3840 × 2160 @ 120 fps encoding at 130 MHz with less than 540 mW.

    DOI

  • Accelerating HEVC inter prediction with improved merge mode handling

    Cheng, Zhengxue, Cheng, Zhengxue, Sun, Heming, Zhou, Dajiang, Kimura, Shinji

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E100A ( 2 ) 546 - 554  2017.02

     View Summary

    © 2017 The Institute of Electronics, Information and Communication Engineers. High Efficiency Video Coding (HEVC/H.265) obtains 50% bit rate reduction than H.264/AVC standard with comparable quality at the cost of high computational complexity. Merge mode is one of the most important new features introduced in HEVC's inter prediction. Merge mode and traditional inter mode consume about 90% of the total encoding time. To address this high complexity, this paper utilizes the merge mode to accelerate inter prediction by four strategies. 1) A merge candidate decision is proposed by the sum of absolute transformed difference (SATD) cost. 2) An early merge termination is presented with more than 90% accuracy. 3) Due to the compensation effect of merge candidates, symmetric motion partition (SMP) mode is disabled for non-8×8 coding units (CUs). 4) A fast coding unit filtering strategy is proposed to reduce the number of CUs which need to be fine-processed. Experimental results demonstrate that our fast strategies can achieve 35.4%-58.7% time reduction with 0.68%-1.96% BD-rate increment in RA case. Compared with similar works, the proposed strategies are not only among the best performing in average-case complexity reduction, but also notably outperforming in the worst cases.

    DOI

  • Development of TOF-PET using Compton scattering by plastic scintillators

    Kuramoto, M, Nakamori, T, Kimura, S, Gunji, S, Takakura, M, Kataoka, J

    Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment   845   668 - 672  2017.02

     View Summary

    © 2016 Elsevier B.V. We propose a time-of-flight (TOF) technique using plastic scintillators which have fast decay time of a few ns for positron emission tomography (PET). While the photoelectric absorption probability of the plastic for 511 keV gamma rays are extremely low due to its small density and effective atomic number, the cross section of Compton scattering is comparable to that of absorption by conventional inorganic scintillators. We thus propose TOF-PET using Compton scattering with plastic scintillators (Compton-PET), and performed fundamental experiments towards exploration of the Compton-PET capability. We demonstrated that the plastic scintillators achieved the better time resolution in comparison to LYSO(Ce) and GAGG(Ce) scintillators. In addition we evaluated the depth-of-interaction resolving capability with the plastic scintillators.

    DOI

  • Distortion Control and Optimization for Lossy Embedded Compression in Video Codec System

    GUO Li, ZHOU Dajiang, KIMURA Shinji, GOTO Satoshi

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   100 ( 11 ) 2416 - 2424  2017

     View Summary

    <p>For mobile video codecs, the huge energy dissipation for external memory traffic is a critical challenge under the battery power constraint. Lossy embedded compression (EC), as a solution to this challenge, is considered in this paper. While previous studies in lossy EC mostly focused on algorithm optimization to reduce distortion, this work, to the best of our knowledge, is the first one that addresses the distortion control. Firstly, from both theoretical analysis and experiments for distortion optimization, a conclusion is drawn that, at the frame level, allocating memory traffic evenly is a reliable approximation to the optimal solution to minimize quality loss. Then, to reduce the complexity of decoding twice, the distortion between two sequences is estimated by a linear function of that calculated within one sequence. Finally, on the basis of even allocation, the distortion control is proposed to determine the amount of memory traffic according to a given distortion limitation. With the adaptive target setting and estimating function updating in each group of pictures (GOP), the scene change in video stream is supported without adding a detector or retraining process. From experimental results, the proposed distortion control is able to accurately fix the quality loss to the target. Compared to the baseline of negative feedback on non-referred B frames, it achieves about twice memory traffic reduction.</p>

    CiNii

  • A 7-Die 3D Stacked 3840×2160@120 fps Motion Estimation Processor.

    Shuping Zhang, Jinjia Zhou, Dajiang Zhou, Shinji Kimura, Satoshi Goto

    IEICE Trans. Electron.   100-C ( 3 ) 223 - 231  2017  [Refereed]

     View Summary

    In this paper, a hamburger architecture with a 3D stacked reconfigurable memory is proposed for a 4K motion estimation (ME) processor. By positioning the memory dies on both the top and bottom sides of the processor die, the proposed hamburger architecture can reduce the usage of the signal through-silicon via (TSV), and balance the power delivery network and the clock tree of the entire system. It results in 1/3 reduction of the usage of signal TSVs. Moreover, a stacked reconfigurable memory architecture is proposed to reduce the fabrication complexity and further reduce the number of signal TSVs by more than 1/2. The reduction of signal TSVs in the entire design is 71.24%. Finally, we address unique issues that occur in electronic design automation (EDA) tools during 3D large-scale integration (LSI) designs. As a result, a 4K ME processor with 7-die stacking 3D system-on-chip design is implemented. The proposed design can support real time 3840 x 2160 @ 120 fps encoding at 130 MHz with less than 540 mW.

    DOI CiNii

  • Accelerating HEVC Inter Prediction with Improved Merge Mode Handling.

    Zhengxue Cheng, Heming Sun, Dajiang Zhou, Shinji Kimura

    IEICE Trans. Fundam. Electron. Commun. Comput. Sci.   100-A ( 2 ) 546 - 554  2017  [Refereed]

     View Summary

    High Efficiency Video Coding (HEVC/H.265) obtains 50% bit rate reduction than H.264/AVC standard with comparable quality at the cost of high computational complexity. Merge mode is one of the most important new features introduced in HEVC's inter prediction. Merge mode and traditional inter mode consume about 90% of the total encoding time. To address this high complexity, this paper utilizes the merge mode to accelerate inter prediction by four strategies. 1) A merge candidate decision is proposed by the sum of absolute transformed difference (SATD) cost. 2) An early merge termination is presented with more than 90% accuracy. 3) Due to the compensation effect of merge candidates, symmetric motion partition (SMP) mode is disabled for non-8x8 coding units (CUs). 4) A fast coding unit filtering strategy is proposed to reduce the number of CUs which need to be fine-processed. Experimental results demonstrate that our fast strategies can achieve 35.4%-58.7% time reduction with 0.68%-1.96% BD-rate increment in RA case. Compared with similar works, the proposed strategies are not only among the best performing in average-case complexity reduction, but also notably outperforming in the worst cases.

    DOI CiNii

  • An 8K H.265/HEVC Video Decoder Chip With a New System Pipeline Design.

    Dajiang Zhou, Shihao Wang, Heming Sun, Jian-Bin Zhou, Jiayi Zhu, Yijin Zhao, Jinjia Zhou, Shuping Zhang, Shinji Kimura, Takeshi Yoshimura, Satoshi Goto

    J. Solid-State Circuits   52 ( 1 ) 113 - 126  2017  [Refereed]

     View Summary

    8K ultra-HD is being promoted as the next-generation video specification. While the High Efficiency Video Coding (HEVC) standard greatly enhances the feasibility of 8K with a doubled compression ratio, its implementation is a challenge, owing to ultrahigh-throughput requirements and increased complexity per pixel. The latter comes from the new features of HEVC. At the system level, the most challenging of them is the enlarged and highly variable-size coding/prediction/transform units, which significantly increase the requirement for on-chip memory as pipeline buffers and the difficulty in maintaining pipeline utilization. This paper presents an HEVC decoder chip featuring a system pipeline that works at a nonunified and variable granularity. The pipeline saves on-chip memory with a novel block-in-block-out queue system and a parameter delivery network, while allowing overhead-free and fully pipelined operation of the processing components. With the system pipeline design combined with various component-level optimizations, the proposed decoder in 40 nm achieves a maximum throughput of 4 Gpixels/s or 8K 120 frames/s for the low-delay-P configuration of HEVC, 7.5-55 times faster than prior works. It supports 8K 60 frames/s for the low-delay and random-access configurations. In a normalized comparison, it also shows 3.1-3.6 times better area efficiency and 31%-55% superior energy efficiency.

    DOI

  • A low-power VLSI architecture for HEVC de-quantization and inverse transform

    Sun, Heming, Zhou, Dajiang, Zhang, Shuping, Kimura, Shinji

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E99A ( 12 ) 2375 - 2387  2016.12

     View Summary

    Copyright © 2016 The Institute of Electronics, Information and Communication Engineers.In this paper, we present a low-power system for the de- quantization and inverse transform of HEVC. Firstly, we present a low-delay circuit to process the coded results of the syntax elements, and then reduce the number of multipliers from 16 to 4 for the de-quantization process of each 4x4 block. Secondly, we give two efficient data mapping schemes for the memory between de-quantization and inverse transform, and the memory for transpose. Thirdly, the zero information is utilized through the whole system. For two memory parts, the write and read operation of zero blocks/ rows/ coefficients can all be skipped to save the power consumption. The results show that up to 86% power consumption can be saved for the memory part under the configuration of "Random-access" and common QPs. For the logical part, the proposed architecture for de-quantization can reduce 77% area consumption. Overall, our system can support real-time coding for 8K x 4K 120 fps video sequences and the normalized area consumption can be reduced by 68% compared with the latest work.

    DOI

  • A Low-Power VLSI Architecture for HEVC De-Quantization and Inverse Transform

    Heming Sun, Dajiang Zhou, Shuping Zhang, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E99A ( 12 ) 2375 - 2387  2016.12  [Refereed]

     View Summary

    In this paper, we present a low-power system for the de-quantization and inverse transform of HEVC. Firstly, we present a low-delay circuit to process the coded results of the syntax elements, and then reduce the number of multipliers from 16 to 4 for the de-quantization process of each 4x4 block. Secondly, we give two efficient data mapping schemes for the memory between de-quantization and inverse transform, and the memory for transpose. Thirdly, the zero information is utilized through the whole system. For two memory parts, the write and read operation of zero blocks/ rows/ coefficients can all be skipped to save the power consumption. The results show that up to 86% power consumption can be saved for the memory part under the configuration of Random-access and common QPs. For the logical part, the proposed architecture for de-quantization can reduce 77% area consumption. Overall, our system can support real-time coding for 8K x 4K 120fps video sequences and the normalized area consumption can be reduced by 68% compared with the latest work.

    DOI CiNii

  • Merge mode based fast inter prediction for HEVC

    Zhengxue Cheng, Heming Sun, Dajiang Zhou, Shinji Kimura

    2015 Visual Communications and Image Processing, VCIP 2015    2016.04

     View Summary

    The latest High Efficiency Video Coding (HEVC/H.265) obtains 50% bit rate reduction than H.264/AVC standard with comparable quality, but at the cost of high computational complexity. Inter prediction accounts for large complexity and merge mode is one of the most important new features introduced in HEVC. To address this issue, this paper utilizes the merge mode to accelerate inter prediction by three fast mode decision methods. 1) A merge candidate decision is proposed to select the best merge mode by Sum of Absolute Transformed Difference (SATD) cost to reduce the merge time. 2) An early merge termination is presented still based on SATD cost with more than 90% accuracy. 3) Based on efficient merge mode, symmetric motion partition (SMP) modes can be disabled for non-8 × 8 code units (CUs). Experimental results demonstrate that our work can achieve 53.1%-54.2% time reduction on average with 1.57%-2.30% BD-rate increment. Besides, our method achieves an improvement of 18%-30% time reduction with 0.89%-2.85% BD-rate increment when combined with other existing approaches.

    DOI

  • A-6-3 Reduction of Rewriting Routing Switches for Reconfiguration of NanoBridge Based FPGA

    Aoki Kohei, Yanagisawa Masao, Kimura Shinji

    Proceedings of the IEICE Engineering Sciences Society/NOLTA Society Conference   2016  2016.03

    CiNii

  • A 4Gpixel/s 8/10b H.265/HEVC Video Decoder Chip for 8K Ultra HD Applications

    Dajiang Zhou, Shihao Wang, Heming Sun, Jianbin Zhou, Jiayi Zhu, Yijin Zhao, Jinjia Zhou, Shuping Zhang, Shinji Kimura, Takeshi Yoshimura, Satoshi Goto

    2016 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC)   59   266 - U369  2016  [Refereed]

     View Summary

    © 2016 IEEE.8K Ultra HD is being promoted as the next-generation digital video format. From a communication channel perspective, the latest high-efficiency video coding standard (H.265/HEVC) greatly enhances the feasibility of 8K by doubling the compression ratio. Implementation of such codecs is a challenge, owing to ultra-high throughput requirements and increased complexity per pixel. The former corresponds to up to 10b/pixel, 7680×4320pixels/frame and 120fps - 80× larger than 1080p HD. The latter comes from the new features of HEVC relative to its predecessor H.264/AVC. The most challenging of them is the enlarged and highly variable-size coding/prediction/transform units (CU/PU/TU), which significantly increase: 1) the requirement for on-chip memory as pipeline buffers, 2) the difficulty in maintianing pipeline utilization, and 3) the complexity of inverse transforms (IT). This paper presents an HEVC decoder chip supporting 8K Ultra HD, featuring a 16pixel/cycle true-variable-block-size system pipeline. The pipeline: 1) saves on-chip memory with a novel block-in-block-out (BIBO) queue system and a parameter delivery network, and 2) allows high design efficiency and utilization of processing components through local synchronization. Key optimizations at the component level are also presented.

    DOI

  • FRAME-LEVEL QUALITY AND MEMORY TRAFFIC ALLOCATION FOR LOSSY EMBEDDED COMPRESSION IN VIDEO CODEC SYSTEMS

    Li Guo, Dajiang Zhou, Shinji Kimura, Satoshi Goto

    2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW)    2016  [Refereed]

     View Summary

    For mobile video codecs, the huge energy dissipation for external memory traffic is a critical challenge under the battery power constraint. Lossy embedded compression (EC), as a solution to this challenge, is considered in this paper. While previous studies in EC mostly focused on compression algorithms at the block level, this work, to the best of our knowledge, is the fIrst one that addresses the allocation of video quality and memory traffic at the frame level. For lossy EC, a main difficulty of its application lies in the error propagation from quality degradation of reference frames. Instinctively, it is preferred to perform more lossy EC in non-reference frames to minimize the quality loss. The analysis and experiments in this paper, however, will show lossy EC should actually be distributed to more frames. Correspondingly, for hierarchical-B GOPs, we developed an efficient allocation that outperforms the non-reference-only allocation by up to 4.5 dB in PSNR. In comparison, the proposed allocation also delivers more consistent quality between frames by having lower PSNR fluctuation.

    DOI

  • Power-Efficient and Slew-Aware Three Dimensional Gated Clock Tree Synthesis

    Minghao Lin, Heming Sun, Shinji Kimura

    2016 IFIP/IEEE INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION (VLSI-SOC)    2016  [Refereed]

     View Summary

    This paper presents a three dimensional (3D) gated clock tree synthesis (CTS) approach, which consists of two steps: 1) abstract tree topology generation; and 2) 3D gated and buffered clock routing. 3D Pair Matching (3D-PM) algorithm is proposed to generate the initial tree topology and then the proposed TSV-minimization algorithm is applied to generate TSV-aware tree topology. Based on TSV-aware tree topology, 3D gated and buffered clock tree routing is done using the proposed 3D Gated and Buffered Deferred-Merge Embedding (3D-GB-DME) algorithm. The slew constraint satisfaction is considered and the clock skew is minimized in our approach. Experimental results show that the proposed method achieves 29.11% power reduction compared with the state-of-the-art 2D work.

    DOI

  • CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor for Forward and Backward Propagation of Convolutional Neural Networks

    Xushen Han, Dajiang Zhou, Shihao Wang, Shinji Kimura

    PROCEEDINGS OF THE 34TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD)     320 - 327  2016  [Refereed]

     View Summary

    Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Moreover, VLSI solutions are usually regarded as a lack of the flexibility to be reconfigured for the various parameters of CNNs. This paper presents CNN-MERP to address these issues. CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/offchip data allocation, data flow optimization and data reuse. The proposed 2-level reconfigurability is utilized to enable fast and efficient reconfiguration, which is based on the control logic and the multiboot feature of FPGA. As a result, an external memory bandwidth requirement of 1.94MB/GFlop is achieved, which is 55% lower than prior arts. Under limited DRAM bandwidth, a system throughput of 1244GFlop/s is achieved at the Vertex UltraScale platform, which is 5.48 times higher than the state-of-the-art FPGA implementations.

    DOI

  • Optimization of Area and Power in Multi-Mode Power Gating Scheme for Static Memory Elements

    Xing Su, Shinji Kimura

    2016 IEEE ASIA PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS (APCCAS)     214 - 217  2016  [Refereed]

     View Summary

    This paper presents an optimization method of area and power for static memory elements by using multi-mode power gating (MMPG) scheme. A 2-transistor MMPG scheme replaces the usual 5-transistor one to effectively reduce on chip area overhead and leakage power, simultaneously combining trimming circuits (TC) to guarantee the safety of data retention. When applying the proposed approach into clean/dirty-cache (CD-cache), we can reduce area overhead and leakage power consumption. The simulation results show that the area overhead of SRAM with the proposed approach is reduced from 33.4% to 21.8% compared to that of SRAM with usual MMPG. On the other hand, leakage power is reduced by 12.35% compared to SRAM with usual MMPG and by 86.77% compared to SRAM without power gating scheme. Moreover, the ability of noise immunity of SRAM with proposed approach can also be improved.

    DOI

  • ECC-Based Bit-Write Reduction Code Generation for Non-Volatile Memory

    Masashi Tawada, Shinji Kimura, Masao Yanagisawa, Nozomu Togawa

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E98A ( 12 ) 2494 - 2504  2015.12  [Refereed]

     View Summary

    Non-volatile memory has many advantages such as high density and low leakage power but it consumes larger writing energy than SRAM. It is quite necessary to reduce writing energy in non-volatile memory design. In this paper, we propose write-reduction codes based on error correcting codes and reduce writing energy in non-volatile memory by decreasing the number of writing bits. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. In our write-reduction codes, every data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to the data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. Experimental results demonstrate that we have achieved writing-bits reduction by an average of 51% and energy reduction by an average of 33% compared to non-encoded memory.

    DOI

  • An independent bandwidth reduction device for HEVC VLSI video system

    Jiayi Zhu, Li Guo, Dajiang Zhou, Shinji Kimura, Satoshi Goto

    Proceedings - IEEE International Symposium on Circuits and Systems   2015-   609 - 612  2015.07  [Refereed]

     View Summary

    FRC (frame re-compression) is a kind of widely used technique in reducing the SDRAM (synchronous dynamic random access memory) bandwidth of HEVC video system. However, in previous research works, FRC imposes requirements on accessing pattern and hence its usage are only limited in HEVC video codecs. While in a typical HEVC VLSI video system, there exists many other video IPs with high bandwidth requirements. Therefore, in this article, we propose a new FRC architecture to overcome the limitation and make it applicable to all the video IPs in a HEVC VLSI video system, which raises the overall bandwidth reduction rate of the whole video system. Our proposal has two points: firstly we propose a system internal bus based FRC architecture, which is independent, transparent, and easily connected to all other video IPs. Secondly, we propose a FA (freely access) scheme to remove the requirements on access pattern in previous work. By using this proposal, the bandwidth reduction rate in our VLSI video system model is raised from 92.4% to 69.6%.

    DOI

  • Low-Power Motion Estimation Processor with 3D Stacked Memory

    Shuping Zhang, Jinjia Zhou, Dajiang Zhou, Shinji Kimura, Satoshi Goto

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E98A ( 7 ) 1431 - 1441  2015.07  [Refereed]

     View Summary

    Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60 fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.

    DOI

  • Control Signal Extraction for Sequential Clock Gating Using Time Expansion of Sequential Circuits

      2015 ( 6 ) 1 - 6  2015.05

     View Summary

    Recently, clock gating is utilized as a method for reducing the dynamic power of LSI. Clock gating can be automatically inserted by the synthesis tools, but there are problems such as designers must specify control signals. So more aggressive and automatable clock gating techniques have been proposed. In this study, a clock gating candidate extraction method for combinational clock gating is enhanced to the method for sequential clock gating using time expansion of sequential circuits. Using time expansion and detection by SAT, it is possible to find multiple clock past signal as a candidate. The proposed method was applied to ISCAS'89 benchmark and we got more control signal candidates.

    CiNii

  • A Bit-Write Reduction Method based on Error-Correcting Codes for Non-Volatile Memories

    Masashi Tawada, Shinji Kimura, Masao Yanagisawa, Nozomu Togawa

    2015 20TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC)     496 - 501  2015  [Refereed]

     View Summary

    Non-volatile memory has many advantages over SRAM. However, one of its largest problems is that it consumes a large amount of energy in writing. In this paper, we propose a bit-write reduction method based on error correcting codes for non-volatile memories. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. We focus on error-correcting codes and generate new codes called write-reduction codes. In our write-reduction codes, each data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to a data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. We perform several experimental evaluations and demonstrate up to 72% energy reduction.

  • ECC-Based Bit-Write Reduction Code Generation for Non-Volatile Memory

    TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

    IEICE Trans. Fundamentals   98 ( 12 ) 2494 - 2504  2015

     View Summary

    Non-volatile memory has many advantages such as high density and low leakage power but it consumes larger writing energy than SRAM. It is quite necessary to reduce writing energy in non-volatile memory design. In this paper, we propose write-reduction codes based on error correcting codes and reduce writing energy in non-volatile memory by decreasing the number of writing bits. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. In our write-reduction codes, every data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to the data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. Experimental results demonstrate that we have achieved writing-bits reduction by an average of 51% and energy reduction by an average of 33% compared to non-encoded memory.

    CiNii

  • Low-Power Motion Estimation Processor with 3D Stacked Memory

    ZHANG Shuping, ZHOU Jinjia, ZHOU Dajiang, KIMURA Shinji, GOTO Satoshi

    IEICE Trans. Fundamentals   98 ( 7 ) 1431 - 1441  2015

     View Summary

    Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.

    CiNii

  • HARDWARE-ORIENTED RATE-DISTORTION OPTIMIZATION ALGORITHM FOR HEVC INTRA-FRAME ENCODER

    Landan Hu, Heming Sun, Dajiang Zhou, Shinji Kimura

    2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)    2015  [Refereed]

     View Summary

    Digital video is widely used in the mobile applications, where video compression technology is necessary to store or transmit the videos. High Efficiency Video Coding (HEVC) achieves the highest compression ratio while it costs huge computational complexity, in which rate-distortion (RD) cost calculation takes the majority. This paper presents a low-complexity RD estimation method for HEVC intra prediction by the following schemes. 1) The transformed coefficients rather than quantized coefficients are used to do the RD estimation. 2) For the rate part, the position after the last non-zero quantized coefficient is considered to improve the accuracy of estimation, and a header-bit estimation method is presented to save about 82% complexity on header bits calculation. 3) For the distortion part, the scaling parameter of quantization is modified to the exponential of two so that the bit depth of multiplication can be reduced from 15 to 5 in the worst case. 4) In transform unit 4x4, we consider transform skip mode which is neglect in the prior research. Our proposal could achieve 72.22% time reduction of rate-distortion optimization (RDO) compared with original HEVC Test Model while the BD-rate is only 1.76%.

    DOI

  • Fast SAO Estimation Algorithm and Its Implementation for 8 K x 4 K @ 120 FPS HEVC Encoding

    Jiayi Zhu, Dajiang Zhou, Shinji Kimura, Satoshi Goto

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E97A ( 12 ) 2488 - 2497  2014.12  [Refereed]

     View Summary

    High efficiency video coding (HEVC) is the new generation video compression standard. Sample adaptive offset (SAO) is a new compression tool adopted in HEVC which reduces the distortion between original samples and reconstructed samples. SAO estimation is the process of determining SAO parameters in video encoding. It is divided into two phases: statistic collection and parameters determination. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of Rate Distortion Optimization (RDO) in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistics of all the 16 samples in one 4 x 4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the algorithms complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented using 156.32 K gates, 8,832 bits single port RAM for 8 bits depth case. It can be synthesized to 400 MHz @ 65 nm technology and is capable of 8 K x 4 K @ 120 fps encoding.

    DOI

  • Small-Sized Encoder/Decoder Circuit Design for Bit-Write Reduction Targeting Non-Volatile Memories

    TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

    Technical report of IEICE. VLD   114 ( 328 ) 227 - 232  2014.11

     View Summary

    Non-volatile memory has many advantages such as low leakage power and non-volatility. However, there are problems that a non-volatile memory consumes a large amount of energy in writing and that the maximum number of bit re-writings is limited. We have proposed a Hamming-code based bit-write reduction method using data encoding/decoding but its encoder/decoder becomes too much large. In this paper, we propose small-sized encoder/decoder circuit design for the bit-write reduction codes. In this design, we simplify data encoding/decoding by using code redundancy. Experimental results show the efficiency of our encoder/decoder design.

    CiNii

  • Fast SAO estimation algorithm and its VLSI architecture

    Jiayi Zhu, Dajiang Zhou, Shinji Kimura, Satoshi Goto

    2014 IEEE International Conference on Image Processing, ICIP 2014     1278 - 1282  2014.01  [Refereed]

     View Summary

    SAO estimation is the process of determining SAO parameters in video encoding. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of RDO in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistic of all the 16 samples in one 4×4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented by 156.32K gates, 8832 bits SPRAM, 400MHz @ 65nm technology and is capable of 8Kx4K @ 120fps encoding.

    DOI

  • AN AREA-EFFICIENT 4/8/16/32-POINT INVERSE DCT ARCHITECTURE FOR UHDTV HEVC DECODER

    Heming Sun, Dajiang Zhou, Jiayi Zhu, Shinji Kimura, Satoshi Goto

    2014 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING CONFERENCE     197 - 200  2014  [Refereed]

     View Summary

    This paper presents a new VLSI architecture for HEVC inverse discrete cosine transform (IDCT). Compared to prior arts, this work reduces hardware cost by 1) reducing computational logic of 1-D IDCTs with a reordered parallel-in serial-out (RPISO) scheme that shares the inputs of the butterfly structure, and 2) reducing the area of the transpose buffer with a cyclic memory organization that achieves 100% I/O utilization of the SRAMs. In the implementation of a unified 4/8/16/32-point IDCT, the proposed schemes demonstrate 35% and 62% reduction of logic and memory costs, respectively. The IDCT implementation can support real-time decoding of 4Kx2K 60fps video with a total hardware cost of 357,250um(2) on 2-D IDCT and 80,988um(2) on transpose memory in 90nm process.

  • Fast SAO Estimation Algorithm and Its Implementation for 8K×4K @ 120 FPS HEVC Encoding

    ZHU Jiayi, ZHOU Dajiang, KIMURA Shinji, GOTO Satoshi

    IEICE Trans. Fundamentals   97 ( 12 ) 2488 - 2497  2014

     View Summary

    High efficiency video coding (HEVC) is the new generation video compression standard. Sample adaptive offset (SAO) is a new compression tool adopted in HEVC which reduces the distortion between original samples and reconstructed samples. SAO estimation is the process of determining SAO parameters in video encoding. It is divided into two phases: statistic collection and parameters determination. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of Rate Distortion Optimization (RDO) in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistics of all the 16 samples in one 4×4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the algorithms complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented using 156.32K gates, 8,832bits single port RAM for 8bits depth case. It can be synthesized to 400MHz @ 65nm technology and is capable of 8K×4K @ 120fps encoding.

    CiNii

  • A Reduction Method of Writing Operations to Non-volatile Memory by Keeping Data Difference for Low-Power Circuit Design

    SHINOHARA Hiroyuki, YANAGISAWA Masao, KIMURA Shinji

    Technical report of IEICE. VLD   113 ( 416 ) 167 - 172  2014.01

     View Summary

    In order to reduce the power consumption of LSI, unnecessary parts should be powered off with fine granularity, and current status data before power-off should be stored for the behavior after power-on. Next generation non-volatile memory is expected to be used to store data for power-off. However, the writing power of non-volatile memory is about 10 times higher than that of CMOS memory, so the reduction of writing behaviors is very important to reduce the total energy. The manuscript proposes a reduction method of writing behaviors using the difference of the original data and the new data for monitoring data sequences such as wireless sensor nodes. With the redundancy of the difference and the original data, the number of writing bits for these registers can be saved. The modificaiton system for the original and differential data registers has been developed and its power consumption has been evaluated. When applying to temperature monitoring, 24 % writing bits reduction and 11 % power reduction can be obtained.

    CiNii

  • Dual-Stage Pseudo Power Gating with Advanced Clustering Algorithm for Gate Level Power Optimization

    Yu Jin, Zhe Du, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E96A ( 12 ) 2568 - 2575  2013.12  [Refereed]

     View Summary

    Pseudo Power Gating (Pseudo PG) is one of gate level power reduction methods for combinational circuits by stopping unnecessary input changes of gates. In Pseudo PG, an extra control signal might be added to a gate and other input changes of the gate are deactivated when the control signal takes the controlling value. To improve the power reduction capability, the paper newly introduces dual-stage Pseudo PG with advanced clustering algorithm where up to two extra control signals are added to a gate if effective. The advanced clustering algorithm selects the first control signal to be compatible with the second control signal based on the propagation of controlling condition via a path, with which candidates of controllable gates excluded by the maximum depth constraint can be controlled. Experimental results show that the proposed dual-stage Pseudo PG method has obtained 23.23% average power reduction with 5.28% delay penalty with respect to the original circuits, and has obtained 10.46% more power reduction with 2.75% delay penalty compared with respect to circuits applying the original single-stage Pseudo PG.

    DOI

  • Power Reduction of Non-volatile Logic Circuits Using the Minimum Writing Power Cut-set of State Registers

    ITOI Yudai, KIMURA Shinji

    Technical report of IEICE. VLD   113 ( 320 ) 147 - 152  2013.11

     View Summary

    Recently, the next generation non-volatile memory/register using magnetic tunnel junction elements has been paid attention. Such devices can keep the data when power off, can be integrated in CMOS LSI and have fast access speed. By using such devices, we can apply fine and low overhead power control for CMOS LSI. The write energy of such devices, however, is larger than that of a usual D flip-flop (about 10 times). So it is very important to reduce the write operations on such devices. Therefore we have proposed a write reduction method for non-volatile registers, where a minimum cut-set that has the smallest switching activity is searched by using the min-cut max-flow theorem and non-volatile registers are inserted to the cut-set. In this study, we also consider the overhead of additional circuits for recovering and saving the state to minimize the total power of the circuit. The method has been implemented and applied to ISCAS 89 benchmarks. Compared with the case where non-volatile registers are inserted to the original position, 2.6%〜15.1% power reductions (8.34% on average) have been found.

    CiNii

  • Energy Evaluation of Writing Reduction Method for Non-Volatile Memory

    TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

    Technical report of IEICE. VLD   113 ( 320 ) 141 - 146  2013.11

     View Summary

    Non-volatile memory has many advantages over SRAM, such as high density, low leakage power, and non-volatility. However, one of its largest problems is that it consumes a large amount of energy in writing. It is quite necessary to reduce the number of writing bits and thus decrease its writing energy. We have proposed a memory writing reduction method based on error correcting codes. When a data is written into a memory, we do not write it directly but encode it into a codeword. Then the number of writing bits into memory is also limited in data writing. In this paper, we demonstrate several experimental evaluations from the viewpoints of energy reduction and discuss the effectiveness of our proposed writing-reduction codes.

    CiNii

  • Power Reduction of Non-volatile Logic Circuits Using the Minimum Writing Power Cut-set of State Registers

    ITOI Yudai, KIMURA Shinji

    IEICE technical report. Dependable computing   113 ( 321 ) 147 - 152  2013.11

     View Summary

    Recently, the next generation non-volatile memory/register using magnetic tunnel junction elements has been paid attention. Such devices can keep the data when power off, can be integrated in CMOS LSI and have fast access speed. By using such devices, we can apply fine and low overhead power control for CMOS LSI. The write energy of such devices, however, is larger than that of a usual D flip-flop (about 10 times). So it is very important to reduce the write operations on such devices. Therefore we have proposed a write reduction method for non-volatile registers, where a minimum cut-set that has the smallest switching activity is searched by using the min-cut max-flow theorem and non-volatile registers are inserted to the cut-set. In this study, we also consider the overhead of additional circuits for recovering and saving the state to minimize the total power of the circuit. The method has been implemented and applied to ISCAS 89 benchmarks. Compared with the case where non-volatile registers are inserted to the original position, 2.6%〜15.1% power reductions (8.34% on average) have been found.

    CiNii

  • Energy Consumption Evaluation for Two-Level Cache with Non-Volatile Memory Targeting Mobile Processors

    Shota Matsuno, Masashi Tawada, Masao Yanagisawa, Shinji Kimura, Tadahiko Sugibayashi, Nozomu Togawa

    IEEK Transactions on Smart Processing and Computing   Vol. 2 ( No. 4 ) 226 - 239  2013.08

  • Low Power Memory Based Design Method of Constant Multipliers for Digital Filters

    KABASAWA Kosuke, SUGIBAYASHI Tadahiko, YANAGISAWA Masao, KIMURA Shinji

    Technical report of IEICE. VLD   113 ( 119 ) 101 - 106  2013.07

     View Summary

    Digital Signal Processing of sounds and images are using many digital filters which computes the summation of multiplications between a sequence of constants and a time sequence of an input. In this manuscript, a memory based design method for such constant multiplication is described. In the design, the trade-off between the size of a memory and that of the logic is considered, and its speed and power consumption is optimized. The read power of a memory is independent with the output read from the memory and a memory can encapsulate the toggles of logic gates in gate-based designs. By separating an input into several parts and designing such separated small multipliers using a memory, the memory size can be reduced drastically. The proposed constant multiplier has been implemented on ASIC, and shows the power reduction compared with gate-level design.

    CiNii

  • A non-volatile memory writing reduction method based on state encoding limiting maximum Hamming distance

    TAWADA Masashi, KIMURA Shinji, YANAGISAWA Masao, TOGAWA Nozomu

    Technical report of IEICE. VLD   113 ( 119 ) 95 - 100  2013.07

     View Summary

    Non-volatile memory has many advantages over SRAM, such as high density, low leakage power, and non-volatility. However, one of its largest problems is that it consumes a large amount of energy in writing. It is quite necessary to reduce the number of writing bits and thus decrease its writing energy. In this paper, we propose a memory writing reduction method based on state encoding limiting maximum Hamming distance. When a data is written into a memory, we do not write it directly but encode it into a codeword. Then we write the codeword into a memory. At this time, we encode a data into a codeword limiting its maximum Hamming distance from another codeword. If the maximum Hamming distance is limited among all the codewords, the number of flipped bits are also limited and then the number of writing bits will be reduced. We show several experimental evaluations and discuss the effectiveness of our proposed algorithm.

    CiNii

  • Evaluation of energy consumption for two-level cache using Non-Volatile Memory for IL1 and UL2 caches

    MATSUNO Shota, TAWADA Masashi, YANAGISAWA Masao, KIMURA Shinji, TOGAWA Nozomu, SUGIBAYASHI Tadahiko

    Technical report of IEICE. VLD   113 ( 119 ) 89 - 94  2013.07

     View Summary

    A non-volatile memory has advantages such as low leak energy and non-volatility compared with SRAM or DRAM has high leak energy. It is strongly expected to use a non-volatile memory for realizing normally-off systems. A non-volatile memory, however, consumes more energy to write than SRAM or DRAM. In this paper, we evaluate energy consumption of a cache memory in an embedded processor with non-volatile memories. In our evaluation, we assume that their write energy is 1.0x to 10.0x higher than that of SRAM. Experimental evaluations demonstrate that using non-volatile memories in a cache is better choice in some cases, even when write energy of non-volatile memories is 10.0x higher than that of SRAM.

    CiNii

  • Write Control Method for Nonvolatile Flip-Flops Based on State Transition Analysis

    Naoya Okada, Yuichi Nakamura, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E96A ( 6 ) 1264 - 1272  2013.06  [Refereed]

     View Summary

    Nonvolatile flip-flop enables leakage power reduction in logic circuits and quick return from standby mode. However, it has limited write endurance, and its power consumption for writing is larger than that of conventional D flip-flop (DFF). For this reason, it is important to reduce the number of write operations. The write operations can be reduced by stopping the clock signal to synchronous flip-flops because write operations are executed only when the clock is applied to the flip-flops. In such clock gating, a method using Exclusive OR (XOR) of the current value and the new value as the control signal is well known. The XOR based method is effective, but there are several cases where the write operations can be reduced even if the current value and the new value are different. The paper proposes a method to detect such unnecessary write operations based on state transition analysis, and proposes a write control method to save power consumption of nonvolatile flip-flops. In the method, redundant bits are detected to reduce the number of write operations. If the next state and the outputs do not depend on some current bit, the bit is redundant and not necessary to write. The method is based on Binary Decision Diagram (BDD) calculation. We construct write control circuits to stop the clock signal by converting BDDs representing a set of states where write operations are unnecessary. Proposed method can be combined with the XOR based method and reduce the total write operations. We apply combined method to some benchmark circuits and estimate the power consumption with Synopsys NanoSim. On average, 15.0% power consumption can be reduced compared with only the XOR based method.

    DOI

  • Write Control Method for Nonvolatile Flip-Flops Based on State Transition Analysis

    Naoya Okada, Yuichi Nakamura, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E96A ( 6 ) 1264 - 1272  2013.06  [Refereed]

     View Summary

    Nonvolatile flip-flop enables leakage power reduction in logic circuits and quick return from standby mode. However, it has limited write endurance, and its power consumption for writing is larger than that of conventional D flip-flop (DFF). For this reason, it is important to reduce the number of write operations. The write operations can be reduced by stopping the clock signal to synchronous flip-flops because write operations are executed only when the clock is applied to the flip-flops. In such clock gating, a method using Exclusive OR (XOR) of the current value and the new value as the control signal is well known. The XOR based method is effective, but there are several cases where the write operations can be reduced even if the current value and the new value are different. The paper proposes a method to detect such unnecessary write operations based on state transition analysis, and proposes a write control method to save power consumption of nonvolatile flip-flops. In the method, redundant bits are detected to reduce the number of write operations. If the next state and the outputs do not depend on some current bit, the bit is redundant and not necessary to write. The method is based on Binary Decision Diagram (BDD) calculation. We construct write control circuits to stop the clock signal by converting BDDs representing a set of states where write operations are unnecessary. Proposed method can be combined with the XOR based method and reduce the total write operations. We apply combined method to some benchmark circuits and estimate the power consumption with Synopsys NanoSim. On average, 15.0% power consumption can be reduced compared with only the XOR based method.

    DOI

  • A-3-7 REDUCING THE WRITING BITS TO NON-VOLATILE MEMORY BY HOLDING DATA DIFFERENCE

    Shinohara Hiroyuki, Yanagisawa Masao, Kimura Shinji

    Proceedings of the IEICE General Conference   2013  2013.03

    CiNii

  • Controlling-value-based power gating considering controllability propagation and power-off probability

    Zhe Du, Yu Jin, Shinji Kimura

    Proceedings of International Conference on ASIC    2013  [Refereed]

     View Summary

    Power gating technology is useful in reducing standby leakage current. Controlling value based power gating is a fine-grained power gating approach using the controlling value of logic elements. However, power saving capability suffers from the steady maximum depth constraint, which prohibits the power gating assignment when the control of a gate increases the critical path length. To increase power savings, this paper proposes a power gating control extraction method based on controllability propagation and power-off probability. Multiple power domains can be clustered by a smaller depth signal with the controllability propagation. Experimental results show that 21.4% power reduction can be obtained in average, achieving 8.5% improvement compared with previous algorithm. © 2013 IEEE.

    DOI

  • Energy Evaluation for Two-level On-chip Cache with Non-Volatile Memory on Mobile Processors

    Shota Matsuno, Masashi Tawada, Masao Yanagisawa, Shinji Kimura, Nozomu Togawa, Tadahiko Sugibayashi

    2013 IEEE 10TH INTERNATIONAL CONFERENCE ON ASIC (ASICON)    2013  [Refereed]

     View Summary

    As leakage power of traditional SRAM becomes larger, a ratio of static energy in total energy of memory architecture becomes also larger. Non-volatile memory (NVM) has many advantages over SRAM, such as high density, low leakage power, and non-volatility, but consumes too much write energy. In this paper, we evaluate energy consumption of two-level cache using NVM in part on mobile processors and confirm that it effectively reduces energy consumption.

  • An exact approach for gpc-based compressor tree synthesis

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E96-A ( 12 ) 2553 - 2560  2013

     View Summary

    Multi-operand adders that calculate the summation of more than two operands usually consist of compressor trees, which reduce the number of operands to two without any carry propagation, and carry-propagate adders for the two operands in the ASIC implementation. Compressor trees that consist of full adders and half adders cannot be implemented efficiently on LUT-based FPGAs, and carry-chains or dedicated structures have been utilized to produce multi-operand adders on FPGAs. Recent studies indicate that compressor trees can be implemented efficiently on LUTs using Generalized Parallel Counters (GPCs) as the building blocks of compressor trees. This paper addresses the problem of synthesizing compressor trees based on GPCs. Based on the observation that characteristics such as the area, power, and delay correlate roughly to the total number and the maximum level of GPCs, the target problem can be regarded as a minimization problem for the total number of GPCs and the maximum levels of the GPCs, for which an ILP-based approach is proposed. The key point of our formulation is not to model the problem based on the structures of compressor trees like the existing approach, but instead the compression process itself is used to reduce the number of variables and constraints in the ILP formulation. The experimental results demonstrate the advantage of our formulation in terms of the quality and runtime.Copyright © 2013 The Institute of Electronics, Information and Communication Engineers.

    DOI

  • An exact approach for gpc-based compressor tree synthesis

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E96-A ( 12 ) 2553 - 2560  2013

     View Summary

    Multi-operand adders that calculate the summation of more than two operands usually consist of compressor trees, which reduce the number of operands to two without any carry propagation, and carry-propagate adders for the two operands in the ASIC implementation. Compressor trees that consist of full adders and half adders cannot be implemented efficiently on LUT-based FPGAs, and carry-chains or dedicated structures have been utilized to produce multi-operand adders on FPGAs. Recent studies indicate that compressor trees can be implemented efficiently on LUTs using Generalized Parallel Counters (GPCs) as the building blocks of compressor trees. This paper addresses the problem of synthesizing compressor trees based on GPCs. Based on the observation that characteristics such as the area, power, and delay correlate roughly to the total number and the maximum level of GPCs, the target problem can be regarded as a minimization problem for the total number of GPCs and the maximum levels of the GPCs, for which an ILP-based approach is proposed. The key point of our formulation is not to model the problem based on the structures of compressor trees like the existing approach, but instead the compression process itself is used to reduce the number of variables and constraints in the ILP formulation. The experimental results demonstrate the advantage of our formulation in terms of the quality and runtime.Copyright © 2013 The Institute of Electronics, Information and Communication Engineers.

    DOI

  • On Gate Level Power Optimization of Combinational Circuits Using Pseudo Power Gating

    Yu Jin, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E95A ( 12 ) 2191 - 2198  2012.12  [Refereed]

     View Summary

    In recent years, the demand for low-power design has remained undiminished. In this paper, a pseudo power gating (SPG) structure using a normal logic cell is proposed to extend the power gating to an ultrafine grained region at the gate level. In the proposed method, the controlling value of a logic element is used to control the switching activity of modules computing other inputs of the element. For each element, there exists a submodule controlled by an input to the element. Power reduction is maximized by controlling the order of the submodule selection. A basic algorithm and a switching activity first algorithm have been developed to optimize the power. In this application, a steady maximum depth constraint is added to prevent the depth increase caused by the insertion of the control signal. In this work, various factors affecting the power consumption of library level circuits with the SPG are determined. In such factors, the occurrence of glitches increases the power consumption and a method to reduce the occurrence of glitches is proposed by considering the parity of inverters. The proposed SPG method was evaluated through the simulation of the netlist extracted from the layout using the VDEC Rohm 0.18 mu m process. Experiments on ISCAS'85 benchmarks show that the reduction in total power consumption achieved is 13% on average with a 2.5% circuit delay degradation. Finally, the effectiveness of the proposed method under different primary input statistics is considered.

    DOI

  • Write Reduction for Non-volatile Registers Using the Max-flow Min-cut

    ITOI Yudai, KIMURA Shinji

      112 ( 247 ) 101 - 106  2012.10

    CiNii

  • Automatic Multi-Stage Clock Gating Optimization Using ILP Formulation

    Xin Man, Takashi Horiyama, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E95A ( 8 ) 1347 - 1358  2012.08  [Refereed]

     View Summary

    Clock gating is supported by commercial tools as a power optimization feature based on the guard signal described in HDL (structural method). However, the identification of control signals for gated registers is hard and designer-intensive work. Besides, since the clock gating cells also consume power, it is imperative to minimize the number of inserted clock gating cells and their switching activities for power optimization. In this paper, we propose an automatic multi-stage clock gating algorithm with ILP (Integer Linear Programming) formulation, including clock gating control candidate extraction, constraints construction and optimum control signal selection. By multi-stage clock gating, unnecessary clock pulses to clock gating cells can be avoided by other clock gating cells, so that the switching activity of clock gating cells can be reduced. We find that any multi-stage control signals are also single-stage control signals, and any combination of signals can be selected from single-stage candidates. The proposed method can be applied to 3 or more cascaded stages. The multi-stage clock gating optimization problem is formulated as constraints in LP format for the selection of cascaded clock-gating order of multi-stage candidate combinations, and a commercial ILP solver (IBM CPLEX) is applied to obtain the control signals for each register with minimum switching activity. Those signals are used to generate a gate level description with guarded registers from original design, and a commercial synthesis and layout tools are applied to obtain the circuit with multi-stage clock gating. For a set of benchmark circuits and a Low Density Parity Check (LDPC) Decoder (6.6k gates, 212 F.F.s), the proposed method is applied and actual power consumption is estimated using Synopsys NanoSim after layout. On average, 31% actual power reduction has been obtained compared with original designs with structural clock gating, and more than 10% improvement has been achieved for some circuits compared with single-stage optimization method. CPU time for optimum multi-stage control selection is several seconds for up to 25k variables in LP format. By applying the proposed clock gating, area can also be reduced since the multiplexors controlling register inputs are eliminated.

    DOI

  • On gate level power optimization of combinational circuits using pseudo power gating

    Yu Jin, Shinji Kimura

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E95-A ( 12 ) 2191 - 2198  2012

     View Summary

    In recent years, the demand for low-power design has remained undiminished. In this paper, a pseudo power gating (SPG) structure using a normal logic cell is proposed to extend the power gating to an ultrafine grained region at the gate level. In the proposed method, the controlling value of a logic element is used to control the switching activity of modules computing other inputs of the element. For each element, there exists a submodule controlled by an input to the element. Power reduction is maximized by controlling the order of the submodule selection. A basic algorithm and a switching activity first algorithm have been developed to optimize the power. In this application, a steady maximum depth constraint is added to prevent the depth increase caused by the insertion of the control signal. In this work, various factors affecting the power consumption of library level circuits with the SPG are determined. In such factors, the occurrence of glitches increases the power consumption and a method to reduce the occurrence of glitches is proposed by considering the parity of inverters. The proposed SPG method was evaluated through the simulation of the netlist extracted from the layout using the VDEC Rohm 0.18 μm process. Experiments on ISCAS'85 benchmarks show that the reduction in total power consumption achieved is 13% on average with a 2.5% circuit delay degradation. Finally, the effectiveness of the proposed method under different primary input statistics is considered. Copyright © 2012 The Institute of Electronics, Information and Communication Engineers.

    DOI

  • Multi-Operand Adder Synthesis Targeting FPGAs

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E94A ( 12 ) 2579 - 2586  2011.12  [Refereed]

     View Summary

    Multi-operand adders, which calculates the summation of more than two operands, usually consist of compressor trees which reduce the number of operands to two without any carry propagation, and a carry-propagate adder for the two operands in ASIC implementation. The former part is usually realized using full adders or (3;2) counters like Wallace-trees in ASIC, while adder trees or dedicated hardware are used in FPGA. In this paper, an approach to realize compression trees on FPGAs is proposed. In case of FPGA with m-input LUT, any counters with up to m inputs can be realized with one LUT per an output. Our approach utilizes generalized parallel counters (GPCs) with up to m inputs and synthesizes high-performance compressor trees by setting some intermediate height limits in the compression process like Dadda's multipliers. Experimental results show that the number of GPCs are reduced by up to 22% compared to the existing heuristic. Its effectivity on reduction of delay is also shown against existing approaches on Altera's Stratix III.

    DOI

  • Multi-Stage Power Gating Based on Controlling Values of Logic Gates

    Yu Jin, Shinji Kimura

    Proc. IEEE International Symposium on ASIC (ASICON)     87 - 90  2011.10

  • Low Power LSI Design Methods Based on Gating Technology

    Shinji Kimura

    Keynote Speech of IEEE International Conference on ASIC (ASICON)    2011.10

  • High-parallel LDPC decoder with power gating design

    Ying Cui, Xiao Peng, Yu Jin, Peilin Liu, Shinji Kimura, Satoshi Goto

    Proceedings of International Conference on ASIC     21 - 24  2011  [Refereed]

     View Summary

    Leakage power is growing comparable to dynamic power dissipation as a result of technology trends, and thus it has become an important issue in low-power circuit design. As a popular technique for standby power reduction, power gating is applied to high-parallel LDPC decoder for WiMAX standard. The clustered-block processing engine (CBPE) array are divided into 9 power domains, and they are switched on or off according to different code lengths of LDPC code defined in WiMAX standard. As CBPE array occupies about 70% of the decoder system, the dedicated power gating strategy is very effective in shorter code length case since more power domains can be switched off. At shortest code length, power gating design brings about 55% power reduction compared to that of longest code length. © 2011 IEEE.

    DOI

  • Power and delay aware synthesis of multi-operand adders targeting LUT-based FPGAs

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    Proceedings of the International Symposium on Low Power Electronics and Design     217 - 222  2011

     View Summary

    Recent researches have indicated that multi-operand addition on FPGAs can be efficiently realized as the architecture consisting of a compressor tree which reduces the number of operands and a carry-propagate adder like ASIC by utilizing generalized parallel counters(GPCs). This paper addresses power and delay aware synthesis of GPC-based compressor trees. Based on the observation that dynamic power would correlate to the number of GPCs and the levels of GPCs, our approach targets to minimize the maximum levels and the total number of GPCs, and an ILP-based algorithm and heuristic approaches are proposed. Several experiments targeting Altera Stratix III architecture show that the proposed approach reduced the delay by up to 20% under a slight increase in total power dissipation. © 2011 IEEE.

    DOI

  • Comparison of Optimized Multi-Stage Clock Gating with Structural Gating Approach

    Xin Man, Shinji Kimura

    2011 IEEE REGION 10 CONFERENCE TENCON 2011     651 - 656  2011  [Refereed]

     View Summary

    Clock gating is a power efficient technique by switching off unnecessary clock signals to the registers. The condition under which the registers can be safely gated is checked using EXOR of the current and the next state values. Due to the extra power consumed by clock gating logics consisting of a latch and an AND gate, we have proposed an optimum sharing method of gating controls based on BDD (Binary Decision Diagram) with single-stage clock gating for power optimization. In this paper, we enhance the optimization method including multi-stage clock gating and compare with structural gating approach. By multi-stage clock gating, the activities of both registers and clock gating logics can be reduced. On a set of interface circuits, we have obtained power reduction by 14.1% on average compared with single-stage structural method and by 10.8% compared with multi-stage structural gating approach. Our BDD based method is also fast and scalable by candidates pruning.

  • Power Optimization of Sequential Circuits Using Switching Activity Based Clock Gating

    Xin Man, Takashi Horiyama, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E93A ( 12 ) 2472 - 2480  2010.12  [Refereed]

     View Summary

    Clock gating is the insertion of control signal for registers to switch off unnecessary clock signals selectively without violating the functional correctness of the original design so as to reduce the dynamic power consumption Commercial EDA tools usually have a mechanism to generate clock gating logic based on the structural method where the con trol signals specified by designers are used and the effectiveness of the clock gating depends on the specified control signals In the research we focus on the automatic clock gating logic generation and propose a method based on the candidate extraction and control signal selection We formalize the control signal selection using linear formulae and devise an optimization method based on BDD The method is effective for circuits with a lot of shared candidates by different registers The method is applied to counter circuits to check the co relation with power simulation results and a set of benchmark circuits 19 1-71 9% power reduction has been found on counter circuitsafter layout and 2 3-18 0% cost reduction on benchmark circuits

    DOI

  • Acceleration of a SAT Based Solver for Minimum Cost Satisfiability Problems Us ing Optimized Boolean Constraint Propagation

    Xin Zhang, Peilin Liu, Shinji Kimura

    Proc. of 16th Workshop on Synthesis And System Integration of Mixed Information Technologies     365 - 370  2010.10

  • The Sizing of Sleep Transistors In Controlling Value Based Power Gating

    Lei Chen, Shinji Kimura

    Proc. of 16th Workshop on Synthesis And System Integration of Mixed Information Technologies     202 - 207  2010.10

  • Automatic Clock Gating Generation through Power-optimal Control Signal Selection

    MAN Xin, HORIYAMA Takashi, KIMURA Shinji

      2010 ( 1 ) 1 - 6  2010.05

    CiNii

  • Multi-Operand Adder Synthesis on FPGAs Using Generalized Parallel Counters

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    2010 15TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC 2010)     332 - +  2010  [Refereed]

     View Summary

    Multi-operand adders usually consist of compression trees which reduce the number of operands per a bit to two, and a carry-propagate adder for the two operands in ASIC implementation. The former part is usually realized using full adders or (3;2) counters like Wallace-trees in ASIC, while adder trees or dedicated hardware are used in FPGA. In this paper, an approach to realize compression trees on FPGAs is proposed. In case of FPGA with m-input LUT, any counters with up to m inputs can be realized with one LUT per an output. Our approach utilizes generalized parallel counters (GPCs) with up to m inputs and synthesizes high-performance compression trees by setting some intermediate height limits in the compression process like Dadda's multipliers. Experimental results show its effectiveness against existing approaches at GPC level and on Altera's Stratix III.

  • Optimizing Controlling-Value-Based Power Gating with Gate Count and Switching Activity

    Lei Chen, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E92A ( 12 ) 3111 - 3118  2009.12  [Refereed]

     View Summary

    In this paper. a new heuristic algorithm is proposed to optimize the power domain clustering in controlling-value-based (CV-based) power gating technology. In this algorithm, both the switching activity of sleep signals (p) and the overall numbers of sleep gates (gate count, N) are considered, and the sum of the product of p and N is optimized. The algorithm effectively exerts the total power reduction obtained from the CV-based power gating. Even when the maximum depth is kept to be the same, the proposed algorithm can still achieve power reduction approximately 10% more than that of the prior algorithms. Furthermore, detailed comparison between the proposed heuristic algorithm and other possible heuristic algorithms are also presented. HSPICE simulation results show that over 26% of total power reduction can be obtained by using the new heuristic algorithm. In addition, the effect of dynamic power reduction through the CV-based power gating method and the delay overhead caused by the switching of sleep transistors are also shown in this paper.

    DOI

  • Framework for Parallel Prefix Adder Synthesis Considering Switching Activities

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    IPSJ Trans. SLDM     212 - 221  2009.08

  • Finite Input-Memory Automaton Based Checker Synthesis of SystemVerilog Assertions for FPGA Prototyping

    Chengjie Zang, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E92A ( 6 ) 1454 - 1463  2009.06  [Refereed]

     View Summary

    Checker synthesis for assertion based verification becomes popular because of the recent progress on the FPGA prototyping environment. In the paper, we propose a checker synthesis method based on the finite input-memory automaton suitable for embedded RAM modules in FPGA. There are more than 1 Mbit memories in medium size FPGA's and such embedded memory cells have the capability to be used as the shift registers. The main idea is to construct a checker circuit using the finite input-memory automata and implement shift register chain by logic elements or embedded RAM modules. When using RAM module, the method does not consume any logic element for storing the value. Note that the shift register chain of input memory can be shared with different assertions and we can reduce the hardware resource significantly. We have checked the effectiveness of the proposed method using several assertions.

    DOI

  • Automatic pipeline generation for fpga-based prototyping

    W. Xing, K. Zheng, T. Kimura, S. Kuromaru, K. Kai, S. Kimura

    Proc. 15th Workshop on Synthesis And System Integration of Mixed Information technologies     155 - 160  2009.03

  • Assertion checker synthesis for FPGA emulation

    C. Zang, Q. Wei, S. Kimura

    Proc. 15th Workshop on Synthesis And System Integration of Mixed Information technologies     149 - 154  2009.03

  • Fine-Grained Power Gating Based on the Controlling Value of Logic Elements

    Lei Chen, Takashi Horiyama, Yuichi Nakamura, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E91A ( 12 ) 3531 - 3538  2008.12  [Refereed]

     View Summary

    Leakage power consumption of logic elements has become a serious problem, especially in the sub-100-nanometer process. In this paper, a novel power gating approach by using the controlling value of logic elements is proposed, In the proposed method, sleep signals of the power-gated blocks are extracted completely front the original circuits Without any extra logic element. A basic algorithm and it probability-based heuristic algorithm have been developed to implement the basic idea. The steady maximum delay constraint has also been introduced to handle the delay issues. Experiments on the ISCAS'85 benchmarks show that averagely 15-36% of logic elements could he power gated at a time for random input patterns, and 3-31% of elements could be stopped under the steady maximum delay constraints. we also show a power optimizition method for AND/OR tree circuits, in which more than 80% of gates can be power-gated.

    DOI

  • Efficient Hybrid Grid Synthesis Method Based on Genetic Algorithm for Power/Ground Network Optimization with Dynamic Signal Consideration

    Yun Yang, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E91A ( 12 ) 3431 - 3442  2008.12  [Refereed]

     View Summary

    This paper proposes all efficient design algorithm for power/ground (P/G) network synthesis with dynamic signal consideration, which is mainly caused by Ldi/dt noise and Cdv/dt decoupling capacitance (DE-CAP) Current in the distribution network. To deal with the nonlinear global optimization under synthesis constraints directly, the genetic algorithm (GA) is introduced. The proposed GA-based synthesis method call avoid the linear transformation loss and the restraint condition complexity in current SLP, SQP, ICG, and random-walk methods. In the proposed Hybrid Grid Synthesis algorithm, the dynamic signal is simulated in the gene disturbance process, and Trapezoidal Modified Euler (TME) method is introduced to realize the precise dynamic time step process. We also use a hybrid-SLP method to reduce the genetic execute time and increase the network synthesis efficiency. Experimental results on given power distribution network show the reduction on layout area and execution time compared with current P/G network synthesis methods.

    DOI

  • FPGA prototyping of a simultaneous multithreading processor

    C. Zang, S. Imai, S. Kimur

    Proc. 21th Workshop on Circuits and Systems in Karuizaw     219 - 224  2008.04

  • The Optimal Architecture Design of Two-Dimensional Matrix Multiplication

    Y. Yang, S. Kimura

    IEICE Trans. Fundamentals   E91-A ( 4 ) 1101 - 1111  2008.04

  • Issue mechanism for embedded Simultaneous Multithreading processor

    Chengjie Zang, Shigeki Imai, Steven Frank, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E91A ( 4 ) 1092 - 1100  2008.04  [Refereed]

     View Summary

    Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of V are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T-1-T-n)) with other policies: thread one always has the highest priority (PR(T-1)) and thread one or thread n has the highest priority in turn (PR(T-1-T-n)). The results show that RR(T-1-T-n) policy outperforms others and PR(T-1-T-n) is almost the same to RR(T-1-T-n) from the point ofview of the issued instructions per cycle.

    DOI

  • Synthesis of Parallel Prefix Adders Considering Switching Activities

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    2008 IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN     404 - +  2008  [Refereed]

     View Summary

    This paper addresses parallel prefix adder synthesis which targets minimization of the total switching activities under bitwise timing constraints. This problem is treated as synthesis of prefix graphs which represent global structures of parallel prefix adders at technology-independent level. An approach for timing-driven area minimization has been proposed which first finds the exact minimum solution on a specific subset of prefix graphs by dynamic programming, then restructures the result for further reduction by removing restriction on the subset. This approach can be applied for switching cost minimization almost directly, though it is not so effective as area minimization in some cases. In this paper, a heuristic is proposed which estimates the effect of the restructuring phase and improve cost calculation fo some specific cases. Through various kinds of experiments, conditions where this approach can be executed effectively is also discussed.

  • Resynthesis Method for Circuit Acceleration on LUT-based FPGA

    Weijie Xing, Takashi Horiyama, Shunichi Kuromaru, Tomoo Kimura, Shinji Kimura

    Proceedings of 14th Workshop on Synthesis And System Integration of Mixed Information technologies     375 - 380  2007.10

  • Active Mode Leakage Power Reduction Based on the Controlling Value of Logic Gates

    Lei Chen, Shinji Kimura

    Proceedings of 14th Workshop on Synthesis And System Integration of Mixed Information technologies     266 - 271  2007.10

  • Power-Conscious Synthesis of Parallel Prefix Adders under Bitwise Timing Constraints

    Taeko Matsunaga, Shinji Kimura, Yusuke Matsunaga

    Proceedings of 14th Workshop on Synthesis And System Integration of Mixed Information technologies     7 - 14  2007.10

  • Optimal planar jumping systolic array design for matrix multiplication

    Yun Yang, Shinji Kimura

    Proceedings of 20th Workshop on Circuits and Systems in Karuizawa     343 - 348  2007.04

  • Issue Mechanism for Embedded Simultaneous Multithreading Processor

    Chengjie Zang, Shigeki Imai, Shinji Kimura

    Proceedings of 20th Workshop on Circuits and Systems in Karuizawa     325 - 330  2007.04

  • Coverage estimation using transition perturbation for symbolic model checking in hardware verification

    Xingwen Xu, Shinji Kimura, Kazunari Horikawa, Takehiko Tsuchiya

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E89A ( 12 ) 3451 - 3457  2006.12  [Refereed]

     View Summary

    Lack of complete formal specification is one of the major obstacles to the deployment of model checking. Coverage estimation addresses this issue by revealing the unverified part of the design according to the specified properties. In this paper we propose a new transition-based coverage metric to evaluate the completeness of properties for symbolic model checking. Our coverage metric pinpoints the transitions through which the values of signals are checked. An efficient symbolic algorithm is presented for computing the transition coverage for a subset of ACTL. Our coverage estimator has been applied to the model checking of a cache coherence protocol. We uncovered several coverage holes including one that eventually led to the discovery of a design bug.

    DOI

  • Bit-length optimization method for high-level synthesis based on non-linear programming technique

    Nobuhiro Doi, Takashi Horiyama, Masaki Nakanishi, Shinji Kimura

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E89A ( 12 ) 3427 - 3434  2006.12  [Refereed]

     View Summary

    High-level synthesis is a novel method to generate a RT-level hardware description automatically from a high-level language such as C, and is used at recent digital circuit design. Floating-point to fixed-point conversion with bit-length optimization is one of the key issues for the area and speed optimization in high-level synthesis. However, the conversion task is a rather tedious work for designers. This paper,introduces automatic bit-length optimization method on floating-point to fixed-point conversion for high-level synthesis. The method estimates computational errors statistically, and formalizes an optimization problem as a non-linear problem. The application of NLP technique improves the balancing between computational accuracy and total hardware cost. Various constraints such as unit sharing, maximum bit-length of function units can be modeled easily, too. Experimental result shows that our method is fast compared with typical one, and reduces the hardware area.

    DOI

  • An Efficient Instruction Issue Mechanism for Simultaneous Multithreading Microprocessor

    Taeseok Jeong, Chengjie Zang, Shinji Kimura

    Proc. International SoC Design Conference (ISOCC2006)     533 - 536  2006.10

  • Performance and Energy Efficient Data Cache Architecture for Embedded Simultaneous Multithreading Microprocessor

    Chengjie Zang, Shigeki Imai, Shinji Kimura

    International SoC Design Conference (ISOCC2006)     351 - 354  2006.10

  • Performance and Energy Efficient Data Cache Architecture for Embedded Simultaneous Multithreading Microprocessor

    Chengjie Zang, Shigeki Imai, Shinji Kimura

    Proceedings of 13th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI2006)     268 - 273  2006.04

  • Selective low-care coding: A means for test data compression in circuits with multiple scan chains

    Youhua Shi, Nozomu Togawa, Shinji Kimura, Masao Yanagisawa, Tatsuo Ohtsuki

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E89-A ( 4 ) 996 - 1003  2006  [Refereed]

     View Summary

    This paper presents a test input data compression technique, Selective Low-Care Coding (SLC), which can he used to significantly reduce input test data volume as well as the external test channel requirement for multiscan-based designs. In the proposed SLC scheme, we explored the linear dependencies of the internal scan chains, and instead of encoding all the specified bits in test cubes, only a smaller amount of specified bits are selected for encoding, thus greater compression can be expected. Experiments on the larger benchmark circuits show drastic reduction in test data volume with corresponding savings on test application time can be indeed achieved even for the well-compacted test set. Copyright © 2006 The Institute of Electronics, Information and Communication Engineers.

    DOI

  • FCSCAN: An efficient multiscan-based test compression technique for test cost reduction

    Youhua Shi, Nozomu Togawa, Shinji Kimura, Masao Yanagisawa, Tatsuo Ohtsuki

    ASP-DAC 2006: 11TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, PROCEEDINGS     653 - 658  2006  [Refereed]

     View Summary

    This paper proposes a new multiscan-based test input data compression technique by employing a Fan-out Compression Scan Architecture (FCSCAN) for test cost reduction. The basic idea of FCSCAN is to target the minority specified 1 or 0 bits (either 1 or 0) in scan slices for compression. Due to the low specified bit density in test cube set, FCSCAN can significantly reduce input test data volume and the number of required test channels so as to reduce test cost. The FCSCAN technique is easy to be implemented with small hardware overhead and does not need any special ATPG for test generation. In addition, based on the theoretical compression efficiency analysis, improved procedures are also proposed for the FCSCAN to achieve further compression. Experimental results on both benchmark circuits and one real industrial design indicate that drastic reduction in test cost can be indeed achieved.

  • Transition-based coverage estimation for symbolic model checking

    Xingwen Xu, Shinji Kimura, Kazunari Horikawa, Takehiko Tsuchiya

    ASP-DAC 2006: 11TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, PROCEEDINGS     1 - 6  2006  [Refereed]

     View Summary

    Lack of complete formal specification is one of the major obstacles for the deployment of model checking. Coverage estimation addresses this issue by revealing the unverified part of the design according to the specified properties. In this paper we propose a new transition-based coverage metric to evaluate the completeness of properties for symbolic model checking. It is more comprehensive and accurate than the existing coverage metrics for model checking. An efficient symbolic algorithm is presented for computing the transition coverage for a subset of ACTL. Our coverage estimator has been applied to the model checking of a cache coherence protocol. We uncovered several coverage holes including one that eventually led to the discovery of a design bug.

  • Functional State Coverage Estimation for CTL Model Checking

    Xingwen Xu, Shinji Kimura, Kazunari Horikawa, Takehiko Tsuchiya

    Proceeding of the 20th International Technical Conference on Circuits/Systems, Computers and Communications(ITC-CSCC2005)     1 - 2  2005.07

  • Low power test compression technique for designs with multiple scan chains

    Youhua Shi, Nozomu Togawa, Shinji Kimura, Masao Yanagisawa, Tatsuo Ohtsuki

    Proceedings of the Asian Test Symposium   2005   386 - 389  2005  [Refereed]

     View Summary

    This paper presents a new DFT technique that can significantly reduce test data volume as well as scan-in power consumption for multiscan-based designs. It can also help to reduce test time and tester channel requirements with small hardware overhead. In the proposed approach, we start with apre-computed test cube set and fill the don't-cares with proper values for joint reduction of test data volume and scan power consumption. In addition we explore the linear dependencies of the scan chains to construct a fanout structure only with inverters to achieve further compression. Experimental results for the larger ISCAS'89 benchmarks show the efficiency of the proposed technique. © 2005 IEEE.

    DOI

  • Special section on VLSI design and CAD algorithms

    Shinji Kimura

    IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences   E88-A ( 12 ) 3273  2005  [Refereed]

    DOI

  • Extended abstract: Transition traversal coverage estimation for symbolic model checking

    XW Xu, S Kimura, K Horikawa, T Tsuchiya

    THIRD ACM & IEEE INTERNATIONAL CONFERENCE ON FORMAL METHODS AND MODELS FOR CO-DESIGN, PROCEEDINGS     259 - 260  2005  [Refereed]

  • Duplicated register file design for embedded simultaneous multithreading microprocessor

    C Zang, S Imai, S Kimura

    2005 6th International Conference on ASIC Proceedings, Books 1 and 2     160 - 163  2005  [Refereed]

     View Summary

    In modern microprocessors, the access time of register file becomes a critical part in total delay. Instruction level or thread level parallelism improves Instructions Per. Cycle (IPC) by executing multiple instructions in one cycle. Such multiple instructions need to read or write data from/to register files simultaneously. To satisfy that, register file with sufficient ports should be designed. However, the area and access time of register file with large ports will increase sharply. Duplicated Register File (DupRF) architecture can reduce access time by distributing read ports. In this paper, we propose a new kind of DupRF architecture for embedded Simultaneous Multithreading (SMT) microprocessor and estimate the effect with respect to the area and access time. Especially, we measure the product of area and access time as computation cost. For a SMT microprocessor with 6 threads, 64-bit data-width and 6 function units, a 3-duplicate register file architecture can reduce access time by 12.61% with a slight increase of computation cost by 3.35% compared with the central register file architecture.

  • Transition traversal coverage estimation for symbolic model checking

    XW Xu, S Kimura, K Horikawa, T Tsuchiya

    2005 6TH INTERNATIONAL CONFERENCE ON ASIC PROCEEDINGS, BOOKS 1 AND 2     850 - 853  2005  [Refereed]

     View Summary

    Model checking can exhaustively verify a set of specified properties on a given implementation. However, it is very hard to determine whether sufficient properties have been speci ed or not. In this paper, we propose a transition traversal coverage method for a subset of CTL to evaluate the completeness, of properties. With this method, we can detect the transitions which are not veri ed by any property. It is more comprehensive and accurate than state-based coverage metric. We avoid generating the perturbed implementation by directly traversing transitions based on the semantics of CTL formulas. Experimental results show that the proposed method can discover subtle coverage holes with low computation cost.

  • A selective scan chain reconfiguration through run-length coding for test data compression and scan power reduction

    Y Shi, S Kimura, M Yanagisawa, T Ohtsuki

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E87A ( 12 ) 3208 - 3215  2004.12  [Refereed]

     View Summary

    Test data volume and power consumption for scan-based designs are two major concerns in system-on-a-chip testing. However, test set compaction by filling the don't-cares will invariably increase the scan-in power dissipation for scan testing, then the goals of test data reduction and low-power scan testing appear to be conflicted. Therefore, in this paper we present a selective scan chain reconfiguration method for test data compression and scan-in power reduction. The proposed method analyzes the compatibility of the internal scan cells for a given test set and then divides the scan cells into compatible classes. After the scan chain reconfiguration a dictionary is built to indicate the run-length of each compatible class and only the scan-in data for each class should be transferred from the ATE to the CUT so as to reduce test data volume. Experimental results for the larger ISCAS' 89 benchmarks show that the proposed approach overcomes the limitations of traditional run-length coding techniques, and leads to highly reduced test data volume with significant power savings during scan testing in all cases.

  • A hybrid dictionary test data compression for multiscan-based designs

    Y Shi, S Kimura, M Yanagisawa, T Ohtsuki

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E87A ( 12 ) 3193 - 3199  2004.12  [Refereed]

     View Summary

    In this paper, we present a test data compression technique to reduce test data volume for multiscan-based designs. In our method the internal scan chains are divided into equal sized groups and two dictionaries were build to encode either an entire slice or a subset of the slice. Depending on the codeword, the decompressor may load all scan chains or may load only a group of the scan chains, which can enhance the effectiveness of dictionary-based compression. In contrast to previous dictionary coding techniques, even for the CUT with a large number of scan chains, the proposed approach can achieve satisfied reduction in test data volume with a reasonable smaller dictionary. Experimental results showed the proposed test scheme works particularly well for the large ISCAS'89 benchmarks.

  • Efficient Hardware Architecture of a New Simple Public-Key Cryptosystem for Real-Time Data Processing

    C. Jin, N. Doi, H. Tanaka, S. Imai, S. Kimura

    Proc. of Workshop on Synthesis and System Integration of Mixed Technologies (SASIMI'2004)     107 - 112  2004.10

  • An Optimization Method in Floating-point to Fixed-point Conversion using Positive and Negative Error Analysis and Sharing of Operations

    N. Doi, T. Horiyama, M.Nakanishi, S.Kimura

    Proc. of Workshop on Synthesis and System Integration of Mixed Technologies (SASIMI'2004)     466 - 471  2004.10

  • Reconfigurable Architecture for Bit-Level Data Processing

    S. Kimura

    Invited Talk of The 1st Silicon-Seabelt Workshop on VLSI Designs in National Taiwan University    2004.04

  • Alternative run-length coding through scan chain reconfiguration for joint minimization of test data volume and power consumption in scan test

    Youhua Shi, Shinji Kimura, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki

    Proceedings of the Asian Test Symposium     432 - 437  2004  [Refereed]

     View Summary

    Test data volume and scan power are two major concerns in SoC test. In this paper we present an alternative run-length coding method through scan chain reconfiguration to reduce both test data volume and scan-in power consumption. The proposed method analyzes the compatibility of the internal scan cells for a given test set and then divides the scan cells into compatible classes. To extract the compatible scan cells we apply a heuristic algorithm by solving the graph coloring problem
    and then a simple greedy algorithm is used to configure the scan chain for the minimization of scan power. Experimental results for the larger IS-CAS'89 benchmarks show that the proposed approach leads to highly reduced test data volume with significant power savings during scan test.

    DOI

  • Minimization of fractional wordlength on fixed-point conversion for high-level synthesis

    N Doi, T Horiyama, M Nakanishi, S Kimura

    ASP-DAC 2004: PROCEEDINGS OF THE ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE     80 - 85  2004  [Refereed]

     View Summary

    In the hardware synthesis from high-level language such as C, bit length of variables is one of the key issues on the area and speed optimization. Usually, designers are required to specify the word length of each variable manually, and verify the correctness by the simulation on huge data. In this paper, we propose an optimization method of fractional wold length of floating-point variables in the floating to fixed-point conversion of variables. The amount of round-off errors are formulated with parameters and propagated via data flow graphs. The non-linear programming is used to solve the fractional wordlength minimization problem. The method does not require the simulation on huge data, and is very fast compared to ones based on the simulation. We have shown the effect on several programs.

  • Reducing test data volume for multiscan-based designs through single/sequence mixed encoding

    Y Shi, S Kimura, N Togawa, M Yanagisawa, T Ohtsuki

    2004 47TH MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL II, CONFERENCE PROCEEDINGS     445 - 448  2004  [Refereed]

     View Summary

    This paper presents a new test data compression technique for multiscan-based designs through dictionary-based encoding on the single or sequences scan-inputs. In spite of its simplicity, it achieves significant reduction in test data volume. Unlike some previous approaches on test data compression, our approach eliminates the need for additional synchronization and handshaking between the CUT and the ATE, so it is especially suitable to be integrated in a low cost test scheme for SoC test In addition in contrast to previous dictionary-based coding techniques, even for the CUT with a small number of scan chains, the proposed approach can achieve satisfied reduction in test data volume. Experimental results showed the proposed test scheme works particularly well for the large ISCAS'89 benchmarks.

  • A built-in reseeding technique for LFSR-based test pattern generation

    Y Shi, Z Zhang, S Kimura, M Yanagisawa, T Ohtsuki

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E86A ( 12 ) 3056 - 3062  2003.12  [Refereed]

     View Summary

    Reseeding technique is proposed to improve the fault coverage in pseudo-random testing. However most of previous works on reseeding is based on storing the seeds in an external tester or in a ROM. In this paper we present a built-in reseeding technique for LFSR-based test pattern generation. The proposed structure can run both in pseudorandom mode and in reseeding mode. Besides, our method requires no storage for the seeds since in reseeding mode the seeds can be generated automatically in hardware. In this paper we also propose an efficient grouping algorithm based on simulated annealing to optimize test vector grouping. Experimental results for benchmark circuits indicate the superiority of our technique against other reseeding methods with respect to test length and area overhead. Moreover, since the theoretical properties of LFSRs are preserved, our method could be beneficially used in conjunction with any other techniques proposed so far.

  • Bit Length Optimization of Fractional Part on Floating to Fixed Point Conversion for High Level Synthesis

    N. Doi, T. Horiyama, N. Nakanishi, S. Kimura, K. Watanabe

    IEICE Trans. Fundamentals   Vol. E86-A ( No. 12 ) 3176 - 3183  2003.12

  • Bit Length Optimization in High Level Synthesis Based on Analytical Methods (Invited Talk)

    Shinji Kimura, Nobuhiro Doi

    System on Chip Design Automation Conference 2003 at Korea    2003.11

  • Bit Length Optimization of Fractional Parts on Floating to Fixed Point Conversion fro High-Level Synthesis

    Nobuhiro Doi, Takashi Horiyama, Masaki Nakanishi, Shinji Kimura, Katsumasa Watanabe

    Proc. of the Workshop on Synthesis and System Integration of Mixed Information technologies     129 - 136  2003.04

  • An on-chip high speed serial communication method based on independent ring oscillators

    S Kimura, T Hayakawa, T Horiyama, M Nakanishi, K Watanabe

    2003 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE: DIGEST OF TECHNICAL PAPERS   46 ( 22.3 ) 390 - 391  2003  [Refereed]

  • Look up table compaction based on folding of logic functions

    S Kimura, A Ishii, T Horiyama, M Nakanishi, H Kajihara, K Watanabe

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E85A ( 12 ) 2701 - 2707  2002.12  [Refereed]

     View Summary

    The paper describes the folding method of logic functions to reduce the size of memories to keep the functions. The folding is based on the relation of fractions of logic functions. If the logic function includes 2 or 3 same parts, then only one part should be kept and other parts can be omitted. We show that the logic function of I-bit addition can be reduced to half size using the bit-wise NOT relation and the bit-wise OR relation. The paper also introduces 3-1 LUT's with the folding mechanism. A full adder can be implemented using only one 3-1 LUT with the folding. Multi-bit AND and OR operations can be mapped to our LUT's not using the extra cascading circuit but using the carry circuit for addition. We have also tested the mapping capability of 4 input functions to our 3-1 LUT's with folding and carry propagation mechanisms. We have shown the reduction of the area consumption when using our LUT's compared to the case using 4-1 LUT's on several benchmark circuits.

  • Folding of logic functions and its application to look up table compaction

    S Kimura, T Horiyama, M Nakanishi, H Kajihara

    IEEE/ACM INTERNATIONAL CONFERENCE ON CAD-02, DIGEST OF TECHNICAL PAPERS     694 - 697  2002  [Refereed]

     View Summary

    The paper describes the folding method of logic functions to reduce the size of memories for keeping the functions. The folding is based on the relation of fractions of logic functions. We show that the fractions of the full adder function have the bit-wise NOT relation and the bit-wise OR relation, and that the memory size becomes half (8-bit). We propose a new 3-1 LUT with the folding mechanisms whcih can implement a full adder with one LUT. A fast carry propagation line is introduced for a multi-bit addition. The folding and fast carry propagation mechanisms are shown to be useful to implement other multi-bit operations and general 4 input functions without extra hardware resources. The paper shows the reduction of the area consumption when using our LUTs compared to the case using 4-1 LUTs on several benchmark circuits.

  • A Real-Time User-Independent Eye Tracking LSI with Environment Adaptability

    K. Nakamura, M. Nakanishi, T. Horiyama, M. Suzuki, S. Kimura, K. Watanabe

    In Proc. of the 10th Workshop on Synthesis And System Integration of Mixed Technologies (SASIMI 2001)     357 - 361  2001.10

  • A New Symbolic Image Computation Algorithm Based on BDD Constrain Operator

    S. Kimura, D. Dill, S. G. Govindaraju

    In Proc. of the 10th Workshop on Synthesis And System Integration of Mixed Technologies (SASIMI 2001)     167 - 171  2001.10

  • Speech recognition chip for monosyllables

    K Nakamura, Q Zhu, S Maruoka, T Horiyama, S Kimura, K Watanabe

    PROCEEDINGS OF THE ASP-DAC 2001: ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 2001     396 - 399  2001  [Refereed]

     View Summary

    In the paper, we present a real-time speech recognition chip for monosyllables such as A, B,.,., etc. The chip recognizes up to 64 monosyllables based on the Hidden Markov Model (HMM), which is a well known speaker-independent recognition method. The chip accepts a short-speech frame including 256 16-bit digitized samples corresponding to 11.6 msec period, and outputs the 6-bit symbol code of monosyllables for 16 short-frames (corresponding to 185.6 msec), A learning circuit to update HMM parameters for the recognition chip has also been designed, and the recognition chip includes an interface to the learning circuit. We have fabricated the recognition chip by VDEC Rohm 0.6 mum process on a 4.5 mm x 4.5 mm chip. We have also made a layout of the entire circuit including the learning circuit by VDEC Rohm 0.35 mum process on a 4.9 mm x 4.9 mm chip.

  • A real-time 64-monosyllable recognition LSI with learning mechanism

    K Nakamura, Q Zhu, S Maruoka, T Horiyama, S Kimura, K Watanabe

    PROCEEDINGS OF THE ASP-DAC 2001: ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE 2001     31 - 32  2001  [Refereed]

     View Summary

    In the paper, a real-time 64-mono-syllable recognition LSI is presented. The LSI accepts 11.6 msec speech frame and outputs a 6-bit symbol-code for each frame by the end of the next frame with the pipelining manner. The recognition method is based on the Hidden Markov Model and is speaker-independent. An on-chip learning mechanism has also been designed, but the circuit is off-chip at present implementation because of the restriction of LSI area, The LSI is fablicated by VDEC Rohm with 0.6 mum process on a 4.5 mm x 4.5 mm chip.

  • Multi-cycle path detection based on propositional satisfiability with CNF simplification using adaptive variable insertion

    K Nakamura, S Maruoka, S Kimura, K Watanabe

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E83A ( 12 ) 2600 - 2607  2000.12  [Refereed]

     View Summary

    Multi-cycle paths are paths between registers where 2 or more clock cycles are allowed to propagate signals, and the detection of multi-cycle paths is important in deciding proper clock period, timing verification and logic optimization. This paper presents a satisfiability-based multi-cycle paths detection method, where the detection problems are reduced to CNF formulae and the satisfiability is checked using SAT provers. We also show heuristics on conversion from multi-level circuits into CNF formulae. We have applied our method of ISCAS'89 benchmarks and other sample circuits. Experimental results show the remarkable improvements on the size of manipulatable circuits.

  • Robust heuristics for multi-level logic simplification considering local circuit structure

    Q Zhu, Y Matsunaga, S Kimura, K Watanabe

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E83A ( 12 ) 2520 - 2527  2000.12  [Refereed]

     View Summary

    Combinational logic circuits are usually implemented as multi-level networks of logic nodes, Multi-level logic simplification using the don't cares on each node is widely used. Large don't cares give good simplification results, but suffer from huge memory area and computation time. Extraction of useful don't cares and reduction of the size of the don't cares are important problems on the simplification using don't cares. In the paper, we propose a new robust heuristic method for the selection of dent cares. MIF consider an adaptive subnetwork for each simplified node in the network and introduce a stepwise enhancement method of the subnetwork considering the memory area and the network structure. The don't cares extracted from the adaptive subnetworks are called the local don't cares. We have implemented our method for satisfiability don't cares and observability don't cares. We have applied the method on MCNC89 benchmarks, and compared the experimental results with those of the SIS system. The results demonstrate the superiority of our method on the quality of the results and on the size of applicable circuits.

  • Robust Heuristics for Multi-Level Logic Simplification Considering Local Circuit Structure

    Q. Zhu, Y. Matsunaga, S. Kimura, K. Watanabe

    In Proc. of the 9th Workshop on Synthesis And System Integration of Mixed Technologies (SASIMI 2000)     299 - 306  2000.04

  • An application specific Java processor with reconfigurabilities

    Shinji Kimura, Hiroyuki Kida, Kazuyoshi Takagi, Tatsumori Abematsu, Katsumasa Watanabe

    Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC     25 - 26  2000

     View Summary

    The paper presents an application specific Java processor including reconfigurabilities, which is a DLX like pipeline processor with 5 stages and executes Java byte codes directly. Reconfigurabilities are the key technologies for application specific operations. We have introduced two reconfigurabilities: (1) a mechanism to override the control signals for a specific instruction, (2) external components can be attached with the same input and output ports as the internal ALU. © 2000 IEEE.

    DOI

  • Multi-clock path analysis using propositional satisfiability

    Kazuhiro Nakamura, Shinji Maruoka, Shinji Kimura, Katsumasa Watanabe

    Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC     81 - 86  2000

     View Summary

    We present a satisfiability based multi-clock path analysis method. The method uses propositional satisfiability (SAT) in the detection of multi-clock paths. We show a method to reduce the multi-clock path detection problems to SAT problems. We also show heuristics on the conversion from multi-level circuits into CNF formulae. We have applied our method to ISCAS89 benchmarks and other sample circuits. Experimental results show the improvement on the manipulatable size of circuits by using SAT. © 2000 IEEE.

    DOI

  • Exact minimization of free BDDs and its application to pass-transistor logic optimization

    K Takagi, H Hatakeda, S Kimura, K Watanabe

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E82A ( 11 ) 2407 - 2413  1999.11  [Refereed]

     View Summary

    In several design methods for Pass-transistor Logic (PTL) circuits, Boolean functions are expressed as OBDDs in decomposed form and then the component OBDDs are directly mapped to PTL cells. The total size of OBDDs (number of nodes) corresponds to the circuit size. In this paper, we investigate a method for PTL synthesis based on exact minimization of Free BDDs (FBDDs). FBDDs are well-studied extension of OBDDs with free variable ordering on each path. We present statistics showing that more than 56% of 616126 iu:PN-equivalence classes of 5-variable Boolean functions have minimum FBDDs with less size than their OBDDs. This result can be used for PTL synthesis as libraries. We also applied the exact minimization algorithm of FBDDs to the minimization of subcircuits in the synthesis for MCNC benchmarks and found up to 5% size reduction.

  • Hardware synthesis from C programs with estimation of bit length of variables

    O Ogawa, K Takagi, Y Itoh, S Kimura, K Watanabe

    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES   E82A ( 11 ) 2338 - 2346  1999.11  [Refereed]

     View Summary

    In the hardware synthesis methods with high level languages such as C language, optimization quality of the compilers has a great influence on the area and speed of the synthesized circuits. Among hardware-oriented optimization methods required in such compilers, minimization of the bit length of the data-paths is one of the most important issues. In this paper, rye propose an estimation algorithm of the necessary bit length of variables for this aim. The algorithm analyzes the control/dataflow graph translated from C programs and decides the bit length of each variable. On several experiments, the bit length of variables can be reduced by half with respect to the declared length. This method is effective not only for reducing the circuit area but also for reducing the delay of the operation units such as adders.

  • Multi-Level Logic Simplification using Statisfiability Don't Cares

    Q.Zhu, Y.Matsunaga, S.Kimura, K.Watanabe

    Proceedings of Asia Pacific Conference on cHip Design Languages     127 - 131  1999.10

▼display all

Books and Other Publications

  • システムLSI設計工学

    藤田昌宏, 梶原誠司, 木村晋二, 高田宏章, 浜口清治, 冨山宏之

    オーム社  2006.10 ISBN: 4274202976

Misc

  • High Accuracy 8×8 Approximate Multiplier based on OR Operation (VLSI設計技術)

    GUO Yi, SUN Heming, JIN Canran, KIMURA Shinji

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   116 ( 478 ) 19 - 24  2017.03

    CiNii

  • MERP-CNN : A memory-efficient reconfigurable processor for convolutional neural networks based on FPGA (VLSI設計技術)

    HAN Xushen, ZHOU Dajiang, KIMURA Shinji

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   116 ( 21 ) 47 - 52  2016.05

    CiNii

  • Write Reduction of Internal Registers for Non-volatile RISC Processors

    GOTO Tomoya, YANAGISAWA Masao, KIMURA Shinji

    Mathematical Systems Science and its Applications : IEICE technical report   114 ( 125 ) 213 - 218  2014.07

     View Summary

    Recently next-generation non-volatile memories based on MTJ (Magnetic Tunnel Junction) have been paid attention because of their enough endurance and fast access speed. The access speed is comparable with that of CMOS memory devices but their writing energy is far larger than the energy of CMOS memory devices. So the reduction of writing operations is very important. In this study, we propose write-reduction methods depending on the types of internal registers for RISC processors. By considering the types, the control circuit can be reduced. For the register file, write operations are reduced by using "write aware flags" and "sign extension flags". For the program counter, write operations are reduced by using "XOR-based comparison" and "carry detection". The proposed method is applied to the MIPS32 processor and the write activity has been evaluated using a simulator. The write activity can be reduced about 93.1-93.8% on register files and about 54.5-56.8% on the program counter.

    CiNii

  • Write Control Method Based on State Transition for Magnetic Flip-Flop

    OKADA Naoya, NAKAMURA Yuichi, KIMURA Shinji

    Technical report of IEICE. VLD   112 ( 71 ) 13 - 18  2012.05

     View Summary

    In this manuscript, we propose a write control method for nonvolatile MFF(Magnetic Flip-Flop). MFF enables leakage power reduction in the logic circuits and quick return from standby mode. However, it consumes about 10 times power as large as conventional DFF during the write operation. So, it is desirable to reduce redundant write operations. We focus on the state transition of sequential circuit to detect them. If the next state and outputs do not depend on some current bit, the bit is redundant and unnecessary to write. We propose a method to detect such bits. Our method can be combined with a reduction method based on the EXOR of the current value and the new value. When applying combined method to several benchmark circuits, up to 15.3% power reduction can be achieved with the area over head of 1.9%〜4.8% compared with only the EXOR based method.

    CiNii

  • A-3-10 A Control Circuit Based on Analysis of State Transition

    Okada Naoya, Nakamura Yuichi, Kimura Shinji

    Proceedings of the IEICE General Conference   2012   94 - 94  2012.03

    CiNii

  • A-3-8 Memory-based Arithmetic Circuits on FPGA and Their Power Evaluation

    Yu Xinmu, Hamaguchi Kiyoharu, Kimura Shinji

    Proceedings of the IEICE General Conference   2012   92 - 92  2012.03

    CiNii

▼display all

Awards

  • 編集活動感謝状

    2012.09  

  • 日経 BP 社, LSI IP デザインアワード, IP 賞

    2000  

  • Asian South-Pacific Design Automation Conference, University LSI Design Contest

    2000  

  • 日経 BP 社, LSI IP デザインアワード, IP 賞

    1999  

  • 情報処理学会 全国大会 第45回 奨励賞

    1993.03  

Research Projects

  • Hardware-Trojan Detection for Integrated Circuit Design Data based on Machine Learning

    Project Year :

    2019.04
    -
    2022.03
     

  • 再構成アクセラレータにおけるデータ形式最適化と精度保証

    Project Year :

    2018.04
    -
    2021.03
     

  • 大域的超低エネルギー化を実現するLSI抽象モデルと上位下位統合化LSI設計技術

    科学研究費助成事業(早稲田大学)  科学研究費助成事業(基盤研究(B))

    Project Year :

    2013
    -
    2015
     

     View Summary

    平成25年度には研究計画全体の基礎となる研究項目(I)~(III)を実施した.
    (I) LSI 抽象モデルの構築: 本研究で提案するLSI抽象モデルを採り入れ,実際のアプリケーションを試行設計した.試行設計の結果,動作記述で数千行を越える実大規模応用プログラムにおいて電源制御,クロック制御,周波数制御可能によりエネルギー削減の可能性を確認した.
    (II) LSI 抽象モデルの検証: (I)によって設計された回路動作を「形式検証」した.特にここでは意味結合・強/弱-物理結合によるLSI 抽象モデルが,従来のLSI設計モデルと等価であることを検証した.これに加えて検証結果を用いて等価性を担保した制御回路分割を検討し(III)にてアルゴリズム化を検討した.
    (III) 低エネルギー統合化LSI 自動設計技術の構築・検証(フェーズ1-電源制御): (I)および(II)により,提案するLSI抽象モデルの妥当性が検証された後,これをベースに統合化LSI自動設計フローを構築・検証した.仮想物理設計にて,実物理制約を緩和し上位工程の面から見た理想的な物理設計をし,これと実物理設計との「距離」を小さくすることを基本とするものを考えた.距離として各機能モジュールの位置の差の総和あるいは差の二乗和としている.『意味結合』として電源『意味結合』モジュールを対象に,パワーゲーティング,複数電源電圧制御および基板電圧制御を想定,低エネルギー指向統合化LSI 自動設計技術を構築・検証した.さらにこれを計算機上に実装,複数の応用プログラムに適用することで評価した.

  • Abstract LSI Model and Its Associated High-Level Synthesis Algorithm for Deep Submicron Technologies

    Project Year :

    2010
    -
    2012
     

     View Summary

    In this reseach, we have firstly developed an abstract LSI model, where we introduce "logical connection" and "physical connection" among registers, controllers, and functinal units inside an LSI chip. Using our abstract LSI model, we can have well-defined interface between high-level design and physical-level design. Secondly, we have developed a high-level synthesis algorithm for our abstract LSI mode, which realizes physical-synthesis-aware high-level sythnsis. Our simulation results demonstrate that our abstract LSI model and its associated high-level sysnthsis outperform several convetntional LSI synthesis modethods.

  • 高性能プロセッサの設計技術に関する研究

    Project Year :

    2002
    -
     
     

  • フレキシブルIPの形式的検証技術の研究

    Project Year :

    2002
    -
     
     

  • IPベースシステムLSI設計技術の研究

    Project Year :

    2001
    -
     
     

  • コンテンフに適応する発展的ソフトウェアの構成法

     View Summary

    本研究では,「ソフトウェアの設計・開発時には適用範囲を設定できない処理対象をもつソフトウェア」の発展的な構成法を研究している.平成9年度では,ソフトウェアを発展的に構成するための方法や事例を調査し,具体的に,ソフトウェアの機能(仕様)を拡張させながら,プログラムを構成する過程を追跡した.その結果のひとつとして,「細胞に基づくプログラミング」(Poc:Programming on cells)の方針を打ち出し,そのためのエディタの構成を進めた.また,ハードウェア/ソフトウェア協調設計の観点から,メタレベルの機能等をハードウェアにより支援することの検討を進めた.細胞に基づくプログラミングでは、データ細胞、開始細胞、名前細胞、および、パターン細胞の4種を導入した。また、それぞれの細胞が活動する前条件と,活動の結果の後条件を明示して、プログラムの動きを判り易くする方針を提案した。さらに、細胞によるプログラミングを支援する環境を整えるために、Pocエディタの構成を計画して、その一部の実装を進めた。Pocの実際的な適用計画として、手指動作記述文から3次元グラフィックス表示へ変換するプログラムの開発を進めた。その結果,あらかじめ準備した記述文に対応する基本的な構文形式と,語句の辞書および表示パラメータの推定規則を用意して,中間表現への変換プログラムが作成できた.一方、発展するソフトウェアの実行環境を、ハードウェアの面から支援するために、すでに開発している「FPGAにより可変論理回路部を備えた汎用コプロセッサ」の有用性の考察を進めた.今後は、これらの内容を充実させながら,「発展するソフトウェア」を,変化するハードウェア/ソフトウェアの両面から研究を進めて行く

  • Research on design and implementation of Ultra Large scale LSI

     View Summary

    In this research, fundamental technologies have been developed from architecture circuit, device design to package design to implement 100 million Gate LSI within 1/5 development period, 1/10 fabrication cost and 1/10 power consumption compared with conventional SoC or SiP technologies. Particularly, by doing (1) Research on large-scale system design methodologies, (2) Research on large-scale design automation technologies, (3) Research on high level verification technologies, achieved the drastic reduction on design and fabrication cost with realizing ultra low power and huge bandwidth communication

  • Studies on Computer-Aided Design of Microprocessor Controlled Precise AC Servo Systems.

     View Summary

    Robust and maintenance free induction motors, used in every movable portion of engineering, have seen new drive technologies based on power electronics, control schemes, and microprocessor implementation techniques. This was led to the inevitable need of computer-aided design environment for higher reliability and efficiency in more complex design process. The following research has been carried out:1. Design method of precise AC servo systems has been investigated. Theoretical basis for vector control is given in circuit theory context which is suitable for both qualitative analysis and coputer-aided procedural applications.2. Computer-aided design method has been studied. General and efficient CAD methods have been investigated based on sparsity technique, decomposition technique and discretization technique to cope with electronic-mechanical control system analysis. The proposed method has been implemented into a new CAD environment.3. Those CAD tool environment was applied to the real design of precise AC servo system, and the result was verified experimentally

  • OPERATION ON SETS AND IT'S APPLICATIONS TO COMPUTER AIDED DESIGN OF ROBUST CONTROL SYSTEMS

     View Summary

    Because of variation and uncertainty of system parameters, mathematical models Invariably give an imperfect description of real systems. Therefore, robustness is one of the most fundamental requirements for control systems. Main results of this research project may be summarized as follows:1. Polygon Interval Arithmetic. To treat uncertainty, we defined operations (addition, multiplication, and reversion) on sets consisting of all the convex polygons, which we call polygon interval arithmetic. We derived several important properties of polygon interval arithmetic. We also proposed an efficient algorithm to calculate addition of convex polygons.2. Robust Stability Analysis. (1)Stability of feedback systems can be analyzed by examining determinants of return differential matrix at every frequencies. A method based on the mapping theorem was used to calculate it, but it is very much time consuming. We proposed to use a method based on both polygon interval arithmetic and the mapping theorem to calculate determinants. (2)Stability of (nonlinear) systems can be examined by using Liapunov functions. We proposed a method to construct a Liapunov function via computational geometric technique to calculate convex hulls.3. RSRD(Robust Sequential Return Difference) method. We proposed RSRD method to design robust control systems, which uses polygon interval arithmetic, and which makes possible to design controllers of each loops "independently", and to guarantee the integrity.4. CAD(Computer Aided Design) System. We developed a CAD system to design robust control systems based on RSRD method. We also implemented a program to calculate stability margin of multi-input multi-output systems

  • Studies on Digital-Controller Configuration Design and Its Synchronization Control Using Multiple Digital Signal Processors.

     View Summary

    1.Modeling Multi-Processor Digital Controllers and Their Synchronization SchemeEach processor is required to observe timing constraints to avoid command signal collision and is preferred not to have idling time intervals. Appropriate models have been investigated and selected for both feedback-type and program-controlled-type controllers with an emphasis on the verification of synchronization capabilities. It was concluded that discrete-time system representation is appropriate when viewed from digital side. We have adopted a gain matrix model for the state feedback and discrete-time transfer matrix model for program control. As a by-product, it was learned that robust stability compensation method is not mature enough for applications. Hence, we have investigated/developed a new and more useful stabilization method.2.Minimum Throughput-Time Configuration and Synchronization ControlFor single input and single output controllers, a shared memory and bus configuration was proposed. We have investigated computational loading mechanism among each processor and proposed a new synchronization which achieves the minimum through-put time. The results were also extended to multiple-input and multiple-output cases.3.Verifying The Proposed SchemeSeveral supporting computer-software tools have been developed to verify those proposed schemes and also to be utilized in the process of design : Digital Signal Processor Command Generator, Throughput Estimator and Hybrid Simulator. Digital Signal Processor Command Generator gives program list written in the processor's command for a given digital-controller characteristics, the configuration and synchronization control protocol. Throughput Estimator evaluates throughput efficiency of a given control program. Hybrid Simulator simulates those digital control systems which include analog plant, A/D and D/A converters and digital controllers. Users can select different types of converters with different arithmetic employed and plants can be modeled as a block diagram using elementary blocks

  • 順序機械の設計検証のための暗黙状態数え上げの並列化に関する研究

     View Summary

    本研究では、論理関数の効率的な表現方法である二分決定グラフを用いた、順序回路の到達可能状態の数え挙げ手法の並列化を行った。本手法は暗黙状態数え挙げ(Implicit State Enumeration)と呼ばれ、順序回路の検証やテスト生成に使用されている。暗黙状態数え挙げは、基本的に初期状態から到達できる状態集合を網羅する手法である。順序回路において現状態と入力から次状態を決める関数は、状態を二進符合化することにより論理関数として表される。また、これまでに到達した状態集合なども、集合に属する時に1となる論理関数である特性関数で表される。本研究ではこれらの論理関数を並列二分決定グラフ処理手法で扱うことの研究を行なった。これらの論理関数の処理は、基本的には論理演算の複数個の列となるので、ここでは一般化した問題として、与えられた論理演算の列をいかに高速に処理するかの研究を行なった。並列処理手法としては、Shannon展開法を用いたもの、出力毎に処理する手法を用いたもの、Shannonの展開法を一般化したものの三つについて研究を行ない、多くの主記憶容量を要する論理関数に対してはShannonの展開法が優れていることと、一般的なベンチマークの回路に対しては出力毎の分割法が有効であるという結果を得た。Shannon展開の一般化については現在も研究を継続している。富士通研究所のAP1000を使用した実験では、乗算器の処理を512プロセサを用いて130倍程度の高速化を達成した他、一般的なベンチマーク回路に対しても64プロセサで、良い場合に27倍程度、平均で13倍程度の高速化を達成した。今後は暗黙状態数え挙げ処理特有の性質をより深く研究し、それを用いた並列化について考察する必要がある

  • 超並列アルゴリズム設計のためのデータ構造と計算モデルに関する研究

     View Summary

    逐次処理のアルゴリズム設計においては,データ構造の工夫が効率的なアルゴリズム設計に大きく影響することが良く知られているが,数万から数百万個のプロセッサ上で動作する超並列アルゴリズムの設計においても,データ構造の重要性は当然認識されるべきものである.本研究では,超並列処理のアルゴリズム設計に対する計算モデルを確立し,その上でのデータ構造の設計原理を明確化することを目指している.特に,プロッセサ間の通信量の制約を考慮して,通信量を限定した処理に適した「局所計算可能なデータ構造」の確立を目指す.本年度の研究としては,1)多重階層メッシュネットワーク上でのデータ構造の研究:本重点領域研究で提案されているRDTネットワークの能力とその上でのアルゴリズム開発の基礎理論を構築するために,RDTネットワークを包含する概念として多重階層メッシュネットワークを定義し,ネットワーク構造とデータ構造や通信によるオーバーヘッドの関係を調べた.この結果,RDTネットワークを含む多重階層メッシュネットワークの数万台規模の並列計算機における有効性を確認した.2)局所計算可能な符号化に関する研究:昨年度に引き続き,複数の単項演算が定義された有限集合に対し,すべての単項演算を局所計算可能とするための符号化の条件について研究を行い,いくつかの理論的成果を得た.3)二分決定グラフの並列処理アルゴリズムに関する研究:組み合わせ問題の分野で重要なデータ構造である二分決定グラフに対する並列アルゴリズムを研究し,実際に並列計算機上に実現してその能力を調べた.本プログラムは設計検証などの実用分野にも応用している.以上のように,本年度の研究では,多重階層メッシュネットワークや局所計算可能性に関する理論的な研究と並行して,二分決定グラフの並列処理アルゴリズムの考案とそのプログラム化を行った

  • パイプライン処理の形式的並列設計検証手法に関する研究

     View Summary

    本研究では、パイプライン処理方式の形式的な並列設計検証手法の研究を行なった。とくに、パイプラインプロセッサの制御方式の検証に着目し、二分決定グラフを用いた暗黙状態数え上げに基づき、命令をパイプライン処理するときのパイプラインの乱れであるハザードが生じるかどうかを判定する手法を示した。通常ハザードの検出はシミュレーションで行われているが、本手法はこのシミュレーションを記号的にすべての場合について網羅的に行う手法である。具体的には、連続する二つの命令を記号的に与えて記号実行を行う。着目している二つの命令以外はNOP命令にする。またそれと同時に二命令の間にNOP命令を適当な数だけはさんだ命令列を記号実行し、最初の命令列と比較を行なうことで、ハザードを生じるかどうかおよび、ハザードを消すためにどのような機構を備えているかを検出する。記号実行の部分は順序回路の暗黙状態数え上げ手法を用いている。実行はプログラムカウンタの値を除いて、すべてのレジスタの値が定常状態になるまで行なう。記号実行の結果は論理関数として表される。検証は、各命令列について定常状態になるまでのクロック数および定常状態の各レジスタの値が等しいかどうかを比較することで行なう。記号実行対象の回路の演算回路部分の簡単化のために剰余BDDと呼ばれる新しい二分決定グラフを提案した。また、並列化に関しては、暗黙状態数え上げの並列化手法を示した。本並列化手法は、二分決定グラフのグラフ自体をデータフローグラフと見て並列性を抽出するという新しい手法である。これにより、10CPUで4倍程度の高速化を達成した。今後は、本検証手法をスーパースカラプロセッサの検証に適用することや、二分決定グラフのグラフ構造を用いた並列化手法と通常の二分決定グラフの演算の並列化手法と組み合わせることなどが必要である

  • 論理回路の縮約モデルの自動抽出とそれを用いた大規模論理回路の設計検証に関する研究

     View Summary

    本研究では、論理回路の縮約モデルの抽出と、それを用いた大規模論理回路の検証に関する研究を行った。まず、縮約モデルを用いた検証手法に関する調査研究を行なった。つぎに現在多くの論理設計検証手法で用いられている二分決定グラフ(Binary Decision Diagram、BDD)について研究を行なった。特に回路の内部の適当な論理ゲートの出力を変数として扱ってBDDを小さくするとともに、相異なる内部変数を持つ二つの回路の等価性判定を行なう手法の研究を行なった。等価性判定では、一方の回路の内部変数を他方の回路の内部変数へ多項式時間で変換する手法を新たに開発して用いた。第二に、乗算など算術演算回路で二分決定グラフの節点爆発を抑制する手法を提案した。これは数の剰余数表現に基づく手法で、算術演算回路の入力が二進数に対応づけられているという性質を用い、二分決定グラフの節点数を入力変数の数の多項式で限定する。限定された結果のBDDを剰余BDD(Residue BDD)と呼ぶ。検証では、回路を複数の法について個別に検証する。剰余数表現で知られているように、もとの関数の剰余の組で、もとの関数を完全に表せるので、検証においても各剰余毎の検証で良い。研究ではまず剰余BDDを組み合わせ回路の検証に適用し、ある程度の効果を確認した。また、乗算器など算術演算回路を含む順序回路の検証への適用についても研究を行なった。第三に、プロセッサの検証などで重要な、回路の構造に基づく縮約手法の研究を行ない、論理回路をグラフと見て、構造が等しい部分を縮約するという手法の研究を行なった。さらに、時相論理に基づく仕様記述法について研究を行ない、仕様記述から仕様記述に関係のない回路部分を縮約する手法の研究を行なった

  • Research on Reconfigurable General Purpose Co-processor Systems and Their Optimized Hardware/Software Codesign Compiler

     View Summary

    We have investigated computer systems with reconfigurable general purpose co-processors, and the hardware/software codesign environment for the systems. The results of our research are as follows :1. We have proposed a reconfigurable coprocessor architecture made of FPGAs (Field Programmable Gate Arrays), a cache memory, and a bus interface.2. We have designed and implemented a prototype of the co-processor for Sun workstations. The coprocessor includes 4 FPGAs, a 1 MB cache memory, and a bus interface with a hardware queue.3. We proposed a hardware/software codesign environment for the computer system with the co-processor. We have investigated the system description languages and the co-operation method between the main processor and the co-processor.4. We have designed and implemented the codesign environment from C programs for the coprocessor system. The hardware/software codesign compiler accepts a C program and estimates the execution time and the hardware costs of each function when the function is implemented as a hardware. The compiler also estimates the execution time of the function with the software implementation. Then the compiler decides the implementation method of each function.5. We have investigated the optimization method of C programs to be implemented as hardware modules on FPGAs. We have introduced hardware independent optimization methods such as the loop-unrolling, the variable bit-length reduction, the function expansion, ets., optimization methods such as the 4-1 LUT (Look-Up Table) based hardware estimation method, the marge method of bit-level operations, etc.6. We have tested several algorithms on the prototype of the codesign system, which include lexical analysis, sorting, and several graphic applications. We have found that the FPGA based co-processor is useful for the fast execution of programs, when the program include the parallel-if structure or the bit-level operations.In the future, we would like to investigate context switching on the co-preoessor system, and dynamic reconfigurability of the co-processor

  • 論理回路の合成手法および最適化手法の高速化に関する研究

     View Summary

    本研究では、大規模論理回路の高速合成技術に関する研究を行なった。論理合成における最適化問題の多くはNP完全問題で効率の良いアルゴリズムの生成が困難であることが多い。そこで、不必要な論理合成最適化機能を用いないようにすることで、論理合成時間を短くする手法の研究を行なった。まず、データパス部のビット幅に着目し、それを必要最小限にすることで論理の最適化に必要な時間を減少させる手法についての研究を行なった。具体的には、VHDLあるいはC言語などで記述された回路の機能を解析し、機能記述で用いられる変数の最小値と最大値を求め、その差の対数をとることで必要最小限のビット幅の変数とする手法を提案した。さらにそれに付随する演算器のビット幅を減らして全体のハードウェア量を減らし、その合成にかかる時間を減少させる手法を提案した。フラグ変数やループの制御変数などでビット幅の減少効果が認められ、2割程度のハードウェア量の減少が認められた。また、定数との比較などでは、ゲートレベルで定数判定を行なう回路を自動生成し、論理合成系の最適化機能を用いないようにした。本手法は、通常の論理合成系のフロントエンドとして動作し、論理最適化機能の適用を減少させる効果を持つ。また、これらの手法で生成された論理回路のタイミング解析を高いレベルで行なう手法に関する研究を行なった。さらに、論理合成最適化手法の一つであるトランスダクション法の並列化に関する研究を行ない、並列に回路変換および最適化を行なう手法を提案した。この並列化手法は、共有主記憶方の並列計算機上で有効に動作し、4プロセッサで2倍程度の高速化を達成した。最後に、論理合成アルゴリズムと論理素子の割り当て手法の統合について、基本演算器を中心にFPGA実現のための論理素子割り当てをVHDLレベルで行なう手法を開発し、論理合成系の処理時間を短縮した。現在これらの手法の実装および改良を行なっている

  • コンテンツに適応する発展的ソフトウェアの構成法

     View Summary

    本研究では,「ソフトウェアの設計・開発時には適用範囲を設定できない処理対象をもつソフトウェア」の発展的な構成法を,その実現方式を含めて研究している.平成9年度では,ソフトウェアの機能(仕様)を拡張させながら プログラムを、溝成する過程を追跡して,その結果,「細胞に基づくプログラミング(Poc)」の方針を打ち出した.細胞によるプログラミングでは,プログラムを.細胞の集まりで構成する.その特徴は,その前条件が満たされた時点で,自ら起動する能動細胞を導入していることである.平成10年度では,実際にPocエディタを作成した.これは,単なるエディタ機能に加えて,細胞のグループを集めて1つのCプログラムに結合する機能も持っている.それを用いていくつかのプログラムを記述し問題点の検討と評価を行なった.その経験から,「能動形計算モデル」を提案した.能動形計算モデルは,前条件により能動的に起動する関数と,その起動を制御する部分とから構成されており,完全自律型関数と他から起動される受動型関数の中間的な性質をもつモデルである.Pocの効率的な実現機構については,動的結合機構や,再構成可能なハードウェア部分をもつコンピュータの構成を検討した.ソフトウエアが発展的に拡張していくためには,新しい概念の導入とそれを表す新しい言語に加えて,それらの実現を支援するコンピュータアーキテクチャの機構が有効になると考えて,ハードウェア/ソフトウェア協訓設計に関連する研究を進めた.また,Pocの能動細胞の特長だけを抽出して,能動形計算モデルを導入し、C言語に,ある条件によって自ら起動する能動関数の定義を追加した.それに基づいた,新しいアルゴリズムを考えるとともに,並列計算機によって複数の能動関数が並列に動作する状況を調べた.今後は,能動形プログラムの言語プロセッサを開発して、手指動作の記述の解析,英文契約書の草案作成の支援などの実際の問題で,プログラムを発展的に溝成する方法を求めていく.これらは,いずれも,プログラムの仕様の拡張を余儀なくされる問題である.また,能動形プログラムの実行に適した新しいコンピュータアーキテクチャの検討を進める.このように,「発展するソフトウェア」を,変化するハードウェア/ソフトウェアの両面から研究を進めて行く

  • Implementation of Adaptable Hardware and Software for Changing Environment

     View Summary

    The aim of our research is how to construct adaptable hardware and software for changing environment. In design and implementat ion of new informat ion systems, we research about methods of const ruct ing re-configurable system depending on changing envi ronment from total view points of hardware and software.Through 3 years, we studied at the fol lowing theoretical and practical aspects.(1) From the view point of adaptable software, we propose about the representat ion and construct ion of act ive software, the spontaneity and extensibility of objects in conversational programming, and optimizing C compiler to generate optimum bit-length variables in VHDL. Then we implement some examples and show the effectiveness of our proposals.(2) From the view point of adaptable hardware, as examples of LSI with re-configurability, we design and construct LSI of Java processor with abiliity to shorten the sequence of instructions dynamically, LSI to guess the eye track and LSI to determine the direct ion of face person-independently. These LSI have hardware oriented algorithms and give response in real time.(3) About hardware synthesis and verification, we propose a new symbolic image computation algorithm based on BDD(Binary Decision Diagram) constrain operator. Then we show good performance and effectiveness of the algorithm to large scale circuits.(4) From the view point of learning and knowledge acquirement for environmental adaptability, we propose a method based on OBDD(ordered BOD). Then we design the algorithms of mutual conversion between conventional character istic model and OBDD.(5) We pay attention to quantum computation. Quantun computers can exploi t quantum paralleiism to recognize the dynamic characteristics of environment. Then we research non-deterministic quantum fin te automata (OFA) and compare OFA with the classical counterparts.As results of the research, we get some mechanism for constructing systems with environmental adaptability in hardware and software totally

  • Hardware Verification with respect to Program Specification

     View Summary

    With the recent development of integrated circuit technology, we can integrate 1 million transistors in one chip. For the design of such huge circuits, high-level design methodologies have been developed and applied to many application specific chips. In the high-level design, programming languages are used to describe the functionality and the description is automatically converted to hardware modules based on high-level synthesis algorithms. So the modification and verification should be done at programming level and high-level verification methods are needed. In this research, we have developed several basic algorithms to show the correctness of hardware modules with respect to the program specification.At first, we have surveyed the current research on the equality with uninterpreted function and its application to software and hardware verification. We have also checked the current equality systems such as SVC, CLVL, etc. We have applied these systems for the verification of arithmetic circuits and shown the limitation of such systems. We have also applied the equality checking systems for the verification of parallel and pipeline circuits.In the equality checking, the algorithm uses logic formulae to represent and decide the equality. For the acceleration of the decision procedure, we proposed a prototyping system based on new look-up-table architecture of Field Programmable Gate Array. We have devised the architecture and proposed a mapping method for the new architecture. The architecture is more area-efficient and faster compared to the usual loop-up-table architecture.For the program specification, we have proposed a control-data-flow graph based data-path optimization methods. Especially, we focused on the bit-width of data-paths and proposed an optimization method of integer operations and an error estimation method for floating point operations. With the optimization and estimation algorithms, we can verify application specific circuits written in C programs.We have also worked on the high-level test and proposed a test pattern compaction method with small area overhead for system-on-chip design

  • High-level Hardware Verification Based on Equivalence Logic with Similarities

     View Summary

    For the formal hardware verification at high level, the equivalence checking system based on the equivalence logic with un-interpreted functions and similarities has been studied. The original equivalence logic manipulates the equivalence of variables, and has been shown to be effective for the verification of pipeline processor. The equivalence logic with similarities is a logic system to manipulate the similarity between variables. For example, if we design a circuit with fixed-point number system, and we would like to show the correctness with respect to a C program using floating number system, then the exact equivalence cannot be shown and we should cope with the similarity At first, we have developed a prototyping system which converts Verilog description to the equivalence logic formula, a prototyping system converting C descriptions to the equivalence logic formulae, and a prototype equivalence checking system based on the time expansion and published equivalence logic checking system(like CVCL/YICES). We have tested the prototype system and Sound that the computation is proportional to the exponential with respect to the number of time expansions, and we have worked on the SAT based equivalence checking and the transitivity constraints issue. For similarities, we are working on the optimization of the number of bits of variables in the floating to fixed point conversion, and the similarity based on the difference of the values and one based one the difference with values of other live variables. We have also applied the proposed equivalence checking to the multi-threading processor design and the acceleration of equivalence verification using the prototyping environment

▼display all

Specific Research

  • 単一命令計算機を用いたディジタルデータの意味保存手法の研究

    2016  

     View Summary

    ディジタルデータは0と1の並びであり、それだけでは意味を持たず、その意味解釈方法を同時に記憶する必要がある。これまで、文字データについては、1文字のデータのビット数とビットパターンに対応するフォントの最小データとそれへの変換方法を添付し、読めるデータに変換する手法を提案してきた。今回、画像圧縮されたデータの意味保存を見えるデータに戻すことと定義し、プログラムの意味記述の研究に取り組み、単一命令計算機の subleq の命令解釈機構の記述と subleq のアセンブラでプログラムの保存を行う手法と、その場合の記述量の最適化について研究を行った。subleq は命令が一種類しかなく、意味記述が簡単で、解釈機構の模擬や再構築が容易である。

  • 次世代不揮発素子の活用に向けたハードウェア設計技術

    2013  

     View Summary

     近年の携帯端末および無線センサなどのアンビエントデバイスの発達・普及に伴い、これらの稼働時間を延ばすため、アイドル状態での電源停止制御が重要になってきた。この時、電源復帰後の動作のために内部状態を保存することが必要で、電源停止でも記憶が保持できる次世代不揮発素子が注目されている。 MTJ (Magnetic Tunnel Junction) に基づく次世代不揮発素子は、アクセスは通常の CMOS SRAM と同等の速度で、集積度は DRAM と同様に高い。しかし、値の書込みにおいては、MTJ 内部の磁場の向きを制御するため、通常の SRAM と比較して10倍程度の書込みエネルギーを必要とし、その削減が急務である。 そこで本研究では、書込みエネルギーの削減を含む次世代不揮発素子の活用のための設計技術の研究を行った。メモリをROMとして書き換えずに計算結果の記憶に用いる手法の他、書込みそのものを減らす手法を研究した。MTJの書換えは同じ値を書込む場合でも違う値の書換えと同様大きなエネルギーを必要とするので、今記憶している値と書込みたい値が同じ場合に、書込みを停止することが基本となる。ここでは、それと組み合わせてさらに書込み回数を削減する手法を示した。 まず、順序回路の状態遷移解析に基づき、書換える必要のないレジスタの探索手法を提案し、書換えを停止する条件から停止制御回路の自動生成を行い、電力削減を確認した。 第二に、値の変化にあたって、変更するビット数を削減する手法の研究を行った。新しい値を元の値と新しい値との差分で表すことで、書き換えるビット数を削減する手法や、最大変更ビット数を制限した符号の研究などを行った。 第三に入力をアドレス、計算結果をメモリの内容としたメモリベース演算の研究を行った。基本的には入力数に対して指数的な容量を必要とするので、乗算等に対して必要に応じて演算器と組み合わせてメモリ量を削減する手法を検討した。 最後に、論理素子の制御値の伝播を考慮した細粒度の実行時パワーゲーティングの研究を行った。論理素子の制御値は一つの入力だけで出力を決定できる値である。ある入力が制御値をとると、他の入力の値は不要となり、それを計算する部分の電源を停止できる。この制御値の直列接続での伝播を用いてより多くの素子の電力停止を行う手法を示した。

  • システムオンシリコンにおけるランタイム解析・最適化に関する研究

    2012  

     View Summary

    システムオンシリコンにおけるランタイム解析・最適化に関する研究というテーマで、細粒度の動的なクロックゲーティングとパワーゲーティング、Single Event Upset (SEU) エラーに対するFPGA上での回路の動的書き換えを用いた対処手法、メモリベース演算、キャッシュ構成の最適化の研究を行った。細粒度の動的なクロックゲーティングとパワーゲーティングについては、回路内部の信号を用いて動的にクロックや電源の ON/OFF を制御することで、ランタイムに電力を制御する手法の検討を行った。マルチステージクロックゲーティングや、疑似パワーゲーティング法で電力を10%~20%程度削減できることが分かった。FPGA上での回路の動的書き換えについては、SEU エラーにより FPGA の構成ビットが変化し、回路の機能が正しくなくなる現象に対し、3重系よりも安全な4重系の構造を提案するとともに、エラー発生時にエラーを同定してエラーモジュールの動的再書込みによる機能の復帰を行う手法の提案を行った。実際に提案手法を Xilinx FPGA の動的部分書換え機能を用いて実現し、安全性と面積オーバーヘッドの評価を行った。メモリベース演算については、メモリ部の書換え可能性がランタイムの最適化に有効であるという判断から、基礎的な算術演算および CORDIC 法による三角関数や乗算・除算の実現手法の研究を行った。これは、演算器の入力をアドレスとして、計算結果をメモリに入れることで算術演算を実現するものである。なお、アドレスに対してメモリのサイズが指数的であるので、入力をいくつかに分割してメモリで実現し、メモリ出力を演算器に入れるなどの手法が必要であった。また、ハードウェア内部の演算器の結果をキャッシュ的にメモリに入れることで再計算を行わずにメモリアクセスで済ませる手法の検討を行った。これらのメモリを用いた演算手法は、論理ゲートの出力の変化による動的電力を削減する効果があり、実行時の電力最適化に有効であることがわかった。さらに、次世代不揮発メモリを用いたキャッシュメモリの電力の最適化についても検討を行い、L1 キャッシュの一部とL2 キャッシュを不揮発化することで、リーク電力の大きな削減が得られることがわかった。

  • システムオンシリコンのためのランタイム解析・最適化手法の研究

    2011   戸川望

     View Summary

    システムオンシリコンのためのランタイム解析・最適化の研究として、アサーションチェッカを用いたランタイムエラー検出法と得られたエラーの暗号化と安全な記憶方式や耐タンパ性に関する基礎的な研究を行った。まずアサーションチェッカーについては、入力記憶オートマトンを用いる手法に基づき、入力記憶部を共有することでFPGA実現によりハードウェア資源が削減できることを示した。つぎに、ランタイム解析で必要なアサーション集合に関する十分性について、回路の一部を変更したミュータントベースのアサーションの十分性判定に基づく手法の調査と検討を行った。ミュータントベース手法では、加えた変更がアサーションにより検出できるかでアサーションの十分性を判断するが、どのような変更を加えるかはランタイム解析の種類に大きく依存する。とくに遅延エラーについては、記述手法を含めて議論する必要があることがわかった。エラー情報の圧縮については、圧縮能力に優れたLFSRベース手法を検討した。ランタイム最適化については、FPGA の動的再構成の機構を用いる手法の検討を行った。とくに、内臓プロセッサの命令実行中に、その命令に対応する演算器を動的に構築し、ループに対応する命令列を検出して、データを動的に構築した演算系に通す手法の検討およびプロトタイプの構築を行った。これはハードウェアの高位合成をアセンブラレベルから動的に行う手法であるが、ループの検出部およびデータを新たに構築した演算系に流す手法、およびFPGA の動的再構成を高速に行う手法を検討する必要がある。また、演算系の最適化も今後の課題であり、メモリを用いた算術演算の効率化および低電力化や複数の加算を連続して行うマルチオペランド加算の最適化などの最適化の研究を行った。エラー情報の暗号化および情報漏洩の耐タンパ性についても検討を行い、スキャンパスがある場合の耐タンパ性について議論を行った。

  • 論理制御値を用いたVLSIの電力・遅延最適化

    2009  

     View Summary

    論理制御値を用いたVLSIの電力・遅延の最適化というテーマで、VLSI ゲートレベル回路の最適化の研究を行った。まず遅延の最適化に関しては、パイプライン回路の自動生成の研究を行い、FPGA 向けのパイプライン合成手法の提案を行い、加算回路や乗算回路で2段のパイプラインで1.8倍のクロック周波数を得られるという結果を得た。アルゴリズムおよび実験結果は、情報処理学会SLDM研究会およびASP_DACの Student Forum で口頭発表を行った。つぎに、電力の最適化に関しては、論理素子の制御値でパワーを停止する細粒度のパワーゲーティング手法を提案し、制御信号の制御値確率とそれで停止できるゲート数の積を評価し、評価値の大きい順にパワーゲーティングを挿入するアルゴリズムで、平均15%程度の電力削減効果を得た。研究成果は電子情報通信学会の英文論文誌に掲載された。さらに、順序回路のレジスタのクロックを停止して動的電力を削減するクロックゲーティング手法の最適共有の研究を行い、カウンタや ISCAS 89 ベンチマーク回路に適用して効果を確認した。研究成果は、2010年5月の情報処理学会SLDM研究会で口頭発表の予定である。

  • VLSIの論理素子の制御値に基づく電力・遅延最適化

    2008  

     View Summary

    VLSIの性能向上および電力消費を削減する目的に対し、論理素子の制御値を用いる手法を提案し、基礎的な実験を行った。まず性能向上に対しては、AND ゲートの制御値が0であることを用いて、論理回路の最長経路を通る0への変化をANDゲートで先に通すこととし、そのための制御条件を生成する方法を導いた。また1への変化に対しては OR ゲートで先に通すこととした。0への変化と1への変化を分けてスキップ(バイパス)するので 01-skip 手法と呼んでいる。本手法を簡単な回路に適用し、期待通りの高速化が得られることを確認した。ツール化と種々の回路への適用が今後の課題である。また制御回路の共有による付加回路の削減も今後の課題である。一方、電力消費の削減に関しては、AND ゲートの制御値が 0 であることを用い、一方が 0 であるときに他方の入力の値が不定でも出力に影響を与えないという性質を利用し、他方の入力を計算するブロックの電力を停止する手法を提案し、簡単な回路で効果を確認した。本手法は、プロセスの微細化に伴い大幅な増加が見られるリーク電力の削減に有効であると同時に、動的な電力の削減にも有効であることが確認されている。ツール化および種々の回路への適用および実LSI試作を用いた評価が今後の課題である。

  • プログラムを仕様とするハードウェアの設計検証手法

    2002  

     View Summary

    ハードウェアの設計の高位化に対応し、プログラムを仕様として用い、ハードウェアの設計を形式的に検証する手法に関する研究を行った。まず、現状の検証手法の調査を論文誌および国際会議、研究会などに対して行った。その結果として、二分決定グラフを用いた厳密な順序回路の検証手法、SAT に基づく近似的な検証手法、無評価関数に基づく等価性判定論理の 3 つが基本的な手法であることと、これらを組み合わせたハードウェアの検証手法の研究が盛んに行われていることがわかった。ただ、プログラムを仕様とするものについては、プログラムの直接実行による、シミュレーションの高速化の側面が主に強調され、形式的な手法の研究開発が不十分であることも明らかとなった。 そこで、これらのハードウェアの手法の中で、大規模な回路に適用可能と考えられる無評価関数に基づく等価性判定論理を適用した手法の開発を目指し、そのための基礎的な研究を行った。無評価関数に基づく等価性判定論理では、記号的な式の等価性を判断することができるので、プログラムの代入をそのまま等価性判定の式に変換することで、二つのプログラムの等価性を式の等価性として判定することができる。具体的には、C 言語のプログラムを対象として、それを等価性判定論理の式へ変換する規則を求めるとともに、多バイトの演算問題に適用し、手法の有効性と適用限界を求めた。実際のプロセッサなどで用いられている、桁上げ選択加算を含むような演算では、64 ビット程度の加算の等価性の検証が時間的に不可能となることがわかり、等価性判定論理自体の性質を含めて、今後のさらなる研究が必要である。

▼display all

 

Syllabus

▼display all