Updated on 2024/11/28

写真a

 
KIMURA, Keiji
 
Affiliation
Faculty of Science and Engineering, School of Fundamental Science and Engineering
Job title
Professor
Degree
博士(工学) ( 早稲田大学 )
Doctor of Engineering

Research Experience

  • 2012
    -
     

    Professor, Department of Computer Science and Engineering, Waseda University

  • 2005
    -
    2012

    Associate Professor, Department of Computer Science, Waseda University

  • 2004
    -
    2005

    Assistant Professor, Department of Computer Science, Waseda University

  • 2002
    -
    2004

    Visiting Assistant Professor, Advanced Research Institute for Science and Engineering, Waseda University

  • 1999
    -
    2002

    Research Associate, Department of Electrical, Electronics and Computer Engineering, Waseda University

Education Background

  •  
    -
    1996

    Waseda University   Faculty of Science and Engineering   Department of Electronics  

Committee Memberships

  • 2022.04
    -
    2022.10

    The 31st International Conference on Parallel Architectures and Compilation Techniques (PACT 2022)

  • 2021
    -
     

    The 30th International Conference on Parallel Architectures and Compilation Techniques (PACT 2021)

  • 2021
    -
     

    The 34th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2021)

  • 2021
    -
     

    ACM Principles and Practice of Parallel Programming 2021 (PPoPP 2021), Extended Review Committee

  • 2020
    -
     

    The 26th IEEE International Symposium on High-Performance Computer Architecture  Program Committee

  • 2018
    -
    2020

    IEEE International Parallel & Distributed Processing Symposium (IPDPS 2018-2020)  Program Committee

  • 2019
    -
     

    The 37th IEEE International Conference on Computer Design (ICCD 2019)  Program track Chair (Processor Architecture)

  • 2019
    -
     

    24th Asia and South Pacific Design Automation Conference (ASP-DAC 2019)  Program Committee (On-chip Communication and Networks-on-Chip)

  • 2018
    -
     

    Principles and Practice of Parallel Programming 2018 (PPoPP 2018)  Publicity Chair

  • 2018
    -
     

    IEEE COMPSAC 2018  Computer Architecture and Platforms Co-Chairs

  • 2016
    -
     

    The 22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS 2016)  Program Vice Chair (Parallel / Distributed Algorithms and Applications)

  • 2016
    -
     

    The 45th International Conference on Parallel Processing (ICPP-2016)  Program Committee (Programming Models, Languages and Compilers)

  • 2016
    -
     

    The 3rd International Workshop on Software and Engineering for Parallel Sysmtems (SEPS 2016)  Program Committee

  • 2015
    -
     

    The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT 2015)  Program Committee

  • 2015
    -
     

    27th International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2015)  Program Committee (Software Track)

  • 2015
    -
     

    15th International Symposium on High-Performance Computer Architecture (HPCA-15)  Publicity Co-Chairs

  • 2010.04
    -
    2014.03

    情報処理学会 計算機アーキテクチャ研究会  幹事

  • 2014
    -
     

    The 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS)  Program Committee

  • 2014
    -
     

    The 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS)  Program Committee

  • 2011
    -
    2014

    The 24--27th International Workshop on Languages and Compilers for Parallel Computing (LCPC )  Program Committee, Program Chair (2012)

  • 2010.04
    -
    2013.03

    情報処理学会 組込システム研究会  運営委員

  • 2013
    -
     

    The 13th International Forum on Embedded MPSoC and Multicore (MPSoC2013)  Finace Co-Chairs

  • 2013
    -
     

    The 27th Internationcal Conference on Supercomputing (ICS 2013)  Program Committee

  • 2009
    -
    2013

    IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips XII--XVII)  Program Committee

  • 2009
    -
    2013

    XXVII--XXXII IEEE International Conference on Computer Design (ICCD )  Program Committee (Computer System Design and Application Track)

  • 2012
    -
     

    The 12th International Forum on Embedded MPSoC and Multicore (MPSoC2012)  Program Co-Chairs

  • 2011
    -
     

    Advanced Parallel Processing Technology Symposium (APPT )  Program Committee

  • 2011
    -
     

    The 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS )  Program Committee (Multicore Computing and Parallel / Distributed Architecture)

  • 2008.04
    -
    2010.03

    情報処理学会 計算機アーキテクチャ研究会  運営委員

  • 2010
    -
     

    22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD )  Program Committee (System Software Track)

  • 2010
    -
     

    IEEE International Symposium on Workload Characterization (IISWC-2010)  Program Committee

  • 2005.04
    -
    2009.03

    情報処理学会 学会誌  編集委員 SWG

  • 2005.04
    -
    2009.03

    情報処理学会 システムLSI設計技術研究会(SLDM)  運営委員

  • 2005
    -
    2009.03

    情報処理学会論文誌 コンピューティングシステム ACS  論文誌編集委員会

  • 2009
    -
     

    The 38th International Conference on Parallel Processing (ICPP-2009)  Program Committee (Programming Models, Languages and Compilers)

  • 2006
    -
    2008

    IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX--XI)  Program Committee Vice Chair

  • 2006
    -
    2008

    IPSJ ComSys  Program Committee

  • 2006
    -
    2008

    ComSys - コンピュータシステムシンポジウム  プログラム委員

  • 2007
    -
     

    IPSJ DA Symposium  University Chair

  • 2007
    -
     

    情報処理学会 DAシンポジウム  大学幹事

  • 2007
    -
     

    IPSJ SACSIS  Program Committee Vice Chair

  • 2007
    -
     

    SACSIS 先進的計算基盤システムシンポジウム  プログラム副委員長

  • 2006
    -
     

    IPSJ SACSIS , 2008--2013  Program Committee

  • 2006
    -
     

    SACSIS , 2008--2013 - 先進的計算基盤システムシンポジウム  プログラム委員

  • 2003
    -
    2006

    並列/分散/協調処理に関するサマーワークショップ(SWoPP)  実行委員

  • 2001.04
    -
    2005.03

    情報処理学会 システムソフトウェアとオペレーティング・システム研究会  運営委員

  • 2001.04
    -
    2005.03

    情報処理学会 学会誌  編集委員 BWG, (最終年度主査)

  • 2004
    -
     

    SACSIS 先進的計算基盤システムシンポジウム  会計委員長・プログラム委員

▼display all

Professional Memberships

  •  
     
     

    ACM

  •  
     
     

    IEEE Computer Society

  •  
     
     

    The institute of Electronics, Information and Communication Engineers

  •  
     
     

    Information Processing Society of Japan

Research Areas

  • Computer system

Research Interests

  • Multiprocessor Architecture, Parallelizing Compiler

Awards

  • MEXT Award for Science and Technology (Research category)

    2014.04   Ministry of Education、Culture、Sports、Science and Technology (MEXT)  

 

Papers

  • Parallel Verification in RISC-V Secure Boot

    Akihiro Saiki, Yu Omori, Keiji Kimura

    2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)    2023.12

    DOI

  • Parallelizing Factory Automation Ladder Programs by OSCAR Automatic Parallelizing Compiler

    Tohma Kawasumi, Tsumura Yuta, Hiroki Mikami, Tomoya Yoshikawa, Takero Hosomi, Shingo Oidate, Keiji Kimura, Hironori Kasahara

    Proc. of the 35th International Workshop on Languages and Compilers for Parallel Computing (LCPC2022)    2022.10  [Refereed]

  • Open-Source Hardware Memory Protection Engine Integrated With NVMM Simulator

    Yu Omori, Keiji Kimura

    IEEE Computer Architecture Letters   21 ( 2 ) 77 - 80  2022.08  [Refereed]

    Authorship:Last author

    DOI

  • Data stream clustering for low-cost machines

    Christophe Cérin, Keiji Kimura, Mamadou Sow

    Journal of Parallel and Distributed Computing   166   57 - 70  2022.08  [Refereed]

    DOI

    Scopus

    2
    Citation
    (Scopus)
  • Open-Source RISC-V Linux-Compatible NVMM Emulator

    Yu Omori, Keiji Kimura

    Sixth Workshop on Computer Architecture Research with RISC-V (CARRV 2022)    2022.06  [Refereed]

    Authorship:Last author

  • Lightweight Array Contraction by Trace-Based Polyhedral Analysis

    Hugo Thievenaz, Keiji Kimura, Christophe Alias

    C3PO’22: Compiler-assisted Correctness Checking and Performance Optimization for HPC    2022.06  [Refereed]

  • Rephrasing polyhedral optimizations with trace analysis

    Hugo Thievenaz, Keiji Kimura, Christophe Alias

    12th International Workshop on Polyhedral Compilation Techniques (IMPACT 2022)    2022.06  [Refereed]

  • Accelerating Data Dependence Profiling Through Abstract Interpretation of Loop Instructions

    Mostafa Abbas, Mostafa I. Soliman, Sherif I. Rabia, Keiji Kimura, Ahmed El-Mahdy

    IEEE Access   10   31626 - 31640  2022  [Refereed]

    DOI

  • OSCAR Parallelizing and Power Reducing Compiler and API for Heterogeneous Multicores : (Invited Paper)

    Hironori Kasahara, Keiji Kimura, Toshiaki Kitamura, Hiroki Mikami, Kazutaka Morita, Kazuki Fujita, Kazuki Yamamoto, Tohma Kawasumi

    2021 IEEE/ACM Programming Environments for Heterogeneous Computing (PEHC)    2021.11  [Refereed]  [Invited]

    DOI

  • Parallelizing Compiler Translation Validation Using Happens-Before and Task-Set

    Jixin Han, Tomofumi Yuki, Michelle Mills Strout, Dan Umeda, Hironori Kasahara, Keiji Kimura

    2021 Ninth International Symposium on Computing and Networking Workshops (CANDARW)    2021.11  [Refereed]

    DOI

  • Performance Evaluation of OSCAR Multi-target Automatic Parallelizing Compiler on Intel, AMD, Arm and RISC-V Multicores

    Birk M. Magnussen, Tohma Kawasumi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    LCPC2021    2021.10  [Refereed]

  • Durable Queue Implementations Built on a Formally Defined Strand Persistency Model

    Jixin Han, Keiji Kimura

    Journal of Information Processing   29   823 - 838  2021  [Refereed]

    Authorship:Last author

    DOI

  • Secure Image Inference Using Pairwise Activation Functions

    Jonas T. Agyepong, Mostafa Soliman, Yasutaka Wada, Keiji Kimura, Ahmed El-Mahdy

    IEEE Access   9   118271 - 118290  2021  [Refereed]

    DOI

  • Non-Volatile Main Memory Emulator for Embedded Systems Employing Three NVMM Behaviour Models

    Yu OMORI, Keiji KIMURA

    IEICE TRANSACTIONS on Information and Systems   E104-D ( 5 ) 697 - 708  2021  [Refereed]

    Authorship:Last author

  • Scalable and Fast Lazy Persistency on GPUs

    Ardhi Wiratama, Baskara Yudha, Keiji Kimura, Huiyang Zhou, Yan Solihin

    2020 IEEE International Symposium on Workload Characterization (IISWC 2020)     252 - 263  2020.10  [Refereed]

  • Local Memory Mapping of Multicore Processors on an Automatic Parallelizing Compiler

    Yoshitake OKI, Yuto ABE, Kazuki YAMAMOTO, Kohei YAMAMOTO, Tomoya SHIRAKAWA, Akimasa YOSHIDA, Keiji KIMURA, Hironori KASAHARA

    IEICE TRANSACTIONS on Electronics   E103-C ( 3 ) 98 - 109  2020.03  [Refereed]

  • Compiler Software Coherent Control for Embedded High Performance Multicore

    Boma A. ADHI, Tomoya KASHIMATA, Ken TAKAHASHI, Keiji KIMURA, Hironori KASAHARA

    IEICE TRANSACTIONS on Electronics   E103-C ( 3 ) 85 - 97  2020.03  [Refereed]

  • Compiler-support for Critical Data Persistence in NVM

    Reem Elkhouly, Mohammad Alshboul, Akihiro Hayashi, Yan Solihin, Keiji Kimura

    ACM Transactions on Architecture and Code Optimization (TACO)   16 ( 4 )  2019.12  [Refereed]

    Authorship:Last author

  • Software Cache Coherent Control by Parallelizing Compiler

    Boma A. Adhi, Masayoshi Mase, Yuhei Hosokawa, Yohei Kishimoto, Taisuke Onishi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   11403   17 - 25  2019.11  [Refereed]

  • Cascaded DMA Controller for Speedup of Indirect Memory Access in Irregular Applications

    Tomoya Kashimata, Toshiaki Kitamura, Keiji Kimura, Hironori Kasahara

    9th Workshop on Irregular Applications: Architectures and Algorithms     71 - 76  2019.11  [Refereed]

  • Performance of Static and Dynamic Task Scheduling for Real-Time Control System on Embedded Multicore Processor

    Yoshitake Oki, Hiroki Mikami, Hikaru Nishida, Dan Umeda, Keiji Kimura, Hironori Kasahara

    32nd International Workshop on Languages and Compilers for Parallel Computing(LCPC)    2019.10  [Refereed]

  • Performance Evaluation on NVMM Emulator Employing Fine-Grain Delay Injection

    Yu Omori, Keiji Kimura

    The 8th IEEE Non-Volatile Memory Systems and Applications Symposium (IEEE NVMSA 2019)     1 - 6  2019.08  [Refereed]

    Authorship:Last author

    DOI

    Scopus

    3
    Citation
    (Scopus)
  • Fast and Highly Optimizing Separate Compilation for Automatic Parallelization

    Tohma Kawasumi, Ryota Tamura, Yuya Asada, Jixin Han, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    The 2019 International Conference on High Performance Computing & Simulation (HPCS 2019)     478 - 485  2019.07  [Refereed]

  • Efficient Checkpointing with Recompute Scheme for Non-volatile Main Memory

    Mohammad Alshboul, Hussein Elnawawy, Reem Elkhouly, Keiji Kimura, James Tuck, Yan Solihin

    ACM Transactions on Architecture and Code Optimization (TACO)   16 ( 2 )  2019.05  [Refereed]

  • Multicore Cache Coherence Control by a Parallelizing Compiler

    Hironori Kasahara, Keiji Kimura, Boma A. Adhi, Yuhei Hosokawa, Yohei Kishimoto, Masayoshi Mase

    Proceedings - International Computer Software and Applications Conference   1   492 - 497  2017.09  [Refereed]

     View Summary

    A recent development in multicore technology has enabled development of hundreds or thousands core processor. However, on such multicore processor, an efficient hardware cache coherence scheme will become very complex and expensive to develop. This paper proposes a parallelizing compiler directed software coherence scheme for shared memory multicore systems without hardware cache coherence control. The general idea of the proposed method is that an automatic parallelizing compiler analyzes the control dependency and data dependency among coarse grain task in the program. Then based on the obtained information, task parallelization, false sharing detection and data restructuration to prevent false sharing are performed. Next the compiler inserts cache control code to handle stale data problem. The proposed method is built on OSCAR automatic parallelizing compiler and evaluated on Renesas RP2 with 8 SH-4A cores processor. The hardware cache coherence scheme on the RP2 processor is only available for up to 4 cores and the hardware cache coherence can be completely turned off for non-coherence cache mode. Performance evaluation is performed using 10 benchmark program from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB) and Mediabench II. The proposed method performs as good as or better than hardware cache coherence scheme. For example, 4 cores with the hardware coherence mechanism gave us speed up of 2.52 times against 1 core for SPEC2000 'equake', 2.9 times for SPEC2006 'lbm', 3.34 times for NPB 'cg', and 3.17 times for MediaBench II MPEG2 Encoder. The proposed software cache coherence control gave us 2.63 times for 4 cores and 4.37 for 8 cores for 'equake', 3.28 times for 4 cores and 4.76 times for 8 cores for lbm, 3.71 times for 4 cores and 4.92 times for 8 cores for 'MPEG2 Encoder'.

    DOI

    Scopus

    7
    Citation
    (Scopus)
  • Automatic Local Memory Management for Multicores Having Global Address Space

    Kouhei Yamamoto, Tomoya Shirakawa, Yoshitake Oki, Akimasa Yoshida, Keiji Kimura, Hironori Kasahara

    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016   10136   282 - 296  2017  [Refereed]

     View Summary

    Embedded multicore processors for hard real-time applications like automobile engine control require the usage of local memory on each processor core to precisely meet the real-time deadline constraints, since cache memory cannot satisfy the deadline requirements due to cache misses. To utilize local memory, programmers or compilers need to explicitly manage data movement and data replacement for local memory considering the limited size. However, such management is extremely difficult and time consuming for programmers. This paper proposes an automatic local memory management method by compilers through (i) multi-dimensional data decomposition techniques to fit working sets onto limited size local memory (ii) suitable block management structures, called Adjustable Blocks, to create application specific fixed size data transfer blocks (iii) multi-dimensional templates to preserve the original multi-dimensional representations of the decomposed multi-dimensional data that are mapped onto one-dimensional Adjustable Blocks (iv) block replacement policies from liveness analysis of the decomposed data, and (v) code size reduction schemes to generate shorter codes. The proposed local memory management method is implemented on the OSCAR multi-grain and multi-platform compiler and evaluated on the Renesas RP2 8 core embedded homogeneous multicore processor equipped with local and shared memory. Evaluations on 5 programs including multimedia and scientific applications show promising results. For instance, speedups on 8 cores compared to single core execution using off-chip shared memory on an AAC encoder program, a MPEG2 encoder program, Tomcatv, and Swim are improved from 7.14 to 20.12, 1.97 to 7.59, 5.73 to 7.38, and 7.40 to 11.30, respectively, when using local memory with the proposed method. These evaluations indicate the usefulness and the validity of the proposed local memory management method on real embedded multicore processors.

    DOI

    Scopus

    2
    Citation
    (Scopus)
  • Architecture design for the environmental monitoring system over the winter season

    Koichiro Yamashita, Chen Ao, Takahisa Suzuki, Yi Xu, Hongchun Li, Jun Tian, Keiji Kimura, Hironori Kasahara

    MobiWac 2016 - Proceedings of the 14th ACM International Symposium on Mobility Management and Wireless Access, co-located with MSWiM 2016     27 - 34  2016.11  [Refereed]

     View Summary

    One of the applications as a source of big data, there is a sensor network for the environmental monitoring that is designed to detect the deterioration of the infrastructure, erosion control and so on. The specific targets are bridges, buildings, slopes and embankments due to the natural disasters or aging. Basic requirement of this monitoring system is to collect data over a long period of time from a large number of nodes that installed in a wide area. However, in order to apply a wireless sensor network (WSN), using wireless communication and energy harvesting, there are not many cases in the actual monitoring system design. Because of the system must satisfy various conditions measurement location and time specified by the civil engineering communication quality and topology obtained from the network technology the electrical engineering to solve the balance of weather environment and power consumption that depends on the above-mentioned conditions. We propose the whole WSN design methodology especially for the electrical architecture that is affected by the network behavior and the environmental disturbance. It is characterized by determining recursively mutual trade-off of a wireless simulation and a power architecture simulation of the node devices. Furthermore, the system allows the redundancy of the design. In addition, we deployed the actual slope monitoring WSN that is designed by the proposed method to the snow-covered area. A conventional similar monitoring WSN, with 7 Ah Li-battery, it worked only 129 days in a mild climate area. On the other hand, our proposed system, deployed in the heavy snow area has been working more than 6 months (still working) with 3.2 Ah batteries. Finally, it made a contribution to the civil engineering succeeded in the real time observation of the groundwater level displacement at the time of melting snow in the spring season.

    DOI

    Scopus

    2
    Citation
    (Scopus)
  • Reducing parallelizing compilation time by removing redundant analysis

    Jixin Han, Rina Fujino, Ryota Tamura, Mamoru Shimaoka, Hiroki Mikami, Moriyuki Takamura, Sachio Kamiya, Kazuhiko Suzuki, Takahiro Miyajima, Keiji Kimura, Hironori Kasahara

    SEPS 2016 - Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems, co-located with SPLASH 2016     1 - 9  2016.10  [Refereed]

     View Summary

    Parallelizing compilers employing powerful compiler optimizations are essential tools to fully exploit performance from today's computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers apply multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of analysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.

    DOI

    Scopus

    2
    Citation
    (Scopus)
  • An Android Systrace Extension for Tracing Wakelocks

    Bui Duc Binh, Keiji Kimura

    IEEE International Conference on Embedded and Ubiquitous Computing (EUC 2016)     146 - 149  2016.08  [Refereed]

    Authorship:Corresponding author

  • Multigrain Parallelization Using Profile Information of Embedded Applications Generated by Model-based Development Tools on Multicore Processors

    Dan Umeda, Takahiro Suzuki, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    IPSJ Journal   57 ( 2 ) 1 - 12  2016.02  [Refereed]

  • Android video processing system combined with automatically parallelized and power optimized code by OSCAR compiler

    Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

    Journal of Information Processing   24 ( 3 ) 504 - 511  2016  [Refereed]

     View Summary

    The emergence of multi-core processors in smart devices promises higher performance and low power consumption. The parallelization of applications enables us to improve their performance. However, simultaneously utilizing many cores would drastically drain the device battery life. This paper shows a demonstration system of realtime video processing combined with power reduction controlled by the OSCAR automatic parallelization compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime with 4 ARM Cortext- A9 cores. In this paper, we exploited the DVFS framework, core partitioning, and profiling technique and OSCAR parallelization - power control algorithm to reduce the total consumption in a real-time video application. The demonstration results show that it can cut power consumption by 42.8% for MPEG-2 Decoder application and 59.8% for Optical Flow application by using 3 cores in both applications.

    DOI CiNii

    Scopus

  • Multigrain parallelization for model-based design applications using the OSCAR compiler

    Dan Umeda, Takahiro Suzuki, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   9519   125 - 139  2016  [Refereed]

     View Summary

    Model-based design is a very popular software development method for developing a wide variety of embedded applications such as automotive systems, aircraft systems, and medical systems. Model-based design tools like MATLAB/Simulink typically allow engineers to graphically build models consisting of connected blocks for the purpose of reducing development time. These tools also support automatic C code generation from models with a special tool such as Embedded Coder to map models onto various kinds of embedded CPUs. Since embedded systems require real-time processing, the use of multi-core CPUs poses more opportunities for accelerating program execution to satisfy the real-time constraints. While prior approaches exploit parallelism among blocks by inspecting MATLAB/Simulink models, this may lose an opportunity for fully exploiting parallelism of the whole program because models potentially have parallelism within a block. To unlock this limitation, this paper presents an automatic parallelization technique for auto-generated C code developed by MATLAB/Simulink with Embedded Coder. Specifically, this work (1) exploits multi-level parallelism including inter-block and intra-block parallelism by analyzing the auto-generated C code, and (2) performs static scheduling to reduce dynamic overheads as much as possible. Also, this paper proposes an automatic profiling framework for the auto-generated code for enhancing static scheduling, which leads to improving the performance of MATLAB/Simulink applications. Performance evaluation shows 4.21 times speedup with six processor cores on Intel Xeon X5670 and 3.38 times speedup with four processor cores on ARM Cortex-A15 compared with uniprocessor execution for a road tracking application.

    DOI

    Scopus

    10
    Citation
    (Scopus)
  • Coarse grain task parallelization of earthquake simulator GMS using OSCAR compiler on various Cc-NUMA servers

    Mamoru Shimaoka, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   9519   238 - 253  2016  [Refereed]

     View Summary

    This paper proposes coarse grain task parallelization for a earthquake simulation program using Finite Difference Method to solve the wave equations in 3-D heterogeneous structure or the Ground Motion Simulator (GMS) on various cc-NUMA servers using IBM, Intel and Fujitsu multicore processors. The GMS has been developed by the National Research Institute for Earth Science and Disaster Prevention (NIED) in Japan. Earthquake wave propagation simulations are important numerical applications to save lives through damage predictions of residential areas by earthquakes. Parallel processing with strong scaling has been required to precisely calculate the simulations quickly. The proposed method uses the OSCAR compiler for exploiting coarse grain task parallelism efficiently to get scalable speed-ups with strong scaling. The OSCAR compiler can analyze data dependence and control dependence among coarse grain tasks, such as subroutines, loops and basic blocks. Moreover, locality optimizations considering the boundary calculations of FDM and a new static scheduler that enables more efficient task schedulings on cc-NUMA servers are presented. The performance evaluation shows 110 times speed-up using 128 cores against the sequential execution on a POWER7 based 128 cores cc-NUMA server Hitachi SR16000 VM1, 37.2 times speed-up using 64 cores against the sequential execution on a Xeon E7-8830 based 64 cores cc-NUMA server BS2000, 19.8 times speed-up using 32 cores against the sequential execution on a Xeon X7560 based 32 cores cc-NUMA server HA8000/RS440, 99.3 times speed-up using 128 cores against the sequential execution on a SPARC64 VII based 256 cores cc-NUMA server Fujitsu M9000, 9.42 times speed-up using 12 cores against the sequential execution on a POWER8 based 12 cores cc-NUMA server Power System S812L.

    DOI

    Scopus

  • 2-Step Power Scheduling with Adaptive Control Interval for Network Intrusion Detection Systems on Multicores

    Lau Phi Tuong, Keiji Kimura

    2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC)     69 - 76  2016  [Refereed]

    Authorship:Last author

     View Summary

    Network intrusion detection system (NIDS) is becoming an important element even in embedded systems as well as in data centers since embedded computers have been increasingly exposed to the Internet. The demand for power budget of these embedded systems is a critical issue in addition to that for performance. In this paper, we propose a technique to minimize power consumption in the NIDS by 2-step power scheduling with the adaptive control interval. In addition, we also propose a CPU-core controlling algorithm so that our scheduling technique can preserve the performance for other applications and NIDS assuming the cases of multiplexing NIDS and them simultaneously on the same device such as a home server or a mobile platform. We implement our 2-step algorithm into Suricata, which is a popular NIDS, as well as a 1-step algorithm with the adaptive interval, and a simple fixed-interval algorithm for evaluations. Experimental results show that our 2-step scheduling with both the adaptive and the fixed 30-millisecond interval achieve 75% power saving comparing with the Ondemand governor and 87% comparing with the Performance governor in Linux, respectively, without affecting their performance capability on four ARM Cortex-A15 cores at the network traffic of 1,000 packets/seconds. In contrast, when the network traffic reaches to 17,000 packets/seconds, our 2-step scheduling and the Ondemand as well as the Performance governor can maintain the packet processing capacity while the fixed 30-milliseconds interval processes only 50% packets with two and three cores, and about 80% packets on four cores.

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • Accelerating Multicore Architecture Simulation Using Application Profile

    Keiji Kimura, Gakuho Taguchi, Hironori Kasahara

    2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC)     177 - 184  2016  [Refereed]

    Authorship:Lead author

     View Summary

    Architecture simulators play an important role in exploring frontiers in the early stages of the architecture design. However, the execution time of simulators increases with an increase the number of cores. The sampling simulation technique that was originally proposed to simulate single-core processors is a promising approach to reduce simulation time. Two main hurdles for multi/many-core are preparing sampling points and thread skewing at functional simulation time. This paper proposes a very simple and low-error sampling-based acceleration technique for multi/many-core simulators. For a parallelized application, an iteration of a large loop including a parallelizable program part, is defined as a sampling unit. We apply X-means method to a profile result of the collection of iterations derived from a real machine to form clusters of those iterations. Multiple iterations are exploited as sampling points from these clusters. We execute the simulation along the sampling points and calculate the number of total execution cycles. Results from a 16-core simulation show that our proposed simulation technique gives us a maximum of 443x speedup with a 0.52% error and 218x speedup with 1.50% error on an average.

    DOI

    Scopus

    3
    Citation
    (Scopus)
  • Annotatable systrace: An extended linux ftrace for tracing a parallelized program

    Daichi Fukui, Mamoru Shimaoka, Hiroki Mikami, Dominic Hillenbrand, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

    SEPS 2015 - Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems     21 - 25  2015.10  [Refereed]

     View Summary

    Investigation of the runtime behavior is one of the most important processes for performance tuning on a computer system. Profiling tools have been widely used to detect hot-spots in a program. In addition to them, tracing tools produce valuable information especially from parallelized programs, such as thread scheduling, barrier synchronizations, context switching, thread migration, and jitter by interrupts. Users can optimize a runtime system and hardware configuration in addition to a program itself by utilizing the attained information. However, existing tools provide information per process or per function. Finer information like task-or loop-granularity should be required to understand the program behavior more precisely. This paper has proposed a tracing tool, Annotatable Systrace, to investigate runtime execution behavior of a parallelized program based on an extended Linux ftrace. The Annotatable Systrace can add arbitrary annotations in a trace of a target program. The proposed tool exploits traces from 183.equake, 179.art, and mpeg2enc on Intel Xeon X7560 and ARMv7 as an evaluation. The evaluation shows that the tool enables us to observe load imbalance along with the program execution. It can also generate a trace with the inserted annotations even on a 32-core machine. The overhead of one annotation on Intel Xeon is 1.07 us and the one on ARMv7 is 4.44 us, respectively.

    DOI

    Scopus

    5
    Citation
    (Scopus)
  • Evaluation of Automatic Power Reduction with OSCAR Compiler on Intel Haswell and ARM Cortex-A9 Multicores

    Tomohiro Hirano, Hideo Yamamoto, Shuhei Iizuka, Kohei Muto, Takashi Goto, Tamami Wake, Hiroki Mikami, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   8967   239 - 252  2015.05  [Refereed]

  • Automatic Parallelization of Designed Engine Control C Codes by MATLAB/Simulink

    Dan Umeda, Youhei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Mitsuhiro Tani, Hiroshi Mori, Keiji Kimura, Hironori Kasahara

    IPSJ Journal   55 ( 8 ) 1817 - 1829  2014.08  [Refereed]

     View Summary

    Recently, more safety, comfort and environmental feasibility are required for the automobile. Accordingly, control systems need performance enhancement on microprocessors for real-time software which realize that. However, the improvement of clock frequency has been limited by power consumption and the performance of a single-core processor which controls power has reached the limits. For these factors, multi-core processors will be used for automotive control system. Recently Model-based Design by MATLAB and Simulink has been used for developing automobile systems because of elimination time of development and improvement of reliability. However, auto-generated-code from MATLAB and Simulink has been functioned on only single core processor so far. This paper proposes a parallelization method of engine control C codes for a multi-core processor generated from MATLAB and Simulink using Embedded Coder. The engine control C code which composed of many conditional branches and arithmetic assignment statements and are difficult to parallelize have been parallelized automatically using OSCAR automatic parallel compiler. In this result, it is succeeded to attain performance improvement on RP2 and V850E2R. Maximum 1.9x speedup on two cores and 3.76x speedup on four cores are attained.

    CiNii

  • Multicore Technologies Realizing Low-Power Computing

    Keiji Kimura, Hironori Kasahara

    The Journal of IEICE   97 ( 2 ) 133 - 139  2014.02  [Invited]

    Authorship:Lead author

    CiNii

  • OSCAR Compiler Controlled Multicore Power Reduction on Android Platform

    Hideo Yamamoto, Tomohiro Hirano, Kohei Muto, Hiroki Mikami, Takashi Goto, Dominic Hillenbrand, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2013   8664   155 - 168  2014  [Refereed]

     View Summary

    In recent years, smart devices are transitioning from single core processors to multicore processors to satisfy the growing demands of higher performance and lower power consumption. However, power consumption of multicore processors is increasing, as usage of smart devices become more intense. This situation is one of the most fundamental and important obstacle that the mobile device industries face, to extend the battery life of smart devices. This paper evaluates the power reduction control by the OSCAR Automatic Parallelizing Compiler on an Android platform with the newly developed precise power measurement environment on the ODROID-X2, a development platform with the Samsung Exynos4412 Prime, which consists of 4 ARM Cortex-A9 cores. The OSCAR Compiler enables automatic exploitation of multigrain parallelism within a sequential program, and automatically generates a parallelized code with the OSCAR Multi-Platform API power reduction directives for the purpose of DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating. The paper also introduces a newly developed micro second order pseudo clock gating method to reduce power consumption using WFI (Wait For Interrupt). By inserting GPIO (General Purpose Input Output) control functions into programs, signals appear on the power waveform indicating the point of where the GPIO control was inserted and provides a precise power measurement of the specified program area. The results of the power evaluation for real-time Mpeg2 Decoder show 86.7% power reduction, namely from 2.79[W] to 0.37[W] and for real-time Optical Flow show 86.5% power reduction, namely from 2.23[W] to 0.36[W] on 3 core execution.

    DOI

    Scopus

    3
    Citation
    (Scopus)
  • モデルベース設計により自動生成されたエンジン制御Cコードのマルチコア用自動並列化

    梅田弾, 金羽木洋平, 見神広紀, 谷充弘(デンソー, 森裕司(デンソー, 木村啓二, 笠原博徳

    組み込みシステムシンポジウム(ESS2013)    2013.10

  • OSAR API v2.1: Extensions for an Advanced Accelerator Control Scheme to a Low-Power Multicore API

    Keiji Kimura, Cecilia Gonzales-Alvarez, Akihiro Hayashi, Hiroki Mikami, Mamoru Shimaoka, Jun Shirako, Hironori Kasahara

    17th Workshop on Compilers for Parallel Computing (CPC2013)    2013.07  [Refereed]

    Authorship:Lead author

  • Automatic Parallelization of Hand Written Automotive Engine Control Codes Using OSCAR Compiler

    Dan Umeda, Yohei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

    17th Workshop on Compilers for Parallel Computing (CPC2013)    2013.07  [Refereed]

  • Evaluation of power consumption at execution of multiple automatically parallelized and power controlled media applications on the RP2 low-power multicore

    Hiroki Mikami, Shumpei Kitaki, Masayoshi Mase, Akihiro Hayashi, Mamoru Shimaoka, Keiji Kimura, Masato Edahiro, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   7146   31 - 45  2013

     View Summary

    This paper evaluates an automatic power reduction scheme of OSCAR automatic parallelizing compiler having power reduction control capability when multiple media applications parallelized by the OSCAR compiler are executed simultaneously on RP2, a 8-core multicore processor developed by Renesas Electronics, Hitachi, and Waseda University. OSCAR compiler enables the hierarchical multigrain parallel processing and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating and power gating for each processor core using the OSCAR multi-platform API. The RP2 has eight SH4A processor cores, each of which has power control mechanisms such as DVFS, clock gating and power gating. First, multiple applications with relatively light computational load are executed simultaneously on the RP2. The average power consumption of power controlled eight AAC encoder programs, each of which was executed on one processor, was reduced by 47%, (to 1.01W), against one AAC encoder execution on one processor (from 1.89W) without power control. Second, when multiple intermediate computational load applications are executed, the power consumptions of an AAC encoder executed on four processors with the power reduction control was reduced by 57% (to 0.84W) against an AAC encoder execution on one processor (from 1.95W). Power consumptions of one MPEG2 decoder on four processors with power reduction control was reduced by 49% (to 1.01W) against one MPEG2 decoder execution on one processor (from 1.99W). Finally, when a combination of a high computational load application program and an intermediate computational load application program are executed simultaneously, the consumed power reduced by 21% by using twice number of cores for each application. This paper confirmed parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions. In execution of multiple light computational load applications, power consumption increases only 12% for one application. Parallel processing being applied to intermediate computational load applications, power consumption of executing one application on one processor core (1.49W) is almost same power consumption of two applications on eight processor cores (1.46W). © 2013 Springer-Verlag.

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • Automatic Design Exploration Framework for Multicores with Reconfigurable Accelerators

    Cecilia Gonzalez-Alvarez, Haruku Ishikawa, Akihiro Hayashi, Daniel Jimenez-Gonzalez, Carlos Alvarez, Keiji Kimura, Hironori Kasahara

    th Workshop on Reconfigurable Computing (WRC) 2013, held in conjuction with HiPEAC conference 2013    2013.01  [Refereed]

  • Parallelization of Automotive Engine Control Software On Embedded Multi-core Processor Using OSCAR Compiler

    Yohei Kanehagi, Dan Umeda, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

    2013 IEEE COOL CHIPS XVI (COOL CHIPS)    2013  [Refereed]

  • Automatic Parallelization, Performance Predictability and Power Control for Mobile-Applications

    Dominic Hillenbrand, Akihiro Hayashi, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

    2013 IEEE COOL CHIPS XVI (COOL CHIPS)    2013  [Refereed]

     View Summary

    Currently few mobile applications exploit the power- and performance capabilities of multi-core architectures. As the number of cores increases, the challenges become more pressing. We picked three challenges: application parallelization, performance-predictability/portability and power control for mobile devices. We tackled the challenges with our auto-parallelizing compiler and operating system enhancements.

  • Reconciling application power control and operating systems for optimal power and performance

    Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip, ReCoSoC 2013    2013

     View Summary

    In the age of dark silicon on-chip power control is a necessity. Upcoming and state of the art embedded- and cloud computer system-on-chips (SoCs) already provide interfaces for fine grained power control. Sometimes both: core- and interconnect-voltage and frequency can be scaled for example. To further reduce power consumption SoCs often have specialized accelerators. Due to the rising specialization of hard- and software general purpose operating systems require changes to exploit the power saving opportunities provided by the hardware. However, they lack detailed hardware- and application-level-information. Application-level power control in turn is still very uncommon and difficult to realize. Now a days vendors of mobile devices are forced to tweak and patch system-level software to enhance the power efficiency of each individual product. This manual process is time consuming and must be re-iterated for each new product. In this paper we explore the opportunities and challenges of automatic application- level power control using compilers. © 2013 IEEE.

    DOI

    Scopus

    4
    Citation
    (Scopus)
  • 組込マルチコア用OSCAR APIを用いたTILEPro64上でのマルチメディアアプリケーションの 並列処理

    岸本耀平, 見神広紀, 中野恵一, 林明宏, 木村啓二, 笠原博徳

    組み込みシステムシンポジウム(ESS2012)    2012.10

  • OSCAR Parallelizing Compiler and API for Real-time Low Power Heterogeneous Multicores

    kihiro Hayashi, Mamoru Shimaoka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

    6th Workshop on Compilers for Parallel Computing(CPC2012)    2012.01  [Refereed]

  • 重粒子線がん治療用線量計算エンジンの自動並列化

    林明宏, 松本卓司, 見神広紀, 木村啓二, 山本啓二, 崎浩典, 高谷保行, 笠原博徳

    HPCS2012 - ハイパフォーマンスコンピューティングと計算科学シンポジウム    2012.01

  • Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

    Yasir I. M. Al-Dosary, Keiji Kimura, Hironori Kasahara, Seinosuke Narita

    2012 17TH INTERNATIONAL CONFERENCE ON COMPUTER GAMES (CGAMES)     67 - 75  2012  [Refereed]

     View Summary

    Video Games have been a very popular form of digital entertainment in recent years. They have been delivered in state of the art technologies that include multi-core processors that are known to be the leading contributor in enhancing the performance of computer applications. Since parallel programming is a difficult technology to implement, that field in Video Games is still rich with areas for advancements. This paper investigates performance enhancement in Video Games when using parallelizing compilers and the difficulties involved in achieving that. This experiment conducts several stages in attempting to parallelize a well-renowned sequentially written Video Game called ioquake3. First, the Game is profiled for discovering bottlenecks, then examined by hand on how much parallelism could be extracted from those bottlenecks, and what sort of hazards exist in delivering a parallel-friendly version of ioquake3. Then, the Game code is rewritten into a hazard-free version while also modified to comply with the Parallelizable-C rules, which crucially aid parallelizing compilers in extracting parallelism. Next, the program is compiled using a parallelizing compiler called OSCAR (Optimally Scheduled Advanced Multiprocessor) to produce a parallel version of ioquake3. Finally, the performance of the newly produced parallel version of ioquake3 on a Multi-core platform is analyzed.
    The following is found: (1) the parallelized game by the compiler from the revised sequential program of the game is found to achieve a 5.1 faster performance at 8-threads than original one on an IBM Power 5+ machine that is equipped with 8-cores, and (2) hazards are caused by thread contentions over globally shared data, and as well as thread private data, and (3) AI driven players are represented very similarly to Human players inside ioquake3 engine, which gives an estimation of the costs for parallelizing Human driven sessions, and (4) 70% of the costs of the experiment is spent in analyzing ioquake3 code, 30% in implementing the changes in the code.

  • Parallelizing Compiler Framework and API for Heterogeneous Multicores

    Akihiro Hayashi, Yasutaka Wada, Takeshi Watanabe, Takeshi Sekiguchi, Masayoshi Mase, Jun Shirako, Keiji Kimura, Hironori Kasahara

    IPSJ Transactions on Advanced Computing Systems   5 ( 1 ) 68 - 79  2011.11  [Refereed]

  • A 45-nm 37.3 GOPS/W Heterogeneous Multi-Core SOC with 16/32 Bit Instruction-Set General-Purpose Core

    Osamu Nishii, Yoichi Yuyama, Masayuki Ito, Yoshikazu Kiyoshige, Yusuke Nitta, Makoto Ishikawa, Tetsuya Yamada, Junichi Miyakoshi, Yasutaka Wada, Keiji Kimura, Hironori Kasahara, Hideo Maejima

    IEICE TRANSACTIONS ON ELECTRONICS   E94C ( 4 ) 663 - 669  2011.04  [Refereed]

     View Summary

    We built a 12.4 mm x 12.4 mm, 45-nm CMOS, chip that integrates eight 648-MHz general purpose cores, two matrix processor (MX-2) cores, four flexible engine (FE) cores and media IP (VPU5) to establish heterogeneous multi-core chip architecture. The general purpose core had its IPC (instructions per cycle) performance enhanced by adding 32-bit instructions to the existing 16-bit fixed-length instruction set and executing up to two 32-bit instructions per cycle. Considering these five-to-seven years of embedded LSI and increasing trend of access-master within LSI, we predict that the memory usage of single core will not exceed 32-bit physical area (i.e. 4 GB), but chip-total memory usage will exceed 4 GB. Based on this prediction, the physical address was expanded from 32-bit to 40-bit. The fabricated chip was tested and a parallel operation of eight general purpose cores and four FE cores and eight data transfer units (DTU) is obtained on AAC (Advanced Audio Coding) encode processing.

    DOI

    Scopus

  • Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

    Akihiro Hayashi, Yasutaka Wada, Takeshi Watanabe, Takeshi Sekiguchi, Masayoshi Mase, Jun Shirako, Keiji Kimura, Hironori Kasahara

    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING   6548   184 - 198  2011  [Refereed]

     View Summary

    Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.

  • A parallelizing compiler cooperative heterogeneous multicore processor architecture

    Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   6760   215 - 233  2011

     View Summary

    Heterogeneous multicore architectures, integrating several kinds of accelerator cores in addition to general purpose processor cores, have been attracting much attention to realize high performance with low power consumption. To attain effective high performance, high application software productivity, and low power consumption on heterogeneous multicores, cooperation between an architecture and a parallelizing compiler is important. This paper proposes a compiler cooperative heterogeneous multicore architecture and parallelizing compilation scheme for it. Performance of the proposed scheme is evaluated on the heterogeneous multicore integrating Hitachi and Renesas' SH4A processor cores and Hitachi's FE-GA accelerator cores, using an MP3 encoder. The heterogeneous multicore gives us 14.34 times speedup with two SH4As and two FE-GAs, and 26.05 times speedup with four SH4As and four FE-GAs against sequential execution with a single SH4A. The cooperation between the heterogeneous multicore architecture and the parallelizing compiler enables to achieve high performance in a short development period. © 2011 Springer-Verlag Berlin Heidelberg.

    DOI

  • Parallelizable C and Its Performance on Low Power High Performance Multicore Processors

    Masayoshi Mase, Yuto Onozaki, Keiji Kimura, Hironori Kasahara

    Proc. of 15th Workshop on Compilers for Parallel Computing (CPC 2010)    2010.07  [Refereed]

  • Element-Sensitive Pointer Analysis for Automatic Parallelization

    Masayoshi Mase, Yuta Murata, Keiji Kimura, Hironori Kasahara

    IPSJ Transactions on Programming (PRO)   3 ( 2 ) 36 - 47  2010.03  [Refereed]

  • A 45nm 37.3GOPS/W heterogeneous multi-core SoC

    Yoichi Yuyama, Masayuki Ito, Yoshikazu Kiyoshige, Yusuke Nitta, Shigezumi Matsui, Osamu Nishii, Atsushi Hasegawa, Makoto Ishikawa, Tetsuya Yamada, Junichi Miyakoshi, Koichi Terada, Tohru Nojiri, Makoto Satoh, Hiroyuki Mizuno, Kunio Uchiyama, Yasutaka Wada, Keiji Kimura, Hironori Kasahara, Hideo Maejima

    Digest of Technical Papers - IEEE International Solid-State Circuits Conference   53   100 - 101  2010

     View Summary

    We develop a heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search. Figure 5.3.1 shows the chip features. This SoC is capable of decoding 1080i audio/video data using a part of SoC (one general-purpose CPU core, video processing unit called VPU5 and sound processing unit called SPU) [1]. Four dynamically reconfigurable processors called FE [2] are integrated and have a total theoretical performance of 41.5GOPS and power consumption of 0.76W. Two 1024-way matrix-processors called MX-2 [3] are integrated and have a total theoretical performance of 36.9GOPS and power consumption of 1.10W. Overall, the performance per watt of our SoC is 37.3GOPS/W at 1.15V, the highest among comparable processors [4-6] excluding special-purpose codecs. The operation granularity of the CPU, FE and MX-2 are 32bit, 16bit, and 4bit respectively, and thus we can assign the appropriate processor for each task in an effective manner. A heterogeneous multi-core approach is one of the most promising approaches to attain high performance with low frequency, or low power, for consumer electronics application and scientific applications, compared to homogeneous multi-core SoCs [4]. For example, for image-recognition application in the IP-TV system, the FEs are assigned to calculate optical flow operation [7] of VGA (640x480) size video data at 15fps, which requires 0.62GOPS. The MX-2s are used for face detection and calculation of the feature quantity of the VGA video data at 15fps, which requires 30.6GOPS. In addition, general-purpose CPU cores are used for database search using the results of the above operations, which requires further enhancement of CPU. The automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE. ©2010 IEEE.

    DOI

    Scopus

    33
    Citation
    (Scopus)
  • OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

    Keiji Kimura, Masayoshi Mase, Hiroki Mikami, Takamichi Miyamoto, Jun Shirako, Hironori Kasahara

    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING   5898   188 - 202  2010  [Refereed]

    Authorship:Lead author

     View Summary

    OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR, compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR. API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.

  • A Power Reduction Scheme of Parallelizing Compiler Using OSCAR API on Multicore Processor

    Masayoshi Mase, Ryo Nakagawa, Naoto Ohkuni, Jun Shirako, Keiji Kimura, Hironori Kasahara

    IPSJ Transactions on Advanced Computing Systems   2 ( 3 ) 96 - 106  2009.09  [Refereed]

  • マルチコア上でのOSCAR APIを用いた並列化コンパイラによる低消費電力化手法

    中川亮, 間瀬正啓, 大國直人, 白子準, 木村啓二, 笠原博徳

    先進的計算基盤システムシンポジウム(SACSIS2009)     3 - 10  2009.05

  • Performance of OSCAR Multigrain Parallelizing Compiler on Multicore Processors

    Hiroki Mikami, Jun Shirako, Masayoshi Mase, Takamichi Miyamoto, Hirofumi Nakano, Fumiyo Takano, Akihiro Hayashi, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    Proc. of 14th Workshop on Compilers for Parallel Computing(CPC 2009)    2009.01  [Refereed]

  • Green multicore-SoC software-execution framework with timely-power-gating scheme

    Masafumi Onouchi, Keisuke Toyama, Toru Nojiri, Makoto Sato, Masayoshi Mase, Jun Shirako, Mikiko Sato, Masashi Takada, Masayuki Ito, Hiroyuki Mizuno, Mitaro Namiki, Keiji Kimura, Hironori Kasahara

    Proceedings of the International Conference on Parallel Processing     510 - 517  2009

     View Summary

    We are developing a software-execution framework based on an octo-core chip multiprocessor named RP2 and an automatic multigrain-parallelizing compiler named OSCAR. The main purpose of this framework is to maintain good speed scalability and power efficiency over the number of processor cores under severe hardware restrictions for embedded use. Key to the speed scalability is reduction of a communication overhead with parallelized tasks. A data-categorization scheme enables small-overhead cache-coherency maintenance by using directives and instructions from the compiler. In this scheme, the number of cache-flushing time is minimized and parallelized tasks are quickly synchronized by using flags in local memory. As regards power efficiency, to reduce power consumption, power supply to processor cores waiting for other cores is timely and frequently cut off, even in the middle of an application, by using a timelypower- gating scheme. In this scheme, to achieve quick mode transition between "NORMAL" mode and "RESUME POWEROFF" mode, register values of the processor core are stored in core-local memory, which is active even in "RESUME POWEROFF" mode and can be accessed in one or two clock cycles. Measured speed and power of an application show good speed scalability in execution time and high power efficiency, simultaneously. In the case of a secure AAC-LC encoding program, execution speed when eight processor cores are used can be increased by 4.85 times compared to that of sequential execution. Moreover, power consumption under the same condition can be reduced by 51.0% by parallelizing and timely-power gating. The time for mode transition is less than 20 μsec, which is only 2.5% of the "RESUME POWER-OFF" period. © 2009 IEEE.

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • An Evaluation of Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API

    Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    IPSJ Transactions on Advanced Computing Systems   1 ( 3 ) 83 - 95  2008.12  [Refereed]

    CiNii

  • Parallelizing Compiler Cooperative Heterogeneous Multicore

    Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Proc. of Workshop on Software and Hardware Challenges of Manycore Platforms (SHCMP 2008)    2008.06  [Refereed]

  • Parallelization of MP3 Encoder using Static Scheduling on a Heterogeneous Multicore

    Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Trans. of IPSJ on Computing Systems   1 ( 1 ) 105 - 119  2008.06  [Refereed]

    CiNii

  • 情報家電用マルチコア上におけるマルチメディア処理のコンパイラによる並列化

    宮本孝道, 浅香沙織, 見神広紀, 間瀬正啓, 木村啓二, 笠原博徳

    SACSIS2008 - 先進的計算基盤システムシンポジウム    2008.05

  • Power-aware compiler controllable chip multiprocessor

    Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    IEICE TRANSACTIONS ON ELECTRONICS   E91C ( 4 ) 432 - 439  2008.04  [Refereed]

     View Summary

    A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • Heterogeneous multi-core architecture that enables 54x AAC-LC stereo encoding

    Hiroaki Shikano, Masaki Ito, Masafumi Onouchi, Takashi Todaka, Takanobu Tsunoda, Tomoyuki Kodama, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    IEEE JOURNAL OF SOLID-STATE CIRCUITS   43 ( 4 ) 902 - 910  2008.04  [Refereed]

     View Summary

    This paper describes a heterogeneous multi-core processor (HMCP) architecture that integrates general-purpose processors (CPUs) and accelerators (ACCs) to achieve exceptional performance as well as low-power consumption for the SoCs of embedded systems. The memory architectures of CPUs and ACCs were unified to improve programming and compiling efficiency. Advanced audio codec-low complexity (AAC-LC) stereo audio encoding was parallelized on a heterogeneous multi-core having homogeneous processor cores and dynamically reconfigurable processor (DRP) ACC cores in a preliminary evaluation of the HMCP architecture. The performance evaluation revealed that 54x AAC encoding was achieved on the chip with two CPUs at 600 MHz and two DRPs at 300 MHz, which achieved encoding of an entire CD within 1-2 min.

    DOI

    Scopus

    16
    Citation
    (Scopus)
  • An 8 CPU SoC with Independent Power-off Control of CPUs and Multicore Software Debug Function

    Yutaka Yoshida, Masayuki Ito, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Toshihiro Hattori, Jun Sakiyama, Masashi Takada, Kunio Uchiyama, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    Proc. of IEEE Cool Chips XI: Symposium on Low-Power and High-Speed Chips 2008    2008.04  [Refereed]

  • A 600MHz SoC with Compiler Power-off Control of 8 CPUs and 8 Onchip-RAMs

    Masayuki Ito, Toshihiro Hattori, Yutaka Yoshida, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Yoshihiko Yasu, Atsushi Hasegawa, Masashi Takada, Masaki Ito, Hiroyuki Mizuno, Kunio Uchiyama, Toshihiko Odaka, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    Proc. of International Solid State Circuits Conference (ISSCC2008)     90 - 91  2008.02  [Refereed]

  • An 8640 MIPS SoC with independent power-off control of 8 CPUs and 8 RAMs by an automatic parallelizing compiler

    Masayuki Ito, Toshihiro Hattori, Yutaka Yoshida, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Yoshihiko Yasu, Atsushi Hasegawa, Masashi Takada, Masaki Ito, Hiroyuki Mizuno, Kunio Uchiyama, Toshihiko Odaka, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    Digest of Technical Papers - IEEE International Solid-State Circuits Conference   51   81 - 598  2008  [Refereed]

     View Summary

    A 104.8mm2 90nm CMOS 600MHz SoC integrates 8 processor cores and 8 user RAMs in 17 separate power domains and delivers 33.6GFLOPS. An automatic parallelizing compiler assigns tasks to each CPU and controls its power mode including power supply in accordance with its processing load and status. The compiler also uses barrier registers to achieve fast and accurate CPU synchronization. ©2008 IEEE.

    DOI

    Scopus

    37
    Citation
    (Scopus)
  • Performance evaluation of compiler controlled power saving scheme

    Jun Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofurni Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    HIGH-PERFORMANCE COMPUTING   4759   480 - 493  2008  [Refereed]

     View Summary

    Multicore processors, or chip multiprocessors, which allow us to realize low power consumption, high effective performance, good cost performance and short hardware/software development period, are attracting much attention. In order to achieve full potential of multicore processors, cooperation with a parallelizing compiler is very important. The latest compiler extracts multilevel parallelism, such as coarse grain task parallelism, loop parallelism and near fine grain parallelism, to keep parallel execution efficiency high. It also controls voltage and clock frequency of processors carefully to reduce energy consumption during execution of an application program. This paper evaluates performance of compiler controlled power saving scheme which has been implemented in OSCAR multigrain parallelizing compiler. The developed power saving scheme realizes voltage/frequency control and power shutdown of each processor core during coarse grain task parallel processing. In performance evaluation, when static power is assumed as one-tenth of dynamic power, OSCAR compiler with the power saving scheme achieved 61.2 percent energy reduction for SPEC CFP95 applu without performance degradation on 4 processors and 87.4 percent energy reduction for mpeg2encode, 88.1 percent energy reduction for SPEC CFP95 tomcatv and 84.6 percent energy reduction for applu with real-time deadline constraint on 4 processors.

  • Software-cooperative power-efficient heterogeneous multi-core for media processing

    Hiroaki Shikano, Masaki Ito, Kunio Uchiyama, Toshihiko Odaka, Akihiro Hayashi, Takeshi Masuura, Masayoshi Mase, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    2008 ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2     712 - +  2008  [Refereed]

     View Summary

    A heterogeneous multi-core processor (HMCP) architecture, which integrates general purpose processors (CPU) and accelerators (ACC) to achieve high-performance as well as low-power consumption with the support of a parallelizing compiler, was developed. The evaluation was performed using an MP3 audio encoder on a simulator that accurately models the HMCP, It showed that 16-frame encoding on the HMCP with four CPUs and four ACCs yielded 24.5-fold speed-up of performance against sequential execution on one CPU. Furthermore, power saving by the compiler reduced energy consumption of the encoding to 0.17 J, namely, by 28.4%.

  • Power Reduction Controll for Multicores in OSCAR Multigrain Parallelizing Compiler

    Jun Shirako, Keiji Kimura, Hironori Kasahara

    ISOCC: 2008 INTERNATIONAL SOC DESIGN CONFERENCE, VOLS 1-3     50 - 55  2008  [Refereed]

     View Summary

    Multicore processors have become mainstream computer architecture to go beyond the performance and power efficiency limits of single-core processors. To achieve low power consumption and high performance on multicores, parallelizing compilers take on an important role. This paper describes the performance of a compiler-based power reduction scheme cooperating with OSCAR multigrain parallelizing compiler on a newly developed 8-way SH4A low power multicore chip for consumer electronics, which supports DVFS (Dynamic Voltage and Frequency Scaling) and Clock/Power Gating. Using hardware parameters and parallelized program information, OSCAR compiler determines suitable voltage and frequency of each active processor core and appropriate schedule of clock gating and power gating. Performance experiments shows the compiler reduces consumed power by 88.3%, namely from 5.68 W to 0.67 W, for real-time secure AAC Encoding and 73.5%, namely from 5.73 W to 1.52 W, for real-time MPEG2 Decoding on 8 core execution.

  • Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API

    Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    PROCEEDINGS OF THE 2008 INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS     600 - 607  2008  [Refereed]

     View Summary

    Multicore processors have been adopted for consumer electronics like portable electronics, mobile phones, car navigation systems, digital TVs and games to obtain high performance with low power consumption. The OSCAR automatic parallelizing compiler has been developed to utilize these multicores easily. Also, a new Consumer Electronics Multicore Application Program Interface (API) to use the OSCAR compiler with native sequential compilers for various kinds of multicores from different vendors has been developed in NEDO (New Energy and Industrial Technology Development Organization) "Multicore Technology for Realtime Consumer Electronics" project with Japanese 6 IT companies. This paper evaluates the parallel processing performance of multimedia applications using this API by the OSCAR compiler on the FR1000 4 VLIW cores multicore processor developed by Fujitsu Ltd, and the RP1 4 SH-4A cores multicore processor jointly-developed by Renesas Technology Corp., Hitachi Ltd. and Waseda University. As the results, the parallel codes generated by the OSCAR compiler using the API give us 3.27 times speedup on average using 4 cores against 1 core on the FR1000 multicore, and 3.31 times speedup on average using 4 cores against 1 core on the RP1 multicore.

    DOI

    Scopus

    6
    Citation
    (Scopus)
  • 情報家電用マルチコアSMP実行モードにおける制約付きCプログラムのマルチグレイン並列化

    間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 宮本孝道, 白子準, 中野啓史, 木村啓二, 笠原博徳

    組込みシステムシンポジウム2007    2007.10

  • Performance Evaluation of MP3 Audio Encoder on OSCAR Heterogeneous Chip Multicore Processor

    Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hioronori Kasahara

    Trans. of IPSJ on Computing Systems   Vol. 48, No. SIG8(ACS18),   141 - 152  2007.05  [Refereed]

  • Power-aware compiler controllable chip multiprocessor

    Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT     427  2007  [Refereed]

    DOI

    Scopus

    1
    Citation
    (Scopus)
  • A 4320MIPS four-processor core SMP/AMP with individually managed clock frequency for low power consumption

    Yutaka Yoshida, Tatsuya Kamei, Kiyoshi Hayase, Shinichi Shibahara, Osamu Nishii, Toshihiro Hattori, Atsushi Hasegawa, Masashi Takada, Naohiko Irie, Kunio Uchiyama, Toshihiko Odaka, Kiwamu Takada, Keiji Kimura, Hironori Kasahara

    Digest of Technical Papers - IEEE International Solid-State Circuits Conference     95 - 590  2007

     View Summary

    A 4320MIPS four-core SoC that supports both SMP and AMP for embedded applications is designed in 90nm CMOS. Each processor-core can be operated with a different frequency dynamically including clock stop, while keeping data cache coherency, to maintain maximum processing performance and to reduce average operating power. The 97.6mm2 die achieves a floating-point performance of 16.8GFLOPS. © 2007 IEEE.

    DOI

    Scopus

    26
    Citation
    (Scopus)
  • Heterogeneous multiprocessor on a chip which enables 54x AAC-LC stereo encoding

    Masaki Ito, Takashi Todaka, Takanobu Tsunoda, Hiroshi Tanaka, Tomoyuki Kodama, Hiroaki Shikano, Masafumi Onouchi, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    2007 Symposium on VLSI Circuits, Digest of Technical Papers     18 - 19  2007  [Refereed]

     View Summary

    A heterogeneous multiprocessor on a chip has been designed and implemented. It consists of 2 CPUs and 2 DRPs (Dynamic Reconfigurable Processors). The design of DRP was intended to achieve high-performance in a small area to be integrated on a SoC for embedded systems. Memory architecture of CPUs and DRPs were unified to improve programming and compiling efficiency. 54x AAC-LC stereo encoding has been enabled with 2 DRPs at 300MHz and 2 CPUs at 600MHz.

  • Compiler Control Power Saving Scheme for Multicore Processors

    Jun Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Trans. of IPSJ on Computing Systems   Vol. 47(ACS15)  2006.09  [Refereed]

  • マルチコアプロセッサにおけるコンパイラ制御低消費電力化手法

    白子準, 吉田宗広, 押山直人, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

    先進的計算基盤システムシンポジウム(SACSIS2006)   ( 467 ) 476  2006.05

  • Performance Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

    Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

    Proc. of IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX)     349 - 363  2006.05  [Refereed]

  • Compiler control power saving scheme for multi core processors

    Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   4339   362 - 376  2006

     View Summary

    With the increase of transistors integrated onto a chip, multi core processor architectures have attracted much attention to achieve high effective performance, shorten development period and reduce the power consumption. To this end, the compiler for a multi core processor is expected not only to parallelize program effectively, but also to control the voltage and clock frequency of processors and storages carefully inside an application program. This paper proposes a compilation scheme for reduction of power consumption under the multigrain parallel processing environment that controls Voltage/Frequency and power supply of each processor core on a chip. In the evaluation, the OSCAR compiler with the proposed scheme achieves 60.7 percent energy savings for SPEC CFP95 applu without performance degradation on 4 processors, and 45.4 percent energy savings for SPEC CFP95 tomcatv with real-time deadline constraint on 4 processors, and 46.5 percent energy savings for SPEC CFP95 swim with the deadline constraint on 4 processors. © 2006 Springer-Verlag Berlin Heidelberg.

    DOI

    Scopus

    18
    Citation
    (Scopus)
  • Programing for Multicore Systems

    Keiji Kimura, Hironori Kasahara

    IPSJ MAGAZINE   47 ( 1 ) 17 - 23  2006.01  [Invited]

    Authorship:Lead author

  • Multicores Emerge as Next Generation Microprocessors

    Hironori Kasahara, Keiji Kimura

    IPSJ MAGAZINE   47 ( 1 ) 10 - 16  2006.01  [Refereed]

  • Parallelizing Compilation Scheme for Reduction of Power Consumption of Chip Multiprocessors

    Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Proc. of 12th Workshop on Compilers for Parallel Computers (CPC 2006),    2006.01  [Refereed]

  • Compiler control power saving scheme for multi core processors

    Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   4339   362 - 376  2006

     View Summary

    With the increase of transistors integrated onto a chip, multi core processor architectures have attracted much attention to achieve high effective performance, shorten development period and reduce the power consumption. To this end, the compiler for a multi core processor is expected not only to parallelize program effectively, but also to control the voltage and clock frequency of processors and storages carefully inside an application program. This paper proposes a compilation scheme for reduction of power consumption under the multigrain parallel processing environment that controls Voltage/Frequency and power supply of each processor core on a chip. In the evaluation, the OSCAR compiler with the proposed scheme achieves 60.7 percent energy savings for SPEC CFP95 applu without performance degradation on 4 processors, and 45.4 percent energy savings for SPEC CFP95 tomcatv with real-time deadline constraint on 4 processors, and 46.5 percent energy savings for SPEC CFP95 swim with the deadline constraint on 4 processors. © 2006 Springer-Verlag Berlin Heidelberg.

    DOI

    Scopus

    18
    Citation
    (Scopus)
  • マルチコアプロセッサ上でのデータローカライゼーション

    中野啓文, 浅野尚一郎, 内藤陽介, 仁藤拓実, 田川友博, 宮本孝道, 小高剛, 木村啓二, 笠原博徳

    情報処理学会研究報告   ARC2005-165-10  2005.12

  • arallel Processing of MPEG2 Encoding on a Chip Multiprocessor Architecture

    Takeshi Kodaka, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Trans. of IPSJ   46 ( 9 ) 2311 - 2325  2005.09  [Refereed]

  • ホモジニアスマルチコアにおけるコンパイラ制御低消費電力化手法

    白子準, 押山直人, 和田康孝, 鹿野裕明, 木村啓二, 笠原博徳

    情報処理学会研究報告   ARC2005-164-10 (SWoPP205)  2005.08

  • Performance of OSCAR multigrain parallelizing compiler on SMP servers

    K Ishizaka, T Miyamoto, J Shirako, M Obata, K Kimura, H Kasahara

    LANGUAGES AND COMPILERS FOR HIGH PERFORMANCE COMPUTING   3602   319 - 331  2005  [Refereed]

     View Summary

    This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II4 processors desktop work-station, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.

  • Performance of OSCAR multigrain parallelizing compiler on SMP servers

    K Ishizaka, T Miyamoto, J Shirako, M Obata, K Kimura, H Kasahara

    LANGUAGES AND COMPILERS FOR HIGH PERFORMANCE COMPUTING   3602   319 - 331  2005  [Refereed]

     View Summary

    This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II4 processors desktop work-station, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.

  • Multigrain parallel processing on compiler cooperative chip multiprocessor

    K Kimura, Y Wada, H Nakano, T Kodaka, J Shirako, K Ishizaka, H Kasahara

    9TH ANNUAL WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS     11 - 20  2005  [Refereed]

    Authorship:Lead author

     View Summary

    This paper describes multigrain parallel processing on a compiler cooperative chip multiprocessor The multigrain parallel processing hierarchically exploits multiple grains of parallelism such as coarse grain task parallelism, loop iteration level parallelism and statement level near-fine grain parallelism. The chip multiprocessor has been designed to attain high effective peformance, cost effectiveness and high software productivity by supporting the optimizations of the multigrain parallelizing compiler, which is developed by Japanese Millennium Project IT21 "Advance Parallelizing Compiler". To achieve full potential of multigrain parallel processing, the chip multiprocessor integrates simple single-issue processors having distributed shared data memory for both optimal use of data locality and scalar data transfer local data memory for processor private data, in addition to centralized shared memory for shared data among processors. This paper focuses on the scalability of the chip multiprocessor having up to eight processors on a chip by exploiting of the multigrain parallelism from SPECfp95 programs. When microSPARC like the simple processor core is used under assumption of 90 nm technology and 2.8 GHz, the evaluation results show the speedups for eight processors and four processors reach 7.1 and 3.9, respectively. Similarly, when 400 MHz is assumed for embedded usage, the speedups reach 7.8 and 4.0, respectively.

  • Memory management for data localization on OSCAR chip multiprocessor

    H Nakano, T Kodaka, K Kimura, H Kasahara

    INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS     82 - 88  2004  [Refereed]

     View Summary

    Chip Multiprocessor (CMP) architecture has attracting much attention as a next-generation microprocessor architecture and many kinds of CMP are widely being researched. However, CMP architectures several difficulties for effective use of memory, especially cache or local memory near a processor core. The authors have proposed OSCAR CMP architecture, which cooperatively works with multigrain parallelizing compiler which gives us much higher parallelism than instruction level parallelism or loop level parallelism and high productivity of application programs. To support the compiler optimization for effective use of cache or local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) for synchronization and fine grain data transfers among processors, in addition to centralized shared memory (CSM) to support dynamic task scheduling. This paper proposes a static coarse grain task scheduling scheme for data localization using live variable analysis. Furthermore, remote memory data transfer scheduling scheme using information of live variable analysis is also described. The proposed scheme is implemented on OSCAR FORTRAN multigrain parallelizing compiler and is evaluated on OSCAR CMP using Tomcatv and Swim in SPEC CFP 95 benchmark.

  • Parallel processing using data localization for MPEG2 encoding on OSCAR chip multiprocessor

    T Kodaka, H Nakano, K Kimura, H Kasahara

    INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS     119 - 127  2004  [Refereed]

     View Summary

    Currently, many people are enjoying multimedia applications with image and audio processing on PCs, PDAs, mobile phones and so on. With the popularization of the multimedia applications, needs for low cost, low power consumption and high performance processors has been increasing. To this end, chip multiprocessor architectures which allow us to attain scalable performance improvement by using multigrain parallelism are attracting much attention. However, in order to extract higher performance on a chip multiprocessor, more sophisticated software techniques are required, such as decomposing a program into adequate grain of tasks, assigning them onto processors considering parallelism, data locality optimization and so on. This paper describes a parallel processing scheme for MPEG2 encoding using data localization which improve execution efficiency assigning coarse grain tasks sharing same data on a same processor consecutively for a chip multiprocessor. The performance evaluation on OSCAR chip multiprocessor architecture shows that proposed scheme gives us 6.97 times speedup using 8 processors and 10.93 times speedup using 16 processors against sequential execution time respectively. Moreover, the proposed scheme gives us 1.61 times speedup using 8 processors and 2.08 times speedup using 16 processors against loop parallel processing which has been widely used for multiprocessor systems using the same number of processors.

  • Static coarse grain task scheduling with cache optimization using OpenMP

    H Nakano, K Ishizaka, M Obata, K Kimura, H Kasahara

    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING   31 ( 3 ) 211 - 223  2003.06  [Refereed]

     View Summary

    Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC Ver. 6 update 1 loop parallelizing compiler.

  • Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture 'Jointly Worked'

    Keiji Kimura, Yasutaka Wada, Hirofumi Nakano, Takeshi Kodaka, Jun Shirako, Kazuhisa Ishizaka, Hironori Kasahara

    The IEICE Transactions on Electronics, Special Issue on High-Performance and Low-Power System LSIs and Related Technologies   E86-C ( 4 ) 570 - 579  2003.02  [Refereed]

    Authorship:Lead author

  • Multigrain parallel processing on OSCAR CMP

    K Kimura, T Kodaka, M Obata, H Kasahara

    INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS     56 - 65  2003  [Refereed]

    Authorship:Lead author

     View Summary

    It seems that Instruction Level Parallelism (ILP) approach, which has been used by various superscalar processors and VLIW processors for a long time, reaches its limitation of performance improvement. To obtain scalable performance improvement, cost effectiveness and high productivity even in the era of one billion transistors, the cooperative work between software and hardware is getting increasingly important. For this reason, the authors have developed OSCAR (Optimally SCheduled Advanced multiprocessoR) Chip Multiprocessor (OSCAR CMP) and OSCAR multigrain compiler simultaneously. To preserve the scalability in the future, OSCAR CMP has mechanisms for efficient use of parallelism and data locality, and for hiding data transfer overhead. These mechanisms can be fully controlled by the OSCAR multigrain compiler In this paper, the authors focus on multigrain parallel processing on OSCAR CMP, which enables us to exploit loop iteration level parallelism and coarse grain task parallelism in addition to ILP from the entire of a program. Performance of multigrain parallel processing on OSCAR CMP architecture is evaluated using SPEC fp 2000195 benchmark suite. When microSPARC like single issue core is used, OSCAR CMP gives us from 1.77 to 3.96 times speedup for four processors against single processor In addition, OSCAR CMP is compared with Sun UltraSPARC II like processor to evaluate cost effectiveness. As a result, OSCAR CMP gives us 1.66 times better performance on the average under the condition that OSCAR CMP and UltraSPARC II are built from almost same number of transistors.

  • JPEG Encoding Using Multigrain Parallel Processing on a Single Chip Multiprocessor

    Takeshi Kodaka, Takayuki Uchida, Keiji Kimura, Hironori Kasahara

    Trans. of IPSJ on High Performance Computing Systems   43 ( Sig 6(HPS5) ) 153 - 162  2002.09  [Refereed]

     View Summary

    With the recent increase of multimedia contents using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia application have been expected. Particularly, single chip multiprocessor architectures having simple processor cores that will attain scalability and cost performance are attracting much attention to develop such processors. Single chip multiprocessor architectures allow us to exploit coarse grain task level and loop level parallelism in addition to the instruction level parallelism, so parallel processing technology is indispensable to get us scalable performance improvement. This paper describes a multigrain parallel processing scheme for the JPEG encoding for a single chip multiprocessor and its performance. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gave us 3.59 times speed-up.

    CiNii

  • シングルチップマルチプロセッサにおける JPEGエンコーディングのマルチグレイン並列処理 (共著)

    小高剛, 内田貴之, 木村啓二, 笠原博徳

    情報処理学会並列処理シンポジウム(JSPP2002)    2002.05

  • Static coarse grain task scheduling with cache optimization using openMP

    Hirofumi Nakano, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   2327   479 - 489  2002

     View Summary

    Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation, using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC 6 loop parallelizing compiler. © 2002 Springer Berlin Heidelberg.

    DOI

    Scopus

    2
    Citation
    (Scopus)
  • Multigrain parallel processing for JPEG encoding on a single chip multiprocessor

    T Kodaka, K Kimura, H Kasahara

    INTERNATIONAL WORKSHOP ON INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS     57 - 63  2002  [Refereed]

     View Summary

    With the recent increase of multimedia contents using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia application have been expected. Particularly, single chip multiprocessor architecture having simple processor cores that will attain good scalability and cost effectiveness is attracting much attention. To exploit full performance of single chip multiprocessor architecture, multigrain parallel processing, which exploits coarse grain task parallelism, loop parallelism and instruction level parallelism, is attractive. This paper describes a multigrain parallel processing scheme for the JPEG encoding on a single chip multiprocessor and its performance. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gave us 3.59 times speed-up against sequential execution time.

  • Multigrain automatic parallelization in Japanese Millennium Project IT21 Advanced Parallelizing Compiler

    H Kasahara, M Obata, K Ishizaka, K Kimura, H Kaminaga, H Nakano, K Nagasawa, A Murai, H Itagaki, J Shirako

    PAR ELEC 2002: INTERNATIONAL CONFERENCE ON PARALLEL COMPUTING IN ELECTRICAL ENGINEERING     105 - 111  2002  [Refereed]

     View Summary

    This paper describes OSCAR multigrain parallelizing compiler which has been developed in Japanese Millennium Project IT21 "Advanced Parallelizing Compiler" project and its performance on SMP machines. The compiler realizes multigrain parallelization for chip-multiprocessors to high-end servers. It hierarchically exploits coarse grain task parallelism among loops, subroutines and basic blocks and near fine grain parallelism among statements inside a basic block in addition to loop parallelism. Also, it globally optimizes cache use over different loops, or coarse grain tasks, based on data localization technique to reduce memory access overhead Current performance of OSCAR compiler for SPEC95fp is evaluated on different SMPs. For example, it gives us 3.7 times speedup for HYDRO2D, 1.8 times for SWIM, 1.7 times for SU2COR, 2.0 times for MGRID, 3.3 times for TURB3D on 8 processor IBM RS6000, against XL Fortran compiler ver:7.1 and 4.2 times speedup for SWIM and 2.2 times speedup for TURB3D on 4 processor Sun Ultra80 workstation against Forte6 update 2.

  • Evaluation of Processor Core Architecture for Single Chip Multiprocessor with Near Fine Grain Parallel Processing

    K. Kimura, T. Kato, H. Kasahara

    Trans. of IPSJ   42 ( 4 ) 692 - 703  2001.04  [Refereed]

    Authorship:Lead author

  • Evaluation of Single Chip Multiprocessor Core Architecture with Near Fine Grain Parallel Processing

    Keiji Kimura, Hironori Kasahara

    Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01)    2001.01  [Refereed]

    Authorship:Lead author

  • Near fine grain parallel processing using static scheduling on single chip multiprocessors

    K Kimura, H Kasahara

    INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS     23 - 31  2000  [Refereed]

     View Summary

    With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting important problems. However it has been thought that popular superscalar and VLIW would have difficulty, to obtain scalable improvement of effective performance in future because of the limitation of instruction level parallelism. To cope with this problem, a single chip multiprocessor (SCM) approach,vith multi grain parallelprocessing inside a chip, which hierarchically exploits loop parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism, is thought one of the most promising approaches. This paper evaluates effectiveness of the single chip multiprocessor architectures with a shared cache, global registers, distributed shared memory and/or local memory for near fine grain parallel processing as the first step of research on SCM architecture to support multi grain parallel processing. The evaluation shows OSCAR (Optimally Scheduled Advanced Multiprocessor architecture having distributed shared memory and local memory in addition to centralized shared memory and attachment of global register gives us significant speed up such as 13.8% to 143.8% for four processors compared with shared cache architecture for applications which have been difficult to extract parallelism effectively.

  • Near Fine Grain Parallel Processing on Single Chip Multiprocessors

    K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

    Trans. of IPSJ   40 ( 5 ) 1924 - 1934  1999.05  [Refereed]

    Authorship:Lead author

  • Near fine grain parallel processing using static scheduling on single chip multiprocessors

    Keiji Kimura, Hironori Kasahara

    Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems   1999-   23 - 31  1999  [Refereed]

    Authorship:Lead author

     View Summary

    With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting im-portant problems. However, it has been thought that popular superscalar and VLIW would have difficulty to obtain scalable improvement of effective performance in future because of the limitation of instruction level parallelism. To cope with this problem, a single chip multiprocessor (SCM) approach with multi grain parallel processing inside a chip, which hierarchically exploits loop parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism, is thought one of the most promising approaches. This paper evaluates effectiveness of the single chip multiprocessor architectures with a shared cache, global registers, distributed shared memory and/or local memory for near fine grain parallel processing as the first step of research on SCM architecture to support multi grain parallel processing. The evaluation shows OSCAR (Optimally Scheduled Advanced Multiprocessor) architecture having distributed shared memory and local memory in addition to centralized shared memory and attachment of global register gives us significant speed up such as 13.8% to 143.8% for four pro-cessors compared with shared cache architecture for applications which have been difficult to extract parallelism effectively.

    DOI

    Scopus

    7
    Citation
    (Scopus)
  • OSCAR multi-grain architecture and its evaluation

    H Kasahara, W Ogata, K Kimura, G Matsui, H Matsuzaki, M Okamoto, A Yoshida, H Honda

    INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS     106 - 115  1998  [Refereed]

     View Summary

    OSCAR (Optimally Scheduled Advanced Multiprocessor) was designed to efficiently realize multi-grain parallel processing using static and dynamic scheduling. It is a shared memory multiprocessor system having centralized and distributed shared memories in addition to local memory on each processor with data transfer controller for overlapping of data transfer and task processing. Also, its Fortran multi-grain compiler hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks, conventional medium grain parallelism among loop-iterations in a Doall loop and near fine grain parallelism among statements. At the coarse grain parallel processing, data localization (automatic data distribution) have been employed to minimize data transfer overhear. In the near fine grain processing of a basic block, explicit synchronization can be removed by use of a clock level accurate code scheduling technique with architectural supports. This paper describes OSCAR's architecture, its compiler and the performance for the multi-grain parallel processing. OSCAR's architecture and compilation technology will be more important in future High Performance Computers and single chip multiprocessors.

  • Data-Localization among Doall and Sequential Loops in Coarse Grain Parallel Processing

    Akimasa Yoshida, Yasushi Ujigawa, Motoki Obata, Keiji Kimura, Hironori Kasahara

    Seventh Workshop on Compilers for Parallel Computers Linkoping Sweden     266 - 277  1998.01  [Refereed]

  • Near Fine Grain Parallel Processing without Explicit Synchronization on a Multiprocessor System

    Wataru Ogata, Akimasa Yoshida, Masami Okamoto, Keiji Kimura, Hironori Kasahara

    Proc. of Sixth Workshop on Compilers for Parallel Computers (Aachen Germany)    1996.12  [Refereed]

▼display all

Presentations

  • Prototype Implementation of Non-Volatile Memory Support for RISC-V Keystone Enclave

    Lena Yu, Yu Omori, Keiji Kimura

    Presentation date: 2021.07

  • Sparse Neural NetworkにおけるSpMMの並列/ベクトル化による高速化

    田處 雄大, 木村 啓二, 笠原 博徳

    情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021) 

    Presentation date: 2021.03

  • 整合性ツリーおよび暗号化機構を持つ不揮発性メインメモリエミュレータの実装

    林 知輝, 大森 侑, 木村 啓二

    情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021) 

    Presentation date: 2021.03

  • OSCARコンパイラによるMATLAB/Simulinkアプリケーションの自動並列化

    古山 凌, 津村 雄太, 川角 冬馬, 仲田 優哉, 梅田 弾, 木村 啓二, 笠原 博徳

    情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021) 

    Presentation date: 2021.03

  • Linuxが動作可能なRISC-V NVMMエミュレータの実装

    大森 侑, 木村 啓二

    情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021) 

    Presentation date: 2021.03

  • Automatic Vector-Parallelization by Collaboration of Oscar Automatic Parallelizing Compiler and NEC Vectorizing Compiler

    Yuta Tadokoro, Hiroki Mikami, Takeo Hosomi, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2020-ARC-240  IPSJ

    Presentation date: 2020.03

  • Consideration of Accelerator Cost Estimation Method in Multi-Target Automatic Parallelizing Compiler

    Kazuki Yamamoto, Kazuki Fujita, Tomoya Kashimata, Ken Takahashi, Boma A. Adhi, Toshiaki Kitamura, Akihiro Kawashima, Akira Nodomi, Yuji Mori, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2020-ARC-240  IPSJ

    Presentation date: 2020.03

  • Extensions of OSCAR Compiler for Parallelizing C++ Programs

    Toma Kawasumi, Tilman Priesner, Masato Noguchi, Jixin Han, Hiroki Mikami, Takahiro Miyajima, Keishiro Tanaka, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2020-ARC-240  IPSJ

    Presentation date: 2020.03

  • NDCKPT: Transparent Check Pointing Mechanism on Non Volatile Memory by OS

    Hikaru Nishida, Keiji Kimura

    Technical Report of IEICE, CPSY2019-102  IEICE

    Presentation date: 2020.03

  • Investigation into Acceleration of Matrix-multiply in Homomorphic Encryption

    Tetsuya Makita, Teppei Shishido, Yasutaka Wada, Keiji Kimura

    Technical Report of IEICE, CPSY2019-96  IEICE

    Presentation date: 2020.03

  • Cascaded DMAC Enabling Efficient Data Transfer for Indirect Memory Access Applications

    Keiji Kimura  [Invited]

    RECS

    Presentation date: 2019.11

  • Automatic parallelizing and vectorizing compiler framework for OSCAR vector multicore processor.

    Kazuki Miyamoto, Tetsuya Makita, Ken Takahashi, Tomoya Kashimata, Takumi Kawada, Satoshi Karino, Toshiaki Kitamura, Keiji Kimura, Hironori kasahara

    Technical Report of IPSJ, 2018-ARC-230  IPSJ

    Presentation date: 2018.03

  • Automatic Local Memory Management Using Hierarchical Adjustable Block for Multicores and Its Performance Evaluation

    Tomoya Shirakawa, Yuto abe, Yoshitake Ooki, Akimasa Yoshida, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2017-ARC-220  IPSJ

    Presentation date: 2017.11

  • A Reproducible Full Computer System emulator

    Yuki Shimizu, Mineo Takai, Keiji Kimura

    Multimedia, Distributed, Cooperative, and Mobile Symposium(DICOMO 2017)  IPSJ

    Presentation date: 2017.07

  • Hierarchical Interconnection Network Extension for Gen 5 Simulator Considering Large Scale Systems

    Tatsuya Onoguchi, Ayane Hayashi, Katsuyuki Utaka, Yuichi Matsushima, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ,Vol.2017-ARC-221  IPSJ

    Presentation date: 2017.03

  • Parallel Processing of Automobile Real-time Control on Multicore System with Multiple Clusters

    Jin Miyata, Mamoru Shimaoka, Hiroki Mikami, Hirofumi Nishi, Hitoshi Suzuki, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ,Vol.2017-ARC-221  IPSJ

    Presentation date: 2017.03

  • Code Generating Method with Profile Feedback for Reducing Compilation Time of Automatic Parallelizing Compiler

    Rina Fujino, Jixin Han, Mamoru Shimaoka, Hiroki Mikami, Takahiro Miyajima, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ,Vol.2017-ARC-221  IPSJ

    Presentation date: 2017.03

  • Development of Compilation Flow and Evaluation of OSCAR Vector Multicore Architecture

    Ken Takahashi, Satoshi Karino, Kazuki Miyamoto, Takumi Kawata, Tomoya Kashimata, Tetsuya Makita, Toshiaki Kitamura, Keiji Kimura, Hironori Kasahara

    Proc. 80th Annual Convention IPSJ  IPSJ

    Presentation date: 2017.03

  • FPGA implementation of OSCAR Vector Accelerator

    Tomoya Kashimata, Satoshi Karino, Kazuki Miyamoto, Takumi Kawata, Ken Takahashi, Tetsuya Makita, Toshiaki Kitamura, Keiji Kimura, Hironori Kasahara

    Proc. 80th Annual Convention IPSJ  IPSJ

    Presentation date: 2017.03

  • A Compilation Framework for Multicores having Vector Accelerators using LLVM

    Akira Maruoka, Yuya Mushu, Satoshi Karino, Takashi Mochiyama, Toshiaki Kitamura, Sachio Kamiya, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ,Vol.2017-ARC-221  IPSJ

    Presentation date: 2016.08

  • Multigrain Parallelization of Program for Medical Image Filtering

    Mariko Okumura, Tomoyuki Shibasaki, Kohei Kuwajima, Hiroki Mikami, Keiji Kimura, Kohei Kadoshita, Keiichi Nakano, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2016-HPC0153  IPSJ

    Presentation date: 2016.03

  • Automatic Multigrain Parallel Processing for 3D Noise Reduction Using OSCAR Compiler

    Tomoyuki Shibasaki, Kohei Kuwajima, Mariko Okumura, Hiroki Mikami, Keiji Kimura, Kohei Kadoshita, Keiichi Nakano, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2016-HPC0153  IPSJ

    Presentation date: 2016.03

  • The parallelism abstraction method with a data conversion at analysis in a OSCAR compiler

    Naoto Kageura, Tamami Wake, Ji Xin Han, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2016-HPC0153  IPSJ

    Presentation date: 2016.03

  • Multicore Local Memory Management Scheme using Data Multidimensional Aligned Decomposition

    Kohei Yamamoto, Tomoya Shirakawa, Akimasa Yoshida, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2016-SLDM-174  IPSJ

    Presentation date: 2016.01

  • An Evaluation of the Repeatability of Full Computer System Emulation

    Daichi Fukui, Teruhiro Mizumoto, Shinsuke Nishimoto, Shigeru Kaneda, Mineo Takai, Keiji Kimura

    Multimedia, Distributed, Cooperative, and Mobile Symposium(DICOMO 2015)  IPSJ

    Presentation date: 2015.07

  • Evaluation of Parallelization of video decoding on Intel and ARM Multicore

    Tamami Wake, Shuhei Iizuka, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2015-EMB-36  IPSJ

    Presentation date: 2015.03

  • Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

    Takashi Goto, Kohei Muto, Tomohiro Hirano, Hiroki Mikami, Uichiro Takahashi(Fujitsu, Sakae Inoue(Fujitsu, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2015-SLDM-170  IPSJ

    Presentation date: 2015.03

  • Power Reduction of Real-time Dynamic Image Processing on Haswell Multicore Using OSCAR Compiler

    Shuhei Iizuka, Hideo Yamamoto, Tomohiro Hirano, Youhei Kishimoto, Takashi Goto, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2015-EMB-36  IPSJ

    Presentation date: 2015.03

  • Evaluation of Software Cashe Coherency Cotrol Scheme by an Automatic Parallelizing Compiler

    Yohei Kishimoto, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2014-ARC-213 No.19  IPSJ

    Presentation date: 2014.12

  • Android Demonstration System of Automatic Parallelization and Power Optimization by OSCAR Compiler

    Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ,Vol.2014-ARC-211 No.6  IPSJ

    Presentation date: 2014.07

  • Tracing method of a parallelized program using Linux ftrace on a multicore processor

    Daichi Fukui, Mamoru Shimaoka, Hiroki Mikami, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ,Vol.2014-ARC-211 No.6  IPSJ

    Presentation date: 2014.07

  • A Latency Reduction Technique for Network Intrusion Detection System on Multicores

    Keiji Kimura  [Invited]

    MPSoC

    Presentation date: 2014.07

  • Automatic Parallelization of Small Point FFT on Multicore Processor

    Yuuki Furuyama, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2013-ARC-201  IPSJ

    Presentation date: 2014.03

  • A Latency Reduction Technique for IDS by Allocating Decomposed Signature on Multi-core

    Shohei Yamada, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Technical Report Vol.2013-ARC-201  IPSJ

    Presentation date: 2014.03

  • A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

    TAGUCHI Gakuho, KIMURA Keiji, KASAHARA Hironori

    IPSJ SIG Technical Report  IEICE

    Presentation date: 2014.03

  • Profile-Based Automatic Parallelization for Android 2D Rendering by Using OSCAR Compiler

    Takashi Goto, Kohei Muto, Hideo Yamamoto, Tomohiro Hirano, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol.2013-ARC-207 No.12  IPSJ

    Presentation date: 2013.12

  • Automatic Parallelization of Automatically Generated Engine Control C Codes by Model-based Design

    Dan Umeda, Youhei Kanehagi, Hiroki Mikami, Mitsuhiro Tani(DENSO, Yuji Mori(DENSO, Keiji Kimura, Hironori Kasahara

    Embedded System Symposium2013  IPSJ

    Presentation date: 2013.10

  • An Evaluation of Hardware Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping using OSCAR API Standard Translator

    Akihiro Kawashima, Yohei Kanehagi, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol.2013-ARC-206 No.16  IPSJ

    Presentation date: 2013.08

  • Automatic Power Control on Multicore Android Devices

    Tomohiro Hirano, Hideo Yamamoto, Kohei Muto, Hiroki Mikami, Takashi Goto, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol.2013-ARC-206 No.23  IPSJ

    Presentation date: 2013.08

  • OSCAR API v2.1 with Flexible Accelerator Control Facilities

    Keiji Kimura  [Invited]

    MPSoC

    Presentation date: 2013.07

  • マルチコア用並列化アプリケーション開発の基礎と実例

    木村啓二  [Invited]

    ESEC 2013 専門セミナー  Reed Exhibition Japan

    Presentation date: 2013.05

  • Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

    Yasir I. M. Al-Dosary, Yuki Furuyama, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara, Seinosuke Narita

    Technical Report of IPSJ  IPSJ

    Presentation date: 2013.04

  • An Investigation of Parallelization and Evaluation on Commercial Multi-core Smart Device

    Hideo Yamamoto, Takashi Goto, Tomohiro Hirano, Kouhei Muto, Hiroki Mikami, Dominic Hillenbrand, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol. 2013-OS-124 No. 000310  IPSJ

    Presentation date: 2013.02

  • Parallelization of Automobile Engine Control Software on Multicore Processor

    KANEHAGI YOUHEI, UMEDA DAN, MIKAMI HIROKI, HAYASHI AKIHIRO, SAWADA MITSUO, KIMURA KEIJI, KASAHARA HIRONORI

    Technical Report of IPSJ, Vol.2013-ARC-203 No.2  IPSJ

    Presentation date: 2013.01

  • An Accelemtion Technique of Many-core Architecture Simulation with Parallelized Applications by Statistical Technique

    Abe Yoichi, Taguchi Gakuho, Kimura Keiji, Kasahara Hironori

    Technical Report of IPSJ, Vol.2012-ARC-203 N0.13  IPSJ

    Presentation date: 2013.01

  • A Parallelizing Compiler Cooperative Multicore Architecture Simulator with Changeover Mechanism of Simulation Modes

    TAGUCHI GAKUHO, ABE YOUICHI, KIMURA KEIJI, KASAHARA HIRONORI

    Technical Report of IPSJ, Vol.2012-ARC-203 N0.14  IPSJ

    Presentation date: 2013.01

  • Automatic parallelization with OSCAR API Analyzer: a cross-platform performance evaluation

    Cecilia Gonzalez-Alvarez, Youhei Kanehagi, Kosei Takemoto, Yohei Kishimoto, Kohei Muto, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol.2012-ARC-202HPC137 No.10  IPSJ

    Presentation date: 2012.12

  • Automatic Parallelization of Ground Motion Simulator

    Mamoru Shimaoka, Hiroki Mikami, Akihiro Hayashi, Yasutaka Wada, Keiji Kimura, Hidekazu Morita, HITACHI, Kunio Uchiyama, HITACHI, Hironori Kasahara

    Technical Report of IPSJ, Vol.2012-ARC-202HPC137 No.11  IPSJ

    Presentation date: 2012.12

  • Opportunities and Challenges of Application-Power Control in the Age of Dark Silicon

    Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol.2012-ARC-202HPC137 No.26  IPSJ

    Presentation date: 2012.12

  • Parallel processing of multimedia applications on TILEPro64 using OSCAR API for embedded multicore

    Yohei Kishimoto, Hiroki Mikami, Keiichi Nakano, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

    Embedded System Symposium2012  IPSJ

    Presentation date: 2012.10

  • Parallelization of Basic Engine Controll Software Model on Multicore Processor

    Dan Umeda, Youhei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Mituhiro Tani, Yuji Mori, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol.2012-ARC-201 No.22  IPSJ

    Presentation date: 2012.08

  • Realization of 1 Watt Web Service with RP-X Low-power Multicore Processor

    Yuuki Furuyama, Mamoru Shimaoka, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol.2012-ARC-201 No.24  IPSJ

    Presentation date: 2012.08

  • OSCAR API for Low-Power Multicores and Manycores, and API Standard Translator

    Keiji Kimura  [Invited]

    MPSoC

    Presentation date: 2012.07

  • 並列化コンパイラを考慮したコーディング作法と並列化APIの現在

    木村啓二  [Invited]

    ESEC 2012 専門セミナー  Reed Exhibition Japan

    Presentation date: 2012.05

  • A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

    KIMURA KEIJI, MASE MASAYOSHI, KASAHARA HIRONORI

    ETNET2012  IPSJ

    Presentation date: 2012.03

  • Inlining Analysis of Exception Flow and Fast Method Dispatch on Automatic Parallelization of Java

    Keiichi Tabata, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol. 2012-ARC-199  IPSJ

    Presentation date: 2012.03

  • An Examination of Accelerating Many-core Architecture Simulation for Parallelized Media Applications

    Yoichi Abe, Ryo Ishizuka, Ryota Daigo, Gakuho Taguchi, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, Vol. 2012-ARC-199  IPSJ

    Presentation date: 2012.03

  • Automatic Parallelization of Dose Calculation Engine for A Particle Therapy

    Akihiro Hayashi, Takuji Matsumoto, Hiroki Mikami, Keiji Kimura, Keiji Yamamoto, Hironori Saki, Yasuyuki Takatani, Hironori Kasahara

    Symposium on High-Performance Computing and Computer Science(HPCS2012)  IPSJ

    Presentation date: 2012.01

  • Automatic Parallelization of Dose Calculation Engine for A Particle Therapy on SMP Servers

    Akihiro Hayashi, Takuji Matsumoto, Hiroki Mikami, Keiji Kimura, Keiji Yamamoto, Hironori Saki, Yasuyuki Takatani, Hironori Kasahara

    Technical Report of IPSJ, Vol.2011-ARC189HPC132-2  IPSJ

    Presentation date: 2011.11

  • Examination of Parallelization by CUDA in SPEC Benchmark Programs

    Yuki Taira, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2011-HPC-130-16  IPSJ

    Presentation date: 2011.07

  • An Evaluation of an Acceleration method of Many-core Architecture Simulation using Program Structures of Scientific Applications

    Ryo Ishizuka, Yoichi Abe, Ryota Daigo, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2011-ARC-196-14  IPSJ

    Presentation date: 2011.07

  • 並列化APIとコンパイラによるマルチコア用アプリケーションの開発

    木村啓二  [Invited]

    ESEC 2011 専門セミナー  Reed Exhibition Japan

    Presentation date: 2011.05

  • Hiding I/O overheads with Parallelizing Compiler for Media Applications

    Akihiro Hayashi, Takeshi Sekiguchi, Masayoshi Mase, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2011-ARC-195-14  IPSJ

    Presentation date: 2011.04

  • Evaluation of Power Consumption by Executing Media Applications on Low-power Multicore RP2

    Hiroki Mikami, Shumpei Kitaki, Takafumi Sato, Masayoshi Mase, Keiji Kimura, Kazuhisa Ishizaka, Junji Sakai, Masato Edahiro, Hironori Kasahara

    Technical Report of IPSJ, 2011-ARC-194-1  IPSJ

    Presentation date: 2011.03

  • Evaluation of Parallelizable C Programs by the OSCAR API Standard Translator

    SATO TAKUYA, MIKAMI HIROKI, HAYASHI AKIHIRO, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

    Technical Report of IPSJ, 2010-ARC-191-2  IPSJ

    Presentation date: 2010.10

  • An Acceleration Technique of Many Core Architecture Simulator Considering Program Structure

    ISHIZUKA RYO, OOTOMO TOSHIYA, DAIGO RYOTA, KIMURA KEIJI, KASAHARA HIRONORI

    Technical Report of IPSJ, 2010-ARC-190 No. 20  IPSJ

    Presentation date: 2010.08

  • Performance of Power Reduction Scheme by a Compiler on Heterogeneous Multicore for Consumer Electronics "RP-X"

    WADA YASUTAKA, HAYASHI AKIHIRO, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, SHIRAKO JUN, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

    Technical Report of IPSJ, 2010-ARC-190 No. 8  IPSJ

    Presentation date: 2010.08

  • A Compiler Framework for Heterogeneous Multicores for Consumer Electronics

    HAYASHI AKIHIRO, WADA YASUTAKA, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

    Technical Report of IPSJ, 2010-ARC-190 No. 7  IPSJ

    Presentation date: 2010.08

  • 組込みマルチコア用並列化APIと並列化コンパイラの現在

    木村啓二  [Invited]

    ESEC 2010 専門セミナー  Reed Exhibition Japan

    Presentation date: 2010.05

  • Parallelizing Compiler Directed Software Coherence

    MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

    Technical Report of IPSJ, 2010-ARC-189, 2010-OS-114  IPSJ

    Presentation date: 2010.04

  • Multi Media Offload with Automatic Parallelization

    ISHIZAKA KAZUHISA, SAKAI JUNJI, EDAHIRO MASATO, MIYAMOTO TAKAMICHI, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

    Technical Report of IPSJ, 2010-SLDM144, 2010-EMB16  IPSJ

    Presentation date: 2010.03

  • Processing Performance of Automatically Parallelized Applications on Embedded Multicore with Running Multiple Applications

    Takamichi Miyamoto, Masayoshi Mase, Keiji Kimura, Kazuhisa Ishizaka, Junji Sakai, Masato Edahiro

    Technical Report of IPSJ, 2010-ARC-188 No.9  IPSJ

    Presentation date: 2010.03

  • Hierarchical Parallel Processing of H.264/AVC Encoder on an Multicore Processeor

    Hiroki Mikami, Takamichi Miyamoto, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ Vol.2010-ARC-187 No.22 Vol.2010-EMB-15 No.22  IPSJ

    Presentation date: 2010.01

  • Element-Sensitive Pointer Analysis for Automatic Parallelization

    Masayoshi Mase, Yuta Murata, Keiji Kimura, Hironori Kasahara

    IPSJ-SIGPRO  IPSJ

    Presentation date: 2009.10

  • メニーコア・プロセッサとそれを支える要素技術

    井上 弘士, 木村 啓二, 松谷 宏紀  [Invited]

    組込システムシンポジウム 2009  情報処理学会

    Presentation date: 2009.10

  • Automatic Parallelization of Parallelizable C Programs on Multicore Processors

    Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2009-ARC-184-15  IPSJ

    Presentation date: 2009.08

  • 組込みソフトウェアの信頼性,開発効率向上のためのコーディングガイドライン

    木村啓二  [Invited]

    平成21年度 INSTAC成果報告会 

    Presentation date: 2009.07

  • A Power Reduction Scheme of Parallelizing Compiler Using OSCAR API on Multicore Processor

    Ryo Nakagawa, Masayoshi Mase, Naoto Ohkuni, Jun Shirako, Keiji Kimura, Hironori Kasahara

    Symposium on Advanced Computing Systems and Infrastructures (SACSIS 2009)  IPSJ

    Presentation date: 2009.05

  • 最新の組込みマルチコア用コンパイラ技術と並列API

    木村啓二  [Invited]

    ESEC 2009 専門セミナー  Reed Exhibition Japan

    Presentation date: 2009.05

  • Performance Evaluation of Minimum Execution Time Multiprocessor Scheduling Algorithms Using Standard Task Graph Set Ver3 Consider Parallelism of Task Graphs and Deviation of Task Execution Time

    Mamoru Shimaoka, Kazuhiro Imaizumi, Fumiyo Takano, Keiji Kimura, Hironori Kasahara

    Technical Report of IEICE  IPSJ

    Presentation date: 2009.02

  • A Power Saving Scheme on Multicore Processors Using OSCAR API

    Ryo Nakagawa, Masayoshi Mase, Jun Shirako, Keiji Kimura, Hironori Kasahara

    TECHNICAL REPORT OF IEICE. (ICD2008/145)  IEICE

    Presentation date: 2009.01

  • Local Memory Management Scheme by a Compiler for Multicore Processor

    Taku Momozono, Hirofumi Nakano, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    TECHNICAL REPORT OF IEICE. (ICD2008/141)  IEICE

    Presentation date: 2009.01

  • Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

    Teruo Kamiyama, Yasutaka Wada, Akihiro Hayashi, Masayoshi Mase, Hirofumi Nakano, Takeshi Watanabe, Keiji Kimura, Hironori Kasahara

    TECHNICAL REPORT OF IEICE. (ICD2008/140)  IEICE

    Presentation date: 2009.01

  • マルチコアのソフトウェア開発

    木村啓二  [Invited]

    CEATEC JAPAN 2008 インダストリアルセッション(IS)  JEITA

    Presentation date: 2008.10

  • マルチコア用コンパイル技術の現在

    木村啓二  [Invited]

    第10回 組み込みシステム技術に関するサマーワークショップ (SWEST10)  情報処理学会

    Presentation date: 2008.09

  • マルチコアプロセッサのソフトウェア

    木村啓二  [Invited]

    第31回STARCアドバンスト講座 システムアーキテクチャ セミナー - SoCシステムアーキテクチャ -  STARC

    Presentation date: 2008.07

  • An Evaluation of Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping

    Kaito Yamada, Masayoshi Mase, Jun Shirako, Keiji Kimura, Masayuki Ito, Toshihiro Hattori, Hiroyuki Mizuno, Kunio Uchiyama, Hironori Kasahara

    Technical Report of IPSJ,  IPSJ

    Presentation date: 2008.05

  • Automatic Parallelization of Restricted C Programs using Pointer Analysis

    Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Yuta Murata, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2008  IPSJ

    Presentation date: 2008.05

  • Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics

    Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    Symposium on Advanced Computing Systems and Infrastructures (SACSIS 2008)  IPSJ

    Presentation date: 2008.05

  • Parallelization for Multimedia Processing on Multicore Processors

    Takamichi Miyamoto, Kei Tamura, Hiroaki Tano, Hiroki Mikami, Saori Asaka, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2007-ARC-175-05  IPSJ

    Presentation date: 2007.11

  • 最新の組み込みマルチコア用コンパイラ技術

    木村啓二  [Invited]

    システムLSIワークショップ  情報処理学会

    Presentation date: 2007.11

  • Multigrain Parallelization of Restricted C Programs in SMP Execution Mode of a Multicore for Consumer Electronics

    Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Hiroaki Tano, Takeshi Masuura, Takamichi Miyamoto, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Embedded Systems Symposium 2007 (ESS 2007)  IPSJ

    Presentation date: 2007.10

  • Compiler Control Power Saving for Heterogeneous Multicore Processor

    Akihiro Hayashi, Taketo Iyoku, Ryo Nakagawa, Shigeru Matsumoto, Kaito Yamada, Naoto Oshiyama, Jun Shirako, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2007-ARC-174-18  IPSJ

    Presentation date: 2007.08

  • A Hierarchical Coarse Grain Task Static Scheduling Scheme on a Heterogeneous Multicore

    Yasutaka Wada, Akihiro Hayashi, Taketo Iyoku, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2007-ARC-174-17  IPSJ

    Presentation date: 2007.08

  • Evaluation of Heterogeneous Multicore Architecture with AAC-LC Stereo Encoding

    Hiroaki Shikano, Masaki Ito, Takashi Todaka, Takanobu Tsunoda, Tomoyuki Kodama, Masafumi Onouchi, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    TECHNICAL REPORT OF IEICE. (ICD2007-71)  IEICE

    Presentation date: 2007.08

  • マルチコア用コンパイラ技術

    木村啓二  [Invited]

    165委員会主催研究会第46回研究会 「マルチコアプロセッサSoCの現状と今後の展望」 

    Presentation date: 2007.07

  • 組込マルチコアの動向

    木村啓二  [Invited]

    JEITA 情報端末フェスティバル 2007  JEITA

    Presentation date: 2007.06

  • A 4320MIPS four Processor-core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption

    Kiyoshi hayase, Yutaka Yoshida, Tatsuya Kamei, Shinichi Shibahara, Osamu Nishii, Toshihiro Hattori, Atsushi Hasegawa, Masashi Takada, Naohiko Irie, Kunio Uchiyama, Toshihiko Odaka, Kiwamu Takada, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2007-ARC-173-06  IPSJ

    Presentation date: 2007.05

  • Mutligrain Parallel Processing in SMP Execution Mode on a Multicore for Consumer Electronics

    Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Hiroaki Tano, Takeshi Masuura, Takamichi Miyamoto, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Tatsuya Kamei, Toshihiro Hattori, Atsushi Hasegawa, Makoto Sato, Masaki Ito, Toshihiko Odaka, Hironori Kasahara

    Technical Report of IPSJ, 2007-ARC-173-05  IPSJ

    Presentation date: 2007.05

  • マルチコアプロセッサ活用の勘所

    木村啓二  [Invited]

    組み込みプロセッサ&プラットホームワークショップ 

    Presentation date: 2007.04

  • A Local Memory Management Scheme in Multigrain Parallelizing Compiler

    Miura Tsuyoshi, Tomohiro Tagawa, Yusuke Muramatsu, Akinori Ikemi, Masahiro Nakagawa, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2007-ARC-172-11  IPSJ

    Presentation date: 2007.03

  • Automatic Parallelization for Multimedia Applications on Multicore Processors

    Takamichi Miyamoto, Saori Asaka, Nobuhito Kamakura, Hiromasa Yamauchi, Masayoshi Mase, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2007-ARC-171-13  IPSJ

    Presentation date: 2007.01

  • Automati Parallelization of Restricted C Ptrograms in OSCAR Compiler

    Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Hiroaki Tano, Takeshi Masuura, Koji Fukatsu, Takamichi Miyamoto, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2006-ARC-170-1  IPSJ

    Presentation date: 2006.11

  • Performance of OSCAR Multigrain Parallelizaing Compiler on SMP Servers and Embedded Multicore

    Jun Shirako, Tomohiro Tagawa, Tsuyoshi Miura, Takamichi Miyamoto, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2006-ARC-170-2  IPSJ

    Presentation date: 2006.11

  • ソフトウェアもおもしろいこれからのプロセッサアーキテクチャ

    木村啓二  [Invited]

    FIT2006イベント企画「これからが面白いプロセッサアーキテクチャ」(パネル)  情報処理学会

    Presentation date: 2006.09

  • Local Memory Management on OSCAR Multicore

    Hirofumi Nakano, Takumi Nito, Takanori Maruyama, Masahiro Nakagawa, Yuki Suzuki, Yousuke Naito, Takamichi Miyamoto, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, 2006-ARC-169-28  IPSJ

    Presentation date: 2006.08

  • Compiler Control Power Saving Scheme for Multicore Processors

    un Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Proc. of Symposium on Advanced Computing Systems and Infrastructures (SACSIS2006)  IPSJ

    Presentation date: 2006.05

  • Data Transfer Overlap of Coarse Grain Task Parallel Processing on a Multicore Processor

    Takamichi Miyamoto, Masahiro Nakagawa, Shoichiro Asano, Yosuke Naito, Takumi Nito, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC-2006-167, HPC-2006-105  IPSJ

    Presentation date: 2006.02

  • A Static Scheduling Scheme for Coarse Grain Task on a Heterogeneous Chip Multi Processor

    Yasutaka Wada, Naoto Oshiyama, Yuki Suzuki, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC-2006-166  IPSJ

    Presentation date: 2006.01

  • Preliminary Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

    Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC-2006-166  IPSJ

    Presentation date: 2006.01

  • Data Localization on a Multicore Processor

    Hirofumi Nakano, Shoichiro Asano, Yosuke Naito, Takumi Nito, Tomohiro Tagawa, Takamichi Miyamoto, Takeshi Kodaka, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2005-165-10  IPSJ

    Presentation date: 2005.12

  • Compiler Control Power Saving Scheme for Homogeneous Multiprocessor

    Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2005-164-10  IPSJ

    Presentation date: 2005.08

  • Performance of OSCAR Multigrain Parallelizing Compiler on Shared Memory Multiprocessor Serers

    Jun Shirako, Takamichi Miyamoto, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2005-161-5  IPSJ

    Presentation date: 2005.01

  • Performance Evaluation of Electronic Circuit Simulation Using Code Generation Method without Array Indirect Access

    Akira Kuroda, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2005-161-1  IPSJ

    Presentation date: 2005.01

  • Parallel Processing for MPEG2 Encoding on OSCAR Chip Multiprocessor

    Takeshi Kodaka, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2004-160-10  IPSJ

    Presentation date: 2004.12

  • Data Localization using Data Transfer Unit on OSCAR Chip Multiprocessor

    Hirofumi Nakano, Yosuke Naito, Takahisa Suzuki, Takeshi Kodaka, Kazuhisa Ishizaka, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2004-159-20  IPSJ

    Presentation date: 2004.08

  • Evaluation of Multigrain Parallelism on OSCAR Chip Multi Processor

    Yasutaka Wada, Jun Shirako, Kazuhisa Ishizaka, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2004-159-11  IPSJ

    Presentation date: 2004.08

  • Evaluation of OSCAR Multigrain Automatic Parallelizing Compiler on IBM pSeries 690

    Kazuhisa Ishizaka, Jun Shirako, Motoki Obata, Keiji Kimura, Hironori Kasahara

    Proc. 66th Annual Convention IPSJ  IPSJ

    Presentation date: 2004.03

  • Parallel Processing for MPEG2 Encoding using Data Localization

    Takeshi Kodaka, Hirohumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2004-156-3  IPSJ

    Presentation date: 2004.02

  • The Data Prefetching of Coarse Grain Task Parallel Processing on Symmetric Multi Proc essor Machine

    akamichi Miyamoto, Takahiro Yamaguchi, Takao Tobita, Kazuhisa Ishizaka, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2003-155-06  IPSJ

    Presentation date: 2003.11

  • Data Localization Scheme using Static Scheduling on Chip Multiprocessor

    Hirofumi Nakano, Takeshi Kodaka, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2003-154-14  IPSJ

    Presentation date: 2003.08

  • Parallel Processing on MPEG2 Encoding for OSCAR Chip Multiprocessor

    Takeshi Kodaka, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2003-154-10  IPSJ

    Presentation date: 2003.08

  • Data Localization using Coarse Grain Task Parallelism on Chip Multiprocessor

    Hirofumi Nakano, Takeshi Kodaka, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2003-151-3  IPSJ

    Presentation date: 2003.01

  • Multigrain Parallel Processing on Motion Vector Estimation for Single Chip Multiprocessor

    Takeshi Kodaka, Takahisa Suzuki, Keiji Kimura, Hironori Kasahara

    Technical Report of IPSJ, ARC2002-150-6  IPSJ

    Presentation date: 2002.11

  • Multigrain Parallel Processing on OSCAR Chip Multiprocessor

    Keiji Kimura, Takeshi Kodaka, Motoki Obata, Hironori Kasahara

    Technical Report of IPSJ, ARC2002-150-7  IPSJ

    Presentation date: 2002.11

  • Evaluation of Overhead with Coarse Grain Task Parallel Processing on SMP Machines

    Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, Motoki Obata, Hironori Kasahara

    Technical Report of IPSJ, ARC2002-148-3  IPSJ

    Presentation date: 2002.05

  • PEG Encoding using Multigrain Parallel Processing on a Shingle Chip Multiprocessor

    Takeshi Kodaka, Takayuki Uchida, Keiji Kimura, Hironori Kasahara

    Joint Symposium on Parallel Processing 2002 (JSPP2002)  IPSJ

    Presentation date: 2002.05

  • Multigrain Parallel Processing for JPEG Encoding Program on an OSCAR type Single Chip Multiprocessor

    T. Kodaka, T. Uchida, K. Kimura, H. Kasahara

    Technical Report of IPSJ, ARC2002-146-4  IPSJ

    Presentation date: 2002.02

  • Multigrain Parallel Processing on Single Chip Multiprocessor

    T. Uchida, T. Kodaka, K. Kimura, H. Kasahara

    Technical Report of IPSJ, ARC2002-146-3  IPSJ

    Presentation date: 2002.02

  • Near Fine Grain Parallel Processing on Multimedia Application for Single Chip Multiprocessor

    T. Kodaka, N. Miyashita, K. Kimura, H. Kasahara

    Technical Report of IPSJ, ARC2001-144-11  IPSJ

    Presentation date: 2001.11

  • A Static Scheduling Scheme for Coarse Grain Tasks considering Cache Optimization on SMP

    H. Nakano, K. Ishizaka, M. Obata, K. Kimura, H. Kasahara

    Technical Report of IPSJ, ARC2001-144-12  IPSJ

    Presentation date: 2001.08

  • A Static Scheduling Method for Coarse Grain Tasks considering Cache Optimization on Multiprocessor Systems

    H. Nakano, K. Ishizaka, M. Obata, K. Kimura, H. Kasahara

    Proc. 62nd Annual Convention IPSJ  IPSJ

    Presentation date: 2001.03

  • Near Fine Grain Parallel Processing on Multimedia Application for Single Chip Multiprocessor

    T. Kodaka, K. Kimura, N. Miyashita, H. Kasahara

    Proc. 62nd Annual Convention IPSJ  IPSJ

    Presentation date: 2001.03

  • Performance Evaluation of Single Chip Multiprocessor Memory Architecture for Near Fine Grain Parallel Processing

    N. Matsumoto, K. Kimura, H. Kasahara

    Proc. 62nd Annual Convention IPSJ  IPSJ

    Presentation date: 2001.03

  • A Data Transfer Unit on the Single Chip Multiprocessor for Multigrain Prallel Processing

    N. Miyashita, K. Kimura, T. Kodaka, H. Kasahara

    Proc. 62nd Annual Convention IPSJ  IPSJ

    Presentation date: 2001.03

  • Processor Core Architecture of Single Chip Multiprocessor for Near Fine Grain Parallel Processing

    K. Kimura, T. Uhida, T. Kato, H. Kasahara

    Technical Report of IPSJ, ARC-139-16  IPSJ

    Presentation date: 2000.08

  • Performance Evaluation of Single Chip Multiprocessor for Near Fine Grain Parallel Processing

    T. Kato, W. Ogata, K. Kimura, T. Uchida, H. Kasahara

    Proc. 60th Annual Convention IPSJ  IPSJ

    Presentation date: 2000.03

  • Memory access analyzer for a Multi-grain parallel processing

    K. Iwai, M. Obata, K. Kimura, H. Amano, H. Kasahara

    Technical Report of IEICE, CPSY99-62  IEICE

    Presentation date: 1999.08

  • Performance Evaluation of Near Fine Grain Parallel Processing on the Single Chip Multiprocessor

    K. Kimura, K. Manaka, W. Ogata, M. Okamoto, H. Kasahara

    Technical Report of IPSJ, ARC134-5  IPSJ

    Presentation date: 1999.08

  • A Cache Optimization Scheme Using Earliest Executable Condition Analysis

    D. Inaishi, K. Kimura, K. Fujimoto, W. Ogata, M. Okamoto, H. Kasahara

    Proc. 58th Annual Convention IPSJ  IPSJ

    Presentation date: 1999.03

  • A Cache Optimization with Earliest Executable Condition Analysis

    D. Inaishi, K. Kimura, K. Fujimoto, W. Ogata, M. Okamoto, H. Kasahara

    Technical Report of IPSJ, ARC-130-6  IPSJ

    Presentation date: 1998.08

  • Multigrain parallel Processing on the Single Chip Multiprocessor

    K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

    Technical Report of IPSJ, ARC98-130-5  IPSJ

    Presentation date: 1998.08

  • A Multigrain Parallelizing Compiler and Its Architectural Support

    H. Kasahara, W. Ogata, K. Kimura, M. Obata, T. Tobita, D. Inaishi

    TECHNICAL REPORT OF IEICE. (ICD98-10, CPSY98-10, FTS98-10)  IEICE

    Presentation date: 1998.04

  • Implementation of FPGA Based Architecture Test Bed For Multi Processor System

    W. Ogata, T. Yamamoto, M. Mizuno, K. Kimura, H. Kasahara

    IPSJ SIG Notes, 98-ARC-128-14, HPC70-14  IPSJ

    Presentation date: 1998.03

  • Single Chip Multiprocessor Architecture for Multigrain Parallel Processing

    K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

    Proc. 56th Annual Convention IPSJ  IPSJ

    Presentation date: 1998.03

  • A Cache Optimization with Macro-Task Earliest Execution Condition

    D. Inaishi, K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

    Proc. 56th Annual Convention IPSJ  IPSJ

    Presentation date: 1998.03

  • Multi-processor system for Multi-grain Parallel Processing

    K. Iwai, T. Fujiwara, T. Morimura, H. Amano, K. Kimura, W. Ogata, H. Kasahara

    Technical Report of IEICE, CPSY97-46  IEICE

    Presentation date: 1997.08

  • A Macro Task Dynamic Scheduling Algorithm with Overlapping of Task Processing and Data Transfer

    K. Kimura, S. Hashimoto, M. Kogou, W. Ogata, H. Kasahara

    Technical Report of IEICE, CPSY97-40  IEICE

    Presentation date: 1997.08

▼display all

Research Projects

  • A Study of Matrix Multiply by Homomorphic Encryption for Utilizing in Deep Learning Frameworks

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research

    Project Year :

    2018.06
    -
    2020.03
     

    Kimura Keiji

     View Summary

    This research aims at accelerating matrix-multiply in homomorphic encryption toward utilizing it in deep learning frameworks. Through the research, we obtained 5.53x and 3.73x speedups in maximum for two important computational parts in the target encrypted matrix-multiply process. In addition, we have developed a data transfer unit, which can quickly provide required data to accelerator hardware units. We also investigated and evaluated the relationship between the precision of computations and calculation time to reduce the calculation cost while keeping the appropriate precision. As a result, we obtained 8 points accuracy improvement and 54% speedup for image recognition at the same time by parallel inference with eight smaller neural networks.

  • A research on a heterogeneous multicore that enables flexible cooperation among CPUs, accelerators and data transfer units on a chip

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research

    Project Year :

    2015.04
    -
    2018.03
     

    Kimura Keiji

     View Summary

    We developed a heterogeneous multicore architecture and its compiler flow, which enable flexible cooperation among CPUs, accelerator cores, and data transfer units, which is a kind of extended DMA controller, in a multicore chip. One of the main achievements in this research project is that a program parallelized by the developed compiler flow including LLVM backend for the accelerator core obtains 24.91x speedup on the heterogeneous multicore on an FPGA test bed, which is also developed in this research

  • Real-Time Optimization Algorithms and Their Applications for Control of Large-Scale Nonlinear Spatiotemporal Patterns

    Project Year :

    2012.04
    -
    2016.03
     

     View Summary

    Fast algorithms for solving nonlinear optimal control problems were investigated to optimally control large-scale and complicated systems, and their applications to various fields were examined. Achievements in this research include, for example, development of efficient optimization algorithms for control of large-scale systems, systematic tuning methods of control responses, and a tool for automatic coding of the algorithms. The algorithms have been validated in various applications such as control of distributions of temperature and velocity in thermal fluid systems, suppression of quality dispersion in a steel making process, water quality control in advanced sewage treatment facilities, demand control in smart grids, control of power generation and attitude oscillation in floating off-shore wind turbines, and so on

  • A Study of Acceleration Technique for Many-core Architecture Simulation Considering Global Program Structure

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research

    Project Year :

    2011.04
    -
    2014.03
     

    KIMURA Keiji

     View Summary

    A fast and high accuracy architecture simulation technique for multi-core and many-core processors are proposed in this study. By this proposed technique, an architecture simulator changes its precision and simulation speed appropriately under the assumption that a parallelized application is executed on a multi-core or a many-core.The evaluation results with four applications each of which has different characteristics show the 16-core multicore simulation gives 443 times speedup within 0.52% error in maximum, and 218 times speedup within 2.75% error on average

  • ソフトウェア協調整チップマルチプロセッサにおけるデータ利用最適化に関する研究

     View Summary

    本年度は、昨年度に引き続きソフトウェア協調動作型チップマルチプロセッサ用のデータローカリティ最適化およびデータ転送最適化に関する研究を行なった。本研究では、データを共有するタスク群に着目し、プロセッサコアローカルなキャッシュやローカルメモリのサイズを考慮してこれらのタスクを分割し各プロセッサコアに割り当て、キャッシュやローカルメモリの有効利用を図る。さらに、残存するデータ転送を、プロセッサコアに割り当てたタスクとオーバラップして行うことにより、データ転送オーバヘッドの隠蔽を図る。具体的には、MPEG2エンコーデイング処理やJPEG2000エンコーディング処理などのマルチメディアデプリケーションをターゲットとして、これらのアプリケーションに自動的にデータローカリティ最適化とデータ転送最適化手法を適用し、チップマルチプロセッサ上で効率よく動作させるためのソフトウェア・ハードウェア協調動作技術の開発とその評価を行なった。評価の結果、とりわけMPEG2エンコーディング処理では動作周波数400MHz時で逐次実行に対し8プロセッサ使用時で7.97倍、動作周波数2.8GHz時で逐次実行に対し8プロセッサ使用時で6.54倍の速度向上率を得られることが確認できた。MPEG2エンコーディングプログラムに対する本データローカリティ最適化およびデータ転送最適化は、自動並列化コンパイラによりほぼ自動的に行われる。より多くのアプリケーションに対して本手法を自動的に適用し対象アプリケーションを拡大することは今後の課題である

Misc

  • 自動並列化コンパイラのコンパイル時間短縮のための実行プロファイル・フィードバックを用いたコード生成手法 (コンピュータシステム) -- (組込み技術とネットワークに関するワークショップETNET2017)

    藤野 里奈, 韓 吉新, 島岡 護, 見神 広紀, 宮島 崇浩, 高村 守幸, 木村 啓二, 笠原 博徳

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   116 ( 510 ) 207 - 212  2017.03

    CiNii

  • 自動車リアルタイム制御計算の複数クラスタ構成マルチコア上での並列化 (コンピュータシステム) -- (組込み技術とネットワークに関するワークショップETNET2017)

    宮田 仁, 島岡 護, 見神 広紀, 西 博史, 鈴木 均, 木村 啓二, 笠原 博徳

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   116 ( 510 ) 177 - 182  2017.03

    CiNii

  • 大規模システムを想定したGem5シミュレータの階層的インターコネクションネットワーク拡張 (コンピュータシステム) -- (組込み技術とネットワークに関するワークショップETNET2017)

    小野口 達也, 林 綾音, 宇高 勝之, 松島 裕一, 木村 啓二, 笠原 博徳

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   116 ( 510 ) 147 - 152  2017.03

    CiNii

  • LLVMを用いたベクトルアクセラレータ用コードのコンパイル手法 (コンピュータシステム)

    丸岡 晃, 無州 祐也, 狩野 哲史, 持山 貴司, 北村 俊明, 神谷 幸男, 高村 守幸, 木村 啓二, 笠原 博徳

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   116 ( 177 ) 19 - 24  2016.08

    CiNii

  • Android Video Processing System Combined with Automatically Parallelized and Power Optimized Code by OSCAR Compiler

    Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

      57 ( 4 )  2016.04

    CiNii

  • Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

    GOTO Takashi, MUTO Kohei, HIRANO Tomohiro, MIKAMI Hiroki, TAKAHASHI Uichiro, INOUE Sakae, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report. Computer systems   114 ( 506 ) 95 - 100  2015.03

     View Summary

    This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

    CiNii

  • OSCAR自動並列化コンパイラを用いたリアルタイム動画像アプリケーションのHaswellマルチコア上での低消費電力化 (コンピュータシステム)

    飯塚 修平, 山本 英雄, 平野 智大, 岸本 耀平, 後藤 隆志, 見神 広紀, 木村 啓二, 笠原 博徳

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   114 ( 506 ) 219 - 224  2015.03

     View Summary

    スマートフォンやノートパソコンといったモバイル端末からデータセンタで利用されるサーバーマシンまで,あらゆる計算機において消費電力の削減が最重要課題となっている.これは、消費電力の削減によりモバイル機器においてはバッテリー持続時間の延長により利便性が大幅に向上し,またサーバーマシンにおいては膨大な電力コストや空調コストの削減が実現できるからである.これらの計算機は高性能かつ低消費電力を実現するためにマルチコアプロセッサを搭載したものが主流となっている.しかしながらマルチコアの資源を有効活用してこれらを実現するためには,プログラムの並列化が不可欠であり手動で行うには膨大な工数を必要とする.本稿では,医用・防犯・個人認証・車載などで広く利用されているリアルタイム物体認識処理に対して,OSCAR自動並列化コンパイラによるDVFS及びclock gatingによる電力制御を適用し,現在幅広く利用されているIntel Haswell Core i7-4770Kマルチコア上で評価した. Intel Haswellマルチコア上で,Webカメラからの画像の入力・人の顔の認識処理・画面描画というリアルタイムなシステム全域における消費電力の削減を行ったところ,1PE逐次実行では電力制御なしの場合の31.06[W]から電力制御ありの場合では28.74[W]に、3PEで並列化実行した場合では電力制御なし場合のの41.73[W]から電力制御の場合では17.78[W]に消費電力を削減したことが確認され,物体認識処理におけるマルチコア用のコンパイラ自動電力制御の有用性が確認できた.

    CiNii

  • OSCAR自動並列化コンパイラを用いたリアルタイム動画像アプリケーションのHaswellマルチコア上での低消費電力化 (ディペンダブルコンピューティング)

    飯塚 修平, 山本 英雄, 平野 智大, 岸本 耀平, 後藤 隆志, 見神 広紀, 木村 啓二, 笠原 博徳

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   114 ( 507 ) 219 - 224  2015.03

     View Summary

    スマートフォンやノートパソコンといったモバイル端末からデータセンタで利用されるサーバーマシンまで,あらゆる計算機において消費電力の削減が最重要課題となっている.これは、消費電力の削減によりモバイル機器においてはバッテリー持続時間の延長により利便性が大幅に向上し,またサーバーマシンにおいては膨大な電力コストや空調コストの削減が実現できるからである.これらの計算機は高性能かつ低消費電力を実現するためにマルチコアプロセッサを搭載したものが主流となっている.しかしながらマルチコアの資源を有効活用してこれらを実現するためには,プログラムの並列化が不可欠であり手動で行うには膨大な工数を必要とする.本稿では,医用・防犯・個人認証・車載などで広く利用されているリアルタイム物体認識処理に対して,OSCAR自動並列化コンパイラによるDVFS及びclock gatingによる電力制御を適用し,現在幅広く利用されているIntel Haswell Core i7-4770Kマルチコア上で評価した. Intel Haswellマルチコア上で,Webカメラからの画像の入力・人の顔の認識処理・画面描画というリアルタイムなシステム全域における消費電力の削減を行ったところ, 1PE逐次実行では電力制御なしの場合の31.06[W]から電力制御ありの場合では28.74[W]に、3PEで並列化実行した場合では電力制御なし場合のの41.73[W]から電力制御の場合では17.78[W]に消費電力を削減したことが確認され,物体認識処理におけるマルチコア用のコンパイラ自動電力制御の有用性が確認できた.

    CiNii

  • Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

    GOTO Takashi, MUTO Kohei, HIRANO Tomohiro, MIKAMI Hiroki, TAKAHASHI Uichiro, INOUE Sakae, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report. Dependable computing   114 ( 507 ) 95 - 100  2015.03

     View Summary

    This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

    CiNii

  • 動画像デコーディングのIntelおよびARMマルチコア上での並列処理の評価 (ディペンダブルコンピューティング)

    和気 珠実, 飯塚 修平, 見神 広紀, 木村 啓二, 笠原 博徳

    電子情報通信学会技術研究報告 = IEICE technical report : 信学技報   114 ( 507 ) 263 - 268  2015.03

     View Summary

    本稿では,マルチコアプロセッサを用いて動画像デコーディング処理の高速化を実現する手法として2種類の並列化手法について性能評価を行った.1つ目の並列化手法は並列化対象ループにループスキューイング/ループインターチェンジを適用する手法,2つ目の並列化手法はwave-front手法を適用する手法であり,どちらの場合もマクロブロック間の依存関係を満たしつつこれらの間の並列性を利用することで並列処理が可能となる.評価に用いる動画像コーデックは,MPEG2と比較して約2倍の符号化効率を持ちワンセグ放送等に用いられているH.264/AVCと,H.264/AVCと同等の品質を持ちYoutube等でも採用されている動画規格であるWebMのビデオコーデックVP8である.これらの規格により動画像デコーディングを行うプログラムに対して,上記2つの並列化手法をそれぞれ適用した.Snapdragon APQ8064 Krait 4コアを搭載したNexus7上で評価を行った結果,ループスキューイング/ループインターチェンジ手法で並列化した場合,並列化箇所のみで逐次実行に比べ3コアで1.33倍速度向上し,その一方でwave-front手法では3コアで2.86倍の速度向上が得られた.同様にIntel(R) Xeon(R) CPU X5670プロセッサを搭載したマシンで評価を行った結果,ループスキューイング/ループインターチェンジ手法で並列化した場合,並列化箇所のみで逐次実行に比べ6コアで1.82倍速度向上し,一方でwave-front手法では6コアで4.61倍の速度向上が得られた.

    CiNii

  • OSCAR自動並列化コンパイラを用いたリアルタイム動画像アプリケーションのHaswellマルチコア上での低消費電力化

    飯塚 修平, 山本 英雄, 平野 智大, 岸本 耀平, 後藤 隆志, 見神 広紀, 木村 啓二, 笠原 博徳

    研究報告組込みシステム(EMB)   2015 ( 20 ) 1 - 6  2015.02

     View Summary

    スマートフォンやノートパソコンといったモバイル端末からデータセンタで利用されるサーバーマシンまで,あらゆる計算機において消費電力の削減が最重要課題となっている.これは,消費電力の削減によりモバイル機器においてはバッテリー持続時間の延長により利便性が大幅に向上し,またサーバーマシンにおいては膨大な電力コストや空調コストの削減が実現できるからである.これらの計算機は高性能かつ低消費電力を実現するためにマルチコアプロセッサを搭載したものが主流となっている.しかしながらマルチコアの資源を有効活用してこれらを実現するためには,プログラムの並列化が不可欠であり手動で行うには膨大な工数を必要とする.本稿では,医用・防犯・個人認証・車載などで広く利用されているリアルタイム物体認識処理に対して,OSCAR 自動並列化コンパイラによる DVFS 及び clock gating による電力制御を適用し,現在幅広く利用されている Intel Haswell Core i7-4770K マルチコア上で評価した.Intel Haswell マルチコア上で,Web カメラからの画像の入力・人の顔の認識処理・画面描画というリアルタイムなシステム全域における消費電力の削減を行ったところ,1PE 逐次実行では電力制御なしの場合の 31.06[W] から電力制御ありの場合では 28.74[W] に,3PE で並列化実行した場合では電力制御なし場合のの 41.73[W] から電力制御の場合では 17.78[W] に消費電力を削減したことが確認され,物体認識処理におけるマルチコア用のコンパイラ自動電力制御の有用性が確認できた.

    CiNii

  • 動画像デコーディングのIntelおよびARMマルチコア上での並列処理の評価

    和気 珠実, 飯塚 修平, 見神 広紀, 木村 啓二, 笠原 博徳

    研究報告組込みシステム(EMB)   2015 ( 35 ) 1 - 6  2015.02

     View Summary

    本稿では,マルチコアプロセッサを用いて動画像デコーディング処理の高速化を実現する手法として 2 種類の並列化手法について性能評価を行った.1 つ目の並列化手法は並列化対象ループにループスキューイング/ループインターチェンジを適用する手法,2 つ目の並列化手法は wave-front 手法を適用する手法であり,どちらの場合もマクロブロック間の依存関係を満たしつつこれらの間の並列性を利用することで並列処理が可能となる.評価に用いる動画像コーデックは,MPEG2 と比較して約 2 倍の符号化効率を持ちワンセグ放送等に用いられている H.264/AVC と,H.264/AVC と同等の品質を持ち Youtube 等でも採用されている動画規格である WebM のビデオコーデック VP8 である.これらの規格により動画像デコーディングを行うプログラムに対して,上記 2 つの並列化手法をそれぞれ適用した.Snapdragon APQ8064 Krait 4 コアを搭載した Nexus7 上で評価を行った結果,ループスキューイング/ループインターチェンジ手法で並列化した場合,並列化箇所のみで逐次実行に比べ 3 コアで 1.33 倍速度向上し,その一方で wave-front 手法では 3 コアで 2.86 倍の速度向上が得られた.同様に Intel(R) Xeon(R) CPU X5670 プロセッサを搭載したマシンで評価を行った結果,ループスキューイング/ループインターチェンジ手法で並列化した場合,並列化箇所のみで逐次実行に比べ 6 コアで 1.82 倍速度向上し,一方で wave-front 手法では 6 コアで 4.61 倍の速度向上が得られた.

    CiNii

  • Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

      2015 ( 34 ) 1 - 6  2015.02

     View Summary

    This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

    CiNii

  • 自動並列化コンパイラによるソフトウェアキャッシュコヒーレンシ制御手法の評価

    岸本 耀平, 間瀬 正啓, 木村 啓二, 笠原 博徳

    研究報告ハイパフォーマンスコンピューティング(HPC)   2014 ( 19 ) 1 - 7  2014.12

     View Summary

    主記憶共有型マルチコアプロセッサにおいて,一般にキャッシュコヒーレンシ制御はハードウェアにより実現されている.今後のプロセッサコア数の増加に伴いキャッシュコヒーレンシハードウェアの回路規模は大きくなり,チップへの実装が困難になること,電力消費が大きくなること,設計期間及び開発費用が増大することが懸念されている.本稿ではこのハードウェアコヒーレンシ制御の問題を解決するために,ハードウェアコヒーレンシ制御機構を持たない主記憶共有型ノンコヒーレントキャッシュマルチコアに対して,並列化コンパイラがソフトウェアに対し自動的にコヒーレンシ制御を行う手法を提案する.本手法を実装した OSCAR 自動並列化コンパイラと,4 コアのクラスタを 2 つ持ちクラスタ間ではハードウェアコヒーレンシを持たない情報家電用マルチコア RP2 を用い性能評価を行った.9 つの科学技術計算アプリケーションを対象として評価を行ったところ,4 コアのハードウェアコヒーレンシ制御使用時の性能は平均で 1 コア性能の 2.80 倍であったのに対し,ハードウェアコヒーレンシを使用せず本手法を適用した 4 コア実行時の性能は平均で 1 コア性能の 2.61 倍となりほぼ同等の速度向上が得られ,さらに 8 コアハードウェアコヒーレンシ制御無効時には平均で 1 コア性能の 3.66 倍とスケールアップすることが確認できた.

    CiNii

  • Prospect of Green Computing

      4 ( 4 ) 3 - 8  2014.10

    CiNii

  • Linux ftraceを用いたマルチコアプロセッサ上での並列化プログラムのトレース手法

    福意 大智, 島岡 護, 見神 広紀, Dominic Hillenbrand, 木村 啓二, 笠原 博徳

    研究報告計算機アーキテクチャ(ARC)   2014 ( 6 ) 1 - 6  2014.07

     View Summary

    ソフトウェアの適切な並列化により,マルチコアを搭載したコンピュータシステム上でアプリケーションを高速に動作させることが可能である.並列化されたソフトウェアの挙動や性能を調査する手法として,ソースコードの解読や実行ダンプファイルの収集,プロファイラの利用,デバッガの利用といった方法が挙げられる.しかしこれらの手法ではどのようなタイミングにおいてコンテクストスイッチが発生したのか,システムで発生する事象に対してソフトウェアがどのような影響を受けているかといった情報を得ることは困難である.そこで,本稿では並列化されたプログラムが実際に並列実行される様子をソフトウェアからトレースに任意のアノテーションを挿入可能とする拡張を施した Linux ftrace を用いて解析する手法を提案する.提案手法を用いて,Intel Xeon X7560,ARMv7 の各々のプラットフォームにおいて equake,art,mpeg2enc というベンチマークのトレースを行い,これらのプログラムが実行時に OS からどのような影響を受けているか観測できることが確認できた.また,1 回のアノテーションの挿入を Intel Xeon で 1.07[us],ARMで4.44[us] で可能であることが確認できた.

    CiNii

  • 大規模無線センサネットワークにおける外乱を考慮したアーキテクチャ探索シミュレータの実装と評価

    山下浩一郎, 鈴木貴久, 栗原康志, 大友俊也, 木村啓二, 笠原博徳

    マルチメディア、分散協調とモバイルシンポジウム2014論文集   2014   1368 - 1377  2014.07

    CiNii

  • A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

    TAGUCHI Gakuho, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report. Dependable computing   113 ( 498 ) 289 - 294  2014.03

     View Summary

    A parallelizing compiler cooperative acceleration technique for multicore architecture simulation is proposed in this paper. Profile data of a sequential execution of a target application on a real machine is decomposed into multiple clusters by x-means clustering. Then, sampling points for a detail simulation mode in each cluster are calculated. In addition, a parallelizing compiler generates a parallelized code by taking both of the clustering information and the source code of the target application. The evaluation results show, in the case of the simulation for 16 cores, 437 times speedup is achieved with 0.04% error for equake, and 28 times speedup is achieved with 0.04% error for mpeg2 encoder.

    CiNii

  • A Latency Reduction Technique for IDS by Allocating Decomposed Signature on Multi-core

    Shohei Yamada, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Notes   2014 ( 2 ) 1 - 8  2014.02

     View Summary

    Cyber attacks targeting on companies and government organizations have been increasing and highly developed. An Intrusion Detection System (IDS) is one of efficient solutions to prevent those attacks. An IDS detects illegal network accesses in realtime by monitoring the network and filtering suspicious IP packets. Large processing performance is required for IDSs to process a large number of IP packets in realtime. In order to satisfy this requirement, a latency reduction technique for signature-based IDSs by allocating decomposed signature on multicores is proposed in this paper. The proposed technique is implemented in Suricata, which is an open source IDS, and evaluated it with several data sets, such as DARPA Intrusion Detection Evaluation Data Set. The evaluation results show the proposed techniques with four cores achieves 3.22 times performance improvement in maximum comparing with two cores without signature decomposition.

    CiNii

  • Automatic Parallelization of Small Point FFT on Multicore Processor

    Yuuki Furuyama, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

    IPSJ SIG Notes   2014 ( 3 ) 1 - 8  2014.02

     View Summary

    Fast Fourier Transorm (FFT) is one of the most frequently used algorihtms in many applications including digital signal processing and image processing to compute Descrite Fourier Transform (DFT). Although small size FFT programs must be used in baseband signal processing such as LTE and so on, it's difficult to use special hardwares like DSPs for computing such a small problem because of their relatively large data transfer and control overhead. This paper proposes an automatic parallelization method to generate parallelized programs with low overhead for small size FFTs suited for shared memory multicore processor by applying cache optimization to avoide false sharing between cores. The proposed method has been implemented in OSCAR automatic parallelizing compiler, parallelized small point FFT programs from 32 points to 256 points and evaluated them on RP2 multicore processor having 8 SH-4A cores. It achieved 1.97 times speedup on 2 SH-4A cores and 3.9 times speedup on 4 SH-4A cores in a 256 points FFT program. In addition to the FFT programs, the proposed approach is applied to Fast Hadamard Transform (FHT) which has similar computation to the FFT. The results are 1.91 times speedup on 2 SH-4A cores and 3.32 times speedup on 4 SH-4A cores. It shows effectiveness of the proposed method and easiness of applying the method to many kinds of programs.

    CiNii

  • プロファイル情報を用いたAndroid 2D描画ライブラリSKIAのOSCARコンパイラによる並列化

    後藤隆志, 武藤康平, 山本英雄, 平野智大, 見神広紀, 木村啓二, 笠原博徳

    研究報告ハイパフォーマンスコンピューティング(HPC)   2013 ( 12 ) 1 - 7  2013.12

     View Summary

    本論文では,スマートフォンやタブレット等で広く用いられる Android において,従来マルチコアプロセッサ上での並列化が困難で,その高速化が望まれていた 2D 描画ライブラリ Skia を,OSCAR 自動並列化コンパイラにより,プロファイラ情報に基づいた自動並列化を行う手法を開発したのでその方法を説明する.OSCAR コンパイラは Parallelizable C により記述された逐次プログラムから様々な粒度で並列化解析を行い,自動的に並列化 C ソースを出力する.しかし,Skia は Android 内のライブラリであり,利用する描画命令ルーチンにより制御フローが大きく変化するため,最適な並列化解析を行うことが困難である.そこで,本論文では Skia のような制御フローがコンパイル時に特定できないプログラムに対し,Oprofile を用いて取得したプロファイル結果を OSCAR コンパイラにフィードバックすることで,並列化対象を特定の領域に絞り,高い性能向上が得られる手法を提案する.なお,並列化対象領域が Parallelizable C コードでない場合でも,解析結果により実行コストが大きい部分から Parallelizable C に変更し,チューニングを施すことで並列化が可能となる.本手法を,描画ベンチマークとして広く使われている 0xbench を NVIDIA Tegra3 チップ (ARM Cortex-A9 4 コア) を搭載した Nexus7 上で評価を行った.並列化 Skia の実行においては,並列化部分の速度向上を正確に評価するため, Android を core0 に割り当て,残り 3 コアを Skia が利用できる形とした.評価の結果として,DrawRect で従来の 1.91 倍である 43.57 [fps],DrawArc で 1.32 倍の 50.98[fps],DrawCircle2 では 1.5 倍の 50.77[fps] といずれも性能向上結果が得られた.

    CiNii

  • プロファイル情報を用いたAndroid 2D描画ライブラリSKIAのOSCARコンパイラによる並列化

    後藤隆志, 武藤康平, 山本英雄, 平野智大, 見神広紀, 木村啓二, 笠原博徳

    研究報告計算機アーキテクチャ(ARC)   2013 ( 12 ) 1 - 7  2013.12

     View Summary

    本論文では,スマートフォンやタブレット等で広く用いられる Android において,従来マルチコアプロセッサ上での並列化が困難で,その高速化が望まれていた 2D 描画ライブラリ Skia を,OSCAR 自動並列化コンパイラにより,プロファイラ情報に基づいた自動並列化を行う手法を開発したのでその方法を説明する.OSCAR コンパイラは Parallelizable C により記述された逐次プログラムから様々な粒度で並列化解析を行い,自動的に並列化 C ソースを出力する.しかし,Skia は Android 内のライブラリであり,利用する描画命令ルーチンにより制御フローが大きく変化するため,最適な並列化解析を行うことが困難である.そこで,本論文では Skia のような制御フローがコンパイル時に特定できないプログラムに対し,Oprofile を用いて取得したプロファイル結果を OSCAR コンパイラにフィードバックすることで,並列化対象を特定の領域に絞り,高い性能向上が得られる手法を提案する.なお,並列化対象領域が Parallelizable C コードでない場合でも,解析結果により実行コストが大きい部分から Parallelizable C に変更し,チューニングを施すことで並列化が可能となる.本手法を,描画ベンチマークとして広く使われている 0xbench を NVIDIA Tegra3 チップ (ARM Cortex-A9 4 コア) を搭載した Nexus7 上で評価を行った.並列化 Skia の実行においては,並列化部分の速度向上を正確に評価するため, Android を core0 に割り当て,残り 3 コアを Skia が利用できる形とした.評価の結果として,DrawRect で従来の 1.91 倍である 43.57 [fps],DrawArc で 1.32 倍の 50.98[fps],DrawCircle2 では 1.5 倍の 50.77[fps] といずれも性能向上結果が得られた.

    CiNii

  • OSCAR API標準解釈系を用いた階層グルーピング対応ハードウェアバリア同期機構の評価

    川島慧大, 金羽木洋平, 林明宏, 木村啓二, 笠原博徳

    研究報告計算機アーキテクチャ(ARC)   2013 ( 16 ) 1 - 6  2013.07

     View Summary

    1 チップ内に搭載されるコア数の増加に伴い,アプリケーションからより多くの並列性を抽出し,低オーバーヘッドで利用することがこれらのコアを有効利用するために重要となっている.OSCAR コンパイラによる自動並列化ではより多くの並列性を利用するため,ループやサブルーチン内部の粗粒度並列性を解析し,階層的にタスク定義を行う.この階層的に定義されたタスクをコアを階層的にグルーピングし,コアグループに対して割り当てることにより並列処理を実現する.この階層的なグループ間で独立かつ低コストでバリア同期を実現できるハードウェアが提案され,SH4A プロセッサ 8 コア搭載の情報家電用マルチコア RP2 に実装されている.本稿では,OSCAR API 標準解釈系の階層グループバリア同期 API を RP2 のハードウェアバリア同期機構に対応し評価を行った結果について述べる.8 コアを使用した SPEC CPU 2000 の ART による評価ではソフトウェアでのバリア同期に対し 1.16 倍の性能向上が得られた.

    CiNii

  • マルチコア商用スマートディバイスの評価と並列化の試み

    山本 英雄, 後藤 隆志, 平野 智大, 武藤 康平, 見神 広紀, Dominic Hillenbrand, 林 明宏, 木村 啓二, 笠原 博徳

    研究報告システムソフトウェアとオペレーティング・システム(OS)   2013 ( 2 ) 1 - 7  2013.02

     View Summary

    半導体プロセスの微細化に伴いスマートフォン,タブレットに代表される民生機器にも4コア程度のマルチコアSoCの採用が進んでいる.一方,ソフトウェアはマルチコアを活用するための並列化が十分に進んでおらず,対応が望まれている.本稿ではAndroidを搭載した商用スマートデバイスにおいて,一般的な利用範囲におけるマルチコアの活用状況を評価し,並列化されたベンチマークプログラムを用いて実行環境の課題と改善方式を述べた上で,標準APIの仕様を変更すること無く,アプリケーションがオフスクリーンバッファを描画バッファに書くBitBLT処理の並列化を試みた結果を報告する.この処理並列化の結果,アプリケーションから2D描画APIを呼び出すベンチマークテストで約3%のフレームレートの改善を確認した.

    CiNii

  • A Parallelizing Compiler Cooperative Multicore Architecture Simulator with Changeover Mechanism of Simulation Modes

    TAGUCHI GAKUHO, ABE YOUICHI, KIMURA KEIJI, KASAHARA HIRONORI

    Technical report of IEICE. ICD   112 ( 425 ) 65 - 71  2013.01

     View Summary

    A parallelizing compiler cooperative multicore architecture simulation framework, which enables reducing simulation time by a flexible simulation-mode changeover mechanism, is proposed A multicore architecture simulator in this framework has two modes, namely, functional-and-fast simulation mode and cycle-accurate-and slow simulation modes This framework generates appropriate sampling points for cycle-accurate mode and runtime for mode changeover of the simulator depending on a parallelized application by cooperating with a parallelizing compiler The proposed framework is evaluated with EQUAKE from SPEC2000 The evaluation result shows 50 times to 500 times speedup can be achieved within 1 6% error

    CiNii

  • An Acceleration Technique of Many-core Architecture Simulation with Parallelized Applications by Statistical Technique

    Abe Yoichi, Taguchi Gakuho, Kimura Keiji, Kasahara Hironori

    Technical report of IEICE. ICD   112 ( 425 ) 57 - 63  2013.01

     View Summary

    This paper proposes an automatic decision technique of the number of clusters and samplmg points for an acceleration technique of many-core architecture simulation by statistical methods This techinque, firstly, focuses on a structure of a benchmark program, especially loops The number of sampling points is exploited from iterations of a target loop by statistical methods If the variation of the cost of the iterations is large, these iterations are grouped into clusters Thus, this technique enables higher estimation accuracy with fewer sampling points However, the number of clusters must be decided by hand in our previous works The automatic decision technique of the number of clusters by "x means" is proposed m this paper As a preliminary evaluation of the proposed technique, sequential execution costs of several benchmark programs are estimated As a result, when MPEG2 encoder program with SIF16, which causes large variation among the cost of iterations, is used, 1 92% error is achieved with 14 iterations as sampling pomts of 450 iterations exploited by x-means

    CiNii

  • Parallelization of Automobile Engine Control Software on Multicore Processor

    KANEHAGI YOUHEI, UMEDA DAN, MIKAMI HIROKI, HAYASHI AKIHIRO, SAWADA MITSUO, KIMURA KEIJI, KASAHARA HIRONORI

    Technical report of IEICE. ICD   112 ( 425 ) 3 - 10  2013.01

     View Summary

    The calculation load in the automobile control system is increasing to achive more safety, comfort and energy-saving Accordingly, control processor cores needs high performance However, the improvement of clock frequency in processor cores is difficult, and it is important to use multicore processor Using the multicore for the engine control, performance, development cost, development period, etc are problems be-cause it is difficult to parallelize softwares This paper proposes a parallelization method of the automobile engine control software on the multicore processor, which has only functioned on single-core processors Con-cretely, it is applied restructuring the sequential program for extracting more parallelism, for example inlining functions and duplicating conditional branches, and the OSCAR compiler allows us perform automatic par-allelization and generation of a parallel C program Using proposed method, the automobile engine control software, which is difficult to parallelize manually because of very fine-grained program, is parallelized and give us 1 71x speedup using 2 cores on RP-X multicore It is confirmed that parallehzation of the automobile engine control software is effective

    CiNii

  • A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

    KIMURA KEIJI, MASE MASAYOSHI, KASAHARA HIRONORI

    IEICE technical report. Dependable computing   111 ( 462 ) 127 - 132  2012.02

     View Summary

    JISX0180:2011 "Framework of establishing coding guidelines for embedded system development" was decided to improve the quality of embeded systems. Parallelizable C has bee also proposed to support exploitation of parallelism by a parallelizing compiler. This paper proposes a definition of Parallelizable C by JISX0180:2011 aiming at the improvement of productivity for embeded multicore developers with parallelizing compilers. An evaluation has been carried out using rewritten programs by the defined coding guideline on ordinary SMPs and a consumer electronics multicore. As the result, 5.54x speedup on IBM p5 550Q (8core), 2.42x speedup on Intel Core i7 960 (4core), and 2.79x speedup on Renesas/Hitachi/Waseda RP2 (4core) have been achieved, respectively.

    CiNii

  • 科学技術計算プログラムの構造を利用したメニーコアアーキテクチャシミュレーション高速化手法の評価

    石塚亮, 阿部洋一, 大胡亮太, 木村啓二, 笠原博徳

    研究報告計算機アーキテクチャ(ARC)   2011 ( 14 ) 1 - 11  2011.07

     View Summary

    本稿ではキャッシュやパイプラインまでシミュレーションする詳細シミュレーションと命令実行のみの高速な機能シミュレーションの両方を用いたシミュレーション精度切り替えによるメニーコアシミュレータの高速化手法を提案する.本手法はメニーコアシミュレータ上で並列化プログラムを実行することを前提としており,このプログラムの一部のみを詳細シミュレーションを行うことにより高速化を図る.このとき,詳細シミュレーションを行うサンプリング部分を実機での逐次実行プロファイル情報とプログラム構造から判断し,その分量を統計的手法により決定する.本手法を比較的規則性の高い科学技術計算である SPEC CPU 95のTOMCATV,SWIM で及び SPEC CPU 2000 の ART,EQUAKE を用いて統計学的に算出したサンプリングサイズの値を堺に,実行サイクルが収束していくことを示した.これにより,評価したところ,64 コアかつ精度切換えを想定したシミュレーションで,各アプリケーションにおいて,誤差5%の範囲で約 100 倍の高速化が可能であることを示した.

    CiNii

  • Evaluation of Parallelizable C Programs by the OSCAR API Standard Translator

    SATO TAKUYA, MIKAMI HIROKI, HAYASHI AKIHIRO, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

      2010 ( 2 ) 1 - 6  2010.10

    CiNii

  • Performance of Power Reduction Scheme by a Compiler on Heterogeneous Multicore for Consumer Electronics "RP-X"

    WADA YASUTAKA, HAYASHI AKIHIRO, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, SHIRAKO JUN, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

      2010 ( 8 ) 1 - 10  2010.07

    CiNii

  • An Acceleration Technique of Many Core Architecture Simulator Considering Program Structure

    ISHIZUKA RYO, OOTOMO TOSHIYA, DAIGO RYOTA, KIMURA KEIJI, KASAHARA HIRONORI

      2010 ( 20 ) 1 - 7  2010.07

    CiNii

  • A Compiler Framework for Heterogeneous Multicores for Consumer Electronics

    HAYASHI AKIHIRO, WADA YASUTAKA, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

      2010 ( 7 ) 1 - 9  2010.07

    CiNii

  • Parallelizing Compiler Directed Software Coherence

    MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

      2010 ( 7 ) 1 - 10  2010.04

    CiNii

  • Multi Media Offload with Automatic Parallelization

    ISHIZAKA KAZUHISA, SAKAI JUNJI, EDAHIRO MASATO, MIYAMOTO TAKAMICHI, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

      2010 ( 59 ) 1 - 7  2010.03

    CiNii

  • Processing Performance of Automatically Parallelized Applications on Embedded Multicore with Running Multiple Applications

    MIYAMOTO TAKAMICHI, MASE MASAYOSHI, KIMURA KEIJI, ISHIZAKA KAZUHISA, SAKAI JUNJI, EDAHIRO MASATO, KASAHARA HIRONORI

      2010 ( 9 ) 1 - 8  2010.02

    CiNii

  • Hierarchical parallel processing of H.264/AVC encoder on an multicore processor

    IEICE technical report   109 ( 405 ) 121 - 126  2010.01

    CiNii

  • Hierarchical Parallel Processing of H.264/AVC Encoder on an Multicore Processor

    MIKAMI Hiroki, MIYAMOTO Takamichi, KIMURA Keiji, KASAHARA Hironori

      2010 ( 22 ) 1 - 6  2010.01

    CiNii

  • Green Multicore-SoC Software-Execution Framework with Timely-Power-Gating Scheme

    ONOUCHI Masafumi, TOYAMA Keisuke, NOJIRI Toru, SATO Makoto, MASE Masayoshi, SHIRAKO Jun, SATO Mikiko, TAKADA Masashi, ITO Masayuki, MIZUNO Hiroyuki, NAMIKI Mitaro, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report   109 ( 367 ) 7 - 12  2010.01

     View Summary

    We developed a software-execution framework for scalable increase of execution speed and low-power consumption based on an octo-core chip multiprocessor named RP2 and an automatic multigrain-parallelizing compiler named OSCAR. Keys to improvement of the performance are reduction of a communication overhead with parallelized tasks and frequent shutdown to waiting cores. For this framework, we developed two schemes: data mapping and timely-power gating. Measurement of the performance for the conventional framework and our proposed framework showed that normalized execution speedup becomes 5.00 when secure AAC-LC encoding is processed in 8-parallel execution. Moreover, applying our timely-power-gating scheme improves power efficiency by 10%.

    CiNii

  • Automatic Parallelization of Parallelizable C Programs on Multicore Processors

    MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

      2009 ( 15 ) 1 - 10  2009.07

    CiNii

  • Performance Evaluation of Minimum Execution Time Multiprocessor Scheduling Algorithms Using Standard Task Graph Set Ver3 Consider Parallelism of Task Graphs and Deviation of Task Execution Time

    SHIMAOKA MAMORU, IMAIZUMI KAZUHIRO, TAKANO FUMIYO, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2009 ( 14 ) 127 - 132  2009.02

     View Summary

    This paper proposes the "Standard Task Graph Set Ver3" (STG Ver3) to evaluate performance of heuristic and optimization algorithms for the minimum execution time multiprocessor scheduling problem. The minimum execution time multiprocessor scheduling problem is known as a strong NP-hard combinational optimization problem to the public. The STG Ver2 was created by random task execution times and random predecessors. In addition, the STG Ver3 considers parallelism of task graphs and deviation of task execution times to let us understand characteristics of algrithms. This paper describes evaluation results by applying the STG Ver3 to several algorithms. Performance evaluation show that DF/IHS can give us optimal solutions for 87.25%, and PDF/IHS 92.25% within 600 seconds.

    CiNii

  • Performance Evaluation of Minimum Execution Time Multiprocessor Scheduling Algorithms Using Standard Task Graph Set Ver3 Consider Parallelism of Task Graphs and Deviation of Task Execution Time

    SHIMAOKA MAMORU, IMAIZUMI KAZUHIRO, TAKANO FUMIYO, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2009 ( 14 ) 127 - 132  2009.02

     View Summary

    This paper proposes the "Standard Task Graph Set Ver3" (STG Ver3) to evaluate performance of heuristic and optimization algorithms for the minimum execution time multiprocessor scheduling problem. The minimum execution time multiprocessor scheduling problem is known as a strong NP-hard combinational optimization problem to the public. The STG Ver2 was created by random task execution times and random predecessors. In addition, the STG Ver3 considers parallelism of task graphs and deviation of task execution times to let us understand characteristics of algrithms. This paper describes evaluation results by applying the STG Ver3 to several algorithms. Performance evaluation show that DF/IHS can give us optimal solutions for 87.25%, and PDF/IHS 92.25% within 600 seconds.

    CiNii

  • Local Memory Management Scheme by a Compiler for Multicore Processor

    MOMOZONO Taku, NAKANO Hirofumi, MASE Masayoshi, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report   108 ( 375 ) 69 - 74  2009.01

     View Summary

    This paper proposes a local memory management scheme for an automatic parallelizing compiler to realize effective use of a limited size of local memory. After the loop aligned decomposition and task scheduling considering data locality and parallelism, the compiler allocates data to the local memory effectively using the task scheduling result. This paper evaluates the proposed scheme on RP2 multicore for consumer electronics which has 8 SH4A processor cores. Each core integrates 32KB of local data memory and 64KB of distributed shared memory. As the results, the proposed scheme using 8 processors gives us about 6.20 times speedup for MPEG2 encoding program, 7.25 times speedup for AAC encoding program and 7.64 times speedup for susan against the sequential execution.

    CiNii

  • A Power Saving Scheme on Multicore Processors Using OSCAR API

    NAKAGAWA Ryo, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report   108 ( 375 ) 93 - 98  2009.01

     View Summary

    Effective power reduction of an application program on multicore processors requires appropriate power control for each on-chip resource by compilers or users. These low power techniques need an application program interface (API) to realize power control in a user program. This paper proposes a power saving scheme for multicore processors using OSCAR API developed in NEDO "Multicore for Realtime Consumer Electronics" project. The proposed scheme has been implemented in OSCAR compiler to realize the power reduction for fastest execution mode, which minimizes power consumption without performance degradation, and the realtime execution mode to minimize power consumption under realtime constrains. The proposed scheme is evaluated on an 8 cores SH4A multicore processor RP2, newly developed for consumer electronics by Renesas Technology Corp., Hitachi, Ltd. and Waseda University in the above project. For the fastest execution mode, consumed energy was reduced by 13.05% for SPEC2000 art and 3.99% for SPEC2000 equake. Also, for the realtime execution mode, consumed power was reduced by 87.9% for AAC encoder and 73.2% for MPEG2 decoder.

    CiNii

  • Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

    KAMIYAMA Teruo, WADA Yasutaka, HAYASHI Akihiro, MASE Masayoshi, NAKANO Hirohumi, WATANABE Takeshi, KIMURA Keiji, KASAHARA Hironori

    IPSJ SIG Notes   2009 ( 1 ) 63 - 68  2009.01

     View Summary

    This paper describes a heterogeneous multicore architecture having accelerator cores in addition to general purpose cores, an automatic parallelizing compiler that cooperatively works with the heterogeneous multicore, a heterogeneous multicore architecture simulation environment, and performance evaluation results with the simulation environment. For the performance evaluation, multimedia applications written in C or Fortran, considered with parallelization by the compiler, are used. As a result, the evaluated heterogeneous multicore having two general purpose cores and two accelerator cores achieves 9.82 times speedup from MP3 encoder. This architecture also achieves 14.64 times speedup from JPEG2000 encoder.

    CiNii

  • Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

    KAMIYAMA Teruo, WADA Yasutaka, HAYASHI Akihiro, MASE Masayoshi, NAKANO Hirohumi, WATANABE Takeshi, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report   108 ( 375 ) 63 - 68  2009.01

     View Summary

    This paper describes a heterogeneous multicore architecture having accelerator cores in addition to general purpose cores, an automatic parallelizing compiler that cooperatively works with the heterogeneous multicore, a heterogeneous multicore architecture simulation environment, and performance evaluation results with the simulation environment. For the performance evaluation, multimedia applications written in C or Fortran, considered with parallelization by the compiler, are used. As a result, the evaluated heterogeneous multicore having two general purpose cores and two accelerator cores achieves 9.82 times speedup from MP3 encoder. This architecture also achieves 14.64 times speedup from JPEG2000 encoder.

    CiNii

  • Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

    KAMIYAMA Teruo, WADA Yasutaka, HAYASHI Akihiro, MASE Masayoshi, NAKANO Hirohumi, WATANABE Takeshi, KIMURA Keiji, KASAHARA Hironori

      2009 ( 1 ) 63 - 68  2009.01

     View Summary

    This paper describes a heterogeneous multicore architecture having accelerator cores in addition to general purpose cores, an automatic parallelizing compiler that cooperatively works with the heterogeneous multicore, a heterogeneous multicore architecture simulation environment, and performance evaluation results with the simulation environment. For the performance evaluation, multimedia applications written in C or Fortran, considered with parallelization by the compiler, are used. As a result, the evaluated heterogeneous multicore having two general purpose cores and two accelerator cores achieves 9.82 times speedup from MP3 encoder. This architecture also achieves 14.64 times speedup from JPEG2000 encoder.

    CiNii

  • A Power Saving Scheme on Multicore Processors Using OSCAR API

    NAKAGAWA Ryo, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

      2009 ( 1 ) 93 - 98  2009.01

     View Summary

    Effective power reduction of an application program on multicore processors requires appropriate power control for each on-chip resource by compilers or users. These low power techniques need an application program interface (API) to realize power control in a user program. This paper proposes a power saving scheme for multicore processors using OSCAR API developed in NEDO "Multicore for Realtime Consumer Electronics" project. The proposed scheme has been implemented in OSCAR compiler to realize the power reduction for fastest execution mode, which minimizes power consumption without performance degradation, and the realtime execution mode to minimize power consumption under realtime constrains. The proposed scheme is evaluated on an 8 cores SH4A multicore processor RP2, newly developed for consumer electronics by Renesas Technology Corp., Hitachi, Ltd. and Waseda University in the above project. For the fastest execution mode, consumed energy was reduced by 13.05% for SPEC2000 art and 3.99% for SPEC2000 equake. Also, for the realtime execution mode, consumed power was reduced by 87.9% for AAC encoder and 73.2% for MPEG2 decoder.

    CiNii

  • Local Memory Management Scheme by a Compiler for Multicore Processor

    MOMOZONO Taku, NAKANO Hirofumi, MASE Masayoshi, KIMURA Keiji, KASAHARA Hironori

    IPSJ SIG Notes   2009 ( 1 ) 69 - 74  2009.01

     View Summary

    This paper proposes a local memory management scheme for an automatic parallelizing compiler to realize effective use of a limited size of local memory. After the loop aligned decomposition and task scheduling considering data locality and parallelism, the compiler allocates data to the local memory effectively using the task scheduling result. This paper evaluates the proposed scheme on RP2 multicore for consumer electronics which has 8 SH4A processor cores. Each core integrates 32KB of local data memory and 64KB of distributed shared memory. As the results, the proposed scheme using 8 processors gives us about 6.20 times speedup for MPEG2 encoding program, 7.25 times speedup for AAC encoding program and 7.64 times speedup for susan against the sequential execution.

    CiNii

  • A Power Saving Scheme on Multicore Processors Using OSCAR API

    NAKAGAWA Ryo, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

    IPSJ SIG Notes   2009 ( 1 ) 93 - 98  2009.01

     View Summary

    Effective power reduction of an application program on multicore processors requires appropriate power control for each on-chip resource by compilers or users. These low power techniques need an application program interface (API) to realize power control in a user program. This paper proposes a power saving scheme for multicore processors using OSCAR API developed in NEDO "Multicore for Realtime Consumer Electronics" project. The proposed scheme has been implemented in OSCAR compiler to realize the power reduction for fastest execution mode, which minimizes power consumption without performance degradation, and the realtime execution mode to minimize power consumption under realtime constrains. The proposed scheme is evaluated on an 8 cores SH4A multicore processor RP2, newly developed for consumer electronics by Renesas Technology Corp., Hitachi, Ltd. and Waseda University in the above project. For the fastest execution mode, consumed energy was reduced by 13.05% for SPEC2000 art and 3.99% for SPEC2000 equake. Also, for the realtime execution mode, consumed power was reduced by 87.9% for AAC encoder and 73.2% for MPEG2 decoder.

    CiNii

  • Local Memory Management Scheme by a Compiler for Multicore Processor

    MOMOZONO Taku, NAKANO Hirofumi, MASE Masayoshi, KIMURA Keiji, KASAHARA Hironori

      2009 ( 1 ) 69 - 74  2009.01

     View Summary

    This paper proposes a local memory management scheme for an automatic parallelizing compiler to realize effective use of a limited size of local memory. After the loop aligned decomposition and task scheduling considering data locality and parallelism, the compiler allocates data to the local memory effectively using the task scheduling result. This paper evaluates the proposed scheme on RP2 multicore for consumer electronics which has 8 SH4A processor cores. Each core integrates 32KB of local data memory and 64KB of distributed shared memory. As the results, the proposed scheme using 8 processors gives us about 6.20 times speedup for MPEG2 encoding program, 7.25 times speedup for AAC encoding program and 7.64 times speedup for susan against the sequential execution.

    CiNii

  • An Evaluation of Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping

    YAMADA Kaito, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, ITO Masayuki, HATTORI Toshihiro, MIZUNO Hiroyuki, UCHIYAMA Kunio, KASAHARA Hironori

    IEICE technical report   108 ( 28 ) 19 - 24  2008.05

     View Summary

    In order to use a large number of processor cores in a chip, hierarchical coarse grain task parallel processing, which exploits whole program parallelism by analyzing hierarchical coarse grain task parallelism inside loops and subroutines, has been proposed and implemented in OSCAR automatic parallelizing compiler. This hierarchical coarse grain task parallel processing defines processor groups hierarchically and logically, and assigns hierarchical coarse grain tasks to each processor group. A light-weight and scalable barrier synchronization mechanism considering hierarchical processor grouping, which supports hierarchical coarse grain task parallel processing, is developed and implemented into RP2 multicore processor having eight SH4A cores with support by NEDO "Multicore Technology for Realtime Consumer Electronics". This barrier mechanism is proposed and evaluated in this paper. The evaluation using AAC encoder program by 8 cores shows our barrier mechanism achieves 16% better performance than software barrier.

    CiNii

  • Automatic Parallelization of Restricted C Programs using Pointer Analysis

    MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, MURATA Yuta, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report   108 ( 28 ) 69 - 74  2008.05

     View Summary

    This paper describes a restriction on pointer usage in C language for parallelism extraction by an automatic parallelizing compiler. By rewriting programs to satisfy the restriction, automatic parallelization using flow-sensitive, context-sensitive pointer analysis on an 8 cores SMP server achieved 3.80 times speedup for SPEC2000 art, 6.17 times speedup for SPEC2006 lbm and 5.14 times speedup for MediaBench mpeg2enc against the sequential execution, respectively.

    CiNii

  • Automatic Parallelization of Restricted C Programs using Pointer Analysis

    MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, MURATA Yuta, KIMURA Keiji, KASAHARA Hironori

    IPSJ SIG Notes   2008 ( 39 ) 69 - 74  2008.05

     View Summary

    This paper describes a restriction on pointer usage in C language for parallelism extraction by an automatic parallelizing compiler. By rewriting programs to satisfy the restriction, automatic parallelization using flow-sensitive, context-sensitive pointer analysis on an 8 cores SMP server achieved 3.80 times speedup for SPEC2000 art, 6.17 times speedup for SPEC2006 lbm and 5.14 times speedup for MediaBench mpeg2enc against the sequential execution, respectively.

    CiNii

  • An Evaluation of Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping

    YAMADA Kaito, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, ITO Masayuki, HATTORI Toshihiro, MIZUNO Hiroyuki, UCHIYAMA Kunio, KASAHARA Hironori

    IPSJ SIG Notes   2008 ( 39 ) 19 - 24  2008.05

     View Summary

    In order to use a large number of processor cores in a chip, hierarchical coarse grain task parallel processing, which exploits whole program parallelism by analyzing hierarchical coarse grain task parallelism inside loops and subroutines, has been proposed and implemented in OSCAR automatic parallelizing compiler. This hierarchical coarse grain task parallel processing defines processor groups hierarchically and logically, and assigns hierarchical coarse grain tasks to each processor group. A light-weight and scalable barrier synchronization mechanism considering hierarchical processor grouping, which supports hierarchical coarse grain task parallel processing, is developed and implemented into RP2 multicore processor having eight SH4A cores with support by NEDO "Multicore Technology for Realtime Consumer Electronics". This barrier mechanism is proposed and evaluated in this paper. The evaluation using AAC encoder program by 8 cores shows our barrier mechanism achieves 16% better performance than software barrier.

    CiNii

  • Parallelization for Multimedia Processing on Multicore Processors

    MIYAMOTO TAKAMICHI, TAMURA KEI, TANO HIROAKI, MIKAMI HIROKI, ASAKA SAORI, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2007 ( 115 ) 77 - 82  2007.11

     View Summary

    Multicore processors have attracted much attention to handle the increase of power consumption, the slowdown of improvement of processor clock speed, and the increase of hardware/software developing period. Also, speeding up multimedia applications is required with the progress of the consumer electronics devices like mobile phones, digital TV and games. This paper describes parallelization methods of multimedia applications on the multicore processors. Especially in this paper, MPEG2 encoding and MPEG2 decoding are selected as examples of video sequence processing, MP3 encoding is selected as an example of audio processing, JPEG 2000 encoding is selected as an example of picture processing. OSCAR multigrain parallelizing compiler parallelizes these media applications using newly developed multicore API. This paper evaluates parallel processing performances of these multimedia applications on the FR1000 multicore processor developed by Fujitsu Ltd, and the RP1 multicore processor jointly-developed by Waseda University, Renesas Technology Corp. and Hitachi Ltd.

    CiNii

  • Evaluation of Heterogeneous Multicore-Architecture with AAC-LC Stereo Encoding

    SHIKANO Hiroaki, ITO Masaki, TODAKA Takashi, TSUNODA Takanobu, KODAMA Tomoyuki, ONOUCHI Masafumi, UCHIYAMA Kunio, ODAKA Toshihiko, KAMEI Tatsuya, NAGAHAMA Ei, KUSAOKE Manabu, NITTA Yusuke, WADA Yasutaka, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report   107 ( 194 ) 11 - 16  2007.08

     View Summary

    This paper describes a heterogeneous multi-core processor (HMCP) architecture which integrates general purpose processors (CPU) and accelerators (ACC) to achieve high-performance as well as low-power consumption for SoCs of embedded systems. Memory architecture of CPUs and ACCs were unified to improve programming and compiling efficiency. For preliminary evaluation of the HMCP architecture, AAC-LC stereo audio encoding is parallelized on a heterogeneous multi-core having homogeneous processor cores and dynamic reconfigurable processor (DRP) accelerator cores. The performance evaluation shows that 54x AAC encoding is achieved on the chip with two CPUs at 600MHz and two DRPs at 300MHz, which realizes encoding of a whole CD in 1-2 minutes.

    CiNii

  • A Hierarchical Coarse Grain Task Static Scheduling Scheme on a Heterogeneous Multicore

    WADA YASUTAKA, HAYASHI AKIHIRO, IYOKU TAKETO, MASUURA TAKESHI, SHIRAKO JUN, NAKANO HIROFUMI, SHIKANO HIROAKI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2007 ( 79 ) 97 - 102  2007.08

     View Summary

    This paper proposes a static scheduling scheme for hierarchical coarse grain task parallel processing on a heterogeneous multicore processor. A heterogeneous multicore processor integrates not only general purpose processors but also accelerators like dynamically reconfigurable processors (DRPs) or digital signal processors (DSPs). Effective usage of these accelerators allows us to get high performance and low power consumption at the same time. In the proposed scheme, the compiler extracts parallelism using coarse grain parallel processing and assigns tasks considering characteristics of each core to minimize the execution time of an application. Performance of the proposed scheme is evaluated on a heterogeneous multicore processor using MP3 encoder. Hetero-geneous configurations give us 12.64 times speedup with two SH4As and two DRPs and 24.48 times speedup with four SH4As and four DRPs against sequential execution with one SH4A core.

    CiNii

  • Compiler Control Power Saving for Heterogeneous Multicore Processor

    HAYASHI AKIHIRO, IYOKU TAKETO, NAKAGAWA RYO, MASUURA TAKESHI, MATSUMOTO SHIGERU, YAMADA KAITO, OSHIYAMA NAOTO, SHIRAKO JUN, WADA YASUTAKA, NAKANO HIROFUMI, SHIKANO HIROAKI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2007 ( 79 ) 103 - 108  2007.08

     View Summary

    Multicore processors are getting introduced for performance improvement and reduction of power dissipation in various IT fields, such as consumer electronics, PCs, servers and super-computers. Especially, heterogeneous multicores have attracted much attention in consumer electronics to achieve higher performance per watt. In order to satisfy the demand for the high performance, low power dissipation and high software productivity, Parallelizing compilers for both parallelization and Frequency and Voltage control are required. This paper describes the evaluation results of compiler control power saving for a heterogeneous multicore processor which integrates upto 4 general purpose embedded processor Renesas SH4As and 4 accelerator core like dynamically reconfigureable processors Hitachi FE-GAs. Performance evaluation shows the heterogeneous multicore gave us 24.32 times speed up against sequential processing and 28.43% energy savings for MP3 encoding program without performance degradation.

    CiNii

  • A 4320MIPS four Processor-core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption

    HAYASE Kiyoshi, YOSHIDA Yutaka, KAMEI Tatsuya, SHIBAHARA Shinichi, NISHII Osamu, HATTORI Toshihiro, HASEGAWA Atsushi, TAKADA Masashi, IRIE Naohiko, UCHIYAMA Kunio, ODAKA Toshihiko, TAKADA Kiwamu, KIMURA Keiji, KASAHARA Hironori

    IPSJ SIG Notes   2007 ( 55 ) 31 - 35  2007.05

     View Summary

    4320MIPS 4-processor SoC that provides with low power consumption and high performance was designed using 90nm process. The 32KB-data cache is built into each processor, and the module to maintain the coherency of the data cache between processors is built into. A low electric power is achieved by frequency control of each processor according to amount of processing and adopting sleep mode that maintains coherency of the data cache between processors.

    CiNii

  • Multigrain Parallel Processing in SMP Execution Mode on a Multicore for Consumer Electronics

    MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, TANO Hiroaki, MASUURA Takeshi, MIYAMOTO Takamichi, SHIRAKO Jun, NAKANO Hirofumi, KIMURA Keiji, KAMEI Tatsuya, HATTORI Toshihiro, HASEGAWA Atsushi, ITO Masaki, SATO Makoto, UCHIYAMA Kunio, ODAKA Toshihiko, KASAHARA Hironori

    IPSJ SIG Notes   2007 ( 55 ) 25 - 30  2007.05

     View Summary

    Currently, multicore processors are becoming ubiquitous in various computing domains, namely consumer electronics such as games, car navigation systems and mobile phones, PCs, and supercomputers. This paper describes parallelization of media processing programs written in restricted C language by OSCAR multigrain parallelizing compiler and SMP processing performance on RP1 4-core SH-4A (SH-X3) multicore processor developed by Renesas Technology Corp. and Hitachi, Ltd. based on standard OSCAR multicore memory architecture as a part of NEDO "Research and Development of Multicore Technology for Real Time Consumer Electronics Project". Performance evaluation shows OSCAR compiler achieved 3.34 times speedup using 4 cores against using 1 core for AAC audio encoder.

    CiNii

  • A 4320MIPS four Processor-core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption

    HAYASE Kiyoshi, YOSHIDA Yutaka, KAMEI Tatsuya, SHIBAHARA Shinichi, NISHII Osamu, HATTORI Toshihiro, HASEGAWA Atsushi, TAKADA Masashi, IRIE Naohiko, UCHIYAMA Kunio, ODAKA Toshihiko, TAKADA Kiwamu, KIMURA Keiji, KASAHARA Hironori

    IEICE technical report   107 ( 76 ) 31 - 35  2007.05

     View Summary

    4320MIPS 4-processor SoC that provides with low power consumption and high performance was designed using 90nm process. The 32KB-data cache is built into each processor, and the module to maintain the coherency of the data cache between processors is built into. A low electric power is achieved by frequency control of each processor according to amount of processing and adopting sleep mode that maintains coherency of the data cache between processors.

    CiNii

  • Multigrain Parallel Processing in SMP Execution Mode on a Multicore for Consumer Electronics

    MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, TANO Hiroaki, MASUURA Takeshi, MIYAMOTO Takamichi, SHIRAKO Jun, NAKANO Hirofumi, KIMURA Keiji, KAMEI Tatsuya, HATTORI Toshihiro, HASEGAWA Atsushi, ITO Masaki, SATO Makoto, UCHIYAMA Kunio, ODAKA Toshihiko, KASAHARA Hironori

    IEICE technical report   107 ( 76 ) 25 - 30  2007.05

     View Summary

    Currently, multicore processors are becoming ubiquitous in various computing domains, namely consumer electronics such as games, car navigation systems and mobile phones, PCs, and supercomputers. This paper describes parallelization of media processing programs written in restricted C language by OSCAR multigrain parallelizing compiler and SMP processing performance on RP1 4-core SH-4A (SH-X3) multicore processor developed by Renesas Technology Corp. and Hitachi, Ltd. based on standard OSCAR multicore memory architecture as a part of NEDO "Research and Development of Multicore Technology for Real Time Consumer Electronics Project". Performance evaluation shows OSCAR compiler achieved 3.34 times speedup using 4 cores against using 1 core for AAC audio encoder.

    CiNii

  • A Local Memory Management Scheme in Multigrain Parallelizing Compiler

    MIURA TSUYOSHI, TAGAWA TOMOHIRO, MURAMATSU YUSUKE, IKEMI AKINORI, NAKAGAWA MASAHIRO, NAKANO HIROFUMI, SHIRAKO JUN, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2007 ( 17 ) 61 - 66  2007.03

     View Summary

    Multicore systems have been attracting much attention for performance, low power consumption and short hardware/software development period. To take the full advantage of multiprocessor systems, parallelizing compilers serve important roles. On multicore processor, a memory wall caused by the speed gap between processor core and memory is also serious problem. Therefore, it is important for performance improvement to use fast memolies like cache and local memory nearby a processor effectively. This paper proposes a local memory management scheme for coarse grain task parallel processing. In the evaluation using SPeC 95fp tomcatv, the proposed scheme using 8 processors achieved 19.6 times speedup against the sequantial execution without the proposed scheme on the OSCAR multicore processor by the effective use of local memories.

    CiNii

  • Automatic Parallelization for Multimedia Applications on Multicore Processors

    MIYAMOTO TAKAMICHI, ASAKA SAORI, KAMAKURA NOBUHITO, YAMAUCHI HIROMASA, MASE MASAYOSHI, SHIRAKO JUN, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

      2007 ( 4 ) 69 - 74  2007.01

     View Summary

    Multicore processors have attracted much attention to handle the increase of power consumption along with the increase of integration degree of semiconductor devices, the slowdown of improvement of processor clocks, and the increase of hardware/software developing period. Also, speeding up multimedia applications is required with the progress of the consumer electronics like mobile phones, digital TV and games. This paper describes parallelization methods of multimedia applications on the multicore processors. Especially in this paper, MPEG2 encoding and MPEG2 decoding are selected as examples of video sequence processing, MP3 encoding is selected as an example of audio processing, JPEG 2000 encoding is selected as an example of picture processing. OSCAR multigrain parallelizing compiler automatically parallelizes these media applications. This paper evaluates parallel processing performances of these multimedia applications on the OSCAR multicore processor, and the IBM p5 550Q Power5+ 8 processors SMP server. On the OSCAR multicore processor, the parallel execution with the proposed method of managing local memory and optimizing data transfer using 4 processors, gives us 3.81 times speedup for MPEG2 encoding, 3.04 times speedup for MPEG2 decoding, 3.09 times speedup for MP3 encoding, 3.79 times speedup for JPEG 2000 encoding against the sequential execution. On the IBM p5 550Q Power5+ 8 processors server, the parallel execution using 8 processors gives us 5.19 times speedup for MPEG2 encoding, 5.12 times speedup for MPEG2 decoding, 3.69 times speedup for MP3 encoding, 4.32 times speedup for JPEG 2000 encoding against the sequential execution.

    CiNii

  • Automatic Parallelization of Restricted C Programs in OSCAR Compiler

    MASE MASAYOSHI, BABA DAISUKE, NAGAYAMA HARUMI, TANO HIROAKI, MASUURA TAKESHI, FUKATSU KOJI, MIYAMOTO TAKAMICHI, SHIRAKO JUN, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2006 ( 127 ) 1 - 6  2006.11

     View Summary

    Along with the popularization of multiprocessors and multicore architectures, automatic parallelizing compiler, which can realize high effective performance and low power comsumption, becomes more and more important in various areas from high performance computing to embedded computing. OSCAR compiler realizes multigrain automatic parallelization, which can exploit parallelism and data locality from the whole of the program. This paper describes C language support in OSCAR compiler. For rapid support of C language, restricted C language is proposed. In the preliminary performance evaluation of automatic parallelization using following media applications as MPEG2 encode, MP3 encode, and AAC encode, Susan (smoothing) derived from MiBench, and Art from SPEC2000, OSCAR compiler achieved 7.49 times speed up in maximum for susan (smoothing) against sequential execution on IBM p5 550 server having 8 processors, and 3.75 times speed up in maximum for susan (smoothing) too against sequential execution on Sun Ultra80 workstation having 4 processors.

    CiNii

  • Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers and Embedded Multicore

    SHIRAKO JUN, TAGAWA TOMOHIRO, MIURA TSUYOSHI, MIYAMOTO TAKAMICHI, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2006 ( 127 ) 7 - 12  2006.11

     View Summary

    Currently, multiprocessor systems, especially multicore processors, are attracting much attention for performance, low power consumption and short hardware/software development period. To take the full advantage of multiprocessor systems, parallelizing compilers serve important roles. This paper describes the execution performance of OSCAR multigrain parallelizing compiler using coarse grain task parallelization and near fine grain parallelization in addition to loop parallelization, on the latest SMP servers and a SMP embedded multicore. The OSCAR compiler has realized the automatic determination of parallelizing layer, which decides the suitable number of processors and parallelizing technique for each nested part of the program, and global cache memory optimization over loops and coarse grain tasks. In the performance evaluation using 10 SPEC CFP95 benchmark programs and 4 SPEC CFP2000, OSCAR compiler gave us 2.74 times speedup compared with IBM XL Fortran compiler 10.1 on IBM p5 550Q Power5+8 processors server, 4.82 times speedup compared with IBM XL Fortran compiler 8.1 on IBM pSeries690 Power4 24 processors server. OSCAR compiler can be also applied for NEC/ARM MPCore ARMv6 4 processors low power embedded multicore, using subset of OpenMP libraries and g77 compiler. In the evaluation using SPEC CFP95 benchmarks with reduced data sets, OSCAR compiler achieved 4.08 times speedup for tomcatv, 3.90 times speedup for swim, 2.21 times speedup for su2cor, 3.53 times speedup for hydro2d, 3.85 times speedup for mgrid, 3.62 times speedup for applu and 3.20 times speedup for turb3d against the sequential execution.

    CiNii

  • Local Memory Management on OSCAR Multicore

    NAKANO HIROFUMI, NITO TAKUMI, MARUYAMA TAKANORI, NAKAGAWA MASAHIRO, SUZUKI YUKI, NAITO YOSUKE, MIYAMOTO TAKAMICHI, WADA YASUTAKA, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2006 ( 88 ) 163 - 168  2006.07

     View Summary

    Along with the advancement of integration technology of semiconductor devices, to overcome the increase of power consumption, the slowdown of processor effective performance improvement rate, and the increase of period for hardware/software developing transistors integrated on to a chip, multicore processors have attracted much attention as a next-generation microprocessor architecture. However, the memory wall caused by the gap between memory access speed and processor core speed is getting a serious problem also on the multicore processors. Therefore, the effective use of fast memories like cache and local memory nearby a processor is important. Considering these problems, the authors have proposed the OSCAR multicore processor architecture which cooperates with OSCAR multigrain parallelizing compiler and aims at developing a processor with high effective performance and good cost performance. The OSCAR multicore processor has local data memory (LDM) for processor private data, distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centralized shared memory (CSM) to support dynamic task scheduling, and data transfer unit (DTU) which transfers data asynchronously and aims at overlapping data transfer overhead. This paper describes data localization scheme that aimed at improving the effective use of LDM and DSM using coarse grain task parallel processing and compiler-controlled LDM and DSM management scheme. As the results, the proposed scheme gives us 7.1 times speedup for MP3 encoding program, 6.3 for MPEG2 encoding program and 3.8 for JPEG2000 encoding program against the sequential execution without the proposed scheme on 8 processors automatically.

    CiNii

  • Data Transfer Overlap of Coarse Grain Task Parallel Processing on a Multicore Processor

    MIYAMOTO TAKAMICHI, NAKAGAWA MASAHIRO, ASANO SHOICHIRO, NAITO YOSUKE, NITO TAKUMI, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2006 ( 20 ) 55 - 60  2006.02

     View Summary

    Along with the increase of integration degree of semiconductor devices, to overcome the increase of power consumption, the slowdown of improvement of processor effective performance, and the increase of period for hardware/software developing transistors integrated on to a chip, multicore processors, have attracted much attention as a next-generation microprocessor architecture. However, the memory wall caused by the gap between memory access speed and processor core speed is still a serious problem also on the multicore processors. Therefore, the effective use of fast memories like cache and local memory nearby processor is important for reducing large memory access overhead. Futhermore, hiding data transfer overhead among local or distributed shared memories of processors and centralized shared memory is important. On the memory architechture, the data transfer is specified. Considering these problems, the authors have proposed the OSCAR multicore processor architecture which cooperates with OSCAR multigrain parallelizing compiler and aims at developing a processor with high effective performance and good cost performance computer system. The OSCAR multicore processor has local data memory (LDM) for processor private data, distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centralized shared memory (CSM) to support dynamic task scheduling, and data transfer unit (DTU) which transfers data asynchronously and aims at overlapping data transfer overhead. This paper proposes and evaluates a static data transfer scheduling algorithm aiming at overlapping data transfer overhead. As the results, the proposed scheme controlled by OSCAR compiler gives us 2.86 times speedup using 4 processors for JPEG2000 encoding program against the ideal sequential execution assuming that the all data can be assigned to the local memory.

    CiNii

  • A Static Scheduling Scheme for Coarse Grain Tasks on a Heterogeneous Chip Multi Processor

    WADA YASUTAKA, OSHIYAMA NAOTO, SUZUKI YUKI, SHIRAKO JUN, NAKANO HIROFUMI, SHIKANO HIROAKI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2006 ( 8 ) 13 - 18  2006.01

     View Summary

    This paper proposes a static scheduling scheme for coarse grain tasks on a heterogeneous chip multi processor which integrates not only general purpose processors but also accelerators like DRP or DSP. A heterogeneous chip multi processor allows us to get high performance by using the accelerators and to save energy by frequency/voltage control by the compiler. In this scheme, the compiler aim to minimize the execution time of an application in consideration of the characteristic in each core. Performance of the proposed scheme is evaluated on a heterogeneous chip multi processor which has 4 general purpose processors and 2 accelerators using MP3 encoder and gives us 8.8 times speedup against sequencial execution without the proposed scheme.

    CiNii

  • Preliminary Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

    SHIKANO Hiroaki, SUZUKI Yuki, WADA Yasutaka, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

    IPSJ SIG Notes   2006 ( 8 ) 1 - 6  2006.01

     View Summary

    This paper proposes a heterogeneous chip multi-processor (HCMP) that possesses different types of processing elements (PEs) such as CPUs as general-purpose processors, as well as digital signal processors or dynamic reconfigurable processors (DRPs) as special-purpose processors. The HCMP realizes higher performance than conventional single-core processors or even homogeneous multi-processors in some specific applications such as media processing, with low operating frequency supplied, which results in lower power consumption. In this paper, the performance of the HCMP is analyzed by studying parallelizing scheme and power control scheme of an MP3 audio encoding program and by scheduling the program onto the HCMP using these two schemes. As a result, it is confirmed that an HCMP, consisting of three CPUs and two DRPs, outperforms a single-core processor with one CPU by a speed-up factor of 16.3, and a homogeneous multi-processor with 5 CPUs by a speed-up factor of 4.0. It is also confirmed that the power control on the HCMP results in 24% power reduction.

    CiNii

  • Performance Evaluation of Electronic Circuit Simulation Using Code Generation Method without Array Indirect Access

    KURODA AKIRA, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2005 ( 7 ) 1 - 6  2005.01

     View Summary

    This paper evaluates performance of a fast sequential circuit simulation scheme using the loop free code without the array indirect accesses. This scheme allows us to get several tens of times higher processing performance than SPICE version 3f5 on a WS and a PC. The array indirect accesses for the sparse matrix solution in SPICE have been one of the factors that prevents from efficient processing. This paper describes the circuit simulation scheme using loop free code without any array indirect accesses and its performance evaluation shows the scheme gives us 2 to 110 times better performance than SPICE3f5 on a WS and a PC. The performance by reducing the memory accesses overhead significantly.

    CiNii

  • Performance of OSCAR Multigrain Parallelizing Compiler on Shared Memory Multiprocessor Servers}

    SHIRAKO JUN, MIYAMOTO TAKAMICHI, ISHIZAKA KAZUHISA, OBATA MOTOKI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2005 ( 7 ) 21 - 26  2005.01

     View Summary

    The needs for automatic parallelizing compilers are getting larger with widly use of multiprocessor systems. However, the loop parallelization techniques are almost matured and new generation of parallelization methods like multi-grain parallelization are required to achieve higher effective performance. This paper describes the performance of OSCAR multigrain parallelizing compiler that uses the coarse grain task parallelization and the near fine grain parallelization in addition to the loop parallelization. OSCAR compiler realizes the following two important techniques. The first is the automatic determination scheme of parallelizing layer, which decides the number of processors and parallelizing technique for each part of the program. The other is global cache memory optimization among loops and coarse grain tasks. In the evaluation using SPEC95FP benchmarks, OSCAR compiler gave us 4.78 times speedup compared with IBM XL Fortran compiler 8.1 on IBM pSeries690 Power4 24 processors server, 2.40 times speedup compared with Intel Fortran Itanium Compiler 7.1 on SGI Altix3700 Itanium2 16 processors server, 1.90 times speedup compared with Sun Forte compiler 7.1 on Sun Fire V880 Ultra SPARC III Cu 8 processors server.

    CiNii

  • Parallel Processing for MPEG2 Encoding on OSCAR Chip Multiprocessor

    KODAKA TAKESHI, NAKANO HIROHUMI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2004 ( 123 ) 53 - 58  2004.12

     View Summary

    This paper proposes a coarse grain task parallel processing scheme for MPEG2 encoding using data localization which optimizes execution efficiency assigning coarse grain tasks accessing the same array data on the same processor consecutively on a chip multiprocessor and data transfer overlapping technique which minimize the data transfer overhead by overlapping task execution and data transfer. Performance of the proposed scheme is evaluated. As the evaluation result on an OSCAR chip multiprocessor architecture, the proposed scheme gave us 1.24 times speedup for 1 processor, 2.47 times speedup for 2 processors. 4.57 times speedup for 4 processors, 7.97 times speedup for 8 processors and 11.93 times speedup for 16 processors respectively against the sequential execution on a single processor without the proposed scheme.

    CiNii

  • Data Localization using Data Transfer Unit on OSCAR Chip Multiprocessor

    NAKANO HIROFUMI, NAITO YOSUKE, SUZUKI TAKAHISA, KODAKA TAKESHI, ISHIZAKA KAZUHISA, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2004 ( 80 ) 115 - 120  2004.07

     View Summary

    Recently, Chip Multiprocessor (CMP) architecture has attracted much attention as a next-generation micro-processor architecture, and many kinds of CMP have widely developed. However, these CMP architectures still have the problem of effective use of memory system nearby processor cores such as cache and local memory. On the other hand, the authors have proposed OSCAR CMP, which cooperatively works with multigrain parallel processing, to achieve high effective performance and good cost effectiveness. To overcome the problem of effective use of cache and local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centralized shared memory (CSM) to support dynamic task scheduling, and data transfer unit(DTU) for asynchronous data transfer. The multigrain parallelizing compiler uses such memory architecture of OSCAR CMP with data localization scheme that fully uses compile time information. This paper proposes a coarse grain task static scheduling scheme considering data localization using live variable analysis. Data is transferred in burst mode using automatically generated DTU instructions.

    CiNii

  • Evaluation of Multigrain Parallelism on OSCAR Chip Multi Processor

    WADA YASUTAKA, SHIRAKO JUN, ISHIZAKA KAZUHISA, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2004 ( 80 ) 61 - 66  2004.07

     View Summary

    This paper describes performance of multigrain parallel processing of SPEC CFP 95 on OSCAR Chip Multi Processor (OSCAR CMP). OSCAR multigrain parallelizing compiler, which exploits statement level near-fine grain parallelism, loop iteration level parallelism and coarse grain parallelism hierarchically, allows us to fully control hardware on OSCAR CMP. Also, this cooperation realizes high software productivity and effective use of hardware resources. Performance of multigrain parallel processing of SPEC CFP 95 benchmark programs on OSCAR CMP with 8 processor cores and centralized shared memory were 2.03 to 7.79 times speedup against sequential execution using 400MHz clock cycles for embedded use and 1.89 to 7.05 times speedup against sequential execution using 2.8GHz clock cycles for high-end use.

    CiNii

  • Parallel Processing for MPEG2 Encoding using Data Localization

    KODAKA TAKESHI, NAKANO HIROHUMI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2004 ( 12 ) 13 - 18  2004.02

     View Summary

    Recently, many people are getting to enjoy multimedia applications with image and audio processing on PCs, mobile phones and PDAs. For this situation, development of low cost, low power consumption and high performance processors for multimedia applications has been expected. To satisfy these demands, chip multiprocessor architectures which allows us to attain scalability using coarse grain level parallelism and loop level parallelism in addition to instruction level parallelism are attracting much attention. However, in order to extract much performance from chip multiprocessor architectures efficiently, highly sophisticated technique is required such as decomposing a program into adequate grain of tasks and assigning them onto processors considering parallelism and data locality of target applications. This paper describes a parallel processing scheme for MPEG2 encoding using data localization which improve execution efficiency assigning coarse grain tasks sharing same data on a same processor consecutively for a chip multiprocessor, and evaluate its performance. As the evaluation result on OSCAR CMP using 8 processors, proposed scheme gives us 1.64 times speedup against loop parallel processing, and 6.82 times speedup against sequential execution time.

    CiNii

  • The Data Prefetching of Coarse Grain Task Parallel Processing on Symmetric Multi Processor Machine

    MIYAMOTO TAKAMICHI, YAMAGUCHI TAKAHIRO, TOBITA TAKAO, ISHIZAKA KAZUHISA, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2003 ( 119 ) 63 - 68  2003.11

     View Summary

    On the shared multi processor system used in current computing servers, the increase of memory access overhead with the speedup of CPU interfere to get the scalable performance improvement with the increase of the processors. In order to get scalable performance improvement, this paper proposes and evaluates the static scheduling algorithm which reduces the memory access overhead by using cache prefetch to overlap of data transfer and task processing. The proposed algorithm is used in static scheduling stage in a compiler, moreover the compiler generates a OpenMP parallelized Fortran program with prefetch directives for SUN Forte compiler for Sun Fire V880 server. Performance evaluation shows that the proposed algorithm gave us super linear speedup compared with sequential processing without prefetching by Sun Forte compiler such as 13.9 times speedup on 8 processors for SPEC95fp tomcatv program and 22.3 times speedup on 8 processors for SPEC95fp swim program. Futhermore, compared with automatic prefetching by SUN Forte compiler using the same number of processors, this algorithm shows that 1.1 times speedup on 1 processor, 3.86 times speedup on 8 processors for SPEC95fp tomcatv and 1.44 times speedup on 1processor, 1.85 times speedup on 8 processors for SPEC95fp swim.

    CiNii

  • Data Localization Scheme using Static Scheduling on Chip Multiprocessor

    NAKANO HIROFUMI, KODAKA TAKESHI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2003 ( 84 ) 79 - 84  2003.08

     View Summary

    Recently, chip multiprocessor architecture that contains multiple processors on a chip becomes popular approach even in commercial area. The authors have proposed OSCAR chip multiprocessor (OSCAR CMP) that is aimed at exploiting multiple grains of parallelism hierarchically from a sequential program on a chip. OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory having two ports for processor shared data to control data allocation by a compiler appropriately. This paper describes data localization scheme for OSCAR CMP which exploits data locality by assigning coarse grain tasks sharing same data on a same processor consecutively. In addition, OSCAR CMP using data localization scheme is compared with shared cache architecture and snooping cache architecture. Then, current naive code generation for OSCAR CMP is considered using evaluation results.

    CiNii

  • Parallel Processing on MPEG2 Encoding for OSCAR Chip Multiprocessor

    KODAKA TAKESHI, NAKANO HIROHUMI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2003 ( 84 ) 55 - 60  2003.08

     View Summary

    Recently, multimedia applications with visual and sound processing are popular on mobile phones and PDAs. To satisfy the needs for efficient multimedia processing, development of low cost, low power consumption and high performance processors for multimedia applications has been expected. Chip multiprocessor architectures which allows us to attain scalability using coarse grain level parallelism and loop level parallelism in addition to instruction level parallelism are attracting much attention. However, to realize efficient processing on chip multiprocessor architectures, parallel processing techniques such as decomposing a program into adequate tasks considering characteristics of a program and assigning these tasks onto processors are essential. This paper describes a parallel processing scheme for MPEG2 encoding for a chip multiprocessor and its performance.

    CiNii

  • Data Localization using Coarse Grain Task Parallelization on Chip Multiprocessor

    NAKANO HIROFUMI, KODAKA TAKESHI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2003 ( 10 ) 13 - 18  2003.01

     View Summary

    Recently. Chip Multiprocessor(GMP)architecture has attracted much attention as a next-generation microprocessor architecture. and many kinds of GMP have widely developed. However, these GMP architectures still have the problem of effective use of memory system nearby processor cores such as cache and local memory. On the other hand, the authors have proposed OSCAR GMP. which cooperatively works with multigrain parallel processing, to achieve high effective performance and good cost effectiveness. To overcome the problem of effective use of cache and local memory. OSCAR GMP has local data memory(LDM)for processor private data and distributed shared memory(DSN) having two por for synchronization and data transfer among processor cores, in addition to centralized shared memory (CSM). The multigrain parallelizing compiler uses such memory architecture of OSCAR GMP with data localization scheme that fully uses compile time information. This paper proposes a coarse grain task static scheduling scheme considering data localization using live variable analysis. Furthermore, data transfer between CSM and LDM insertion scheme using information of live variable analysis is also described. This data localization scheme is implemented on OSCAR FORTRAN multigrain parallelizing compiler and is evaluated on OSCAR GMP using Tomcatv form SPEC fp 95 benchmark suite. As the results, the proposed scheme gives us about 1.3 times speedup using 20 clocks as the access latency of CSM, and about 1.6 times using 40 clocks as the access latency of CSM respectively against without data localization scheme.

    CiNii

  • Multigrain Parallel Processing on OSCAR Chip Multiprocessor

    KIMURA KEIJI, KODAKA TAKESHI, OBATA MOTOKI, KASAHARA HIRONORI

    IPSJ SIG Notes   2002 ( 112 ) 29 - 34  2002.11

     View Summary

    This paper describes multigrain parallel processing on OSCAR Chip Multiprocessor (OSCAR CMP). The aim of OSCAR CMP is to achieve both of scalable performance improvement with effective use of huge number of transistors on a chip and high efficiency of application development with compiler supports. OSCAR CMP integrates simple single issue processors having local data memory for private data recognized by compiler, distributed shared data memory for optimal use of data locality over different loops. The compiler controllable data transfer unit for overlapping data transfer, and the multigrain parallelizing compiler, which exploits statement level near-fine grain parallelism, loop iteration level parallelism and coarse grain task parallelism hierarchically, fully controls these hardwares. Performance of multigrain parallel processing on OSCAR CMP is evaluated using SPEC fp 2000/95 benchmark suite. When microSPARC like single issue core is used, OSCAR CMP having four CPU cores gives us 2.98 times speedup in HYDRO2D, 3.84 times in TOMCATV, 3.84 times in MGRID, 3.97 times in SWIM, 2.36 times in FPPPP, 2.88 times in TURB3D, 2.64 times in SU2COR, 2.29 times in APPLU and 1.77 times in APSI.

    CiNii

  • Multigrain Parallel Processing on Motion Vector Estimation for Single Chip Multiprocessor

    KODAKA TAKESHI, SUZUKI TAKAHISA, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2002 ( 112 ) 23 - 28  2002.11

     View Summary

    With the recent increase of multimedia contents using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia application have been expected. Particularly, single chip multiprocessor architectures having simple processor cores that will be able to attain scalability and cost effectiveness are attracting much attention to develop such processors. Single chip multiprocessor architectures allow us to exploit coarse grain task level and loop level parallelism in addition to the instruction level parallelism, so parallel processing technology is indispensable to allow us scalable performance improvement. This paper describes a multigrain parallel processing scheme for motion vector estimation for a single chip multiprocessor and its performance is evaluated.

    CiNii

  • Evaluation of Overhead with Coarse Grain Task Parallel Processing on SMP Machines

    WADA YASUTAKA, NAKANO HIROFUMI, KIMURA KEIJI, OBATA MOTOKI, KASAHARA HIRONORI

    IPSJ SIG Notes   2002 ( 37 ) 13 - 18  2002.05

     View Summary

    Coarse grain task parallel processing, which exploits parallelism among loops, subroutines and basic blocks, is getting more important to attain performance improvement on multiprocessor architectures. To efficiently implement the coarse grain task parallel processing. it is important to analyze various processor overhead quantitatively. This paper evaluates overheads of barrier synchronization, thread fork/join and L2 cache miss penalty are using performance measurement mechanisms to analyze the performance improvements by OSCAR Fortran compiler on Sun Ultra80, IBM RS6000 and SGI Origin2000.

    CiNii

  • Multigrain Parallel Processing for JPEG Encoding Program on an OSCAR type Single Chip Multiprocessor

    KODAKA TAKESHI, UCHIDA TAKAYUKI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2002 ( 9 ) 19 - 24  2002.02

     View Summary

    With the recent increase of multimedia contests using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia have been expected. Particularly, single chip multiprocessor architecture having simple processor cores is attracting much attention to develop such processors. This paper describes multigrain parallel processing scheme for a JPEG encoding program for OSCAR type single chip multiprocessor and its performance. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gave us 3.59 times speed-up than sequencial execution and 2.87 times speed-up than OSCAR type single chip multiprocessor that has a four-issue UltraSPARC-II type super-scaler processor core.

    CiNii

  • Near Fine Grain parallel Processing on Multimedia Application for Single Chip Multiprocessor

    KODAKA TAKESHI, MIYASHITA NAOHISA, KIMURA KEIJI, KASAHARA HIRONORI

    ARC   2001 ( 76 ) 61 - 66  2001.07

     View Summary

    With the recent increase of multimedia contents, such as JPEG and MPEG data, low cost and low power consumption processors that can process these multimedia contents efficiently are expected. In such microprocessors, single chip multiprocessor architecture having simple processor cores is attracting much attention. Considering the above facts, this paper evaluate a JPEG encoding program on OSCAR type single chip multiprocessor architecture using near fine grain parallel processing for 8×8 pixel block that is a fundamental part of JPEG algorithm. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gives 2.32 times speedup than four-issue UltraSPARC-II type super-scaler processor.

    CiNii

  • A Static Scheduling Scheme for Coarse Grain Tasks considering Cache Optimization on SMP

    NAKANO HIROFUMI, ISHIZAKA KAZUHISA, OBATA MOTOKI, KIMURA KEIJI, KASAHARA HIRONORI

    IPSJ SIG Notes   2001 ( 76 ) 67 - 72  2001.07

     View Summary

    Effective use of cache memory based on data locality is getting more important with increasing gap between the processor speed and memory access speed. As to parallel processing on multiprocessor systems, it seems to be difficult to achieve large performance improvement only with the conventional loop iteration level parallelism. This paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme is based on the macro data flow parallel processing that uses coarse grain task parallelism among tasks such as loop blocks, subroutines and basic blocks. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP machine, using swim and tomcatv from the SPEC fp 95 benchmark suite. As the results, the proposed scheme gives us 4.56 times speedup for swim and 2.37 times for tomcatv respectively against the Sun Forte HPC 6 loop parallelizing compiler on 4 processors.

    CiNii

  • Processor Core Architecture of Single Chip Multiprocessor for Near Fine Grain Parallel Processing

    KIMURA KEIJI, UCHIDA TAKAYUKI, KATO TAKAYUKI, KASAHARA HIRONORI

    IPSJ SIG Notes   2000 ( 74 ) 91 - 96  2000.08

     View Summary

    With continuously increase of transistors integrated onto a chip, it has been a very important how to achieve scalable performance improvement using these transistors effectively. Especially, exploiting different grains of parallelism in addition to instruction level parallelism and effective use of this parallelism in a single chip is getting more important. To this end, a single chip multiprocessor(SCM)architecture that contains multiple processor cores has been attracted much attention. To decide suitable SCM processor core architecture for multigrain parallel processing, this paper evaluates several SCM architectures which have different instruction issue widths and numbers of global shared register file for near fine grain parallel processing, which is one of the key issues in multigrain parallel processing.

    CiNii

  • Memory access analyzer for a Multi-grain parallel processing

    IWAI Keisuke, OBATA Motoki, KIMURA Keiji, AMANO Hideharu, KASAHARA Hironori

    IEICE technical report. Computer systems   99 ( 252 ) 1 - 8  1999.08

     View Summary

    Multi-grain parallel processing is proposed to exploit an inherent parallelism in application programs as much as possible. Although this method can be realized on various architectures, a dedicated multiprocessor architecture is required for achieving the maximum performance. Multiprocessor system ASCA (Advanced Scheduling oriented Computer Architecture) is proposed for efficient execution of multi-grain parallel processing. It provides various mechanisms for this purpose including a dedicated memory structure which is used efficiently both for multi-grain parallel processing A memory access analyzer is developed for investigating memory access characteristics in multi-grain parallel processing. Based on a result of analysis on a real application with multi-grain parallel processing efficient memory structure for ASCA is discussed.

    CiNii

  • Performance Evaluation of Near Finegrain Parallel Processing on the Single Chip Multiprocessor

    KIMURA KEIJI, MANAKA KUNIYUKI, OGATA WATARU, OKAMOTO MASAMI, KASAHARA HIRONORI

    IPSJ SIG Notes   1999 ( 67 ) 19 - 24  1999.08

     View Summary

    Advances in semiconductor technology allows us to integrate a lot of integer and floating point execution units, memory or processors on a single chip. To use these resources effectively, many researches on next generation microprocessor architectures and its software, especially compilers have been performed. In these next generation microprocessor architectures, a single chip multiprocessor (SCM) using multigrain parallel processing, which hierarchically exploits different level of parallelism from the whole program, is one of the most promising architectures. This paper evaluates performance of the SCM architectures for near fine grain parallel processing, which is one of the key issues in multigrain parallel processing, using several real application programs.

    CiNii

  • Evaluation of Multigrain Parallelism using OSCAR FORTRAN Compiler

    OBATA Motoki, MATSUI Gantetsu, MATSUZAKI Hidenori, KIMURA Keiji, INAISHI Daisuke, UJIGAWA Yasushi, YAMAMOTO Terumasa, OKAMOTO Masami, KASAHARA Hironori

    IPSJ SIG Notes   1998 ( 70 ) 13 - 18  1998.08

     View Summary

    Currently, peak performances of supercomputers attain TELOPS order. It seems that the peak performances will continue by increase.However, supercomputers have a problem that enlargement of the world is very difficult because of relatively low cost performance and difficulty of use. In microprocessor, limitations of extraction of instruction level parallelism being used by super scalar and VLIW architecture are getting clear and single chip multiprocessor is received much attention as one of next generation processor architechture.In order to improve effective performance or cost performance, and ease of use, the authors have been proposing a Multigrain Automatic Parallelizing Compilation scheme. The multigrain parallel processing is a method which extract all parallelism from a program, such as coarse grain parallelism among subroutines, loops, and basic blocks, medium grain parallelism among loop iterations, and fine grain parallelism among instructions and statements.This paper shows effectiveness of multigrain parallel processing using OSCAR multigrain FORTRAN parallelization compiler using fluid flow problem solver ARC2D(Perfect Benchmark)as an example.

    CiNii

  • Multigrain Parallel Processing on the Single Chip Multiprocessor

    KIMURA KEIJI, OGATA WATARU, OKAMOTO MASAMI, KASAHARA HIRONORI

    IPSJ SIG Notes   1998 ( 70 ) 25 - 30  1998.08

     View Summary

    With the increase of the number of transistors integrated on a chip, how to use transistors efficiently and improve effective performance of a processor is getting an important problem. However, it has been thought that superscalar and VLIW which have been popular architectures would have difficulty to obtain scalable improvement of effective performance because of limitation of instruction level parallelism.To cope with this problem, the authors have been proposing a single chip multiprocessor(SCM)approach to use multi grain parallelism inside a chip, which hierarchicaly exproits loop parallelism with large parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism.This paper describes preliminary evaluation of effectiveness of single chip multiprocessor architecture with a shared cache, global registers, distributed shared memory and/or local memory as the first step of research on SCM architecture for supporting effective realization of multi grain parallel processing.

    CiNii

  • A Cache Optimization with Earliest Executable Condition Analysis

    INAISHI Daisuke, KIMURA Keiji, FUJIMOTO Kensaku, OGATA Wataru, OKAMOTO Masami, KASAHARA Hironori

    IPSJ SIG Notes   1998 ( 70 ) 31 - 36  1998.08

     View Summary

    Cache optimizations by a compiler for a single processor machine have been mainly applied to a singlenested loop.On the contrary, this paper proposes a cache optimization scheme using earliest executable condition analysis for FORTRAN programs on a single processor system.OSCAR FORTRAN multi-grain automatic parallelizing compiler decomposes a FORTRAN program into three types of macrotasks(MT), such as loops, subroutines and basic blocks, and analyzes the earliest executable condition of each MT to extract coarse grain parallelism among MTs and generates a macrotask graph(MTG).The MTG represents data dependence and extended control dependence among MTs and an information of shared data among MTs.By using this MTG, a compiler realizes global code motion to use cache effectively.The code motion technique moves a MT, which accesses data accessed by a precedent MT on MTG, immediately after the precedent MT to increase a cache hit rate. This optimization is realized using OSCAR multi-grain compiler as a preprocessor to output an optimized sequential FORTRAN code.A performance evaluation shows about 62% speed up compared with original program on 167MHz UltraSPARC.

    CiNii

  • A Multigrain Parallelizing Compiler and Its Architectural Support

    KASAHARA Hironori, OGATA Wataru, KIMURA Keiji, OBATA Motoki, TOBITA Takao, INAISHI Daisuke

    Technical report of IEICE. ICD   98 ( 22 ) 71 - 76  1998.04

     View Summary

    Currently, difficulty of enlargement of the world market for supercomputers caused by cost-performance, which does not seem excellent for real effective performance, and need of high experience for parallel tuning is getting a problem. Also, in general purpose microprocessors, limitations of extraction of instruction level parallelism being used by super-scalar and VLIW architectures are getting clear. This paper describes a multigrain compilation technology and architectural support for it as an approach to cope with the above difficulites and develop user friendly and excellent cors performance supercomputers and single chip multiprocesors.

    CiNii

  • Multi-processor system for Multi-grain Parallel Processing

    IWAI Keisuke, FUJIWARA Takashi, MORIMURA Tomohiro, AMANO Hideharu, KIMURA Keiji, OGATA Wataru, KASAHARA Hironori

    IEICE technical report. Computer systems   97 ( 225 ) 77 - 84  1997.08

     View Summary

    Multi-grain parallel processing is proposed to exploit an inherent parallelism in application programs as much as possible. Although this method can be realized on various architectures, a dedicated multiprocessor architecture is required for achieving the maximum performance. Multiprocessor system ASCA (Advanced Scheduling oriented Computer Architecture) is proposed for efficient execution of multigrain parallel processing. It provides various mechanisms for this purpose including a dedicated communication mechanism which is used efficiently both for coarse grain and near fine grain parallel processing, and a custom designed processor for static scheduling.

    CiNii

  • A Macro Task Dynamic Scheduling Algorithm with Overlapping of Task Processing and Data Transfer

    KIMURA KEIJI, HASHIMOTO SHIGERU, KOGOU MAKOTO, OGATA WATARU, KASAHARA HIRONORI

    CPSY97   97 ( 225 ) 33 - 38  1997.08

     View Summary

    Resently, multiprocessor systems having data transfer unit which can transfer data asynchronously with CPU are getting popular. We can hide data transfer overhead by using these data transfer unit. However, it is difficult by users to write a optimized program considering overlapping of data transfers and task processing. To hide overhead caused by data transfers, this paper proposes dynamic scheduling algorithm considering data pre-loading and post-storing for overlapping of data transfers and task processing. Preliminary performance evaluations by simulation show that the proposed scheduling scheme can reduce execution time 26% comparing with the scheduling scheme without pre-loading and post-storing.

    CiNii

▼display all

Industrial Property Rights

  • 並列化コンパイラ、並列化コンパイル装置、及び並列プログラムの生成方法

    6600888

    笠原 博徳, 木村 啓二, 梅田 弾, 見神 広紀

    Patent

  • マルチプロセッサシステム

    6335253

    笠原 博徳, 木村 啓二

    Patent

  • マルチプロセッサシステム

    笠原 博徳, 木村 啓二

    Patent

  • 並列化コンパイル方法、並列化コンパイラ、並列化コンパイル装置、及び、車載装置

    6018022

    笠原 博徳, 木村 啓二, 林 明宏, 見神 広紀, 梅田 弾, 金羽木 洋平

    Patent

  • 並列性の抽出方法及びプログラムの作成方法

    6319880

    木村 啓二, 林 明宏, 笠原 博徳, 見神 広紀, 金羽木 洋平, 梅田 弾

    Patent

  • マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法

    笠原 博徳, 木村 啓二

    Patent

  • プロセッサシステム及びアクセラレータ

    6103647

    木村 啓二, 笠原 博徳

    Patent

  • プロセッサによって実行可能なコードの生成方法、記憶領域の管理方法及びコード生成プログラム

    5283128

    笠原 博徳, 木村 啓二, 間瀬 正啓

    Patent

  • マルチプロセッサ

    笠原 博徳, 木村 啓二

    Patent

  • マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法

    笠原 博徳, 木村 啓二

    Patent

  • マルチプロセッサ

    4304347

    笠原 博徳, 木村 啓二

    Patent

  • メモリ管理方法、情報処理装置、プログラムの作成方法及びプログラム

    5224498

    笠原 博徳, 木村 啓二, 中野 啓史, 仁藤 拓実, 丸山 貴紀, 三浦 剛, 田川 友博

    Patent

  • マルチプロセッサ及びマルチプロセッサシステム

    4784842

    笠原 博徳, 木村 啓二

    Patent

  • プロセッサ及びデータ転送ユニット

    4476267

    笠原 博徳, 木村 啓二

    Patent

  • ヘテロジニアスマルチプロセッサ向けグローバルコンパイラ

    4784827

    笠原 博徳, 木村 啓二, 鹿野 裕明

    Patent

  • ヘテロジニアス・マルチプロセッサシステムの制御方法及びマルチグレイン並列化コンパイラ

    4936517

    笠原 博徳, 木村 啓二, 白子 準, 和田 康孝, 伊藤 雅樹, 鹿野 裕明

    Patent

  • マルチプロセッサシステム及びマルチグレイン並列化コンパイラ

    笠原 博徳, 木村 啓二, 白子 準, 伊藤 雅樹, 鹿野 裕明

    Patent

  • マルチプロセッサシステム及びマルチグレイン並列化コンパイラ

    4082706

    笠原 博徳, 木村 啓二, 白子 準, 伊藤 雅樹, 鹿野 裕明

    Patent

  • マルチプロセッサ

    4784792

    笠原 博徳, 木村 啓二

    Patent

▼display all

 

Syllabus

▼display all

 

Overseas Activities

  • 新しいメモリ階層を考慮したソフトウェア・ハードウェアの構成法に関する研究

    2017.08
    -
    2018.02

    アメリカ   North Carolina State University

Sub-affiliation

  • Faculty of Science and Engineering   Graduate School of Fundamental Science and Engineering

Research Institute

  • 2022
    -
    2024

    Waseda Research Institute for Science and Engineering   Concurrent Researcher

  • 2022
    -
    2024

    Waseda Center for a Carbon Neutral Society   Concurrent Researcher

Internal Special Research Projects

  • 深層学習フレームワークでの利用を目指した完全準同型暗号による行列計算に関する研究

    2020  

     View Summary

    2020年度は、研究のベースとなるソフトウェアとして、Microsoft ResearchのSEAを利用し、これによる行列積演算を構成する各種処理の時間を測定し、そのオーバーヘッドと並列性の調査を行った。まず、行列積計算をOpenMPで並列化し、8コア搭載のIntel Xeon W2145(3.70GHz)で実行した結果、1コア実行時に対して約6倍の性能向上を得ることが出来た。さらに、準同型暗号による行列積演算を構成する処理をSIMD演算(AVX512)により高速化することを試みた。その結果、ライブラリ内部で使用する基本データ型を64bitから32bitに縮小しかつSIMD演算幅を増やすことで、行列演算の重要処理をSIMDオリジナルの実装に対して3.48倍高速化可能となった。 

  • フラグによりCPUとアクセラレータが連係するヘテロジニアスマルチコアに関する研究

    2014  

     View Summary

    本研究は、アクセラレータを持つヘテロジニアスマルチコアに対して、アクセラレータの制御やデータ転送に要するオーバーヘッドを削減する技術の開発する。具体的にはCPU、データ転送ユニット(DTU)、及びアクセラレータを同時実行させることで上記オーバーヘッドを隠蔽可能とするタスク分割及びスケジューリング手法を開発し、自動並列化コンパイラに実装する。本年度の成果としては、まず本研究が前提とするアクセラレータの基本仕様を決定した。その上で、本アクセラレータ用のコンパイラモジュールを開発し、さらにアクセラレータのアーキテクチャシミュレータを開発することにより、本研究を行う上での基本的な評価環境を整備した。

  • コンパイラ解析情報と実機実行情報を利用したマルチコアシミュレーション高速化の研究

    2009  

     View Summary

    計算機アーキテクチャの研究では、様々な構成のシステム評価を行うため、ソフトウェアによるアーキテクチャシミュレーションが大きな役割を果たしている。しかしながら、ソフトウェアシミュレータはプログラムの実行に実機の数千倍の時間がかかる。このような膨大な評価時間は今後のメニーコアの研究・開発の大きな妨げになる。本研究では、このような問題を克服するための、マルチコア・メニーコアのソフトウェアシミュレーション高速化手法の研究を行う。特に並列アーキテクチャ研究のためのシミュレーション高速化の研究に関しては、これまでミュレーションによる実験対象となる仮想のマルチコアやマルチプロセッサのコアを、シミュレータを実行する実際のマルチプロセッサのコアに割り当てるという方法が提案されてきたが、実機上の並列処理オーバーヘッドが大きく、実用的なシステムはこれまで実現されていない。本研究の特徴は、マルチコア・メニーコアのソフトウェアシミュレーションの高速化に、ループ構造や並列化情報などの並列化コンパイラによる解析情報と、評価対象アプリケーションの実機での実行情報を利用することである。これらの情報を利用し、詳細にシミュレーションする必要がある箇所とそうでない箇所を特定する。従来のソフトウェアシミュレーション高速化手法では利用されてこなかったこれらの付加的な情報を利用することで、精度の高い性能値を最小の実行コストで得ることができる。本年度は、本高速化手法の基本的な適用可能性を検討するための予備実験を行った。具体的には、二種類のマルチコアアーキテクチャのコア数を32コアまで変化させ、ベンチマークプログラムのメインループの回転数を変化させ本研究による性能値推定手法により本来のループ回転数における性能値を再現できるか調査した。ベンチマークプログラムとしてSPEC95ベンチマークのtomcatvとswim、および音声圧縮で標準的に使われているAACエンコーディングプログラムを用いた。評価の結果、いずれのアーキテクチャ、コア数、ベンチマークプログラムの組み合わせにおいても、わずか数回転分の性能値から本来の数百回転分の性能値を高々2%程度の誤差で予測することができた。今後は適用アプリケーションの拡大ならびにシステムの自動化を行う予定である。

  • ソフトウェア協調型チップマルチプロセッサにおけるメモリ最適化に関する研究

    2004  

     View Summary

    本研究ではまず、データローカリティ最適化およびデータ転送最適化の基盤技術となるマルチグレイン並列化コンパイラとチップマルチプロセッサアーキテクチャプラットフォームの選定および評価基盤の整備を行った。コンパイラとしては、経済産業省ミレニアムプロジェクトIT21 アドバンスト並列化コンパイラで開発されたOSCARマルチグレイン並列化コンパイラをコアとした。また、チップマルチプロセッサアーキテクチャとしては、簡素なプロセッサコア、ローカルデータメモリ、2ポート構成の分散共有メモリ、およびデータ転送ユニットを持つプロセッシングエレメント(PE)をPE間ネットワークで接続したOSCAR型チップマルチプロセッサとした。本研究では、OSCARマルチグレイン並列化コンパイラに対してOSCAR型チップマルチプロセッサ用のバックエンド(コード生成器)を追加開発した。データローカリティ最適化およびデータ転送最適化技術開発の第一歩として、ターゲットアプリケーションには、SPECfp95ベンチマークより科学技術計算の典型例であるTomcatvとSwimプログラムを選んだ。本研究では、これらに対してタスク(並列処理の単位)とデータをデータローカリティと並列性の両方を考慮しながらPEへスケジューリングし、さらに共有メモリとプロセッサのローカルメモリ(データローカルメモリおよび分散共有メモリ)とのやり取りをプロセッサと非同期で動作するデータ転送ユニットにより処理させることにより、データローカリティ利用とデータ転送処理の効率化を行った。8PEで評価を行った結果、データローカリティ最適化を適用していない場合に対してTomcatvで1.56倍、Swimで1.38倍の速度向上を得ることができた。