Details of a Researcher - KIMURA, Keiji

写真a

KIMURA, Keiji

Scopus Paper Info

Paper Count: 79 Citation Count: 370 h-index: 10

Click to view the Scopus page. The data was downloaded from Scopus API in October 30, 2025, via http://api.elsevier.com and http://www.scopus.com .

Google Scholar Information (Citations per year)

Citation Count: 1295 h-index: 17 i10-index: 34

Click to view the Google Scholar page.

Scopus Information

Affiliation

Faculty of Science and Engineering, School of Fundamental Science and Engineering

Job title

Professor

Degree

博士(工学) ( 早稲田大学 )
Doctor of Engineering

Homepage URL

http://www.apal.cs.waseda.ac.jp/

Research Experience

2012

-

　

Professor, Department of Computer Science and Engineering, Waseda University
2005

-

2012

Associate Professor, Department of Computer Science, Waseda University
2004

-

2005

Assistant Professor, Department of Computer Science, Waseda University
2002

-

2004

Visiting Assistant Professor, Advanced Research Institute for Science and Engineering, Waseda University
1999

-

2002

Research Associate, Department of Electrical, Electronics and Computer Engineering, Waseda University

Education Background

1998.04

-

2001.03

Waseda University Graduate School of Science and Engineering Department of Electrical Engineering (Ph.D course)

Ph.D in Electrical Engineering
1996.04

-

1999.03

Waseda University Graduate School of Science and Engineering Department of Electrical Engineering (Master course)

M.E. in Electrical Engineering
1992.04

-

1996.03

Waseda University Faculty of Science and Engineering Department of Electronics

Committee Memberships

2022.04

-

2022.10

The 31st International Conference on Parallel Architectures and Compilation Techniques (PACT 2022)
2021

-

　

The 30th International Conference on Parallel Architectures and Compilation Techniques (PACT 2021)
2021

-

　

The 34th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2021)
2021

-

　

ACM Principles and Practice of Parallel Programming 2021 (PPoPP 2021), Extended Review Committee
2020

-

　

The 26th IEEE International Symposium on High-Performance Computer Architecture Program Committee
2018

-

2020

IEEE International Parallel & Distributed Processing Symposium (IPDPS 2018-2020) Program Committee
2019

-

　

The 37th IEEE International Conference on Computer Design (ICCD 2019) Program track Chair (Processor Architecture)
2019

-

　

24th Asia and South Pacific Design Automation Conference (ASP-DAC 2019) Program Committee (On-chip Communication and Networks-on-Chip)
2018

-

　

Principles and Practice of Parallel Programming 2018 (PPoPP 2018) Publicity Chair
2018

-

　

IEEE COMPSAC 2018 Computer Architecture and Platforms Co-Chairs
2016

-

　

The 22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS 2016) Program Vice Chair (Parallel / Distributed Algorithms and Applications)
2016

-

　

The 45th International Conference on Parallel Processing (ICPP-2016) Program Committee (Programming Models, Languages and Compilers)
2016

-

　

The 3rd International Workshop on Software and Engineering for Parallel Sysmtems (SEPS 2016) Program Committee
2015

-

　

The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT 2015) Program Committee
2015

-

　

27th International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2015) Program Committee (Software Track)
2015

-

　

15th International Symposium on High-Performance Computer Architecture (HPCA-15) Publicity Co-Chairs
2010.04

-

2014.03

情報処理学会計算機アーキテクチャ研究会幹事
2014

-

　

The 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Program Committee
2014

-

　

The 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Program Committee
2011

-

2014

The 24--27th International Workshop on Languages and Compilers for Parallel Computing (LCPC ) Program Committee, Program Chair (2012)
2010.04

-

2013.03

情報処理学会組込システム研究会運営委員
2013

-

　

The 13th International Forum on Embedded MPSoC and Multicore (MPSoC2013) Finace Co-Chairs
2013

-

　

The 27th Internationcal Conference on Supercomputing (ICS 2013) Program Committee
2009

-

2013

IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips XII--XVII) Program Committee
2009

-

2013

XXVII--XXXII IEEE International Conference on Computer Design (ICCD ) Program Committee (Computer System Design and Application Track)
2012

-

　

The 12th International Forum on Embedded MPSoC and Multicore (MPSoC2012) Program Co-Chairs
2011

-

　

Advanced Parallel Processing Technology Symposium (APPT ) Program Committee
2011

-

　

The 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS ) Program Committee (Multicore Computing and Parallel / Distributed Architecture)
2008.04

-

2010.03

情報処理学会計算機アーキテクチャ研究会運営委員
2010

-

　

22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD ) Program Committee (System Software Track)
2010

-

　

IEEE International Symposium on Workload Characterization (IISWC-2010) Program Committee
2005.04

-

2009.03

情報処理学会学会誌編集委員 SWG
2005.04

-

2009.03

情報処理学会システムLSI設計技術研究会（SLDM）運営委員
2005

-

2009.03

情報処理学会論文誌コンピューティングシステム ACS 論文誌編集委員会
2009

-

　

The 38th International Conference on Parallel Processing (ICPP-2009) Program Committee (Programming Models, Languages and Compilers)
2006

-

2008

IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX--XI) Program Committee Vice Chair
2006

-

2008

IPSJ ComSys Program Committee
2006

-

2008

ComSys - コンピュータシステムシンポジウムプログラム委員
2007

-

　

IPSJ DA Symposium University Chair
2007

-

　

情報処理学会 DAシンポジウム大学幹事
2007

-

　

IPSJ SACSIS Program Committee Vice Chair
2007

-

　

SACSIS 先進的計算基盤システムシンポジウムプログラム副委員長
2006

-

　

IPSJ SACSIS , 2008--2013 Program Committee
2006

-

　

SACSIS , 2008--2013 - 先進的計算基盤システムシンポジウムプログラム委員
2003

-

2006

並列/分散/協調処理に関するサマーワークショップ(SWoPP) 実行委員
2001.04

-

2005.03

情報処理学会システムソフトウェアとオペレーティング・システム研究会運営委員
2001.04

-

2005.03

情報処理学会学会誌編集委員 BWG, （最終年度主査）
2004

-

　

SACSIS 先進的計算基盤システムシンポジウム会計委員長・プログラム委員

▼display all

Professional Memberships

　

　

　

ACM
　

　

　

IEEE Computer Society
　

　

　

The institute of Electronics, Information and Communication Engineers
　

　

　

Information Processing Society of Japan

Research Areas

Computer system

Research Interests

Multiprocessor Architecture, Parallelizing Compiler, Secure Computer Systems

Awards

MEXT Award for Science and Technology (Research category)

2014.04 Ministry of Education、Culture、Sports、Science and Technology (MEXT)

Papers

Parallel Verification in RISC-V Secure Boot

Akihiro Saiki, Yu Omori, Keiji Kimura

2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 2023.12

DOI
Parallelizing Factory Automation Ladder Programs by OSCAR Automatic Parallelizing Compiler

Tohma Kawasumi, Tsumura Yuta, Hiroki Mikami, Tomoya Yoshikawa, Takero Hosomi, Shingo Oidate, Keiji Kimura, Hironori Kasahara

Proc. of the 35th International Workshop on Languages and Compilers for Parallel Computing (LCPC2022) 2022.10 [Refereed]
Open-Source Hardware Memory Protection Engine Integrated With NVMM Simulator

Yu Omori, Keiji Kimura

IEEE Computer Architecture Letters 21 ( 2 ) 77 - 80 2022.08 [Refereed]

Authorship：Last author

DOI
Data stream clustering for low-cost machines

Christophe Cérin, Keiji Kimura, Mamadou Sow

Journal of Parallel and Distributed Computing 166 57 - 70 2022.08 [Refereed]

DOI

Scopus

3

Citation

(Scopus)
Open-Source RISC-V Linux-Compatible NVMM Emulator

Yu Omori, Keiji Kimura

Sixth Workshop on Computer Architecture Research with RISC-V (CARRV 2022) 2022.06 [Refereed]

Authorship：Last author
Lightweight Array Contraction by Trace-Based Polyhedral Analysis

Hugo Thievenaz, Keiji Kimura, Christophe Alias

C3PO’22: Compiler-assisted Correctness Checking and Performance Optimization for HPC 2022.06 [Refereed]
Rephrasing polyhedral optimizations with trace analysis

Hugo Thievenaz, Keiji Kimura, Christophe Alias

12th International Workshop on Polyhedral Compilation Techniques (IMPACT 2022) 2022.06 [Refereed]
Accelerating Data Dependence Profiling Through Abstract Interpretation of Loop Instructions

Mostafa Abbas, Mostafa I. Soliman, Sherif I. Rabia, Keiji Kimura, Ahmed El-Mahdy

IEEE Access 10 31626 - 31640 2022 [Refereed]

DOI
OSCAR Parallelizing and Power Reducing Compiler and API for Heterogeneous Multicores : (Invited Paper)

Hironori Kasahara, Keiji Kimura, Toshiaki Kitamura, Hiroki Mikami, Kazutaka Morita, Kazuki Fujita, Kazuki Yamamoto, Tohma Kawasumi

2021 IEEE/ACM Programming Environments for Heterogeneous Computing (PEHC) 2021.11 [Refereed] [Invited]

DOI
Parallelizing Compiler Translation Validation Using Happens-Before and Task-Set

Jixin Han, Tomofumi Yuki, Michelle Mills Strout, Dan Umeda, Hironori Kasahara, Keiji Kimura

2021 Ninth International Symposium on Computing and Networking Workshops (CANDARW) 2021.11 [Refereed]

DOI
Performance Evaluation of OSCAR Multi-target Automatic Parallelizing Compiler on Intel, AMD, Arm and RISC-V Multicores

Birk M. Magnussen, Tohma Kawasumi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

LCPC2021 2021.10 [Refereed]
Durable Queue Implementations Built on a Formally Defined Strand Persistency Model

Jixin Han, Keiji Kimura

Journal of Information Processing 29 823 - 838 2021 [Refereed]

Authorship：Last author

DOI
Secure Image Inference Using Pairwise Activation Functions

Jonas T. Agyepong, Mostafa Soliman, Yasutaka Wada, Keiji Kimura, Ahmed El-Mahdy

IEEE Access 9 118271 - 118290 2021 [Refereed]

DOI
Non-Volatile Main Memory Emulator for Embedded Systems Employing Three NVMM Behaviour Models

Yu OMORI, Keiji KIMURA

IEICE TRANSACTIONS on Information and Systems E104-D ( 5 ) 697 - 708 2021 [Refereed]

Authorship：Last author
Scalable and Fast Lazy Persistency on GPUs

Ardhi Wiratama, Baskara Yudha, Keiji Kimura, Huiyang Zhou, Yan Solihin

2020 IEEE International Symposium on Workload Characterization (IISWC 2020) 252 - 263 2020.10 [Refereed]
Local Memory Mapping of Multicore Processors on an Automatic Parallelizing Compiler

Yoshitake OKI, Yuto ABE, Kazuki YAMAMOTO, Kohei YAMAMOTO, Tomoya SHIRAKAWA, Akimasa YOSHIDA, Keiji KIMURA, Hironori KASAHARA

IEICE TRANSACTIONS on Electronics E103-C ( 3 ) 98 - 109 2020.03 [Refereed]
Compiler Software Coherent Control for Embedded High Performance Multicore

Boma A. ADHI, Tomoya KASHIMATA, Ken TAKAHASHI, Keiji KIMURA, Hironori KASAHARA

IEICE TRANSACTIONS on Electronics E103-C ( 3 ) 85 - 97 2020.03 [Refereed]
Compiler-support for Critical Data Persistence in NVM

Reem Elkhouly, Mohammad Alshboul, Akihiro Hayashi, Yan Solihin, Keiji Kimura

ACM Transactions on Architecture and Code Optimization (TACO) 16 ( 4 ) 2019.12 [Refereed]

Authorship：Last author
Software Cache Coherent Control by Parallelizing Compiler

Boma A. Adhi, Masayoshi Mase, Yuhei Hosokawa, Yohei Kishimoto, Taisuke Onishi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11403 17 - 25 2019.11 [Refereed]
Cascaded DMA Controller for Speedup of Indirect Memory Access in Irregular Applications

Tomoya Kashimata, Toshiaki Kitamura, Keiji Kimura, Hironori Kasahara

9th Workshop on Irregular Applications: Architectures and Algorithms 71 - 76 2019.11 [Refereed]
Performance of Static and Dynamic Task Scheduling for Real-Time Control System on Embedded Multicore Processor

Yoshitake Oki, Hiroki Mikami, Hikaru Nishida, Dan Umeda, Keiji Kimura, Hironori Kasahara

32nd International Workshop on Languages and Compilers for Parallel Computing(LCPC) 2019.10 [Refereed]
Performance Evaluation on NVMM Emulator Employing Fine-Grain Delay Injection

Yu Omori, Keiji Kimura

The 8th IEEE Non-Volatile Memory Systems and Applications Symposium (IEEE NVMSA 2019) 1 - 6 2019.08 [Refereed]

Authorship：Last author

DOI

Scopus

3

Citation

(Scopus)
Fast and Highly Optimizing Separate Compilation for Automatic Parallelization

Tohma Kawasumi, Ryota Tamura, Yuya Asada, Jixin Han, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

The 2019 International Conference on High Performance Computing & Simulation (HPCS 2019) 478 - 485 2019.07 [Refereed]
Efficient Checkpointing with Recompute Scheme for Non-volatile Main Memory

Mohammad Alshboul, Hussein Elnawawy, Reem Elkhouly, Keiji Kimura, James Tuck, Yan Solihin

ACM Transactions on Architecture and Code Optimization (TACO) 16 ( 2 ) 2019.05 [Refereed]
Multicore Cache Coherence Control by a Parallelizing Compiler

Hironori Kasahara, Keiji Kimura, Boma A. Adhi, Yuhei Hosokawa, Yohei Kishimoto, Masayoshi Mase

Proceedings - International Computer Software and Applications Conference 1 492 - 497 2017.09 [Refereed]

　View Summary

A recent development in multicore technology has enabled development of hundreds or thousands core processor. However, on such multicore processor, an efficient hardware cache coherence scheme will become very complex and expensive to develop. This paper proposes a parallelizing compiler directed software coherence scheme for shared memory multicore systems without hardware cache coherence control. The general idea of the proposed method is that an automatic parallelizing compiler analyzes the control dependency and data dependency among coarse grain task in the program. Then based on the obtained information, task parallelization, false sharing detection and data restructuration to prevent false sharing are performed. Next the compiler inserts cache control code to handle stale data problem. The proposed method is built on OSCAR automatic parallelizing compiler and evaluated on Renesas RP2 with 8 SH-4A cores processor. The hardware cache coherence scheme on the RP2 processor is only available for up to 4 cores and the hardware cache coherence can be completely turned off for non-coherence cache mode. Performance evaluation is performed using 10 benchmark program from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB) and Mediabench II. The proposed method performs as good as or better than hardware cache coherence scheme. For example, 4 cores with the hardware coherence mechanism gave us speed up of 2.52 times against 1 core for SPEC2000 'equake', 2.9 times for SPEC2006 'lbm', 3.34 times for NPB 'cg', and 3.17 times for MediaBench II MPEG2 Encoder. The proposed software cache coherence control gave us 2.63 times for 4 cores and 4.37 for 8 cores for 'equake', 3.28 times for 4 cores and 4.76 times for 8 cores for lbm, 3.71 times for 4 cores and 4.92 times for 8 cores for 'MPEG2 Encoder'.

DOI

Scopus

8

Citation

(Scopus)
Automatic Local Memory Management for Multicores Having Global Address Space

Kouhei Yamamoto, Tomoya Shirakawa, Yoshitake Oki, Akimasa Yoshida, Keiji Kimura, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016 10136 282 - 296 2017 [Refereed]

　View Summary

Embedded multicore processors for hard real-time applications like automobile engine control require the usage of local memory on each processor core to precisely meet the real-time deadline constraints, since cache memory cannot satisfy the deadline requirements due to cache misses. To utilize local memory, programmers or compilers need to explicitly manage data movement and data replacement for local memory considering the limited size. However, such management is extremely difficult and time consuming for programmers. This paper proposes an automatic local memory management method by compilers through (i) multi-dimensional data decomposition techniques to fit working sets onto limited size local memory (ii) suitable block management structures, called Adjustable Blocks, to create application specific fixed size data transfer blocks (iii) multi-dimensional templates to preserve the original multi-dimensional representations of the decomposed multi-dimensional data that are mapped onto one-dimensional Adjustable Blocks (iv) block replacement policies from liveness analysis of the decomposed data, and (v) code size reduction schemes to generate shorter codes. The proposed local memory management method is implemented on the OSCAR multi-grain and multi-platform compiler and evaluated on the Renesas RP2 8 core embedded homogeneous multicore processor equipped with local and shared memory. Evaluations on 5 programs including multimedia and scientific applications show promising results. For instance, speedups on 8 cores compared to single core execution using off-chip shared memory on an AAC encoder program, a MPEG2 encoder program, Tomcatv, and Swim are improved from 7.14 to 20.12, 1.97 to 7.59, 5.73 to 7.38, and 7.40 to 11.30, respectively, when using local memory with the proposed method. These evaluations indicate the usefulness and the validity of the proposed local memory management method on real embedded multicore processors.

DOI

Scopus

2

Citation

(Scopus)
Architecture design for the environmental monitoring system over the winter season

Koichiro Yamashita, Chen Ao, Takahisa Suzuki, Yi Xu, Hongchun Li, Jun Tian, Keiji Kimura, Hironori Kasahara

MobiWac 2016 - Proceedings of the 14th ACM International Symposium on Mobility Management and Wireless Access, co-located with MSWiM 2016 27 - 34 2016.11 [Refereed]

　View Summary

One of the applications as a source of big data, there is a sensor network for the environmental monitoring that is designed to detect the deterioration of the infrastructure, erosion control and so on. The specific targets are bridges, buildings, slopes and embankments due to the natural disasters or aging. Basic requirement of this monitoring system is to collect data over a long period of time from a large number of nodes that installed in a wide area. However, in order to apply a wireless sensor network (WSN), using wireless communication and energy harvesting, there are not many cases in the actual monitoring system design. Because of the system must satisfy various conditions measurement location and time specified by the civil engineering communication quality and topology obtained from the network technology the electrical engineering to solve the balance of weather environment and power consumption that depends on the above-mentioned conditions. We propose the whole WSN design methodology especially for the electrical architecture that is affected by the network behavior and the environmental disturbance. It is characterized by determining recursively mutual trade-off of a wireless simulation and a power architecture simulation of the node devices. Furthermore, the system allows the redundancy of the design. In addition, we deployed the actual slope monitoring WSN that is designed by the proposed method to the snow-covered area. A conventional similar monitoring WSN, with 7 Ah Li-battery, it worked only 129 days in a mild climate area. On the other hand, our proposed system, deployed in the heavy snow area has been working more than 6 months (still working) with 3.2 Ah batteries. Finally, it made a contribution to the civil engineering succeeded in the real time observation of the groundwater level displacement at the time of melting snow in the spring season.

DOI

Scopus

2

Citation

(Scopus)
Reducing parallelizing compilation time by removing redundant analysis

Jixin Han, Rina Fujino, Ryota Tamura, Mamoru Shimaoka, Hiroki Mikami, Moriyuki Takamura, Sachio Kamiya, Kazuhiko Suzuki, Takahiro Miyajima, Keiji Kimura, Hironori Kasahara

SEPS 2016 - Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems, co-located with SPLASH 2016 1 - 9 2016.10 [Refereed]

　View Summary

Parallelizing compilers employing powerful compiler optimizations are essential tools to fully exploit performance from today's computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers apply multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of analysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.

DOI

Scopus

2

Citation

(Scopus)
An Android Systrace Extension for Tracing Wakelocks

Bui Duc Binh, Keiji Kimura

IEEE International Conference on Embedded and Ubiquitous Computing (EUC 2016) 146 - 149 2016.08 [Refereed]

Authorship：Corresponding author
Multigrain Parallelization Using Profile Information of Embedded Applications Generated by Model-based Development Tools on Multicore Processors

Dan Umeda, Takahiro Suzuki, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ Journal 57 ( 2 ) 1 - 12 2016.02 [Refereed]

CiNii
Android video processing system combined with automatically parallelized and power optimized code by OSCAR compiler

Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

Journal of Information Processing 24 ( 3 ) 504 - 511 2016 [Refereed]

　View Summary

The emergence of multi-core processors in smart devices promises higher performance and low power consumption. The parallelization of applications enables us to improve their performance. However, simultaneously utilizing many cores would drastically drain the device battery life. This paper shows a demonstration system of realtime video processing combined with power reduction controlled by the OSCAR automatic parallelization compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime with 4 ARM Cortext- A9 cores. In this paper, we exploited the DVFS framework, core partitioning, and profiling technique and OSCAR parallelization - power control algorithm to reduce the total consumption in a real-time video application. The demonstration results show that it can cut power consumption by 42.8% for MPEG-2 Decoder application and 59.8% for Optical Flow application by using 3 cores in both applications.

DOI CiNii

Scopus
Multigrain parallelization for model-based design applications using the OSCAR compiler

Dan Umeda, Takahiro Suzuki, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9519 125 - 139 2016 [Refereed]

　View Summary

Model-based design is a very popular software development method for developing a wide variety of embedded applications such as automotive systems, aircraft systems, and medical systems. Model-based design tools like MATLAB/Simulink typically allow engineers to graphically build models consisting of connected blocks for the purpose of reducing development time. These tools also support automatic C code generation from models with a special tool such as Embedded Coder to map models onto various kinds of embedded CPUs. Since embedded systems require real-time processing, the use of multi-core CPUs poses more opportunities for accelerating program execution to satisfy the real-time constraints. While prior approaches exploit parallelism among blocks by inspecting MATLAB/Simulink models, this may lose an opportunity for fully exploiting parallelism of the whole program because models potentially have parallelism within a block. To unlock this limitation, this paper presents an automatic parallelization technique for auto-generated C code developed by MATLAB/Simulink with Embedded Coder. Specifically, this work (1) exploits multi-level parallelism including inter-block and intra-block parallelism by analyzing the auto-generated C code, and (2) performs static scheduling to reduce dynamic overheads as much as possible. Also, this paper proposes an automatic profiling framework for the auto-generated code for enhancing static scheduling, which leads to improving the performance of MATLAB/Simulink applications. Performance evaluation shows 4.21 times speedup with six processor cores on Intel Xeon X5670 and 3.38 times speedup with four processor cores on ARM Cortex-A15 compared with uniprocessor execution for a road tracking application.

DOI

Scopus

12

Citation

(Scopus)
Coarse grain task parallelization of earthquake simulator GMS using OSCAR compiler on various Cc-NUMA servers

Mamoru Shimaoka, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9519 238 - 253 2016 [Refereed]

　View Summary

This paper proposes coarse grain task parallelization for a earthquake simulation program using Finite Difference Method to solve the wave equations in 3-D heterogeneous structure or the Ground Motion Simulator (GMS) on various cc-NUMA servers using IBM, Intel and Fujitsu multicore processors. The GMS has been developed by the National Research Institute for Earth Science and Disaster Prevention (NIED) in Japan. Earthquake wave propagation simulations are important numerical applications to save lives through damage predictions of residential areas by earthquakes. Parallel processing with strong scaling has been required to precisely calculate the simulations quickly. The proposed method uses the OSCAR compiler for exploiting coarse grain task parallelism efficiently to get scalable speed-ups with strong scaling. The OSCAR compiler can analyze data dependence and control dependence among coarse grain tasks, such as subroutines, loops and basic blocks. Moreover, locality optimizations considering the boundary calculations of FDM and a new static scheduler that enables more efficient task schedulings on cc-NUMA servers are presented. The performance evaluation shows 110 times speed-up using 128 cores against the sequential execution on a POWER7 based 128 cores cc-NUMA server Hitachi SR16000 VM1, 37.2 times speed-up using 64 cores against the sequential execution on a Xeon E7-8830 based 64 cores cc-NUMA server BS2000, 19.8 times speed-up using 32 cores against the sequential execution on a Xeon X7560 based 32 cores cc-NUMA server HA8000/RS440, 99.3 times speed-up using 128 cores against the sequential execution on a SPARC64 VII based 256 cores cc-NUMA server Fujitsu M9000, 9.42 times speed-up using 12 cores against the sequential execution on a POWER8 based 12 cores cc-NUMA server Power System S812L.

DOI

Scopus
2-Step Power Scheduling with Adaptive Control Interval for Network Intrusion Detection Systems on Multicores

Lau Phi Tuong, Keiji Kimura

2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC) 69 - 76 2016 [Refereed]

Authorship：Last author

　View Summary

Network intrusion detection system (NIDS) is becoming an important element even in embedded systems as well as in data centers since embedded computers have been increasingly exposed to the Internet. The demand for power budget of these embedded systems is a critical issue in addition to that for performance. In this paper, we propose a technique to minimize power consumption in the NIDS by 2-step power scheduling with the adaptive control interval. In addition, we also propose a CPU-core controlling algorithm so that our scheduling technique can preserve the performance for other applications and NIDS assuming the cases of multiplexing NIDS and them simultaneously on the same device such as a home server or a mobile platform. We implement our 2-step algorithm into Suricata, which is a popular NIDS, as well as a 1-step algorithm with the adaptive interval, and a simple fixed-interval algorithm for evaluations. Experimental results show that our 2-step scheduling with both the adaptive and the fixed 30-millisecond interval achieve 75% power saving comparing with the Ondemand governor and 87% comparing with the Performance governor in Linux, respectively, without affecting their performance capability on four ARM Cortex-A15 cores at the network traffic of 1,000 packets/seconds. In contrast, when the network traffic reaches to 17,000 packets/seconds, our 2-step scheduling and the Ondemand as well as the Performance governor can maintain the packet processing capacity while the fixed 30-milliseconds interval processes only 50% packets with two and three cores, and about 80% packets on four cores.

DOI

Scopus

1

Citation

(Scopus)
Accelerating Multicore Architecture Simulation Using Application Profile

Keiji Kimura, Gakuho Taguchi, Hironori Kasahara

2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC) 177 - 184 2016 [Refereed]

Authorship：Lead author

　View Summary

Architecture simulators play an important role in exploring frontiers in the early stages of the architecture design. However, the execution time of simulators increases with an increase the number of cores. The sampling simulation technique that was originally proposed to simulate single-core processors is a promising approach to reduce simulation time. Two main hurdles for multi/many-core are preparing sampling points and thread skewing at functional simulation time. This paper proposes a very simple and low-error sampling-based acceleration technique for multi/many-core simulators. For a parallelized application, an iteration of a large loop including a parallelizable program part, is defined as a sampling unit. We apply X-means method to a profile result of the collection of iterations derived from a real machine to form clusters of those iterations. Multiple iterations are exploited as sampling points from these clusters. We execute the simulation along the sampling points and calculate the number of total execution cycles. Results from a 16-core simulation show that our proposed simulation technique gives us a maximum of 443x speedup with a 0.52% error and 218x speedup with 1.50% error on an average.

DOI

Scopus

4

Citation

(Scopus)
Annotatable systrace: An extended linux ftrace for tracing a parallelized program

Daichi Fukui, Mamoru Shimaoka, Hiroki Mikami, Dominic Hillenbrand, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

SEPS 2015 - Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems 21 - 25 2015.10 [Refereed]

　View Summary

Investigation of the runtime behavior is one of the most important processes for performance tuning on a computer system. Profiling tools have been widely used to detect hot-spots in a program. In addition to them, tracing tools produce valuable information especially from parallelized programs, such as thread scheduling, barrier synchronizations, context switching, thread migration, and jitter by interrupts. Users can optimize a runtime system and hardware configuration in addition to a program itself by utilizing the attained information. However, existing tools provide information per process or per function. Finer information like task-or loop-granularity should be required to understand the program behavior more precisely. This paper has proposed a tracing tool, Annotatable Systrace, to investigate runtime execution behavior of a parallelized program based on an extended Linux ftrace. The Annotatable Systrace can add arbitrary annotations in a trace of a target program. The proposed tool exploits traces from 183.equake, 179.art, and mpeg2enc on Intel Xeon X7560 and ARMv7 as an evaluation. The evaluation shows that the tool enables us to observe load imbalance along with the program execution. It can also generate a trace with the inserted annotations even on a 32-core machine. The overhead of one annotation on Intel Xeon is 1.07 us and the one on ARMv7 is 4.44 us, respectively.

DOI

Scopus

5

Citation

(Scopus)
Evaluation of Automatic Power Reduction with OSCAR Compiler on Intel Haswell and ARM Cortex-A9 Multicores

Tomohiro Hirano, Hideo Yamamoto, Shuhei Iizuka, Kohei Muto, Takashi Goto, Tamami Wake, Hiroki Mikami, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8967 239 - 252 2015.05 [Refereed]
Automatic Parallelization of Designed Engine Control C Codes by MATLAB/Simulink

Dan Umeda, Youhei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Mitsuhiro Tani, Hiroshi Mori, Keiji Kimura, Hironori Kasahara

IPSJ Journal 55 ( 8 ) 1817 - 1829 2014.08 [Refereed]

　View Summary

Recently, more safety, comfort and environmental feasibility are required for the automobile. Accordingly, control systems need performance enhancement on microprocessors for real-time software which realize that. However, the improvement of clock frequency has been limited by power consumption and the performance of a single-core processor which controls power has reached the limits. For these factors, multi-core processors will be used for automotive control system. Recently Model-based Design by MATLAB and Simulink has been used for developing automobile systems because of elimination time of development and improvement of reliability. However, auto-generated-code from MATLAB and Simulink has been functioned on only single core processor so far. This paper proposes a parallelization method of engine control C codes for a multi-core processor generated from MATLAB and Simulink using Embedded Coder. The engine control C code which composed of many conditional branches and arithmetic assignment statements and are difficult to parallelize have been parallelized automatically using OSCAR automatic parallel compiler. In this result, it is succeeded to attain performance improvement on RP2 and V850E2R. Maximum 1.9x speedup on two cores and 3.76x speedup on four cores are attained.

CiNii
Multicore Technologies Realizing Low-Power Computing

Keiji Kimura, Hironori Kasahara

The Journal of IEICE 97 ( 2 ) 133 - 139 2014.02 [Invited]

Authorship：Lead author

CiNii
OSCAR Compiler Controlled Multicore Power Reduction on Android Platform

Hideo Yamamoto, Tomohiro Hirano, Kohei Muto, Hiroki Mikami, Takashi Goto, Dominic Hillenbrand, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2013 8664 155 - 168 2014 [Refereed]

　View Summary

In recent years, smart devices are transitioning from single core processors to multicore processors to satisfy the growing demands of higher performance and lower power consumption. However, power consumption of multicore processors is increasing, as usage of smart devices become more intense. This situation is one of the most fundamental and important obstacle that the mobile device industries face, to extend the battery life of smart devices. This paper evaluates the power reduction control by the OSCAR Automatic Parallelizing Compiler on an Android platform with the newly developed precise power measurement environment on the ODROID-X2, a development platform with the Samsung Exynos4412 Prime, which consists of 4 ARM Cortex-A9 cores. The OSCAR Compiler enables automatic exploitation of multigrain parallelism within a sequential program, and automatically generates a parallelized code with the OSCAR Multi-Platform API power reduction directives for the purpose of DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating. The paper also introduces a newly developed micro second order pseudo clock gating method to reduce power consumption using WFI (Wait For Interrupt). By inserting GPIO (General Purpose Input Output) control functions into programs, signals appear on the power waveform indicating the point of where the GPIO control was inserted and provides a precise power measurement of the specified program area. The results of the power evaluation for real-time Mpeg2 Decoder show 86.7% power reduction, namely from 2.79[W] to 0.37[W] and for real-time Optical Flow show 86.5% power reduction, namely from 2.23[W] to 0.36[W] on 3 core execution.

DOI

Scopus

3

Citation

(Scopus)
モデルベース設計により自動生成されたエンジン制御Cコードのマルチコア用自動並列化

梅田弾, 金羽木洋平, 見神広紀, 谷充弘(デンソー, 森裕司(デンソー, 木村啓二, 笠原博徳

組み込みシステムシンポジウム(ESS2013) 2013 104 - 113 2013.10

CiNii
OSAR API v2.1: Extensions for an Advanced Accelerator Control Scheme to a Low-Power Multicore API

Keiji Kimura, Cecilia Gonzales-Alvarez, Akihiro Hayashi, Hiroki Mikami, Mamoru Shimaoka, Jun Shirako, Hironori Kasahara

17th Workshop on Compilers for Parallel Computing (CPC2013) 2013.07 [Refereed]

Authorship：Lead author
Automatic Parallelization of Hand Written Automotive Engine Control Codes Using OSCAR Compiler

Dan Umeda, Yohei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

17th Workshop on Compilers for Parallel Computing (CPC2013) 2013.07 [Refereed]
Evaluation of power consumption at execution of multiple automatically parallelized and power controlled media applications on the RP2 low-power multicore

Hiroki Mikami, Shumpei Kitaki, Masayoshi Mase, Akihiro Hayashi, Mamoru Shimaoka, Keiji Kimura, Masato Edahiro, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7146 31 - 45 2013

　View Summary

This paper evaluates an automatic power reduction scheme of OSCAR automatic parallelizing compiler having power reduction control capability when multiple media applications parallelized by the OSCAR compiler are executed simultaneously on RP2, a 8-core multicore processor developed by Renesas Electronics, Hitachi, and Waseda University. OSCAR compiler enables the hierarchical multigrain parallel processing and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating and power gating for each processor core using the OSCAR multi-platform API. The RP2 has eight SH4A processor cores, each of which has power control mechanisms such as DVFS, clock gating and power gating. First, multiple applications with relatively light computational load are executed simultaneously on the RP2. The average power consumption of power controlled eight AAC encoder programs, each of which was executed on one processor, was reduced by 47%, (to 1.01W), against one AAC encoder execution on one processor (from 1.89W) without power control. Second, when multiple intermediate computational load applications are executed, the power consumptions of an AAC encoder executed on four processors with the power reduction control was reduced by 57% (to 0.84W) against an AAC encoder execution on one processor (from 1.95W). Power consumptions of one MPEG2 decoder on four processors with power reduction control was reduced by 49% (to 1.01W) against one MPEG2 decoder execution on one processor (from 1.99W). Finally, when a combination of a high computational load application program and an intermediate computational load application program are executed simultaneously, the consumed power reduced by 21% by using twice number of cores for each application. This paper confirmed parallel processing and power reduction by OSCAR compiler are efficient for multiple application executions. In execution of multiple light computational load applications, power consumption increases only 12% for one application. Parallel processing being applied to intermediate computational load applications, power consumption of executing one application on one processor core (1.49W) is almost same power consumption of two applications on eight processor cores (1.46W). © 2013 Springer-Verlag.

DOI

Scopus

1

Citation

(Scopus)
Automatic Design Exploration Framework for Multicores with Reconfigurable Accelerators

Cecilia Gonzalez-Alvarez, Haruku Ishikawa, Akihiro Hayashi, Daniel Jimenez-Gonzalez, Carlos Alvarez, Keiji Kimura, Hironori Kasahara

th Workshop on Reconfigurable Computing (WRC) 2013, held in conjuction with HiPEAC conference 2013 2013.01 [Refereed]
Parallelization of Automotive Engine Control Software On Embedded Multi-core Processor Using OSCAR Compiler

Yohei Kanehagi, Dan Umeda, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

2013 IEEE COOL CHIPS XVI (COOL CHIPS) 2013 [Refereed]
Automatic Parallelization, Performance Predictability and Power Control for Mobile-Applications

Dominic Hillenbrand, Akihiro Hayashi, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

2013 IEEE COOL CHIPS XVI (COOL CHIPS) 2013 [Refereed]

　View Summary

Currently few mobile applications exploit the power- and performance capabilities of multi-core architectures. As the number of cores increases, the challenges become more pressing. We picked three challenges: application parallelization, performance-predictability/portability and power control for mobile devices. We tackled the challenges with our auto-parallelizing compiler and operating system enhancements.
Reconciling application power control and operating systems for optimal power and performance

Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip, ReCoSoC 2013 2013

　View Summary

In the age of dark silicon on-chip power control is a necessity. Upcoming and state of the art embedded- and cloud computer system-on-chips (SoCs) already provide interfaces for fine grained power control. Sometimes both: core- and interconnect-voltage and frequency can be scaled for example. To further reduce power consumption SoCs often have specialized accelerators. Due to the rising specialization of hard- and software general purpose operating systems require changes to exploit the power saving opportunities provided by the hardware. However, they lack detailed hardware- and application-level-information. Application-level power control in turn is still very uncommon and difficult to realize. Now a days vendors of mobile devices are forced to tweak and patch system-level software to enhance the power efficiency of each individual product. This manual process is time consuming and must be re-iterated for each new product. In this paper we explore the opportunities and challenges of automatic application- level power control using compilers. © 2013 IEEE.

DOI

Scopus

4

Citation

(Scopus)
組込マルチコア用OSCAR APIを用いたTILEPro64上でのマルチメディアアプリケーションの並列処理

岸本耀平, 見神広紀, 中野恵一, 林明宏, 木村啓二, 笠原博徳

組み込みシステムシンポジウム(ESS2012) 2012 22 - 30 2012.10

CiNii
OSCAR Parallelizing Compiler and API for Real-time Low Power Heterogeneous Multicores

kihiro Hayashi, Mamoru Shimaoka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

6th Workshop on Compilers for Parallel Computing(CPC2012) 5 ( 1 ) 68 - 79 2012.01 [Refereed]

CiNii
重粒子線がん治療用線量計算エンジンの自動並列化

林明宏, 松本卓司, 見神広紀, 木村啓二, 山本啓二, 崎浩典, 高谷保行, 笠原博徳

HPCS2012 - ハイパフォーマンスコンピューティングと計算科学シンポジウム 2012 135 - 143 2012.01

CiNii
Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

Yasir I. M. Al-Dosary, Keiji Kimura, Hironori Kasahara, Seinosuke Narita

2012 17TH INTERNATIONAL CONFERENCE ON COMPUTER GAMES (CGAMES) 67 - 75 2012 [Refereed]

　View Summary

Video Games have been a very popular form of digital entertainment in recent years. They have been delivered in state of the art technologies that include multi-core processors that are known to be the leading contributor in enhancing the performance of computer applications. Since parallel programming is a difficult technology to implement, that field in Video Games is still rich with areas for advancements. This paper investigates performance enhancement in Video Games when using parallelizing compilers and the difficulties involved in achieving that. This experiment conducts several stages in attempting to parallelize a well-renowned sequentially written Video Game called ioquake3. First, the Game is profiled for discovering bottlenecks, then examined by hand on how much parallelism could be extracted from those bottlenecks, and what sort of hazards exist in delivering a parallel-friendly version of ioquake3. Then, the Game code is rewritten into a hazard-free version while also modified to comply with the Parallelizable-C rules, which crucially aid parallelizing compilers in extracting parallelism. Next, the program is compiled using a parallelizing compiler called OSCAR (Optimally Scheduled Advanced Multiprocessor) to produce a parallel version of ioquake3. Finally, the performance of the newly produced parallel version of ioquake3 on a Multi-core platform is analyzed.
The following is found: (1) the parallelized game by the compiler from the revised sequential program of the game is found to achieve a 5.1 faster performance at 8-threads than original one on an IBM Power 5+ machine that is equipped with 8-cores, and (2) hazards are caused by thread contentions over globally shared data, and as well as thread private data, and (3) AI driven players are represented very similarly to Human players inside ioquake3 engine, which gives an estimation of the costs for parallelizing Human driven sessions, and (4) 70% of the costs of the experiment is spent in analyzing ioquake3 code, 30% in implementing the changes in the code.
Parallelizing Compiler Framework and API for Heterogeneous Multicores

Akihiro Hayashi, Yasutaka Wada, Takeshi Watanabe, Takeshi Sekiguchi, Masayoshi Mase, Jun Shirako, Keiji Kimura, Hironori Kasahara

IPSJ Transactions on Advanced Computing Systems 5 ( 1 ) 68 - 79 2011.11 [Refereed]
A 45-nm 37.3 GOPS/W Heterogeneous Multi-Core SOC with 16/32 Bit Instruction-Set General-Purpose Core

Osamu Nishii, Yoichi Yuyama, Masayuki Ito, Yoshikazu Kiyoshige, Yusuke Nitta, Makoto Ishikawa, Tetsuya Yamada, Junichi Miyakoshi, Yasutaka Wada, Keiji Kimura, Hironori Kasahara, Hideo Maejima

IEICE TRANSACTIONS ON ELECTRONICS E94C ( 4 ) 663 - 669 2011.04 [Refereed]

　View Summary

We built a 12.4 mm x 12.4 mm, 45-nm CMOS, chip that integrates eight 648-MHz general purpose cores, two matrix processor (MX-2) cores, four flexible engine (FE) cores and media IP (VPU5) to establish heterogeneous multi-core chip architecture. The general purpose core had its IPC (instructions per cycle) performance enhanced by adding 32-bit instructions to the existing 16-bit fixed-length instruction set and executing up to two 32-bit instructions per cycle. Considering these five-to-seven years of embedded LSI and increasing trend of access-master within LSI, we predict that the memory usage of single core will not exceed 32-bit physical area (i.e. 4 GB), but chip-total memory usage will exceed 4 GB. Based on this prediction, the physical address was expanded from 32-bit to 40-bit. The fabricated chip was tested and a parallel operation of eight general purpose cores and four FE cores and eight data transfer units (DTU) is obtained on AAC (Advanced Audio Coding) encode processing.

DOI

Scopus
Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

Akihiro Hayashi, Yasutaka Wada, Takeshi Watanabe, Takeshi Sekiguchi, Masayoshi Mase, Jun Shirako, Keiji Kimura, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING 6548 184 - 198 2011 [Refereed]

　View Summary

Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
A parallelizing compiler cooperative heterogeneous multicore processor architecture

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6760 215 - 233 2011

　View Summary

Heterogeneous multicore architectures, integrating several kinds of accelerator cores in addition to general purpose processor cores, have been attracting much attention to realize high performance with low power consumption. To attain effective high performance, high application software productivity, and low power consumption on heterogeneous multicores, cooperation between an architecture and a parallelizing compiler is important. This paper proposes a compiler cooperative heterogeneous multicore architecture and parallelizing compilation scheme for it. Performance of the proposed scheme is evaluated on the heterogeneous multicore integrating Hitachi and Renesas' SH4A processor cores and Hitachi's FE-GA accelerator cores, using an MP3 encoder. The heterogeneous multicore gives us 14.34 times speedup with two SH4As and two FE-GAs, and 26.05 times speedup with four SH4As and four FE-GAs against sequential execution with a single SH4A. The cooperation between the heterogeneous multicore architecture and the parallelizing compiler enables to achieve high performance in a short development period. © 2011 Springer-Verlag Berlin Heidelberg.

DOI
Parallelizable C and Its Performance on Low Power High Performance Multicore Processors

Masayoshi Mase, Yuto Onozaki, Keiji Kimura, Hironori Kasahara

Proc. of 15th Workshop on Compilers for Parallel Computing (CPC 2010) 2010.07 [Refereed]

CiNii
Element-Sensitive Pointer Analysis for Automatic Parallelization

Masayoshi Mase, Yuta Murata, Keiji Kimura, Hironori Kasahara

IPSJ Transactions on Programming (PRO) 3 ( 2 ) 36 - 47 2010.03 [Refereed]
A 45nm 37.3GOPS/W heterogeneous multi-core SoC

Yoichi Yuyama, Masayuki Ito, Yoshikazu Kiyoshige, Yusuke Nitta, Shigezumi Matsui, Osamu Nishii, Atsushi Hasegawa, Makoto Ishikawa, Tetsuya Yamada, Junichi Miyakoshi, Koichi Terada, Tohru Nojiri, Makoto Satoh, Hiroyuki Mizuno, Kunio Uchiyama, Yasutaka Wada, Keiji Kimura, Hironori Kasahara, Hideo Maejima

Digest of Technical Papers - IEEE International Solid-State Circuits Conference 53 100 - 101 2010

　View Summary

We develop a heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search. Figure 5.3.1 shows the chip features. This SoC is capable of decoding 1080i audio/video data using a part of SoC (one general-purpose CPU core, video processing unit called VPU5 and sound processing unit called SPU) [1]. Four dynamically reconfigurable processors called FE [2] are integrated and have a total theoretical performance of 41.5GOPS and power consumption of 0.76W. Two 1024-way matrix-processors called MX-2 [3] are integrated and have a total theoretical performance of 36.9GOPS and power consumption of 1.10W. Overall, the performance per watt of our SoC is 37.3GOPS/W at 1.15V, the highest among comparable processors [4-6] excluding special-purpose codecs. The operation granularity of the CPU, FE and MX-2 are 32bit, 16bit, and 4bit respectively, and thus we can assign the appropriate processor for each task in an effective manner. A heterogeneous multi-core approach is one of the most promising approaches to attain high performance with low frequency, or low power, for consumer electronics application and scientific applications, compared to homogeneous multi-core SoCs [4]. For example, for image-recognition application in the IP-TV system, the FEs are assigned to calculate optical flow operation [7] of VGA (640x480) size video data at 15fps, which requires 0.62GOPS. The MX-2s are used for face detection and calculation of the feature quantity of the VGA video data at 15fps, which requires 30.6GOPS. In addition, general-purpose CPU cores are used for database search using the results of the above operations, which requires further enhancement of CPU. The automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE. ©2010 IEEE.

DOI

Scopus

33

Citation

(Scopus)
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

Keiji Kimura, Masayoshi Mase, Hiroki Mikami, Takamichi Miyamoto, Jun Shirako, Hironori Kasahara

LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING 5898 188 - 202 2010 [Refereed]

Authorship：Lead author

　View Summary

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR, compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR. API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.
A Power Reduction Scheme of Parallelizing Compiler Using OSCAR API on Multicore Processor

Masayoshi Mase, Ryo Nakagawa, Naoto Ohkuni, Jun Shirako, Keiji Kimura, Hironori Kasahara

IPSJ Transactions on Advanced Computing Systems 2 ( 3 ) 96 - 106 2009.09 [Refereed]
マルチコア上でのOSCAR APIを用いた並列化コンパイラによる低消費電力化手法

中川亮, 間瀬正啓, 大國直人, 白子準, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2009) 3 - 10 2009.05
Performance of OSCAR Multigrain Parallelizing Compiler on Multicore Processors

Hiroki Mikami, Jun Shirako, Masayoshi Mase, Takamichi Miyamoto, Hirofumi Nakano, Fumiyo Takano, Akihiro Hayashi, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Proc. of 14th Workshop on Compilers for Parallel Computing(CPC 2009) 2009.01 [Refereed]
Green multicore-SoC software-execution framework with timely-power-gating scheme

Masafumi Onouchi, Keisuke Toyama, Toru Nojiri, Makoto Sato, Masayoshi Mase, Jun Shirako, Mikiko Sato, Masashi Takada, Masayuki Ito, Hiroyuki Mizuno, Mitaro Namiki, Keiji Kimura, Hironori Kasahara

Proceedings of the International Conference on Parallel Processing 510 - 517 2009

　View Summary

We are developing a software-execution framework based on an octo-core chip multiprocessor named RP2 and an automatic multigrain-parallelizing compiler named OSCAR. The main purpose of this framework is to maintain good speed scalability and power efficiency over the number of processor cores under severe hardware restrictions for embedded use. Key to the speed scalability is reduction of a communication overhead with parallelized tasks. A data-categorization scheme enables small-overhead cache-coherency maintenance by using directives and instructions from the compiler. In this scheme, the number of cache-flushing time is minimized and parallelized tasks are quickly synchronized by using flags in local memory. As regards power efficiency, to reduce power consumption, power supply to processor cores waiting for other cores is timely and frequently cut off, even in the middle of an application, by using a timelypower- gating scheme. In this scheme, to achieve quick mode transition between "NORMAL" mode and "RESUME POWEROFF" mode, register values of the processor core are stored in core-local memory, which is active even in "RESUME POWEROFF" mode and can be accessed in one or two clock cycles. Measured speed and power of an application show good speed scalability in execution time and high power efficiency, simultaneously. In the case of a secure AAC-LC encoding program, execution speed when eight processor cores are used can be increased by 4.85 times compared to that of sequential execution. Moreover, power consumption under the same condition can be reduced by 51.0% by parallelizing and timely-power gating. The time for mode transition is less than 20 μsec, which is only 2.5% of the "RESUME POWER-OFF" period. © 2009 IEEE.

DOI

Scopus

1

Citation

(Scopus)
An Evaluation of Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API

Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

IPSJ Transactions on Advanced Computing Systems 1 ( 3 ) 83 - 95 2008.12 [Refereed]

CiNii
Parallelizing Compiler Cooperative Heterogeneous Multicore

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of Workshop on Software and Hardware Challenges of Manycore Platforms (SHCMP 2008) 2008.06 [Refereed]
Parallelization of MP3 Encoder using Static Scheduling on a Heterogeneous Multicore

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Trans. of IPSJ on Computing Systems 1 ( 1 ) 105 - 119 2008.06 [Refereed]

CiNii
情報家電用マルチコア上におけるマルチメディア処理のコンパイラによる並列化

宮本孝道, 浅香沙織, 見神広紀, 間瀬正啓, 木村啓二, 笠原博徳

SACSIS2008 - 先進的計算基盤システムシンポジウム 2008.05

CiNii
Power-aware compiler controllable chip multiprocessor

Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

IEICE TRANSACTIONS ON ELECTRONICS E91C ( 4 ) 432 - 439 2008.04 [Refereed]

　View Summary

A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.

DOI

Scopus

1

Citation

(Scopus)
Heterogeneous multi-core architecture that enables 54x AAC-LC stereo encoding

Hiroaki Shikano, Masaki Ito, Masafumi Onouchi, Takashi Todaka, Takanobu Tsunoda, Tomoyuki Kodama, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

IEEE JOURNAL OF SOLID-STATE CIRCUITS 43 ( 4 ) 902 - 910 2008.04 [Refereed]

　View Summary

This paper describes a heterogeneous multi-core processor (HMCP) architecture that integrates general-purpose processors (CPUs) and accelerators (ACCs) to achieve exceptional performance as well as low-power consumption for the SoCs of embedded systems. The memory architectures of CPUs and ACCs were unified to improve programming and compiling efficiency. Advanced audio codec-low complexity (AAC-LC) stereo audio encoding was parallelized on a heterogeneous multi-core having homogeneous processor cores and dynamically reconfigurable processor (DRP) ACC cores in a preliminary evaluation of the HMCP architecture. The performance evaluation revealed that 54x AAC encoding was achieved on the chip with two CPUs at 600 MHz and two DRPs at 300 MHz, which achieved encoding of an entire CD within 1-2 min.

DOI

Scopus

16

Citation

(Scopus)
An 8 CPU SoC with Independent Power-off Control of CPUs and Multicore Software Debug Function

Yutaka Yoshida, Masayuki Ito, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Toshihiro Hattori, Jun Sakiyama, Masashi Takada, Kunio Uchiyama, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Proc. of IEEE Cool Chips XI: Symposium on Low-Power and High-Speed Chips 2008 2008.04 [Refereed]
A 600MHz SoC with Compiler Power-off Control of 8 CPUs and 8 Onchip-RAMs

Masayuki Ito, Toshihiro Hattori, Yutaka Yoshida, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Yoshihiko Yasu, Atsushi Hasegawa, Masashi Takada, Masaki Ito, Hiroyuki Mizuno, Kunio Uchiyama, Toshihiko Odaka, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Proc. of International Solid State Circuits Conference (ISSCC2008) 90 - 91 2008.02 [Refereed]
An 8640 MIPS SoC with independent power-off control of 8 CPUs and 8 RAMs by an automatic parallelizing compiler

Masayuki Ito, Toshihiro Hattori, Yutaka Yoshida, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Yoshihiko Yasu, Atsushi Hasegawa, Masashi Takada, Masaki Ito, Hiroyuki Mizuno, Kunio Uchiyama, Toshihiko Odaka, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Digest of Technical Papers - IEEE International Solid-State Circuits Conference 51 81 - 598 2008 [Refereed]

　View Summary

A 104.8mm2 90nm CMOS 600MHz SoC integrates 8 processor cores and 8 user RAMs in 17 separate power domains and delivers 33.6GFLOPS. An automatic parallelizing compiler assigns tasks to each CPU and controls its power mode including power supply in accordance with its processing load and status. The compiler also uses barrier registers to achieve fast and accurate CPU synchronization. ©2008 IEEE.

DOI

Scopus

37

Citation

(Scopus)
Performance evaluation of compiler controlled power saving scheme

Jun Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofurni Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

HIGH-PERFORMANCE COMPUTING 4759 480 - 493 2008 [Refereed]

　View Summary

Multicore processors, or chip multiprocessors, which allow us to realize low power consumption, high effective performance, good cost performance and short hardware/software development period, are attracting much attention. In order to achieve full potential of multicore processors, cooperation with a parallelizing compiler is very important. The latest compiler extracts multilevel parallelism, such as coarse grain task parallelism, loop parallelism and near fine grain parallelism, to keep parallel execution efficiency high. It also controls voltage and clock frequency of processors carefully to reduce energy consumption during execution of an application program. This paper evaluates performance of compiler controlled power saving scheme which has been implemented in OSCAR multigrain parallelizing compiler. The developed power saving scheme realizes voltage/frequency control and power shutdown of each processor core during coarse grain task parallel processing. In performance evaluation, when static power is assumed as one-tenth of dynamic power, OSCAR compiler with the power saving scheme achieved 61.2 percent energy reduction for SPEC CFP95 applu without performance degradation on 4 processors and 87.4 percent energy reduction for mpeg2encode, 88.1 percent energy reduction for SPEC CFP95 tomcatv and 84.6 percent energy reduction for applu with real-time deadline constraint on 4 processors.
Software-cooperative power-efficient heterogeneous multi-core for media processing

Hiroaki Shikano, Masaki Ito, Kunio Uchiyama, Toshihiko Odaka, Akihiro Hayashi, Takeshi Masuura, Masayoshi Mase, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

2008 ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2 712 - + 2008 [Refereed]

　View Summary

A heterogeneous multi-core processor (HMCP) architecture, which integrates general purpose processors (CPU) and accelerators (ACC) to achieve high-performance as well as low-power consumption with the support of a parallelizing compiler, was developed. The evaluation was performed using an MP3 audio encoder on a simulator that accurately models the HMCP, It showed that 16-frame encoding on the HMCP with four CPUs and four ACCs yielded 24.5-fold speed-up of performance against sequential execution on one CPU. Furthermore, power saving by the compiler reduced energy consumption of the encoding to 0.17 J, namely, by 28.4%.
Power Reduction Controll for Multicores in OSCAR Multigrain Parallelizing Compiler

Jun Shirako, Keiji Kimura, Hironori Kasahara

ISOCC: 2008 INTERNATIONAL SOC DESIGN CONFERENCE, VOLS 1-3 50 - 55 2008 [Refereed]

　View Summary

Multicore processors have become mainstream computer architecture to go beyond the performance and power efficiency limits of single-core processors. To achieve low power consumption and high performance on multicores, parallelizing compilers take on an important role. This paper describes the performance of a compiler-based power reduction scheme cooperating with OSCAR multigrain parallelizing compiler on a newly developed 8-way SH4A low power multicore chip for consumer electronics, which supports DVFS (Dynamic Voltage and Frequency Scaling) and Clock/Power Gating. Using hardware parameters and parallelized program information, OSCAR compiler determines suitable voltage and frequency of each active processor core and appropriate schedule of clock gating and power gating. Performance experiments shows the compiler reduces consumed power by 88.3%, namely from 5.68 W to 0.67 W, for real-time secure AAC Encoding and 73.5%, namely from 5.73 W to 1.52 W, for real-time MPEG2 Decoding on 8 core execution.
Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API

Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

PROCEEDINGS OF THE 2008 INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS 600 - 607 2008 [Refereed]

　View Summary

Multicore processors have been adopted for consumer electronics like portable electronics, mobile phones, car navigation systems, digital TVs and games to obtain high performance with low power consumption. The OSCAR automatic parallelizing compiler has been developed to utilize these multicores easily. Also, a new Consumer Electronics Multicore Application Program Interface (API) to use the OSCAR compiler with native sequential compilers for various kinds of multicores from different vendors has been developed in NEDO (New Energy and Industrial Technology Development Organization) "Multicore Technology for Realtime Consumer Electronics" project with Japanese 6 IT companies. This paper evaluates the parallel processing performance of multimedia applications using this API by the OSCAR compiler on the FR1000 4 VLIW cores multicore processor developed by Fujitsu Ltd, and the RP1 4 SH-4A cores multicore processor jointly-developed by Renesas Technology Corp., Hitachi Ltd. and Waseda University. As the results, the parallel codes generated by the OSCAR compiler using the API give us 3.27 times speedup on average using 4 cores against 1 core on the FR1000 multicore, and 3.31 times speedup on average using 4 cores against 1 core on the RP1 multicore.

DOI

Scopus

6

Citation

(Scopus)
情報家電用マルチコアSMP実行モードにおける制約付きCプログラムのマルチグレイン並列化

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 宮本孝道, 白子準, 中野啓史, 木村啓二, 笠原博徳

組込みシステムシンポジウム2007 2007.10

CiNii
Performance Evaluation of MP3 Audio Encoder on OSCAR Heterogeneous Chip Multicore Processor

Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hioronori Kasahara

Trans. of IPSJ on Computing Systems Vol. 48, No. SIG8(ACS18), 141 - 152 2007.05 [Refereed]
Power-aware compiler controllable chip multiprocessor

Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT 427 2007 [Refereed]

DOI

Scopus

1

Citation

(Scopus)
A 4320MIPS four-processor core SMP/AMP with individually managed clock frequency for low power consumption

Yutaka Yoshida, Tatsuya Kamei, Kiyoshi Hayase, Shinichi Shibahara, Osamu Nishii, Toshihiro Hattori, Atsushi Hasegawa, Masashi Takada, Naohiko Irie, Kunio Uchiyama, Toshihiko Odaka, Kiwamu Takada, Keiji Kimura, Hironori Kasahara

Digest of Technical Papers - IEEE International Solid-State Circuits Conference 95 - 590 2007

　View Summary

A 4320MIPS four-core SoC that supports both SMP and AMP for embedded applications is designed in 90nm CMOS. Each processor-core can be operated with a different frequency dynamically including clock stop, while keeping data cache coherency, to maintain maximum processing performance and to reduce average operating power. The 97.6mm2 die achieves a floating-point performance of 16.8GFLOPS. © 2007 IEEE.

DOI

Scopus

26

Citation

(Scopus)
Heterogeneous multiprocessor on a chip which enables 54x AAC-LC stereo encoding

Masaki Ito, Takashi Todaka, Takanobu Tsunoda, Hiroshi Tanaka, Tomoyuki Kodama, Hiroaki Shikano, Masafumi Onouchi, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

2007 Symposium on VLSI Circuits, Digest of Technical Papers 18 - 19 2007 [Refereed]

　View Summary

A heterogeneous multiprocessor on a chip has been designed and implemented. It consists of 2 CPUs and 2 DRPs (Dynamic Reconfigurable Processors). The design of DRP was intended to achieve high-performance in a small area to be integrated on a SoC for embedded systems. Memory architecture of CPUs and DRPs were unified to improve programming and compiling efficiency. 54x AAC-LC stereo encoding has been enabled with 2 DRPs at 300MHz and 2 CPUs at 600MHz.
Compiler Control Power Saving Scheme for Multicore Processors

Jun Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Trans. of IPSJ on Computing Systems Vol. 47(ACS15) 2006.09 [Refereed]
マルチコアプロセッサにおけるコンパイラ制御低消費電力化手法

白子準, 吉田宗広, 押山直人, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2006) ( 467 ) 476 2006.05

CiNii
Performance Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

Proc. of IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX) 349 - 363 2006.05 [Refereed]

CiNii
Programing for Multicore Systems

Keiji Kimura, Hironori Kasahara

IPSJ MAGAZINE 47 ( 1 ) 17 - 23 2006.01 [Invited]

Authorship：Lead author
Multicores Emerge as Next Generation Microprocessors

Hironori Kasahara, Keiji Kimura

IPSJ MAGAZINE 47 ( 1 ) 10 - 16 2006.01 [Refereed]
Parallelizing Compilation Scheme for Reduction of Power Consumption of Chip Multiprocessors

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of 12th Workshop on Compilers for Parallel Computers (CPC 2006), 2006.01 [Refereed]
Compiler control power saving scheme for multi core processors

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4339 362 - 376 2006

　View Summary

With the increase of transistors integrated onto a chip, multi core processor architectures have attracted much attention to achieve high effective performance, shorten development period and reduce the power consumption. To this end, the compiler for a multi core processor is expected not only to parallelize program effectively, but also to control the voltage and clock frequency of processors and storages carefully inside an application program. This paper proposes a compilation scheme for reduction of power consumption under the multigrain parallel processing environment that controls Voltage/Frequency and power supply of each processor core on a chip. In the evaluation, the OSCAR compiler with the proposed scheme achieves 60.7 percent energy savings for SPEC CFP95 applu without performance degradation on 4 processors, and 45.4 percent energy savings for SPEC CFP95 tomcatv with real-time deadline constraint on 4 processors, and 46.5 percent energy savings for SPEC CFP95 swim with the deadline constraint on 4 processors. © 2006 Springer-Verlag Berlin Heidelberg.

DOI

Scopus

19

Citation

(Scopus)
マルチコアプロセッサ上でのデータローカライゼーション

中野啓文, 浅野尚一郎, 内藤陽介, 仁藤拓実, 田川友博, 宮本孝道, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-165-10 2005.12
arallel Processing of MPEG2 Encoding on a Chip Multiprocessor Architecture

Takeshi Kodaka, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Trans. of IPSJ 46 ( 9 ) 2311 - 2325 2005.09 [Refereed]
ホモジニアスマルチコアにおけるコンパイラ制御低消費電力化手法

白子準, 押山直人, 和田康孝, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-164-10 (SWoPP205) 2005.08
Performance of OSCAR multigrain parallelizing compiler on SMP servers

K Ishizaka, T Miyamoto, J Shirako, M Obata, K Kimura, H Kasahara

LANGUAGES AND COMPILERS FOR HIGH PERFORMANCE COMPUTING 3602 319 - 331 2005 [Refereed]

　View Summary

This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II4 processors desktop work-station, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.
Multigrain parallel processing on compiler cooperative chip multiprocessor

K Kimura, Y Wada, H Nakano, T Kodaka, J Shirako, K Ishizaka, H Kasahara

9TH ANNUAL WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS 11 - 20 2005 [Refereed]

Authorship：Lead author

　View Summary

This paper describes multigrain parallel processing on a compiler cooperative chip multiprocessor The multigrain parallel processing hierarchically exploits multiple grains of parallelism such as coarse grain task parallelism, loop iteration level parallelism and statement level near-fine grain parallelism. The chip multiprocessor has been designed to attain high effective peformance, cost effectiveness and high software productivity by supporting the optimizations of the multigrain parallelizing compiler, which is developed by Japanese Millennium Project IT21 "Advance Parallelizing Compiler". To achieve full potential of multigrain parallel processing, the chip multiprocessor integrates simple single-issue processors having distributed shared data memory for both optimal use of data locality and scalar data transfer local data memory for processor private data, in addition to centralized shared memory for shared data among processors. This paper focuses on the scalability of the chip multiprocessor having up to eight processors on a chip by exploiting of the multigrain parallelism from SPECfp95 programs. When microSPARC like the simple processor core is used under assumption of 90 nm technology and 2.8 GHz, the evaluation results show the speedups for eight processors and four processors reach 7.1 and 3.9, respectively. Similarly, when 400 MHz is assumed for embedded usage, the speedups reach 7.8 and 4.0, respectively.
Memory management for data localization on OSCAR chip multiprocessor

H Nakano, T Kodaka, K Kimura, H Kasahara

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS 82 - 88 2004 [Refereed]

　View Summary

Chip Multiprocessor (CMP) architecture has attracting much attention as a next-generation microprocessor architecture and many kinds of CMP are widely being researched. However, CMP architectures several difficulties for effective use of memory, especially cache or local memory near a processor core. The authors have proposed OSCAR CMP architecture, which cooperatively works with multigrain parallelizing compiler which gives us much higher parallelism than instruction level parallelism or loop level parallelism and high productivity of application programs. To support the compiler optimization for effective use of cache or local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) for synchronization and fine grain data transfers among processors, in addition to centralized shared memory (CSM) to support dynamic task scheduling. This paper proposes a static coarse grain task scheduling scheme for data localization using live variable analysis. Furthermore, remote memory data transfer scheduling scheme using information of live variable analysis is also described. The proposed scheme is implemented on OSCAR FORTRAN multigrain parallelizing compiler and is evaluated on OSCAR CMP using Tomcatv and Swim in SPEC CFP 95 benchmark.
Parallel processing using data localization for MPEG2 encoding on OSCAR chip multiprocessor

T Kodaka, H Nakano, K Kimura, H Kasahara

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS 119 - 127 2004 [Refereed]

　View Summary

Currently, many people are enjoying multimedia applications with image and audio processing on PCs, PDAs, mobile phones and so on. With the popularization of the multimedia applications, needs for low cost, low power consumption and high performance processors has been increasing. To this end, chip multiprocessor architectures which allow us to attain scalable performance improvement by using multigrain parallelism are attracting much attention. However, in order to extract higher performance on a chip multiprocessor, more sophisticated software techniques are required, such as decomposing a program into adequate grain of tasks, assigning them onto processors considering parallelism, data locality optimization and so on. This paper describes a parallel processing scheme for MPEG2 encoding using data localization which improve execution efficiency assigning coarse grain tasks sharing same data on a same processor consecutively for a chip multiprocessor. The performance evaluation on OSCAR chip multiprocessor architecture shows that proposed scheme gives us 6.97 times speedup using 8 processors and 10.93 times speedup using 16 processors against sequential execution time respectively. Moreover, the proposed scheme gives us 1.61 times speedup using 8 processors and 2.08 times speedup using 16 processors against loop parallel processing which has been widely used for multiprocessor systems using the same number of processors.
Static coarse grain task scheduling with cache optimization using OpenMP

H Nakano, K Ishizaka, M Obata, K Kimura, H Kasahara

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING 31 ( 3 ) 211 - 223 2003.06 [Refereed]

　View Summary

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC Ver. 6 update 1 loop parallelizing compiler.
Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture 'Jointly Worked'

Keiji Kimura, Yasutaka Wada, Hirofumi Nakano, Takeshi Kodaka, Jun Shirako, Kazuhisa Ishizaka, Hironori Kasahara

The IEICE Transactions on Electronics, Special Issue on High-Performance and Low-Power System LSIs and Related Technologies E86-C ( 4 ) 570 - 579 2003.02 [Refereed]

Authorship：Lead author
Multigrain parallel processing on OSCAR CMP

K Kimura, T Kodaka, M Obata, H Kasahara

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS 56 - 65 2003 [Refereed]

Authorship：Lead author

　View Summary

It seems that Instruction Level Parallelism (ILP) approach, which has been used by various superscalar processors and VLIW processors for a long time, reaches its limitation of performance improvement. To obtain scalable performance improvement, cost effectiveness and high productivity even in the era of one billion transistors, the cooperative work between software and hardware is getting increasingly important. For this reason, the authors have developed OSCAR (Optimally SCheduled Advanced multiprocessoR) Chip Multiprocessor (OSCAR CMP) and OSCAR multigrain compiler simultaneously. To preserve the scalability in the future, OSCAR CMP has mechanisms for efficient use of parallelism and data locality, and for hiding data transfer overhead. These mechanisms can be fully controlled by the OSCAR multigrain compiler In this paper, the authors focus on multigrain parallel processing on OSCAR CMP, which enables us to exploit loop iteration level parallelism and coarse grain task parallelism in addition to ILP from the entire of a program. Performance of multigrain parallel processing on OSCAR CMP architecture is evaluated using SPEC fp 2000195 benchmark suite. When microSPARC like single issue core is used, OSCAR CMP gives us from 1.77 to 3.96 times speedup for four processors against single processor In addition, OSCAR CMP is compared with Sun UltraSPARC II like processor to evaluate cost effectiveness. As a result, OSCAR CMP gives us 1.66 times better performance on the average under the condition that OSCAR CMP and UltraSPARC II are built from almost same number of transistors.
JPEG Encoding Using Multigrain Parallel Processing on a Single Chip Multiprocessor

Takeshi Kodaka, Takayuki Uchida, Keiji Kimura, Hironori Kasahara

Trans. of IPSJ on High Performance Computing Systems 43 ( Sig 6(HPS5) ) 153 - 162 2002.09 [Refereed]

　View Summary

With the recent increase of multimedia contents using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia application have been expected. Particularly, single chip multiprocessor architectures having simple processor cores that will attain scalability and cost performance are attracting much attention to develop such processors. Single chip multiprocessor architectures allow us to exploit coarse grain task level and loop level parallelism in addition to the instruction level parallelism, so parallel processing technology is indispensable to get us scalable performance improvement. This paper describes a multigrain parallel processing scheme for the JPEG encoding for a single chip multiprocessor and its performance. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gave us 3.59 times speed-up.

CiNii
シングルチップマルチプロセッサにおける JPEGエンコーディングのマルチグレイン並列処理 (共著)

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会並列処理シンポジウム(JSPP2002) 2002.05
Static coarse grain task scheduling with cache optimization using openMP

Hirofumi Nakano, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2327 479 - 489 2002

　View Summary

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation, using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC 6 loop parallelizing compiler. © 2002 Springer Berlin Heidelberg.

DOI

Scopus

2

Citation

(Scopus)
Multigrain parallel processing for JPEG encoding on a single chip multiprocessor

T Kodaka, K Kimura, H Kasahara

INTERNATIONAL WORKSHOP ON INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS 57 - 63 2002 [Refereed]

　View Summary

With the recent increase of multimedia contents using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia application have been expected. Particularly, single chip multiprocessor architecture having simple processor cores that will attain good scalability and cost effectiveness is attracting much attention. To exploit full performance of single chip multiprocessor architecture, multigrain parallel processing, which exploits coarse grain task parallelism, loop parallelism and instruction level parallelism, is attractive. This paper describes a multigrain parallel processing scheme for the JPEG encoding on a single chip multiprocessor and its performance. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gave us 3.59 times speed-up against sequential execution time.
Multigrain automatic parallelization in Japanese Millennium Project IT21 Advanced Parallelizing Compiler

H Kasahara, M Obata, K Ishizaka, K Kimura, H Kaminaga, H Nakano, K Nagasawa, A Murai, H Itagaki, J Shirako

PAR ELEC 2002: INTERNATIONAL CONFERENCE ON PARALLEL COMPUTING IN ELECTRICAL ENGINEERING 105 - 111 2002 [Refereed]

　View Summary

This paper describes OSCAR multigrain parallelizing compiler which has been developed in Japanese Millennium Project IT21 "Advanced Parallelizing Compiler" project and its performance on SMP machines. The compiler realizes multigrain parallelization for chip-multiprocessors to high-end servers. It hierarchically exploits coarse grain task parallelism among loops, subroutines and basic blocks and near fine grain parallelism among statements inside a basic block in addition to loop parallelism. Also, it globally optimizes cache use over different loops, or coarse grain tasks, based on data localization technique to reduce memory access overhead Current performance of OSCAR compiler for SPEC95fp is evaluated on different SMPs. For example, it gives us 3.7 times speedup for HYDRO2D, 1.8 times for SWIM, 1.7 times for SU2COR, 2.0 times for MGRID, 3.3 times for TURB3D on 8 processor IBM RS6000, against XL Fortran compiler ver:7.1 and 4.2 times speedup for SWIM and 2.2 times speedup for TURB3D on 4 processor Sun Ultra80 workstation against Forte6 update 2.
Evaluation of Processor Core Architecture for Single Chip Multiprocessor with Near Fine Grain Parallel Processing

K. Kimura, T. Kato, H. Kasahara

Trans. of IPSJ 42 ( 4 ) 692 - 703 2001.04 [Refereed]

Authorship：Lead author
Evaluation of Single Chip Multiprocessor Core Architecture with Near Fine Grain Parallel Processing

Keiji Kimura, Hironori Kasahara

Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01) 2001.01 [Refereed]

Authorship：Lead author
Near fine grain parallel processing using static scheduling on single chip multiprocessors

K Kimura, H Kasahara

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS 23 - 31 2000 [Refereed]

　View Summary

With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting important problems. However it has been thought that popular superscalar and VLIW would have difficulty, to obtain scalable improvement of effective performance in future because of the limitation of instruction level parallelism. To cope with this problem, a single chip multiprocessor (SCM) approach,vith multi grain parallelprocessing inside a chip, which hierarchically exploits loop parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism, is thought one of the most promising approaches. This paper evaluates effectiveness of the single chip multiprocessor architectures with a shared cache, global registers, distributed shared memory and/or local memory for near fine grain parallel processing as the first step of research on SCM architecture to support multi grain parallel processing. The evaluation shows OSCAR (Optimally Scheduled Advanced Multiprocessor architecture having distributed shared memory and local memory in addition to centralized shared memory and attachment of global register gives us significant speed up such as 13.8% to 143.8% for four processors compared with shared cache architecture for applications which have been difficult to extract parallelism effectively.
Near Fine Grain Parallel Processing on Single Chip Multiprocessors

K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

Trans. of IPSJ 40 ( 5 ) 1924 - 1934 1999.05 [Refereed]

Authorship：Lead author
Near fine grain parallel processing using static scheduling on single chip multiprocessors

Keiji Kimura, Hironori Kasahara

Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems 1999- 23 - 31 1999 [Refereed]

Authorship：Lead author

　View Summary

With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting im-portant problems. However, it has been thought that popular superscalar and VLIW would have difficulty to obtain scalable improvement of effective performance in future because of the limitation of instruction level parallelism. To cope with this problem, a single chip multiprocessor (SCM) approach with multi grain parallel processing inside a chip, which hierarchically exploits loop parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism, is thought one of the most promising approaches. This paper evaluates effectiveness of the single chip multiprocessor architectures with a shared cache, global registers, distributed shared memory and/or local memory for near fine grain parallel processing as the first step of research on SCM architecture to support multi grain parallel processing. The evaluation shows OSCAR (Optimally Scheduled Advanced Multiprocessor) architecture having distributed shared memory and local memory in addition to centralized shared memory and attachment of global register gives us significant speed up such as 13.8% to 143.8% for four pro-cessors compared with shared cache architecture for applications which have been difficult to extract parallelism effectively.

DOI

Scopus

7

Citation

(Scopus)
OSCAR multi-grain architecture and its evaluation

H Kasahara, W Ogata, K Kimura, G Matsui, H Matsuzaki, M Okamoto, A Yoshida, H Honda

INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, PROCEEDINGS 106 - 115 1998 [Refereed]

　View Summary

OSCAR (Optimally Scheduled Advanced Multiprocessor) was designed to efficiently realize multi-grain parallel processing using static and dynamic scheduling. It is a shared memory multiprocessor system having centralized and distributed shared memories in addition to local memory on each processor with data transfer controller for overlapping of data transfer and task processing. Also, its Fortran multi-grain compiler hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks, conventional medium grain parallelism among loop-iterations in a Doall loop and near fine grain parallelism among statements. At the coarse grain parallel processing, data localization (automatic data distribution) have been employed to minimize data transfer overhear. In the near fine grain processing of a basic block, explicit synchronization can be removed by use of a clock level accurate code scheduling technique with architectural supports. This paper describes OSCAR's architecture, its compiler and the performance for the multi-grain parallel processing. OSCAR's architecture and compilation technology will be more important in future High Performance Computers and single chip multiprocessors.
Data-Localization among Doall and Sequential Loops in Coarse Grain Parallel Processing

Akimasa Yoshida, Yasushi Ujigawa, Motoki Obata, Keiji Kimura, Hironori Kasahara

Seventh Workshop on Compilers for Parallel Computers Linkoping Sweden 266 - 277 1998.01 [Refereed]
Near Fine Grain Parallel Processing without Explicit Synchronization on a Multiprocessor System

Wataru Ogata, Akimasa Yoshida, Masami Okamoto, Keiji Kimura, Hironori Kasahara

Proc. of Sixth Workshop on Compilers for Parallel Computers (Aachen Germany) 1996.12 [Refereed]

CiNii

▼display all

Presentations

Prototype Implementation of Non-Volatile Memory Support for RISC-V Keystone Enclave

Lena Yu, Yu Omori, Keiji Kimura

Presentation date： 2021.07
Sparse Neural NetworkにおけるSpMMの並列/ベクトル化による高速化

田處雄大, 木村啓二, 笠原博徳

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

Presentation date： 2021.03
整合性ツリーおよび暗号化機構を持つ不揮発性メインメモリエミュレータの実装

林知輝, 大森侑, 木村啓二

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

Presentation date： 2021.03
OSCARコンパイラによるMATLAB/Simulinkアプリケーションの自動並列化

古山凌, 津村雄太, 川角冬馬, 仲田優哉, 梅田弾, 木村啓二, 笠原博徳

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

Presentation date： 2021.03
Linuxが動作可能なRISC-V NVMMエミュレータの実装

大森侑, 木村啓二

情報処理学会第236回システム・アーキテクチャ・第194回システムとLSIの設計技術・第56回組込みシステム合同研究発表会(ETNET2021)

Presentation date： 2021.03
Automatic Vector-Parallelization by Collaboration of Oscar Automatic Parallelizing Compiler and NEC Vectorizing Compiler

Yuta Tadokoro, Hiroki Mikami, Takeo Hosomi, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2020-ARC-240 IPSJ

Presentation date： 2020.03
Consideration of Accelerator Cost Estimation Method in Multi-Target Automatic Parallelizing Compiler

Kazuki Yamamoto, Kazuki Fujita, Tomoya Kashimata, Ken Takahashi, Boma A. Adhi, Toshiaki Kitamura, Akihiro Kawashima, Akira Nodomi, Yuji Mori, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2020-ARC-240 IPSJ

Presentation date： 2020.03
Extensions of OSCAR Compiler for Parallelizing C++ Programs

Toma Kawasumi, Tilman Priesner, Masato Noguchi, Jixin Han, Hiroki Mikami, Takahiro Miyajima, Keishiro Tanaka, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2020-ARC-240 IPSJ

Presentation date： 2020.03
NDCKPT: Transparent Check Pointing Mechanism on Non Volatile Memory by OS

Hikaru Nishida, Keiji Kimura

Technical Report of IEICE, CPSY2019-102 IEICE

Presentation date： 2020.03
Investigation into Acceleration of Matrix-multiply in Homomorphic Encryption

Tetsuya Makita, Teppei Shishido, Yasutaka Wada, Keiji Kimura

Technical Report of IEICE, CPSY2019-96 IEICE

Presentation date： 2020.03
Cascaded DMAC Enabling Efficient Data Transfer for Indirect Memory Access Applications

Keiji Kimura [Invited]

RECS

Presentation date： 2019.11
Automatic parallelizing and vectorizing compiler framework for OSCAR vector multicore processor.

Kazuki Miyamoto, Tetsuya Makita, Ken Takahashi, Tomoya Kashimata, Takumi Kawada, Satoshi Karino, Toshiaki Kitamura, Keiji Kimura, Hironori kasahara

Technical Report of IPSJ, 2018-ARC-230 IPSJ

Presentation date： 2018.03
Automatic Local Memory Management Using Hierarchical Adjustable Block for Multicores and Its Performance Evaluation

Tomoya Shirakawa, Yuto abe, Yoshitake Ooki, Akimasa Yoshida, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2017-ARC-220 IPSJ

Presentation date： 2017.11
A Reproducible Full Computer System emulator

Yuki Shimizu, Mineo Takai, Keiji Kimura

Multimedia, Distributed, Cooperative, and Mobile Symposium(DICOMO 2017) IPSJ

Presentation date： 2017.07
Hierarchical Interconnection Network Extension for Gen 5 Simulator Considering Large Scale Systems

Tatsuya Onoguchi, Ayane Hayashi, Katsuyuki Utaka, Yuichi Matsushima, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ,Vol.2017-ARC-221 IPSJ

Presentation date： 2017.03
Parallel Processing of Automobile Real-time Control on Multicore System with Multiple Clusters

Jin Miyata, Mamoru Shimaoka, Hiroki Mikami, Hirofumi Nishi, Hitoshi Suzuki, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ,Vol.2017-ARC-221 IPSJ

Presentation date： 2017.03
Code Generating Method with Profile Feedback for Reducing Compilation Time of Automatic Parallelizing Compiler

Rina Fujino, Jixin Han, Mamoru Shimaoka, Hiroki Mikami, Takahiro Miyajima, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ,Vol.2017-ARC-221 IPSJ

Presentation date： 2017.03
Development of Compilation Flow and Evaluation of OSCAR Vector Multicore Architecture

Ken Takahashi, Satoshi Karino, Kazuki Miyamoto, Takumi Kawata, Tomoya Kashimata, Tetsuya Makita, Toshiaki Kitamura, Keiji Kimura, Hironori Kasahara

Proc. 80th Annual Convention IPSJ IPSJ

Presentation date： 2017.03
FPGA implementation of OSCAR Vector Accelerator

Tomoya Kashimata, Satoshi Karino, Kazuki Miyamoto, Takumi Kawata, Ken Takahashi, Tetsuya Makita, Toshiaki Kitamura, Keiji Kimura, Hironori Kasahara

Proc. 80th Annual Convention IPSJ IPSJ

Presentation date： 2017.03
A Compilation Framework for Multicores having Vector Accelerators using LLVM

Akira Maruoka, Yuya Mushu, Satoshi Karino, Takashi Mochiyama, Toshiaki Kitamura, Sachio Kamiya, Moriyuki Takamura, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ,Vol.2017-ARC-221 IPSJ

Presentation date： 2016.08
Multigrain Parallelization of Program for Medical Image Filtering

Mariko Okumura, Tomoyuki Shibasaki, Kohei Kuwajima, Hiroki Mikami, Keiji Kimura, Kohei Kadoshita, Keiichi Nakano, Hironori Kasahara

IPSJ SIG Technical Report Vol.2016-HPC0153 IPSJ

Presentation date： 2016.03
Automatic Multigrain Parallel Processing for 3D Noise Reduction Using OSCAR Compiler

Tomoyuki Shibasaki, Kohei Kuwajima, Mariko Okumura, Hiroki Mikami, Keiji Kimura, Kohei Kadoshita, Keiichi Nakano, Hironori Kasahara

IPSJ SIG Technical Report Vol.2016-HPC0153 IPSJ

Presentation date： 2016.03
The parallelism abstraction method with a data conversion at analysis in a OSCAR compiler

Naoto Kageura, Tamami Wake, Ji Xin Han, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2016-HPC0153 IPSJ

Presentation date： 2016.03
Multicore Local Memory Management Scheme using Data Multidimensional Aligned Decomposition

Kohei Yamamoto, Tomoya Shirakawa, Akimasa Yoshida, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2016-SLDM-174 IPSJ

Presentation date： 2016.01
An Evaluation of the Repeatability of Full Computer System Emulation

Daichi Fukui, Teruhiro Mizumoto, Shinsuke Nishimoto, Shigeru Kaneda, Mineo Takai, Keiji Kimura

Multimedia, Distributed, Cooperative, and Mobile Symposium(DICOMO 2015) IPSJ

Presentation date： 2015.07
Evaluation of Parallelization of video decoding on Intel and ARM Multicore

Tamami Wake, Shuhei Iizuka, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2015-EMB-36 IPSJ

Presentation date： 2015.03
Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

Takashi Goto, Kohei Muto, Tomohiro Hirano, Hiroki Mikami, Uichiro Takahashi(Fujitsu, Sakae Inoue(Fujitsu, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2015-SLDM-170 IPSJ

Presentation date： 2015.03
Power Reduction of Real-time Dynamic Image Processing on Haswell Multicore Using OSCAR Compiler

Shuhei Iizuka, Hideo Yamamoto, Tomohiro Hirano, Youhei Kishimoto, Takashi Goto, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2015-EMB-36 IPSJ

Presentation date： 2015.03
Evaluation of Software Cashe Coherency Cotrol Scheme by an Automatic Parallelizing Compiler

Yohei Kishimoto, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2014-ARC-213 No.19 IPSJ

Presentation date： 2014.12
Android Demonstration System of Automatic Parallelization and Power Optimization by OSCAR Compiler

Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ,Vol.2014-ARC-211 No.6 IPSJ

Presentation date： 2014.07
Tracing method of a parallelized program using Linux ftrace on a multicore processor

Daichi Fukui, Mamoru Shimaoka, Hiroki Mikami, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ,Vol.2014-ARC-211 No.6 IPSJ

Presentation date： 2014.07
A Latency Reduction Technique for Network Intrusion Detection System on Multicores

Keiji Kimura [Invited]

MPSoC

Presentation date： 2014.07
Automatic Parallelization of Small Point FFT on Multicore Processor

Yuuki Furuyama, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2013-ARC-201 IPSJ

Presentation date： 2014.03
A Latency Reduction Technique for IDS by Allocating Decomposed Signature on Multi-core

Shohei Yamada, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ SIG Technical Report Vol.2013-ARC-201 IPSJ

Presentation date： 2014.03
A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

TAGUCHI Gakuho, KIMURA Keiji, KASAHARA Hironori

IPSJ SIG Technical Report IEICE

Presentation date： 2014.03
Profile-Based Automatic Parallelization for Android 2D Rendering by Using OSCAR Compiler

Takashi Goto, Kohei Muto, Hideo Yamamoto, Tomohiro Hirano, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol.2013-ARC-207 No.12 IPSJ

Presentation date： 2013.12
Automatic Parallelization of Automatically Generated Engine Control C Codes by Model-based Design

Dan Umeda, Youhei Kanehagi, Hiroki Mikami, Mitsuhiro Tani(DENSO, Yuji Mori(DENSO, Keiji Kimura, Hironori Kasahara

Embedded System Symposium2013 IPSJ

Presentation date： 2013.10
An Evaluation of Hardware Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping using OSCAR API Standard Translator

Akihiro Kawashima, Yohei Kanehagi, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol.2013-ARC-206 No.16 IPSJ

Presentation date： 2013.08
Automatic Power Control on Multicore Android Devices

Tomohiro Hirano, Hideo Yamamoto, Kohei Muto, Hiroki Mikami, Takashi Goto, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol.2013-ARC-206 No.23 IPSJ

Presentation date： 2013.08
OSCAR API v2.1 with Flexible Accelerator Control Facilities

Keiji Kimura [Invited]

MPSoC

Presentation date： 2013.07
マルチコア用並列化アプリケーション開発の基礎と実例

木村啓二 [Invited]

ESEC 2013 専門セミナー Reed Exhibition Japan

Presentation date： 2013.05
Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

Yasir I. M. Al-Dosary, Yuki Furuyama, Dominic Hillenbrand, Keiji Kimura, Hironori Kasahara, Seinosuke Narita

Technical Report of IPSJ IPSJ

Presentation date： 2013.04
An Investigation of Parallelization and Evaluation on Commercial Multi-core Smart Device

Hideo Yamamoto, Takashi Goto, Tomohiro Hirano, Kouhei Muto, Hiroki Mikami, Dominic Hillenbrand, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol. 2013-OS-124 No. 000310 IPSJ

Presentation date： 2013.02
Parallelization of Automobile Engine Control Software on Multicore Processor

KANEHAGI YOUHEI, UMEDA DAN, MIKAMI HIROKI, HAYASHI AKIHIRO, SAWADA MITSUO, KIMURA KEIJI, KASAHARA HIRONORI

Technical Report of IPSJ, Vol.2013-ARC-203 No.2 IPSJ

Presentation date： 2013.01
An Accelemtion Technique of Many-core Architecture Simulation with Parallelized Applications by Statistical Technique

Abe Yoichi, Taguchi Gakuho, Kimura Keiji, Kasahara Hironori

Technical Report of IPSJ, Vol.2012-ARC-203 N0.13 IPSJ

Presentation date： 2013.01
A Parallelizing Compiler Cooperative Multicore Architecture Simulator with Changeover Mechanism of Simulation Modes

TAGUCHI GAKUHO, ABE YOUICHI, KIMURA KEIJI, KASAHARA HIRONORI

Technical Report of IPSJ, Vol.2012-ARC-203 N0.14 IPSJ

Presentation date： 2013.01
Automatic parallelization with OSCAR API Analyzer: a cross-platform performance evaluation

Cecilia Gonzalez-Alvarez, Youhei Kanehagi, Kosei Takemoto, Yohei Kishimoto, Kohei Muto, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol.2012-ARC-202HPC137 No.10 IPSJ

Presentation date： 2012.12
Automatic Parallelization of Ground Motion Simulator

Mamoru Shimaoka, Hiroki Mikami, Akihiro Hayashi, Yasutaka Wada, Keiji Kimura, Hidekazu Morita, HITACHI, Kunio Uchiyama, HITACHI, Hironori Kasahara

Technical Report of IPSJ, Vol.2012-ARC-202HPC137 No.11 IPSJ

Presentation date： 2012.12
Opportunities and Challenges of Application-Power Control in the Age of Dark Silicon

Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol.2012-ARC-202HPC137 No.26 IPSJ

Presentation date： 2012.12
Parallel processing of multimedia applications on TILEPro64 using OSCAR API for embedded multicore

Yohei Kishimoto, Hiroki Mikami, Keiichi Nakano, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

Embedded System Symposium2012 IPSJ

Presentation date： 2012.10
Parallelization of Basic Engine Controll Software Model on Multicore Processor

Dan Umeda, Youhei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Mituhiro Tani, Yuji Mori, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol.2012-ARC-201 No.22 IPSJ

Presentation date： 2012.08
Realization of 1 Watt Web Service with RP-X Low-power Multicore Processor

Yuuki Furuyama, Mamoru Shimaoka, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol.2012-ARC-201 No.24 IPSJ

Presentation date： 2012.08
OSCAR API for Low-Power Multicores and Manycores, and API Standard Translator

Keiji Kimura [Invited]

MPSoC

Presentation date： 2012.07
並列化コンパイラを考慮したコーディング作法と並列化APIの現在

木村啓二 [Invited]

ESEC 2012 専門セミナー Reed Exhibition Japan

Presentation date： 2012.05
A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

KIMURA KEIJI, MASE MASAYOSHI, KASAHARA HIRONORI

ETNET2012 IPSJ

Presentation date： 2012.03
Inlining Analysis of Exception Flow and Fast Method Dispatch on Automatic Parallelization of Java

Keiichi Tabata, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol. 2012-ARC-199 IPSJ

Presentation date： 2012.03
An Examination of Accelerating Many-core Architecture Simulation for Parallelized Media Applications

Yoichi Abe, Ryo Ishizuka, Ryota Daigo, Gakuho Taguchi, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, Vol. 2012-ARC-199 IPSJ

Presentation date： 2012.03
Automatic Parallelization of Dose Calculation Engine for A Particle Therapy

Akihiro Hayashi, Takuji Matsumoto, Hiroki Mikami, Keiji Kimura, Keiji Yamamoto, Hironori Saki, Yasuyuki Takatani, Hironori Kasahara

Symposium on High-Performance Computing and Computer Science(HPCS2012) IPSJ

Presentation date： 2012.01
Automatic Parallelization of Dose Calculation Engine for A Particle Therapy on SMP Servers

Akihiro Hayashi, Takuji Matsumoto, Hiroki Mikami, Keiji Kimura, Keiji Yamamoto, Hironori Saki, Yasuyuki Takatani, Hironori Kasahara

Technical Report of IPSJ, Vol.2011-ARC189HPC132-2 IPSJ

Presentation date： 2011.11
Examination of Parallelization by CUDA in SPEC Benchmark Programs

Yuki Taira, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2011-HPC-130-16 IPSJ

Presentation date： 2011.07
An Evaluation of an Acceleration method of Many-core Architecture Simulation using Program Structures of Scientific Applications

Ryo Ishizuka, Yoichi Abe, Ryota Daigo, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2011-ARC-196-14 IPSJ

Presentation date： 2011.07
並列化APIとコンパイラによるマルチコア用アプリケーションの開発

木村啓二 [Invited]

ESEC 2011 専門セミナー Reed Exhibition Japan

Presentation date： 2011.05
Hiding I/O overheads with Parallelizing Compiler for Media Applications

Akihiro Hayashi, Takeshi Sekiguchi, Masayoshi Mase, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2011-ARC-195-14 IPSJ

Presentation date： 2011.04
Evaluation of Power Consumption by Executing Media Applications on Low-power Multicore RP2

Hiroki Mikami, Shumpei Kitaki, Takafumi Sato, Masayoshi Mase, Keiji Kimura, Kazuhisa Ishizaka, Junji Sakai, Masato Edahiro, Hironori Kasahara

Technical Report of IPSJ, 2011-ARC-194-1 IPSJ

Presentation date： 2011.03
Evaluation of Parallelizable C Programs by the OSCAR API Standard Translator

SATO TAKUYA, MIKAMI HIROKI, HAYASHI AKIHIRO, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

Technical Report of IPSJ, 2010-ARC-191-2 IPSJ

Presentation date： 2010.10
An Acceleration Technique of Many Core Architecture Simulator Considering Program Structure

ISHIZUKA RYO, OOTOMO TOSHIYA, DAIGO RYOTA, KIMURA KEIJI, KASAHARA HIRONORI

Technical Report of IPSJ, 2010-ARC-190 No. 20 IPSJ

Presentation date： 2010.08
Performance of Power Reduction Scheme by a Compiler on Heterogeneous Multicore for Consumer Electronics "RP-X"

WADA YASUTAKA, HAYASHI AKIHIRO, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, SHIRAKO JUN, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

Technical Report of IPSJ, 2010-ARC-190 No. 8 IPSJ

Presentation date： 2010.08
A Compiler Framework for Heterogeneous Multicores for Consumer Electronics

HAYASHI AKIHIRO, WADA YASUTAKA, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

Technical Report of IPSJ, 2010-ARC-190 No. 7 IPSJ

Presentation date： 2010.08
組込みマルチコア用並列化APIと並列化コンパイラの現在

木村啓二 [Invited]

ESEC 2010 専門セミナー Reed Exhibition Japan

Presentation date： 2010.05
Parallelizing Compiler Directed Software Coherence

MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

Technical Report of IPSJ, 2010-ARC-189, 2010-OS-114 IPSJ

Presentation date： 2010.04
Multi Media Offload with Automatic Parallelization

ISHIZAKA KAZUHISA, SAKAI JUNJI, EDAHIRO MASATO, MIYAMOTO TAKAMICHI, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

Technical Report of IPSJ, 2010-SLDM144, 2010-EMB16 IPSJ

Presentation date： 2010.03
Processing Performance of Automatically Parallelized Applications on Embedded Multicore with Running Multiple Applications

Takamichi Miyamoto, Masayoshi Mase, Keiji Kimura, Kazuhisa Ishizaka, Junji Sakai, Masato Edahiro

Technical Report of IPSJ, 2010-ARC-188 No.9 IPSJ

Presentation date： 2010.03
Hierarchical Parallel Processing of H.264/AVC Encoder on an Multicore Processeor

Hiroki Mikami, Takamichi Miyamoto, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ Vol.2010-ARC-187 No.22 Vol.2010-EMB-15 No.22 IPSJ

Presentation date： 2010.01
Element-Sensitive Pointer Analysis for Automatic Parallelization

Masayoshi Mase, Yuta Murata, Keiji Kimura, Hironori Kasahara

IPSJ-SIGPRO IPSJ

Presentation date： 2009.10
メニーコア・プロセッサとそれを支える要素技術

井上弘士, 木村啓二, 松谷宏紀 [Invited]

組込システムシンポジウム 2009 情報処理学会

Presentation date： 2009.10
Automatic Parallelization of Parallelizable C Programs on Multicore Processors

Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2009-ARC-184-15 IPSJ

Presentation date： 2009.08
組込みソフトウェアの信頼性，開発効率向上のためのコーディングガイドライン

木村啓二 [Invited]

平成21年度 INSTAC成果報告会

Presentation date： 2009.07
A Power Reduction Scheme of Parallelizing Compiler Using OSCAR API on Multicore Processor

Ryo Nakagawa, Masayoshi Mase, Naoto Ohkuni, Jun Shirako, Keiji Kimura, Hironori Kasahara

Symposium on Advanced Computing Systems and Infrastructures (SACSIS 2009) IPSJ

Presentation date： 2009.05
最新の組込みマルチコア用コンパイラ技術と並列API

木村啓二 [Invited]

ESEC 2009 専門セミナー Reed Exhibition Japan

Presentation date： 2009.05
Performance Evaluation of Minimum Execution Time Multiprocessor Scheduling Algorithms Using Standard Task Graph Set Ver3 Consider Parallelism of Task Graphs and Deviation of Task Execution Time

Mamoru Shimaoka, Kazuhiro Imaizumi, Fumiyo Takano, Keiji Kimura, Hironori Kasahara

Technical Report of IEICE IPSJ

Presentation date： 2009.02
A Power Saving Scheme on Multicore Processors Using OSCAR API

Ryo Nakagawa, Masayoshi Mase, Jun Shirako, Keiji Kimura, Hironori Kasahara

TECHNICAL REPORT OF IEICE. (ICD2008/145) IEICE

Presentation date： 2009.01
Local Memory Management Scheme by a Compiler for Multicore Processor

Taku Momozono, Hirofumi Nakano, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

TECHNICAL REPORT OF IEICE. (ICD2008/141) IEICE

Presentation date： 2009.01
Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

Teruo Kamiyama, Yasutaka Wada, Akihiro Hayashi, Masayoshi Mase, Hirofumi Nakano, Takeshi Watanabe, Keiji Kimura, Hironori Kasahara

TECHNICAL REPORT OF IEICE. (ICD2008/140) IEICE

Presentation date： 2009.01
マルチコアのソフトウェア開発

木村啓二 [Invited]

CEATEC JAPAN 2008 インダストリアルセッション(IS) JEITA

Presentation date： 2008.10
マルチコア用コンパイル技術の現在

木村啓二 [Invited]

第10回組み込みシステム技術に関するサマーワークショップ (SWEST10) 情報処理学会

Presentation date： 2008.09
マルチコアプロセッサのソフトウェア

木村啓二 [Invited]

第31回STARCアドバンスト講座システムアーキテクチャセミナー - SoCシステムアーキテクチャ - STARC

Presentation date： 2008.07
An Evaluation of Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping

Kaito Yamada, Masayoshi Mase, Jun Shirako, Keiji Kimura, Masayuki Ito, Toshihiro Hattori, Hiroyuki Mizuno, Kunio Uchiyama, Hironori Kasahara

Technical Report of IPSJ, IPSJ

Presentation date： 2008.05
Automatic Parallelization of Restricted C Programs using Pointer Analysis

Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Yuta Murata, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2008 IPSJ

Presentation date： 2008.05
Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics

Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Symposium on Advanced Computing Systems and Infrastructures (SACSIS 2008) IPSJ

Presentation date： 2008.05
Parallelization for Multimedia Processing on Multicore Processors

Takamichi Miyamoto, Kei Tamura, Hiroaki Tano, Hiroki Mikami, Saori Asaka, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2007-ARC-175-05 IPSJ

Presentation date： 2007.11
最新の組み込みマルチコア用コンパイラ技術

木村啓二 [Invited]

システムLSIワークショップ情報処理学会

Presentation date： 2007.11
Multigrain Parallelization of Restricted C Programs in SMP Execution Mode of a Multicore for Consumer Electronics

Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Hiroaki Tano, Takeshi Masuura, Takamichi Miyamoto, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Embedded Systems Symposium 2007 （ESS 2007) IPSJ

Presentation date： 2007.10
Compiler Control Power Saving for Heterogeneous Multicore Processor

Akihiro Hayashi, Taketo Iyoku, Ryo Nakagawa, Shigeru Matsumoto, Kaito Yamada, Naoto Oshiyama, Jun Shirako, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2007-ARC-174-18 IPSJ

Presentation date： 2007.08
A Hierarchical Coarse Grain Task Static Scheduling Scheme on a Heterogeneous Multicore

Yasutaka Wada, Akihiro Hayashi, Taketo Iyoku, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2007-ARC-174-17 IPSJ

Presentation date： 2007.08
Evaluation of Heterogeneous Multicore Architecture with AAC-LC Stereo Encoding

Hiroaki Shikano, Masaki Ito, Takashi Todaka, Takanobu Tsunoda, Tomoyuki Kodama, Masafumi Onouchi, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

TECHNICAL REPORT OF IEICE. (ICD2007-71) IEICE

Presentation date： 2007.08
マルチコア用コンパイラ技術

木村啓二 [Invited]

165委員会主催研究会第46回研究会「マルチコアプロセッサSoCの現状と今後の展望」

Presentation date： 2007.07
組込マルチコアの動向

木村啓二 [Invited]

JEITA 情報端末フェスティバル 2007 JEITA

Presentation date： 2007.06
A 4320MIPS four Processor-core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption

Kiyoshi hayase, Yutaka Yoshida, Tatsuya Kamei, Shinichi Shibahara, Osamu Nishii, Toshihiro Hattori, Atsushi Hasegawa, Masashi Takada, Naohiko Irie, Kunio Uchiyama, Toshihiko Odaka, Kiwamu Takada, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2007-ARC-173-06 IPSJ

Presentation date： 2007.05
Mutligrain Parallel Processing in SMP Execution Mode on a Multicore for Consumer Electronics

Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Hiroaki Tano, Takeshi Masuura, Takamichi Miyamoto, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Tatsuya Kamei, Toshihiro Hattori, Atsushi Hasegawa, Makoto Sato, Masaki Ito, Toshihiko Odaka, Hironori Kasahara

Technical Report of IPSJ, 2007-ARC-173-05 IPSJ

Presentation date： 2007.05
マルチコアプロセッサ活用の勘所

木村啓二 [Invited]

組み込みプロセッサ＆プラットホームワークショップ

Presentation date： 2007.04
A Local Memory Management Scheme in Multigrain Parallelizing Compiler

Miura Tsuyoshi, Tomohiro Tagawa, Yusuke Muramatsu, Akinori Ikemi, Masahiro Nakagawa, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2007-ARC-172-11 IPSJ

Presentation date： 2007.03
Automatic Parallelization for Multimedia Applications on Multicore Processors

Takamichi Miyamoto, Saori Asaka, Nobuhito Kamakura, Hiromasa Yamauchi, Masayoshi Mase, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2007-ARC-171-13 IPSJ

Presentation date： 2007.01
Automati Parallelization of Restricted C Ptrograms in OSCAR Compiler

Masayoshi Mase, Daisuke Baba, Harumi Nagayama, Hiroaki Tano, Takeshi Masuura, Koji Fukatsu, Takamichi Miyamoto, Jun Shirako, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2006-ARC-170-1 IPSJ

Presentation date： 2006.11
Performance of OSCAR Multigrain Parallelizaing Compiler on SMP Servers and Embedded Multicore

Jun Shirako, Tomohiro Tagawa, Tsuyoshi Miura, Takamichi Miyamoto, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2006-ARC-170-2 IPSJ

Presentation date： 2006.11
ソフトウェアもおもしろいこれからのプロセッサアーキテクチャ

木村啓二 [Invited]

FIT2006イベント企画「これからが面白いプロセッサアーキテクチャ」（パネル）情報処理学会

Presentation date： 2006.09
Local Memory Management on OSCAR Multicore

Hirofumi Nakano, Takumi Nito, Takanori Maruyama, Masahiro Nakagawa, Yuki Suzuki, Yousuke Naito, Takamichi Miyamoto, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, 2006-ARC-169-28 IPSJ

Presentation date： 2006.08
Compiler Control Power Saving Scheme for Multicore Processors

un Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of Symposium on Advanced Computing Systems and Infrastructures (SACSIS2006) IPSJ

Presentation date： 2006.05
Data Transfer Overlap of Coarse Grain Task Parallel Processing on a Multicore Processor

Takamichi Miyamoto, Masahiro Nakagawa, Shoichiro Asano, Yosuke Naito, Takumi Nito, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC-2006-167, HPC-2006-105 IPSJ

Presentation date： 2006.02
A Static Scheduling Scheme for Coarse Grain Task on a Heterogeneous Chip Multi Processor

Yasutaka Wada, Naoto Oshiyama, Yuki Suzuki, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC-2006-166 IPSJ

Presentation date： 2006.01
Preliminary Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC-2006-166 IPSJ

Presentation date： 2006.01
Data Localization on a Multicore Processor

Hirofumi Nakano, Shoichiro Asano, Yosuke Naito, Takumi Nito, Tomohiro Tagawa, Takamichi Miyamoto, Takeshi Kodaka, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2005-165-10 IPSJ

Presentation date： 2005.12
Compiler Control Power Saving Scheme for Homogeneous Multiprocessor

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2005-164-10 IPSJ

Presentation date： 2005.08
Performance of OSCAR Multigrain Parallelizing Compiler on Shared Memory Multiprocessor Serers

Jun Shirako, Takamichi Miyamoto, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2005-161-5 IPSJ

Presentation date： 2005.01
Performance Evaluation of Electronic Circuit Simulation Using Code Generation Method without Array Indirect Access

Akira Kuroda, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2005-161-1 IPSJ

Presentation date： 2005.01
Parallel Processing for MPEG2 Encoding on OSCAR Chip Multiprocessor

Takeshi Kodaka, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2004-160-10 IPSJ

Presentation date： 2004.12
Data Localization using Data Transfer Unit on OSCAR Chip Multiprocessor

Hirofumi Nakano, Yosuke Naito, Takahisa Suzuki, Takeshi Kodaka, Kazuhisa Ishizaka, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2004-159-20 IPSJ

Presentation date： 2004.08
Evaluation of Multigrain Parallelism on OSCAR Chip Multi Processor

Yasutaka Wada, Jun Shirako, Kazuhisa Ishizaka, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2004-159-11 IPSJ

Presentation date： 2004.08
Evaluation of OSCAR Multigrain Automatic Parallelizing Compiler on IBM pSeries 690

Kazuhisa Ishizaka, Jun Shirako, Motoki Obata, Keiji Kimura, Hironori Kasahara

Proc. 66th Annual Convention IPSJ IPSJ

Presentation date： 2004.03
Parallel Processing for MPEG2 Encoding using Data Localization

Takeshi Kodaka, Hirohumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2004-156-3 IPSJ

Presentation date： 2004.02
The Data Prefetching of Coarse Grain Task Parallel Processing on Symmetric Multi Proc essor Machine

akamichi Miyamoto, Takahiro Yamaguchi, Takao Tobita, Kazuhisa Ishizaka, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2003-155-06 IPSJ

Presentation date： 2003.11
Data Localization Scheme using Static Scheduling on Chip Multiprocessor

Hirofumi Nakano, Takeshi Kodaka, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2003-154-14 IPSJ

Presentation date： 2003.08
Parallel Processing on MPEG2 Encoding for OSCAR Chip Multiprocessor

Takeshi Kodaka, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2003-154-10 IPSJ

Presentation date： 2003.08
Data Localization using Coarse Grain Task Parallelism on Chip Multiprocessor

Hirofumi Nakano, Takeshi Kodaka, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2003-151-3 IPSJ

Presentation date： 2003.01
Multigrain Parallel Processing on Motion Vector Estimation for Single Chip Multiprocessor

Takeshi Kodaka, Takahisa Suzuki, Keiji Kimura, Hironori Kasahara

Technical Report of IPSJ, ARC2002-150-6 IPSJ

Presentation date： 2002.11
Multigrain Parallel Processing on OSCAR Chip Multiprocessor

Keiji Kimura, Takeshi Kodaka, Motoki Obata, Hironori Kasahara

Technical Report of IPSJ, ARC2002-150-7 IPSJ

Presentation date： 2002.11
Evaluation of Overhead with Coarse Grain Task Parallel Processing on SMP Machines

Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, Motoki Obata, Hironori Kasahara

Technical Report of IPSJ, ARC2002-148-3 IPSJ

Presentation date： 2002.05
PEG Encoding using Multigrain Parallel Processing on a Shingle Chip Multiprocessor

Takeshi Kodaka, Takayuki Uchida, Keiji Kimura, Hironori Kasahara

Joint Symposium on Parallel Processing 2002 (JSPP2002) IPSJ

Presentation date： 2002.05
Multigrain Parallel Processing for JPEG Encoding Program on an OSCAR type Single Chip Multiprocessor

T. Kodaka, T. Uchida, K. Kimura, H. Kasahara

Technical Report of IPSJ, ARC2002-146-4 IPSJ

Presentation date： 2002.02
Multigrain Parallel Processing on Single Chip Multiprocessor

T. Uchida, T. Kodaka, K. Kimura, H. Kasahara

Technical Report of IPSJ, ARC2002-146-3 IPSJ

Presentation date： 2002.02
Near Fine Grain Parallel Processing on Multimedia Application for Single Chip Multiprocessor

T. Kodaka, N. Miyashita, K. Kimura, H. Kasahara

Technical Report of IPSJ, ARC2001-144-11 IPSJ

Presentation date： 2001.11
A Static Scheduling Scheme for Coarse Grain Tasks considering Cache Optimization on SMP

H. Nakano, K. Ishizaka, M. Obata, K. Kimura, H. Kasahara

Technical Report of IPSJ, ARC2001-144-12 IPSJ

Presentation date： 2001.08
A Static Scheduling Method for Coarse Grain Tasks considering Cache Optimization on Multiprocessor Systems

H. Nakano, K. Ishizaka, M. Obata, K. Kimura, H. Kasahara

Proc. 62nd Annual Convention IPSJ IPSJ

Presentation date： 2001.03
Near Fine Grain Parallel Processing on Multimedia Application for Single Chip Multiprocessor

T. Kodaka, K. Kimura, N. Miyashita, H. Kasahara

Proc. 62nd Annual Convention IPSJ IPSJ

Presentation date： 2001.03
Performance Evaluation of Single Chip Multiprocessor Memory Architecture for Near Fine Grain Parallel Processing

N. Matsumoto, K. Kimura, H. Kasahara

Proc. 62nd Annual Convention IPSJ IPSJ

Presentation date： 2001.03
A Data Transfer Unit on the Single Chip Multiprocessor for Multigrain Prallel Processing

N. Miyashita, K. Kimura, T. Kodaka, H. Kasahara

Proc. 62nd Annual Convention IPSJ IPSJ

Presentation date： 2001.03
Processor Core Architecture of Single Chip Multiprocessor for Near Fine Grain Parallel Processing

K. Kimura, T. Uhida, T. Kato, H. Kasahara

Technical Report of IPSJ, ARC-139-16 IPSJ

Presentation date： 2000.08
Performance Evaluation of Single Chip Multiprocessor for Near Fine Grain Parallel Processing

T. Kato, W. Ogata, K. Kimura, T. Uchida, H. Kasahara

Proc. 60th Annual Convention IPSJ IPSJ

Presentation date： 2000.03
Memory access analyzer for a Multi-grain parallel processing

K. Iwai, M. Obata, K. Kimura, H. Amano, H. Kasahara

Technical Report of IEICE, CPSY99-62 IEICE

Presentation date： 1999.08
Performance Evaluation of Near Fine Grain Parallel Processing on the Single Chip Multiprocessor

K. Kimura, K. Manaka, W. Ogata, M. Okamoto, H. Kasahara

Technical Report of IPSJ, ARC134-5 IPSJ

Presentation date： 1999.08
A Cache Optimization Scheme Using Earliest Executable Condition Analysis

D. Inaishi, K. Kimura, K. Fujimoto, W. Ogata, M. Okamoto, H. Kasahara

Proc. 58th Annual Convention IPSJ IPSJ

Presentation date： 1999.03
A Cache Optimization with Earliest Executable Condition Analysis

D. Inaishi, K. Kimura, K. Fujimoto, W. Ogata, M. Okamoto, H. Kasahara

Technical Report of IPSJ, ARC-130-6 IPSJ

Presentation date： 1998.08
Multigrain parallel Processing on the Single Chip Multiprocessor

K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

Technical Report of IPSJ, ARC98-130-5 IPSJ

Presentation date： 1998.08
A Multigrain Parallelizing Compiler and Its Architectural Support

H. Kasahara, W. Ogata, K. Kimura, M. Obata, T. Tobita, D. Inaishi

TECHNICAL REPORT OF IEICE. (ICD98-10, CPSY98-10, FTS98-10) IEICE

Presentation date： 1998.04
Implementation of FPGA Based Architecture Test Bed For Multi Processor System

W. Ogata, T. Yamamoto, M. Mizuno, K. Kimura, H. Kasahara

IPSJ SIG Notes, 98-ARC-128-14, HPC70-14 IPSJ

Presentation date： 1998.03
Single Chip Multiprocessor Architecture for Multigrain Parallel Processing

K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

Proc. 56th Annual Convention IPSJ IPSJ

Presentation date： 1998.03
A Cache Optimization with Macro-Task Earliest Execution Condition

D. Inaishi, K. Kimura, W. Ogata, M. Okamoto, H. Kasahara

Proc. 56th Annual Convention IPSJ IPSJ

Presentation date： 1998.03
Multi-processor system for Multi-grain Parallel Processing

K. Iwai, T. Fujiwara, T. Morimura, H. Amano, K. Kimura, W. Ogata, H. Kasahara

Technical Report of IEICE, CPSY97-46 IEICE

Presentation date： 1997.08
A Macro Task Dynamic Scheduling Algorithm with Overlapping of Task Processing and Data Transfer

K. Kimura, S. Hashimoto, M. Kogou, W. Ogata, H. Kasahara

Technical Report of IEICE, CPSY97-40 IEICE

Presentation date： 1997.08

▼display all

Research Seeds

Method for speeding up simulations for assessing many-core performance

Information Communication
Parallelization of multimedia applications (MPEG2 encoders and decoders) by using the OSCAR compiler

Information Communication

Research Projects

Highly Convenient and Low Overhead Trusted Execution Environment

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2023.04

-

2026.03
A Study of Matrix Multiply by Homomorphic Encryption for Utilizing in Deep Learning Frameworks

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2018.06

-

2020.03

Kimura Keiji

　View Summary

This research aims at accelerating matrix-multiply in homomorphic encryption toward utilizing it in deep learning frameworks. Through the research, we obtained 5.53x and 3.73x speedups in maximum for two important computational parts in the target encrypted matrix-multiply process. In addition, we have developed a data transfer unit, which can quickly provide required data to accelerator hardware units. We also investigated and evaluated the relationship between the precision of computations and calculation time to reduce the calculation cost while keeping the appropriate precision. As a result, we obtained 8 points accuracy improvement and 54% speedup for image recognition at the same time by parallel inference with eight smaller neural networks.
A research on a heterogeneous multicore that enables flexible cooperation among CPUs, accelerators and data transfer units on a chip

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2015.04

-

2018.03

Kimura Keiji

　View Summary

We developed a heterogeneous multicore architecture and its compiler flow, which enable flexible cooperation among CPUs, accelerator cores, and data transfer units, which is a kind of extended DMA controller, in a multicore chip. One of the main achievements in this research project is that a program parallelized by the developed compiler flow including LLVM backend for the accelerator core obtains 24.91x speedup on the heterogeneous multicore on an FPGA test bed, which is also developed in this research
Real-Time Optimization Algorithms and Their Applications for Control of Large-Scale Nonlinear Spatiotemporal Patterns

Project Year :

2012.04

-

2016.03

　View Summary

Fast algorithms for solving nonlinear optimal control problems were investigated to optimally control large-scale and complicated systems, and their applications to various fields were examined. Achievements in this research include, for example, development of efficient optimization algorithms for control of large-scale systems, systematic tuning methods of control responses, and a tool for automatic coding of the algorithms. The algorithms have been validated in various applications such as control of distributions of temperature and velocity in thermal fluid systems, suppression of quality dispersion in a steel making process, water quality control in advanced sewage treatment facilities, demand control in smart grids, control of power generation and attitude oscillation in floating off-shore wind turbines, and so on
A Study of Acceleration Technique for Many-core Architecture Simulation Considering Global Program Structure

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research

Project Year :

2011.04

-

2014.03

KIMURA Keiji

　View Summary

A fast and high accuracy architecture simulation technique for multi-core and many-core processors are proposed in this study. By this proposed technique, an architecture simulator changes its precision and simulation speed appropriately under the assumption that a parallelized application is executed on a multi-core or a many-core.The evaluation results with four applications each of which has different characteristics show the 16-core multicore simulation gives 443 times speedup within 0.52% error in maximum, and 218 times speedup within 2.75% error on average
ソフトウェア協調整チップマルチプロセッサにおけるデータ利用最適化に関する研究

　View Summary

本年度は、昨年度に引き続きソフトウェア協調動作型チップマルチプロセッサ用のデータローカリティ最適化およびデータ転送最適化に関する研究を行なった。本研究では、データを共有するタスク群に着目し、プロセッサコアローカルなキャッシュやローカルメモリのサイズを考慮してこれらのタスクを分割し各プロセッサコアに割り当て、キャッシュやローカルメモリの有効利用を図る。さらに、残存するデータ転送を、プロセッサコアに割り当てたタスクとオーバラップして行うことにより、データ転送オーバヘッドの隠蔽を図る。具体的には、MPEG2エンコーデイング処理やJPEG2000エンコーディング処理などのマルチメディアデプリケーションをターゲットとして、これらのアプリケーションに自動的にデータローカリティ最適化とデータ転送最適化手法を適用し、チップマルチプロセッサ上で効率よく動作させるためのソフトウェア・ハードウェア協調動作技術の開発とその評価を行なった。評価の結果、とりわけMPEG2エンコーディング処理では動作周波数400MHz時で逐次実行に対し8プロセッサ使用時で7.97倍、動作周波数2.8GHz時で逐次実行に対し8プロセッサ使用時で6.54倍の速度向上率を得られることが確認できた。MPEG2エンコーディングプログラムに対する本データローカリティ最適化およびデータ転送最適化は、自動並列化コンパイラによりほぼ自動的に行われる。より多くのアプリケーションに対して本手法を自動的に適用し対象アプリケーションを拡大することは今後の課題である

▼display all

Misc

自動並列化コンパイラのコンパイル時間短縮のための実行プロファイル・フィードバックを用いたコード生成手法 (コンピュータシステム) -- (組込み技術とネットワークに関するワークショップETNET2017)

藤野里奈, 韓吉新, 島岡護, 見神広紀, 宮島崇浩, 高村守幸, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 116 ( 510 ) 207 - 212 2017.03

CiNii
自動車リアルタイム制御計算の複数クラスタ構成マルチコア上での並列化 (コンピュータシステム) -- (組込み技術とネットワークに関するワークショップETNET2017)

宮田仁, 島岡護, 見神広紀, 西博史, 鈴木均, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 116 ( 510 ) 177 - 182 2017.03

CiNii
大規模システムを想定したGem5シミュレータの階層的インターコネクションネットワーク拡張 (コンピュータシステム) -- (組込み技術とネットワークに関するワークショップETNET2017)

小野口達也, 林綾音, 宇高勝之, 松島裕一, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 116 ( 510 ) 147 - 152 2017.03

CiNii
LLVMを用いたベクトルアクセラレータ用コードのコンパイル手法 (コンピュータシステム)

丸岡晃, 無州祐也, 狩野哲史, 持山貴司, 北村俊明, 神谷幸男, 高村守幸, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 116 ( 177 ) 19 - 24 2016.08

CiNii
Android Video Processing System Combined with Automatically Parallelized and Power Optimized Code by OSCAR Compiler

Bui Duc Binh, Tomohiro Hirano, Hiroki Mikami, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

57 ( 4 ) 2016.04

CiNii
Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

GOTO Takashi, MUTO Kohei, HIRANO Tomohiro, MIKAMI Hiroki, TAKAHASHI Uichiro, INOUE Sakae, KIMURA Keiji, KASAHARA Hironori

IEICE technical report. Computer systems 114 ( 506 ) 95 - 100 2015.03

　View Summary

This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

CiNii
OSCAR自動並列化コンパイラを用いたリアルタイム動画像アプリケーションのHaswellマルチコア上での低消費電力化 (コンピュータシステム)

飯塚修平, 山本英雄, 平野智大, 岸本耀平, 後藤隆志, 見神広紀, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 114 ( 506 ) 219 - 224 2015.03

　View Summary

スマートフォンやノートパソコンといったモバイル端末からデータセンタで利用されるサーバーマシンまで,あらゆる計算機において消費電力の削減が最重要課題となっている.これは、消費電力の削減によりモバイル機器においてはバッテリー持続時間の延長により利便性が大幅に向上し,またサーバーマシンにおいては膨大な電力コストや空調コストの削減が実現できるからである.これらの計算機は高性能かつ低消費電力を実現するためにマルチコアプロセッサを搭載したものが主流となっている.しかしながらマルチコアの資源を有効活用してこれらを実現するためには,プログラムの並列化が不可欠であり手動で行うには膨大な工数を必要とする.本稿では,医用・防犯・個人認証・車載などで広く利用されているリアルタイム物体認識処理に対して,OSCAR自動並列化コンパイラによるDVFS及びclock gatingによる電力制御を適用し,現在幅広く利用されているIntel Haswell Core i7-4770Kマルチコア上で評価した. Intel Haswellマルチコア上で,Webカメラからの画像の入力・人の顔の認識処理・画面描画というリアルタイムなシステム全域における消費電力の削減を行ったところ,1PE逐次実行では電力制御なしの場合の31.06[W]から電力制御ありの場合では28.74[W]に、3PEで並列化実行した場合では電力制御なし場合のの41.73[W]から電力制御の場合では17.78[W]に消費電力を削減したことが確認され,物体認識処理におけるマルチコア用のコンパイラ自動電力制御の有用性が確認できた.

CiNii
OSCAR自動並列化コンパイラを用いたリアルタイム動画像アプリケーションのHaswellマルチコア上での低消費電力化 (ディペンダブルコンピューティング)

飯塚修平, 山本英雄, 平野智大, 岸本耀平, 後藤隆志, 見神広紀, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 114 ( 507 ) 219 - 224 2015.03

　View Summary

スマートフォンやノートパソコンといったモバイル端末からデータセンタで利用されるサーバーマシンまで,あらゆる計算機において消費電力の削減が最重要課題となっている.これは、消費電力の削減によりモバイル機器においてはバッテリー持続時間の延長により利便性が大幅に向上し,またサーバーマシンにおいては膨大な電力コストや空調コストの削減が実現できるからである.これらの計算機は高性能かつ低消費電力を実現するためにマルチコアプロセッサを搭載したものが主流となっている.しかしながらマルチコアの資源を有効活用してこれらを実現するためには,プログラムの並列化が不可欠であり手動で行うには膨大な工数を必要とする.本稿では,医用・防犯・個人認証・車載などで広く利用されているリアルタイム物体認識処理に対して,OSCAR自動並列化コンパイラによるDVFS及びclock gatingによる電力制御を適用し,現在幅広く利用されているIntel Haswell Core i7-4770Kマルチコア上で評価した. Intel Haswellマルチコア上で,Webカメラからの画像の入力・人の顔の認識処理・画面描画というリアルタイムなシステム全域における消費電力の削減を行ったところ, 1PE逐次実行では電力制御なしの場合の31.06[W]から電力制御ありの場合では28.74[W]に、3PEで並列化実行した場合では電力制御なし場合のの41.73[W]から電力制御の場合では17.78[W]に消費電力を削減したことが確認され,物体認識処理におけるマルチコア用のコンパイラ自動電力制御の有用性が確認できた.

CiNii
Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

GOTO Takashi, MUTO Kohei, HIRANO Tomohiro, MIKAMI Hiroki, TAKAHASHI Uichiro, INOUE Sakae, KIMURA Keiji, KASAHARA Hironori

IEICE technical report. Dependable computing 114 ( 507 ) 95 - 100 2015.03

　View Summary

This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

CiNii
動画像デコーディングのIntelおよびARMマルチコア上での並列処理の評価 (ディペンダブルコンピューティング)

和気珠実, 飯塚修平, 見神広紀, 木村啓二, 笠原博徳

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 114 ( 507 ) 263 - 268 2015.03

　View Summary

本稿では,マルチコアプロセッサを用いて動画像デコーディング処理の高速化を実現する手法として2種類の並列化手法について性能評価を行った.1つ目の並列化手法は並列化対象ループにループスキューイング/ループインターチェンジを適用する手法,2つ目の並列化手法はwave-front手法を適用する手法であり,どちらの場合もマクロブロック間の依存関係を満たしつつこれらの間の並列性を利用することで並列処理が可能となる.評価に用いる動画像コーデックは,MPEG2と比較して約2倍の符号化効率を持ちワンセグ放送等に用いられているH.264/AVCと,H.264/AVCと同等の品質を持ちYoutube等でも採用されている動画規格であるWebMのビデオコーデックVP8である.これらの規格により動画像デコーディングを行うプログラムに対して,上記2つの並列化手法をそれぞれ適用した.Snapdragon APQ8064 Krait 4コアを搭載したNexus7上で評価を行った結果,ループスキューイング/ループインターチェンジ手法で並列化した場合,並列化箇所のみで逐次実行に比べ3コアで1.33倍速度向上し,その一方でwave-front手法では3コアで2.86倍の速度向上が得られた.同様にIntel(R) Xeon(R) CPU X5670プロセッサを搭載したマシンで評価を行った結果,ループスキューイング/ループインターチェンジ手法で並列化した場合,並列化箇所のみで逐次実行に比べ6コアで1.82倍速度向上し,一方でwave-front手法では6コアで4.61倍の速度向上が得られた.

CiNii
OSCAR自動並列化コンパイラを用いたリアルタイム動画像アプリケーションのHaswellマルチコア上での低消費電力化

飯塚修平, 山本英雄, 平野智大, 岸本耀平, 後藤隆志, 見神広紀, 木村啓二, 笠原博徳

研究報告組込みシステム（EMB） 2015 ( 20 ) 1 - 6 2015.02

　View Summary

スマートフォンやノートパソコンといったモバイル端末からデータセンタで利用されるサーバーマシンまで，あらゆる計算機において消費電力の削減が最重要課題となっている．これは，消費電力の削減によりモバイル機器においてはバッテリー持続時間の延長により利便性が大幅に向上し，またサーバーマシンにおいては膨大な電力コストや空調コストの削減が実現できるからである．これらの計算機は高性能かつ低消費電力を実現するためにマルチコアプロセッサを搭載したものが主流となっている．しかしながらマルチコアの資源を有効活用してこれらを実現するためには，プログラムの並列化が不可欠であり手動で行うには膨大な工数を必要とする．本稿では，医用・防犯・個人認証・車載などで広く利用されているリアルタイム物体認識処理に対して，OSCAR 自動並列化コンパイラによる DVFS 及び clock gating による電力制御を適用し，現在幅広く利用されている Intel Haswell Core i7-4770K マルチコア上で評価した．Intel Haswell マルチコア上で，Web カメラからの画像の入力・人の顔の認識処理・画面描画というリアルタイムなシステム全域における消費電力の削減を行ったところ，1PE 逐次実行では電力制御なしの場合の 31.06[W] から電力制御ありの場合では 28.74[W] に，3PE で並列化実行した場合では電力制御なし場合のの 41.73[W] から電力制御の場合では 17.78[W] に消費電力を削減したことが確認され，物体認識処理におけるマルチコア用のコンパイラ自動電力制御の有用性が確認できた．

CiNii
動画像デコーディングのIntelおよびARMマルチコア上での並列処理の評価

和気珠実, 飯塚修平, 見神広紀, 木村啓二, 笠原博徳

研究報告組込みシステム（EMB） 2015 ( 35 ) 1 - 6 2015.02

　View Summary

本稿では，マルチコアプロセッサを用いて動画像デコーディング処理の高速化を実現する手法として 2 種類の並列化手法について性能評価を行った．1 つ目の並列化手法は並列化対象ループにループスキューイング/ループインターチェンジを適用する手法，2 つ目の並列化手法は wave-front 手法を適用する手法であり，どちらの場合もマクロブロック間の依存関係を満たしつつこれらの間の並列性を利用することで並列処理が可能となる．評価に用いる動画像コーデックは，MPEG2 と比較して約 2 倍の符号化効率を持ちワンセグ放送等に用いられている H.264/AVC と，H.264/AVC と同等の品質を持ち Youtube 等でも採用されている動画規格である WebM のビデオコーデック VP8 である．これらの規格により動画像デコーディングを行うプログラムに対して，上記 2 つの並列化手法をそれぞれ適用した．Snapdragon APQ8064 Krait 4 コアを搭載した Nexus7 上で評価を行った結果，ループスキューイング/ループインターチェンジ手法で並列化した場合，並列化箇所のみで逐次実行に比べ 3 コアで 1.33 倍速度向上し，その一方で wave-front 手法では 3 コアで 2.86 倍の速度向上が得られた．同様に Intel(R) Xeon(R) CPU X5670 プロセッサを搭載したマシンで評価を行った結果，ループスキューイング/ループインターチェンジ手法で並列化した場合，並列化箇所のみで逐次実行に比べ 6 コアで 1.82 倍速度向上し，一方で wave-front 手法では 6 コアで 4.61 倍の速度向上が得られた．

CiNii
Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

2015 ( 34 ) 1 - 6 2015.02

　View Summary

This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

CiNii
自動並列化コンパイラによるソフトウェアキャッシュコヒーレンシ制御手法の評価

岸本耀平, 間瀬正啓, 木村啓二, 笠原博徳

研究報告ハイパフォーマンスコンピューティング（HPC） 2014 ( 19 ) 1 - 7 2014.12

　View Summary

主記憶共有型マルチコアプロセッサにおいて，一般にキャッシュコヒーレンシ制御はハードウェアにより実現されている．今後のプロセッサコア数の増加に伴いキャッシュコヒーレンシハードウェアの回路規模は大きくなり，チップへの実装が困難になること，電力消費が大きくなること，設計期間及び開発費用が増大することが懸念されている．本稿ではこのハードウェアコヒーレンシ制御の問題を解決するために，ハードウェアコヒーレンシ制御機構を持たない主記憶共有型ノンコヒーレントキャッシュマルチコアに対して，並列化コンパイラがソフトウェアに対し自動的にコヒーレンシ制御を行う手法を提案する．本手法を実装した OSCAR 自動並列化コンパイラと，4 コアのクラスタを 2 つ持ちクラスタ間ではハードウェアコヒーレンシを持たない情報家電用マルチコア RP2 を用い性能評価を行った．9 つの科学技術計算アプリケーションを対象として評価を行ったところ，4 コアのハードウェアコヒーレンシ制御使用時の性能は平均で 1 コア性能の 2.80 倍であったのに対し，ハードウェアコヒーレンシを使用せず本手法を適用した 4 コア実行時の性能は平均で 1 コア性能の 2.61 倍となりほぼ同等の速度向上が得られ，さらに 8 コアハードウェアコヒーレンシ制御無効時には平均で 1 コア性能の 3.66 倍とスケールアップすることが確認できた．

CiNii
Android Movie Player System Combined with Automatically Parallelized and Power Optimized Code by OSCAR Compiler

2014 55 - 62 2014.10

CiNii
Prospect of Green Computing

4 ( 4 ) 3 - 8 2014.10

CiNii
OSCARコンパイラを用いたH.264/AVCデコーダのAndroidマルチコアでの低消費電力化

飯塚修平, 山本英雄, 平野智大, 後藤隆志, 見神広紀, 高橋宇一郎, 井上栄, 高村守幸, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC） 2014 ( 9 ) 1 - 1 2014.09

　View Summary

スマートフォンの普及と移動通信の高速化に伴い，モバイル端末における動画再生の頻度が増加している．H.264/AVC は高い圧縮率を実現することからワンセグ放送や YouTube など現在のメディア処理に広く利用されている動画像圧縮符号標準であるが，モバイル端末では動画再生時の膨大な演算に対する消費電力の増大がバッテリー持続時間の低下を招き，問題となっている．この問題に対して，現在では専用ハードウェアが用いられているが，モバイル端末に求められる多様なコーデックやアップデートへの柔軟な対応を考慮すると，今後ソフトウェアによる解決手法が有用であると考えられる．本研究では H.264/AVC デコーダのプログラムのうち最も負荷が大きいフレーム間予測及び，デブロッキングフィルタの処理に対して並列化を行った上で電力制御を適用し，ソフトウェアによる消費電力削減の有用性を検証した．OSCAR 自動並列化コンパイラを用いて LoopSkewing のアクセス順序からマクロブロックレベルでの並列性を抽出し，リアルタイム制約の保証内での DVFS 及び WFI を用いた擬似クロックゲーティングを適用した．Android 端末の開発ボードである ODROID-X2 の上で電力値の評価を行ったところ，1PE で 1.07[W] から 0.79[W] に，2PE で 1.69[W] から 0.57[W] に，3PE で 2.45[W] から 0.51[W] に消費電力を削減したことが確認された．

CiNii
Automatic Parallelization of Designed Engine Control C Codes by MATLAB/Simulink

55 ( 8 ) 1817 - 1829 2014.08

CiNii
Linux ftraceを用いたマルチコアプロセッサ上での並列化プログラムのトレース手法

福意大智, 島岡護, 見神広紀, Dominic Hillenbrand, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC） 2014 ( 6 ) 1 - 6 2014.07

　View Summary

ソフトウェアの適切な並列化により，マルチコアを搭載したコンピュータシステム上でアプリケーションを高速に動作させることが可能である．並列化されたソフトウェアの挙動や性能を調査する手法として，ソースコードの解読や実行ダンプファイルの収集，プロファイラの利用，デバッガの利用といった方法が挙げられる．しかしこれらの手法ではどのようなタイミングにおいてコンテクストスイッチが発生したのか，システムで発生する事象に対してソフトウェアがどのような影響を受けているかといった情報を得ることは困難である．そこで，本稿では並列化されたプログラムが実際に並列実行される様子をソフトウェアからトレースに任意のアノテーションを挿入可能とする拡張を施した Linux ftrace を用いて解析する手法を提案する．提案手法を用いて，Intel Xeon X7560，ARMv7 の各々のプラットフォームにおいて equake，art，mpeg2enc というベンチマークのトレースを行い，これらのプログラムが実行時に OS からどのような影響を受けているか観測できることが確認できた．また，1 回のアノテーションの挿入を Intel Xeon で 1.07[us]，ARMで4.44[us] で可能であることが確認できた．

CiNii
Android Demonstration System of Automatic Parallelization and Power Optimization by OSCAR Compiler

Bui DucBinh, Tomohiro Hirano, Dominic Hillenbrand, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ SIG Notes 2014 ( 7 ) 1 - 6 2014.07

　View Summary

The emergence of multicore processors in smart devices promises higher performance and better user experience. The parallelization of application enables us to improve the application performance, however, simultaneously utilizing many cores would drastically drain the device battery life. Therefore, power saving technology has become important. This report shows a realtime video demonstration system for power reduction controlled by OSCAR Automatic parallelization Compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime. The demonstration results show that it can save 18.2% power consumption for MPEG-2 Decoder application and 56.6% power consumption for Optical Flow application by using 2 cores in both applications.

CiNii
大規模無線センサネットワークにおける外乱を考慮したアーキテクチャ探索シミュレータの実装と評価

山下浩一郎, 鈴木貴久, 栗原康志, 大友俊也, 木村啓二, 笠原博徳

マルチメディア、分散協調とモバイルシンポジウム2014論文集 2014 1368 - 1377 2014.07

CiNii
A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

TAGUCHI Gakuho, KIMURA Keiji, KASAHARA Hironori

IEICE technical report. Dependable computing 113 ( 498 ) 289 - 294 2014.03

　View Summary

A parallelizing compiler cooperative acceleration technique for multicore architecture simulation is proposed in this paper. Profile data of a sequential execution of a target application on a real machine is decomposed into multiple clusters by x-means clustering. Then, sampling points for a detail simulation mode in each cluster are calculated. In addition, a parallelizing compiler generates a parallelized code by taking both of the clustering information and the source code of the target application. The evaluation results show, in the case of the simulation for 16 cores, 437 times speedup is achieved with 0.04% error for equake, and 28 times speedup is achieved with 0.04% error for mpeg2 encoder.

CiNii
A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

2014 ( 49 ) 1 - 6 2014.03

CiNii
A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

2014 ( 49 ) 1 - 6 2014.03

CiNii
A Latency Reduction Technique for IDS by Allocating Decomposed Signature on Multi-core

Shohei Yamada, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ SIG Notes 2014 ( 2 ) 1 - 8 2014.02

　View Summary

Cyber attacks targeting on companies and government organizations have been increasing and highly developed. An Intrusion Detection System (IDS) is one of efficient solutions to prevent those attacks. An IDS detects illegal network accesses in realtime by monitoring the network and filtering suspicious IP packets. Large processing performance is required for IDSs to process a large number of IP packets in realtime. In order to satisfy this requirement, a latency reduction technique for signature-based IDSs by allocating decomposed signature on multicores is proposed in this paper. The proposed technique is implemented in Suricata, which is an open source IDS, and evaluated it with several data sets, such as DARPA Intrusion Detection Evaluation Data Set. The evaluation results show the proposed techniques with four cores achieves 3.22 times performance improvement in maximum comparing with two cores without signature decomposition.

CiNii
Automatic Parallelization of Small Point FFT on Multicore Processor

Yuuki Furuyama, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

IPSJ SIG Notes 2014 ( 3 ) 1 - 8 2014.02

　View Summary

Fast Fourier Transorm (FFT) is one of the most frequently used algorihtms in many applications including digital signal processing and image processing to compute Descrite Fourier Transform (DFT). Although small size FFT programs must be used in baseband signal processing such as LTE and so on, it's difficult to use special hardwares like DSPs for computing such a small problem because of their relatively large data transfer and control overhead. This paper proposes an automatic parallelization method to generate parallelized programs with low overhead for small size FFTs suited for shared memory multicore processor by applying cache optimization to avoide false sharing between cores. The proposed method has been implemented in OSCAR automatic parallelizing compiler, parallelized small point FFT programs from 32 points to 256 points and evaluated them on RP2 multicore processor having 8 SH-4A cores. It achieved 1.97 times speedup on 2 SH-4A cores and 3.9 times speedup on 4 SH-4A cores in a 256 points FFT program. In addition to the FFT programs, the proposed approach is applied to Fast Hadamard Transform (FHT) which has similar computation to the FFT. The results are 1.91 times speedup on 2 SH-4A cores and 3.32 times speedup on 4 SH-4A cores. It shows effectiveness of the proposed method and easiness of applying the method to many kinds of programs.

CiNii
プロファイル情報を用いたAndroid 2D描画ライブラリSKIAのOSCARコンパイラによる並列化

後藤隆志, 武藤康平, 山本英雄, 平野智大, 見神広紀, 木村啓二, 笠原博徳

研究報告ハイパフォーマンスコンピューティング（HPC） 2013 ( 12 ) 1 - 7 2013.12

　View Summary

本論文では，スマートフォンやタブレット等で広く用いられる Android において，従来マルチコアプロセッサ上での並列化が困難で，その高速化が望まれていた 2D 描画ライブラリ Skia を，OSCAR 自動並列化コンパイラにより，プロファイラ情報に基づいた自動並列化を行う手法を開発したのでその方法を説明する．OSCAR コンパイラは Parallelizable C により記述された逐次プログラムから様々な粒度で並列化解析を行い，自動的に並列化 C ソースを出力する．しかし，Skia は Android 内のライブラリであり，利用する描画命令ルーチンにより制御フローが大きく変化するため，最適な並列化解析を行うことが困難である．そこで，本論文では Skia のような制御フローがコンパイル時に特定できないプログラムに対し，Oprofile を用いて取得したプロファイル結果を OSCAR コンパイラにフィードバックすることで，並列化対象を特定の領域に絞り，高い性能向上が得られる手法を提案する．なお，並列化対象領域が Parallelizable C コードでない場合でも，解析結果により実行コストが大きい部分から Parallelizable C に変更し，チューニングを施すことで並列化が可能となる．本手法を，描画ベンチマークとして広く使われている 0xbench を NVIDIA Tegra3 チップ（ARM Cortex-A9 4 コア）を搭載した Nexus7 上で評価を行った．並列化 Skia の実行においては，並列化部分の速度向上を正確に評価するため， Android を core0 に割り当て，残り 3 コアを Skia が利用できる形とした．評価の結果として，DrawRect で従来の 1.91 倍である 43.57 [fps]，DrawArc で 1.32 倍の 50.98[fps]，DrawCircle2 では 1.5 倍の 50.77[fps] といずれも性能向上結果が得られた．

CiNii
プロファイル情報を用いたAndroid 2D描画ライブラリSKIAのOSCARコンパイラによる並列化

後藤隆志, 武藤康平, 山本英雄, 平野智大, 見神広紀, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC） 2013 ( 12 ) 1 - 7 2013.12

　View Summary

本論文では，スマートフォンやタブレット等で広く用いられる Android において，従来マルチコアプロセッサ上での並列化が困難で，その高速化が望まれていた 2D 描画ライブラリ Skia を，OSCAR 自動並列化コンパイラにより，プロファイラ情報に基づいた自動並列化を行う手法を開発したのでその方法を説明する．OSCAR コンパイラは Parallelizable C により記述された逐次プログラムから様々な粒度で並列化解析を行い，自動的に並列化 C ソースを出力する．しかし，Skia は Android 内のライブラリであり，利用する描画命令ルーチンにより制御フローが大きく変化するため，最適な並列化解析を行うことが困難である．そこで，本論文では Skia のような制御フローがコンパイル時に特定できないプログラムに対し，Oprofile を用いて取得したプロファイル結果を OSCAR コンパイラにフィードバックすることで，並列化対象を特定の領域に絞り，高い性能向上が得られる手法を提案する．なお，並列化対象領域が Parallelizable C コードでない場合でも，解析結果により実行コストが大きい部分から Parallelizable C に変更し，チューニングを施すことで並列化が可能となる．本手法を，描画ベンチマークとして広く使われている 0xbench を NVIDIA Tegra3 チップ（ARM Cortex-A9 4 コア）を搭載した Nexus7 上で評価を行った．並列化 Skia の実行においては，並列化部分の速度向上を正確に評価するため， Android を core0 に割り当て，残り 3 コアを Skia が利用できる形とした．評価の結果として，DrawRect で従来の 1.91 倍である 43.57 [fps]，DrawArc で 1.32 倍の 50.98[fps]，DrawCircle2 では 1.5 倍の 50.77[fps] といずれも性能向上結果が得られた．

CiNii
OSCAR API標準解釈系を用いた階層グルーピング対応ハードウェアバリア同期機構の評価

川島慧大, 金羽木洋平, 林明宏, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC） 2013 ( 16 ) 1 - 6 2013.07

　View Summary

1 チップ内に搭載されるコア数の増加に伴い，アプリケーションからより多くの並列性を抽出し，低オーバーヘッドで利用することがこれらのコアを有効利用するために重要となっている．OSCAR コンパイラによる自動並列化ではより多くの並列性を利用するため，ループやサブルーチン内部の粗粒度並列性を解析し，階層的にタスク定義を行う．この階層的に定義されたタスクをコアを階層的にグルーピングし，コアグループに対して割り当てることにより並列処理を実現する．この階層的なグループ間で独立かつ低コストでバリア同期を実現できるハードウェアが提案され，SH4A プロセッサ 8 コア搭載の情報家電用マルチコア RP2 に実装されている．本稿では，OSCAR API 標準解釈系の階層グループバリア同期 API を RP2 のハードウェアバリア同期機構に対応し評価を行った結果について述べる．8 コアを使用した SPEC CPU 2000 の ART による評価ではソフトウェアでのバリア同期に対し 1.16 倍の性能向上が得られた．

CiNii
マルチコア商用スマートディバイスの評価と並列化の試み

山本英雄, 後藤隆志, 平野智大, 武藤康平, 見神広紀, Dominic Hillenbrand, 林明宏, 木村啓二, 笠原博徳

研究報告システムソフトウェアとオペレーティング・システム（OS） 2013 ( 2 ) 1 - 7 2013.02

　View Summary

半導体プロセスの微細化に伴いスマートフォン，タブレットに代表される民生機器にも４コア程度のマルチコアSoCの採用が進んでいる．一方，ソフトウェアはマルチコアを活用するための並列化が十分に進んでおらず，対応が望まれている．本稿ではAndroidを搭載した商用スマートデバイスにおいて，一般的な利用範囲におけるマルチコアの活用状況を評価し，並列化されたベンチマークプログラムを用いて実行環境の課題と改善方式を述べた上で，標準APIの仕様を変更すること無く，アプリケーションがオフスクリーンバッファを描画バッファに書くBitBLT処理の並列化を試みた結果を報告する．この処理並列化の結果，アプリケーションから2D描画APIを呼び出すベンチマークテストで約3%のフレームレートの改善を確認した．

CiNii
A Parallelizing Compiler Cooperative Multicore Architecture Simulator with Changeover Mechanism of Simulation Modes

TAGUCHI GAKUHO, ABE YOUICHI, KIMURA KEIJI, KASAHARA HIRONORI

Technical report of IEICE. ICD 112 ( 425 ) 65 - 71 2013.01

　View Summary

A parallelizing compiler cooperative multicore architecture simulation framework, which enables reducing simulation time by a flexible simulation-mode changeover mechanism, is proposed A multicore architecture simulator in this framework has two modes, namely, functional-and-fast simulation mode and cycle-accurate-and slow simulation modes This framework generates appropriate sampling points for cycle-accurate mode and runtime for mode changeover of the simulator depending on a parallelized application by cooperating with a parallelizing compiler The proposed framework is evaluated with EQUAKE from SPEC2000 The evaluation result shows 50 times to 500 times speedup can be achieved within 1 6% error

CiNii
An Acceleration Technique of Many-core Architecture Simulation with Parallelized Applications by Statistical Technique

Abe Yoichi, Taguchi Gakuho, Kimura Keiji, Kasahara Hironori

Technical report of IEICE. ICD 112 ( 425 ) 57 - 63 2013.01

　View Summary

This paper proposes an automatic decision technique of the number of clusters and samplmg points for an acceleration technique of many-core architecture simulation by statistical methods This techinque, firstly, focuses on a structure of a benchmark program, especially loops The number of sampling points is exploited from iterations of a target loop by statistical methods If the variation of the cost of the iterations is large, these iterations are grouped into clusters Thus, this technique enables higher estimation accuracy with fewer sampling points However, the number of clusters must be decided by hand in our previous works The automatic decision technique of the number of clusters by "x means" is proposed m this paper As a preliminary evaluation of the proposed technique, sequential execution costs of several benchmark programs are estimated As a result, when MPEG2 encoder program with SIF16, which causes large variation among the cost of iterations, is used, 1 92% error is achieved with 14 iterations as sampling pomts of 450 iterations exploited by x-means

CiNii
Parallelization of Automobile Engine Control Software on Multicore Processor

KANEHAGI YOUHEI, UMEDA DAN, MIKAMI HIROKI, HAYASHI AKIHIRO, SAWADA MITSUO, KIMURA KEIJI, KASAHARA HIRONORI

Technical report of IEICE. ICD 112 ( 425 ) 3 - 10 2013.01

　View Summary

The calculation load in the automobile control system is increasing to achive more safety, comfort and energy-saving Accordingly, control processor cores needs high performance However, the improvement of clock frequency in processor cores is difficult, and it is important to use multicore processor Using the multicore for the engine control, performance, development cost, development period, etc are problems be-cause it is difficult to parallelize softwares This paper proposes a parallelization method of the automobile engine control software on the multicore processor, which has only functioned on single-core processors Con-cretely, it is applied restructuring the sequential program for extracting more parallelism, for example inlining functions and duplicating conditional branches, and the OSCAR compiler allows us perform automatic par-allelization and generation of a parallel C program Using proposed method, the automobile engine control software, which is difficult to parallelize manually because of very fine-grained program, is parallelized and give us 1 71x speedup using 2 cores on RP-X multicore It is confirmed that parallehzation of the automobile engine control software is effective

CiNii
Automatic parallelization with OSCAR API Analyzer: a cross-platform performance evaluation

2012 ( 10 ) 1 - 8 2012.12

CiNii
Automatic Parallelization of Ground Motion Simulator

2012 ( 11 ) 1 - 8 2012.12

CiNii
Automatic parallelization with OSCAR API Analyzer: a cross-platform performance evaluation

2012 ( 10 ) 1 - 8 2012.12

CiNii
Parallelization of Basic Engine Controll Software Model on Multicore Processor

2012 ( 22 ) 1 - 7 2012.07

CiNii
Realization of 1 Watt Web Service with RP-X Low-power Multicore Processor

2012 ( 24 ) 1 - 6 2012.07

CiNii
Inlining Analysis of Exception Flow and Method Dispatch on Automatic Parallelization of Java

2012 ( 9 ) 1 - 6 2012.03

CiNii
An Examination of Accelerating Many-core Architecture Simulation for Parallelized Media Applications

2012 ( 3 ) 1 - 4 2012.03

CiNii
A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

2012 ( 22 ) 1 - 6 2012.02

CiNii
A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

2012 ( 22 ) 1 - 6 2012.02

CiNii
A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

KIMURA KEIJI, MASE MASAYOSHI, KASAHARA HIRONORI

IEICE technical report. Dependable computing 111 ( 462 ) 127 - 132 2012.02

　View Summary

JISX0180:2011 "Framework of establishing coding guidelines for embedded system development" was decided to improve the quality of embeded systems. Parallelizable C has bee also proposed to support exploitation of parallelism by a parallelizing compiler. This paper proposes a definition of Parallelizable C by JISX0180:2011 aiming at the improvement of productivity for embeded multicore developers with parallelizing compilers. An evaluation has been carried out using rewritten programs by the defined coding guideline on ordinary SMPs and a consumer electronics multicore. As the result, 5.54x speedup on IBM p5 550Q (8core), 2.42x speedup on Intel Core i7 960 (4core), and 2.79x speedup on Renesas/Hitachi/Waseda RP2 (4core) have been achieved, respectively.

CiNii
Automatic Parallelization of Dose Calculation Engine for A Particle Therapy on SMP Servers

2011 ( 2 ) 1 - 9 2011.11

CiNii
Examination of Parallelization by CUDA in SPEC benchmark program

2011 ( 16 ) 1 - 6 2011.07

CiNii
科学技術計算プログラムの構造を利用したメニーコアアーキテクチャシミュレーション高速化手法の評価

石塚亮, 阿部洋一, 大胡亮太, 木村啓二, 笠原博徳

研究報告計算機アーキテクチャ（ARC） 2011 ( 14 ) 1 - 11 2011.07

　View Summary

本稿ではキャッシュやパイプラインまでシミュレーションする詳細シミュレーションと命令実行のみの高速な機能シミュレーションの両方を用いたシミュレーション精度切り替えによるメニーコアシミュレータの高速化手法を提案する．本手法はメニーコアシミュレータ上で並列化プログラムを実行することを前提としており，このプログラムの一部のみを詳細シミュレーションを行うことにより高速化を図る．このとき，詳細シミュレーションを行うサンプリング部分を実機での逐次実行プロファイル情報とプログラム構造から判断し，その分量を統計的手法により決定する．本手法を比較的規則性の高い科学技術計算である SPEC CPU 95のTOMCATV，SWIM で及び SPEC CPU 2000 の ART，EQUAKE を用いて統計学的に算出したサンプリングサイズの値を堺に，実行サイクルが収束していくことを示した．これにより，評価したところ，64 コアかつ精度切換えを想定したシミュレーションで，各アプリケーションにおいて，誤差5%の範囲で約 100 倍の高速化が可能であることを示した．

CiNii
Hiding I/O overheads with Parallelizing Compiler for Media Applications

2011 ( 14 ) 1 - 7 2011.04

CiNii
Evaluation of Power Consumption by Executing Media Applications on Low-power Multicore RP2

2011 ( 1 ) 1 - 8 2011.03

CiNii
Evaluation of Parallelizable C Programs by the OSCAR API Standard Translator

SATO TAKUYA, MIKAMI HIROKI, HAYASHI AKIHIRO, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

2010 ( 2 ) 1 - 6 2010.10

CiNii
Performance of Power Reduction Scheme by a Compiler on Heterogeneous Multicore for Consumer Electronics "RP-X"

WADA YASUTAKA, HAYASHI AKIHIRO, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, SHIRAKO JUN, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

2010 ( 8 ) 1 - 10 2010.07

CiNii
An Acceleration Technique of Many Core Architecture Simulator Considering Program Structure

ISHIZUKA RYO, OOTOMO TOSHIYA, DAIGO RYOTA, KIMURA KEIJI, KASAHARA HIRONORI

2010 ( 20 ) 1 - 7 2010.07

CiNii
A Compiler Framework for Heterogeneous Multicores for Consumer Electronics

HAYASHI AKIHIRO, WADA YASUTAKA, WATANABE TAKESHI, SEKIGUCHI TAKESHI, MASE MASAYOSHI, KIMURA KEIJI, ITO MASAYUKI, HASEGAWA ATSUSHI, SATO MAKOTO, NOJIRI TOHRU, UCHIYAMA KUNIO, KASAHARA HIRONORI

2010 ( 7 ) 1 - 9 2010.07

CiNii
Parallelizing Compiler Directed Software Coherence

MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

2010 ( 7 ) 1 - 10 2010.04

CiNii
Multi Media Offload with Automatic Parallelization

ISHIZAKA KAZUHISA, SAKAI JUNJI, EDAHIRO MASATO, MIYAMOTO TAKAMICHI, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

2010 ( 59 ) 1 - 7 2010.03

CiNii
Processing Performance of Automatically Parallelized Applications on Embedded Multicore with Running Multiple Applications

MIYAMOTO TAKAMICHI, MASE MASAYOSHI, KIMURA KEIJI, ISHIZAKA KAZUHISA, SAKAI JUNJI, EDAHIRO MASATO, KASAHARA HIRONORI

2010 ( 9 ) 1 - 8 2010.02

CiNii
Hierarchical parallel processing of H.264/AVC encoder on an multicore processor

IEICE technical report 109 ( 405 ) 121 - 126 2010.01

CiNii
Hierarchical Parallel Processing of H.264/AVC Encoder on an Multicore Processor

MIKAMI Hiroki, MIYAMOTO Takamichi, KIMURA Keiji, KASAHARA Hironori

2010 ( 22 ) 1 - 6 2010.01

CiNii
Green Multicore-SoC Software-Execution Framework with Timely-Power-Gating Scheme

ONOUCHI Masafumi, TOYAMA Keisuke, NOJIRI Toru, SATO Makoto, MASE Masayoshi, SHIRAKO Jun, SATO Mikiko, TAKADA Masashi, ITO Masayuki, MIZUNO Hiroyuki, NAMIKI Mitaro, KIMURA Keiji, KASAHARA Hironori

IEICE technical report 109 ( 367 ) 7 - 12 2010.01

　View Summary

We developed a software-execution framework for scalable increase of execution speed and low-power consumption based on an octo-core chip multiprocessor named RP2 and an automatic multigrain-parallelizing compiler named OSCAR. Keys to improvement of the performance are reduction of a communication overhead with parallelized tasks and frequent shutdown to waiting cores. For this framework, we developed two schemes: data mapping and timely-power gating. Measurement of the performance for the conventional framework and our proposed framework showed that normalized execution speedup becomes 5.00 when secure AAC-LC encoding is processed in 8-parallel execution. Moreover, applying our timely-power-gating scheme improves power efficiency by 10%.

CiNii
Automatic Parallelization of Parallelizable C Programs on Multicore Processors

MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

2009 ( 15 ) 1 - 10 2009.07

CiNii
Local Memory Management Scheme by a Compiler on a Multicore Processor for Coarse Grain Task Parallel Processing

2 ( 2 ) 63 - 74 2009.07

CiNii
Performance Evaluation of Minimum Execution Time Multiprocessor Scheduling Algorithms Using Standard Task Graph Set Ver3 Consider Parallelism of Task Graphs and Deviation of Task Execution Time

SHIMAOKA MAMORU, IMAIZUMI KAZUHIRO, TAKANO FUMIYO, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2009 ( 14 ) 127 - 132 2009.02

　View Summary

This paper proposes the "Standard Task Graph Set Ver3" (STG Ver3) to evaluate performance of heuristic and optimization algorithms for the minimum execution time multiprocessor scheduling problem. The minimum execution time multiprocessor scheduling problem is known as a strong NP-hard combinational optimization problem to the public. The STG Ver2 was created by random task execution times and random predecessors. In addition, the STG Ver3 considers parallelism of task graphs and deviation of task execution times to let us understand characteristics of algrithms. This paper describes evaluation results by applying the STG Ver3 to several algorithms. Performance evaluation show that DF/IHS can give us optimal solutions for 87.25%, and PDF/IHS 92.25% within 600 seconds.

CiNii
Performance Evaluation of Minimum Execution Time Multiprocessor Scheduling Algorithms Using Standard Task Graph Set Ver3 Consider Parallelism of Task Graphs and Deviation of Task Execution Time

SHIMAOKA MAMORU, IMAIZUMI KAZUHIRO, TAKANO FUMIYO, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2009 ( 14 ) 127 - 132 2009.02

　View Summary

This paper proposes the "Standard Task Graph Set Ver3" (STG Ver3) to evaluate performance of heuristic and optimization algorithms for the minimum execution time multiprocessor scheduling problem. The minimum execution time multiprocessor scheduling problem is known as a strong NP-hard combinational optimization problem to the public. The STG Ver2 was created by random task execution times and random predecessors. In addition, the STG Ver3 considers parallelism of task graphs and deviation of task execution times to let us understand characteristics of algrithms. This paper describes evaluation results by applying the STG Ver3 to several algorithms. Performance evaluation show that DF/IHS can give us optimal solutions for 87.25%, and PDF/IHS 92.25% within 600 seconds.

CiNii
Local Memory Management Scheme by a Compiler for Multicore Processor

MOMOZONO Taku, NAKANO Hirofumi, MASE Masayoshi, KIMURA Keiji, KASAHARA Hironori

IEICE technical report 108 ( 375 ) 69 - 74 2009.01

　View Summary

This paper proposes a local memory management scheme for an automatic parallelizing compiler to realize effective use of a limited size of local memory. After the loop aligned decomposition and task scheduling considering data locality and parallelism, the compiler allocates data to the local memory effectively using the task scheduling result. This paper evaluates the proposed scheme on RP2 multicore for consumer electronics which has 8 SH4A processor cores. Each core integrates 32KB of local data memory and 64KB of distributed shared memory. As the results, the proposed scheme using 8 processors gives us about 6.20 times speedup for MPEG2 encoding program, 7.25 times speedup for AAC encoding program and 7.64 times speedup for susan against the sequential execution.

CiNii
A Power Saving Scheme on Multicore Processors Using OSCAR API

NAKAGAWA Ryo, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

IEICE technical report 108 ( 375 ) 93 - 98 2009.01

　View Summary

Effective power reduction of an application program on multicore processors requires appropriate power control for each on-chip resource by compilers or users. These low power techniques need an application program interface (API) to realize power control in a user program. This paper proposes a power saving scheme for multicore processors using OSCAR API developed in NEDO "Multicore for Realtime Consumer Electronics" project. The proposed scheme has been implemented in OSCAR compiler to realize the power reduction for fastest execution mode, which minimizes power consumption without performance degradation, and the realtime execution mode to minimize power consumption under realtime constrains. The proposed scheme is evaluated on an 8 cores SH4A multicore processor RP2, newly developed for consumer electronics by Renesas Technology Corp., Hitachi, Ltd. and Waseda University in the above project. For the fastest execution mode, consumed energy was reduced by 13.05% for SPEC2000 art and 3.99% for SPEC2000 equake. Also, for the realtime execution mode, consumed power was reduced by 87.9% for AAC encoder and 73.2% for MPEG2 decoder.

CiNii
Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

KAMIYAMA Teruo, WADA Yasutaka, HAYASHI Akihiro, MASE Masayoshi, NAKANO Hirohumi, WATANABE Takeshi, KIMURA Keiji, KASAHARA Hironori

IPSJ SIG Notes 2009 ( 1 ) 63 - 68 2009.01

　View Summary

This paper describes a heterogeneous multicore architecture having accelerator cores in addition to general purpose cores, an automatic parallelizing compiler that cooperatively works with the heterogeneous multicore, a heterogeneous multicore architecture simulation environment, and performance evaluation results with the simulation environment. For the performance evaluation, multimedia applications written in C or Fortran, considered with parallelization by the compiler, are used. As a result, the evaluated heterogeneous multicore having two general purpose cores and two accelerator cores achieves 9.82 times speedup from MP3 encoder. This architecture also achieves 14.64 times speedup from JPEG2000 encoder.

CiNii
Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

KAMIYAMA Teruo, WADA Yasutaka, HAYASHI Akihiro, MASE Masayoshi, NAKANO Hirohumi, WATANABE Takeshi, KIMURA Keiji, KASAHARA Hironori

IEICE technical report 108 ( 375 ) 63 - 68 2009.01

　View Summary

This paper describes a heterogeneous multicore architecture having accelerator cores in addition to general purpose cores, an automatic parallelizing compiler that cooperatively works with the heterogeneous multicore, a heterogeneous multicore architecture simulation environment, and performance evaluation results with the simulation environment. For the performance evaluation, multimedia applications written in C or Fortran, considered with parallelization by the compiler, are used. As a result, the evaluated heterogeneous multicore having two general purpose cores and two accelerator cores achieves 9.82 times speedup from MP3 encoder. This architecture also achieves 14.64 times speedup from JPEG2000 encoder.

CiNii
Performance Evaluation of Parallelizing Compiler Cooperated Heterogeneous Multicore Architecture Using Media Applications

KAMIYAMA Teruo, WADA Yasutaka, HAYASHI Akihiro, MASE Masayoshi, NAKANO Hirohumi, WATANABE Takeshi, KIMURA Keiji, KASAHARA Hironori

2009 ( 1 ) 63 - 68 2009.01

　View Summary

This paper describes a heterogeneous multicore architecture having accelerator cores in addition to general purpose cores, an automatic parallelizing compiler that cooperatively works with the heterogeneous multicore, a heterogeneous multicore architecture simulation environment, and performance evaluation results with the simulation environment. For the performance evaluation, multimedia applications written in C or Fortran, considered with parallelization by the compiler, are used. As a result, the evaluated heterogeneous multicore having two general purpose cores and two accelerator cores achieves 9.82 times speedup from MP3 encoder. This architecture also achieves 14.64 times speedup from JPEG2000 encoder.

CiNii
A Power Saving Scheme on Multicore Processors Using OSCAR API

NAKAGAWA Ryo, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

2009 ( 1 ) 93 - 98 2009.01

　View Summary

Effective power reduction of an application program on multicore processors requires appropriate power control for each on-chip resource by compilers or users. These low power techniques need an application program interface (API) to realize power control in a user program. This paper proposes a power saving scheme for multicore processors using OSCAR API developed in NEDO "Multicore for Realtime Consumer Electronics" project. The proposed scheme has been implemented in OSCAR compiler to realize the power reduction for fastest execution mode, which minimizes power consumption without performance degradation, and the realtime execution mode to minimize power consumption under realtime constrains. The proposed scheme is evaluated on an 8 cores SH4A multicore processor RP2, newly developed for consumer electronics by Renesas Technology Corp., Hitachi, Ltd. and Waseda University in the above project. For the fastest execution mode, consumed energy was reduced by 13.05% for SPEC2000 art and 3.99% for SPEC2000 equake. Also, for the realtime execution mode, consumed power was reduced by 87.9% for AAC encoder and 73.2% for MPEG2 decoder.

CiNii
Local Memory Management Scheme by a Compiler for Multicore Processor

MOMOZONO Taku, NAKANO Hirofumi, MASE Masayoshi, KIMURA Keiji, KASAHARA Hironori

IPSJ SIG Notes 2009 ( 1 ) 69 - 74 2009.01

　View Summary

This paper proposes a local memory management scheme for an automatic parallelizing compiler to realize effective use of a limited size of local memory. After the loop aligned decomposition and task scheduling considering data locality and parallelism, the compiler allocates data to the local memory effectively using the task scheduling result. This paper evaluates the proposed scheme on RP2 multicore for consumer electronics which has 8 SH4A processor cores. Each core integrates 32KB of local data memory and 64KB of distributed shared memory. As the results, the proposed scheme using 8 processors gives us about 6.20 times speedup for MPEG2 encoding program, 7.25 times speedup for AAC encoding program and 7.64 times speedup for susan against the sequential execution.

CiNii
A Power Saving Scheme on Multicore Processors Using OSCAR API

NAKAGAWA Ryo, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

IPSJ SIG Notes 2009 ( 1 ) 93 - 98 2009.01

　View Summary

Effective power reduction of an application program on multicore processors requires appropriate power control for each on-chip resource by compilers or users. These low power techniques need an application program interface (API) to realize power control in a user program. This paper proposes a power saving scheme for multicore processors using OSCAR API developed in NEDO "Multicore for Realtime Consumer Electronics" project. The proposed scheme has been implemented in OSCAR compiler to realize the power reduction for fastest execution mode, which minimizes power consumption without performance degradation, and the realtime execution mode to minimize power consumption under realtime constrains. The proposed scheme is evaluated on an 8 cores SH4A multicore processor RP2, newly developed for consumer electronics by Renesas Technology Corp., Hitachi, Ltd. and Waseda University in the above project. For the fastest execution mode, consumed energy was reduced by 13.05% for SPEC2000 art and 3.99% for SPEC2000 equake. Also, for the realtime execution mode, consumed power was reduced by 87.9% for AAC encoder and 73.2% for MPEG2 decoder.

CiNii
Local Memory Management Scheme by a Compiler for Multicore Processor

MOMOZONO Taku, NAKANO Hirofumi, MASE Masayoshi, KIMURA Keiji, KASAHARA Hironori

2009 ( 1 ) 69 - 74 2009.01

　View Summary

This paper proposes a local memory management scheme for an automatic parallelizing compiler to realize effective use of a limited size of local memory. After the loop aligned decomposition and task scheduling considering data locality and parallelism, the compiler allocates data to the local memory effectively using the task scheduling result. This paper evaluates the proposed scheme on RP2 multicore for consumer electronics which has 8 SH4A processor cores. Each core integrates 32KB of local data memory and 64KB of distributed shared memory. As the results, the proposed scheme using 8 processors gives us about 6.20 times speedup for MPEG2 encoding program, 7.25 times speedup for AAC encoding program and 7.64 times speedup for susan against the sequential execution.

CiNii
An Evaluation of Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping

YAMADA Kaito, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, ITO Masayuki, HATTORI Toshihiro, MIZUNO Hiroyuki, UCHIYAMA Kunio, KASAHARA Hironori

IEICE technical report 108 ( 28 ) 19 - 24 2008.05

　View Summary

In order to use a large number of processor cores in a chip, hierarchical coarse grain task parallel processing, which exploits whole program parallelism by analyzing hierarchical coarse grain task parallelism inside loops and subroutines, has been proposed and implemented in OSCAR automatic parallelizing compiler. This hierarchical coarse grain task parallel processing defines processor groups hierarchically and logically, and assigns hierarchical coarse grain tasks to each processor group. A light-weight and scalable barrier synchronization mechanism considering hierarchical processor grouping, which supports hierarchical coarse grain task parallel processing, is developed and implemented into RP2 multicore processor having eight SH4A cores with support by NEDO "Multicore Technology for Realtime Consumer Electronics". This barrier mechanism is proposed and evaluated in this paper. The evaluation using AAC encoder program by 8 cores shows our barrier mechanism achieves 16% better performance than software barrier.

CiNii
Automatic Parallelization of Restricted C Programs using Pointer Analysis

MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, MURATA Yuta, KIMURA Keiji, KASAHARA Hironori

IEICE technical report 108 ( 28 ) 69 - 74 2008.05

　View Summary

This paper describes a restriction on pointer usage in C language for parallelism extraction by an automatic parallelizing compiler. By rewriting programs to satisfy the restriction, automatic parallelization using flow-sensitive, context-sensitive pointer analysis on an 8 cores SMP server achieved 3.80 times speedup for SPEC2000 art, 6.17 times speedup for SPEC2006 lbm and 5.14 times speedup for MediaBench mpeg2enc against the sequential execution, respectively.

CiNii
Automatic Parallelization of Restricted C Programs using Pointer Analysis

MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, MURATA Yuta, KIMURA Keiji, KASAHARA Hironori

IPSJ SIG Notes 2008 ( 39 ) 69 - 74 2008.05

　View Summary

This paper describes a restriction on pointer usage in C language for parallelism extraction by an automatic parallelizing compiler. By rewriting programs to satisfy the restriction, automatic parallelization using flow-sensitive, context-sensitive pointer analysis on an 8 cores SMP server achieved 3.80 times speedup for SPEC2000 art, 6.17 times speedup for SPEC2006 lbm and 5.14 times speedup for MediaBench mpeg2enc against the sequential execution, respectively.

CiNii
An Evaluation of Barrier Synchronization Mechanism Considering Hierarchical Processor Grouping

YAMADA Kaito, MASE Masayoshi, SHIRAKO Jun, KIMURA Keiji, ITO Masayuki, HATTORI Toshihiro, MIZUNO Hiroyuki, UCHIYAMA Kunio, KASAHARA Hironori

IPSJ SIG Notes 2008 ( 39 ) 19 - 24 2008.05

　View Summary

In order to use a large number of processor cores in a chip, hierarchical coarse grain task parallel processing, which exploits whole program parallelism by analyzing hierarchical coarse grain task parallelism inside loops and subroutines, has been proposed and implemented in OSCAR automatic parallelizing compiler. This hierarchical coarse grain task parallel processing defines processor groups hierarchically and logically, and assigns hierarchical coarse grain tasks to each processor group. A light-weight and scalable barrier synchronization mechanism considering hierarchical processor grouping, which supports hierarchical coarse grain task parallel processing, is developed and implemented into RP2 multicore processor having eight SH4A cores with support by NEDO "Multicore Technology for Realtime Consumer Electronics". This barrier mechanism is proposed and evaluated in this paper. The evaluation using AAC encoder program by 8 cores shows our barrier mechanism achieves 16% better performance than software barrier.

CiNii
Parallelization for Multimedia Processing on Multicore Processors

MIYAMOTO TAKAMICHI, TAMURA KEI, TANO HIROAKI, MIKAMI HIROKI, ASAKA SAORI, MASE MASAYOSHI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2007 ( 115 ) 77 - 82 2007.11

　View Summary

Multicore processors have attracted much attention to handle the increase of power consumption, the slowdown of improvement of processor clock speed, and the increase of hardware/software developing period. Also, speeding up multimedia applications is required with the progress of the consumer electronics devices like mobile phones, digital TV and games. This paper describes parallelization methods of multimedia applications on the multicore processors. Especially in this paper, MPEG2 encoding and MPEG2 decoding are selected as examples of video sequence processing, MP3 encoding is selected as an example of audio processing, JPEG 2000 encoding is selected as an example of picture processing. OSCAR multigrain parallelizing compiler parallelizes these media applications using newly developed multicore API. This paper evaluates parallel processing performances of these multimedia applications on the FR1000 multicore processor developed by Fujitsu Ltd, and the RP1 multicore processor jointly-developed by Waseda University, Renesas Technology Corp. and Hitachi Ltd.

CiNii
Evaluation of Heterogeneous Multicore-Architecture with AAC-LC Stereo Encoding

SHIKANO Hiroaki, ITO Masaki, TODAKA Takashi, TSUNODA Takanobu, KODAMA Tomoyuki, ONOUCHI Masafumi, UCHIYAMA Kunio, ODAKA Toshihiko, KAMEI Tatsuya, NAGAHAMA Ei, KUSAOKE Manabu, NITTA Yusuke, WADA Yasutaka, KIMURA Keiji, KASAHARA Hironori

IEICE technical report 107 ( 194 ) 11 - 16 2007.08

　View Summary

This paper describes a heterogeneous multi-core processor (HMCP) architecture which integrates general purpose processors (CPU) and accelerators (ACC) to achieve high-performance as well as low-power consumption for SoCs of embedded systems. Memory architecture of CPUs and ACCs were unified to improve programming and compiling efficiency. For preliminary evaluation of the HMCP architecture, AAC-LC stereo audio encoding is parallelized on a heterogeneous multi-core having homogeneous processor cores and dynamic reconfigurable processor (DRP) accelerator cores. The performance evaluation shows that 54x AAC encoding is achieved on the chip with two CPUs at 600MHz and two DRPs at 300MHz, which realizes encoding of a whole CD in 1-2 minutes.

CiNii
A Hierarchical Coarse Grain Task Static Scheduling Scheme on a Heterogeneous Multicore

WADA YASUTAKA, HAYASHI AKIHIRO, IYOKU TAKETO, MASUURA TAKESHI, SHIRAKO JUN, NAKANO HIROFUMI, SHIKANO HIROAKI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2007 ( 79 ) 97 - 102 2007.08

　View Summary

This paper proposes a static scheduling scheme for hierarchical coarse grain task parallel processing on a heterogeneous multicore processor. A heterogeneous multicore processor integrates not only general purpose processors but also accelerators like dynamically reconfigurable processors (DRPs) or digital signal processors (DSPs). Effective usage of these accelerators allows us to get high performance and low power consumption at the same time. In the proposed scheme, the compiler extracts parallelism using coarse grain parallel processing and assigns tasks considering characteristics of each core to minimize the execution time of an application. Performance of the proposed scheme is evaluated on a heterogeneous multicore processor using MP3 encoder. Hetero-geneous configurations give us 12.64 times speedup with two SH4As and two DRPs and 24.48 times speedup with four SH4As and four DRPs against sequential execution with one SH4A core.

CiNii
Compiler Control Power Saving for Heterogeneous Multicore Processor

HAYASHI AKIHIRO, IYOKU TAKETO, NAKAGAWA RYO, MASUURA TAKESHI, MATSUMOTO SHIGERU, YAMADA KAITO, OSHIYAMA NAOTO, SHIRAKO JUN, WADA YASUTAKA, NAKANO HIROFUMI, SHIKANO HIROAKI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2007 ( 79 ) 103 - 108 2007.08

　View Summary

Multicore processors are getting introduced for performance improvement and reduction of power dissipation in various IT fields, such as consumer electronics, PCs, servers and super-computers. Especially, heterogeneous multicores have attracted much attention in consumer electronics to achieve higher performance per watt. In order to satisfy the demand for the high performance, low power dissipation and high software productivity, Parallelizing compilers for both parallelization and Frequency and Voltage control are required. This paper describes the evaluation results of compiler control power saving for a heterogeneous multicore processor which integrates upto 4 general purpose embedded processor Renesas SH4As and 4 accelerator core like dynamically reconfigureable processors Hitachi FE-GAs. Performance evaluation shows the heterogeneous multicore gave us 24.32 times speed up against sequential processing and 28.43% energy savings for MP3 encoding program without performance degradation.

CiNii
A 4320MIPS four Processor-core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption

HAYASE Kiyoshi, YOSHIDA Yutaka, KAMEI Tatsuya, SHIBAHARA Shinichi, NISHII Osamu, HATTORI Toshihiro, HASEGAWA Atsushi, TAKADA Masashi, IRIE Naohiko, UCHIYAMA Kunio, ODAKA Toshihiko, TAKADA Kiwamu, KIMURA Keiji, KASAHARA Hironori

IPSJ SIG Notes 2007 ( 55 ) 31 - 35 2007.05

　View Summary

4320MIPS 4-processor SoC that provides with low power consumption and high performance was designed using 90nm process. The 32KB-data cache is built into each processor, and the module to maintain the coherency of the data cache between processors is built into. A low electric power is achieved by frequency control of each processor according to amount of processing and adopting sleep mode that maintains coherency of the data cache between processors.

CiNii
Multigrain Parallel Processing in SMP Execution Mode on a Multicore for Consumer Electronics

MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, TANO Hiroaki, MASUURA Takeshi, MIYAMOTO Takamichi, SHIRAKO Jun, NAKANO Hirofumi, KIMURA Keiji, KAMEI Tatsuya, HATTORI Toshihiro, HASEGAWA Atsushi, ITO Masaki, SATO Makoto, UCHIYAMA Kunio, ODAKA Toshihiko, KASAHARA Hironori

IPSJ SIG Notes 2007 ( 55 ) 25 - 30 2007.05

　View Summary

Currently, multicore processors are becoming ubiquitous in various computing domains, namely consumer electronics such as games, car navigation systems and mobile phones, PCs, and supercomputers. This paper describes parallelization of media processing programs written in restricted C language by OSCAR multigrain parallelizing compiler and SMP processing performance on RP1 4-core SH-4A (SH-X3) multicore processor developed by Renesas Technology Corp. and Hitachi, Ltd. based on standard OSCAR multicore memory architecture as a part of NEDO "Research and Development of Multicore Technology for Real Time Consumer Electronics Project". Performance evaluation shows OSCAR compiler achieved 3.34 times speedup using 4 cores against using 1 core for AAC audio encoder.

CiNii
A 4320MIPS four Processor-core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption

HAYASE Kiyoshi, YOSHIDA Yutaka, KAMEI Tatsuya, SHIBAHARA Shinichi, NISHII Osamu, HATTORI Toshihiro, HASEGAWA Atsushi, TAKADA Masashi, IRIE Naohiko, UCHIYAMA Kunio, ODAKA Toshihiko, TAKADA Kiwamu, KIMURA Keiji, KASAHARA Hironori

IEICE technical report 107 ( 76 ) 31 - 35 2007.05

　View Summary

4320MIPS 4-processor SoC that provides with low power consumption and high performance was designed using 90nm process. The 32KB-data cache is built into each processor, and the module to maintain the coherency of the data cache between processors is built into. A low electric power is achieved by frequency control of each processor according to amount of processing and adopting sleep mode that maintains coherency of the data cache between processors.

CiNii
Multigrain Parallel Processing in SMP Execution Mode on a Multicore for Consumer Electronics

MASE Masayoshi, BABA Daisuke, NAGAYAMA Harumi, TANO Hiroaki, MASUURA Takeshi, MIYAMOTO Takamichi, SHIRAKO Jun, NAKANO Hirofumi, KIMURA Keiji, KAMEI Tatsuya, HATTORI Toshihiro, HASEGAWA Atsushi, ITO Masaki, SATO Makoto, UCHIYAMA Kunio, ODAKA Toshihiko, KASAHARA Hironori

IEICE technical report 107 ( 76 ) 25 - 30 2007.05

　View Summary

Currently, multicore processors are becoming ubiquitous in various computing domains, namely consumer electronics such as games, car navigation systems and mobile phones, PCs, and supercomputers. This paper describes parallelization of media processing programs written in restricted C language by OSCAR multigrain parallelizing compiler and SMP processing performance on RP1 4-core SH-4A (SH-X3) multicore processor developed by Renesas Technology Corp. and Hitachi, Ltd. based on standard OSCAR multicore memory architecture as a part of NEDO "Research and Development of Multicore Technology for Real Time Consumer Electronics Project". Performance evaluation shows OSCAR compiler achieved 3.34 times speedup using 4 cores against using 1 core for AAC audio encoder.

CiNii
A Local Memory Management Scheme in Multigrain Parallelizing Compiler

MIURA TSUYOSHI, TAGAWA TOMOHIRO, MURAMATSU YUSUKE, IKEMI AKINORI, NAKAGAWA MASAHIRO, NAKANO HIROFUMI, SHIRAKO JUN, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2007 ( 17 ) 61 - 66 2007.03

　View Summary

Multicore systems have been attracting much attention for performance, low power consumption and short hardware/software development period. To take the full advantage of multiprocessor systems, parallelizing compilers serve important roles. On multicore processor, a memory wall caused by the speed gap between processor core and memory is also serious problem. Therefore, it is important for performance improvement to use fast memolies like cache and local memory nearby a processor effectively. This paper proposes a local memory management scheme for coarse grain task parallel processing. In the evaluation using SPeC 95fp tomcatv, the proposed scheme using 8 processors achieved 19.6 times speedup against the sequantial execution without the proposed scheme on the OSCAR multicore processor by the effective use of local memories.

CiNii
Automatic Parallelization for Multimedia Applications on Multicore Processors

MIYAMOTO TAKAMICHI, ASAKA SAORI, KAMAKURA NOBUHITO, YAMAUCHI HIROMASA, MASE MASAYOSHI, SHIRAKO JUN, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

2007 ( 4 ) 69 - 74 2007.01

　View Summary

Multicore processors have attracted much attention to handle the increase of power consumption along with the increase of integration degree of semiconductor devices, the slowdown of improvement of processor clocks, and the increase of hardware/software developing period. Also, speeding up multimedia applications is required with the progress of the consumer electronics like mobile phones, digital TV and games. This paper describes parallelization methods of multimedia applications on the multicore processors. Especially in this paper, MPEG2 encoding and MPEG2 decoding are selected as examples of video sequence processing, MP3 encoding is selected as an example of audio processing, JPEG 2000 encoding is selected as an example of picture processing. OSCAR multigrain parallelizing compiler automatically parallelizes these media applications. This paper evaluates parallel processing performances of these multimedia applications on the OSCAR multicore processor, and the IBM p5 550Q Power5+ 8 processors SMP server. On the OSCAR multicore processor, the parallel execution with the proposed method of managing local memory and optimizing data transfer using 4 processors, gives us 3.81 times speedup for MPEG2 encoding, 3.04 times speedup for MPEG2 decoding, 3.09 times speedup for MP3 encoding, 3.79 times speedup for JPEG 2000 encoding against the sequential execution. On the IBM p5 550Q Power5+ 8 processors server, the parallel execution using 8 processors gives us 5.19 times speedup for MPEG2 encoding, 5.12 times speedup for MPEG2 decoding, 3.69 times speedup for MP3 encoding, 4.32 times speedup for JPEG 2000 encoding against the sequential execution.

CiNii
Automatic Parallelization of Restricted C Programs in OSCAR Compiler

MASE MASAYOSHI, BABA DAISUKE, NAGAYAMA HARUMI, TANO HIROAKI, MASUURA TAKESHI, FUKATSU KOJI, MIYAMOTO TAKAMICHI, SHIRAKO JUN, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2006 ( 127 ) 1 - 6 2006.11

　View Summary

Along with the popularization of multiprocessors and multicore architectures, automatic parallelizing compiler, which can realize high effective performance and low power comsumption, becomes more and more important in various areas from high performance computing to embedded computing. OSCAR compiler realizes multigrain automatic parallelization, which can exploit parallelism and data locality from the whole of the program. This paper describes C language support in OSCAR compiler. For rapid support of C language, restricted C language is proposed. In the preliminary performance evaluation of automatic parallelization using following media applications as MPEG2 encode, MP3 encode, and AAC encode, Susan (smoothing) derived from MiBench, and Art from SPEC2000, OSCAR compiler achieved 7.49 times speed up in maximum for susan (smoothing) against sequential execution on IBM p5 550 server having 8 processors, and 3.75 times speed up in maximum for susan (smoothing) too against sequential execution on Sun Ultra80 workstation having 4 processors.

CiNii
Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers and Embedded Multicore

SHIRAKO JUN, TAGAWA TOMOHIRO, MIURA TSUYOSHI, MIYAMOTO TAKAMICHI, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2006 ( 127 ) 7 - 12 2006.11

　View Summary

Currently, multiprocessor systems, especially multicore processors, are attracting much attention for performance, low power consumption and short hardware/software development period. To take the full advantage of multiprocessor systems, parallelizing compilers serve important roles. This paper describes the execution performance of OSCAR multigrain parallelizing compiler using coarse grain task parallelization and near fine grain parallelization in addition to loop parallelization, on the latest SMP servers and a SMP embedded multicore. The OSCAR compiler has realized the automatic determination of parallelizing layer, which decides the suitable number of processors and parallelizing technique for each nested part of the program, and global cache memory optimization over loops and coarse grain tasks. In the performance evaluation using 10 SPEC CFP95 benchmark programs and 4 SPEC CFP2000, OSCAR compiler gave us 2.74 times speedup compared with IBM XL Fortran compiler 10.1 on IBM p5 550Q Power5+8 processors server, 4.82 times speedup compared with IBM XL Fortran compiler 8.1 on IBM pSeries690 Power4 24 processors server. OSCAR compiler can be also applied for NEC/ARM MPCore ARMv6 4 processors low power embedded multicore, using subset of OpenMP libraries and g77 compiler. In the evaluation using SPEC CFP95 benchmarks with reduced data sets, OSCAR compiler achieved 4.08 times speedup for tomcatv, 3.90 times speedup for swim, 2.21 times speedup for su2cor, 3.53 times speedup for hydro2d, 3.85 times speedup for mgrid, 3.62 times speedup for applu and 3.20 times speedup for turb3d against the sequential execution.

CiNii
Local Memory Management on OSCAR Multicore

NAKANO HIROFUMI, NITO TAKUMI, MARUYAMA TAKANORI, NAKAGAWA MASAHIRO, SUZUKI YUKI, NAITO YOSUKE, MIYAMOTO TAKAMICHI, WADA YASUTAKA, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2006 ( 88 ) 163 - 168 2006.07

　View Summary

Along with the advancement of integration technology of semiconductor devices, to overcome the increase of power consumption, the slowdown of processor effective performance improvement rate, and the increase of period for hardware/software developing transistors integrated on to a chip, multicore processors have attracted much attention as a next-generation microprocessor architecture. However, the memory wall caused by the gap between memory access speed and processor core speed is getting a serious problem also on the multicore processors. Therefore, the effective use of fast memories like cache and local memory nearby a processor is important. Considering these problems, the authors have proposed the OSCAR multicore processor architecture which cooperates with OSCAR multigrain parallelizing compiler and aims at developing a processor with high effective performance and good cost performance. The OSCAR multicore processor has local data memory (LDM) for processor private data, distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centralized shared memory (CSM) to support dynamic task scheduling, and data transfer unit (DTU) which transfers data asynchronously and aims at overlapping data transfer overhead. This paper describes data localization scheme that aimed at improving the effective use of LDM and DSM using coarse grain task parallel processing and compiler-controlled LDM and DSM management scheme. As the results, the proposed scheme gives us 7.1 times speedup for MP3 encoding program, 6.3 for MPEG2 encoding program and 3.8 for JPEG2000 encoding program against the sequential execution without the proposed scheme on 8 processors automatically.

CiNii
Data Transfer Overlap of Coarse Grain Task Parallel Processing on a Multicore Processor

MIYAMOTO TAKAMICHI, NAKAGAWA MASAHIRO, ASANO SHOICHIRO, NAITO YOSUKE, NITO TAKUMI, NAKANO HIROFUMI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2006 ( 20 ) 55 - 60 2006.02

　View Summary

Along with the increase of integration degree of semiconductor devices, to overcome the increase of power consumption, the slowdown of improvement of processor effective performance, and the increase of period for hardware/software developing transistors integrated on to a chip, multicore processors, have attracted much attention as a next-generation microprocessor architecture. However, the memory wall caused by the gap between memory access speed and processor core speed is still a serious problem also on the multicore processors. Therefore, the effective use of fast memories like cache and local memory nearby processor is important for reducing large memory access overhead. Futhermore, hiding data transfer overhead among local or distributed shared memories of processors and centralized shared memory is important. On the memory architechture, the data transfer is specified. Considering these problems, the authors have proposed the OSCAR multicore processor architecture which cooperates with OSCAR multigrain parallelizing compiler and aims at developing a processor with high effective performance and good cost performance computer system. The OSCAR multicore processor has local data memory (LDM) for processor private data, distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centralized shared memory (CSM) to support dynamic task scheduling, and data transfer unit (DTU) which transfers data asynchronously and aims at overlapping data transfer overhead. This paper proposes and evaluates a static data transfer scheduling algorithm aiming at overlapping data transfer overhead. As the results, the proposed scheme controlled by OSCAR compiler gives us 2.86 times speedup using 4 processors for JPEG2000 encoding program against the ideal sequential execution assuming that the all data can be assigned to the local memory.

CiNii
A Static Scheduling Scheme for Coarse Grain Tasks on a Heterogeneous Chip Multi Processor

WADA YASUTAKA, OSHIYAMA NAOTO, SUZUKI YUKI, SHIRAKO JUN, NAKANO HIROFUMI, SHIKANO HIROAKI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2006 ( 8 ) 13 - 18 2006.01

　View Summary

This paper proposes a static scheduling scheme for coarse grain tasks on a heterogeneous chip multi processor which integrates not only general purpose processors but also accelerators like DRP or DSP. A heterogeneous chip multi processor allows us to get high performance by using the accelerators and to save energy by frequency/voltage control by the compiler. In this scheme, the compiler aim to minimize the execution time of an application in consideration of the characteristic in each core. Performance of the proposed scheme is evaluated on a heterogeneous chip multi processor which has 4 general purpose processors and 2 accelerators using MP3 encoder and gives us 8.8 times speedup against sequencial execution without the proposed scheme.

CiNii
Preliminary Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

SHIKANO Hiroaki, SUZUKI Yuki, WADA Yasutaka, SHIRAKO Jun, KIMURA Keiji, KASAHARA Hironori

IPSJ SIG Notes 2006 ( 8 ) 1 - 6 2006.01

　View Summary

This paper proposes a heterogeneous chip multi-processor (HCMP) that possesses different types of processing elements (PEs) such as CPUs as general-purpose processors, as well as digital signal processors or dynamic reconfigurable processors (DRPs) as special-purpose processors. The HCMP realizes higher performance than conventional single-core processors or even homogeneous multi-processors in some specific applications such as media processing, with low operating frequency supplied, which results in lower power consumption. In this paper, the performance of the HCMP is analyzed by studying parallelizing scheme and power control scheme of an MP3 audio encoding program and by scheduling the program onto the HCMP using these two schemes. As a result, it is confirmed that an HCMP, consisting of three CPUs and two DRPs, outperforms a single-core processor with one CPU by a speed-up factor of 16.3, and a homogeneous multi-processor with 5 CPUs by a speed-up factor of 4.0. It is also confirmed that the power control on the HCMP results in 24% power reduction.

CiNii
Beyond the Computer - Publicity Activity of Department of Computer Science, Waseda University -

KIMURA Keiji

IPSJ Magazine 46 ( 10 ) 1155 - 1157 2005.10

CiNii
Performance Evaluation of Electronic Circuit Simulation Using Code Generation Method without Array Indirect Access

KURODA AKIRA, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2005 ( 7 ) 1 - 6 2005.01

　View Summary

This paper evaluates performance of a fast sequential circuit simulation scheme using the loop free code without the array indirect accesses. This scheme allows us to get several tens of times higher processing performance than SPICE version 3f5 on a WS and a PC. The array indirect accesses for the sparse matrix solution in SPICE have been one of the factors that prevents from efficient processing. This paper describes the circuit simulation scheme using loop free code without any array indirect accesses and its performance evaluation shows the scheme gives us 2 to 110 times better performance than SPICE3f5 on a WS and a PC. The performance by reducing the memory accesses overhead significantly.

CiNii
Performance of OSCAR Multigrain Parallelizing Compiler on Shared Memory Multiprocessor Servers}

SHIRAKO JUN, MIYAMOTO TAKAMICHI, ISHIZAKA KAZUHISA, OBATA MOTOKI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2005 ( 7 ) 21 - 26 2005.01

　View Summary

The needs for automatic parallelizing compilers are getting larger with widly use of multiprocessor systems. However, the loop parallelization techniques are almost matured and new generation of parallelization methods like multi-grain parallelization are required to achieve higher effective performance. This paper describes the performance of OSCAR multigrain parallelizing compiler that uses the coarse grain task parallelization and the near fine grain parallelization in addition to the loop parallelization. OSCAR compiler realizes the following two important techniques. The first is the automatic determination scheme of parallelizing layer, which decides the number of processors and parallelizing technique for each part of the program. The other is global cache memory optimization among loops and coarse grain tasks. In the evaluation using SPEC95FP benchmarks, OSCAR compiler gave us 4.78 times speedup compared with IBM XL Fortran compiler 8.1 on IBM pSeries690 Power4 24 processors server, 2.40 times speedup compared with Intel Fortran Itanium Compiler 7.1 on SGI Altix3700 Itanium2 16 processors server, 1.90 times speedup compared with Sun Forte compiler 7.1 on Sun Fire V880 Ultra SPARC III Cu 8 processors server.

CiNii
Parallel Processing for MPEG2 Encoding on OSCAR Chip Multiprocessor

KODAKA TAKESHI, NAKANO HIROHUMI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2004 ( 123 ) 53 - 58 2004.12

　View Summary

This paper proposes a coarse grain task parallel processing scheme for MPEG2 encoding using data localization which optimizes execution efficiency assigning coarse grain tasks accessing the same array data on the same processor consecutively on a chip multiprocessor and data transfer overlapping technique which minimize the data transfer overhead by overlapping task execution and data transfer. Performance of the proposed scheme is evaluated. As the evaluation result on an OSCAR chip multiprocessor architecture, the proposed scheme gave us 1.24 times speedup for 1 processor, 2.47 times speedup for 2 processors. 4.57 times speedup for 4 processors, 7.97 times speedup for 8 processors and 11.93 times speedup for 16 processors respectively against the sequential execution on a single processor without the proposed scheme.

CiNii
Data Localization using Data Transfer Unit on OSCAR Chip Multiprocessor

NAKANO HIROFUMI, NAITO YOSUKE, SUZUKI TAKAHISA, KODAKA TAKESHI, ISHIZAKA KAZUHISA, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2004 ( 80 ) 115 - 120 2004.07

　View Summary

Recently, Chip Multiprocessor (CMP) architecture has attracted much attention as a next-generation micro-processor architecture, and many kinds of CMP have widely developed. However, these CMP architectures still have the problem of effective use of memory system nearby processor cores such as cache and local memory. On the other hand, the authors have proposed OSCAR CMP, which cooperatively works with multigrain parallel processing, to achieve high effective performance and good cost effectiveness. To overcome the problem of effective use of cache and local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) having two ports for synchronization and data transfer among processor cores, centralized shared memory (CSM) to support dynamic task scheduling, and data transfer unit(DTU) for asynchronous data transfer. The multigrain parallelizing compiler uses such memory architecture of OSCAR CMP with data localization scheme that fully uses compile time information. This paper proposes a coarse grain task static scheduling scheme considering data localization using live variable analysis. Data is transferred in burst mode using automatically generated DTU instructions.

CiNii
Evaluation of Multigrain Parallelism on OSCAR Chip Multi Processor

WADA YASUTAKA, SHIRAKO JUN, ISHIZAKA KAZUHISA, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2004 ( 80 ) 61 - 66 2004.07

　View Summary

This paper describes performance of multigrain parallel processing of SPEC CFP 95 on OSCAR Chip Multi Processor (OSCAR CMP). OSCAR multigrain parallelizing compiler, which exploits statement level near-fine grain parallelism, loop iteration level parallelism and coarse grain parallelism hierarchically, allows us to fully control hardware on OSCAR CMP. Also, this cooperation realizes high software productivity and effective use of hardware resources. Performance of multigrain parallel processing of SPEC CFP 95 benchmark programs on OSCAR CMP with 8 processor cores and centralized shared memory were 2.03 to 7.79 times speedup against sequential execution using 400MHz clock cycles for embedded use and 1.89 to 7.05 times speedup against sequential execution using 2.8GHz clock cycles for high-end use.

CiNii
Parallel Processing for MPEG2 Encoding using Data Localization

KODAKA TAKESHI, NAKANO HIROHUMI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2004 ( 12 ) 13 - 18 2004.02

　View Summary

Recently, many people are getting to enjoy multimedia applications with image and audio processing on PCs, mobile phones and PDAs. For this situation, development of low cost, low power consumption and high performance processors for multimedia applications has been expected. To satisfy these demands, chip multiprocessor architectures which allows us to attain scalability using coarse grain level parallelism and loop level parallelism in addition to instruction level parallelism are attracting much attention. However, in order to extract much performance from chip multiprocessor architectures efficiently, highly sophisticated technique is required such as decomposing a program into adequate grain of tasks and assigning them onto processors considering parallelism and data locality of target applications. This paper describes a parallel processing scheme for MPEG2 encoding using data localization which improve execution efficiency assigning coarse grain tasks sharing same data on a same processor consecutively for a chip multiprocessor, and evaluate its performance. As the evaluation result on OSCAR CMP using 8 processors, proposed scheme gives us 1.64 times speedup against loop parallel processing, and 6.82 times speedup against sequential execution time.

CiNii
The Data Prefetching of Coarse Grain Task Parallel Processing on Symmetric Multi Processor Machine

MIYAMOTO TAKAMICHI, YAMAGUCHI TAKAHIRO, TOBITA TAKAO, ISHIZAKA KAZUHISA, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2003 ( 119 ) 63 - 68 2003.11

　View Summary

On the shared multi processor system used in current computing servers, the increase of memory access overhead with the speedup of CPU interfere to get the scalable performance improvement with the increase of the processors. In order to get scalable performance improvement, this paper proposes and evaluates the static scheduling algorithm which reduces the memory access overhead by using cache prefetch to overlap of data transfer and task processing. The proposed algorithm is used in static scheduling stage in a compiler, moreover the compiler generates a OpenMP parallelized Fortran program with prefetch directives for SUN Forte compiler for Sun Fire V880 server. Performance evaluation shows that the proposed algorithm gave us super linear speedup compared with sequential processing without prefetching by Sun Forte compiler such as 13.9 times speedup on 8 processors for SPEC95fp tomcatv program and 22.3 times speedup on 8 processors for SPEC95fp swim program. Futhermore, compared with automatic prefetching by SUN Forte compiler using the same number of processors, this algorithm shows that 1.1 times speedup on 1 processor, 3.86 times speedup on 8 processors for SPEC95fp tomcatv and 1.44 times speedup on 1processor, 1.85 times speedup on 8 processors for SPEC95fp swim.

CiNii
Data Localization Scheme using Static Scheduling on Chip Multiprocessor

NAKANO HIROFUMI, KODAKA TAKESHI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2003 ( 84 ) 79 - 84 2003.08

　View Summary

Recently, chip multiprocessor architecture that contains multiple processors on a chip becomes popular approach even in commercial area. The authors have proposed OSCAR chip multiprocessor (OSCAR CMP) that is aimed at exploiting multiple grains of parallelism hierarchically from a sequential program on a chip. OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory having two ports for processor shared data to control data allocation by a compiler appropriately. This paper describes data localization scheme for OSCAR CMP which exploits data locality by assigning coarse grain tasks sharing same data on a same processor consecutively. In addition, OSCAR CMP using data localization scheme is compared with shared cache architecture and snooping cache architecture. Then, current naive code generation for OSCAR CMP is considered using evaluation results.

CiNii
Parallel Processing on MPEG2 Encoding for OSCAR Chip Multiprocessor

KODAKA TAKESHI, NAKANO HIROHUMI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2003 ( 84 ) 55 - 60 2003.08

　View Summary

Recently, multimedia applications with visual and sound processing are popular on mobile phones and PDAs. To satisfy the needs for efficient multimedia processing, development of low cost, low power consumption and high performance processors for multimedia applications has been expected. Chip multiprocessor architectures which allows us to attain scalability using coarse grain level parallelism and loop level parallelism in addition to instruction level parallelism are attracting much attention. However, to realize efficient processing on chip multiprocessor architectures, parallel processing techniques such as decomposing a program into adequate tasks considering characteristics of a program and assigning these tasks onto processors are essential. This paper describes a parallel processing scheme for MPEG2 encoding for a chip multiprocessor and its performance.

CiNii
Data Localization using Coarse Grain Task Parallelization on Chip Multiprocessor

NAKANO HIROFUMI, KODAKA TAKESHI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2003 ( 10 ) 13 - 18 2003.01

　View Summary

Recently. Chip Multiprocessor(GMP)architecture has attracted much attention as a next-generation microprocessor architecture. and many kinds of GMP have widely developed. However, these GMP architectures still have the problem of effective use of memory system nearby processor cores such as cache and local memory. On the other hand, the authors have proposed OSCAR GMP. which cooperatively works with multigrain parallel processing, to achieve high effective performance and good cost effectiveness. To overcome the problem of effective use of cache and local memory. OSCAR GMP has local data memory(LDM)for processor private data and distributed shared memory(DSN) having two por for synchronization and data transfer among processor cores, in addition to centralized shared memory (CSM). The multigrain parallelizing compiler uses such memory architecture of OSCAR GMP with data localization scheme that fully uses compile time information. This paper proposes a coarse grain task static scheduling scheme considering data localization using live variable analysis. Furthermore, data transfer between CSM and LDM insertion scheme using information of live variable analysis is also described. This data localization scheme is implemented on OSCAR FORTRAN multigrain parallelizing compiler and is evaluated on OSCAR GMP using Tomcatv form SPEC fp 95 benchmark suite. As the results, the proposed scheme gives us about 1.3 times speedup using 20 clocks as the access latency of CSM, and about 1.6 times using 40 clocks as the access latency of CSM respectively against without data localization scheme.

CiNii
Multigrain Parallel Processing on OSCAR Chip Multiprocessor

KIMURA KEIJI, KODAKA TAKESHI, OBATA MOTOKI, KASAHARA HIRONORI

IPSJ SIG Notes 2002 ( 112 ) 29 - 34 2002.11

　View Summary

This paper describes multigrain parallel processing on OSCAR Chip Multiprocessor (OSCAR CMP). The aim of OSCAR CMP is to achieve both of scalable performance improvement with effective use of huge number of transistors on a chip and high efficiency of application development with compiler supports. OSCAR CMP integrates simple single issue processors having local data memory for private data recognized by compiler, distributed shared data memory for optimal use of data locality over different loops. The compiler controllable data transfer unit for overlapping data transfer, and the multigrain parallelizing compiler, which exploits statement level near-fine grain parallelism, loop iteration level parallelism and coarse grain task parallelism hierarchically, fully controls these hardwares. Performance of multigrain parallel processing on OSCAR CMP is evaluated using SPEC fp 2000/95 benchmark suite. When microSPARC like single issue core is used, OSCAR CMP having four CPU cores gives us 2.98 times speedup in HYDRO2D, 3.84 times in TOMCATV, 3.84 times in MGRID, 3.97 times in SWIM, 2.36 times in FPPPP, 2.88 times in TURB3D, 2.64 times in SU2COR, 2.29 times in APPLU and 1.77 times in APSI.

CiNii
Multigrain Parallel Processing on Motion Vector Estimation for Single Chip Multiprocessor

KODAKA TAKESHI, SUZUKI TAKAHISA, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2002 ( 112 ) 23 - 28 2002.11

　View Summary

With the recent increase of multimedia contents using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia application have been expected. Particularly, single chip multiprocessor architectures having simple processor cores that will be able to attain scalability and cost effectiveness are attracting much attention to develop such processors. Single chip multiprocessor architectures allow us to exploit coarse grain task level and loop level parallelism in addition to the instruction level parallelism, so parallel processing technology is indispensable to allow us scalable performance improvement. This paper describes a multigrain parallel processing scheme for motion vector estimation for a single chip multiprocessor and its performance is evaluated.

CiNii
Evaluation of Overhead with Coarse Grain Task Parallel Processing on SMP Machines

WADA YASUTAKA, NAKANO HIROFUMI, KIMURA KEIJI, OBATA MOTOKI, KASAHARA HIRONORI

IPSJ SIG Notes 2002 ( 37 ) 13 - 18 2002.05

　View Summary

Coarse grain task parallel processing, which exploits parallelism among loops, subroutines and basic blocks, is getting more important to attain performance improvement on multiprocessor architectures. To efficiently implement the coarse grain task parallel processing. it is important to analyze various processor overhead quantitatively. This paper evaluates overheads of barrier synchronization, thread fork/join and L2 cache miss penalty are using performance measurement mechanisms to analyze the performance improvements by OSCAR Fortran compiler on Sun Ultra80, IBM RS6000 and SGI Origin2000.

CiNii
Multigrain Parallel Processing for JPEG Encoding Program on an OSCAR type Single Chip Multiprocessor

KODAKA TAKESHI, UCHIDA TAKAYUKI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2002 ( 9 ) 19 - 24 2002.02

　View Summary

With the recent increase of multimedia contests using JPEG and MPEG, low cost, low power consumption and high performance processors for multimedia have been expected. Particularly, single chip multiprocessor architecture having simple processor cores is attracting much attention to develop such processors. This paper describes multigrain parallel processing scheme for a JPEG encoding program for OSCAR type single chip multiprocessor and its performance. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gave us 3.59 times speed-up than sequencial execution and 2.87 times speed-up than OSCAR type single chip multiprocessor that has a four-issue UltraSPARC-II type super-scaler processor core.

CiNii
Near Fine Grain parallel Processing on Multimedia Application for Single Chip Multiprocessor

KODAKA TAKESHI, MIYASHITA NAOHISA, KIMURA KEIJI, KASAHARA HIRONORI

ARC 2001 ( 76 ) 61 - 66 2001.07

　View Summary

With the recent increase of multimedia contents, such as JPEG and MPEG data, low cost and low power consumption processors that can process these multimedia contents efficiently are expected. In such microprocessors, single chip multiprocessor architecture having simple processor cores is attracting much attention. Considering the above facts, this paper evaluate a JPEG encoding program on OSCAR type single chip multiprocessor architecture using near fine grain parallel processing for 8×8 pixel block that is a fundamental part of JPEG algorithm. The evaluation shows an OSCAR type single chip multiprocessor having four single-issue simple processor cores gives 2.32 times speedup than four-issue UltraSPARC-II type super-scaler processor.

CiNii
A Static Scheduling Scheme for Coarse Grain Tasks considering Cache Optimization on SMP

NAKANO HIROFUMI, ISHIZAKA KAZUHISA, OBATA MOTOKI, KIMURA KEIJI, KASAHARA HIRONORI

IPSJ SIG Notes 2001 ( 76 ) 67 - 72 2001.07

　View Summary

Effective use of cache memory based on data locality is getting more important with increasing gap between the processor speed and memory access speed. As to parallel processing on multiprocessor systems, it seems to be difficult to achieve large performance improvement only with the conventional loop iteration level parallelism. This paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme is based on the macro data flow parallel processing that uses coarse grain task parallelism among tasks such as loop blocks, subroutines and basic blocks. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP machine, using swim and tomcatv from the SPEC fp 95 benchmark suite. As the results, the proposed scheme gives us 4.56 times speedup for swim and 2.37 times for tomcatv respectively against the Sun Forte HPC 6 loop parallelizing compiler on 4 processors.

CiNii
Distributed Application Server implemented on distributed objects

YAMASHITA Sekiya, KIMURA Keiji

IPSJ SIG Notes 2001 ( 39 ) 1 - 6 2001.05

　View Summary

Since the Internet became popular, especially for utilization of WWW service to increase rapidly, server software has attracted a great deal of attention. Server software's purposes are efficiency of www service's development and maintenance as well as load balancing of WWW site's congestion. Such server software is called application servers. In this study, we propose distributed application servers that distributed software architecture are applied to application servers. Distributed software architecture is composed of distributed objects. Our goal is the development of application serves that has high scalability and load tolerance. In this paper, we describe the proposal, design policy and preliminary assessment's result of distributed application servers.

CiNii
Processor Core Architecture of Single Chip Multiprocessor for Near Fine Grain Parallel Processing

KIMURA KEIJI, UCHIDA TAKAYUKI, KATO TAKAYUKI, KASAHARA HIRONORI

IPSJ SIG Notes 2000 ( 74 ) 91 - 96 2000.08

　View Summary

With continuously increase of transistors integrated onto a chip, it has been a very important how to achieve scalable performance improvement using these transistors effectively. Especially, exploiting different grains of parallelism in addition to instruction level parallelism and effective use of this parallelism in a single chip is getting more important. To this end, a single chip multiprocessor(SCM)architecture that contains multiple processor cores has been attracted much attention. To decide suitable SCM processor core architecture for multigrain parallel processing, this paper evaluates several SCM architectures which have different instruction issue widths and numbers of global shared register file for near fine grain parallel processing, which is one of the key issues in multigrain parallel processing.

CiNii
Memory access analyzer for a Multi-grain parallel processing

IWAI Keisuke, OBATA Motoki, KIMURA Keiji, AMANO Hideharu, KASAHARA Hironori

IEICE technical report. Computer systems 99 ( 252 ) 1 - 8 1999.08

　View Summary

Multi-grain parallel processing is proposed to exploit an inherent parallelism in application programs as much as possible. Although this method can be realized on various architectures, a dedicated multiprocessor architecture is required for achieving the maximum performance. Multiprocessor system ASCA (Advanced Scheduling oriented Computer Architecture) is proposed for efficient execution of multi-grain parallel processing. It provides various mechanisms for this purpose including a dedicated memory structure which is used efficiently both for multi-grain parallel processing A memory access analyzer is developed for investigating memory access characteristics in multi-grain parallel processing. Based on a result of analysis on a real application with multi-grain parallel processing efficient memory structure for ASCA is discussed.

CiNii
Performance Evaluation of Near Finegrain Parallel Processing on the Single Chip Multiprocessor

KIMURA KEIJI, MANAKA KUNIYUKI, OGATA WATARU, OKAMOTO MASAMI, KASAHARA HIRONORI

IPSJ SIG Notes 1999 ( 67 ) 19 - 24 1999.08

　View Summary

Advances in semiconductor technology allows us to integrate a lot of integer and floating point execution units, memory or processors on a single chip. To use these resources effectively, many researches on next generation microprocessor architectures and its software, especially compilers have been performed. In these next generation microprocessor architectures, a single chip multiprocessor (SCM) using multigrain parallel processing, which hierarchically exploits different level of parallelism from the whole program, is one of the most promising architectures. This paper evaluates performance of the SCM architectures for near fine grain parallel processing, which is one of the key issues in multigrain parallel processing, using several real application programs.

CiNii
A Cache Optimization Scheme Using Earliest Executable Condition Analysis

58 177 - 178 1999.03

CiNii
Evaluation of Multigrain Parallelism using OSCAR FORTRAN Compiler

OBATA Motoki, MATSUI Gantetsu, MATSUZAKI Hidenori, KIMURA Keiji, INAISHI Daisuke, UJIGAWA Yasushi, YAMAMOTO Terumasa, OKAMOTO Masami, KASAHARA Hironori

IPSJ SIG Notes 1998 ( 70 ) 13 - 18 1998.08

　View Summary

Currently, peak performances of supercomputers attain TELOPS order. It seems that the peak performances will continue by increase.However, supercomputers have a problem that enlargement of the world is very difficult because of relatively low cost performance and difficulty of use. In microprocessor, limitations of extraction of instruction level parallelism being used by super scalar and VLIW architecture are getting clear and single chip multiprocessor is received much attention as one of next generation processor architechture.In order to improve effective performance or cost performance, and ease of use, the authors have been proposing a Multigrain Automatic Parallelizing Compilation scheme. The multigrain parallel processing is a method which extract all parallelism from a program, such as coarse grain parallelism among subroutines, loops, and basic blocks, medium grain parallelism among loop iterations, and fine grain parallelism among instructions and statements.This paper shows effectiveness of multigrain parallel processing using OSCAR multigrain FORTRAN parallelization compiler using fluid flow problem solver ARC2D(Perfect Benchmark)as an example.

CiNii
Multigrain Parallel Processing on the Single Chip Multiprocessor

KIMURA KEIJI, OGATA WATARU, OKAMOTO MASAMI, KASAHARA HIRONORI

IPSJ SIG Notes 1998 ( 70 ) 25 - 30 1998.08

　View Summary

With the increase of the number of transistors integrated on a chip, how to use transistors efficiently and improve effective performance of a processor is getting an important problem. However, it has been thought that superscalar and VLIW which have been popular architectures would have difficulty to obtain scalable improvement of effective performance because of limitation of instruction level parallelism.To cope with this problem, the authors have been proposing a single chip multiprocessor(SCM)approach to use multi grain parallelism inside a chip, which hierarchicaly exproits loop parallelism with large parallelism and coarse grain parallelism among subroutines, loops and basic blocks in addition to instruction level parallelism.This paper describes preliminary evaluation of effectiveness of single chip multiprocessor architecture with a shared cache, global registers, distributed shared memory and/or local memory as the first step of research on SCM architecture for supporting effective realization of multi grain parallel processing.

CiNii
A Cache Optimization with Earliest Executable Condition Analysis

INAISHI Daisuke, KIMURA Keiji, FUJIMOTO Kensaku, OGATA Wataru, OKAMOTO Masami, KASAHARA Hironori

IPSJ SIG Notes 1998 ( 70 ) 31 - 36 1998.08

　View Summary

Cache optimizations by a compiler for a single processor machine have been mainly applied to a singlenested loop.On the contrary, this paper proposes a cache optimization scheme using earliest executable condition analysis for FORTRAN programs on a single processor system.OSCAR FORTRAN multi-grain automatic parallelizing compiler decomposes a FORTRAN program into three types of macrotasks(MT), such as loops, subroutines and basic blocks, and analyzes the earliest executable condition of each MT to extract coarse grain parallelism among MTs and generates a macrotask graph(MTG).The MTG represents data dependence and extended control dependence among MTs and an information of shared data among MTs.By using this MTG, a compiler realizes global code motion to use cache effectively.The code motion technique moves a MT, which accesses data accessed by a precedent MT on MTG, immediately after the precedent MT to increase a cache hit rate. This optimization is realized using OSCAR multi-grain compiler as a preprocessor to output an optimized sequential FORTRAN code.A performance evaluation shows about 62% speed up compared with original program on 167MHz UltraSPARC.

CiNii
A Multigrain Parallelizing Compiler and Its Architectural Support

KASAHARA Hironori, OGATA Wataru, KIMURA Keiji, OBATA Motoki, TOBITA Takao, INAISHI Daisuke

Technical report of IEICE. ICD 98 ( 22 ) 71 - 76 1998.04

　View Summary

Currently, difficulty of enlargement of the world market for supercomputers caused by cost-performance, which does not seem excellent for real effective performance, and need of high experience for parallel tuning is getting a problem. Also, in general purpose microprocessors, limitations of extraction of instruction level parallelism being used by super-scalar and VLIW architectures are getting clear. This paper describes a multigrain compilation technology and architectural support for it as an approach to cope with the above difficulites and develop user friendly and excellent cors performance supercomputers and single chip multiprocesors.

CiNii
Single Chip Multiprocessor Architecture for Multigrain Parallel Processing

56 101 - 102 1998.03

CiNii
A Cache Optimization with Macto-Task Earliest Execution Condition

56 303 - 304 1998.03

CiNii
Evaluation of Multi-Grain Parallelism in Scientific Programs

56 305 - 306 1998.03

CiNii
Implementation of FPGA Based Architecture Test Bed For Multi Processor System

1998 ( 18 ) 79 - 84 1998.03

CiNii
Multi-processor system for Multi-grain Parallel Processing

IWAI Keisuke, FUJIWARA Takashi, MORIMURA Tomohiro, AMANO Hideharu, KIMURA Keiji, OGATA Wataru, KASAHARA Hironori

IEICE technical report. Computer systems 97 ( 225 ) 77 - 84 1997.08

　View Summary

Multi-grain parallel processing is proposed to exploit an inherent parallelism in application programs as much as possible. Although this method can be realized on various architectures, a dedicated multiprocessor architecture is required for achieving the maximum performance. Multiprocessor system ASCA (Advanced Scheduling oriented Computer Architecture) is proposed for efficient execution of multigrain parallel processing. It provides various mechanisms for this purpose including a dedicated communication mechanism which is used efficiently both for coarse grain and near fine grain parallel processing, and a custom designed processor for static scheduling.

CiNii
A Macro Task Dynamic Scheduling Algorithm with Overlapping of Task Processing and Data Transfer

KIMURA KEIJI, HASHIMOTO SHIGERU, KOGOU MAKOTO, OGATA WATARU, KASAHARA HIRONORI

CPSY97 97 ( 225 ) 33 - 38 1997.08

　View Summary

Resently, multiprocessor systems having data transfer unit which can transfer data asynchronously with CPU are getting popular. We can hide data transfer overhead by using these data transfer unit. However, it is difficult by users to write a optimized program considering overlapping of data transfers and task processing. To hide overhead caused by data transfers, this paper proposes dynamic scheduling algorithm considering data pre-loading and post-storing for overlapping of data transfers and task processing. Preliminary performance evaluations by simulation show that the proposed scheduling scheme can reduce execution time 26% comparing with the scheduling scheme without pre-loading and post-storing.

CiNii

▼display all

Industrial Property Rights

並列化コンパイラ、並列化コンパイル装置、及び並列プログラムの生成方法

6600888

笠原博徳, 木村啓二, 梅田弾, 見神広紀

Patent
マルチプロセッサシステム

6335253

笠原博徳, 木村啓二

Patent
マルチプロセッサシステム

笠原博徳, 木村啓二

Patent
並列化コンパイル方法、並列化コンパイラ、並列化コンパイル装置、及び、車載装置

6018022

笠原博徳, 木村啓二, 林明宏, 見神広紀, 梅田弾, 金羽木洋平

Patent
並列性の抽出方法及びプログラムの作成方法

6319880

木村啓二, 林明宏, 笠原博徳, 見神広紀, 金羽木洋平, 梅田弾

Patent
マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法

笠原博徳, 木村啓二

Patent
プロセッサシステム及びアクセラレータ

6103647

木村啓二, 笠原博徳

Patent
プロセッサによって実行可能なコードの生成方法、記憶領域の管理方法及びコード生成プログラム

5283128

笠原博徳, 木村啓二, 間瀬正啓

Patent
マルチプロセッサ

笠原博徳, 木村啓二

Patent
マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法

笠原博徳, 木村啓二

Patent
マルチプロセッサ

4304347

笠原博徳, 木村啓二

Patent
メモリ管理方法、情報処理装置、プログラムの作成方法及びプログラム

5224498

笠原博徳, 木村啓二, 中野啓史, 仁藤拓実, 丸山貴紀, 三浦剛, 田川友博

Patent
マルチプロセッサ及びマルチプロセッサシステム

4784842

笠原博徳, 木村啓二

Patent
プロセッサ及びデータ転送ユニット

4476267

笠原博徳, 木村啓二

Patent
ヘテロジニアスマルチプロセッサ向けグローバルコンパイラ

4784827

笠原博徳, 木村啓二, 鹿野裕明

Patent
ヘテロジニアス・マルチプロセッサシステムの制御方法及びマルチグレイン並列化コンパイラ

4936517

笠原博徳, 木村啓二, 白子準, 和田康孝, 伊藤雅樹, 鹿野裕明

Patent
マルチプロセッサシステム及びマルチグレイン並列化コンパイラ

笠原博徳, 木村啓二, 白子準, 伊藤雅樹, 鹿野裕明

Patent
マルチプロセッサシステム及びマルチグレイン並列化コンパイラ

4082706

笠原博徳, 木村啓二, 白子準, 伊藤雅樹, 鹿野裕明

Patent
マルチプロセッサ

4784792

笠原博徳, 木村啓二

Patent

▼display all

Syllabus

Master's Thesis (Department of Computer Science and Communications Engineering)

Graduate School of Fundamental Science and Engineering

2025 full year
Master's Thesis (Department of Computer Science and Communications Engineering)

Graduate School of Fundamental Science and Engineering

2025 full year
IoT System Design

Graduate School of Fundamental Science and Engineering

2025 spring semester
Seminar on Advanced Processor Architecture D

Graduate School of Fundamental Science and Engineering

2025 fall semester
Seminar on Advanced Processor Architecture C

Graduate School of Fundamental Science and Engineering

2025 spring semester
Seminar on Advanced Processor Architecture B

Graduate School of Fundamental Science and Engineering

2025 fall semester
Seminar on Advanced Processor Architecture A

Graduate School of Fundamental Science and Engineering

2025 spring semester
Special Laboratory B in Computer Science and Communications Engineering

Graduate School of Fundamental Science and Engineering

2025 fall semester
Special Laboratory A in Computer Science and Communications Engineering

Graduate School of Fundamental Science and Engineering

2025 spring semester
Research on Advanced Processor Architecture

Graduate School of Fundamental Science and Engineering

2025 full year
Seminar on Advanced Processor Architecture D

Graduate School of Fundamental Science and Engineering

2025 fall semester
Seminar on Advanced Processor Architecture C

Graduate School of Fundamental Science and Engineering

2025 spring semester
Seminar on Advanced Processor Architecture B

Graduate School of Fundamental Science and Engineering

2025 fall semester
Seminar on Advanced Processor Architecture A

Graduate School of Fundamental Science and Engineering

2025 spring semester
Special Laboratory B in Computer Science and Communications Engineering

Graduate School of Fundamental Science and Engineering

2025 fall semester
Special Laboratory A in Computer Science and Communications Engineering

Graduate School of Fundamental Science and Engineering

2025 spring semester
Advanced Processor Architecture

Graduate School of Fundamental Science and Engineering

2025 spring semester
Advanced Processor Architecture

Graduate School of Fundamental Science and Engineering

2025 spring semester
Research on Advanced Processor Architecture

Graduate School of Fundamental Science and Engineering

2025 full year
IoT System Design

Graduate School of Creative Science and Engineering

2025 spring semester
Research on Advanced Processor Architecture

Graduate School of Fundamental Science and Engineering

2025 full year
Special Seminar A in Computer Science and Communications Engineering

Graduate School of Fundamental Science and Engineering

2025 spring semester
Special Seminar B in Computer Science and Communications Engineering

Graduate School of Fundamental Science and Engineering

2025 fall semester
IoT System Design

Graduate School of Advanced Science and Engineering

2025 spring semester
IoT System Design

School of Fundamental Science and Engineering

2025 spring semester
Advanced Processor Architecture Technology

School of Fundamental Science and Engineering

2025 spring semester
Project Research B

School of Fundamental Science and Engineering

2025 fall semester
Project Research A

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis B（Spring Semester）

School of Fundamental Science and Engineering

2025 spring semester
Language Processors [S Grade]

School of Fundamental Science and Engineering

2025 spring semester
Language Processors

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis A (Intensive Course)

School of Fundamental Science and Engineering

2025 an intensive course(spring and fall)
Bachelor Thesis B

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A（Fall Semester）

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis B

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis A [S Grade]

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis B [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A（Fall Semester）

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis B（Spring Semester）

School of Fundamental Science and Engineering

2025 spring semester
Computer Architecture B [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A

School of Fundamental Science and Engineering

2025 spring semester
Computer Architecture B

School of Fundamental Science and Engineering

2025 fall semester
Computer Science and Engineering Laboratory B

School of Fundamental Science and Engineering

2025 spring semester
Computer Science and Engineering Laboratory A (2)

School of Fundamental Science and Engineering

2025 fall semester
Computer Architecture A [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Computer Architecture A

School of Fundamental Science and Engineering

2025 fall semester
Computer Science and Engineering Laboratory B [S Grade]

School of Fundamental Science and Engineering

2025 spring semester
Computer Science and Engineering Laboratory A [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Communications and Computer Engineering Laboratory B

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis B (Spring Semester)

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis B

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A [S Grade]

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis B

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A (Intensive Course)

School of Fundamental Science and Engineering

2025 an intensive course(spring and fall)
Bachelor Thesis A (Fall Semester)

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis A (Fall Semester)

School of Fundamental Science and Engineering

2025 fall semester
Bachelor Thesis A

School of Fundamental Science and Engineering

2025 spring semester
IoT System Design

School of Fundamental Science and Engineering

2025 spring semester
Project Research Fall

School of Fundamental Science and Engineering

2025 fall semester
Computer Science and Communications Engineering Laboratory A [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Computer Science and Communications Engineering Laboratory A

School of Fundamental Science and Engineering

2025 fall semester
Computer Architecture

School of Fundamental Science and Engineering

2025 fall semester
Introduction to Computers and Networks

School of Fundamental Science and Engineering

2025 spring semester
Graduation Thesis A (Fall) [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Project Research Spring

School of Fundamental Science and Engineering

2025 spring semester
Graduation Thesis A (Fall)

School of Fundamental Science and Engineering

2025 fall semester
Graduation Thesis A (Spring)

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis B [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Graduation Thesis B (Fall) [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Advanced Processor Architecture Technology

School of Fundamental Science and Engineering

2025 spring semester
Graduation Thesis A (Spring) [S Grade]

School of Fundamental Science and Engineering

2025 spring semester
Computer Science and Communications Engineering Laboratory B

School of Fundamental Science and Engineering

2025 spring semester
Graduation Thesis B (Spring) [S Grade]

School of Fundamental Science and Engineering

2025 spring semester
Graduation Thesis B (Fall)

School of Fundamental Science and Engineering

2025 fall semester
Graduation Thesis B (Spring)

School of Fundamental Science and Engineering

2025 spring semester
Advanced Processor Architecture Technology

School of Fundamental Science and Engineering

2025 spring semester
IoT System Design

School of Fundamental Science and Engineering

2025 spring semester
Project Research B

School of Fundamental Science and Engineering

2025 fall semester
Communications and Computer Engineering Laboratory A [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Computer Architecture B

School of Fundamental Science and Engineering

2025 fall semester
Communications and Computer Engineering Laboratory A

School of Fundamental Science and Engineering

2025 fall semester
Project Research A

School of Fundamental Science and Engineering

2025 spring semester
Bachelor Thesis B (Spring Semester)

School of Fundamental Science and Engineering

2025 spring semester
Computer Architecture A [S Grade]

School of Fundamental Science and Engineering

2025 fall semester
Computer Architecture A

School of Fundamental Science and Engineering

2025 fall semester
Language Processors

School of Fundamental Science and Engineering

2025 spring semester
Communications and Computer Engineering Laboratory B [S Grade]

School of Fundamental Science and Engineering

2025 spring semester

▼display all

Overseas Activities

新しいメモリ階層を考慮したソフトウェア・ハードウェアの構成法に関する研究

2017.08

-

2018.02

アメリカ North Carolina State University

Sub-affiliation

Faculty of Science and Engineering Graduate School of Fundamental Science and Engineering

Research Institute

2024

-

2026

Waseda Research Institute for Science and Engineering Concurrent Researcher
2024

-

2026

Waseda Center for a Carbon Neutral Society Concurrent Researcher

Internal Special Research Projects

深層学習フレームワークでの利用を目指した完全準同型暗号による行列計算に関する研究

2020

　View Summary

2020年度は、研究のベースとなるソフトウェアとして、Microsoft ResearchのSEAを利用し、これによる行列積演算を構成する各種処理の時間を測定し、そのオーバーヘッドと並列性の調査を行った。まず、行列積計算をOpenMPで並列化し、8コア搭載のIntel Xeon W2145(3.70GHz)で実行した結果、1コア実行時に対して約6倍の性能向上を得ることが出来た。さらに、準同型暗号による行列積演算を構成する処理をSIMD演算（AVX512）により高速化することを試みた。その結果、ライブラリ内部で使用する基本データ型を64bitから32bitに縮小しかつSIMD演算幅を増やすことで、行列演算の重要処理をSIMDオリジナルの実装に対して3.48倍高速化可能となった。 
フラグによりCPUとアクセラレータが連係するヘテロジニアスマルチコアに関する研究

2014

　View Summary

本研究は、アクセラレータを持つヘテロジニアスマルチコアに対して、アクセラレータの制御やデータ転送に要するオーバーヘッドを削減する技術の開発する。具体的にはCPU、データ転送ユニット（DTU）、及びアクセラレータを同時実行させることで上記オーバーヘッドを隠蔽可能とするタスク分割及びスケジューリング手法を開発し、自動並列化コンパイラに実装する。本年度の成果としては、まず本研究が前提とするアクセラレータの基本仕様を決定した。その上で、本アクセラレータ用のコンパイラモジュールを開発し、さらにアクセラレータのアーキテクチャシミュレータを開発することにより、本研究を行う上での基本的な評価環境を整備した。
コンパイラ解析情報と実機実行情報を利用したマルチコアシミュレーション高速化の研究

2009

　View Summary

計算機アーキテクチャの研究では、様々な構成のシステム評価を行うため、ソフトウェアによるアーキテクチャシミュレーションが大きな役割を果たしている。しかしながら、ソフトウェアシミュレータはプログラムの実行に実機の数千倍の時間がかかる。このような膨大な評価時間は今後のメニーコアの研究・開発の大きな妨げになる。本研究では、このような問題を克服するための、マルチコア・メニーコアのソフトウェアシミュレーション高速化手法の研究を行う。特に並列アーキテクチャ研究のためのシミュレーション高速化の研究に関しては、これまでミュレーションによる実験対象となる仮想のマルチコアやマルチプロセッサのコアを、シミュレータを実行する実際のマルチプロセッサのコアに割り当てるという方法が提案されてきたが、実機上の並列処理オーバーヘッドが大きく、実用的なシステムはこれまで実現されていない。本研究の特徴は、マルチコア・メニーコアのソフトウェアシミュレーションの高速化に、ループ構造や並列化情報などの並列化コンパイラによる解析情報と、評価対象アプリケーションの実機での実行情報を利用することである。これらの情報を利用し、詳細にシミュレーションする必要がある箇所とそうでない箇所を特定する。従来のソフトウェアシミュレーション高速化手法では利用されてこなかったこれらの付加的な情報を利用することで、精度の高い性能値を最小の実行コストで得ることができる。本年度は、本高速化手法の基本的な適用可能性を検討するための予備実験を行った。具体的には、二種類のマルチコアアーキテクチャのコア数を32コアまで変化させ、ベンチマークプログラムのメインループの回転数を変化させ本研究による性能値推定手法により本来のループ回転数における性能値を再現できるか調査した。ベンチマークプログラムとしてSPEC95ベンチマークのtomcatvとswim、および音声圧縮で標準的に使われているAACエンコーディングプログラムを用いた。評価の結果、いずれのアーキテクチャ、コア数、ベンチマークプログラムの組み合わせにおいても、わずか数回転分の性能値から本来の数百回転分の性能値を高々2%程度の誤差で予測することができた。今後は適用アプリケーションの拡大ならびにシステムの自動化を行う予定である。
ソフトウェア協調型チップマルチプロセッサにおけるメモリ最適化に関する研究

2004

　View Summary

本研究ではまず、データローカリティ最適化およびデータ転送最適化の基盤技術となるマルチグレイン並列化コンパイラとチップマルチプロセッサアーキテクチャプラットフォームの選定および評価基盤の整備を行った。コンパイラとしては、経済産業省ミレニアムプロジェクトIT21 アドバンスト並列化コンパイラで開発されたOSCARマルチグレイン並列化コンパイラをコアとした。また、チップマルチプロセッサアーキテクチャとしては、簡素なプロセッサコア、ローカルデータメモリ、２ポート構成の分散共有メモリ、およびデータ転送ユニットを持つプロセッシングエレメント（PE）をPE間ネットワークで接続したOSCAR型チップマルチプロセッサとした。本研究では、OSCARマルチグレイン並列化コンパイラに対してOSCAR型チップマルチプロセッサ用のバックエンド（コード生成器）を追加開発した。データローカリティ最適化およびデータ転送最適化技術開発の第一歩として、ターゲットアプリケーションには、SPECfp95ベンチマークより科学技術計算の典型例であるTomcatvとSwimプログラムを選んだ。本研究では、これらに対してタスク（並列処理の単位）とデータをデータローカリティと並列性の両方を考慮しながらPEへスケジューリングし、さらに共有メモリとプロセッサのローカルメモリ（データローカルメモリおよび分散共有メモリ）とのやり取りをプロセッサと非同期で動作するデータ転送ユニットにより処理させることにより、データローカリティ利用とデータ転送処理の効率化を行った。８PEで評価を行った結果、データローカリティ最適化を適用していない場合に対してTomcatvで1.56倍、Swimで1.38倍の速度向上を得ることができた。