Updated on 2022/09/25


MIMORI, Takahiro
Faculty of Science and Engineering, Waseda Research Institute for Science and Engineering
Job title
Junior Researcher(Assistant Professor)

Research Experience

  • 2022.04

    Waseda University   Research Institute for Science and Engineering

  • 2019.01



Research Areas

  • Life, health and medical informatics


  • Novel metric for hyperbolic phylogenetic tree embeddings.

    Hirotaka Matsumoto, Takahiro Mimori, Tsukasa Fukunaga

    Biology methods & protocols   6 ( 1 ) bpab006  2021  [International journal]

     View Summary

    Advances in experimental technologies, such as DNA sequencing, have opened up new avenues for the applications of phylogenetic methods to various fields beyond their traditional application in evolutionary investigations, extending to the fields of development, differentiation, cancer genomics, and immunogenomics. Thus, the importance of phylogenetic methods is increasingly being recognized, and the development of a novel phylogenetic approach can contribute to several areas of research. Recently, the use of hyperbolic geometry has attracted attention in artificial intelligence research. Hyperbolic space can better represent a hierarchical structure compared to Euclidean space, and can therefore be useful for describing and analyzing a phylogenetic tree. In this study, we developed a novel metric that considers the characteristics of a phylogenetic tree for representation in hyperbolic space. We compared the performance of the proposed hyperbolic embeddings, general hyperbolic embeddings, and Euclidean embeddings, and confirmed that our method could be used to more precisely reconstruct evolutionary distance. We also demonstrate that our approach is useful for predicting the nearest-neighbor node in a partial phylogenetic tree with missing nodes. Furthermore, we proposed a novel approach based on our metric to integrate multiple trees for analyzing tree nodes or imputing missing distances. This study highlights the utility of adopting a geometric approach for further advancing the applications of phylogenetic methods.

    DOI PubMed

  • Diagnostic Uncertainty Calibration: Towards Reliable Machine Predictions in Medical Domain.

    Takahiro Mimori, Keiko Sasada, Hirotaka Matsui, Issei Sato

    International Conference on Artificial Intelligence and Statistics, AISTATS 2021     3664 - 3672  2021  [Refereed]

    Authorship:Lead author

  • CD45+CD326+ Cells are Predictive of Poor Prognosis in Non-Small Cell Lung Cancer Patients.

    Kota Ishizawa, Mie Yamanaka, Yuriko Saiki, Eisaku Miyauchi, Shinichi Fukushige, Tetsuya Akaishi, Atsuko Asao, Takahiro Mimori, Ryota Saito, Yutaka Tojo, Riu Yamashita, Michiaki Abe, Akira Sakurada, Nhu-An Pham, Ming Li, Yoshinori Okada, Tadashi Ishii, Naoto Ishii, Seiichi Kobayashi, Masao Nagasaki, Masakazu Ichinose, Ming-Sound Tsao, Akira Horii

    Clinical cancer research : an official journal of the American Association for Cancer Research   25 ( 22 ) 6756 - 6763  2019.11  [International journal]

     View Summary

    PURPOSE: The epithelial-to-mesenchymal transition, the major process by which some cancer cells convert from an epithelial phenotype to a mesenchymal one, has been suggested to drive chemo-resistance and/or metastasis in patients with cancer. However, only a few studies have demonstrated the presence of CD45/CD326 doubly-positive cells (CD45/CD326 DPC) in cancer. We deployed a combination of cell surface markers to elucidate the phenotypic heterogeneity in non-small cell lung cancer (NSCLC) cells and identified a new subpopulation that is doubly-positive for epithelial and non-epithelial cell-surface markers in both NSCLC cells and patients' malignant pleural effusions. EXPERIMENTAL DESIGN: We procured a total of 39 patients' samples, solid fresh lung cancer tissues from 21 patients and malignant pleural effusion samples from 18 others, and used FACS and fluorescence microscopy to check their surface markers. We also examined the EGFR mutations in patients with known acquired EGFR mutations. RESULTS: Our data revealed that 0.4% to 17.9% of the solid tumor tissue cells and a higher percentage of malignant pleural effusion cells harbored CD45/CD326 DPC expressing both epithelial and nonepithelial surface markers. We selected 3 EGFR mutation patients and genetically confirmed that the newly identified cell population really originated from cancer cells. We also found that higher proportions of CD45/CD326 DPC are significantly associated with poor prognosis. CONCLUSIONS: In conclusion, varying percentages of CD45/CD326 DPC exist in both solid cancer tissue and malignant pleural effusion in patients with NSCLC. This CD45/CD326 doubly-positive subpopulation can be an important key to clinical management of patients with NSCLC.

    DOI PubMed

  • Construction of full-length Japanese reference panel of class I HLA genes with single-molecule, real-time sequencing.

    Takahiro Mimori, Jun Yasuda, Yoko Kuroki, Tomoko F Shibata, Fumiki Katsuoka, Sakae Saito, Naoki Nariai, Akira Ono, Naomi Nakai-Inagaki, Kazuharu Misawa, Keiko Tateno, Yosuke Kawai, Nobuo Fuse, Atsushi Hozawa, Shinichi Kuriyama, Junichi Sugawara, Naoko Minegishi, Kichiya Suzuki, Kengo Kinoshita, Masao Nagasaki, Masayuki Yamamoto

    The pharmacogenomics journal   19 ( 2 ) 136 - 146  2019.04  [International journal]

     View Summary

    Human leukocyte antigen (HLA) is a gene complex known for its exceptional diversity across populations, importance in organ and blood stem cell transplantation, and associations of specific alleles with various diseases. We constructed a Japanese reference panel of class I HLA genes (ToMMo HLA panel), comprising a distinct set of HLA-A, HLA-B, HLA-C, and HLA-H alleles, by single-molecule, real-time (SMRT) sequencing of 208 individuals included in the 1070 whole-genome Japanese reference panel (1KJPN). For high-quality allele reconstruction, we developed a novel pipeline, Primer-Separation Assembly and Refinement Pipeline (PSARP), in which the SMRT sequencing and additional short-read data were used. The panel consisted of 139 alleles, which were all extended from known IPD-IMGT/HLA sequences, contained 40 with novel variants, and captured more than 96.5% of allelic diversity in 1KJPN. These newly available sequences would be important resources for research and clinical applications including high-resolution HLA typing, genetic association studies, and analyzes of cis-regulatory elements.

    DOI PubMed

  • Maternity Log study: a longitudinal lifelog monitoring and multiomics analysis for the early prediction of complicated pregnancy.

    Junichi Sugawara, Daisuke Ochi, Riu Yamashita, Takafumi Yamauchi, Daisuke Saigusa, Maiko Wagata, Taku Obara, Mami Ishikuro, Yoshiki Tsunemoto, Yuki Harada, Tomoko Shibata, Takahiro Mimori, Junko Kawashima, Fumiki Katsuoka, Takako Igarashi-Takai, Soichi Ogishima, Hirohito Metoki, Hiroaki Hashizume, Nobuo Fuse, Naoko Minegishi, Seizo Koshiba, Osamu Tanabe, Shinichi Kuriyama, Kengo Kinoshita, Shigeo Kure, Nobuo Yaegashi, Masayuki Yamamoto, Satoshi Hiyama, Masao Nagasaki

    BMJ open   9 ( 2 ) e025939  2019.02  [International journal]

     View Summary

    PURPOSE: A prospective cohort study for pregnant women, the Maternity Log study, was designed to construct a time-course high-resolution reference catalogue of bioinformatic data in pregnancy and explore the associations between genomic and environmental factors and the onset of pregnancy complications, such as hypertensive disorders of pregnancy, gestational diabetes mellitus and preterm labour, using continuous lifestyle monitoring combined with multiomics data on the genome, transcriptome, proteome, metabolome and microbiome. PARTICIPANTS: Pregnant women were recruited at the timing of first routine antenatal visits at Tohoku University Hospital, Sendai, Japan, between September 2015 and November 2016. Of the eligible women who were invited, 65.4% agreed to participate, and a total of 302 women were enrolled. The inclusion criteria were age ≥20 years and the ability to access the internet using a smartphone in the Japanese language. FINDINGS TO DATE: Study participants uploaded daily general health information including quality of sleep, condition of bowel movements and the presence of nausea, pain and uterine contractions. Participants also collected physiological data, such as body weight, blood pressure, heart rate and body temperature, using multiple home healthcare devices. The mean upload rate for each lifelog item was ranging from 67.4% (fetal movement) to 85.3% (physical activity), and the total number of data points was over 6 million. Biospecimens, including maternal plasma, serum, urine, saliva, dental plaque and cord blood, were collected for multiomics analysis. FUTURE PLANS: Lifelog and multiomics data will be used to construct a time-course high-resolution reference catalogue of pregnancy. The reference catalogue will allow us to discover relationships among multidimensional phenotypes and novel risk markers in pregnancy for the future personalised early prediction of pregnancy complications.

    DOI PubMed

  • Genome analyses for the Tohoku Medical Megabank Project towards establishment of personalized healthcare.

    Jun Yasuda, Kengo Kinoshita, Fumiki Katsuoka, Inaho Danjoh, Mika Sakurai-Yageta, Ikuko N Motoike, Yoko Kuroki, Sakae Saito, Kaname Kojima, Matsuyuki Shirota, Daisuke Saigusa, Akihito Otsuki, Junko Kawashima, Yumi Yamaguchi-Kabata, Shu Tadaka, Yuichi Aoki, Takahiro Mimori, Kazuki Kumada, Jin Inoue, Satoshi Makino, Miho Kuriki, Nobuo Fuse, Seizo Koshiba, Osamu Tanabe, Masao Nagasaki, Gen Tamiya, Ritsuko Shimizu, Takako Takai-Igarashi, Soichi Ogishima, Atsushi Hozawa, Shinichi Kuriyama, Junichi Sugawara, Akito Tsuboi, Hideyasu Kiyomoto, Tadashi Ishii, Hiroaki Tomita, Naoko Minegishi, Yoichi Suzuki, Kichiya Suzuki, Hiroshi Kawame, Hiroshi Tanaka, Yasuyuki Taki, Nobuo Yaegashi, Shigeo Kure, Fuji Nagami, Kenjiro Kosaki, Yoichi Sutoh, Tsuyoshi Hachiya, Atsushi Shimizu, Makoto Sasaki, Masayuki Yamamoto

    Journal of biochemistry   165 ( 2 ) 139 - 158  2019.02  [International journal]

     View Summary

    Personalized healthcare (PHC) based on an individual's genetic make-up is one of the most advanced, yet feasible, forms of medical care. The Tohoku Medical Megabank (TMM) Project aims to combine population genomics, medical genetics and prospective cohort studies to develop a critical infrastructure for the establishment of PHC. To date, a TMM CommCohort (adult general population) and a TMM BirThree Cohort (birth+three-generation families) have conducted recruitments and baseline surveys. Genome analyses as part of the TMM Project will aid in the development of a high-fidelity whole-genome Japanese reference panel, in designing custom single-nucleotide polymorphism (SNP) arrays specific to Japanese, and in estimation of the biological significance of genetic variations through linked investigations of the cohorts. Whole-genome sequencing from >3,500 unrelated Japanese and establishment of a Japanese reference genome sequence from long-read data have been done. We next aim to obtain genotype data for all TMM cohort participants (>150,000) using our custom SNP arrays. These data will help identify disease-associated genomic signatures in the Japanese population, while genomic data from TMM BirThree Cohort participants will be used to improve the reference genome panel. Follow-up of the cohort participants will allow us to test the genetic markers and, consequently, contribute to the realization of PHC.

    DOI PubMed

  • Construction of JRG (Japanese reference genome) with single-molecule real-time sequencing.

    Masao Nagasaki, Yoko Kuroki, Tomoko F Shibata, Fumiki Katsuoka, Takahiro Mimori, Yosuke Kawai, Naoko Minegishi, Atsushi Hozawa, Shinichi Kuriyama, Yoichi Suzuki, Hiroshi Kawame, Fuji Nagami, Takako Takai-Igarashi, Soichi Ogishima, Kaname Kojima, Kazuharu Misawa, Osamu Tanabe, Nobuo Fuse, Hiroshi Tanaka, Nobuo Yaegashi, Kengo Kinoshita, Shiego Kure, Jun Yasuda, Masayuki Yamamoto

    Human genome variation   6   27 - 27  2019  [International journal]

     View Summary

    In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100 bps to ~10,000 bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.

    DOI PubMed

  • HLA-VBSeq v2: improved HLA calling accuracy with full-length Japanese class-I panel.

    Yen-Yen Wang, Takahiro Mimori, Seik-Soon Khor, Olivier Gervais, Yosuke Kawai, Yuki Hitomi, Katsushi Tokunaga, Masao Nagasaki

    Human genome variation   6   29 - 29  2019  [International journal]

     View Summary

    HLA-VBSeq is an HLA calling tool developed to infer the most likely HLA types from high-throughput sequencing data. However, there is still room for improvement in specific genetic groups because of the diversity of HLA alleles in human populations. Here, we present HLA-VBSeq v2, a software application that makes use of a new Japanese HLA reference panel to enhance calling accuracy for Japanese HLA class-I genes. Our analysis showed significant improvements in calling accuracy in all HLA regions, with prediction accuracies achieving over 99.0, 97.8, and 99.8% in HLA-A, B and C, respectively.

    DOI PubMed

  • STR-realigner: a realignment method for short tandem repeat regions.

    Kaname Kojima, Yosuke Kawai, Kazuharu Misawa, Takahiro Mimori, Masao Nagasaki

    BMC genomics   17 ( 1 ) 991 - 991  2016.12  [International journal]

     View Summary

    BACKGROUND: In the estimation of repeat numbers in a short tandem repeat (STR) region from high-throughput sequencing data, two types of strategies are mainly taken: a strategy based on counting repeat patterns included in sequence reads spanning the region and a strategy based on estimating the difference between the actual insert size and the insert size inferred from paired-end reads. The quality of sequence alignment is crucial, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the variations of repeat numbers. RESULTS: We proposed a new dynamic programming based realignment method named STR-realigner that considers repeat patterns in STR regions as prior knowledge. By allowing the size change of repeat patterns with low penalty in STR regions, accurate realignment is expected. For the performance evaluation, publicly available STR variant calling tools were applied to three types of aligned reads: synthetically generated sequencing reads aligned with BWA-MEM, those realigned with STR-realigner, those realigned with ReviSTER, and those realigned with GATK IndelRealigner. From the comparison of root mean squared errors between estimated and true STR region size, the results for the dataset realigned with STR-realigner are better than those for other cases. For real data analysis, we used a real sequencing dataset from Illumina HiSeq 2000 for a parent-offspring trio. RepeatSeq and lobSTR were applied to the sequence reads for these individuals aligned with BWA-MEM, those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. STR-realigner shows the best performance in terms of consistency of the size of estimated STR regions in Mendelian inheritance. Root mean squared error values were also calculated from the comparison of these estimated results with STR region sizes obtained from high coverage PacBio sequencing data, and the results from the realigned sequencing data with STR-realigner showed the least (the best) root mean squared error value. CONCLUSIONS: The effectiveness of the proposed realignment method for STR regions was verified from the comparison with an existing method on both simulation datasets and real whole genome sequencing dataset.


  • AP-SKAT: highly-efficient genome-wide rare variant association test.

    Takanori Hasegawa, Kaname Kojima, Yosuke Kawai, Kazuharu Misawa, Takahiro Mimori, Masao Nagasaki

    BMC genomics   17 ( 1 ) 745 - 745  2016.09  [International journal]

     View Summary

    BACKGROUND: Genome-wide association studies have revealed associations between single-nucleotide polymorphisms (SNPs) and phenotypes such as disease symptoms and drug tolerance. To address the small sample size for rare variants, association studies tend to group gene or pathway level variants and evaluate the effect on the set of variants. One of such strategies, known as the sequential kernel association test (SKAT), is a widely used collapsing method. However, the reported p-values from SKAT tend to be biased because the asymptotic property of the statistic is used to calculate the p-value. Although this bias can be corrected by applying permutation procedures for the test statistics, the computational cost of obtaining p-values with high resolution is prohibitive. RESULTS: To address this problem, we devise an adaptive SKAT procedure termed AP-SKAT that efficiently classifies significant SNP sets and ranks them according to the permuted p-values. Our procedure adaptively stops the permutation test when the significance level is outside some confidence interval of the estimated p-value for a binomial distribution. To evaluate the performance, we first compare the power and sample size calculation and the type I error rates estimate of SKAT, SKAT-O, and the proposed procedure using genotype data in the SKAT R package and from 1000 Genome Project. Through computational experiments using whole genome sequencing and SNP array data, we show that our proposed procedure is highly efficient and has comparable accuracy to the standard procedure. CONCLUSIONS: For several types of genetic data, the developed procedure could achieve competitive power and sample size under small and large sample size conditions with controlling considerable type I error rates, and estimate p-values of significant SNP sets that are consistent with those estimated by the standard permutation test within a realistic time. This demonstrates that the procedure is sufficiently powerful for recent whole genome sequencing and SNP array data with increasing numbers of phenotypes. Additionally, this procedure can be used in other association tests by employing alternative methods to calculate the statistics.


  • Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree.

    Kaname Kojima, Yosuke Kawai, Naoki Nariai, Takahiro Mimori, Takanori Hasegawa, Masao Nagasaki

    BMC genomics   17 Suppl 5   494 - 494  2016.08  [International journal]

     View Summary

    BACKGROUND: Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned paired-end reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches. RESULTS: We proposed a new statistical model named coalescentSTR that estimates repeat numbers from paired-end read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads. For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a state-of-the-art belief propagation algorithm on sampled coalescent trees. CONCLUSIONS: We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project.

    DOI PubMed

  • [Construction of 1070 Whole-genome Japanese Reference Panel and Bioinformatics].

    Masao Nagasaki, Yosuke Kawai, Kaname Kojima, Takahiro Mimori, Yumi Yamaugchi-Kabata

    Seikagaku. The Journal of Japanese Biochemical Society   88 ( 1 ) 15 - 24  2016.02  [Domestic journal]


  • A Bayesian approach for estimating allele-specific expression from RNA-Seq data with diploid genomes.

    Naoki Nariai, Kaname Kojima, Takahiro Mimori, Yosuke Kawai, Masao Nagasaki

    BMC genomics   17 Suppl 1   2 - 2  2016.01  [International journal]

     View Summary

    BACKGROUND: RNA-sequencing (RNA-Seq) has become a popular tool for transcriptome profiling in mammals. However, accurate estimation of allele-specific expression (ASE) based on alignments of reads to the reference genome is challenging, because it contains only one allele on a mosaic haploid genome. Even with the information of diploid genome sequences, precise alignment of reads to the correct allele is difficult because of the high-similarity between the corresponding allele sequences. RESULTS: We propose a Bayesian approach to estimate ASE from RNA-Seq data with diploid genome sequences. In the statistical framework, the haploid choice is modeled as a hidden variable and estimated simultaneously with isoform expression levels by variational Bayesian inference. Through the simulation data analysis, we demonstrate the effectiveness of the proposed approach in terms of identifying ASE compared to the existing approach. We also show that our approach enables better quantification of isoform expression levels compared to the existing methods, TIGAR2, RSEM and Cufflinks. In the real data analysis of the human reference lymphoblastoid cell line GM12878, some autosomal genes were identified as ASE genes, and skewed paternal X-chromosome inactivation in GM12878 was identified. CONCLUSIONS: The proposed method, called ASE-TIGAR, enables accurate estimation of gene expression from RNA-Seq data in an allele-specific manner. Our results show the effectiveness of utilizing personal genomic information for accurate estimation of ASE. An implementation of our method is available at http://nagasakilab.csml.org/ase-tigar .

    DOI PubMed

  • Japonica array: improved genotype imputation by designing a population-specific SNP array with 1070 Japanese individuals.

    Yosuke Kawai, Takahiro Mimori, Kaname Kojima, Naoki Nariai, Inaho Danjoh, Rumiko Saito, Jun Yasuda, Masayuki Yamamoto, Masao Nagasaki

    Journal of human genetics   60 ( 10 ) 581 - 7  2015.10  [International journal]

     View Summary

    The Tohoku Medical Megabank Organization constructed the reference panel (referred to as the 1KJPN panel), which contains >20 million single nucleotide polymorphisms (SNPs), from whole-genome sequence data from 1070 Japanese individuals. The 1KJPN panel contains the largest number of haplotypes of Japanese ancestry to date. Here, from the 1KJPN panel, we designed a novel custom-made SNP array, named the Japonica array, which is suitable for whole-genome imputation of Japanese individuals. The array contains 659,253 SNPs, including tag SNPs for imputation, SNPs of Y chromosome and mitochondria, and SNPs related to previously reported genome-wide association studies and pharmacogenomics. The Japonica array provides better imputation performance for Japanese individuals than the existing commercially available SNP arrays with both the 1KJPN panel and the International 1000 genomes project panel. For common SNPs (minor allele frequency (MAF)>5%), the genomic coverage of the Japonica array (r(2)>0.8) was 96.9%, that is, almost all common SNPs were covered by this array. Nonetheless, the coverage of low-frequency SNPs (0.5%<MAF⩽5%) of the Japonica array reached 67.2%, which is higher than those of the existing arrays. In addition, we confirmed the high quality genotyping performance of the Japonica array using the 288 samples in 1KJPN; the average call rate 99.7% and the average concordance rate 99.7% to the genotypes obtained from high-throughput sequencer. As demonstrated in this study, the creation of custom-made SNP arrays based on a population-specific reference panel is a practical way to facilitate further association studies through genome-wide genotype imputations.

    DOI PubMed

  • Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals.

    Masao Nagasaki, Jun Yasuda, Fumiki Katsuoka, Naoki Nariai, Kaname Kojima, Yosuke Kawai, Yumi Yamaguchi-Kabata, Junji Yokozawa, Inaho Danjoh, Sakae Saito, Yukuto Sato, Takahiro Mimori, Kaoru Tsuda, Rumiko Saito, Xiaoqing Pan, Satoshi Nishikawa, Shin Ito, Yoko Kuroki, Osamu Tanabe, Nobuo Fuse, Shinichi Kuriyama, Hideyasu Kiyomoto, Atsushi Hozawa, Naoko Minegishi, James Douglas Engel, Kengo Kinoshita, Shigeo Kure, Nobuo Yaegashi, Masayuki Yamamoto

    Nature communications   6   8018 - 8018  2015.08  [International journal]

     View Summary

    The Tohoku Medical Megabank Organization reports the whole-genome sequences of 1,070 healthy Japanese individuals and construction of a Japanese population reference panel (1KJPN). Here we identify through this high-coverage sequencing (32.4 × on average), 21.2 million, including 12 million novel, single-nucleotide variants (SNVs) at an estimated false discovery rate of <1.0%. This detailed analysis detected signatures for purifying selection on regulatory elements as well as coding regions. We also catalogue structural variants, including 3.4 million insertions and deletions, and 25,923 genic copy-number variants. The 1KJPN was effective for imputing genotypes of the Japanese population genome wide. These data demonstrate the value of high-coverage sequencing for constructing population-specific variant panels, which covers 99.0% SNVs of minor allele frequency ≥0.1%, and its value for identifying causal rare variants of complex human disease phenotypes in genetic association studies.

    DOI PubMed

  • Estimating copy numbers of alleles from population-scale high-throughput sequencing data.

    Takahiro Mimori, Naoki Nariai, Kaname Kojima, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Masao Nagasaki

    BMC bioinformatics   16 Suppl 1   S4  2015  [International journal]

     View Summary

    BACKGROUND: With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci. RESULTS: We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring. CONCLUSIONS: Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.

    DOI PubMed

  • HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data.

    Naoki Nariai, Kaname Kojima, Sakae Saito, Takahiro Mimori, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Jun Yasuda, Masao Nagasaki

    BMC genomics   16 Suppl 2   S7  2015  [International journal]

     View Summary

    BACKGROUND: Human leucocyte antigen (HLA) genes play an important role in determining the outcome of organ transplantation and are linked to many human diseases. Because of the diversity and polymorphisms of HLA loci, HLA typing at high resolution is challenging even with whole-genome sequencing data. RESULTS: We have developed a computational tool, HLA-VBSeq, to estimate the most probable HLA alleles at full (8-digit) resolution from whole-genome sequence data. HLA-VBSeq simultaneously optimizes read alignments to HLA allele sequences and abundance of reads on HLA alleles by variational Bayesian inference. We show the effectiveness of the proposed method over other methods through the analysis of predicting HLA types for HLA class I (HLA-A, -B and -C) and class II (HLA-DQA1,-DQB1 and -DRB1) loci from the simulation data of various depth of coverage, and real sequencing data of human trio samples. CONCLUSIONS: HLA-VBSeq is an efficient and accurate HLA typing method using high-throughput sequencing data without the need of primer design for HLA loci. Moreover, it does not assume any prior knowledge about HLA allele frequencies, and hence HLA-VBSeq is broadly applicable to human samples obtained from a genetically diverse population.

    DOI PubMed

  • SUGAR: graphical user interface-based data refiner for high-throughput DNA sequencing.

    Yukuto Sato, Kaname Kojima, Naoki Nariai, Yumi Yamaguchi-Kabata, Yosuke Kawai, Mamoru Takahashi, Takahiro Mimori, Masao Nagasaki

    BMC genomics   15   664 - 664  2014.08  [International journal]

     View Summary

    BACKGROUND: Next-generation sequencers (NGSs) have become one of the main tools for current biology. To obtain useful insights from the NGS data, it is essential to control low-quality portions of the data affected by technical errors such as air bubbles in sequencing fluidics. RESULTS: We develop a software SUGAR (subtile-based GUI-assisted refiner) which can handle ultra-high-throughput data with user-friendly graphical user interface (GUI) and interactive analysis capability. The SUGAR generates high-resolution quality heatmaps of the flowcell, enabling users to find possible signals of technical errors during the sequencing. The sequencing data generated from the error-affected regions of a flowcell can be selectively removed by automated analysis or GUI-assisted operations implemented in the SUGAR. The automated data-cleaning function based on sequence read quality (Phred) scores was applied to a public whole human genome sequencing data and we proved the overall mapping quality was improved. CONCLUSION: The detailed data evaluation and cleaning enabled by SUGAR would reduce technical problems in sequence read mapping, improving subsequent variant analysis that require high-quality sequence data and mapping results. Therefore, the software will be especially useful to control the quality of variant calls to the low population cells, e.g., cancers, in a sample with technical errors of sequencing procedures.

    DOI PubMed

  • HapMonster: A Statistically Unified Approach for Variant Calling and Haplotyping Based on Phase-Informative Reads

    Kaname Kojima, Naoki Nariai, Takahiro Mimori, Yumi Yamaguchi-Kabata, Yukuto Sato, Yosuke Kawai, Masao Nagasaki


     View Summary

    Haplotype phasing is essential for identifying disease-causing variants with phase-dependent interactions as well as for the coalescent-based inference of demographic history. One of approaches for estimating haplotypes is to use phase-informative reads, which span multiple heterozygous variant positions. Although the quality of estimated variants is crucial in haplotype phasing, accurate variant calling is still challenging due to errors on sequencing and read mapping. Since some of such errors can be corrected by considering haplotype phasing, simultaneous estimation of variants and haplotypes is important. Thus, we propose a statistically unified approach for variant calling and haplotype phasing named HapMonster, where haplotype phasing information is used for improving the accuracy of variant calling and the improved variant calls are used for more accurate haplotype phasing. From the comparison with other existing methods on simulation and real sequencing data, we confirm the effectiveness of HapMonster in both variant calling and haplotype phasing.

  • SVEM: A Structural Variant Estimation Method Using Multi-mapped Reads on Breakpoints

    Tomohiko Ohtsuki, Naoki Nariai, Kaname Kojima, Takahiro Mimori, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Testuo Shibuya, Masao Nagasaki


     View Summary

    Recent development of next generation sequencing (NGS) technologies has led to the identification of structural variants (SVs) of genomic DNA existing in the human population. Several SV detection methods utilizing NGS data have been proposed. However, there are several difficulties in analysis of NGS data, particularly with regard to handling reads from duplicated loci or low-complexity sequences of the human genome. In this paper, we propose SVEM, a novel statistical method to detect SVs with a single nucleotide resolution that can utilize multi-mapped reads on breakpoints. SVEM estimates the amount of reads on breakpoints as parameters and mapping states as latent variables using the expectation maximization algorithm. This framework enables us to handle ambiguous mapping of reads without discarding information for SV detection. SVEM is applied to simulation data and real data, and it achieves better performance than existing methods in terms of precision and recall.

  • TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNA-Seq reads.

    Naoki Nariai, Kaname Kojima, Takahiro Mimori, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Masao Nagasaki

    BMC genomics   15 Suppl 10   S5  2014  [International journal]

     View Summary

    BACKGROUND: High-throughput RNA sequencing (RNA-Seq) enables quantification and identification of transcripts at single-base resolution. Recently, longer sequence reads become available thanks to the development of new types of sequencing technologies as well as improvements in chemical reagents for the Next Generation Sequencers. Although several computational methods have been proposed for quantifying gene expression levels from RNA-Seq data, they are not sufficiently optimized for longer reads (e.g. >250 bp). RESULTS: We propose TIGAR2, a statistical method for quantifying transcript isoforms from fixed and variable length RNA-Seq data. Our method models substitution, deletion, and insertion errors of sequencers based on gapped-alignments of reads to the reference cDNA sequences so that sensitive read-aligners such as Bowtie2 and BWA-MEM are effectively incorporated in our pipeline. Also, a heuristic algorithm is implemented in variational Bayesian inference for faster computation. We apply TIGAR2 to both simulation data and real data of human samples and evaluate performance of transcript quantification with TIGAR2 in comparison to existing methods. CONCLUSIONS: TIGAR2 is a sensitive and accurate tool for quantifying transcript isoform abundances from RNA-Seq data. Our method performs better than existing methods for the fixed-length reads (100 bp, 250 bp, 500 bp, and 1000 bp of both single-end and paired-end) and variable-length reads, especially for reads longer than 250 bp.

    DOI PubMed

  • A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads.

    Kaname Kojima, Naoki Nariai, Takahiro Mimori, Mamoru Takahashi, Yumi Yamaguchi-Kabata, Yukuto Sato, Masao Nagasaki

    Bioinformatics (Oxford, England)   29 ( 22 ) 2835 - 43  2013.11  [International journal]

     View Summary

    MOTIVATION: Variant calling from genome-wide sequencing data is essential for the analysis of disease-causing mutations and elucidation of disease mechanisms. However, variant calling in low coverage regions is difficult due to sequence read errors and mapping errors. Hence, variant calling approaches that are robust to low coverage data are demanded. RESULTS: We propose a new variant calling approach that considers pedigree information and haplotyping based on sequence reads spanning two or more heterozygous positions termed phase informative reads. In our approach, genotyping and haplotyping by the assignment of each read to a haplotype based on phase informative reads are simultaneously performed. Therefore, positions with low evidence for heterozygosity are rescued by phase informative reads, and such rescued positions contribute to haplotyping in a synergistic way. In addition, pedigree information supports more accurate haplotyping as well as genotyping, especially in low coverage regions. Although heterozygous positions are useful for haplotyping, homozygous positions are not informative and weaken the information from heterozygous positions, as majority of positions are homozygous. Thus, we introduce latent variables that determine zygosity at each position to filter out homozygous positions for haplotyping. In performance evaluation with a parent-offspring trio sequencing data, our approach outperforms existing approaches in accuracy on the agreement with single nucleotide polymorphism array genotyping results. Also, performance analysis considering distance between variants showed that the use of phase informative reads is effective for accurate variant calling, and further performance improvement is expected with longer sequencing data. CONTACT: kojima@megabank.tohoku.ac.jp .

    DOI PubMed

  • iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data.

    Takahiro Mimori, Naoki Nariai, Kaname Kojima, Mamoru Takahashi, Akira Ono, Yukuto Sato, Yumi Yamaguchi-Kabata, Masao Nagasaki

    BMC systems biology   7 Suppl 6   S8  2013  [International journal]

     View Summary

    BACKGROUND: Structural variations (SVs), such as insertions, deletions, inversions, and duplications, are a common feature in human genomes, and a number of studies have reported that such SVs are associated with human diseases. Although the progress of next generation sequencing (NGS) technologies has led to the discovery of a large number of SVs, accurate and genome-wide detection of SVs remains challenging. Thus far, various calling algorithms based on NGS data have been proposed. However, their strategies are diverse and there is no tool able to detect a full range of SVs accurately. RESULTS: We focused on evaluating the performance of existing deletion calling algorithms for various spanning ranges from low- to high-coverage simulation data. The simulation data was generated from a whole genome sequence with artificial SVs constructed based on the distribution of variants obtained from the 1000 Genomes Project. From the simulation analysis, deletion calls of various deletion sizes were obtained with each caller, and it was found that the performance was quite different according to the type of algorithms and targeting deletion size. Based on these results, we propose an integrated structural variant calling pipeline (iSVP) that combines existing methods with a newly devised filtering and merging processes. It achieved highly accurate deletion calling with >90% precision and >90% recall on the 30× read data for a broad range of size. We applied iSVP to the whole-genome sequence data of a CEU HapMap sample, and detected a large number of deletions, including notable peaks around 300 bp and 6,000 bp, which corresponded to Alus and long interspersed nuclear elements, respectively. In addition, many of the predicted deletions were highly consistent with experimentally validated ones by other studies. CONCLUSIONS: We present iSVP, a new deletion calling pipeline to obtain a genome-wide landscape of deletions in a highly accurate manner. From simulation and real data analysis, we show that iSVP is broadly applicable to human whole-genome sequencing data, which will elucidate relationships between SVs across genomes and associated diseases or biological functions.

    DOI PubMed

▼display all