Abstract
With advances in high-throughput genotyping technologies, the rate-limiting step of large-scale genetic investigations has become the collection of sensitive and specific phenotype information in large samples of study participants. Clinicians play a pivotal role for successful genetic studies because sound clinical acumen can substantially increase study power by reducing measurement error and improving diagnostic precision for translational research. Phenomics is the systematic measurement and analysis of qualitative and quantitative traits, including clinical, biochemical, and imaging methods, for the refinement and characterization of a phenotype. Phenomics requires deep phenotyping, the collection of a wide breadth of phenotypes with fine resolution, and phenomic analysis, composed of constructing heat maps, cluster analysis, text mining, and pathway analysis. In this article, we review the components of phenomics and provide examples of their application to genomic studies, specifically for implicating novel disease processes, reducing sample heterogeneity, hypothesis generation, integration of multiple types of data, and as an extension of Mendelian randomization studies.
Genome-wide association studies (GWASs), examining a wide range of discrete and quantitative phenotypes, have now been performed in many large cohorts. Mega meta-analyses, including samples of more than 75,000 subjects and often including more than 100 authors represent Herculean collaborative efforts to identify phenotype-associated genetic variants.1Although the associated loci may be biologically valid and important, the frequency and effect size of the identified variants will become smaller because very large samples are required to confidently identify the phenotype-genotype association signal. Thus, precision in subject ascertainment and phenotype definition should be key elements in the investigation of the genetic basis of complex diseases.2-4
What are the main obstacles to successful identification of genetic contributors to disease? Major culprits for reducing the signal-to-noise ratio and obscuring phenotype-genotype associations are measurement error and study heterogeneity.5,6Geneticists are now accustomed to technologies with extraordinarily low imprecision: genotype miscall rates are typically less than 0.5%.7However, measurement error in phenotypes has received substantially less attention. Confounding factors in genetic studies, such as pleiotropy, incomplete penetrance, epistasis, and allelic and locus heterogeneity, are unavoidable and need to be considered in study design.5However, study heterogeneity originating from clinical sources, such as phenocopy or multiple pathways leading to a common disease phenotype, can potentially be mitigated. Careful measurement not only of the major phenotype of interest but also of environmental exposures and subphenotypes, also known as biomarkers, endophenotypes, subclinical traits, or attributes, can provide valuable clues to the pathophysiology of each individual case.
Phenomics is the systematic measurement and analysis of qualitative and quantitative traits, including clinical, biochemical, and imaging methods, for the refinement and characterization of a phenotype.8The importance of accurate phenotype determination in association studies has already been proven theoretically,6,9and yet, phenomics has not generally received the same scientific and technological attention as genomic analysis. As attention shifts from increasing sample size to increasing diagnostic precision in the postgenomic era, the clinician, in particular, can play a pivotal role for the collection of accurate and complete phenotype information to ensure that genetic investigations have maximal power to identify consistent genotype-phenotype associations. Here, we further define and describe applications of "phenomics" and provide examples demonstrating its utility.
PHENOMICS
In its simplest form, a retrospective genetic association study design involves the comparison of allele frequencies between a collection of affected cases and unaffected controls.3Because additional covariates can increase the likelihood of an individual becoming a case, attempts to match the cases and controls for covariates such as age and sex are common. By extending and increasing the sophistication of techniques for describing and quantifying the phenotype of cases and controls, phenomics holds the promise of similarly helping to reduce study heterogeneity. Current evidence suggests that common complex disorders are in truth extremes of quantitative traits, in which additional information can be gained through precise, quantitative phenotypic description.10Phenomics is composed of 2 separate components: "deep phenotyping" referring to a strategic and comprehensive approach toward data acquisition and "phenomic analysis" referring to the evaluation of patterns and relationships between individuals with related phenotypes and between genotype-phenotype associations.2,11
DEEP PHENOTYPING
Deep phenotyping involves the development of a complete picture of each study participant through the strategic collection of a broad range of high-resolution phenotypes (Table 1). Phenotype resolution, or granularity, is the level of detail afforded to phenotypic definition.12The goal of deep phenotyping is to characterize further as many of the contributing factors to the "case" definition as possible, which may allow for the removal or correction of heterogeneity among research subjects. In general, continuous quantitative phenotypes can better differentiate between marginal and severe cases and generally allow for more powerful statistical comparisons than do qualitative traits. Deep phenotyping begins with the collection of a thorough medical history, including a detailed account of environmental exposures, a complete review of systems, and collection of family history. Measurement of disease progression, an extensive panel of risk factors, and alternative measures of shared disease pathways should be used to provide the most detailed phenotype possible. By extending the detail, accuracy, and context of trait acquisition, and then linking components, a more complete phenotype can be generated for each individual.11
In their day-to-day work focusing on diagnosis, clinicians use technologies that probe intermediate markers in pathways underlying illness, including biochemical, serological, histopathological, and noninvasive imaging methods. These assays can be static, such as biochemical analysis of plasma or tissue samples, or dynamic, such as provocative tests followed by serial sampling performed in clinical investigation units. For instance, in cardiovascular research, static biochemical phenotypes include serum concentrations of insulin, fasting glucose, triglycerides, low-density lipoprotein (LDL) and high-density lipoprotein cholesterol levels. Dynamic phenotypes include challenge by a stimulus or provocation of a response, such as postprandial excursion of plasma metabolites or measurement of insulin sensitivity using a euglycemic insulin clamp. Using serial measurements to monitor progression or regression of the phenotype, either in response to treatment or simply over time, becomes a further dimension of the phenotype. Thus, the same principles clinicians use to develop specific diagnoses can be applied to phenomic research applications.
In addition to data quality and granularity, deep phenotyping requires optimal data-gathering conditions to reduce technical variability and thus maximize the chance that the observed variation is biological in origin. Whenever possible, quantitative phenotyping methods should be performed using standard operating procedures and be validated against reference standards, with clear performance metrics, such as measurement reliability and reproducibility. Replicate phenotypes from an individual can be averaged, when appropriate. Multiple measurement modalities, for instance, quantifying carotid atherosclerosis using both ultrasound and magnetic resonance imaging, may also bolster phenotype accuracy and precision.13Overall, deep phenotyping should include reliable, comprehensive, and high-resolution assessment of the known components of the phenotype of interest.
PHENOMIC ANALYSIS
Development of deep phenotypes creates difficult-to-interpret multidimensional data. Luckily, visualization and analysis techniques already exist for drawing conclusions from multidimensional data (Table 2). Remarkably, Sneath14described a computational method to catalog and score similarities between bacterial species to create a taxonomic classification more than 50 years ago. Eisen et al.15described a now widely used method, using heat maps and clustering to identify groups of coexpressing genes from highly multidimensional microarray expression data. These techniques can easily be adapted for the visual representation and analysis of human deep phenotypes.
The construction of a phenotype heat map consists of placing either the major phenotype classes or individual subjects in rows, and the quantitative or qualitative subphenotypes in columns of a grid. In a 1-color heat map, black and white boxes indicate the presence and absence, respectively, of an unambiguous discrete or qualitative phenotype or trait. For quantitative traits, grayscale can be used to indicate degree of affection. In a 2-color heat map, one color can be used to indicate a positive fold change from a normative value, whereas the other color indicates a negative fold change from the normative value; greater intensities indicate larger fold changes.
Once heat maps of quantitative or qualitative phenotype data are generated, cluster analysis can help group similar observations into subgroups for identifying potential relationships between data subsets. Software can assist in constructing and analyzing heat maps, such as Hierarchical Cluster Explorer (www.cs.umd.edu/hcil/hce/).16There are many types of cluster analysis, although biologists are likely most familiar with hierarchical cluster analysis from sequence and phylogenetic evaluations.15The output of hierarchical clustering is a dendrogram, or relationship tree, in which all observations are leaves and more closely related observations emanate from more proximal branch points. Hierarchical clustering is a sequential procedure that can be either agglomerative, which begins with the 2 closest leaves and then adds the next closest leaf, and so on, or divisive, which involves the iterative removal of the most distant leaf.17The major limitation of cluster analysis is its dependence on the similarity metric-a measure of correlation-that is used to calculate "closeness" between any 2 observations. For example, the similarity metric could include a weighting factor because 1 subphenotype measure may contribute more information toward the "closeness" of 2 major phenotypes. Hence, different weighting strategies could lead to different conclusions. Nonetheless, cluster analysis is a tool to extract solutions from complex multidimensional data, including phenomic data.
Text mining and pathway analysis strategies have also been proposed for further refining deep phenotypes and uncovering new gene-phenotype associations.12In addition, text mining techniques could be used for the development of phenotypes from electronic medical records. Many well-known databases, such as Online Mendelian Inheritance in Man (OMIM) and PhenoGO, already house phenotype- and gene-disease associations.12Gene otology (GO) and pathway analysis techniques, also borrowed from gene expression studies, could provide new insight into GWASs for hypothesis generation.18
ADVANCING GENOMIC STUDIES THROUGH PHENOMICS
Implicating a Novel Disease Mechanism
We described Old Order Amish patients with a neonatal syndrome of endocrine gland hypoplasia, cerebral anomalies, and severe osteodysplasia.19Informed consent for study was provided by the parents of affected children, and the study was approved by the Office of Research Ethics at the University of Western Ontario. Initial examination of the affected infants suggested similarities to Majewski syndrome (OMIM: 263520) and hydrolethalus syndrome (OMIM: 236680). Collection of approximately 70 phenotype observations per patient and hierarchical cluster analysis indicated that the infants were affected with a novel disorder, subsequently named endocrine-cerebro-osteodysplasia (Fig. 1), and further suggested that a novel molecular mechanism could be responsible.19Autozygosity mapping and targeted sequencing identified a rare mutation, proven through biochemical studies to be disease causing, in the gene encoding intestinal cell kinase. Thus, phenomic analysis to demonstrate both phenotype homogeneity among affected children and the presence of a constellation of phenotypes in a new syndrome of unknown etiology were important initial steps in this study.
In biomedicine, there are numerous examples of insight into normal human physiology gained from studying rare conditions.20Similarly, studies of induced mutant mice have repeatedly shown the benefits of focusing on the phenotypic differences resulting from altered function or expression of a single gene. In humans, studies of naturally occurring rare diseases using linkage and autozygosity mapping strategies have become less common because attention has shifted toward common complex conditions. However, studies of well-defined rare syndromes still have their place, and with the reduced cost of high-density genotyping arrays, mapping studies are feasible in a laboratory with modest resources for even the rarest orphan disease. Furthermore, with the proliferation of next-generation genomic sequencing efforts, phenomic analysis of mutation carriers will be essential to guide evaluation of the underlying pathophysiology resulting from a novel mutation, following the "reverse genetics" paradigm.
Increasing Sample Homogeneity
We have quantified adipose tissue depots in patients with lipodystrophy using magnetic resonance imaging to develop highly resolved phenotypes.21-23Using phenomics, such quantitative data have made clear the phenotypic distinctions between individuals with different genetic forms of familial partial lipodystrophy (FPLD), once considered to be a single entity (Fig. 2).24,25Familial partial lipodystrophy type 3 (FPLD3) has milder adipose tissue atrophy but more severe metabolic complications including insulin resistance, hypertension, dyslipidemia, and earlier-onset type 2 diabetes when compared with FPLD type 2 (FPLD2).25In addition, FPLD3 patients experience blunted response to thiazolidinediones compared with FPLD2 patients.26
Finding analogous phenotypic distinctions in common complex diseases could improve the signal-to-noise ratio for genotype-phenotype association analysis. Again, using cardiovascular disease as an example, it has been suggested that the "distance" between underlying atherogenic mechanisms and disease end points may contribute to study heterogeneity.27Through phenomics, the identification of patient subgroups with a specific mechanism leading to cardiovascular disease could improve the discriminatory power for genetic association studies, although making these distinctions and creating patient subgroups simultaneously reduces sample size.
Generating Molecular Hypotheses
Pleiotropy, in which multiple disease phenotypes are caused by mutations in the same gene, is illustrated by the family of disorders called "laminopathies," which are due to a range of mutations within LMNA, which encodes nuclear lamin A/C.8Laminopathies include a diverse range of phenotypes including partial lipodystrophy, dilated cardiomyopathy, muscular dystrophy, and premature aging syndromes.8Careful analysis of the phenotypes in carriers of LMNA mutations indicates some commonalities across this family of widely disparate disorders.28Application of phenomic heat mapping and hierarchical clustering analysis identified 2 main classes of laminopathies based on organ system involvement. The distribution of mutations across the lamin A/C domains was nonrandom across the 2 laminopathy classes, suggesting that mutation position relative to the nuclear localization signal domain may be an important determinant of the resultant complex phenotype.28
Merging of Clinical, Biochemical, and Genomic Information for Diagnosis Refinement
Incorporating genetic, biochemical, and phenotypic information into a single analysis can further refine classic clinical phenotypes. For example, the Fredrickson hyperlipoproteinemia (HLP) phenotypes are defined by the quality and quantity of plasma lipid subfractions after ultracentrifugation and have been used clinically for decades.29However, molecular genetics research has uncovered the molecular pathways underlying many of the HLP phenoypes.30For example, mutations in the LDL receptor (LDLR), proprotein convertase subtilisin/kexin-type 9 (PCSK9), and apolipoprotein B (APOB) can all produce HLP type 2A, each by impairing different steps in the LDL cholesterol metabolism pathway.30Both rare mutations, such as those in genes encoding lipoprotein lipase (LPL) and apolipoprotein C2 (APOC2), and common single-nucleotide polymorphisms, such as those in glucokinase regulatory protein (GCKR) and tribbles homolog 1 (TRIB1), contribute to overall HLP susceptibility.31,32Genomic resequencing and association studies show that severe hypertriglyceridemia (HLP type 5) is a mosaic of both common and rare genetic variants with a wide range of effect sizes (Fig. 3).30-33Perhaps the traditional classification of lipid phenotypes will need to be readdressed in light of discoveries into the disease-causing mechanisms gleaned from genetic investigations. Knowledge of the exact disease-causing mechanism in a specific patient from clinical, biochemical, and genomic investigations may lead to a refinement of diagnosis and identify patients who are most likely to benefit from a specific pharmacological intervention.
Phenomics and Mendelian Randomization
Mendelian randomization (MR) uses the random assortment of alleles at meiosis to evaluate a causal relationship between an intermediate biomarker and a disease end point.34,35If a genetic variant is associated with the circulating concentration of a biomarker, and the biomarker is causal for the disease end point, the genetic variant should also be associated with the disease end point. There are several caveats to the interpretation of MR studies, but in theory, MR can provide support for causal relationships.36Efforts to curate databases of genotype-phenotype associations are already underway for identifying networks of interrelated findings.12Even if a genetic variant-phenotype association does not surpass the stringent significance threshold required in GWASs owing to multiple testing, confidence would be increased if the genetic variant is consistently associated with multiple intermediate steps in a particular pathogenic pathway.18A network of intermediate steps could be characterized in the living study participant through deep phenotyping and phenomic analysis.
SUMMARY
Owing to recent technological advances, the bar has been raised for data quality and quantity in genetic association studies. To tease out the small genetic effects operating in complex diseases, clinicians will be required to collect precise deep phenotypes to reduce measurement error, reduce study heterogeneity, and further refine study populations. Through analysis of individuals with rare mutations identified through next-generation sequencing, phenomics will play an important role in hypothesis generation. Phenomics will create new diagnostic criteria through the integration of clinical, diagnostic, and genetic information. Finally, confidence in the biological meaning of genetic associations will be improved by assessing the genetic variants effect on multiple components of a complex pathway.
ACKNOWLEDGMENTS
The authors thank Dr. Tisha Joy for providing data included in Figure 2. R.A.H. is a Career Investigator of the Heart and Stroke Foundation of Ontario, holds the Edith Schulich Vinet Canada Research Chair (Tier I) in Human Genetics, the Martha G. Blackburn Chair in Cardiovascular Research, and the Jacob J. Wolfe Distinguished Medical Research Chair at the University of Western Ontario.