The Human Phenotype Ontology

Robinson PN, Mundlos S. The Human Phenotype Ontology.

In 1981, Victor McKusick anticipated the sequencing of the human genome, 'perhaps by the year 2000', and noted that the determination of the sequence itself would be unlikely to be a scientific priority. 'Even when the anatomy of the human genome is known down to the last nucleotide, we will not know the function of all parts of that DNA. . .' (1). The subsequent decades have witnessed remarkable progress in understanding the functional correlates of DNA sequences and the molecular underpinnings of human disease. Currently, newer methods based on massively parallel sequencing are poised to replace Sanger's DNA sequencing method (2) and to revolutionize our understanding of biology and human genetics (3). It is now in principle possible to identify disease genes by exome sequencing of a small number of unrelated affected individuals (4), and similar methodologies are likely to transform investigations of cancer (5) and potentially even polygenic disorders.
As exciting as these perspectives are, without clear clinical descriptions of the affected individuals, the value of the molecular data and its relevance for understanding, diagnosing, and treating human disease will be diminished. Accurate and clear clinical descriptions of which tissues are involved in a disease, the time of onset of individual disease manifestations, and what complications occur can yield important clues as to molecular pathophysiology (6). However, the terms that clinicians use to describe phenotypic manifestations have evolved in a haphazard and uncoordinated manner (7,8), and databases of systematically collected phenotypic information about humans with hereditary disease do not exist (9,10).
Therefore, not only there is a clear need for uniform and internationally accepted terms to describe the human phenotype (8), but also of resources that will allow human phenotype data to be used for computational analysis of the role of DNA sequence variants in human health and disease.
In this article, we describe the Human Phenotype Ontology (HPO) project, which was developed by our group in order to provide a standardized basis for computational analysis of human disease manifestations (11).

Medical terminologies
Coding and classification systems have a long history in medicine. One of the most important early figures was William Farr (1807-1883), who became 'chief of abstracts' (chief statistician) of the Office of the Registrar General in London. Farr became internationally recognized for his work that revealed that there was less mortality from cholera during a 19th-Century epidemic in London, the further removed a district was from the Thames River, the source of the epidemic. This result set the stage for Koch's discovery 17 years later that the cholera bacillus was the cause of cholera. Farr also compiled a 'statistical nosology' which not only defined disease categories but also the 'synonymes' and 'provincial terms' by which the diseases were known locally (12). In his first Annual Report of the Registrar General, he noted (13): The advantages of a uniform statistical nomenclature, however imperfect, are so obvious, that it is surprising that no attention has been paid to its enforcement in Bills of Mortality. Each disease has, in many instances, been denoted by three or four terms, and each term has been applied to as many different diseases: vague, inconvenient names have been employed, or complications have been registered instead of primary diseases. The nomenclature is of as much importance in this department of enquiry as weights and measures in the physical sciences, and should be settled without delay.
While many of the medical terminologies developed over the last two centuries aim to describe disease entities, more recent efforts to use computational algorithms to analyze human phenotypic features have additionally used vocabularies that define the signs, symptoms, and other manifestations of the diseases. In the field of human genetics, the most important source of information about hereditary diseases is the Online Mendelian Inheritance in Man (OMIM) database, which was developed over decades by Professor Victor McKusick and many colleagues at Johns Hopkins University (14). OMIM represents a monumental achievement that is used for the daily work of geneticists around the globe. However, OMIM does not yet use a controlled vocabulary to describe the phenotypic features in its clinical synopsis section, which has hindered the use of the information in OMIM for computational analysis and also can lead to inconsistencies in search results. For instance, OMIM uses the synonymous descriptions 'generalized amyotrophy', 'generalized muscular atrophy', and 'muscular atrophy, generalized' in the clinical synopsis sections of different diseases. Although a human user might have no trouble in recognizing that the three descriptions refer to the same thing, commonly used computer search routines do not. In fact, a search for 'generalized muscle atrophy' in the OMIM website led to 174 hits, whereas a search for 'muscular atrophy, generalized' yielded only 96 results. Similarly, searching for synonyms such as 'heart attack' and 'myocardial infarction' in pubmed or Google will also return differing results.
Other terminologies have been developed for clinical diagnostics in human genetics, including the London Dysmorphology Database (LDDB) (15), POSSUM (16), and Orphanet (17). The vocabularies developed for these databases were not explicitly designed for bioinformatic analysis and are not available under an open-source license. On the one hand, the vocabularies do not comprise nearly enough items to allow a precise description of the spectrum of phenotypic abnormalities that can occur in human hereditary and other diseases. On the other hand, some of the items in the vocabularies are less than ideal for computational analysis because they refer to multiple or compound phenotypic manifestations. For instance, the category 'Asternia or Bifid sternum' in LDDB actually comprises two distinct phenotypic abnormalities, viz. 'Asternia' and 'Bifid sternum'. Such 'bundled' terms can lead to confusion because they may suggest a common pathogenesis or association which may or may not be correct (8).
In publications about new disease genes, gene mutations, and genotype phenotype correlations, descriptions of human phenotypes have been written almost exclusively using free text. This has meant that it is difficult to compare the phenotypes described in even two different papers, and it is essentially impossible to perform a computational analysis involving human phenotypes reported in the hundreds or thousands of articles published about, say, diseases such as Marfan syndrome or neurofibromatosis. This means that integrative analysis in the style that has now become routine in the molecular biology community thanks to the unified vocabulary provided by Gene Ontology (GO) (18) has not been possible in the human genetics community.

Ontologies
Ontology is the philosophical discipline which studies the nature of existence and aims to understand how things in the world are divided into categories and how these categories are related together. In computer science, the word ontology is used with a related meaning to describe a structured, automated representation of the knowledge within a certain domain in fields such as science, government, industry, and healthcare (19). An ontology provides a classification of the entities within a domain. Each entity is said to make up a term of the ontology. Furthermore, an ontology must specify the semantic relationships between the entities. Thus, an ontology can be used to define a standard, controlled vocabulary for a scientific field. In biomedical research, the most widely used ontologies are represented in the form of directed acyclic graphs (DAGs). A graph is a set of nodes (also called vertices) and edges (also called links) between the nodes. In directed graphs, the edges are one-way and go from one node to another. A cycle in a directed graph is a path along a series of two or more edges that leads back to the initial node in the path. Therefore, a DAG is a directed graph that has no cycles. The nodes of the DAG, which correspond to the terms of the ontology, are assigned to entities in the domain and the edges between the nodes represent semantic relationships. Ontologies are designed such that terms closer to the root are more general than their descendant terms. For the HPO and many other ontologies in biology and medicine, the true-path rule applies, that is, entities are annotated to the most specific term possible but are assumed to be implicitly annotated to all ancestors of that term (this is one reason why cycles are not permitted in graphs used to represent ontologies). Figure 1 shows an example of an ontology displayed in the form of a DAG.
Ontologies have been developed for a large number of domains in biomedical research. The most widely used ontology is the GO, which provides structured, controlled vocabularies for several domains of molecular and cellular biology and is structured into three domains, molecular function, biological process, and cellular component (18). Other biomedical ontologies of broad interest include the Mammalian Phenotype Ontology (20), the Foundational Model of Anatomy (FMA) ontology (21), the Sequence Ontology (22), the Cell-Type ontology (23), the Chemical Entities of Biological Interest (ChEBI) ontology (24), and the mouse pathology (MPATH) ontology (25), among many others (26).

The Human Phenotype Ontology
The HPO currently contains over 9500 terms describing phenotypic features. About 50,000 annotations to HPO terms for 4779 diseases listed at the OMIM database are also available at the HPO website (11). The HPO was originally constructed using data from OMIM, whereby synonyms were merged and semantic links were created between the terms to create the ontological structure, and nearly all of the annotations were derived from OMIM. The HPO is now being refined, corrected, and expanded by a process of manual curation in which term definitions are being created, and terms for concepts not originally found in the OMIM data are being made. The annotations available via the clinical synopsis section of OMIM are not always comprehensive (27). Therefore, the HPO project is endeavoring to provide more comprehensive annotations on the basis of literature data.
Each term in the HPO describes a distinct phenotypic abnormality such as ventricular septal defect. Diseases are annotated to terms of the HPO, meaning that HPO terms are used to describe all the signs, symptoms, and other phenotypic manifestations that characterize the disease in question. For instance, Canavan disease is currently annotated to 16 HPO terms including macrocephaly, optic atrophy, and hypoplasia of the corpus callosum.
The terms of the HPO are arranged in a hierarchical structure representing subclass relationships. For instance, ventricular septal defect is a subclass of its parent term abnormality of the ventricular septum in the sense that ventricular septal defect is a kind of abnormality of the ventricular septum and every person with a ventricular septal defect can also be said to have an abnormality of the ventricular septum. Thus, the true-path rule states that if a disease is annotated to a term of the HPO, then the disease is implicitly annotated to all of the ancestors of the term in the ontology (Fig. 2).
One obvious advantage of capturing phenotypic information in the form of an ontology is that search routines can be designed to exploit the semantic relationships between terms. For instance, the search procedure can be designed such that a search on abnormality of the cardiac septa will not just return all diseases annotated to this term, but also all diseases annotated to related terms such as ventricular septal defect or atrial septal defect. We have implemented such a search algorithm in the PhenExplorer, a browser for HPO terms and annotated diseases that is available at the HPO homepage. showing some of the biological process terms. The links between the terms represent semantic relationships. For instance, the term rRNA processing is a specific instance of the term RNA processing. GO has been used to perform overrepresentation analysis in order to provide an indication of the terms which best describe the biological characteristics of high-throughput experiments such as microarray hybridizations (33). In this figure, the results of analysis of a microarray dataset using the Ontologizer (66, 67) are shown, in which GO terms with statistically significant overrepresentation are displayed in green.

The HPO and clinical diagnostics in human genetics
As an example of how the HPO can be used in practical applications, we now describe a procedure we developed for using an ontological semantic similarity analysis for clinical diagnostics. The differential diagnosis in clinical genetics can often be challenging (20,28,29), for which reason a number of commercial and freely available computational tools have been developed including the LDDB (15), POSSUM (16), OMIM (14), and Orphanet (17).
Our procedure makes use of the semantic structure of the HPO in order to weight the importance of the query and disease terms according to their clinical specificity. Intuitively, if a physician enters the query term downward slanting palpebral fissures the amount of clinical information about the patient is higher than if the physician enters the term abnormality of the eyelid. We designed a search procedure that would take this into account by weighting the best match between a query term to the terms of any given disease according to the information content of that term (Fig. 3). We additionally developed a statistical model that uses the distribution of semantic similarity scores that would be obtained by searching with randomly chosen HPO terms in order to assign a p-value to the results of each search, and thereby provide not only a ranking but also a significance threshold for search results (30,31). This distinguishes our search procedure from other search routines commonly used in clinical genetics that are designed to show all diseases that are characterized by a certain number of query terms.
In the 14th Century, the English logician and Franciscan friar William of Occam posited Pluralitas non est ponenda sine necessitate (Plurality Fig. 2. The Human Phenotype Ontology (HPO) is arranged as a directed acylic graph (DAG) in which the terms represent subclasses (more specific instances) of their parent term. A term in a DAG can have more than one parent term, which in the case of the HPO means that a given phenotypic feature can be considered to be a more specific aspect or more than one parental term. In the excerpt of the HPO shown in this figure, abnormality of the atrial septum is a subclass of both abnormality of the cardiac septa and abnormality of the cardiac atria. Terms that are located close to the root of the graph are less specific than terms that are farther away from it. This is defined in the HPO as the information content of a term (−log p i , where p i represents the frequency of the phenotypic manifestation i among all diseases in the database). Intuitively speaking, a term such as mental retardation, which is a common phenotypic manifestation of many hereditary diseases, is less clinically specific (has less information content) than a feature such as calcific stippling.
should not be posited without necessity). This principle came to be known as Occam's razor and is commonly applied in the setting of differential diagnosis in the sense that it if a single diagnosis can be found that explains all of a patient's signs and symptoms, it is likely to be the correct diagnosis. However, a counterargument attributed to John Hickam, MD, states that A patient can have as many diagnoses as he darn well pleases (32). In the field of medical genetics, Hickam's dictum would apply to situations in which an individual has multiple manifestations of some underlying hereditary disease in addition to one or more unrelated manifestations. Another common problem in . Note that only some of the many annotations of these syndromes are shown in the figure. The amount of phenotypic similarity (shown in yellow) is less for OFD2 than for GCPS because the former is not annotated to the HPO term hypertelorism. The implications of this for the calculation of the semantic similarity are shown in (c). For GCPS, there is a perfect match for both query terms. In contrast, for OFD2, the best match to the query term hypertelorism is with the term telecanthus. The most specific common ancestor of hypertelorism and telecanthus is the term abnormality of the eye, and the similarity between hypertelorism and telecanthus is therefore calculated as the information content of the term abnormality of the eye. Therefore, a search with the query terms downward slanting palpebral fissures and hypertelorism yields a higher score for GCPS than for OFD2. The mathematical details of the procedure are explained in Refs (30,31). medical diagnostics is that physicians may either not know the correct name of a disease manifestation or not have performed the necessary diagnostics tests to show it. For instance, if a physician is not able to describe a cone-shaped epiphysis of the middle phalanx of the third finger using this phrase but instead enters abnormality of the epiphysis of the middle phalanx of the third finger, search routines based on text matching or exact matches of features will not recognize any similarity between the two phrases. On the other hand, we showed using extensive simulations that ontological methods are particular robust in the face of the kind of phenotypic noise described by Hickam's dictum as well as imprecision in the choice of the search term (30).
The use of p-values to rank the results of diagnostic queries also has the advantage of assigning an estimation of the plausibility of a given result. If the best result or results receive a statistically significant p-value, then this can be interpreted as a suggestion that the differential diagnoses are plausible and should be considered further by the clinician. If on the other hand, no differential diagnosis achieves a significant p-value, then this can be an indication that the combination of clinical signs and symptoms is not specific enough to enable a diagnosis, or that the correct diagnosis is not present in the database. We have implemented our methods in a web-based program called the Phenomizer, which additionally implements a number of routines that may aid physicians in differential diagnostic considerations (Fig. 4).

Computational phenotype research
The GO (18) has been extensively adopted by the molecular biology community as a kind of lingua franca for describing the biological function of gene products in humans and model organisms using a consistent and computable language. Although the GO was originally developed primarily to provide a means for integration, retrieval, and computation of data (33), it is now commonly used to help understand the results of high-throughput expression profiling experiments (34), as well as for network modeling (35), analysis of semantic similarity (36), and many other applications.
Clinical medicine and research have not yet embraced ontologies and information technology to the same extent. In fact, the clinical research process has been termed 'antiquated', and it has been estimated that the time required to go from initial studies on effectiveness and safety of new drugs and to translate this knowledge into accepted treatments is approximately 17 years (37). Clearly, improving knowledge transfer between researchers and practicing physicians as well as improving the exchange of information among clinicians themselves are essential measures for streamlining research, improving clinical decision making based on current research findings and making optimal use of available information to improve the quality of patient care.
Our goal in developing the HPO is to provide researchers and clinicians in the field of human genetics with a shared and well defined ontological framework with which to share clinical data and knowledge in a standardized, computer-readable way. Just as GO is still a work in progress today, a decade after it was first published, the HPO is still undergoing active development. The HPO is adopting definitions from Elements of Morphology series (8,(38)(39)(40)(41)(42)(43) for the relevant HPO terms, and the HPO team will welcome contributions from the community on new terms, term definitions, or annotations of diseases. In the coming years, the HPO will be refined and improved and the coverage of specific clinical areas will be extended in collaboration with clinical experts. In addition, we plan on extending the annotations to other forms of hereditary disease such as microdeletions and other copy-number variation diseases and chromosomal aberrations. Data regarding the frequency of individual manifestations among persons affected with a disease will be extracted from the medical literature and clinical practice. The HPO and the PATO teams are currently developing computerinterpretable logical definitions for the terms of the HPO using PATO, the ontology of phenotypic qualities (44)(45)(46)(47), to link terms of the HPO to the anatomic and other entities that are affected by abnormal phenotypic qualities (48). The data and algorithms provided by the HPO project will form a basis for incorporating the human phenome into large-scale computational analysis of gene expression patterns and other cellular phenomena associated with human disease.
One of the important questions that can be addressed with a tool for describing human phenotypic features and measuring their similarity to one another is determining the relationship between phenotypes and biochemical or other networks in the cell. The fact that groups of genetic diseases that are associated with mutations in genes encoding proteins that interact in biochemical pathways or protein complexes also often display overlapping phenotypic features led to the development of the concept of disease-gene families by Jürgen Spranger, Han Brunner, and others (49)(50)(51)(52). Parallel advances in high-throughput technologies in molecular genetics and the computational analysis of networks led to the concept of the diseasome (53), a term which refers to complex relationships between biochemical, genetic, cellular, phenotypic, and other networks that together underlie human disease. The importance of diseasomics has become ever more obvious in the field of human genetics with the identification of biochemical or protein interaction networks whose dysfunction underlies groups of phenotypically related diseases (54,55). Recent computational projects have shown the potential of incorporating human phenotypic data into the analysis of cellular networks (56)(57)(58)(59)(60)(61).
One prerequisite for achieving the full promise of phenotype analysis in human genetics and in other fields of medicine is quite simply data availability. It seems clear that it will be important for the field of human genetics to create a comprehensive, central genotype-phenotype database (9,62,63), even though there are still enormous obstacles. The HPO project intends to provide a standardized vocabulary for computational and clinical research involving human In this example, we have entered several phenotypic features of brachydactyly type C. The selected query terms are shown in the right-hand part of the window. Each feature can be marked as mandatory if the clinician is sure that it is definitely related to the underlying disease (this might be the case if multiple affected members of a family all show the feature in question). If a feature is marked as mandatory then all diseases that are not annotated to the feature are filtered out. After the clinical features have been entered, a click on the Get diagnosis button will cause the Phenomizer to display a list of differential diagnoses ranked according to p-value. (b) The best matches are listed ranked according to their p-values The Phenomizer also offers an Improve differential diagnosis function that can help to identify additional clinical features that, if present, would best distinguish between a list of candidate differential diagnoses. Users need to activate the check box next to diseases that should be included (e.g. physicians can click on the top 5 or 10 candidates or may be able to use prior knowledge to exclude some of the candidates). The Phenomizer then generates a list of phenotypic features that are associated with about half of the candidate diagnoses (binary search), or that are relatively specific for single diagnoses (specific search). Users can get access to further information about any of the differential diagnoses by using the context menu (right mouse click). For instance, a list of all phenotypic features of the disease (Show annotations), links to the Entrez Gene (68) entry for genes associated with the disease (Show known genes), a link to the OMIM (14) entry for the disease (Show OMIM entry), and a display of the overlap between the query terms and all annotated terms (Display overlap). The results of the search can be exported in CSV format or as a PDF document. The PDF document can be used to document the differential diagnostic process in patient charts and also contains a list of all phenotypic features that are specific for each disease among all diseases considered for the differential diagnosis, which may be used to help plan a continued differential diagnostic workup. The Phenomizer is freely available at http://compbio.charite.de/phenomizer. phenotypic abnormalities that can be used as a common vocabulary to link mutation data, locus-specific databases (64,65), and potentially large centralized genotype-phenotype databases in a way that will offer a platform for data interchange between clinical researchers and allow the systematic and accurate use of phenotypic data in computational analysis.