Personal genome project (PGP) - Predict individuals' phenotypes

Please go to the prediction submission form to submit you predictions for the PGP dataset (to submit, you need to be logged into your account).

NOTE added on Nov 29: When submitting predictions for the phenotypes (#41-45), please submit all the subtests of each of phenotypes in the order as they appear on the test website. For more information, please see the phenotype list.

Background: The first ten genomes or exomes of participants in the PGP project are now publicly available. The public profiles of these persons are available at

Statement on selecting the phenotypes for the PGP challenge of pre-pro-CAGI

The pre-pro-CAGI list of phenotypes has been built based on the idea of having both mostly Mendelian diseases and more difficult “complex” disorders included in the list. Several of these have significant environmental components, and many phenotypes are difficult or impossible to predict with current scientific knowledge. This has been a conscious decision with an aim of provoking discussion on how the phenotypes should be selected in further rounds of CAGI. These provide a measure for assessing current ability to predict a broad range of phenotypes. They also provide a baseline for long-term progress in predictive ability.

We are asking participants to predict the probability of each of the PGP10 individuals having any of the binary phenotypes on our list, and to predict a numeric value for the continuous characters (e.g., LDL) along with a confidence interval. We also invite predictions on any additional genetic phenotypes.

Most of these phenotypes are ones not yet publicly provided by the PGP10 on the project website. As predictors make their predictions, the PGP10 participants will be asked to provide these phenotypes so that they may be used by assessors to determine the accuracy of the predictions. If a phenotype is already listed in the public profile of an individual (e.g., migraine), the prediction of that phenotype for the individual will not be assessed. However, we will assess predictions of such phenotype for other individuals, as long as the phenotype is not listed in the public profile of that individual.

At the pre-pro-CAGI workshop, we will present the assessment on how well the phenotypes were predicted, based on the answers from PGP10. Most importantly, however, we will encourage a discussion on what kind of phenotypes should be included in the next challenge. For example, some of the phenotypes in the pre-pro-CAGI list are periodic, and currently, we do not specify whether these should be measured at a single time point or at all times (and may not have been manifested yet). Another question is whether the signal is large enough in the PGP10 sample of 10 genomes to predict any of the more rare diseases.

We look forward to your ideas and comments on how to select predictable and assessable phenotypes for the next CAGI challenge.

Dataset: The list of phenotypes for the PGP Challenge may be found here or downloaded in pdf format here. NOTE: This file was updated on 29 Nov 2010 to reflect the subtests under the phenotypes (#41-45).

Prediction challenge: Submit predictions of the above phenotypes for the 10 PGP individuals (data available online at: For binary traits, submit the probability of a person having the phenotype. For the numerical traits, submit the numerical mean and standard deviation. We also welcome predictors to name additional phenotypes that they might discover from the genomes, including rare diseases. We will ask the PGP10 participants to report their phenotypes for these traits when possible.

Prediction submission format: The prediction submission is a simple text file. The organizers provide a file template, which should be used for submission. In the submitted file, each line should include the following columns:

1) The phenotype number (Use the order from the Phenotype list provided here) and for additional phenotypes, write the phenotype
2) Prediction (For phenotypes 1-32, provide the probability; For phenotypes 33-45, provide the numerical mean; For phenotype 46, write the sign)
3) Standard deviation of the prediction in column 2
4) Raw output data from your prediction algorithm

In the template file, columns 2-4 are marked with an “*”. Submit your predictions by replacing the “*” with your prediction value. If predictions cannot be submitted for a specific phenotype, leave the sign “*” in these columns.

In addition, your submission should include a detailed description of the method used to make the predictions. This information will be submitted as a separate file.

Please go to the prediction submission form to submit you predictions for the PGP dataset (to submit, you need to be logged into your account).

George Church
Dataset provided by George Church, Harvard Medical School. Phenotypes proposed by CAGI organizers in consultation with George Church.