The 1000 Genomes Project

Overview of the 1000 Genomes Project

The goal of the 1000 Genomes Project was to find common genetic variants with frequencies of at least 1% in the populations studied.

The 1000 Genomes Project took advantage of developments in sequencing technology, which sharply reduced the cost of sequencing. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases.

Sequencing remained too expensive to deeply sequence the many samples being studied in the project. However, any particular region of the genome generally contains a limited number of haplotypes. Data was combined across samples to allow efficient detection of most of the variants in a region. The project planned to sequence each sample to 4x genomic coverage; at this depth, sequencing can not discover all variants in each sample, but can allow the detection of most variants with frequencies as low as 1%. In the final phase of the project, data from 2,504 samples was combined to allow highly accurate assignment of the genotypes in each sample at all the variant sites the project discovered. The multi-sample approach combined with genotype imputation allowed the project to determine a sample’s genotype, even in variants not covered by sequencing reads in that sample.

The contribution of the 1000 Genomes Project to genomics was summarised in Nature in the issue containing the final publications from the main project.

Design of the 1000 Genomes Project

The Project was planned during a meeting at The Welcome Genome Campus in September 2007. You can read the original plan in the meeting report. Once underway, the project was conducted in four stages: a pilot phase and three phases of the main project. In the main project, phases one and three produced data, with phase two concentrating on technical development.

Pilot project

Three pilot studies provided data to inform the design of the full-scale project:

Pilot Purpose Coverage Strategy Status
1 - low coverage Assess strategy of sharing data across samples 2-4X Whole-genome sequencing of 180 samples Sequencing completed October 2008
2 - trios Assess coverage and platforms and centres 20-60X Whole-genome sequencing of 2 mother-father-adult child trios Sequencing completed October 2008
3 - gene regions Assess methods for gene-region-capture 50X 1000 gene regions in 900 samples Sequencing completed June 2009

Data from the pilot projects was analysed to determine whether the strategy of 4x coverage was adequate to meet the goals of the project.

Main project

Sequencing was carried out in phases one and three of the main project, with data releases and analysis corresponding to each. The final data freeze, associated with the third and final phase, took place on the 2nd May 2013. This data set (defined in the 20130502.sequence.index file) represented the finalised data set upon which the phase three analysis was based and superseded previous data releases. During the course of the project, analysis methods were further developed and the phase three analysis replaces earlier versions.

The final data set contains data for 2,504 individuals from 26 populations. Low coverage and exome sequence data are present for all of these individuals, 24 individuals were also sequenced to high coverage for validation purposes.

Analyses were conducted, looking at both the short variations (up to 50 base pairs in length) and also structural variations. These analyses were published at the conclusion of the project in 2015. A list of our main publications can be seen below.

Publications

1000 Genomes Project samples and data

The 1000 Genomes Project developed guidelines on ethical considerations for investigators doing sampling, outlined in the Informed Consent Background Document and the Informed Consent Form Template. All collections included in the Project followed these ethical guidelines and model informed consent language. The 1000 Genomes Project Steering Committee, with input from the Project’s Samples and ELSI Group, made final decisions about which populations and sample sets to include in the Project.

Data from the 1000 Genomes Project is available without embargo, following the final publications from the project. Use of the data should be cited in the usual way, with current details available in the FAQs, where further details on using 1000 Genomes Project data can be found. Additional information on using data provided by IGSR is available and should also be consulted.

The available data from the 1000 Genomes Project can be explored on our data page, alongside other data in IGSR. Cell lines and DNA are available for all 1000 Genomes samples and can be obtained from the Coriell Institute. A complete list of the populations available can be found on our Cell lines and DNA page

The samples for the 1000 Genomes Project are anonymous and have no associated medical or phenotype data. The project holds self-reported ethnicity and gender. All participants declared themselves to be healthy at the time the samples were collected.

Please email questions about any of the above to info@1000genomes.org.