Earlier this year, the New York Genome Center (NYGC) released high-coverage (30x) data for an additional 698 samples from the 1000 Genomes Project sample collections. These 698 samples are related to the original set of 2,504 samples previously sequenced by NYGC. The 2,504 samples are a set of samples unrelated to each other that made up the panel used by the 1000 Genomes Project in its third (and final) phase. This brings the total number of samples sequenced to high-coverage by NYGC to 3,202, in work funded by NHGRI.
NYGC aligned the data to the GRCh38 reference assembly and the CRAMs have been shared and are listed in our data portal. These files can be accessed from FTP sites hosted by EMBL-EBI and NCBI, and are also hosted on AWS and AnVIL. Details on accessing and using the data can be found on our page for this data collection.
This high-coverage data adds to the previous data sets, giving us:
The phase three 1000 Genomes Project low-coverage and exome data on GRCh37, as used for the 1000 Genomes Project phase three analysis published in 2015
The phase three 1000 Genomes Project low-coverage and exome data realigned to GRCh38 (used to support recalling from the data against GRCh38)
30x high-coverage data from NYGC on GRCh38, where an integrated call set is being produced and preliminary call sets have been shared
These data collections, covering large numbers of samples, are supplemented by other data collections in IGSR where a wider range of technologies have been applied to subsets of the samples. Genomic sequence data is also available for samples that were not part of the 1000 Genomes Project. Our data portal can be used to explore the main data sets in IGSR, with additional (and preliminary) data sets available via our FTP site.