The New York Genome Center (NYGC), funded by NHGRI, has sequenced 3202 samples from the 1000 Genomes Project sample collection to 30x coverage. Initially, the 2504 unrelated samples from the phase three panel from the 1000 Genomes Project were sequenced. Thereafter, an additional 698 samples, related to samples in the 2504 panel, were also sequenced. NYGC aligned the data to GRCh38 and those alignments are publicly available along with a data reuse statement. Details, including URLs for the data in ENA, are in our data portal (below) and are listed on our FTP site. The alignments can be accessed at the following locations:
- ENA - tab delimited file list for the 2504 panel or 2504 ENA Study at https://www.ebi.ac.uk/ena/data/view/PRJEB31736 and a tab delimited file list for the 698 related samples or 698 ENA Study at https://www.ebi.ac.uk/ena/browser/view/PRJEB36890
- AnVIL - https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019
- AWS - s3://1000genomes/1000G_2504_high_coverage/ and s3://1000genomes/1000G_2504_high_coverage/additional_698_related/
- NCBI FTP - ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high_coverage/ and ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/1000G_2504_high_coverage/additional_698_related/
NYGC have performed variant calling on the data and the resulting call sets are available on our FTP site. These include:
- Genotype VCFs - these include the genotypes for all samples in the “recalibrated_variants.vcf.gz” files (also listed below)
- Phased VCFs - the phased SNV/INDEL/SV calls for the 3202 samples
- Structural Variation VCFs - Structural Variation calls for the 3202 samples
- Pedigree info file - Sample metadata file listing pedigree and sex information for the 3,202 1kGP samples
The initial GATK call set for the 2504 unrelated samples remains available.
A preprint is available describing this work, which can be used for citation purposes.