Are there any statistics about how much sequence data has been generated by the 1000 Genomes Project?

Statistics about how much data the 1000 Genomes Project produced are accessible in several different ways. Information on some of the formats used for this information is available on the FTP site.

For raw data, a sequence.index file contains base and read counts for each of the active FASTQ files.

During the 1000 Genomes Project, summery statistics were provided in a sequence indices directory, which is now located with historical data from the project. This contains four summary files, two exome and two low coverage. Both of these analysis groups have a .stats file providing numbers of runs withdrawn for different reasons, base count and coverage statistics for each study, population level summaries and a stats.csv file which provides a comparison to the previous index in terms of number of runs, bases and similar metrics. Since late 2012, the 1000 Genomes Project also produced analysis.sequence.index files, which only consider Illumina runs of 70bp read length or longer, and also have statistics files.

For the aligned data all BAM and CRAM files have BAS files associated with them. These contain read group level statistics for the alignment. We also provide this in a collected form in alignment index files. The alignment indices for the alignments of the 1000 Genomes Project data to GRCh38 are available on the FTP site. There is also an historic alignment indices directory, which contains a .hsmetrics file with the results of the Picard tool CalculateHsMetrics for all the exome alignments and summary files, which compare statistics between old and new alignment releases during the 1000 Genomes Project.

Related questions: