Our sequence files are distributed in FASTQ format. Some are hosted on our own FTP site and some by the sequence read archive.
We use Sanger style phred scaled quality encoding.
The files are all gzipped compressed and the format looks like this, with a four-line repeating pattern
@ERR059938.60 HS9_6783:8:2304:19291:186369#7/2
GTCTCCGGGGGCTGGGGGAACCAGGGGTTCCCACCAACCACCCTCACTCAGCCTTTTCCCTCCAGGCATCTCTGGGAAAGGACATGGGGCTGGTGCGGGG
+
7?CIGJB:D:-F7LA:GI9FDHBIJ7,GHGJBKHNI7IN,EML8IFIA7HN7J6,L6686LCJE?JKA6G7AK6GK5C6@6IK+++?5+=<;227*6054
Many of our individuals have multiple fastq files. This is because many of our individual were sequenced using more than one run of a sequencing machine.
Each set of files named like ERR001268_1.filt.fastq.gz, ERR001268_2.filt.fastq.gz and ERR001268.filt.fastq.gz represent all the sequence from a sequencing run.
The labels with _1 and _2 represent paired-end files; mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended, also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.
When a individual has many files with different run accessions (e.g ERR001268), this means it was sequenced multiple times. This can either be for the same experiment, some centres used multiplexing to have better control over their coverage levels for the low coverage sequencing, or because it was sequenced using different protocols or on different platforms.
For a full description of the sequencing conducted for the project please look at our sequence.index file
We describe our sequence meta data in sequence index files. The index for data from the 1000 Genomes Project can be found in the 1000 Genomes data collection directory. Additional indices are present for data in other data collections. Our old index files which describe the data used in the main project can be found in the historical_data directory
Sequence index files are tab delimited files and frequently contain these columns:
Column | Title | Description |
---|---|---|
1 | FASTQ_FILE | path to fastq file on ftp site or ENA ftp site |
2 | MD5 | md5sum of file |
3 | RUN_ID | SRA/ERA run accession |
4 | STUDY_ID | SRA/ERA study accession |
5 | STUDY_NAME | Name of study |
6 | CENTER_NAME | Submission centre name |
7 | SUBMISSION_ID | SRA/ERA submission accession |
8 | SUBMISSION_DATE | Date sequence submitted, YYYY-MM-DD |
9 | SAMPLE_ID | SRA/ERA sample accession |
10 | SAMPLE_NAME | Sample name |
11 | POPULATION | Sample population, this is a 3 letter code as defined in README_populations.md |
12 | EXPERIMENT_ID | Experiment accession |
13 | INSTRUMENT_PLATFORM | Type of sequencing machine |
14 | INSTRUMENT_MODEL | Model of sequencing machine |
15 | LIBRARY_NAME | Library name |
16 | RUN_NAME | Name of machine run |
17 | RUN_BLOCK_NAME | Name of machine run sector (This is no longer recorded so this column is entirely null, it was left in so as not to disrupt existing sequence index parsers) |
18 | INSERT_SIZE | Submitter specified insert size |
19 | LIBRARY_LAYOUT | Library layout, this can be either PAIRED or SINGLE |
20 | PAIRED_FASTQ | Name of mate pair file if exists (Runs with failed mates will have a library layout of PAIRED but no paired fastq file) |
21 | WITHDRAWN | 0/1 to indicate if the file has been withdrawn, only present if a file has been withdrawn |
22 | WITHDRAWN_DATE | This is generally the date the file is generated on |
23 | COMMENT | comment about reason for withdrawal |
24 | READ_COUNT | read count for the file |
25 | BASE_COUNT | basepair count for the file |
26 | ANALYSIS_GROUP | the analysis group of the sequence, this reflects sequencing strategy. For 1000 Genomes Project data, this includes low coverage, high coverage, exon targeted and exome to reflect the two non low coverage pilot sequencing strategies and the two main project sequencing strategies used by the 1000 Genomes Project. |
The sequence.index file contains a list of all the sequence data produced by the project, pointers to the file locations on the ftp site and also all the meta data associated with each sequencing run.
For the phase 3 analysis the consortium has decided to only use Illumina platform sequence data with reads of 70 base pairs or longer. The analysis.sequence.index file contains only the active runs which match this criterion. There are withdrawn runs in this index. These runs are withdrawn because either: * They have insufficient raw sequence to meet our 3x non duplicated aligned coverage criteria for low coverage alignments. * After the alignment has been run they have failed our post alignment quality controls for short indels. * Contamination. * They do not meet our coverage criteria.
Since the alignment release based on 20120522, we have only released alignments based on the analysis.sequence.index
The 1000 Genomes Project created what they defined as accessibilty masks for the pilot phase, phase one and phase three of the Project. Some other studies have similar files.
In phase three of the 1000 Genomes Project, using the pilot criteria 95.9% of the genome was found to be accessible. For the stricter mask created during phase three, 76.9% was found to be accessible. A detailed description of the accessibility masks created during phase three, the final phase of the Project, can be found in section 9.2 of the supplementary material for the main publication. The percentages quoted are for non-N bases.
While the above was generated on GRCh37, similar files were created on GRCh38 for the reanalysis of the 1000 Genomes Project data on GRCh38. HGSVC2 also have files listing regions of the genome that were not analysed.
The easiest way to find the sequence files you’re looking for is with the Data Portal. You can search for individuals, populations and data collections, and filter the files by data type and technologies. This will give you locations of the files, which you can use to download directly, or to export a list to use with a download manager.