About alignment files (BAM and CRAM)

Answer:

All our alignment files are in BAM or CRAM format. BAM is a standard alignment format which was defined by the 1000 Genomes consortium and has since seen wide community adoption, whereas CRAM is a compressed version of this. This compression is driven by the reference the sequence data is aligned to.

The CRAM file format was designed by the EBI to reduce the disk footprint of alignment data in these days of ever-increasing data volumes.

The CRAM files the 1000 Genomes project distributes are lossy cram files which reduce the base quality scores using the Illumina 8-bin compression scheme as described in the lossy compression section on the cram usage page

There is a github page where the format of CRAM file is discussed and help can be found.

CRAM files can be read using many Picard tools and work is being done to ensure samtools can also read the file format natively.

BAM file names

The bam file names look like:

NA00000.location.platform.population.analysis_group.YYYYMMDD.bam

The bai index and bas statistics files are also named in the same way.

The name includes the individual sample ID, where the sequence is mapped to, if the file has only contains mapping to a particular chromosome that is what the name contains otherwise, mapped means the whole genome mapping and unmapped means the reads which failed to map to the reference (pairs where one mate mapped and the other didn’t stay in the mapped file), the sequencing platform, the ethnicity of the sample using our three letter population code, the sequencing strategy. The date matches the date of the sequence used to build the bams and can also be found in the sequence.index filename.

Unmapped bams

The unmapped bams contain all the reads for the given individual which could not be placed on the reference genome. It contains no mapping information

Please note that any paired end sequence where one end successfully maps but the other does not both reads are found in the mapped bam

Bas files

Bas files are statistics we generate for our alignment files which we distribute alongside our alignment files.

These are readgroup level statistics in a tab delimited manner and are described in this README

Each mapped and unmapped bam file has an associated bas file and we provide them collected together into a single file in the alignment_indices directory, dated to match the alignment release.

Related questions:

About FASTQ sequence read files

Answer:

Our sequence files are distributed in FASTQ format. Some are hosted on our own FTP site and some by the sequence read archive.

Format

We use Sanger style phred scaled quality encoding.

The files are all gzipped compressed and the format looks like this, with a four-line repeating pattern

@ERR059938.60 HS9_6783:8:2304:19291:186369#7/2
GTCTCCGGGGGCTGGGGGAACCAGGGGTTCCCACCAACCACCCTCACTCAGCCTTTTCCCTCCAGGCATCTCTGGGAAAGGACATGGGGCTGGTGCGGGG
+
7?CIGJB:D:-F7LA:GI9FDHBIJ7,GHGJBKHNI7IN,EML8IFIA7HN7J6,L6686LCJE?JKA6G7AK6GK5C6@6IK+++?5+=<;227*6054

Files for each individual

Many of our individuals have multiple fastq files. This is because many of our individual were sequenced using more than one run of a sequencing machine.

Each set of files named like ERR001268_1.filt.fastq.gz, ERR001268_2.filt.fastq.gz and ERR001268.filt.fastq.gz represent all the sequence from a sequencing run.

The labels with _1 and _2 represent paired-end files; mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended, also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.

When a individual has many files with different run accessions (e.g ERR001268), this means it was sequenced multiple times. This can either be for the same experiment, some centres used multiplexing to have better control over their coverage levels for the low coverage sequencing, or because it was sequenced using different protocols or on different platforms.

For a full description of the sequencing conducted for the project please look at our sequence.index file

Related questions:

Can I convert VCF files to PLINK/PED format?

Answer:

We provide a VCF to PED tool to convert from VCF to PLINK PED format. This tool has documentation for both the web interface and the Perl script.

An example Perl command to run the script would be:

perl vcf_to_ped_converter.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr13.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz
    -sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.sample_panel
    -region 13:32889611-32973805 -population GBR -population FIN

Related questions:

What are your filename conventions?

Answer:

Our filename conventions depend on the data format being named. This is described in more detail below.

FASTQ

Our sequence files are distributed in gzipped fastq format

Our files are named with the SRA run accession E?SRR000000.filt.fastq.gz. All the reads in the file also hold this name. The files with _1 and _2 in their names are associated with paired end sequencing runs. If there is also a file with no number it is name this represents the fragments where the other end failed qc. The .filt in the name represents the data in the file has been filtered after retrieval from the archive. This filtering process is described in a README.

VCF

Our variant files are distributed in vcf format, a format initially designed for the 1000 Genomes Project which has seen wider community adoption.

The majority of our vcf files are named in the form:

ALL.chrN|wgs|wex.2of4intersection.20100804.snps|indels|sv.genotypes.analysis_group.vcf.gz

This name starts with the population that the variants were discovered in, if ALL is specifed it means all the individuals available at that date were used. Then the region covered by the call set, this can be a chromosome, wgs (which means the file contains at least all the autosomes) or wex (this represents the whole exome) and a description of how the call set was produced or who produced it, the date matches the sequence and alignment freezes used to generate the variant call set. Next a field which describes what type of variant the file contains, then the analysis group used to generate the variant calls, this should be low coverage, exome or integrated and finally we have either sites or genotypes. A sites file just contains the first eight columns of the vcf format and the genotypes files contain individual genotype data as well.

Release directories should also contain panel files which also describe what individuals the variants have genotypes for and what populations those individuals are from.

Related questions: