How are allele frequencies calculated?

Answer:

Our standard AF values are allele frequencies rounded to two decimal places calculated using allele count (AC) and allele number (AN) values.

LDAF is an allele frequency value in the info column of our phase 1 VCF files. LDAF is the allele frequency as inferred from the haplotype estimation. You will note that LDAF does sometimes differ from the AF calculated on the basis of allele count and allele number. This generally means there are many uncertain genotypes for this site. This is particularly true close to the ends of the chromosomes.

Genotype Dosage

The phase 1 data set also contains Genotype Dosage values. This comes from Mach/Thunder, imputation engine used for genotype refinement in the phase 1 data set.

The Dosage represents the predicted dosage of the non reference allele given the data available, it will always have a value between 0 and 2.

The formula is Dosage = Pr(Het|Data) + 2*Pr(Alt|Data)

The dosage value gives an indication of how well the genotype is supported by the imputation engine. The genotype likelihood gives an indication of how well the genotype is supported by the sequence data.

Related questions:

Is there gene expression and/or functional annotation available for the samples?

Answer:

Functional annotation

As part of our phase 1 analysis we performed functional annotation of our phase 1 variants with respect to both coding and non-coding annotation from GENCODE and the ENCODE project respectively.

This functional annotation can be found in our phase 1 analysis results directory. We present both the annotation we compared the variants to and VCF files which contain the functional consequences for each variant.

Gene expresssion

The most important available existing expression datasets involving 1000 Genomes individuals are probably the following:

Pre-publication RNA-sequencing data from the Geuvadis project is available

http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/samples.html
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-2/samples.html

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-197

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-198
http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-264

http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-19480

References

  1. Reference:Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010 Apr 1;464(7289):773-7. Epub 2010 Mar 10.
  2. Reference: Stranger,B.E S.B. Montgomery, A.S. Dimas, L. Parts, O. Stegle, C.E. Ingle, M. Sekowska, G. Davey Smith, D. Evans, M. Gutierrez-Arcelus, A. Price, T. Raj J. Nisbett, A.C. Nica, C. Beazley, R. Durbin, P. Deloukas, E.T. Dermitzakis. Patterns of cis regulatory variation in diverse human populations. PLoS Genetics in press
  3. Reference: Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010 Apr 1;464(7289):768-72. Epub 2010 Mar 10.

Related questions:

What are the different data collections available for 1000 Genomes?

Answer:

In IGSR, data is organised into collections that roughly correspond to studies or projects.

The samples collected by the 1000 Genomes Project have now been used in many different studies, some generating new data and others reanalysing existing data.

The final phase of the 1000 Genomes Project was phase 3 and represents 2504 samples on GRCh37.

The data from phase three of the 1000 Genomes Project was subsequently reanalysed on GRCh38.

Following this work, the samples have been resequenced to high-coverage, with additional related samples being sequenced, bringing the total number of samples up to 3,202. This data was analysed on GRCh38.

Further studies have also generated data on samples from the 1000 Genomes Project, including work by the Human Genome Structural Variation Consortium (HGSVC).

These data collections are listed in our data portal.

Related questions:

Why are there differences between the different analyses of the 1000 Genomes samples?

Answer:

The phase 1 variants list released in 2012 and the phase 3 variants list released in 2014 overlap but phase 3 is not a complete superset of phase 1. The variant positions between phase 3 and phase 1 releases were compared using their positions. This shows that 2.3M phase 1 sites are not present in phase 3. Of the 2.3M sites, 1.92M are SNPs, the rest are either indels or structural variations (SVs).

The difference between the two lists can be explained by a number of different reasons.

  1. Some phase 1 samples were not used in phase 3 for various reasons. If a sample was not part of phase 3, variants private to this sample are not be part of the phase 3 set.

  2. Our input sequence data is different. In phase 1 we had a mixture of both read lengths 36bp to >100bp and a mixture of sequencing platforms, Illumina, ABI SOLiD and 454. In phase 3 we only used data from the Illumina sequencing platform and we only used read lengths of 70bp+. We believe that these calls are higher quality, and that variants excluded this way were probably not real.

  3. The first two reasons listed explain 548k missing SNPs, leaving 1.37M SNPs still to be explained.

    The phase 1 and phase 3 variant calling pipelines are different. Phase 3 had an expanded set of variant callers, used haplotype aware variant callers and variant callers that used de novo assembly. It considered low coverage and exome sequence together rather than independently. Our genotype calling was also different using ShapeIt2 and MVNcall, allowing integration of multi allelic variants and complex events that weren’t possible in phase 1.

    891k of the 1.37M sites missing from phase 1 were not identified by any phase 3 variant caller. These 891k SNPs have relatively high Ts/Tv ratio (1.84), which means these were likely missed in phase 3 because they are very rare, not because they are wrong; the increase in sample number in phase 3 made it harder to detect very rare events especially if the extra 1400 samples in phase 3 did not carry the alternative allele.

    481k of these SNPs were initially called in phase 3. 340k of them failed our initial SVM filter so were not included in our final merged variant set. 57k overlapped with larger variant events so were not accurately called. 84k sites did not make it into our final set of genotypes due to losses in our pipeline. Some of these sites will be false positives but we have no strong evidence as to which of these sites are wrong and which were lost for other reasons.

  4. The reference genomes used for our alignments are different. Phase 1 alignments were aligned to the standard GRCh37 primary reference including unplaced contigs. In phase 3 we added EBV and a decoy set to the reference to reduce mismapping. This will have reduced our false positive variant calling as it will have reduced mismapping leading to false SNP calls. We cannot quantify this effect.

We have made no attempt to eludcidate why our SV and indel numbers changed. Since the release of phase 1 data, the algorithms to detect and validate indels and SVs have improved dramatically. By and large, we assume the indels and SVs in phase 1 that are missing from phase 3 are false positive in phase 1.

You can get more details about our comparison from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/phase1_sites_missing_in_phase3/

Related questions: