When studies are published, their variant call sets are submitted to the archives (dbSNP,DGVa, EVA, etc.).
The 1000 Genomes Project SNPs and short indels were all submitted to dbSNP and longer structural variants to the DGVa.
The accessions for data sets in the archives can be found in the accompanying publications (listed alongside the data collections).
Our VCF files contain global and super population alternative allele frequencies. You can see this in our most recent release. For multi allelic variants, each alternative allele frequency is presented in a comma separated list.
An example info column which contains this information looks like
1 15211 rs78601809 T G 100 PASS AC=3050;AF=0.609026;AN=5008;NS=2504;DP=32245;EAS_AF=0.504;AMR_AF=0.6772;AFR_AF=0.5371;EUR_AF=0.7316;SAS_AF=0.6401;AA=t|||;VT=SNP
If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in the Ensembl browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.
This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac
An example command set using files from our phase 1 release would look like
grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list
vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac |
bgzip -c > CEU.chr13.phase1.vcf.gz
</pre>
Once you have this file you can calculate your frequency by dividing AC (allele count) by AN (allele number).
Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.
You can get information about a list of variant identifiers using Ensembl’s Biomart.
This YouTube video gives a tutorial on how to do it.
The basic steps are:
If you would like the coordinates on GRCh38, you should use the main Ensembl site, however if you would like the coordinates on GRCh37, you should use the dedicated GRCh37 site.
As is described in “Why are the coordinates of some variants different to what is displayed in other databases?”, different analyses may produce different results.
The publications accompanying the data collections list the accessions for the variant calls in the variation archives.