Can I get phased genotypes and haplotypes for the individual genomes?

Answer:

Phased variant call sets are described in “Are the variant calls in IGSR phased?”.

You can obtain individual phased genotypes through either the Ensembl Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.

The Data Slicer has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.

To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz

Please also note that some studies, such as the second phase of the Human Genome Structural Variation Consortium (HGSVC), are now producing haplotype resolved asssemblies.

Related questions:

Can I use the IGSR data for imputation?

Answer:

The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.

Please look at the software’s website to find those files.

Related questions:

How was imputation used in 1000 Genomes to fill in gaps in sequencing?

Answer:

In the original phase 1 and 3 sequencing of the 1000 Genomes individuals, many genomes were only sequenced in full at low coverage, so some individuals some genotypes will be based on imputation.

This means that if an individual has no coverage at a particular location but overall we have been able to determine there is variation at that location then we can statistically infer the genotype for that variant in that individual using haplotype information. This means we are able to provide complete haplotypes for all the variation we discover.

The process used to create our genotypes first gave our merged sites and genotype likelihoods sets to Beagle to generate initial haplotypes (using 50 interations across all samples) and these were refined using a modified version of Thunder (it used 300 states chosen by longest matching haplotype at each iteration in addition to 100 randomly chosen states).

This process means we are unable to precisely identify which sites used imputation to generate their genotype. Without this process the approximate error rate for our heterozygous sites would be 20% so you can estimate that 20% of our heterozgous sites will have been changed on the basis of imputation. The sites covered by our exome sequencing represent our highest accuracy sites and these are the least likely to have been changed by this process. The converse is also true any site without any sequence alignment will have been imputed. You can find the depth of coverage at any site using our bam files. Other sites may have been given greater evidence on the basis of the imputation and refinement process.

You can find out more about this in our Phase 1 paper.

Related questions: