This is documentation about our Variation Pattern Finder tool
The variation data discovered by the 1000 genomes project are organized in VCF files. The Variation Pattern Finder allows one to look for patterns of shared variation between individuals in the same vcf file. To be more specific, in any user-specified chromosomal regions, different samples would have different combination of variations. The finder looks for distinct variation combinations within the region, as well as individuals associated with each variation combination pattern. The finder only focuses on variations that change protein coding sequences such as non-synonymous coding SNPs, splice site changes.
The finder requires two input files to function.
The first is a remotely accessible tabix indexed vcf file. The vcf format is a tab format for presenting variation sites and genotypes data and is described at http://vcftools.sourceforge.net/specs.html. This tool takes both vcf4.0 and vcf4.1 format files.
The second file, which must also be remotely accessible, described which samples belongs to which populations. Each 1000 genomes release should have such a file associated with it. This file allows to organize output samples by population.
The format is:
These lines should be tab separated.
An example can be found on the 1000 genomes ftp site 20100804.ALL.panel
The interface for the Finder can be navigated to either from the tools link which should be in the top right hand corner of each page below the logo or on any view page via the “Manage your data” link in the left hand menu.
When you read the pattern finders interface you will be presented with a form in which to enter your data. The form itself has 3 input boxes
The Finder offers a collapsed view and an expanded view. The collapsed view does not distinguish sites of homozygous reference with those with no data, therefore the number of distinctive combinations of variations is minimized; it offers a simplified and clear variation landscape in the region. The expanded view treats homozygous reference sites and no genotype data sites differently; allows one to see the data with more accuracy. The two views have the same layout as explained below.
The picture shows a snapshot of a result page. The right shows the functional variations found in the region with individual genotypes; the variations are sorted by chromosomal coordinate and the functional consequences of them are annotated in the headers. The right panel shows individual samples carrying each combination of variations, organized by population. The panels can be scrolled to view more data. The results can be exported in either csv or Excel format. Sections annotated by red numbers are described in greater details below.
The Variation Pattern Finder will work with any publicly visible remove (over http or ftp) vcf file which also has a tabix index. For more information about creating tabix indexes please look at Tabix: fast retrieval of sequence features from generic TAB-delimited files for more information about creating these indexes.
In addition to use the Finder to mine the VCF file, you may look into a VCF file directly. Rather than download the entire VCF file for the whole genome, you may slice out the piece of VCF file that contains data in a user-specified chromosomal region using another tool called Data Slicer. Data Slicer can also slice BAM files. Please see more instruction here.