FORGE Analysis Tool
The FORGE tool performs Functional element Overlap analysis of the Results of Genome Wide Association Study (GWAS) Experiments, to identify tissue specific signals within the set of GWAS SNPs.
This page refers to the updated version of Forge (1.1). Version 1.1 is the result of a number of criticisms and suggestions from reviewers after an initial manuscript was originally submitted to Genome Medicine. The major changes are:
- Forge now reports enrichments as -log10 of the binomial P value.
- By default Forge filters the test SNPs for LD at r2 >= 0.8, and retains only one test SNP of the LD cluster. This option can be turned off (warning - can lead to over-estimation of enrichment) or set at a lower threshold (r«sup>2</sup> >= 0.1).
The adjusted thresholds on the binomial P value are, I think, now rather stringent and so effects observed with these thresholds are strong.
Unfortunately in the meantime I have taken up another position that is occupying all my time, and so have only just got around to making the appropriate revisions to the manuscript which has been uploaded in revised form to bioRxiv. Given the timing and work constraints I do not think I will get around to submitting this to a conventional journal.
It is now recognised that many of the significant SNPs discovered in GWAS studies lie outside of the exons of protein coding genes, and that many of these hits may overlap regulatory regions within the genome (Maurano et al, 2012, ENCODE Consortium, 2012). The FORGE tool provides a method to view the tissue specific regulatory component of a set of GWAS SNPs.
FORGE analysis takes a set of SNPs, such as those hits above the genome wide significance threshold in a GWAS study, and analyses whether there is enrichment for overlap of putative functional elements compared to matched background SNPs. It assesses enrichment on a per cell type basis since functional elements are differentially active in different cell types, and hence can expose tissue specific signals of enrichment within the test SNP set. This can reveal the sites of action underlying the GWAS signal, and provide confirmation of the validity of the GWAS where a tissue specific mechanism is known or expected for the phenotype. Conversely unknown tissue involvements can also be revealed.
In the initial implementation, the functional elements considered are DNAse1 hotspots from either the ENCODE (Thurman et al, 2012) or Roadmap Epigenomics projects generated by the Hotspot method. The hotspots are regions of general DNase1 sensitivity (rather than peaks which are more similar to DNase hypersensitive sites) and are used because, although the method works with peaks, hotspots reveal more tissue specific signal.
For each set of test SNPs, an overlap analysis is performed against the functional elements from either data source for each cell sample separately (125 samples for ENCODE, 299 for Roadmap), and the number of overlaps is counted. A background distribution of the expected overlap counts for this SNP set is obtained by picking sets of the same number of SNPs as the test SNP set, matched in decile bins for G+C content (GC), minor allele frequency (maf) and distance to the nearest transcription start site (TSS). The matched background sets are then overlapped with the functional elements and the background distribution of overlaps determined. By default 100 matched sets are used, but this can be increased. More accurate results will be obtained with larger numbers of background sets, but the tool will take longer to run. The enrichment value for the test SNP set versus the background distribution is expressed as the -log10 of the binomial P value. Enrichments at P less than or equal to 0.05 and 0.01 are considered indicative and significant, respectively, corrected for multiple testing over the range of equivalent tissues. A schematic of the analysis is shown below.
FORGE Analysis Strategy
The results are presented by cell sample in either graphic (interactive Dimple chart or static pdf) or tabular (interactive DataTables table or tab separated file) forms. Typical results may show an enrichment of overlap (red or pink points) for the GWAS SNP set in a tissue of mechanistic relevance to the phenotype under analysis, for instance fetal lung tissue and lung cell lines for Pulmonary function SNPs.
Alternatively there may be no enrichment and all points will be blue below the -log10 P value thresholds. This could be because there is no regulatory component underlying the GWAS association, or because the relevant tissue is not present in the available functional element datasets, or for other technical reasons (e.g. too few overlaps or wrong genome build).
You can supply a list of SNPs by their dbSNP RefSNP (rs) IDs, one SNP per line in the Paste data: box. For example, you can run the SNPs already in the paste box which are a set of SNPs from GWAS studies on Pulmonary_function (Soler Artigas et al, 2011; Hancock et al, 2010; Repapi et al, 2010) obtained by parsing the NCBI GWAS catalog [Accessed 28/03/2014].
Alternatively you can upload a file from disk or via a URL. The file should contain
In fact any BED format with 3 columns (0 based, chrN) will work e.g.
chr21 31812007 31812008
If you experiment you’ll find that FORGE will also accept chromosome names as just N (Ensembl) and 1-based format where begin and end position are the same. Note that if you provide locations rather than RefSNP IDs they should currently be on build GRCh37 (hg19). RefSNP ids should work independent of genome build. One characteristic result when providing erroneous mappings is apparent depletion of the overlap statistic.
SNPs have to be present in the 1000 genomes phase 1 integrated call data set to work in the analysis. Any SNPs provided not in this set will be excluded from the analysis and this is reported to you. Also note that at least 5 input SNPs are required in the set for the web based analysis. There is no maximum, but very large sets will be slow to analyse. See below for false discovery rate analysis.
Name for this data
Give an optional name for your SNP set. This name will be used in the title of the pdf output image.
The data set to analyse your SNP set against. Currently this is DNase1 hotspots from either the ENCODE (Thurman et al, 2012) or Roadmap Epigenomics projects generated by the Hotspot method. The hotspots are regions of general DNase1 sensitivity (rather than peaks which are more similar to DNase hypersensitive siites) and are used because, although the method works with peaks, it reveals more tissue specific signal with hotspots.
Generate a table of the overlaps of the SNPs with Dnase hypersensitive sites. The minimum snp count restriction is ignored if this option is selected, and no enrichment analysis is done. Useful if you just want to identify overlaps with DNase 1 hypersensitive sites across ENOCDE and Epigenome Roadmap data.
Specify whether the background matches should be picked from general set of arrays used in GWAS (GWAS typing arrays) or from the Illumina_HumanOmni2.5 (Omni array SNPs). The GWAS typing arrays include the following arrays as provided by Ensembl:
In both cases SNPs have to be both present on the arrays AND in the 1000 genomes phase 1 integrated call data set to be included in the background calculation.
Inclusion of multiple SNPs that are within tight linkage disequilibrium (LD) from the same association signal will result in over-estimation of the degree of enrichment for a specific regulatory signal, since there are likely to be multiple tissue specific DNase 1 sites in a region, and the background is not regionalised. To avoid this an LD filter is set by default to remove at random all but one SNP within an LD cluster at r2 >= 0.8. SNPs that are removed are reported in the results. The filter can be turned off, or be set at r2 >= 0.1 for a very stringent filter. All results here are generated with the LD filter set to the default. The way the filter works is influenced by the order in which SNPs in LD appear in the list you supply because the first occuring SNP is selected as the exemplar of the LD block. It would be worth experrimenting with alternate orderings and without filtering to see the effect of LD.
Define the number of matching background SNP sets to analyse for the background overlap distribution. By default this is set to 100 background SNP sets, but larger numbers (e.g. up to 1000) will reduce false positive effects, at the expense of analysis speed.
These are the binomial P value thresholds used to colour the points red (by default P <= 0.01) or pink (P <=0.05). We estimate false positive rates at around less than 4% and 0.75% at the P <= 0.05 and P <= 0.01 thresholds, respectively. Increasing the thresholds will reduce false positive rates as shown in the plot below. In the analysis the thresholds are corrected for multiple testing by multiplying by the number of distinct tissues analysed, and expressed as -log10 P.
Several outputs are produced and provided from the results page.
Each of the graphics presents the -log10 P value by cell sample. Cells are grouped alphabetically by tissue and then organised alphabetically by cell name. In each of the graphics the colouring is consistent, blue (P > 0.05), pink (0.05 => P > 0.01), and red (P <= 0.01), before multiple testing correction, by default.
See the following graphics/links for examples of the output for the default Pulmonary_function SNPs:
The interactive chart presents the P values by tissue. Each point is the -log10 binomial P value for a cell type with the different cells organised alphabetically within the tissues. Where the same cell type is assayed (either the same cell line, or the same sample type from different individuals) the results are stacked at the same x axis position which provides useful validation of results where there are replicate or equivalent cell samples. Mouseover the individual points will show a tooltip giving the information from the results table. The fields called Class and Number are for plotting purposes only and should be ignored.
The pdf output presents the log10 binomial P values in a similar way organised alphabetically by Tissue and Cell, but without stacking duplicate samples. For guidance the Tissues are divided by the brown vertical lines. Horizontal pink lines show the P value thresholds. A thumbnail example is shown below but the original is available here. Threshold lines are not necessarily plotted if the results are all below threshold.
The results are also available as a tab separated value file (TSV) with columns as follows:
More example analyses in pdf format are available in the GWAS catalog examples.
ENCODE consortium hotspots were obtained from ftp://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/openchrom/jan2011/combined_hotspots/
Roadmap Epigenome DNAse I sequencing tag alignments were obtained from http://www.genboree.org/EdaccData/Current-Release/experiment-sample/Chromatin_Accessibility/. These were processed by the Hotspot method to give hotspot files using the default parameters.
DNase I hotspot data in bed/wig format was formatted for overlap analysis using Tabix. All SNPs in the 1000 genomes phase 1 integrated call data set were downloaded and overlaps with the DNase I elements by cell type for every SNP were calculated using Tabix in a distributed approach on the EBI compute farm. Overlaps were stored as a binary string for each data set (ENCODE or Roadmap) for each SNP in an indexed sqlite database, forge.db.
For a given set of SNPs FORGE queries the sqlite database and retrieves all overlap bit strings. It then unpacks the bitstrings for each SNP and counts the overlaps per cell type. It next identifies matching SNPs based on the GC, maf and TSS distance, and repeats the overlap analysis for each background set. The final step is a simple calculation of the binomial P value of the test overlap count versus the background distribution. An enrichment is considered significant or indicative if the binomial P value is lower than or equal to the thresholds after correction for multiple testing. The default P value thresholds (0.01 and 0.05, respectively) are corrected by multiplication by the number of groups of related samples (tissues) tested.
Estimating False Positive Rates by SNP count in the Test SNP Set
To estimate false positive rates, 1000 randomly chosen SNP sets for each of a series of SNP counts between 10 and 300 SNPs were analysed using FORGE on the Roadmap and ENCODE data. The false positive rate was calculated as the number of cell enrichments greater than the two standard thresholds used by FORGE expressed as the proportion of the total number of cell overlap tests performed (424000).
This plot suggests that a threshold of P <= 0.01 (maroon line) before correction for multiple testing maintains the false positive rate below around 0.0075 (0.75%). The red line is P <= 0.001.
The source code for FORGE is available on Github at https://github.com/iandunham/Forge. It has been installed and run on Mac OSX 10.8.4, 10.8.5 and Red Hat Linux. To run you also need to download the Forge.db database and background selection hash tables from
If you find the tool useful, please cite the website at http://browser.1000genomes.org/Homo_sapiens/UserData/Forge. For use in publications cite the website and the preprint at bioRxiv : Dunham, I., Kulesha, E., Iotchkova, V., Morganella, S. & Birney, E. FORGE : A tool to discover cell specific enrichments of GWAS associated SNPs in regulatory regions. (http://biorxiv.org/content/early/2014/12/20/013045).
Ian Dunham (firstname.lastname@example.org)
Thanks to Jan Quell for initial implementation of the pdf graphic and to Ramnath Vaidyanathan and @timelyportfolio for assistance with rCharts. We thank Prof Ajay Shah and Marc-Philip Hitz for comments on the results of cardiac phenotypes, and Graham Ritchie and Nicole Soranzo for discussions.