Forge Analysis 1.1 | 1000 Genomes

FORGE2

Please note that, since the conclusion of the 1000 Genomes Project, FORGE has been updated to FORGE2, which is hosted at the Altius Institute.

FORGE Analysis Tool

The FORGE tool performs Functional element Overlap analysis of the Results of Genome Wide Association Study (GWAS) Experiments, to identify tissue specific signals within the set of GWAS SNPs.

This page refers to the updated version of Forge (1.1). Version 1.1 is the result of a number of criticisms and suggestions from reviewers after an initial manuscript was originally submitted to Genome Medicine. The major changes are:

- Forge now reports enrichments as -log10 of the binomial P value.
- By default Forge filters the test SNPs for LD at r² >= 0.8, and retains only one test SNP of the LD cluster. This option can be turned off (warning - can lead to over-estimation of enrichment) or set at a lower threshold (r«sup>2</sup> >= 0.1).

The adjusted thresholds on the binomial P value are, I think, now rather stringent and so effects observed with these thresholds are strong.

Unfortunately in the meantime I have taken up another position that is occupying all my time, and so have only just got around to making the appropriate revisions to the manuscript which has been uploaded in revised form to bioRxiv. Given the timing and work constraints I do not think I will get around to submitting this to a conventional journal.

Return to FORGE tool.

Sections

Resources
Overview
Inputs
Options
Outputs
Output Examples
Methods
Source Code
Citation
Contact/Acknowledgements

Other Resources

Overview

It is now recognised that many of the significant SNPs discovered in GWAS studies lie outside of the exons of protein coding genes, and that many of these hits may overlap regulatory regions within the genome (Maurano et al, 2012, ENCODE Consortium, 2012). The FORGE tool provides a method to view the tissue specific regulatory component of a set of GWAS SNPs.

FORGE analysis takes a set of SNPs, such as those hits above the genome wide significance threshold in a GWAS study, and analyses whether there is enrichment for overlap of putative functional elements compared to matched background SNPs. It assesses enrichment on a per cell type basis since functional elements are differentially active in different cell types, and hence can expose tissue specific signals of enrichment within the test SNP set. This can reveal the sites of action underlying the GWAS signal, and provide confirmation of the validity of the GWAS where a tissue specific mechanism is known or expected for the phenotype. Conversely unknown tissue involvements can also be revealed.

In the initial implementation, the functional elements considered are DNAse1 hotspots from either the ENCODE (Thurman et al, 2012) or Roadmap Epigenomics projects generated by the Hotspot method. The hotspots are regions of general DNase1 sensitivity (rather than peaks which are more similar to DNase hypersensitive sites) and are used because, although the method works with peaks, hotspots reveal more tissue specific signal.

For each set of test SNPs, an overlap analysis is performed against the functional elements from either data source for each cell sample separately (125 samples for ENCODE, 299 for Roadmap), and the number of overlaps is counted. A background distribution of the expected overlap counts for this SNP set is obtained by picking sets of the same number of SNPs as the test SNP set, matched in decile bins for G+C content (GC), minor allele frequency (maf) and distance to the nearest transcription start site (TSS). The matched background sets are then overlapped with the functional elements and the background distribution of overlaps determined. By default 100 matched sets are used, but this can be increased. More accurate results will be obtained with larger numbers of background sets, but the tool will take longer to run. The enrichment value for the test SNP set versus the background distribution is expressed as the -log₁₀ of the binomial P value. Enrichments at P less than or equal to 0.05 and 0.01 are considered indicative and significant, respectively, corrected for multiple testing over the range of equivalent tissues. A schematic of the analysis is shown below.

FORGE Analysis Strategy

Forge Strategy

The results are presented by cell sample in either graphic (interactive Dimple chart or static pdf) or tabular (interactive DataTables table or tab separated file) forms. Typical results may show an enrichment of overlap (red or pink points) for the GWAS SNP set in a tissue of mechanistic relevance to the phenotype under analysis, for instance fetal lung tissue and lung cell lines for Pulmonary function SNPs.

Alternatively there may be no enrichment and all points will be blue below the -log₁₀ P value thresholds. This could be because there is no regulatory component underlying the GWAS association, or because the relevant tissue is not present in the available functional element datasets, or for other technical reasons (e.g. too few overlaps or wrong genome build).

A set of example analyses of GWAS studies from the [NCBI GWAS catalog Accessed 28/03/2014 are available in the GWAS catalog examples.

Inputs

You can supply a list of SNPs by their dbSNP RefSNP (rs) IDs, one SNP per line in the Paste data: box. For example, you can run the SNPs already in the paste box which are a set of SNPs from GWAS studies on Pulmonary_function (Soler Artigas et al, 2011; Hancock et al, 2010; Repapi et al, 2010) obtained by parsing the NCBI GWAS catalog [Accessed 28/03/2014].

Alternatively you can upload a file from disk or via a URL. The file should contain

a list of SNPs by RefSNP ID, one per line as above (example rsid format file),
SNPs in VCF (Variant Call Format) format (example VCF format file),
SNPs in BED (e.g. Personal Genome SNP format) format, (example BED format file).

In fact any BED format with 3 columns (0 based, chrN) will work e.g.

chr21 31812007 31812008

If you experiment you’ll find that FORGE will also accept chromosome names as just N (Ensembl) and 1-based format where begin and end position are the same. Note that if you provide locations rather than RefSNP IDs they should currently be on build GRCh37 (hg19). RefSNP ids should work independent of genome build. One characteristic result when providing erroneous mappings is apparent depletion of the overlap statistic.

SNPs have to be present in the 1000 genomes phase 1 integrated call data set to work in the analysis. Any SNPs provided not in this set will be excluded from the analysis and this is reported to you. Also note that at least 5 input SNPs are required in the set for the web based analysis. There is no maximum, but very large sets will be slow to analyse. See below for false discovery rate analysis.

Options

Name for this data

Give an optional name for your SNP set. This name will be used in the title of the pdf output image.

Analysis data

The data set to analyse your SNP set against. Currently this is DNase1 hotspots from either the ENCODE (Thurman et al, 2012) or Roadmap Epigenomics projects generated by the Hotspot method. The hotspots are regions of general DNase1 sensitivity (rather than peaks which are more similar to DNase hypersensitive siites) and are used because, although the method works with peaks, it reveals more tissue specific signal with hotspots.

Overlap

Generate a table of the overlaps of the SNPs with Dnase hypersensitive sites. The minimum snp count restriction is ignored if this option is selected, and no enrichment analysis is done. Useful if you just want to identify overlaps with DNase 1 hypersensitive sites across ENOCDE and Epigenome Roadmap data.

Background selection

Specify whether the background matches should be picked from general set of arrays used in GWAS (GWAS typing arrays) or from the Illumina_HumanOmni2.5 (Omni array SNPs). The GWAS typing arrays include the following arrays as provided by Ensembl:

Affy_GeneChip_100K_Array
Affy_GeneChip_500K_Array
Affy_SNP6
HumanCNV370-Quadv3
HumanHap300v2
HumanHap550v3.0
Illumina_Cardio_Metabo
Illumina_Human1M-duoV3
Illumina_Human660W-quad

In both cases SNPs have to be both present on the arrays AND in the 1000 genomes phase 1 integrated call data set to be included in the background calculation.

LD Filter

Inclusion of multiple SNPs that are within tight linkage disequilibrium (LD) from the same association signal will result in over-estimation of the degree of enrichment for a specific regulatory signal, since there are likely to be multiple tissue specific DNase 1 sites in a region, and the background is not regionalised. To avoid this an LD filter is set by default to remove at random all but one SNP within an LD cluster at r² >= 0.8. SNPs that are removed are reported in the results. The filter can be turned off, or be set at r² >= 0.1 for a very stringent filter. All results here are generated with the LD filter set to the default. The way the filter works is influenced by the order in which SNPs in LD appear in the list you supply because the first occuring SNP is selected as the exemplar of the LD block. It would be worth experrimenting with alternate orderings and without filtering to see the effect of LD.

Background repetitions

Define the number of matching background SNP sets to analyse for the background overlap distribution. By default this is set to 100 background SNP sets, but larger numbers (e.g. up to 1000) will reduce false positive effects, at the expense of analysis speed.

Significance thresholds

These are the binomial P value thresholds used to colour the points red (by default P <= 0.01) or pink (P <=0.05). We estimate false positive rates at around less than 4% and 0.75% at the P <= 0.05 and P <= 0.01 thresholds, respectively. Increasing the thresholds will reduce false positive rates as shown in the plot below. In the analysis the thresholds are corrected for multiple testing by multiplying by the number of distinct tissues analysed, and expressed as -log₁₀ P.

Outputs

Several outputs are produced and provided from the results page.

An interactive table of the overlaps by cell is given by default on the results page.
The PDF link provides a base R graphics pdf chart of the FORGE analysis useful for printing, presentations and publications.
The Interactive chart link provides an interactive chart to explore the data.
The TSV link provides the FORGE analysis results in tab separated format.

Each of the graphics presents the -log₁₀ P value by cell sample. Cells are grouped alphabetically by tissue and then organised alphabetically by cell name. In each of the graphics the colouring is consistent, blue (P > 0.05), pink (0.05 => P > 0.01), and red (P <= 0.01), before multiple testing correction, by default.

Example Outputs

See the following graphics/links for examples of the output for the default Pulmonary_function SNPs:

Interactive Chart

The interactive chart presents the P values by tissue. Each point is the -log10 binomial P value for a cell type with the different cells organised alphabetically within the tissues. Where the same cell type is assayed (either the same cell line, or the same sample type from different individuals) the results are stacked at the same x axis position which provides useful validation of results where there are replicate or equivalent cell samples. Mouseover the individual points will show a tooltip giving the information from the results table. The fields called Class and Number are for plotting purposes only and should be ignored.

PDF

The pdf output presents the log10 binomial P values in a similar way organised alphabetically by Tissue and Cell, but without stacking duplicate samples. For guidance the Tissues are divided by the brown vertical lines. Horizontal pink lines show the P value thresholds. A thumbnail example is shown below but the original is available here. Threshold lines are not necessarily plotted if the results are all below threshold.

Pulmonary_function

TSV file

The results are also available as a tab separated value file (TSV) with columns as follows:

Zscore - The Zscore of the test data enrichment count versus the background
Pvalue - The -log10 binomial P value for the enrichment versus a binomial model of the background probability.
Cell - The cell for which the enrichment is calculated
Tissue - The tissue of the Cell according to either ENCODE or BioSamples
File - Filename of the functional annotation analysed
SNPs - The SNPs that overlap features in that file
Number - An irrelevant number used to help plotting.
Accession - The GEO accession number of the data set.

More example analyses in pdf format are available in the GWAS catalog examples.

Methods

Data

ENCODE consortium hotspots were obtained from ftp://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/openchrom/jan2011/combined_hotspots/

Roadmap Epigenome DNAse I sequencing tag alignments were obtained from http://www.genboree.org/EdaccData/Current-Release/experiment-sample/Chromatin_Accessibility/. These were processed by the Hotspot method to give hotspot files using the default parameters.

Cell and Tissue assignments were obtained by custom perl scripts using data from ENCODE Data Coordination Center tables or BioSamples

Overlaps

DNase I hotspot data in bed/wig format was formatted for overlap analysis using Tabix. All SNPs in the 1000 genomes phase 1 integrated call data set were downloaded and overlaps with the DNase I elements by cell type for every SNP were calculated using Tabix in a distributed approach on the EBI compute farm. Overlaps were stored as a binary string for each data set (ENCODE or Roadmap) for each SNP in an indexed sqlite database, forge.db.

FORGE analysis

For a given set of SNPs FORGE queries the sqlite database and retrieves all overlap bit strings. It then unpacks the bitstrings for each SNP and counts the overlaps per cell type. It next identifies matching SNPs based on the GC, maf and TSS distance, and repeats the overlap analysis for each background set. The final step is a simple calculation of the binomial P value of the test overlap count versus the background distribution. An enrichment is considered significant or indicative if the binomial P value is lower than or equal to the thresholds after correction for multiple testing. The default P value thresholds (0.01 and 0.05, respectively) are corrected by multiplication by the number of groups of related samples (tissues) tested.

Outputs

The TSV output is generated by FORGE during processing. The graphic and tabular outputs are all generated via R code embedded in FORGE. The pdf output is generated with base R graphics. The interactive table is produced using the Datatables plug-in for the jQuery Javascript library accessed through the rCharts package. The interactive graphic is a dimple d3 interactive graphic again using the rCharts package.

Estimating False Positive Rates by SNP count in the Test SNP Set

To estimate false positive rates, 1000 randomly chosen SNP sets for each of a series of SNP counts between 10 and 300 SNPs were analysed using FORGE on the Roadmap and ENCODE data. The false positive rate was calculated as the number of cell enrichments greater than the two standard thresholds used by FORGE expressed as the proportion of the total number of cell overlap tests performed (424000).

False Positive Rate by SNP set Count

FP rate This plot suggests that a threshold of P <= 0.01 (maroon line) before correction for multiple testing maintains the false positive rate below around 0.0075 (0.75%). The red line is P <= 0.001.

Source Code

The source code for FORGE is available on Github at https://github.com/iandunham/Forge. It has been installed and run on Mac OSX 10.8.4, 10.8.5 and Red Hat Linux. To run you also need to download the Forge.db database and background selection hash tables from

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/browser/forge.

Citation

If you find the tool useful, please cite the website at http://browser.1000genomes.org/Homo_sapiens/UserData/Forge. For use in publications cite the website and the preprint at bioRxiv : Dunham, I., Kulesha, E., Iotchkova, V., Morganella, S. & Birney, E. FORGE : A tool to discover cell specific enrichments of GWAS associated SNPs in regulatory regions. (http://biorxiv.org/content/early/2014/12/20/013045).

Contact

Ian Dunham (dunham@ebi.ac.uk)

Acknowledgements

Thanks to Jan Quell for initial implementation of the pdf graphic and to Ramnath Vaidyanathan and @timelyportfolio for assistance with rCharts. We thank Prof Ajay Shah and Marc-Philip Hitz for comments on the results of cardiac phenotypes, and Graham Ritchie and Nicole Soranzo for discussions.

IGSR: The International Genome Sample Resource

Supporting open human variation data