Help
Functional Annotation Tool
|
|
Introduction
Data Input
Gene-Enrichment and Functional Annotation Analysis
Some Terminology in the DAVID System
|
|
Introduction |
|
The tool suite, introduced in the first version of DAVID,
mainly provides typical batch annotation and gene-GO term enrichment
analysis to highlight the most relevant GO terms associated with a
given gene list . This version of the tool keeps the same enrichment
analytic algorithm but with extended annotation content coverage,
increasing from only GO in the original version of DAVID to currently
over 40 annotation categories, including GO terms, protein-protein
interactions, protein functional domains, disease associations,
bio-pathways, sequence general features, homologies, gene functional
summaries, gene tissue expressions, literatures, etc. The improved
annotation coverage alone provides investigators with much more power
to analyze their genes using many different biological aspects in a
single space. Flexible options are provided to display results in
an individual annotation chart report or a combined chart report. In
addition, with its improved computational power, this version accepts
customized gene backgrounds, an option rarely found in other Web-based,
high-throughput annotation tools for typical gene-annotation
enrichment analysis. This feature was added in order to more
specifically meet the users? requirements for the best analytical
results.
The DAVID Functional Annotation
Clustering function uses
a novel algorithm to measure relationships among the annotation terms
based on
the degrees of their co-association genes to group the similar,
redundant, and
heterogeneous annotation contents from the same or different resources
into
annotation groups. This reduces the burden of associating similar
redundant
terms and makes the biological interpretation more focused in a group
level. The tool also provides a look at
the internal
relationships of the clustered terms by comparing it to the typical
linear,
redundant term report, over which similar annotation terms may be
distributed among
hundreds or thousands of other terms. In addition, to take full
advantage of the
well-known KEGG and BioCarta pathways, the DAVID Pathway Viewer,
another
feature of the DAVID Functional Annotation Tool, can display genes from
a
user?s list on pathway maps to facilitate biological interpretation in
a
network context.
|
top
|
Data
Input
|
|
Please see the Universal Gene List Manager |
top
|
Gene-Enrichment and Functional Annotation Analysis
|
1. A Typical Analysis Flow
|
|
Load Gene List -> View Summary Page -> Explore details through Chart Report, Table Report, Clustering Report, etc. -> Export and Save Results
|
|
2. EASE Score, a modified Fisher Exact P-Value
|
|
When members of two independent groups can fall into one of two
mutually exclusive categories, Fisher's Exact test is used to determine whether the proportions of those falling into each category differs by
group. In DAVID, Fisher's Exact test is adopted to measure the gene-enrichment in annotation terms.
Fisher?s Exact p-values are computed by summing probabilities p over defined sets of tables (Prob=∑Ap).
For 2 X 2 tables, one-sided p-values for Fisher?s Exact test are defined in terms of the frequency of the cell in the first rows and first column of the table, the (1,1) cell. Denote the observed (1,1) cell frequency by n11, for a right-sided alternative hypothesis, A is the set of tables where the frequency of the (1,1) cell is greater than or equal to n11. A small right-sided p-value supports the alternative that the probability of the first cell is actually greater than that expected under the null hypothesis that the two variables are independent and conclude that there is an association between the row and the column variables.
A Hypothetical Example:
In the human genome background (30,000 genes total; Population Total (PT)), 40 genes
are involved in the p53 signaling pathway (Population Hits (PH)). A given gene list has found
that three genes (List Hits (LH)) out of 300 total genes in the list (List Total (LT)) belong to the p53 signaling pathway. Then we ask
the question if 3/300 is more than random chance compared to the
human background of 40/30000.
A 2x2 contingency table is built based on the above numbers:
List Hits (LH) = 3
List Total (LT) = 300
Population Hits (PH) = 40
Population Total (PT) = 30,000
|
User Genes |
Genome |
|
In Pathway |
LH |
PH-LH |
PH |
Not In Pathway |
LT-LH |
PT-LT-(PH-LH) |
PT-PH |
|
LT |
PT-LT |
PT |
|
User Genes |
Genome |
|
In Pathway |
3 |
37 |
40 |
Not In Pathway |
297 |
29663 |
29960 |
|
300 |
29700 |
30000 |
Exact P-Value = 0.007. Since P-Value < 0.05, this user's gene list is specifically associated (enriched) in the p53 signaling pathway by more than random chance.
What about the EASE Score?
The EASE Score is more conservative by subtracting one gene from the List Hits (LH) as seen below. If LH=1 (only one gene in the user's list annotated to the term) , EASE Score is automatically set to 1.
|
User Genes |
Genome |
|
In Pathway |
LH-1 |
PH-LH+1 |
PH |
Not In Pathway |
LT-LH |
PT-LT-(PH-LH) |
PT-PH |
|
LT-1 |
PT-LT+1 |
PT |
|
User Genes |
Genome |
|
In Pathway |
3-1 |
37+1 |
40 |
Not In Pathway |
297 |
29663 |
29960 |
|
300-1 |
29700+1 |
30000 |
For our hypothetical example involving the p53 signaling pathway, the EASE Score is more conservative with a P-value = 0.06 (using 3-1 instead of 3). Since the P-Value > 0.05, this user's gene list is not considered specifically associated (enriched) in the p53 signaling pathway any more than by random chance.
|
|
3. Bonferroni, Benjamini, and FDR
|
|
An adjusted p-value is defined as the smallest significance level for which the given hypothesis would be rejected, when the entire family of tests is considered.
Bonferroni:The Bonferroni in DAVID is the Bonferroni Šidák p-value (Šidák 1967) which is a technique slightly less conservative than Bonferroni.
Šidák, Z. (1967). "Rectangular Confidence Regions for the Means of Multivariate Normal Distributions." Journal of the American Statistical Association 62:626-633.
Benjamini: Benjamini in DAVID requests adjusted p-values by using the linear step-up method of Benjamini and Hochberg (1995).
Yoav Benjamini and Yosef Hochberg, Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological). Vol. 57, No. 1 (1995), pp. 289-300
FDR: FDR in DAVID requests adaptive linear step-up adjusted p-values for approximate control of the false discovery rate, as discussed in Benjamini and Hochberg (2000). Use the lowest slope method to estimate the number of true NULL hypotheses.
Benjamini, Y., and Hochberg, Y. (2000). "On the Adaptive Control of the False Discovery Rate in Multiple Testing with Independent Statistics." Journal of Educational and Behavioral Statistics 25:60-83.
|
|
top
4. Functional Annotation Summary
|
|
 |
top |
5. Functional Annotation Chart Report
|
|

Functional Annotation Chart:
Chart Report is an annotation-term-focused view which lists annotation terms and their associated genes under study. To avoid over counting duplicated genes, the Fisher Exact statistics is calculated based on corresponding DAVID gene IDs by which all redundancies in original IDs are removed. All result of Chart Report has to pass the thresholds (by default, Max.Prob.<=0.1 and Min.Count>=2) in Chart Option section to ensure only statistically significant ones displayed.
EASE Score Threshold (Maximum Probability):
The threshold of EASE Score, a modified Fisher Exact P-Value, for gene-enrichment analysis. It ranges from 0 to 1. Fisher Exact P-Value = 0 represents perfect enrichment. Usually P-Value is equal or smaller than 0.05 to be considered strongly enriched in the annotation categories. Default is 0.1. More details.
Count Threshold (Minimum Count):
The threshold of minimum gene counts belonging to an annotation term. It has to be equal or greater than 0. Default is 2. In short, you do not trust the term only having one gene involved.
RT (Related Term Search):
Related Term Search can identify other similar terms. More details.
|
top |
6. Functional Annotation Clustering Report
|
|
 Functional Annotation Clustering:
Due to the redundant nature of annotations, Functional Annotation Chart presents similar/relevant annotations
repeatedly. It dilutes the focus of the biology in the report. To reduce the redundancy, the Functional Annotation Clustering report groups/displays similar annotations together which makes the
biology clearer and more focused to be read vs. traditional chart report. The grouping algorithm is based on the hypothesis that similar
annotations should have similar gene members. The Functional Annotation Clustering integrates the same techniques of Kappa statistics to measure the degree of the common genes between two
annotations, and fuzzy heuristic clustering
(used in Gene Functional Classification Tool) to classify the groups of similar
annotations according kappa values. In this sense, the more common genes annotations share, the higher chance they will be grouped together.
The p-values associated with each annotation terms inside each clusters are exactly the same meaning/values as p-values (Fisher Exact/EASE Score) shown in the regular chart report for the same terms.
The Group Enrichment Score, the geometric mean (in -log scale) of member's p-values in a corresponding annotation cluster, is used to rank their biological significance. Thus, the top ranked annotation groups most likely have consistent lower p-values for their annotation members.
Options:
Similar idea as the options in Gene Functional Classification.
|
top |
7. Other reports/views
|
|
Functional Annotation Table is a gene-centric view which lists the genes and their associated annotation terms (selected only). There is no statistics applied in this report.
Gene Report is a highly integrated view of a single gene and its general annotations/accessions from multiple resources. It can quickly give a global idea about the gene. The hyperlinks throughout the report will lead to users to original resources for further details.
DAVID Pathway Viewer: displays user genes on static pathway maps generated by BioCarta and KEGG.
|
top
|
Some Terminology in DAVID System
|
|
Annotation Category:
A group of annotation sources collecting similar biological questions, such as: "Pathways" is an annotation category consisting of BioCarta, KEGG, etc.
Annotation Source:
An independent database in a category , such as: BioCarta Pathways.
Term
A detailed item in an annotation source, such as: p53 signaling pathway in BioCarta.
Hierarchical Structure: Category ->Annotation Source -> Term
Pathways -> BioCarta -> p53 signaling pathway
DAVID Gene ID
An internal ID generated on "DAVID Gene Concept" in DAVID system. One DAVID gene ID represents one unique gene cluster belonging to one single gene entry.
Gene-Enrichment:
A set of user's input genes is highly associated with certain terms, which is statistically measured by Fisher Exact in DAVID system.
EASE Score
An alternative name of Fisher Exact Statistics in DAVID system, referring to one-tail Fisher Exact Probability Value used for gene-enrichment analysis.
DAVID Id %
After converting user input gene IDs to corresponding DAVID gene ID, it refers to the percentage of DAVID genes in the list associated with particular
annotation term. Since DAVID gene ID is unique per gene, it is more accurate to use DAVID ID% to present the gene-annotation association by
removing any redundancy in user gene list, i.e. two user's IDs represent same gene.
DAVID Knowledgebase
It represents DAVID Oracle databases which collects a large volume of annotation information from a wide range of bioinformatic public resources.
It is probably the largest and most comprehensive integrated database in the field.
|
top
Last Edit: Sep. 2020
|
|