Gene Functional Classification
1. Introduction
2. General Analysis Data Flow
3. Options
4. View Results in Text Mode
5. View Gene-annotation on 2D View
6. Heuristic Multiple Linkage Clustering

1. Introduction
Grouping genes based on functional similarity can systematically enhance biological interpretation of large lists of genes derived from high throughput studies. The Functional Classification Tool generates a gene-to-gene similarity matrix based on shared functional annotation using thousands of annotation terms from 12 functional annotation categories. Our novel clustering algorithm classifies highly related genes into functionally related groups. Tools are provided to further explore each functional gene cluster including listing of the 'consensus terms' shared by the genes in the cluster, display of enriched terms, and heat map visualization of gene-to-term relationships. A global view of relationships is provided using a fuzzy heat map visualization. Summary information provided by the Functional Classification Tool is extensively linked to DAVID Functional Annotation Tools and to external databases allowing further detailed exploration of gene and term information. The Functional Classification Tool provides a rapid means to organize large lists of genes into functionally related groups to help unravel the biological content captured by high throughput technologies.
top
2. General Analysis Data Flow




top
3. Options
Standard Options



Clustering Stringency (lowest → highest): A high-level single control to establish a set of detailed parameters involved in functional classification algorithms. In general, the higher stringency setting generates less functional groups with more tightly associated genes in each group, so that more genes will be unclustered. The default setting is Medium, which gives balanced results for most cases based on our studies. Customization allows you to control Advanced options.

Advanced Options



Similarity Term Overlap (any value ≥ 0; default = 4): The minimum number of annotation terms overlapped between two genes in order to be qualified for kappa calculation. This parameter is to maintain necessary statistical power to make the kappa value more meaningful. The higher the value, the more meaningful the result is.

Similarity Threshold (any value between 0 to 1; default = 0.35): The minimum kappa value to be considered significant. A higher setting will lead to more genes going unclustered, which leads to a higher quality functional classification result with fewer groups and fewer gene members. Kappa value of 0.3 starts giving meaningful biology based on our genome-wide distribution study. Anything below 0.3 has a good chance to be noise.

Initial Group Members (any value ≥ 2; default = 4): The minimum gene number in a seeding group, which affects the minimum size of each functional group in the final cluster. In general, the lower value attempts to include more genes in functional groups, and may generate a lot of small size groups.

Final Group Members (any value ≥ 2; default = 4): The minimum gene number in one final group after a 'cleanup' procedure. In general, the lower value attempts to include more genes in functional groups and may generate a lot of small size groups. It cofunctions with previous parameters to control the minimum size of functional groups. If you are interested in functional groups containing only 2 or 3 genes, you need to set it to a very low value. Otherwise, the small group will not be displayed and the genes will go unclustered.

Multi-linkage Threshold (any value between 0% to 100%; default = 50%): This parameter controls how seeding groups merge with each other, i.e. two groups sharing the same gene members over the percentage will become one group. A higher percentage, in general, gives sharper separation (i.e. it generates more final functional groups with more tightly associated genes in each group). In addition, changing the parameter does not cause additional genes to go unclustered.
top
4. View Results in Text Mode



Gene(s) not in the output: Any genes in the user's list that are NOT mapped to any of the functional groups (i.e. orphan genes or irrelevant genes). The possible reasons are: (1) it does not have a relationship with any of other genes above the similarity threshold, (2) it has relationship with a few other genes, but they do not have enough members to form a functional group based on minimum final cluster members, and (3) false negative. We know our current algorithm could have up to a 2% false negative rate. If you believe that it happens to your list, please report it to us.

Enriched Term in Group (T): Submits the gene members in the group to our functional annotation engine. The result of the DAVID chart report tries to highlight the most likely biology associated with the group.

Group Enrichment Score: Ranks the biological significance of gene groups based on overall EASE scores of all enriched annotation terms: (step 1) run user's gene list with DAVID functional annotation chart to get p-value (EASE score) for each enriched annotation term, and (step 2) calculate geometric mean of EASE scores of those terms involved in this gene group.

Related Genes (RG): Summarizes the common consensus annotation term profile of the functional group based on term frequency and asks the question, "Which other genes have similar annotation terms profile?". The function allows a user to search within their list or the defined genome, e.g. Homo sapiens.

2D View (): Allows user to see gene members and their associated annotation terms in a heatmap type of view so that the user may further explore the gene-gene and term-term relationships within a group. The terms displayed in the map must pass the term frequency setting in the option session (i.e. 50% of gene associates it as default). See 2D session for more details.
top
5. View Gene-annotation Association on 2D View




top
6. Heuristic Multiple Linkage Clustering
We developed a novel heuristic partitioning procedure that allows an object (gene) to participate in more than one cluster. The use of this method in grouping related genes better reflects the nature of biology in that a given gene may be associated with more than one functional group of genes. Two additional advantages included in this algorithm are: (1) the automatic determination of the optimal numbers of clusters (K), and (2) the exclusion of members (genes) that have weak relationships to other members. Users are permitted to change default parameters to set cluster membership similarity stringencies. Fuzzy Heuristic Partitioning of a gene list yields high quality clusters of highly related genes, with some genes participating in more than one function cluster.

Algorithm:
  • Fuzzy seeding by allowing each gene to serve as a medoid (# neighbor > 4 && cross relevance > 50%)
  • Merge seeding clusters by multiple linkage
  • Repeat the last step until no further merging is needed


Graphical illustration of the heuristic fuzzy partition algorithm

A: Hypothetically each element (gene) can be positioned in a virtual two-dimensional space based on its characteristics (annotation terms). The distance represents the degree of the relationship (kappa score) among the genes.

B: Any gene has a chance as a medoid to form an initial seeding group. Only the initial groups with enough closely related members (e.g. members > 3 & kappa score ≥ 0.4) qualify (solid circle). Conversely, unqualified ones are in dashed circles. Importantly, the genes not covered by any qualified initial seeding group are considered as outliers which are carried along, but not to participate in the next steps.

C: Every qualified initial seeding group is iteratively merged with each other to form a larger group based on the multi-linkage rule, i.e. sharing 50% or more of memberships, until all secondary clusters (thicker oval) are stable.

D: Finally, three final groups (thicker oval) are formed because they can no longer be merged with any other group. One gene (in red) belonging to two groups represents the fuzziness of the algorithm and outliers (gray color in C) are removed for better presentation.

A hypothetical step-by-step example: DAVID_clustering_example.doc
top
Last edited on January 19, 2021