| 1. Introduction
2. General Analysis Data Flow
4. View Results in Text Mode
5. View Gene-annotation on 2-D View
6. Introduction of heuristic fuzzy
| Grouping genes based on
functional similarity can systematically enhance biological
interpretation of large lists of genes derived from high throughput
studies. The Functional Classification Tool generates a gene-to-gene
similarity matrix based shared functional annotation using over 75,000
terms from 14 functional annotation sources. Our novel clustering
algorithms classifies highly related genes into functionally related
groups. Tools are provide to
explore each functional gene cluster including listing of the
?consensus terms? shared by the genes in the cluster, display of
enriched terms, and heat map visualization of gene-to-term
relationships. A global view of cluster-to-cluster relationships is
provided using a fuzzy heat map visualization. Summary information
provided by the Functional Classification Tool is extensively linked to
DAVID Functional Annotation Tools and to external databases allowing
further detailed exploration of gene and term information. The Functional Classification Tool
provides a rapid means to organize large lists of genes into
functionally related groups to help unravel the biological content
captured by high throughput technologies.
2. General Analysis Data Flow
| 3. Options.
Clustering Stringency (lowest
-> highest): a high level single control to establish a set of detailed
parameters involved in functional classification algorithms. In
general, higher stringency setting generates less functional groups
with more tightly associated genes in each group, so that more genes
will be treated as ?irrelevant? one into unclustered group. Default
setting is Medium, which gives balanced results for most cases based on
our studies. Customize allows you to set it any way you want with
Similarity Term Overlap (any value
>=0; default = 4): the minimum number of
annotation terms overlapped between two genes in order to be qualified
for kappa calculation. This parameter is to maintain necessary
statistical power to make kappa value more meaningful. The higher
value, the more meaningful the result is.
Similarity Threshold (any
between 0 to 1; Default = 0.35): the minimum
kappa value to be considered biological significant. The higher
setting, the more genes will be put into unclustered group, which lead
to higher quality of functional classification result with a fewer
groups and a fewer gene members. Kappa value 0.3 starts giving
meaningful biology based on our genome-wide distribution study.
Anything below 0.3 have great chance to be noise.
Initial Group Members (any
>=2; default = 4): the minimum gene
number in a seeding group, which affects the minimum size of each
functional group in the final. In general, the lower value attempts to
include more genes in functional groups, particularly generates a lot
small size groups.
Final Group Members (any value
>=2; default = 4): the
number in one final group after ?cleanup? procedure. In general, the
lower value attempts to include more genes in functional groups,
particularly generates a lot small size groups. It co-functions with
previous parameters to control the minimum size of functional groups.
If you are interested in functional groups containing only 2 or 3
genes, you need to set it to a very low value. Otherwise, the small
group will not be displayed and will be put into the unclustered group.
Multi-linkage Threshold (any
between 0% to 100%; default = 50%): It
controls how seeding groups merge each other, i.e. two groups sharing
the same gene members over the percentage will become one group. The
higher percentage, in general, gives sharper separation i.e. it
generates more final functional groups with more tightly associated
genes in each group. In addition, changing the parameter does not
contribute extra genes into unclustered group.
| 4. View Results in Text Mode
Gene(s) not in the ouput: Any genes in user?s list are
mapped to any of the functional groups, i.e. orphan genes or irrelevant
genes. The possible reasons are: 1. it does not have relationship with
any of other genes above similarity threshold. 2. it
has relationship with a few other genes. But they do not have enough
members to form a functional group based on minimum final cluster
members. 3. False negative. We know our current algorithm could have up
to 2% false negative rate. If you believe it happens to your list,
please report to us.
Enriched Term in Group (T):
It submits the gene members in the
group to our functional
annotation engine. The result of DAVID chart report tries to highlight
the most likely biology associated with the group.
2-D View: It allows user to see gene members
and their associated
annotation term in a heatmap type of view so that user can further
explore the gene-gene and term-term relationships within a group. The terms displayed in the map have to pass
the term frequency setting in option session, i.e. 50% of gene
associates it as default.
Group Enrichment Score: It ranks the
biological significance of gene groups based on
overall EASE scores of all enriched annotation terms. In another
words, step 1, run user's gene list with DAVID functional annotation
chart to get p-value(EASE score) for each enriched annotation terms;
step 2, calculate geometric mean of EASE scores of those terms involved
in this gene group.
Search Related Genes (RG): It summarizes the
common (consensus) annotation term profile
of the functional group based on term frequency and ask the question ?
which other genes have similar annotation terms profile??. The function
allows user to search within user?s list or defined genomes, e.g. homo
2-D View ( ): It allows
users to exam the common and difference of annotations cross the group
gene members. See 2-D session for more details.
Gene-Annotation Association on 2-D View
Multiple Linkage Clustering
|We developed a novel
heuristic partitioning procedure that allows an object (gene) to
participate in more than one cluster. The use of this method in
grouping related genes better reflects the nature of biology in that a
given gene may be associated with more than one functional group of
genes. Two additional advancements included in this algorithm are: 1)
the automatic determination of the optimal numbers of clusters (K), and
2) the exclusion of members (genes) that have weak relationships to
other members. Users are permitted to
change default parameters to set cluster membership similarity
stringencies. Fuzzy Heuristic Partitioning
of a gene list yields high quality clusters of highly related genes,
with some genes participating in more than one function cluster.
o Fuzzy seeding by
allowing each gene to serve as a medoid (# neighbor > 4 &&
cross relevance > 50%)
o Merge seeding clusters by multiple linkage
o Repeat 2 until no more merge needed
illustration of the heuristic fuzzy
partition algorithm. A. Hypothetically each element (gene) can be
a virtual two-dimensional space based on its characters (annotation
distance represents the degree of relationship (kappa score)
among the genes. B. Any gene has a chance as a medoid
to form an initial seeding group. Only the initial groups with enough
related members (e.g. members >3 & kappa
score >= 0.4) are qualified (solid circle). Conversely, unqualified
in dash circles. Importantly, the genes not covered by any qualified
seeding group are considered as outliers (gray color) which are carried
but not to participate in next steps. C.
Every qualified initial seeding group is iteratively merged with each
other to form
a larger group based on the multi-linkage rule, i.e. sharing 50% or
memberships, until all secondary clusters (thicker oval) are stable. D.
three final groups (thicker oval) are formed because they can no longer
merged with any other group. One gene (in red) belonging to two groups
represents the fuzziness capability of the algorithm. And outliers (in
removed for clearer presentation.
An hyperthetical step-by-step
Last Edit: Jan. 2007