Some key concepts to understand with respect the over-representation calculation:
The "population" of genes is all genes assayed (i.e. all genes in the microarray, etc) AND ANNOTATED within a given system of classifying genes (e.g. the 'Molecular Function' branch of the Gene Ontology). Therefore the population can change from one system to the next. The "population total" reported by EASE for "Molecular Function" is therefore the number of genes on the microarray that are annotated with some gene ontology molecular function category. This method of "system-specific populations" is critical for enabling side-by-side comparisons of gene classifications derived from systems that have good coverage (e.g. the Gene Ontology) and those with poor coverage (e.g. systems based on known regulation by transcription factors).
"Hits" refers to genes falling within the gene category in question. Therefore "Population hits" for the Biological Process "apoptosis" refers to the number of genes falling within the category "apoptosis" out of all genes in the population annotated with a Biological Process. Similarly, "List Hits" refers to the number of genes on in the gene list that fall within a specific category.
EASE first maps all gene identifiers in the population to "primary gene identifiers". The default "primary gene identifier" in EASE is the LocusLink number. This step controls for the possibility of multiple identifiers on the list referring to the same gene (typical of Genbank accessions), and that gene therefore receiving multiple spurious "votes" for its categories in the over-representation analysis. The primary gene identifiers are then mapped to gene categories within various categorical systems, the "Population Total" is determined for each system of gene categorization, and the "Population Hits" is determined for every category within those systems.
Now given a gene list that represents some sub-set of the population genes, the "List total" and "List Hits" counts can be determined. The probability of seeing the number of "List Hits" in the "List Total" given the frequency of "Population Hits" in the "Population Total" is now be calculated as the Fisher exact probability. EASE can also calculate another metric known as the "EASE score" which is the upper bound of the distribution of Jackknife Fisher exact probabilities. The EASE score is essentially a sliding-scale, conservative adjustment of the Fisher exact that strongly penalizes the significance of categories supported by few genes and negligibly penalizes categories supported by many genes. It therefore yields more robust results. The EASE score is the default metric used by EASE to rank categories of genes by over-representation.
Definitions of default fields in the results:
System = the system of categorizing genes
Gene Category = the specific category of genes within the System
List Hits = number of genes in the gene list that belong to the Gene Category
List Total = number of genes in the gene list
Population Hits = number of genes in the total group of genes assayed that belong to the specific Gene Category
Population Total = number of genes in the total group of genes assayed that belong to any Gene Category within the System
EASE score = The upper bound of the distribution of Jackknife Fisher exact probabilities given the List Hits, List Total, Population Hits and Population Total
Bonferroni = This is a conservative adjustment to the EASE score that multiplies it by the number of Gene Categories for which over-representation was calculated in order to control for the multiple comparison effect.
Gene identifiers = List of LocusLink numbers (or whatever the custom schema is using as the "primary gene identifier") from the gene list that fall into the Gene Category.
Genbank accessions = List of identifiers (in this case: Genbank accessions) from the gene list that fall into the Gene Category.