Gene selection based on consistency modelling, algorithms and applications
Consistency modeling for gene selection is a new topic emerging from recent cancer bioinformatics research. The result of classification or clustering on a training set was often found very different from the same operations on a testing set. Here, the issue is addressed as a consistency problem. In practice, the inconsistency of microarray datasets prevents many typical gene selection methods working properly for cancer diagnosis and prognosis. In an attempt to deal with this problem, a new concept of performance-based consistency is proposed in this thesis.An interesting finding in our previous experiments is that by using a proper set of informative genes, we significantly improved the consistency characteristic of microarray data. Therefore, how to select genes in terms of consistency modelling becomes an interesting topic. Many previously published gene selection methods perform well in the cancer diagnosis domain, but questions are raised because of the irreproducibility of experimental results. Motivated by this, two new gene selection methods based on the proposed performance-based consistency concept, GAGSc (Genetic Algorithm Gene Selection method in terms of consistency) and LOOLSc (Leave-one-out Least-Square bound method with consistency measurement) were developed in this study with the purpose of identifying a set of informative genes for achieving replicable results of microarray data analysis.The proposed consistency concept was investigated on eight benchmark microarray and proteomic datasets. The experimental results show that the different microarray datasets have different consistency characteristics, and that better consistency can lead to an unbiased and reproducible outcome with good disease prediction accuracy.As an implementation of the proposed performance-based consistency, GAGSc and LOOLSc are capable of providing a small set of informative genes. Comparing with those traditional gene selection methods without using consistency measurement, GAGSc and LOOLSc can provide more accurate classification results. More importantly, GAGSc and LOOLSc have demonstrated that gene selection, with the proposed consistency measurement, is able to enhance the reproducibility in microarray diagnosis experiments.