Meta learning on string kernel SVMs for string categorization
Even though Support Vector Machines (SVMs) are capable of identifying patterns in high dimensional spaces that are presented by kernels without a computational decrease, their performance is determined by two main factors: SVM cost param- eter and kernel parameters. It is empirically proven that the optimum parameter combination for both SVM cost and kernel yield high pattern recognition accuracies in SVMs. Various methods are discussed in the literature to optimize these SVM parameters, namely trial and error method, grid optimization, leave-one-out cross validation, generalization error estimation using gradient descent, and evolutionary algorithms. However, these optimization methods have downfalls: the trial and error method is considered to be imprecise and unreliable (Zhang & He, 2010), the grid parameter search requires complex computation (Friedrichs & Igel, 2005), leave-one- out cross validation is considered to be computationally expensive, and the gradient descent method can fall into a local `minima' (Zhang & He, 2010). Also, these meth- ods mainly optimize numeric kernels. However, when optimizing string kernel SVMs in string classi¯cation, researchers are confronted with two main obstacles. Firstly, there is little or no literature pertaining to string kernel SVM optimization at the moment. Secondly, according to the experiments: Rieck and Laskov (2007), Lodhi, Saunders, Shawe-Taylor, Cristianini, and Watkins (2002), Sharma, Girolami, and Sventek (2007) and S.Sonnenburg (2006), string dataset characteristics in°uence the optimum parameter combination of string kernel SVMs. The current string kernel SVM optimization methods do not take string dataset characteristics into account. Baring this in mind, the thesis initially attempts to identify a mechanism to extract string dataset characteristics from a string dataset. It then, attempts to derive a string kernel SVM optimization method which uses these extracted string dataset characteristics. The initial objective is achieved by explaining that a sting dataset is presentable in token-frequency space (where each dimension is represented by a token) and then computing string meta-features by the points of this token-frequency space. Now for a machine learning algorithm, a meta model is trained using computed string meta-features for each dataset in a string dataset pool, learning algorithm parameters, and accuracy information. This trained meta model helps to predict string classi¯cation accuracy for a new string dataset by computing relevant string meta-features. This principle is employed to optimize string kernel SVMs in the proposed meta learning algorithm (this ful¯lls the second objective of the thesis) where it only computes the relevant string meta-features to yield optimum parameter combination for the machine learning algorithm on the new string dataset. In the experiments, three string kernel SVMs, edit-distance SVM, bag-of-words SVM and n-gram SVM, were optimized using the proposed algorithm on four string datasets: spam, Reuters-21578, Network Application Detection and e-News Cate- gorization. The experiment results revealed that the algorithm was able to produce parameter combinations which yield good string classi¯cation accuracies for string kernel SVMs on most string datasets. It is also revealed that some string kernel SVMs may not be suitable on certain string datasets. As future work, one could research for more string meta-features that help to employ meta learning on string classi¯cation more e®ectively. Also, one could introduce upper and lower bounds to the proposed algorithm, which help the predicted string classi¯cation accuracy to be within the range (0,100). Further experiments on more string datasets will help to identify the robustness of the algorithm.