Malware motif identification using Bio-inspired Data Mining
The application of data mining techniques into biological data is well established. The aim of this thesis is to explore the effects of giving amino acid representation to problematic machine learning data and to evaluate the benefits of supplementing traditional data mining techniques with bioinformatics tools, techniques and databases. The focus of the research is on methods for identifying patterns in computer malware signatures typically used in current anti-viral software. In total, 60 computer viruses and 60 worm signatures were converted into amino acid representations and then aligned to produce fixed length sequences as input to data mining techniques for classification and prediction. Standard protein databases and modellers were also used to give a biological interpretation, and to find biological analogues of the polypeptide representations of the malware signatures. Protein modelling of the consensuses produced through sequence alignment and meta-signatures extracted from data mining provides novel ways of looking at malware signatures and their possible structure and function. However, the results varied by the method of biological representation used and further work is needed to determine the advantages and disadvantages of different methods for representing data as artificial polypeptide sequences.