Data mining log file streams for the detection of anomalies
Log files play an important part in the day to day running of many systems and services, allowing administrators and other users to gain insights into operational, performance or even security issues but it is now impractical with the volume of files today to manually examine them. Existing tools in this space largely work by detecting anomalies from log files that have already been stored or by comparing them against known errors (signatures). By data mining log file streams for the detection of anomalies instead, it will allow administrators to reduce the time required to detect anomalies significantly with no signatures or complex settings needing to be maintained. This paper presents the experimental work undertaken to define a generic, practical and scalable method for anomaly detection in streaming log files by detecting the change to the mix of log events occurring. This was achieved by following a modified CRISP-DM (Cross Industry Standard Process for Data Mining) methodology enabling a broader more flexible approach to the data mining process. By taking this approach, a solution was developed that employs common log file features together with a weighted earth mover distance metric. This enabled a framework to be developed that can be broadly applied to many log file types. By setting a simple percentile threshold indicating an acceptable level of change, anomaly detection in streaming log files can be achieved.