In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.
Anomaly detection finds application in many domains including cyber security, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers.
Three broad categories of anomaly detection techniques exist.Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not the techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.
Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include:
An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.
Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data.
An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.
An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features.
Anomalies are patterns in data that do not conform to a well defined notion of normal behaviour.
Let T be observations from a univariate Gaussian distribution and O a point from T. Then the z-score for O is greater than a pre-selected threshold if and only O is an outlier.
Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using machine vision, medical diagnosis and law enforcement.
It is often used in preprocessing to remove anomalous data from the dataset. This is done for a number of reasons. Statistics of data such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In supervised learning, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy. Anomalies are also often the most important observations in the data to be found such as in intrusion detection or detecting abnormalities in medical images.
Many anomaly detection techniques have been proposed in literature. Some of the popular techniques are:
The performance of methods depends on the data set and parameters, and methods have little systematic advantages over another when compared across many data sets and parameters.
Application to data securityEdit
Anomaly detection was proposed for intrusion detection systems (IDS) by Dorothy Denning in 1986. Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with soft computing, and inductive learning. Types of statistics proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations. The counterpart of anomaly detection in intrusion detection is misuse detection.
In data pre-processingEdit
In supervised learning, anomaly detection is often an important step in data pre-processing to provide the learning algorithm a proper dataset to learn on. This is also known as Data cleansing. After detecting anomalous samples classifiers remove them, however, at times corrupted data can still provide useful samples for learning. A common method for finding appropriate samples to use is identifying Noisy data. One approach to find noisy values is to create a probabilistic model from data using models of uncorrupted data and corrupted data.
Below is an example of the Iris flower data set with an anomaly added. With an anomaly included, classification algorithm may have difficulties properly finding patterns, or run into errors.
Fischer's Iris Data with an Anomaly
By removing the anomaly, training will be enabled to find patterns in classifications more easily.
In data mining, high-dimensional data will also propose high computing challenges with intensely large sets of data. By removing numerous samples that can find itself irrelevant to a classifier or detection algorithm, runtime can be significantly reduced on even the largest sets of data.
ELKI is an open-source Java data mining toolkit that contains several anomaly detection algorithms, as well as index acceleration for them.
PyOD is an open-source Python library developed specifically for anomaly detection.
Scikit-Learn is an open-source Python library that has built functionality to provide unsupervised anomaly detection.
^Smith, M. R.; Martinez, T. (2011). "Improving classification accuracy by identifying and removing instances that should be misclassified" (PDF). The 2011 International Joint Conference on Neural Networks. p. 2690. CiteSeerX10.1.1.221.1371. doi:10.1109/IJCNN.2011.6033571. ISBN 978-1-4244-9635-8. S2CID 5809822.
^Zimek, Arthur; Filzmoser, Peter (2018). "There and back again: Outlier detection between statistical reasoning and data mining algorithms" (PDF). Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8 (6): e1280. doi:10.1002/widm.1280. ISSN 1942-4787.
^Knorr, E. M.; Ng, R. T.; Tucakov, V. (2000). "Distance-based outliers: Algorithms and applications". The VLDB Journal the International Journal on Very Large Data Bases. 8 (3–4): 237–253. CiteSeerX10.1.1.43.1842. doi:10.1007/s007780050006. S2CID 11707259.
^Ramaswamy, S.; Rastogi, R.; Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. Proceedings of the 2000 ACM SIGMOD international conference on Management of data – SIGMOD '00. p. 427. doi:10.1145/342009.335437. ISBN 1-58113-217-4.
^Angiulli, F.; Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. Vol. 2431. p. 15. doi:10.1007/3-540-45681-3_2. ISBN 978-3-540-44037-6.
^Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; Sander, J. (2000). LOF: Identifying Density-based Local Outliers(PDF). Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD. pp. 93–104. doi:10.1145/335191.335388. ISBN 1-58113-217-4.
^Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422. doi:10.1109/ICDM.2008.17. ISBN 9780769535029. S2CID 6505449.
^Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (March 2012). "Isolation-Based Anomaly Detection". ACM Transactions on Knowledge Discovery from Data. 6 (1): 1–39. doi:10.1145/2133360.2133363. S2CID 207193045.
^Schubert, E.; Zimek, A.; Kriegel, H. -P. (2012). "Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection". Data Mining and Knowledge Discovery. 28: 190–237. doi:10.1007/s10618-012-0300-z. S2CID 19036098.
^Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2009). Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science. Vol. 5476. p. 831. doi:10.1007/978-3-642-01307-2_86. ISBN 978-3-642-01306-5.
^Kriegel, H. P.; Kroger, P.; Schubert, E.; Zimek, A. (2012). Outlier Detection in Arbitrarily Oriented Subspaces. 2012 IEEE 12th International Conference on Data Mining. p. 379. doi:10.1109/ICDM.2012.21. ISBN 978-1-4673-4649-8.
^Fanaee-T, H.; Gama, J. (2016). "Tensor-based anomaly detection: An interdisciplinary survey". Knowledge-Based Systems. 98: 130–147. doi:10.1016/j.knosys.2016.01.027.
^Zimek, A.; Schubert, E.; Kriegel, H.-P. (2012). "A survey on unsupervised outlier detection in high-dimensional numerical data". Statistical Analysis and Data Mining. 5 (5): 363–387. doi:10.1002/sam.11161.
^ abcHawkins, Simon; He, Hongxing; Williams, Graham; Baxter, Rohan (2002). "Outlier Detection Using Replicator Neural Networks". Data Warehousing and Knowledge Discovery. Lecture Notes in Computer Science. Vol. 2454. pp. 170–180. CiteSeerX10.1.1.12.3366. doi:10.1007/3-540-46145-0_17. ISBN 978-3-540-44123-6.
^J. An and S. Cho, "Variational autoencoder based anomaly detection using reconstruction probability", 2015.
^Malhotra, Pankaj; Vig, Lovekesh; Shroff, Gautman; Agarwal, Puneet (22–24 April 2015). Long Short Term Memory Networks for Anomaly Detection in Time Series. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium).
^He, Z.; Xu, X.; Deng, S. (2003). "Discovering cluster-based local outliers". Pattern Recognition Letters. 24 (9–10): 1641–1650. CiteSeerX10.1.1.20.4242. doi:10.1016/S0167-8655(03)00003-5.
^Campello, R. J. G. B.; Moulavi, D.; Zimek, A.; Sander, J. (2015). "Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection". ACM Transactions on Knowledge Discovery from Data. 10 (1): 5:1–51. doi:10.1145/2733381. S2CID 2887636.
^Lazarevic, A.; Kumar, V. (2005). Feature bagging for outlier detection. Proc. 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. pp. 157–166. CiteSeerX10.1.1.399.425. doi:10.1145/1081870.1081891. ISBN 978-1-59593-135-1. S2CID 2054204.
^Nguyen, H. V.; Ang, H. H.; Gopalkrishnan, V. (2010). Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces. Database Systems for Advanced Applications. Lecture Notes in Computer Science. Vol. 5981. p. 368. doi:10.1007/978-3-642-12026-8_29. ISBN 978-3-642-12025-1.
^Schubert, E.; Wojdanowski, R.; Zimek, A.; Kriegel, H. P. (2012). On Evaluation of Outlier Rankings and Outlier Scores. Proceedings of the 2012 SIAM International Conference on Data Mining. pp. 1047–1058. doi:10.1137/1.9781611972825.90. ISBN 978-1-61197-232-0.
^Zimek, A.; Campello, R. J. G. B.; Sander, J. R. (2014). "Ensembles for unsupervised outlier detection". ACM SIGKDD Explorations Newsletter. 15: 11–22. doi:10.1145/2594473.2594476. S2CID 8065347.
^Zimek, A.; Campello, R. J. G. B.; Sander, J. R. (2014). Data perturbation for outlier detection ensembles. Proceedings of the 26th International Conference on Scientific and Statistical Database Management – SSDBM '14. p. 1. doi:10.1145/2618243.2618257. ISBN 978-1-4503-2722-0.
^Campos, Guilherme O.; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J. G. B.; Micenková, Barbora; Schubert, Erich; Assent, Ira; Houle, Michael E. (2016). "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study". Data Mining and Knowledge Discovery. 30 (4): 891. doi:10.1007/s10618-015-0444-8. ISSN 1384-5810. S2CID 1952214.
^Teng, H. S.; Chen, K.; Lu, S. C. (1990). Adaptive real-time anomaly detection using inductively generated sequential patterns(PDF). Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy. pp. 278–284. doi:10.1109/RISP.1990.63857. ISBN 978-0-8186-2060-7. S2CID 35632142.
^Jones, Anita K.; Sielken, Robert S. (1999). "Computer System Intrusion Detection: A Survey". Technical Report, Department of Computer Science, University of Virginia, Charlottesville, VA. CiteSeerX10.1.1.24.7802.
^Kubica, J.; Moore, A. (2003). "Probabilistic noise identification and data cleaning". Third IEEE International Conference on Data Mining. IEEE Comput. Soc: 131–138. doi:10.1109/icdm.2003.1250912. ISBN 0-7695-1978-4.
^Zhao, Yue; Nasrullah, Zain; Li, Zheng (2019). "Pyod: A python toolbox for scalable outlier detection". Journal of Machine Learning Research.