Top-k Outlier Detection from Uncertain Data

Salman Ahmed Shaikh Hiroyuki

Salman Ahmed, Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data[J]. 国际自动化与计算杂志(英)/International Journal of Automation and Computing, 2014, 11(2): 128-142. doi: 10.1007/s11633-014-0775-8
引用本文: Salman Ahmed, Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data[J]. 国际自动化与计算杂志(英)/International Journal of Automation and Computing, 2014, 11(2): 128-142. doi: 10.1007/s11633-014-0775-8
Salman Ahmed and Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data. International Journal of Automation and Computing, vol. 11, no. 2, pp. 128-142, 2014. doi: 10.1007/s11633-014-0775-8
Citation: Salman Ahmed and Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data. International Journal of Automation and Computing, vol. 11, no. 2, pp. 128-142, 2014. doi: 10.1007/s11633-014-0775-8

Top-k Outlier Detection from Uncertain Data

doi: 10.1007/s11633-014-0775-8
基金项目: 

This work was partly supported by Grant-in-Aid for Scientific Research(A)(#24240015A).

详细信息
    作者简介:

    Hiroyuki Kitagawa received his B.Sc. degree in physics and his M.Sc. and Dr.Sc. degrees in computer science, all from the University of Tokyo, in 1978, 1980 and 1987, respectively. He is currently a full professor at Faculty of Engineering, Information and Systems and at Center for Computational Sciences, University of Tsukuba, Japan. He is a member of ACM, IEEE Computer Society, the Database Society of Japan, IEICE, IPSJ, and JSSST. He is now vice president of the Database Society of Japan, an IEICE Fellow, an IPSJ Fellow, and an associate member of the Science Council of Japan. His research interests include data integration, stream processing, data intensive computing, data mining, and information retrieval. E-mail: kitagawa@cs.tsukuba.ac.jp

Top-k Outlier Detection from Uncertain Data

Funds: 

This work was partly supported by Grant-in-Aid for Scientific Research(A)(#24240015A).

  • 摘要: Uncertain data are common due to the increasing usage of sensors, radio frequency identification (RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence, this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore, a populated-cells list (PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm. An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.
  • [1] A. Elías, A. Ochoa-Zezzatti, A. Padilla, J. Ponce. Outlier analysis for plastic card fraud detection a hybridized and multi-objective approach. Hybrid Artificial Intelligent Systems, Berlin, Heidelberg: Springer, pp.1-9, 2011.
    [2] M. V. Mahoney, P. K. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the 3rd IEEE International Conference on Data Mining, IEEE, Melbourne, FL, USA, pp.601-604, 2003.
    [3] G. Manson, G. Pierce, K. Worden. On the long-term stability of normal condition for damage detection in a composite panel. Key Engineering Materials, vol.204-205, pp.359-370, 2001.
    [4] H. Garces, D. Sbarbaro. Outliers detection in environmental monitoring databases. Engineering Applications of Artificial Intelligence, vol.24, no.2, pp.341-349, 2011.
    [5] N. Alaydie, F. Fotouhi, C. K. Reddy, H. Soltanian-Zadeh. Noise and outlier filtering in heterogeneous medical data sources. In Proceedings of Workshops on Database and Expert Systems Applications, IEEE, Bilbao, Spain, pp.115-119, 2010.
    [6] D. M. Hawkins. Identification of Outliers, London: Chapman and Hall, 1980.
    [7] V. Barnett, T. Lewis. Outliers in Statistical Data, New York: Wiley, 1994.
    [8] O. Z. Maimon, L. Rokach. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Norwell: Kluwer Academic, 2005.
    [9] C. C. Aggarwal. Outlier Analysis, New York: Springer-Verlag, 2013.
    [10] E. M. Knorr, R. T. Ng, V. Tucakov. Distance-based outliers: Algorithms and applications. The VLDB Journal, vol.8, no.3-4, pp.237-253, 2000.
    [11] E. M. Knorr, R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In Proceedings of 24th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp.392-403, 1998.
    [12] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering, IEEE, Bangalore, India, pp.315-326, 2003.
    [13] V. Kumar. Parallel and distributed computing for cybersecurity. IEEE Distributed Systems Online, vol.6, no.10, pp.1-9, 2005.
    [14] G. H. Orair, C. H. C. Teixeira, W. Meira, Y. Wang, S. Parthasarathy. Distance-based outlier detection: Consolidation and renewed bearing. In Proceedings of the VLDB Endowment, vol.3, no.1-2, pp.1469-1480, 2010.
    [15] B. Wang, G. Xiao, H. Yu, X. C. Yang. Distance-based outlier detection on uncertain data. In Proceedings of the 9th IEEE International Conference on Computer and Information Technology, IEEE, Xiamen, China, pp.293-298, 2009.
    [16] C. Zhu, H. Kitagawa, S. Papadimitriou, C. Faloutsos. Outlier detection by example. Journal of Intelligent Information Systems, vol.36, no.2, pp.217-247, 2011.
    [17] A. B. Sharma, L. Golubchik, R. Govindan. Sensor faults: Detection methods and prevalence in real-world datasets. ACM Transactions on Sensor Networks, vol.6, no.3, pp.1-39, 2010.
    [18] I. Helm, L. Jalukse, I. Leito. Measurement uncertainty estimation in amperometric sensors: A tutorial review. Sensors, vol.10, no.5, pp.4430-4455, 2010.
    [19] Y. Diao, B. D. Li, A. N. Liu, L. P. Peng, C. Sutton, T. Tran, M. Zink. Capturing data uncertainty in high-volume stream processing. In Proceedings of the 4th Biennial Conference on Innovative Data Systems Research, Asilomar, California, USA, 2009.
    [20] A. A. Omer, J. P. Thomas, L. Zhu. Mutual authentication protocols for RFID systems. International Journal of Automation and Computing, vol.5, no.4, pp.348-365, 2008.
    [21] J. Nievergelt, H. Hinterberger, K. C. Sevick. The Grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, vol.9, no.1, 38-71, 1984.
    [22] S. Ramaswamy, R. Rastogi, K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, pp.427-438, 2000.
    [23] F. Angiulli, C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference, PKDD 2002, Springer, Helsinki, Finland, pp.15-26, 2002.
    [24] F. Angiulli, F. Fassetti. Detecting distance-based outliers in streams of data. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, ACM, New York, NY, USA, pp.811-820, 2007.
    [25] M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, Y. Manolopoulos. Continuous monitoring of distance-based outliers over data streams. In Proceedings of the 27th IEEE International Conference on Data Engineering, IEEE, Hannover, pp.135-146, 2011.
    [26] K. Ishida, H. Kitagawa. Detecting current outliers: Continuous outlier detection over time-series data streams. In Proceedings of the 19th International Conference Database and Expert Systems Applications, Springer, Berlin, Heidelberg, pp.255-268, 2008.
    [27] C. C. Aggarwal, P. S. Yu. Outlier detection with uncertain data. In Proceedings of the SIAM International Conference on Data Mining, pp.483-493, 2008.
    [28] S. A. Shaikh, H. Kitagawa. Distance-based outlier detection on uncertain data of Gaussian distribution. In Proceedings of the 14th Asia-Pacific International Conference on Web Technologies and Applications, Springer-Verlag, Berlin, Heidelberg, pp.109-121, 2012.
    [29] S. A. Shaikh, H. Kitagawa. Efficient distance-based outlier detection on uncertain datasets of Gaussian distribution. World Wide Web, 2013. (Online first).
    [30] S. A. Shaikh, H. Kitagawa. Fast top-k distance-based outlier detection on uncertain data. In Proceedings of the 14th International Conference on Web-age Information Management, Springer, Berlin, Heidelberg, pp.301-313, 2013.
    [31] M. M. Breunig, H. P. Kriegel, R. T. Ng, J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, pp.93-104, 2000.
    [32] Z. Y. He, X. F. Xu, S. C. Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, vol.24, no.9-10, pp.1641-1650, 2003.
    [33] B. Jiang, J. Pei. Outlier detection on uncertain data: Objects, instances, and inferences. In Proceedings of the 27th IEEE International Conference on Data Engineering, IEEE, Hannover, pp.422-433, 2011.
    [34] P. Bajorski. Statistics for Imaging, Optics, and Photonics, New York: John Wiley & Sons Publication, 2012.
    [35] F. Pukelsheim. The three sigma rule. The American Statistician, vol.48, no.2, pp.88-91, 1994.
    [36] Y. F. Tao, X. K. Xiao, R. Cheng. Range search on multidimensional uncertain data. ACM Transactions on Database Systems, vol.32, no.3, pp.1-54, 2007.
    [37] W. J. Thistleton, J. A. Marsh, K. Nelson, C. Tsallis. Generalized Box-Müller method for generating q-Gaussian random deviates. IEEE Transactions on Information Theory, vol.53, no.12, pp.4805-4810, 2007.
  • [1] Xi-Wen Wu, Chen Li, Xuan Wang, Hong-Ji Yang.  A Creative Approach to Reducing Ambiguity in Scenario-based Software Architecture Analysis . International Journal of Automation and Computing, 2019, 16(2): 248-260. doi: 10.1007/s11633-017-1102-y
    [2] Chen Li, Hong-Ji Yang, Hua-Xiao Liu.  An Approach to Modelling and Analysing Reliability of Breeze/ADL-based Software Architecture . International Journal of Automation and Computing, 2017, 14(3): 275-284. doi: 10.1007/s11633-016-1044-9
    [3] S. Nagaraju, Manish Kashyap, Mahua Bhattachraya.  An Effective Density Based Approach to Detect Complex Data Clusters Using Notion of Neighborhood Difference . International Journal of Automation and Computing, 2017, 14(1): 57-67. doi: 10.1007/s11633-016-1038-7
    [4] Hfaïedh Kaïs, Dahech Karim, Damak Tarak.  A Sliding Mode Observer for Uncertain Nonlinear Systems Based on Multiple Models Approach . International Journal of Automation and Computing, 2017, 14(2): 202-212. doi: 10.1007/s11633-016-0970-x
    [5] Dong-Feng Fang Zhou Su Qi-Chao Xu Ze-Jun Xu.  Multi-characteristics Based Data Scheduling Over the Smart Grid . International Journal of Automation and Computing, 2016, 13(2): 151-158. doi: 10.1007/s11633-016-0959-5
    [6] He Deng, Huang Yao, Yan-Tao Wei, Ji-Dong Chen.  A Parzen Window-based Approach to Detection of Infrared Dim Target Under Sea-sky Background . International Journal of Automation and Computing, 2016, 13(5): 447-456. doi: 10.1007/s11633-016-0976-4
    [7] Hui Guan, Hongji Yang, Jun Wang.  An Ontology-based Approach to Security Pattern Selection . International Journal of Automation and Computing, 2016, 13(2): 168-182. doi: 10.1007/s11633-016-0950-1
    [8] Jia-Can Geng, Zhe Cui, Xing-Sheng Gu.  Scatter Search Based Particle Swarm Optimization Algorithm for Earliness/Tardiness Flowshop Scheduling with Uncertainty . International Journal of Automation and Computing, 2016, 13(3): 285-295. doi: 10.1007/s11633-016-0964-8
    [9] Zhen-Guo Liu, Jin-Ming Huang.  A New Adaptive Tracking Control Approach for Uncertain Flexible Joint Robot System . International Journal of Automation and Computing, 2015, 12(5): 559-566. doi: 10.1007/s11633-015-0898-6
    [10] Quan Quan, Kai-Yuan Cai.  Repetitive Control for TORA Benchmark: An Additive-state-decomposition-based Approach . International Journal of Automation and Computing, 2015, 12(3): 289-296. doi: 10.1007/s11633-015-0885-y
    [11] Zeineb Lassoued, Kamel Abderrahim.  New Results on PWARX Model Identification Based on Clustering Approach . International Journal of Automation and Computing, 2014, 11(2): 180-188. doi: 10.1007/s11633-014-0779-4
    [12] L. Balaji, K. K. Thyagharajan.  H.264/SVC Mode Decision Based on Mode Correlation and Desired Mode List . International Journal of Automation and Computing, 2014, 11(5): 510-516. doi: 10.1007/s11633-014-0830-5
    [13] Lei Liu, Feng Yang, Peng Zhang, Jing-Yi Wu, Liang Hu.  SVM-based Ontology Matching Approach . International Journal of Automation and Computing, 2012, 9(3): 306-314. doi: 10.1007/s11633-012-0649-x
    [14] Qi Wang,  Zhong-Wei Ren,  Zhong-Feng Guo.  XML-based Data Processing in Network Supported Collaborative Design . International Journal of Automation and Computing, 2010, 7(3): 330-335. doi: 10.1007/s11633-010-0511-y
    [15] Yang Yi, Hong Shen, Lei Guo.  Statistic PID Tracking Control for Non-Gaussian Stochastic Systems Based on T-S Fuzzy Model . International Journal of Automation and Computing, 2009, 6(1): 81-87. doi: 10.1007/s11633-009-0081-z
    [16] Francisco Flórez-Revuelta, José Manuel Casado-Díaz, Lucas Martínez-Bernabeu.  An Evolutionary Approach to the Delineation of Functional Areas Based on Travel-to-work Flows . International Journal of Automation and Computing, 2008, 5(1): 10-21. doi: 10.1007/s11633-008-0010-6
    [17] Sing Kiong Nguang, Ping Zhang, Steven X. Ding.  Parity Relation Based Fault Estimation for Nonlinear Systems: An LMI Approach . International Journal of Automation and Computing, 2007, 4(2): 164-168. doi: 10.1007/s11633-007-0164-7
    [18] Marek Kowal,  Józef Korbicz.  Fault Detection under Fuzzy Model Uncertainty . International Journal of Automation and Computing, 2007, 4(2): 117-124. doi: 10.1007/s11633-007-0117-1
    [19] Ping Zhang,  Steven X. Ding.  A Model-free Approach to Fault Detection of Continuous-time Systems Based on Time Domain Data . International Journal of Automation and Computing, 2007, 4(2): 189-194. doi: 10.1007/s11633-007-0189-y
    [20] Xiao-Feng Wu,  Zi-Qiang Lang,  Stephen A Billings.  An Orthogonal Least Squares Based Approach to FIR Designs . International Journal of Automation and Computing, 2005, 2(2): 163-170. doi: 10.1007/s11633-005-0163-5
  • 加载中
计量
  • 文章访问数:  6121
  • HTML全文浏览量:  23
  • PDF下载量:  2118
  • 被引次数: 0
出版历程
  • 收稿日期:  2013-07-31
  • 修回日期:  2013-11-11

Top-k Outlier Detection from Uncertain Data

doi: 10.1007/s11633-014-0775-8
    基金项目:

    This work was partly supported by Grant-in-Aid for Scientific Research(A)(#24240015A).

    作者简介:

    Hiroyuki Kitagawa received his B.Sc. degree in physics and his M.Sc. and Dr.Sc. degrees in computer science, all from the University of Tokyo, in 1978, 1980 and 1987, respectively. He is currently a full professor at Faculty of Engineering, Information and Systems and at Center for Computational Sciences, University of Tsukuba, Japan. He is a member of ACM, IEEE Computer Society, the Database Society of Japan, IEICE, IPSJ, and JSSST. He is now vice president of the Database Society of Japan, an IEICE Fellow, an IPSJ Fellow, and an associate member of the Science Council of Japan. His research interests include data integration, stream processing, data intensive computing, data mining, and information retrieval. E-mail: kitagawa@cs.tsukuba.ac.jp

摘要: Uncertain data are common due to the increasing usage of sensors, radio frequency identification (RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence, this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore, a populated-cells list (PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm. An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.

English Abstract

Salman Ahmed, Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data[J]. 国际自动化与计算杂志(英)/International Journal of Automation and Computing, 2014, 11(2): 128-142. doi: 10.1007/s11633-014-0775-8
引用本文: Salman Ahmed, Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data[J]. 国际自动化与计算杂志(英)/International Journal of Automation and Computing, 2014, 11(2): 128-142. doi: 10.1007/s11633-014-0775-8
Salman Ahmed and Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data. International Journal of Automation and Computing, vol. 11, no. 2, pp. 128-142, 2014. doi: 10.1007/s11633-014-0775-8
Citation: Salman Ahmed and Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data. International Journal of Automation and Computing, vol. 11, no. 2, pp. 128-142, 2014. doi: 10.1007/s11633-014-0775-8
参考文献 (37)

目录

    /

    返回文章
    返回