Volume 17 Number 5
September 2020
Article Contents
Ziheng Chen, Hongshik Ahn. Item Response Theory Based Ensemble in Machine Learning. International Journal of Automation and Computing, 2020, 17(5): 621-636. doi: 10.1007/s11633-020-1239-y
Cite as: Ziheng Chen, Hongshik Ahn. Item Response Theory Based Ensemble in Machine Learning. International Journal of Automation and Computing, 2020, 17(5): 621-636. doi: 10.1007/s11633-020-1239-y

Item Response Theory Based Ensemble in Machine Learning

Author Biography:
  • Ziheng Chen received the B. Sc. degree in statistics from Renmin University of China, China in 2016. He is currently a Ph. D. degree candidate in Department of Applied Mathematics and Statistics, Stony Brook University, USA. His research interests include reinforcement learning, recommending system, tree structure model and ensemble learning theory. E-mail: ziheng.chen@stonybrook.edu (Corresponding author)ORCID iD: 0000-0002-2585-637X

    Hongshik Ahn received the B. Sc. degree in mathematics from Seoul National University, South Korea, and the Ph. D. degree in statistics from University of Wisconsin-Madison, USA in 1992. From 1992 to 1996, he was a mathematical statistician at the National Center for Toxicological Research, U.S. Food and Drug Administration, and a faculty member in the Department of Applied Mathematics and Statistics at Stony Brook University, USA from 1996 to present. He was the first Vice President of SUNY Korea for two years from 2012. Currently, he is a professor at Stony Brook University. He has published 2 books, 3 book chapters, over 70 papers in peer-reviewed journals, and 25 conference papers. His research interests include classification of high-dimensional data, tree-structured regression modeling, survival analysis, and multi-step batch testing for infectious diseases. E-mail: hongshik.ahn@stonybrook.edu

  • Received: 2020-02-16
  • Accepted: 2020-06-04
  • Published Online: 2020-09-09
  • In this article, we propose a novel probabilistic framework to improve the accuracy of a weighted majority voting algorithm. In order to assign higher weights to the classifiers which can correctly classify hard-to-classify instances, we introduce the item response theory (IRT) framework to evaluate the samples′ difficulty and classifiers′ ability simultaneously. We assigned the weights to classifiers based on their abilities. Three models are created with different assumptions suitable for different cases. When making an inference, we keep a balance between the accuracy and complexity. In our experiment, all the base models are constructed by single trees via bootstrap. To explain the models, we illustrate how the IRT ensemble model constructs the classifying boundary. We also compare their performance with other widely used methods and show that our model performs well on 19 datasets.
  • 加载中
  • [1] Z. H. Zhou. Ensemble learning. Encyclopedia of Biometrics, S. Z. Li, Ed., Berlin, Germany: Springer, pp. 411–416, 2009.
    [2] L. Lam, S. Y. Suen.  Application of majority voting to pattern recognition: an analysis of its behavior and performance[J]. IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans, 1997, 27(5): 553-568. doi: 10.1109/3468.618255
    [3] A. F. R. Rahman, H. Alam, M. C. Fairhurst. Multiple classifier combination for character recognition: revisiting the majority voting system and its variations. In Proceedings of the 5th International Workshop on Document Analysis Systems, pp. 167–178, Springer, Princeton, USA, 2002.
    [4] H. Kim, H. Kim, H. Moon, H. Ahn.  A weight-adjusted voting algorithm for ensembles of classifiers[J]. Journal of the Korean Statistical Society, 2011, 40(4): 437-449. doi: 10.1016/j.jkss.2011.03.002
    [5] S. E. Embretson, S. P. Reise. Item Response Theory, New York, USA: Psychology Press, 2013.
    [6] F. Martínez-Plumed, R. B. C. Prudêncio, A. Martínez-Usó, J. Hernández-Orallo.  Item response theory in AI: Analysing machine learning classifiers at the instance level[J]. Artificial Intelligence, 2019, 271(): 18-42. doi: 10.1016/j.artint.2018.09.004
    [7] L. Breiman.  Bagging predictors[J]. Machine Learning, 1996, 24(2): 123-140. doi: 10.1007/BF00058655
    [8] I. Gandhi, M. Pandey. Hybrid ensemble of classifiers using voting. In Proceedings of International Conference on Green Computing and Internet of Things, IEEE, Noida, India, pp. 399–404, 2015. DOI: 10.1109/ICGCIoT.2015.7380496.
    [9] A. Rojarath, W. Songpan, C. Pong-Inwong. Improved ensemble learning for classification techniques based on majority voting. In Proceedings of the 7th IEEE International Conference on Software Engineering and Service Science, IEEE, Beijing, China, pp. 107–110, 2016. DOI: 10.1109/ICSESS.2016.7883026.
    [10] C. Cornelio, M. Donini, A. Loreggia, M. S. Pini, F. Rossi. Voting with random classifiers (vorace). arXiv: 1909.08996, 2019. https://arxiv.org/abs/1909.08996.
    [11] X. B. Liu, Z. T. Liu, G. J. Wang, Z. H. Cai, H. Zhang.  Ensemble transfer learning algorithm[J]. IEEE Access, 2017, 6(): 2389-2396. doi: 10.1109/ACCESS.2017.2782884
    [12] S. J. Winham, R. R. Freimuth, J. M. Biernacka.  A weighted random forests approach to improve predictive performance[J]. Statistical Analysis and Data Mining, 2013, 6(6): 496-505. doi: 10.1002/sam.11196
    [13] Y. C. Chen, H. Ahn, J. J. Chen.  High-dimensional canonical forest[J]. Journal of Statistical Computation and Simulation, 2017, 87(5): 845-854. doi: 10.1080/00949655.2016.1231191
    [14] H. F. Zhou, X. Z. Zhao, X. Wang.  An effective ensemble pruning algorithm based on frequent patterns[J]. Knowledge-Based Systems, 2014, 56(): 79-85. doi: 10.1016/j.knosys.2013.10.024
    [15] Y. Zhang, S. Burer, W. N. Street.  Ensemble pruning via semidefinite programming[J]. Journal of Machine Learning Research, 2006, 7(1): 1315-1338.
    [16] L. I. Kuncheva, J. J. Rodríguez.  A weighted voting framework for classifiers ensembles[J]. Knowledge and Information Systems, 2014, 38(2): 259-275. doi: 10.1007/s10115-012-0586-6
    [17] A. Kabir, C. Ruiz, S. A. Alvarez. Mixed bagging: a novel ensemble learning framework for supervised classification based on instance hardness. In Proceedings of IEEE International Conference on Data Mining, IEEE, Singapore, Singapore, pp. 1073–1078, 2018. DOI: 10.1109/ICDM.2018.00137.
    [18] L. V. Utkin, M. S. Kovalev, A. A. Meldo.  A deep forest classifier with weights of class probability distribution subsets[J]. Knowledge-based Systems, 2019, 173(): 15-27. doi: 10.1016/j.knosys.2019.02.022
    [19] H. Reddy, N. Raj, M. Gala, A. Basava.  Text-mining-based fake news detection using ensemble methods[J]. International Journal of Automation and Computing, 2020, 17(2): 210-221. doi: 10.1007/s11633-019-1216-5
    [20] W. G. Yi, J. Duan, M. Y. Lu.  Double-layer Bayesian classifier ensembles based on frequent itemsets[J]. International Journal of Automation and Computing, 2012, 9(2): 215-220. doi: 10.1007/s11633-012-0636-2
    [21] G. Wang, J. X. Hao, J. Ma, H. B. Jiang.  A comparative assessment of ensemble learning for credit scoring[J]. Expert Systems with Applications, 2011, 38(1): 223-230. doi: 10.1016/j.eswa.2010.06.048
    [22] F. Martínez-Plumed, R. B. Prudêncio, A. Martínez-Usó, J. Hernández-Orallo. Making sense of item response theory in machine learning. In Proceedings of the 22nd European Conference on Artificial Intelligence, IOS Press, The Hague, The Netherlands, pp. 1140–1148, 2016. DOI: 10.3233/978-1-61499-672-9-1140.
    [23] C. Zanon, C. S. Hutz, H. Yoo, R. K. Hambleton.  An application of item response theory to psychological test development[J]. Psicologia: Reflexão e Crítica, 2016, 29(1): 18-. doi: 10.1186/s41155-016-0040-x
    [24] H. L. Fu, G. Manogaran, K. Wu, M. Cao, S. Jiang, A. M. Yang.  Intelligent decision-making of online shopping behavior based on internet of things[J]. International Journal of Information Management, 2020, 50(): 515-525. doi: 10.1016/j.ijinfomgt.2019.03.010
    [25] W. R. Gilks, S. Richardson, D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Boca Raton, USA: Chapman & Hall, CRC, 1995.
    [26] Y. Chen, T. S. Filho, R. B. C. Prudencio, T. Diethe, P. Flach. β3-IRT: a new item response model and its applications. arXiv: 1903.04016, 2019. https://arxiv.org/abs/1903.04016.
    [27] B. W. Junker, R. J. Patz, N. M. VanHoudnos. Markov chain Monte Carlo for item response models. Handbook of Item Response Theory, Volume Two: Statistical Tools, W. J. van der Linden, Ed., Boca Raton, USA: Chapman and Hall, CRC, pp. 271–325, 2016.
    [28] J. S. Kim, D. M. Bolt.  Estimating item response theory models using Markov chain Monte Carlo methods[J]. Educational Measurement: Issues and Practice, 2007, 26(4): 38-51. doi: 10.1111/j.1745-3992.2007.00107.x
    [29] M. A. Tanner, W. H. Wong.  The calculation of posterior distributions by data augmentation[J]. Journal of the American Statistical Association, 1987, 82(398): 528-540. doi: 10.1080/01621459.1987.10478458
    [30] J. H. Albert.  Bayesian estimation of normal ogive item response curves using Gibbs sampling[J]. Journal of Educational Statistics, 1992, 17(3): 251-269. doi: 10.3102/10769986017003251
    [31] Y. Y. Sheng.  Markov chain Monte Carlo estimation of normal ogive IRT models in matlab[J]. Journal of Statistical Software, 2008, 25(8): 1-15. doi: 10.18637/jss.v025.i08
    [32] Y. Y. Sheng.  Bayesian estimation of the four-parameter IRT model using Gibbs sampling[J]. International Journal of Quantitative Research in Education, 2015, 2(3–4): 194-212. doi: 10.1504/IJQRE.2015.071736
    [33] Y. Noel, B. Dauvier.  A beta item response model for continuous bounded responses[J]. Applied Psychological Measurement, 2007, 31(1): 47-73. doi: 10.1177/0146621605287691
    [34] J. C. Xu, Q. W. Ren, Z. Z. Shen.  Prediction of the strength of concrete radiation shielding based on LS-SVM[J]. Annals of Nuclear Energy, 2015, 85(): 296-300. doi: 10.1016/j.anucene.2015.05.030
    [35] S. Borman. The expectation maximization algorithm: a short tutorial. Submmitted for Publication, vol. 41, 2004.
    [36] W. Deng, H. M. Zhao, L. Zou, G. Y. Li, X. H. Yang, D. Q. Wu.  A novel collaborative optimization algorithm in solving complex optimization problems[J]. Soft Computing, 2017, 21(15): 4387-4398. doi: 10.1007/s00500-016-2071-8
    [37] M. H. Fang, X. H. Hu, T. T. He, Y. Wang, J. M. Zhao, X. J. Shen, J. Yuan. Prioritizing disease-causing genes based on network diffusion and rank concordance. In Proceedings of IEEE International Conference on Bioinformatics and Biomedicine, IEEE, Belfast, UK, pp. 242–247, 2014. DOI: 10.1109/BIBM.2014.6999162.
    [38] S. R. Safavian, D. Landgrebe.  A survey of decision tree classifier methodology[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1991, 21(3): 660-674. doi: 10.1109/21.97458
    [39] A. Liaw, M. Wiener.  Classification and regression by randomforest[J]. R News, 2002, 2–3(): 18-22.
    [40] J. H. Friedman.  Stochastic gradient boosting[J]. Computational Statistics & Data Analysis, 2002, 38(4): 367-378. doi: 10.1016/S0167-9473(01)00065-2
    [41] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, K. R. Mullers. Fisher discriminant analysis with kernels. In Proceedings of IEEE Signal Processing Society Workshop, IEEE, Madison, USA, pp. 41–48, 1999. DOI: 10.1109/NNSP.1999.788121.
    [42] J. A. K. Suykens, J. Vandewalle.  Least squares support vector machine classifiers[J]. Neural Processing Letters, 1999, 9(3): 293-300. doi: 10.1023/A:1018628609742
    [43] E. Bauer, R. Kohavi.  An empirical comparison of voting classification algorithms: bagging, boosting, and variants[J]. Machine Learning, 1999, 36(1–2): 105-139. doi: 10.1023/A:1007515423169
    [44] H. Li, F. D. Chen, K. W. Cheng, Z. Z. Zhao, D. Z. Yang.  Prediction of zeta potential of decomposed peat via machine learning: comparative study of support vector machine and artificial neural networks[J]. International Journal of Electrochemical Science, 2015, 10(8): 6044-6056.
    [45] Y. C. Chen, H. Ha, H. Kim, H. Ahn.  Canonical forest[J]. Computational Statistics, 2014, 29(3–4): 849-867. doi: 10.1007/s00180-013-0466-x
  • 加载中
  • [1] Diclehan Karakaya, Oguzhan Ulucan, Mehmet Turkan. Electronic Nose and Its Applications: A Survey . International Journal of Automation and Computing, 2020, 17(2): 179-209.  doi: 10.1007/s11633-019-1212-9
    [2] Negin Alborzi, Fereshteh Poorahangaryan, Homayoun Beheshti. Spectral-spatial Classification of Hyperspectral Images Using Signal Subspace Identification and Edge-preserving Filter . International Journal of Automation and Computing, 2020, 17(2): 222-232.  doi: 10.1007/s11633-019-1188-5
    [3] Dilantha Haputhanthri, Gunavaran Brihadiswaran, Sahan Gunathilaka, Dulani Meedeniya, Sampath Jayarathna, Mark Jaime, Christopher Harshaw. Integration of Facial Thermography in EEG-based Classification of ASD . International Journal of Automation and Computing, 2020, 17(): 1-18.  doi: 10.1007/s11633-020-1231-6
    [4] Sajjad Afrakhteh, Mohammad-Reza Mosavi, Mohammad Khishe, Ahmad Ayatollahi. Accurate Classification of EEG Signals Using Neural Networks Trained by Hybrid Population-physic-based Algorithm . International Journal of Automation and Computing, 2020, 17(1): 108-122.  doi: 10.1007/s11633-018-1158-3
    [5] Maryam Aljanabi, Mohammad Shkoukani, Mohammad Hijjawi. Ground-level Ozone Prediction Using Machine Learning Techniques: A Case Study in Amman, Jordan . International Journal of Automation and Computing, 2020, 17(5): 667-677.  doi: 10.1007/s11633-020-1233-4
    [6] Ann Smith, Fengshou Gu, Andrew D. Ball. An Approach to Reducing Input Parameter Volume for Fault Classifiers . International Journal of Automation and Computing, 2019, 16(2): 199-212.  doi: 10.1007/s11633-018-1162-7
    [7] Danko Nikolić. Why Deep Neural Nets Cannot Ever Match Biological Intelligence and What To Do About It? . International Journal of Automation and Computing, 2017, 14(5): 532-541.  doi: 10.1007/s11633-017-1093-8
    [8] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, Qianli Liao. Why and When Can Deep-but Not Shallow-networks Avoid the Curse of Dimensionality:A Review . International Journal of Automation and Computing, 2017, 14(5): 503-519.  doi: 10.1007/s11633-017-1054-2
    [9] Zhi-Guo Ding,  Da-Jun Du,  Min-Rui Fei. An Isolation Principle Based Distributed Anomaly Detection Method in Wireless Sensor Networks . International Journal of Automation and Computing, 2015, 12(4): 402-412.  doi: 10.1007/s11633-014-0847-9
    [10] Danasingh Asir Antony Gnana Singh,  Subramanian Appavu Alias Balamurugan,  Epiphany Jebamalar Leavline. An Unsupervised Feature Selection Algorithm with Feature Ranking for Maximizing Performance of the Classifiers . International Journal of Automation and Computing, 2015, 12(5): 511-517.  doi: 10.1007/s11633-014-0859-5
    [11] Nongnuch Poolsawad,  Lisa Moore,  Chandrasekhar Kambhampati. Issues in the Mining of Heart Failure Datasets . International Journal of Automation and Computing, 2014, 11(2): 162-179.  doi: 10.1007/s11633-014-0778-5
    [12] Majda Ltaief,  Anis Messaoud,  Ridha Ben Abdennour. Optimal Systematic Determination of Models' Base for Multimodel Representation: Real Time Application . International Journal of Automation and Computing, 2014, 11(6): 644-652.  doi: 10.1007/s11633-014-0815-4
    [13] Pavla Bromová,  Petr Škoda,  Jaroslav Vážný. Classification of Spectra of Emission Line Stars Using Machine Learning Techniques . International Journal of Automation and Computing, 2014, 11(3): 265-273.  doi: 10.1007/s11633-014-0789-2
    [14] Li-Jie Zhao, Tian-You Chai, De-Cheng Yuan. Selective Ensemble Extreme Learning Machine Modeling of Effluent Quality in Wastewater Treatment Plants . International Journal of Automation and Computing, 2012, 9(6): 627-633 .  doi: 10.1007/s11633-012-0688-3
    [15] Appavu Alias Balamurugan Subramanian, S. Pramala, B. Rajalakshmi, Ramasamy Rajaram. Improving Decision Tree Performance by Exception Handling . International Journal of Automation and Computing, 2010, 7(3): 372-380.  doi: 10.1007/s11633-010-0517-5
    [16] Siva S. Sivatha Sindhu, S. Geetha, M. Marikannan, A. Kannan. A Neuro-genetic Based Short-term Forecasting Framework for Network Intrusion Prediction System . International Journal of Automation and Computing, 2009, 6(4): 406-414.  doi: 10.1007/s11633-009-0406-y
    [17] Subramanian Appavu Alias Balamurugan, Ramasamy Rajaram. Effective and Efficient Feature Selection for Large-scale Data Using Bayes’ Theorem . International Journal of Automation and Computing, 2009, 6(1): 62-71.  doi: 10.1007/s11633-009-0062-2
    [18] Computational Intelligence and Games:Challenges and Opportunities . International Journal of Automation and Computing, 2008, 5(1): 45-57.  doi: 10.1007/s11633-008-0045-8
    [19] Alma Lilia Garcia-Almanza,  Edward P. K. Tsang. Evolving Decision Rules to Predict Investment Opportunities . International Journal of Automation and Computing, 2008, 5(1): 22-31.  doi: 10.1007/s11633-008-0022-2
    [20] L. Meng,  Q. H. Wu. Fast Training of Support Vector Machines Using Error-Center-Based Optimization . International Journal of Automation and Computing, 2005, 2(1): 6-12.  doi: 10.1007/s11633-005-0006-4
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (6)  / Tables (8)

Metrics

Abstract Views (103) PDF downloads (33) Citations (0)

Item Response Theory Based Ensemble in Machine Learning

Abstract: In this article, we propose a novel probabilistic framework to improve the accuracy of a weighted majority voting algorithm. In order to assign higher weights to the classifiers which can correctly classify hard-to-classify instances, we introduce the item response theory (IRT) framework to evaluate the samples′ difficulty and classifiers′ ability simultaneously. We assigned the weights to classifiers based on their abilities. Three models are created with different assumptions suitable for different cases. When making an inference, we keep a balance between the accuracy and complexity. In our experiment, all the base models are constructed by single trees via bootstrap. To explain the models, we illustrate how the IRT ensemble model constructs the classifying boundary. We also compare their performance with other widely used methods and show that our model performs well on 19 datasets.

Ziheng Chen, Hongshik Ahn. Item Response Theory Based Ensemble in Machine Learning. International Journal of Automation and Computing, 2020, 17(5): 621-636. doi: 10.1007/s11633-020-1239-y
Citation: Ziheng Chen, Hongshik Ahn. Item Response Theory Based Ensemble in Machine Learning. International Journal of Automation and Computing, 2020, 17(5): 621-636. doi: 10.1007/s11633-020-1239-y
    • Classification ensembles are increasingly gaining attention from the area of machine learning, especially when we focus on improving the accuracy. The most important feature distinguishing the ensemble learning from other types of learning is that it combines the predictions from a group of classifiers rather than depending on a single classifier[1]. It is proved in many cases that the aggregated performance metrics, such as bagging, boosting and incremental learning outperform others without a collective decision strategy.

      If one had to identify an idea as central and novel to ensemble learning, it is the combination rule, which can be characterized in two ways: simple majority voting and weighted majority voting. Simple majority voting is just a decision rule which combines the decisions of the classifiers in the ensemble[1]. It is widely applied in ensemble learning due to its simplicity and applicability[2]. Weighted majority voting can be done by multiplying a weight to the decision of each classifier to reflect its ability, and then make the final decision by combining the weighted decisions[3]. These two methods utilize the ability of classifiers based on their performance on training the data. Thus it does not require any parameter tuning once the individual classifiers have been trained.

      Here we propose a novel probabilistic framework for the weighted voting classification ensemble. We treat each data point as a problem and different classifier as a subject taking an exam in class. As we know, the performance of one student on a problem depends on two major factors: difficulty of the problem and competence of the student[4]. In the training data, some have significant features and are easy to classify, whereas some are difficult cases because they are near class boundaries. Thus, similar to an exam in class, we define the competence of a classifier as the capability of correctly classifying difficult cases, rather than the number of correctly classified cases. For instance, suppose a classifier correctly classifies some easy cases but fails to deal with difficult cases. Another classifier correctly classifies some difficult cases, while incorrectly classifies easy cases. Then it makes sense that a higher weight is given to the second classifier than the first one.

      In this paper, we propose a method which can simultaneously evaluate the ability of a classifier and difficulty of classifying a case. Here, we employ the item response theory (IRT) framework[5, 6], which is widely applied to psychological and educational research, to estimate the latent ability of classifiers.

    • Classifier $i$'s ability is defined by the parameter $\theta_i$, which measures its capacity to handle different samples. Not only the number of cases it can classify, but difficulty of each case is also considered when estimating the parameter. The classifier's ability is directly connected with the weight assigned to each classifier in the ensemble, which is a real value between 0 and 1. A classifier having highly negative ability leads to a weight close to 0, while the opposite is true for the classifier with highly positive ability. Outliers, observations near the boundary and observations surrounded by multiple observations from the other class can usually only be correctly classified by a classifier with a high value of the ability.

    • Item response theory considers a set of models that relate responses given to items to latent abilities of the respondents[5]. It is widely applied in educational testing and psychological evaluation. In this model, the probability of a response is a function of the classifier's ability and observation's difficulty. As one classifier either correctly classifies or incorrectly classifies an observation in our case, we only focus on the dichotomous models. In the original model, their relationship can be described as

      $P(y_{ij} = 1|\theta_i,\beta_j) = \Phi(\theta_i-\beta_j) = \int_{-\infty}^{\theta_i-\beta_j}\frac{1}{\sqrt{2\pi}}{\rm{e}}^{-\frac{t^2}{2}}\,{\rm{d}}t$

      where $y_{ij}$ is a binary response of a classifier $i$ to observation $j$, $y_{ij} = 1$ for a correct classification and $y_{ij} = 0$ otherwise. The parameter $\theta_i$ is latent ability parameter for classifier $i$ and $\beta_j$ is the difficulty of observation $j$. As it only contains one item parameter $\beta_j$, it is named 1PNO. For the 2PNO model and 3PNO model, we will introduce 2 item parameters and 3 item parameters correspondingly. In our first method, we will use the 3PNO as the basic framework. In our second method, we design a 2PNO.

    • Many prior works focus on assigning the weights to different classifiers constructed by bootstrapping[7]. A hybrid voting method is proposed in [8], combining the results from different classifiers. Rojarath et al.[9] suggest a method to improve each classifier's ability based on the result of others. A ranking measurement to evaluate the instance's preference toward the classifiers is defined in [10]. Although fast and efficient, all these methods fail to consider the classifiers' abilities. The problem of having different distribution between training data and testing data is tackled via transfer learning[11]. However, they need a powerful pre-train model. The weighted ensemble is applied into high-dimensional cases in [12] and [13]. Ensemble pruning algorithms are also considered in [14] and [15], focusing on pruning the ensembles with significant features.

      Recently, probabilistic models are also adopted in ensemble learning. For a comprehensive weight majority vote, recall the naive Bayes combiner is presented in [16]. Based on their probabilistic framework, four ensemble methods are derived subsequently by progressively relaxing the assumption. They construct the model directly by considering the misclassification probability, rather than the difficulty of the observations and the ability of the classifier. In [17], although instance difficulty is taken into consideration, they didn't discuss the detailed decision mechanism between the classifiers and the observations. Weights are assigned to the class distribution at leaf nodes, which requires a large number of parameters as the feature dimension increases[18].

      As a fundamental method to improve the classification performance, ensemble learning is used in different tasks. The text mining methods are combined with majority voting to detect the fake news on websites[19]. A double layer Bayesian neural network is designed and applied into multiple data sets[20]. Bagging, boosting and stacking methods are used to assess the credit score of applicants[21].

      We propose the IRT ensemble to evaluate the difficulty of samples and the ability of classifiers simultaneously. To the best of our knowledge, our method is the first approach introducing the IRT into the ensemble learning and giving a statistical explanation of the samples' difficulty and classifiers' ability. This work is similar to WAVE[4]. However, although they use a similar idea, they didn't statistically explain the weight. Our proposed methods model the weight assignment in a probabilistic way and can explain the corresponding relationship between the classifier's ability and observations' difficulty.

    • In this section, we introduce the proposed method in detail. We treat the classifiers as competitors and want to evaluate their performance by considering the accuracy in classifying hard-to-classify samples. The basic rule is to assign a higher weight to the classifiers with higher accuracy in classification. Thus, we adopt the framework in the item response theory[22], which is widely applied in the educational testing to evaluate the items and people simultaneously[23, 24]. We firstly describe our model and explain why it works. Then we introduce two methods to make an inference. Throughout the paper, vectors or matrices are denoted using boldface.

    • We consider a set of classifiers $\Omega = \{C_1,C_2, \cdots, C_n\}$ and a set of data points $\Phi = \{S_1,S_2, \cdots, S_m\}$. For each classifier $C_i$, there is a corresponding parameter $\theta_i$ to denote the ability of the classifier. Similarly, we assign a difficulty parameter $\beta_j$ for each data point $S_j$. Based on the IRT framework, the probability of a response for a data point is a function of the classifier's ability and difficulty of classifying the case. Although there have been various models developed within the IRT framework, we focus on the basic unidimensional IRT model because it models the classifier-sample interaction by two single unified traits $\theta$ and $\beta$. In this model, the response generated from the interaction has only two choices: success or failure. In our problem, success means the classifier recognizing the label of a sample correctly.

      Now we can formulate the model. Suppose we have $k$ classifiers and $n$ data points in a training set. We denote the $k \times n$ matrix Y as the performance matrix. For each element $Y_{ij}$, 1 is assigned if classifier $i$ correctly classifies sample $j$, or 0 is assigned otherwise. We also define l as an $n \times 1$ vector with all vector elements being 1. Finally, I is a $k \times k$ identity matrix. For an easy description, we first propose Assumptions 1 and 2.

      Assumption 1. The performance parameter of a classifier and difficulty parameter of a sample are rational numbers, and the probability of a correct classification can be expressed as a cumulative distribution function (CDF).

      Assumption 2. The classifiers give their decisions independently conditioned upon the training data.

      For each classifier $C_i$, the probability of a correct classification for sample point $S_j$ equals to:

      $P(Y_{ij} = 1) = \phi(\alpha_j\theta_i-\beta_j)+\gamma_j(1-\phi(\alpha_j\theta_i-\beta_j))$

      where $\phi(x)$ is a CDF. Now we explain extra parameters and their original purpose.

      1) The discrimination parameter $\alpha_j$ of case $S_j$ reflects the steepness of the probability function. If we set $\gamma_j = 0$ and differentiate the function $\phi$ of $\theta_i$, then the derivative is $\alpha_j\phi{'}(\alpha_j\theta_i-\beta_j)$, and $\alpha_j$ serves as the multiplier. The larger the value of $\alpha_j$ is, the steeper the probability is. Hence at some point, any small improvement of the classifier's ability can make a huge difference in the response, which can be used to detect subtle differences in the ability of the classifier.

      2) We also define $\gamma_j$ as a guessing parameter. There is a small probability that a classifier can correctly classify the situation without really learning the features from the training data.

      We can see the advantage of the model. The performance of a classifier is estimated based on the responses to discriminating items with different levels of difficulty, but not by the accuracy. Classifiers which correctly classify the difficult cases will get a high estimated value of the performance parameter. Hard-to-classify data points tend to be correctly classified by highly performing classifiers.

    • According to the second assumption, we can write the likelihood function as

      $f(y|\alpha,\gamma) = \prod\limits_{i = 1}^n\prod\limits_{j = 1}^kP(Y_{ij} = 1)^{Y_{ij}}(1-P(Y_{ij} = 1))^{1-Y_{ij}}.$

      If we directly optimize the likelihood function, it will be very unstable due to the non-convexity of the function. Thus, we consider using Markov Chain Monte Carlo (MCMC) to estimate the parameters[25, 26]. The basic idea of MCMC is to use a series of Markov chains and transition kernel to estimate the parameters. After the procedure, we obtain an estimation of the parameters. To make a concise notation, we denote a set of all the parameters as $\Theta$, which includes all the discrimination parameters $\alpha_j$, the guessing parameters $\gamma_j$, difficulty parameters $\beta_j$ and performance parameters $\theta_i$. For the training data, we denote the sample space as $\Pi$. Thus the joint probability function can be written as $f(\Pi|\Theta)$.

      1) Metropolis hastings algorithm. The metropolis hastings (M-H) algorithm is a very basic method in the MCMC family, which constructs the Markov chain from the parameter's posterior distribution with a proposal kernel[27]. If we supply a prior distribution $f(\Theta)$ to the parameters, we can write the joint distribution of the parameters and data as $f(\Pi,\Theta) = f(\Pi|\Theta) f(\Theta)$, and following the Bayes rule, the posterior distribution of $\Theta$ is

      $p(\Theta|\Pi) = \frac{f(\Pi|\Theta)f(\Theta)}{\int f(\Pi|\Theta)f(\Theta){\rm{d}}\Theta}\propto f(\Pi|\Theta)f(\Theta).$

      Thus, we get the posterior distribution of certain parameter $\Theta$.

      Then we grow the Markov chain by sampling from the posterior distribution in multiple steps[28]. In particular step $i$ of the algorithm, in which the target parameter is $\Theta^i$, we draw a sample $\Theta{'}$ from the proposal kernel $q(\Theta{'}|\Theta^{i})$. The probability of accepting it as the value of $\Theta^{i+1}$ in next iteration is

      $\alpha = \min\left\{\frac{f(\Theta^{'}|{\rm{rest}})q(\Theta^{'}|\Theta^{i})}{f(\Theta^{i}|{\rm{rest}}) q(\Theta^{i}|\Theta^{'})},1\right\}.$

      Here, the rest parameters are denoted as rest. Suppose $u\leq \alpha$, then we can generate another random number $u$ from uniform distribution U(0,1). We update $\Theta^{i+1}$ as $\Theta{'}$. Otherwise, we reject the proposal value.

      The above is a brief summary of the M-H algorithm. In practice, we have to estimate an array of parameters rather than a single parameter, so we need to decompose the parameter vector into different components and update them one by one through the M-H algorithm. This slows down the M-H algorithm. Besides, the prior distribution of the parameters should be carefully determined by considering the effectiveness of the rejection process. Last but not least, the initial value also plays a fundamental role in the efficiency of estimating the parameters.

      In this model, we assign the parameters with the following priors:

      $\begin{aligned} &{\theta _i}\sim {\rm{N}}(0,\sigma _i^2)\\ &\log ({\alpha _j})\sim {\rm{N}}({\mu _a},\sigma _a^2)\;\;{\rm{(we}}\;{\rm{ constrain}}\;{\alpha _j} \ge 0)\\ &{\beta _j}\sim {\rm{N}}(0,{\sigma ^2})\\ &\sigma _i^2\sim {\rm{IG}}({\alpha _\theta },{\beta _\theta }). \end{aligned}$

      Then we derive the posterior probability density function based on the given prior. We can use Pystan to construct the model and assign priors to the parameters. Pystan also provides us different algorithms to make an inference. It is obvious that we cannot obtain a concise form of the posterior distribution because the posterior form does not belong to any exponential family except the last step. Thus, we resort to the M-H algorithm for the updating of parameters $\theta_i$, $\alpha_j$, $\beta_j$. Finally, we use the updated inverse gamma distribution to update $\sigma_i$. Although straightforward, this algorithm is very inefficient and time consuming because of the low accepting rate.

      2) Gibbs sampling algorithm. We employ the Gibbs sampling algorithm, a special M-H algorithm with a proposal acceptance rate of 1, in the hope of improving the efficiency. The probability of a correct classification stays the same as

      $P(Y_{ij} = 1) = \phi(\alpha_j\theta_i-\beta_j)+\gamma_j(1-\phi(\alpha_j\theta_i-\beta_j)).$

      In order to make inference, we use the data augmentation method[29] in the likelihood function so as to make it easier to analyze. In many cases when the likelihood function cannot be closely approximated by the normal likelihood, the data augmentation method can simplify it by introducing a series of latent variables[30, 31].

      Introducing latent variables

      In our problem, we define two $n$ by $m$ matrix variables ${{W}},\;{{Z}}$ which are associated with the generating process of the performance matrix. Before explaining them, we make Assumption 3.

      Assumption 3. If a sample point falls in the stable region, which is constructed by the classifier, and is not close to the boundary, the classifier can correctly classify the sample with probability 1.

      $\begin{array}{l} {W_{ij}} = \left\{ {\begin{aligned} & {1,\;\;\;{\rm{if}}\,{\rm{sample}}\,j\,{\rm{within}}\,{\rm{classifier}}\,i\,{\rm{is}}\,{\rm{in}}\,{\rm{stable}}\,{\rm{region}}}\\[-2pt] & 0,\;\;\;{\rm{if}}\;{\rm{sample}}\;j\;{\rm{within}}\;{\rm{classifier}}\;i\;{\rm{is}}\;{\rm{in}}\;{\rm{unstable}}\\[-2pt] & \quad\quad\;{\rm{boundary}.} \end{aligned}} \right.\\ \end{array}$

      ${Z_{ij}}\sim N({\eta _{ij}},1),\;\;{\eta _{ij}} = {\alpha _j}{\theta _i} - {\beta _j} .$ Z serves as a latent indicator variable of W.

      For the relation between W and Z, we have the following definition:

      ${W_{ij}} =\left\{ {\begin{aligned} &{ 1},\;\;\;{{\rm{if}}\;{Z_{ij}} \ge 0}\\ &{ 0},\;\;\;{{\rm{if}}\;{Z_{ij}} < 0}. \end{aligned}} \right.$

      Prior for normal parameters

      For the classifier's ability $\theta_i$, we assign a normal distribution $\theta_i \sim N \left( \mu,\sigma^2 \right)$[32]. As $\alpha_j \;{\rm{and}}\;\beta_j$ are parameters to describe the discrimination and difficulty of problem $j$, we can stack $\alpha_j\;{\rm{and}}\;\beta_j$ up for simplicity and denote the vector as $\phi_{{j}}$. In accordance with $\theta_i$, we also assign the new vector of a multivariate normal distribution prior[31]:

      $ \begin{split} {{\Phi }} =\; & \left[ {\begin{array}{*{20}{c}} {{\phi _{1}},}&{{\phi _{2}},}&{{\rm{ }} \ldots ,\quad {\phi _{m}}} \end{array}} \right]\\ {{{\Sigma }}_\phi } =\; & \left[ {\begin{array}{*{20}{c}} {\sigma _\alpha ^2}&{\rho {\sigma _\alpha }{\sigma _\beta }}\\ {\rho {\sigma _\alpha }{\sigma _\beta }}&{\sigma _\beta ^2} \end{array}} \right]\\ {\phi _{{j}}} =\; & \left[ {\begin{array}{*{20}{c}} {{\alpha _j}}\\ {{\beta _j}} \end{array}} \right]\quad \sim \quad {{N}}\left( {\left[ {\begin{array}{*{20}{c}} {{\mu _{{\alpha _j}}}}\\ {{\mu _{{\beta _j}}}} \end{array}} \right]\;,\; {{\Sigma }_\phi }} \right). \end{split}$

      We also assume the parameter:

      ${{{M}}_{{j}}} = \left[ {\begin{array}{*{20}{c}} {{\mu _{{\alpha _j}}}}\\ {{\mu _{{\beta _j}}}} \end{array}} \right]\quad \sim \quad {{N}}\left( {\left[ {\begin{array}{*{20}{c}} {{\tau _{{\alpha _j}}}}\\ {{\tau _{{\beta _j}}}} \end{array}} \right]\; ,\; {{I}}} \right).$

      For the hyperparameters $\tau_{\alpha_j}\;{\rm{and}}\;\tau_{\beta_j}$, we assign a value related to the number of people who correctly answer problem $j$.

      Finally, for the guessing parameter $\gamma$, we also assign a beta distribution:

      ${{\Phi }}\gamma_j\quad\sim\quad {\rm{Beta}}(s,t).$

      Thus, the joint posterior distribution of $({{\Theta }},{{\Phi }},{{M}},{{\Gamma }},{{W}},{{Z}}|{{y}})$ is

      $\begin{split} &P({{\Theta }},{{\Phi }},{{M}},{{\Gamma }},{{W}},{{Z}}|{{y}}) = \\ & \quad P(y|{{W}},{{\Gamma }})P({{W}}|{{Z}})P({{Z}}|{{\Theta }},{{\Phi }},{{M}})P({{\Theta }})P({{\Phi }}|{{M}})P({{M}})P({{\Gamma }}). \end{split}$

      Parameter inference

      According to the Gibbs algorithm, we can generate the posterior samples by sweeping through each variable. In each iteration, we sample from the conditional distribution with the remaining values fixed at the current value.

      Inference on latent variables

      The conditional distribution of $W_{ij}$ can be derived as follows:

      ${W_{ij}} = \left\{\!\!\!{\begin{array}{*{20}{c}} {{\rm{Bernoulli}}\;\left( {\dfrac{{\Phi ({\eta _{ij}})}}{{{\gamma _j} + (1 - {\gamma _j})\Phi ({\eta _{ij}})}}} \right),}&{{\rm{if}}\;{y_{ij}} = 1}\\ 0\quad,&{{\rm{if}}\;{y_{ij}} = 0}. \end{array}} \right.$

      Similarly, we can obtain the conditional distribution of Z as

      ${Z_{ij}} = \left\{ {\begin{aligned} &{N({\eta _{ij}},1)I({Z_{ij}} \ge 0)},\;\;\;{{\rm{if}}\;{W_{ij}} = 1}\\ &{N({\eta _{ij}},1)I({Z_{ij}} < 0)},\;\;\;{{\rm{if}}\;{W_{ij}} = 0}. \end{aligned}} \right.$

      Inference on normal variables

      1) For $\theta_i$

      Because of the conjugate prior, we have

      $\theta_i \sim N \left(\dfrac{\sum_{j = 1}^m(z_{ij}+\beta_j)\alpha_j +\frac{\mu}{\sigma^2}}{\frac{1}{\sigma^2}+\sum_{j = 1}^{m}\alpha_j^2}, \frac{1}{\frac{1}{\sigma^2}+\sum_{j = 1}^m\alpha_j^2}\right).$

      2) For ${{{\Phi }}_{ \cdot {j}}},$

      We denote $\overline{X}$ as follows:

      $\bar X = \left[ {\begin{array}{*{20}{c}} {{\theta _1}}&{ - 1}\\ {{\theta _2}}&{ - 1}\\ \vdots & \vdots \\ {{\theta _n}}&{ - 1} \end{array}} \right].$

      Then we have

      $\bar X\Phi = \left[ {\begin{array}{*{20}{c}} {{\eta _{11}}}&{{\eta _{12}}}& \cdots &{{\eta _{1m}}}\\ {{\eta _{21}}}&{{\eta _{22}}}& \cdots &{{\eta _{2m}}}\\ \vdots & \vdots & \ddots & \vdots \\ {{\eta _{n1}}}&{{\eta _{n2}}}& \cdots &{{\eta _{nm}}} \end{array}} \right].$

      Thus, we have

      $\begin{array}{l} {{P}}({{{\Phi }}_{ \cdot {j}}}|\sim )\sim N\left( {{\mu _{{\Phi _{ \cdot j}}}},{\sigma _{{\Phi _{ \cdot j}}}}} \right) \end{array}$

      where ${\mu _{{\Phi _{ \cdot j}}}} = {\left( {{{\bar X}^{\rm T}}\bar X + \Sigma _\phi ^{ - 1}} \right)^{ - 1}}\left( {{{\bar X}^{\rm T}}{Z_{ \cdot j}} + \Sigma _\phi ^{ - 1}{M_j}} \right), {\sigma _{{\Phi _{ \cdot j}}}} = {\left( {{{\bar X}^{\rm T}}\bar X + \Sigma _\phi ^{ - 1}} \right)^{ - 1}}.$

      3) For ${{{M}}_{{j}}},$

      $\begin{array}{l} {{P}}({{{M}}_{j}}|\sim )\sim N\left( {{\mu _{{M_j}}},{\sigma _{{M_j}}}} \right) \end{array}$

      where ${\mu _{{M_j}}} = {\left( {\Sigma _\phi ^{ - 1} + I} \right)^{ - 1}}\left( {\Sigma _\phi ^{ - 1}{\Phi _{ \cdot j}} + {T_j}} \right), {\sigma _{{M_j}}} = {\left( {\Sigma _\phi ^{ - 1} + I} \right)^{ - 1}}. $

      4) For ${{{\gamma }}_{j}},$

      $\begin{array}{l} {\gamma _j}\sim {\rm{Gamma}}\left( {{\lambda _1},{\lambda _2}} \right) \end{array}$

      where ${\lambda _1} = \sum\limits_{j = 1}^m {\sum\limits_{i = 1}^n {{I}} } ({W_{ij}} = 0,{y_{ij}} = 1) + s,{\lambda _2} = \sum\limits_{j = 1}^m {\sum\limits_{i = 1}^n {{I}} } ({W_{ij}} = 0,{y_{ij}} = 0) + t. $

    • In the previous model, we quantify the classifier's ability by defining the probability of a correct classification, and make an assumption about the parameters for simplicity of calculation in the MCMC approach. However, this method is limited in that it confines the distribution of a correct classification, which cannot be directly measured by the performance matrix. To obtain a closed form, we introduce many normal assumptions for the relationship between latent variables. Besides, Gibbs sampling is also time consuming. Thus, instead of assuming a certain distribution for the latent variables, a new model is proposed in which a constraint relaxation is applied to the latent parameters. In this model, the successful prediction of sample $j$ is also determined by its difficulty and the ability of classifier $i$. Now, we consider the generating process of each element $Y_{ij}$ in performance matrix $Y$. For each element $Y_{ij}$:

      $ \begin{array}{l} Y_{ij} \sim {\rm{Bernoulli}}(P_{ij}) \\ P_{ij} \sim {\rm{Beta}}(m_{ij},n_{ij}) . \end{array}$

      Based on the Bernoulli-Beta conjugate distribution, we can see the value of $Y_{ij}$ is strongly associated with a probability $P_{ij}$, and we define it as the successful parameter. It is clear that the larger the $P_{ij}$ is, the higher the probability of $Y_{ij} = 1$ is, meaning that the classifier $i$ correctly classifies sample $j$. Their relationship can be shown as

      $ \begin{split} & P(Y_{ij} = 1) = P_{ij} \\ & f(P_{ij}) = \frac{\Gamma(m_{ij}+n_{ij})} {\Gamma(m_{ij})\Gamma(n_{ij})}P_{ij}^{m_{ij}-1}(1-P_{ij})^{n_{ij}-1} . \end{split}$

      Thus, we want the successful parameters to be increasing functions of $\Delta_{ij} = \theta_i-\beta_j$ and keep the parameters $m_{ij}$ and $n_{ij}$ positive. We need to construct a special relationship to link them with $\theta_i$ and $\beta_j$. We can construct a function:

      $\begin{split}&m_{ij} = \exp\bigg\{\frac{\alpha_j\theta_i-\beta_j}{2}\bigg\}\\ & n_{ij} = \exp\bigg\{\frac{-\alpha_j\theta_i+\beta_j}{2}\bigg\}.\end{split}$

      The exponential function ensures that $m_{ij}$ and $n_{ij}$ can take on positive values. With this structure, the success probability can be indirectly affected by the classifier's ability and the sample's difficulty. Most importantly, no assumptions are made for the classifier's ability[33].

      We can also view the relationship in a different perspective. If we find the expectation of $P_{ij}$ under the assumption and expand it as a function of parameters $\theta_i$ and $\beta_j$, the following sigmoid function will appear.

      $E(P_{ij}) = \frac{m_{ij}}{n_{ij}+m_{ij}} = \frac{1}{1+\exp\bigg\{-\alpha_j\theta_i+\beta_j\bigg\}}.$

      Obviously, we bridge the latent parameter $P_{ij}$, $\theta_i$, $\beta_j$ in a more flexible approach than the previous model. Their relation is no longer defined by a constant equation, but a distribution with a sigmoid function. Although the relation here has no distributional interpretation, it quantifies how the distance between $\theta_i$ and $\beta_j$ contributes to the successful probability, and then generates a prediction result[34]. Moreover, $\theta_i$ and $\beta_j$ are all free variables without a strong distributional constraint. Fig. 1 displays the expected response function with different difficulties and discriminations.

      Figure 1.  The expected response function with different difficulty and discrimination. The higher the discrimination is, the steeper the curve is. The difficulty can affect the probability for correctly classifying the sample.

      Now we can summarize the full model log joint probability as follows:

      $\begin{split} &L(Y,P,M,N) =\; \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m {({y_{ij}} + {m_{ij}} - 1)} } \ln ({P_{ij}})+\\ &\;\;\;\;\; ({n_{ij}} - {y_{ij}})\ln (1 - {P_{ij}})+ \ln \left( {\frac{{\Gamma ({m_{ij}} + {n_{ij}})}}{{\Gamma ({m_{ij}})\Gamma ({n_{ij}})}}} \right)=\\ &\;\;\;\;\; \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m \quad } {L_{ij}}. \end{split}$

      We denote all the parameters $m_{ij}$ and $n_{ij}$ as

      $\Theta = \bigcup\limits_{i,j} \left(m_{ij},n_{ij}\right).$

      Now the model is constructed.

    • A more difficult, but common, situation is that we introduce a latent parameter $P_{ij}$ while making no distributional assumption about parameters $\theta_i$ and $\beta_j$. To solve this problem, we adopt the expectation maximization (EM) algorithm[35] which is a standard tool for the maximum likelihood algorithm with latent variables[36]. We notice that the parameters are redundant because the likelihood function only depends on the distance of $\theta_i$ and $\beta_j$. Thus, a constraint for the parameters is necessary. We set the mean of $\beta_j$ equals to 0, i.e., $\sum_{j = 1}^m\beta_j = 0$.

      E-step

      The E-step, on the ($k+1$)-th iteration, requires the calculation of

      $\begin{split} Q(\Theta |{\Theta ^k}) =\;& {E_{{\Theta ^k}}}[L(\Theta ,P|Y)]\\ Q(L|M,N,Y) = \;&\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m \quad } Q({L_{ij}}|m_{ij}^k,n_{ij}^k,{y_{ij}})=\\ & \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m \quad } {E_{m_{ij}^k,n_{ij}^k}}[{L_{ij}}|{y_{ij}}]= \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m \quad } {Q_{ij}}. \end{split}$

      From the full probability function above, we can derive the posterior distribution of $P_{ij}$:

      $\begin{split} P({P_{ij}}|{m_{ij}},{n_{ij}},{y_{ij}}) &\propto P_{ij}^{{y_{ij}} + m_{ij}^k - 1}{(1 - {P_{ij}})^{n_{ij}^k - {y_{ij}}}}\sim\\ & {\rm{Beta}}\left( {{y_{ij}} + m_{ij}^k,n_{ij}^k - {y_{ij}} + 1} \right). \end{split}$

      Therefore, $Q$ can be expressed as

      $ \begin{split} & Q(L|M,N,Y) =\\ & \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m {{Q_{ij}}} } = \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m {{E_{m_{ij}^k,n_{ij}^k}}} } [{L_{ij}}|{y_{ij}}]=\\ & \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m {({y_{ij}} + {m_{ij}} - 1)} } \psi \left( {{y_{ij}} + m_{ij}^k} \right)+\\ & \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m {({n_{ij}} - {y_{ij}})} } \psi \left( {n_{ij}^k - {y_{ij}} + 1} \right)-\\ & \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^m {({m_{ij}} + {n_{ij}} - 1)} } \psi \left( {n_{ij}^k + m_{ij}^k + 1} \right) \end{split}$

      where $ \psi (x) = \frac{{\Gamma (x)'}}{{\Gamma (x)}} $ is a digamma function.

      M-step

      As we have the constraint $\sum_{j = 1}^m\beta_j = 0$, we can express the Q function as follows:

      $\begin{split} &\;\;\;\;Q(L|M,N,Y)=\\ \;& \;\;\;\;\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^{m - 1} {{m_{ij}}} } S{M_{ij}} + \sum\limits_{i = 1}^n S {M_{im}}\left( {\frac{{{\alpha _j}{\theta _i} + \sum\nolimits_{j = 1}^{m - 1} {{\beta _j}} }}{2}} \right)+\\ &\;\;\;\; \sum\limits_{i = 1}^n {\sum\limits_{j = 1}^{m - 1} {{n_{ij}}} } S{N_{ij}}+ \sum\limits_{i = 1}^n S {N_{im}}\left( {\frac{{ - {\alpha _j}{\theta _i} - \sum\nolimits_{j = 1}^{m - 1} {{\beta _j}} }}{2}} \right) \end{split}$

      where $S{M_{ij}} = \psi ({y_{ij}} + m_{ij}^k) - \psi (n_{ij}^k + m_{ij}^k + 1),\;S{N_{ij}} = \psi (n_{ij}^k - {y_{ij}} + 1) - \psi (n_{ij}^k + m_{ij}^k + 1).$

      To estimate the parameters, we iterate over the set of parameters until they converge. We can use different optimization methods such as gradient ascent, RMSProp or Adam. Here we show the algorithm for gradient ascent. The partial derivative of $Q$ with respect to $\theta_i\;{\rm{and}}\;\beta_j$ is as follow:

      $\begin{split} \frac{{\partial Q}}{{\partial {\theta _i}}} =\;& \sum\limits_{j = 1}^m {\frac{1}{2}} {m_{ij}}S{M_{ij}} - \frac{1}{2}{n_{ij}}S{N_{ij}}\\ \frac{{{\partial ^2}Q}}{{\partial \theta _i^2}} =\;& \sum\limits_{j = 1}^m {\frac{1}{4}} {m_{ij}}S{M_{ij}} + \frac{1}{4}{n_{ij}}S{N_{ij}}\\ \frac{{\partial Q}}{{\partial {\beta _j}}} =\;& \sum\limits_{i = 1}^N {\left( { - \frac{1}{2}{m_{ij}}S{M_{ij}}} \right)} + \frac{1}{2}{m_{im}}S{M_{im}}+\\ & \frac{1}{2}{n_{ij}}S{N_{ij}} - \frac{1}{2}{n_{im}}S{N_{im}}\\ \frac{{{\partial ^2}Q}}{{\partial \beta _j^2}} =\;& \sum\limits_{i = 1}^N {\frac{1}{4}} {m_{ij}}S{M_{ij}} + \frac{1}{4}{m_{im}}S{M_{im}}+\\ & \frac{1}{4}{n_{ij}}S{N_{ij}} + \frac{1}{4}{n_{im}}S{N_{im}}. \end{split}$

      Based on the partial derivatives, we can use the gradient ascent algorithm to update the parameters $\Theta$ in each iteration.

      $ \begin{split} &\theta_i^{k+1}\quad = \quad\theta_i^{k}+\alpha\frac{\partial Q}{\partial \theta_i}_{\Theta^{k}}\\ &\beta_j^{k+1}\quad = \quad\beta_j^{k}+\alpha\frac{\partial Q}{\partial \beta_j}_{\Theta^{k}}. \end{split}$

    • As the estimators of $\theta$ reflect classifiers' ability, we should assign classifier i a weight according to the estimation of $\theta_i$. The estimator, however, can be either positive or negative. Thus, we employ the following transformation to decide the value of weight:

      $w_i = \frac{{\rm e}^{\theta_i}}{\sum_{k = 1}^n {\rm e}^{\theta_k}}.$

      The weights clearly illustrate the order of classifiers' ability and are normalized by the above formula.

    • A simulation study is presented below to show the estimation of the parameters from two 10 by 1000 datasets, which are generated from different parameter settings. The difference lies in the parameters' distributions. For simplicity, we denote the normal IRT model estimated by the M-H algorithm, by Gibbs sampling algorithm and the Bernoulli-Beta IRT model as Models 1, 2 and 3, respectively, and our experiments focus more on Models 2 and 3. First, we apply the three models to estimate the parameters of the two datasets generated from two different sets of parameters. Both settings contain 10 samples with 1000 classifiers, which have fixed difficulty and ability. In the first dataset, all parameters are sampled from a normal distribution so that they meet the assumption of the first two models. However, the distributions of the parameters in the second dataset vary. The difficulty is manually set to have a large variance and a relatively large range, and the ability is sampled from a highly skewed gamma distribution that the second dataset cannot satisfy the first two models' assumption. Thus, we can compare the accuracies of the two models in different situations.

      Four measures were computed to evaluate the performance of the three models. $\theta$ is the real parameter and $\theta^{\rm{est}}$ is an estimate of the parameter:

      1) Correlation:

      $ {\rm{Corr}} = \frac{N\sum_{i = 1}^{N}\theta_i^{\rm{est}}\theta_i -\sum_{i = 1}^{N}\theta_i^{\rm{est}}\sum_{i = 1}^{N}\theta_i}{N\sigma_{\rm{est}}\sigma}$

      2) Mean square error:

      ${\rm{ MSE} } = \frac{\sum_{i = 1}^{N}(\theta_{i}^{\rm{est}}-\theta_{i})^2}{N}$

      3) Mean absolute error:

      $ {\rm{MAE} } = \frac{\sum_{i = 1}^{N}\|(\theta_{i}^{\rm{est}}-\theta_{i})\|}{N}$

      4) Variance ratio:

      $ \;{\rm{VR}} = \frac{\sigma_{\rm{est}}^2}{\sigma^2}.$

      The correlation is to test the linear relationship between the parameter estimates and the real parameters. The MSE and MAE measure the precision of the estimation and the variance ratio illustrates the comparative stability. In Tables 1 and 2, we list the previous four measures of the estimators (classifiers' abilities and problems' difficulties) obtained from different models. In Tables 3 and 4, we show the real parameters and their estimated values from three different models.

      ModelParameterCorrelationMSEMAEVR
      Model 1${\theta}$ (Classifiers′ ability)0.860.970.960.97
      Model 2${\theta}$ (Classifiers′ ability)0.880.310.370.88
      Model 3${\theta}$ (Classifiers′ ability)0.870.910.961.9
      Model 1${\beta}$ (Samples′ difficulty)0.970.0070.161.15
      Model 2${\beta}$ (Samples′ difficulty)0.990.0030.050.99
      Model 3${\beta}$ (Samples′ difficulty)0.990.0160.121.36

      Table 1.  Measures for dataset 1

      Parameter componentParameters valueModel 1 estimationModel 2 estimationModel 3 estimation
      ${\beta_1}$− 1.024− 0.94− 0.953− 1.107
      ${\beta_2}$− 0.934− 0.99− 0.91− 1.08
      ${\beta_3}$− 0.694− 0.732− 0.709− 0.869
      ${\beta_4}$− 0.464− 0.56− 0.514− 0.579
      ${\beta_5}$0.3560.4720.4080.503
      ${\beta_6}$0.3360.3360.3190.382
      ${\beta_7}$0.6160.6940.5760.722
      ${\beta_8}$0.8060.850.7570.956
      ${\beta_9}$− 0.074− 0.171− 0.145− 0.172
      ${\beta_{10}}$1.0761.231.1411.243

      Table 2.  Real and estimated parameters for dataset 1

      ModelParameterCorrelationMSEMAEVR
      Model 1${\theta}$ (Classifiers′ ability)0.63234.20.87
      Model 2${\theta}$ (Classifiers′ ability)0.8721.223.90.81
      Model 3${\theta}$ (Classifiers′ ability)0.9319.333.10.95
      Model 1${\beta}$ (Samples′ difficulty)0.8615.753.530.8
      Model 2${\beta}$ (Samples′ difficulty)0.9312.642.580.18
      Model 3${\beta}$ (Samples′ difficulty)0.981.7161.131.08

      Table 3.  Measures for dataset 2

      Parameter componentParameters valueModel 1 estimationModel 2 estimationModel3 estimation
      ${\beta_1}$− 9.075− 4.35− 3.256− 9.824
      ${\beta_2}$− 2.485− 1.3− 1.296− 2.68
      ${\beta_3}$− 3.395− 1.2− 2.24− 3.85
      ${\beta_4}$− 1.015− 2.2− 0.192− 0.203
      ${\beta_5}$− 0.0750.4611.5931.248
      ${\beta_6}$0.1651.360.961.628
      ${\beta_7}$4.0353.1943.0455.591
      ${\beta_8}$6.4853.853.7836.998
      ${\beta_9}$− 6.395− 3.171− 2.904− 8.462
      ${\beta_{10}}$11.7553.233.5349.553

      Table 4.  Real and estimated parameters for dataset 2

      As we can see from the result, models 1 and 2 perform well on the first dataset while model 3 beats the others on the second dataset. The reason is the first two models tend to give a relatively stable solution due to the normal prior, which has the advantage of minimizing the range between the estimators. Thus, when the parameters are generated from a normal distribution, they fit well. In model 3, we didn't have such an assumption, so the parameters can take any value to minimize the loss function and the range may be larger than that of the first two estimations. It clearly produces a better estimation when the ranges of the real parameters are large. To compare the accuracy of the estimation, we calculate the error ratio for each parameter in both datasets:

      $ \begin{split} & {\rm{Error}}\;{\rm{ ratio}} = \frac{(\theta^{\rm{est}}-\theta)^2} {{\rm{Average}}\;{\rm{ MSE}}}\\ & {\rm{Average}}\;{\rm{ MSE}} = \frac{ {\rm{MSE}}_1+{\rm{MSE}}_2+{\rm{MSE}}_3}{3}. \end{split}$

      In Fig. 2, we use a bar plot to compare the error ratios for all models on two datasets.

      Figure 2.  Error ratio for the two datasets

      From the results, we can make a conclusion that real parameters and parameter estimates are highly correlated. For both datasets, the correlations are almost always higher than 0.85, implying that the parameter estimates hold a strong linear relationship with the true parameter values. Thus, it makes more sense to keep the weights of different classifiers as ordinal consistent with the classifiers' ability[37].

    • As we mentioned before, an IRT ensemble can evaluate the classifiers' ability and samples' difficulty simultaneously. To better understand the model, we first show what the parameters can reflect, and how they can affect the classification. Then we will illustrate the performance of our method on different datasets.

      We depict the sample points from the chess board dataset to show how the IRT ensemble model evaluates samples' difficulty. Figs. 3 and 4 illustrate the dual character of location and difficulty, which is shown by the size of the sample points. The larger the point is, the larger the difficulty of parameter is. Points in the same block share the same color in the original chess board. We first show how the estimated values change with increased iterations. In Fig. 3, we constructed 500 base classifiers. As we increase the number of iterations, difference is gradually increased. The outcome of the 100 iterations is similar, while the outcome of the 500 iterations is various, meaning each sample is well distinguished by its difficulty evaluated by a bunch of classifiers. Zooming in to Fig. 3, we can find some rules. The IRT ensemble method tends to assign a higher difficulty to the points that are closer to the boundary, and these points support the decision boundary in return. It is obvious that when the points are close to their counterparts with a distinct label, they are more likely to be misclassified. Only those classifiers which are powerful enough can correctly complete the task of difficult classification. Thus, it makes sense to take their capability for constructing the classification boundary.

      Figure 3.  500 Classifiers with different iterations. X and Y axes illustrate the position of the points, and different class labels are shown by distinct shape.

      Figure 4.  Different number of classifiers. X and Y axes illustrate the position of the points, and different class labels are shown by distinct shape.

      For some other ensemble methods, it proves that increasing the number of classifiers can improve the performance, which also applies to the IRT ensemble method. In Fig. 4, we fixed the number of iterations to be 2 000 and changed the number of classifiers. According to the experiment, when we increase the number of classifiers, those sample points constructing the boundary within the block will stand out while those inside the boundary will shrink to a dot. In order to distinguish the important sample points from others, more classifiers should be included to make a joint decision. The interesting bit comes when we increase the number of classifiers to a large enough value. In the last two subplots, the sizes of the points seem to be unchanged. In many cases, increasing the decision size cannot guarantee improved performance. When we have sufficient classifiers, it will come across the bottleneck.

    • We collected 15 real datasets and 2 artificial datasets to compare our Model 2 with a single tree[38], random forest (RF)[39], bagging[7], gradient boosting decision tree (GBDT)[40], linear discriminant analysis (LDA)[41], and support vector machine (SVM)[42]. We didn't make experiments for Model 1 because it is time consuming. For all the ensemble methods, we used a single classification tree as the base classifier. When it comes to a single tree, the pruning option is necessary for preventing overfitting. However, we didn't implement the pruning algorithm for the base classifiers in bagging, gradient boosting and random forest because pruning decreases the variance but increases the bias. For all the ensemble algorithms, bootstrap can be used to construct various base tree structures, which can reduce the variance effectively. Thus in all the ensemble models, we used unpruned trees as the base classifiers to account for bias and then used bootstrap to reduce the variance.

      Most of the datasets summarized in Table 5 are from the UCI datasets. As our model is constructed using the base classifiers, it is suitable for all kinds of features as long as the base model is adequate for the data. In order to show a generalization of our method, we intentionally selected some datasets containing both the categorical features as well as the continuous features. One-hot encoding is also a must for all the nominal features. We conducted all the experiments on Python 3.6 platform. To compare the accuracy of these methods, we randomly split the dataset to 10 folds and set the test set proportional to 0.3. A simulation of each setting was performed 30 times for each dataset. In order to compare the accuracies of various methods, we set the number of trees in each ensemble algorithm to 500.

      DatasetObservationsContinuous featuresDiscrete featuresClassSource
      IRIS150403UCI
      Bld345602UCI
      Spe2674402UCI
      Glass214906UCI
      Veh8461804UCI
      Checkboard160000202Artificial
      BTD106904UCI
      IPLD583812UCI
      Haberman306302UCI
      Ionos3513222UCI
      Multiangle160000202Tensorflow
      Balance625042UCI
      AUS690592UCI
      ECOLI336706UCI
      LEN24033UCI
      TAE151053UCI
      LC330563UCI
      LSVT12631002UCI
      SCADI7020508UCI

      Table 5.  Dataset information

      The result of the average accuracy is in Table 6. We highlighted the best two results and the worst result for each data set. From Table 6, it seems that for Model 2, random forest and gradient boosting perform well in general. However, for some kinds of data sets, gradient boosting fails to recognize that pattern and yields the worst result[43]. The weakness of gradient boosting has been reported in some papers. The performance of SVM greatly fluctuated[44].

      DataSetModel 2RFGBDTSVMSingle treeBaggingLDA
      IRIS0.9670.9620.9450.9320.9470.9540.88
      Bld0.7030.690.6930.680.620.660.59
      Spe0.910.910.880.8280.8610.9110.865
      Glass0.7290.7660.770.630.720.720.59
      Veh0.740.730.750.760.630.680.53
      Checkboard0.880.820.960.50.580.870.5
      BTD0.850.8750.880.890.820.860.85
      IPLD0.7230.7120.70.710.680.70.68
      Harman0.6770.6680.650.740.710.730.73
      Ionos0.9280.9260.920.8860.8880.9230.855
      Multiangle0.920.90.960.510.830.830.49
      Balance0.8150.8490.6230.780.590.650.73
      Aus0.8580.8650.810.760.820.830.68
      ECOLI0.8330.7550.650.870.80.810.829
      Len0.7380.7060.7250.7170.7080.720.725
      TAE0.5650.4680.5820.5110.370.430.44
      LC0.50.4840.50.450.430.430.2

      Table 6.  Average of accuracy

      Bagging generally performs better than the tree model. Although bagging is not the best, it is more stable than gradient boosting in some cases. It is noted that Model 2 is within the scope of a weighted voting model, which extends from the bagging strategy. Thus, we can explain the reason why Model 2 is more stable than the gradient boosting method.

      A win table[45] summarizes the comparison in Table 7. In the win table, $a_{i,j}$ illustrates the frequency that method $j$ gives a higher accuracy than method $i$. For instance, $a_{1,3}$ equals 11 means in total 17 comparisons between Model 2 and gradient boosting, Model 2 produces more accurate or the same result than gradient boosting in 11 datasets. This table shows every pairwise comparison in detail. In order to rank the methods, we need to calculate the goal difference from the win table, subtracting the frequency of loss from the frequency of wins for each model. The frequency of wins can be obtained by summing within the row, while the frequency of losses can be obtained within the column for each model. From there, the goal difference can be calculated. The result is shown in Table 8. It is clear that Model 2 has an overwhelming superiority over others. This table provides another way to show the consistently high accuracy of our model compared to other methods shown in Table 6.

      MethodModel 2RFGBDTSVMSingle treeBaggingLDA
      Model 20131113161416
      RF50811141214
      GBDT79011131014
      SVM466010915
      Tree13470310
      Bagging358817013
      LDA2343850

      Table 7.  Win table

      ModelsWinLossWinLoss
      Model 2618322
      RF256439
      GBDT236441
      SVM−35053
      Tree−502878
      Bagging15453
      LDA−572582

      Table 8.  Summary of win table

      We also conducted an experiment to investigate how the ensemble size affects the prediction with 13 datasets (4 datasets were discarded because of the failure of computation when the ensemble size is too small). We still used the same method for getting the accuracy and 20 repetitions were conducted for each sample size, which are averaged to calculate the t statistics. In Fig. 5, we show the boxplot. When comparing our method with random forest, gradient boosting and SVM, we conclude that the gain of our model tends to be enhanced as the ensemble size increases. It seems that our model has a good potential to improve the accuracy if the ensemble size is large enough.

      Figure 5.  Comparison with LDA, SVM, single tree, bagging, gradient and random forest

      Model 3 has predominant advantage when applying to some datasets. We illustrate the cumulative accuracy on 4 cases in Fig. 6. In these datasets, samples contain many attributes compared to the sample size, which means they include a lot of redundant information. Consequently, the samples' difficulties vary and there may exist a small subsample contributing a lot for constructing the decision boundary. It is hard for most of the classifiers to detect the important variables. Only a small subset of classifiers are powerful. Thus, the distribution of the classifiers are no longer symmetric and the variance of the classifiers' abilities increases. Model 3 is more powerful than Model 2 when the variance of the parameters is large enough, higher weights are assigned for the stronger classifiers. Thus, it can outperform other methods in these cases. We found that Model 3 can consistently produce a higher accuracy than random forest and gradient boosting. However, when the difficulty of each sample is similar, Model 2 tends to perform better.

      Figure 6.  Cumulative accuracy for Model 3 on 4 different datasets

    • In this paper, we proposed the IRT ensemble, a weighted majority voting method focusing on the classifiers that can correctly deal with the hard-to-classify problems, by adopting the item response theory. The classifying boundaries are constructed by the points that are frequently misclassified and higher weights are assigned to the classifiers with higher abilities. We also proposed three models to estimate the ability parameters and introduced the assumptions behind the models.

      For the performance of the models, we analyzed them in two stages. First, we evaluated their accuracy in the estimation of parameters. We concluded that Models 1 and 2 perform well when the variance of the parameters are small, while Model 3 is more suitable when the parameters vary. We also explained how the lengths of the Markov chains and the number of classifiers would affect the estimation of samples' difficulty. The chessboard dataset also provides us an intuitive explanation about the idea behind the IRT ensemble algorithm. Finally, we implemented an experiment with Model 2 using 19 datasets and compared the performance with other classification methods. We showed that the advantage of Model 2 is enhanced with the increased ensemble size compared to LDA, SVM, single tree, bagging, and gradient boosting. It showed compatible performance with random forest. Finally, we found Model 3 has an edge in high dimensional datasets.

      Future work includes combining Model 3 with kernel methods. Another modification is to introduce the Beta model, which is widely used in the network analysis.

Reference (45)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return