Citation: | Amina Benabid, Dan Su, Dao-Hong Xiang. LARGE MARGIN UNIFIED MACHINES WITH NON-I.I.D. PROCESS[J]. Journal of Applied Analysis & Computation, 2022, 12(5): 2110-2132. doi: 10.11948/20220222 |
In this paper, we investigate the convergence theory of large margin unified machines (LUMs) in a non-i.i.d. sampling. We decompose the total error into sample error, regularization error and drift error. The appearance of drift error is caused by the non-identical sampling. Independent blocks sequences are constructed to transform the analysis of the dependent sample sequences into the analysis of independent blocks sequences under some mixing conditions. We also require the assumption of polynomial convergence of the marginal distributions to deal with the non-identical sampling. A novel projection operator is introduced to overcome the technical difficulty caused by the unbounded target function. The learning rates are explicitly derived under some mild conditions on approximation and capacity of the reproducing kernel Hilbert space.
[1] | P. L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Trans. Inf. Theory, 1998, 44(2), 525–536. doi: 10.1109/18.661502 |
[2] | A. Benabid, J. Fan and D. Xiang, Comparison theorems on large-margin learning, Int. J. Wavelets Multiresolution Inf. Process., 2021, 19(05), 2150015 (18 pages). doi: 10.1142/S0219691321500156 |
[3] | B. E. Boser, I. M. Guyon and V. N. Vapnik, A training algorithm for optimal margin classifiers, in Proceedings of the fifth annual workshop on Computational learning theory, 1992, 144–152. |
[4] | D. Chen, Q. Wu, Y. Ying and D. Zhou, Support vector machine soft margin classifiers: error analysis, J. Mach. Learn. Res., 2004, 5, 1143–1175. |
[5] | C. Cortes and V. Vapnik, Support-vector networks, Machine learning, 1995, 20, 273–297. |
[6] | F. Cucker and D. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, 2007. |
[7] | D. E. Edmunds and H. Triebel, Function spaces, entropy numbers, differential operators, 180, Cambridge University Press, 1996. |
[8] | J. Fan and D. Xiang, Quantitative convergence analysis of kernel based large-margin unified machines, Commun. Pure Appl. Anal., 2020, 19(8), 4069–4083. doi: 10.3934/cpaa.2020180 |
[9] | Y. Feng, J. Fan and J. Suykens, A statistical learning approach to modal regression, J. Mach. Learn. Res., 2020, 21(2), 1–35. |
[10] |
Q. Guo and P. Ye, Error analysis of least-squares $l^q$-regularized regression learning algorithm with the non-identical and dependent samples, IEEE Access, 2018, 6, 43824–43829. doi: 10.1109/ACCESS.2018.2863600
CrossRef $l^q$-regularized regression learning algorithm with the non-identical and dependent samples" target="_blank">Google Scholar |
[11] | X. Guo, T. Hu, Q. Wu, et al., Distributed minimum error entropy algorithms. , J. Mach. Learn. Res., 2020, 21, 1–31. |
[12] | X. Guo, L. Li and Q. Wu, Modeling interactive components by coordinate kernel polynomial models, Math. Fund. Computing, 2020, 3(4), 263–277. doi: 10.3934/mfc.2020010 |
[13] | Z. Guo and L. Shi, Classification with non-i. i. d. sampling, Math. Comput. Modell, 2011, 54(5–6), 1347–1364. doi: 10.1016/j.mcm.2011.03.042 |
[14] | T. Hastie, R. Tibshirani and J. Friedman, The elements of statistical learning: Data Mining, Inference, and Prediction, Springer-Verlag, New York, 2001. |
[15] | A. Khaleghi and G. Lugosi, Inferring the mixing properties of an ergodic process, arXiv: 2106.07054, 2021. |
[16] | Y. Liu, H. Zhang and Y. Wu, Hard or soft classification? large-margin unified machines, J. Am. Stat. Assoc., 2011, 106(493), 166–177. doi: 10.1198/jasa.2011.tm10319 |
[17] | J. S. Marron, M. J. Todd and J. Ahn, Distance-weighted discrimination, Journal of the American Statistical Association, 2007, 102(480), 1267–1271. doi: 10.1198/016214507000001120 |
[18] | L. Peng, Y. Zhu and W. Zhong, Lasso regression in sparse linear model with $\varphi$-mixing errors, Metrika, 2022, 1–26. |
[19] | S. Smale and D. Zhou, Online learning with markov sampling, Anal. Appl., 2009, 7(01), 87–113. doi: 10.1142/S0219530509001293 |
[20] | I. Steinwart and A. Christmann, Fast learning from non-i. i. d. observations, Adv. Neural Inf. Process. Syst., 2009, 22, 1–9. |
[21] | I. Steinwart and A. Christmann, Estimating conditional quantiles with the help of the pinball loss, Bernoulli, 2011, 17(1), 211–225. |
[22] | I. Steinwart and C. Scovel, Fast rates for support vector machines using gaussian kernels, Ann. Stat., 2007, 35(2), 575–607. |
[23] | H. Sun and Q. Wu, Regularized least square regression with dependent samples, Adv. Comput. Math., 2010, 32(2), 175–189. doi: 10.1007/s10444-008-9099-y |
[24] | R. C. Williamson, A. J. Smola and B. Scholkopf, Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators, IEEE Trans. Inform. Theory, 2001, 47(6), 2516–2532. doi: 10.1109/18.945262 |
[25] |
K. Wong, Z. Li and A. Tewari, Lasso guarantees for $\beta $-mixing heavy-tailed time series, Ann. Stat., 2020, 48(2), 1124–1142.
$\beta $-mixing heavy-tailed time series" target="_blank">Google Scholar |
[26] | Q. Wu, Y. Ying and D. Zhou, Learning rates of least-square regularized regression, Found. Comput. Math., 2006, 6(2), 171–192. doi: 10.1007/s10208-004-0155-9 |
[27] | Q. Wu, Y. Ying and D. Zhou, Multi-kernel regularized classifiers, J. Complex., 2007, 23(1), 108–134. doi: 10.1016/j.jco.2006.06.007 |
[28] | D. Xiang, Logistic classification with varying gaussians, Comput. Math. Appl., 2011, 61(2), 397–407. doi: 10.1016/j.camwa.2010.11.016 |
[29] | D. Xiang, Conditional quantiles with varying gaussians, Adv. Comput. Math., 2013, 38(4), 723–735. doi: 10.1007/s10444-011-9257-5 |
[30] | D. Xiang and D. Zhou, Classification with gaussians and convex loss, J. Mach. Learn. Res., 2009, 10, 1447–1468. |
[31] | Y. Xu and D. Chen, Learning rates of regularized regression for exponentially strongly mixing sequence, J. Stat. Plan. Inference, 2008, 138(7), 2180–2189. doi: 10.1016/j.jspi.2007.09.003 |
[32] | B. Yu, Rates of convergence for empirical processes of stationary mixing sequences, Ann. Probab., 1994, 22, 94–116. |
[33] | D. Zhou, The covering number in learning theory, J. Complex., 2002, 18(3), 739–767. doi: 10.1006/jcom.2002.0635 |
[34] | D. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans. Inform. Theory, 2003, 49(7), 1743–1752. doi: 10.1109/TIT.2003.813564 |