首页    期刊浏览 2024年12月13日 星期五
登录注册

文章基本信息

  • 标题:Provable Boolean interaction recovery from tree ensemble obtained via random forests
  • 本地全文:下载
  • 作者:Merle Behr ; Yu Wang ; Xiao Li
  • 期刊名称:Proceedings of the National Academy of Sciences
  • 印刷版ISSN:0027-8424
  • 电子版ISSN:1091-6490
  • 出版年度:2022
  • 卷号:119
  • 期号:22
  • DOI:10.1073/pnas.2118636119
  • 语种:English
  • 出版社:The National Academy of Sciences of the United States of America
  • 摘要:Significance Random Forests (RFs) are among the most successful machine-learning algorithms in terms of prediction accuracy. In many domain problems, however, the primary goal is not prediction, but to understand the data-generation process—in particular, finding important features and feature interactions. There exists strong empirical evidence that RF-based methods—in particular, iterative RF (iRF)—are very successful in terms of detecting feature interactions. In this work, we propose a biologically motivated, Boolean interaction model. Using this model, we complement the existing empirical evidence with theoretical evidence for the ability of iRF-type methods to select desirable interactions. Our theoretical analysis also yields deeper insights into the general interaction selection mechanism of decision-tree algorithms and the importance of feature subsampling. Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S ± . Intuitively speaking, DWP( S ± ) measures how frequently features in S ± appear together in an RF tree ensemble. We prove that, with high probability, DWP( S ± ) attains a universal upper bound that does not involve any model coefficients, if and only if S ± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.
  • 关键词:endecision treesinteraction selectionensemble methodsconsistencyinterpretable machine learning
国家哲学社会科学文献中心版权所有