Accounting undergraduate Honors theses: Predicting changes in earnings - A walk through a random forest.pdf (Accounting undergraduate Honors theses)

University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 8-2018 Predicting Changes in Earnings: A Walk Through a Random Forest Joshua Hunt University of Arkansas, Fayetteville Follow this and additional works at: http://scholarworks.uark.edu/etd Part of the Accounting Commons Recommended Citation Hunt, Joshua, "Predicting Changes in Earnings: A Walk Through a Random Forest" (2018). Theses and Dissertations. 2856. http://scholarworks.uark.edu/etd/2856 This Dissertation is brought to you for free and open access by ScholarWorks@UARK. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of ScholarWorks@UARK. For more information, please contact scholar@uark.edu, ccmiddle@uark.edu. Predicting Changes in Earnings: A Walk Through a Random Forest A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Business Administration with a concentration in Accounting by Joshua O’Donnell Sebastian Hunt Louisiana Tech University Bachelor of Science in Mathematics, 2007 Louisiana Tech University Master of Arts in Teaching, 2011 University of Arkansas Master of Accountancy, 2013 University of Arkansas Master of Science in Statistics and Analytics, 2017 August 2018 University of Arkansas This dissertation is approved for recommendation to the Graduate Council. ____________________________________ Vern Richardson, Ph.D. Dissertation Director ____________________________________ James Myers, Ph.D. Committee Member ____________________________________ Cory Cassell, Ph.D. Committee Member ____________________________________ David Douglass, Ph.D. Committee Member Abstract This paper investigates whether the accuracy of models used in accounting research to predict categorical dependent variables (classification) can be improved by using a data analytics approach. This topic is important because accounting research makes extensive use of classification in many different research streams that are likely to benefit from improved accuracy. Specifically, this paper investigates whether the out-of-sample accuracy of models used to predict future changes in earnings can be improved by considering whether the assumptions of the models are likely to be violated and whether alternative techniques have strengths that are likely to make them a better choice for the classification task. I begin my investigation using logistic regression to predict positive changes in earnings using a large set of independent variables. Next, I implement two separate modifications to the standard logistic regression model, stepwise logistic regression and elastic net, and examine whether these modifications improve the accuracy of the classification task. Lastly, I relax the logistic regression parametric assumption and examine whether random forest, a nonparametric machine learning technique, improves the accuracy of the classification task. I find little difference in the accuracy of the logistic regression-based models; however, I find that random forest has consistently higher out-of-sample accuracy than the other models. I also find that a hedge portfolio formed on predicted probabilities using random forest earns larger abnormal returns than hedge portfolios formed using the logistic regression-based models. In subsequent analysis, I consider whether the documented improvements exist in an alternative classification setting: financial misstatements. I find that random forest’s out-of-sample area under the receiver operating characteristic (AUC) is significantly higher than the logistic-based models. Taken together, my findings suggest that the accuracy of classification models used in accounting research can be improved by considering the strengths and weaknesses of different classification models and considering whether machine learning models are appropriate. Acknowledgements I would like to thank my mother, Catherine Hunt, who not only taught me how to read, but also instilled in me the importance of education and cultivated my love of learning from an early age. Table of Contents Introduction ................................................................................................................................1 Algorithms ...............................................................................................................................10 Logistic Regression......................................................................................................10 Stepwise Logistic Regression ......................................................................................14 Elastic Net ....................................................................................................................15 Cross-Validation ..........................................................................................................18 Random Forest .............................................................................................................19 Data and Methods ....................................................................................................................22 Results ......................................................................................................................................24 Main Analyses .............................................................................................................24 Additional Analyses .....................................................................................................26 Additional Misstatement Analyses ..............................................................................31 Conclusion ...............................................................................................................................35 References ................................................................................................................................38 Appendices ...............................................................................................................................43 Tables ......................................................................................................................................52 Figures......................................................................................................................................62 1. Introduction The goal of this paper is to show that accounting researchers can improve the accuracy of classification (using models to predict categorical dependent variables) by considering whether the assumptions of a particular classification technique are likely to be violated and whether an alternative classification technique has strengths that are likely to make it a better choice for the classification task. Accounting research makes extensive use of classification in a variety of research streams. One of the most common classification techniques used in accounting research is logistic regression. However, logistic regression is not the only classification technique available and each technique has its own set of assumptions and its own strengths and weaknesses. Using a data analytics approach, I investigate whether the out-of-sample accuracy of predicting changes in earnings can be improved by considering limitations found in a logistic regression model and addressing those limitations with alternative classification techniques. I begin my investigation by predicting positive versus negative changes in earnings for several reasons. First, prior accounting research uses statistical approaches to predict changes in earnings that focus on methods rather than theory, providing an intuitive starting point for my investigation (Ou and Penman 1989a, 1989b; Holthausen and Larcker 1992). While data analytics has advanced since the time of these papers, the statistical nature of their approach fits in well with a data analytics approach. Data analytics tends to take a more statistical, resultsdriven approach to prediction tasks relative to traditional accounting research. Second, changes in earnings are a more balanced dataset in regard to the dependent variable relative to many of the other binary dependent variables that accounting literature uses (e.g., the incidence of fraud, misstatements, going concerns, bankruptcy, etc.). Positive earnings changes range from 40% to 60% percent prevalence in a given year for my dataset. Logistic regression can achieve high 1 accuracy in unbalanced datasets but this accuracy may have little meaning because of the nature of the data. For example, in a dataset of 100 observations that only have 5 occurrences of a positive outcome, one can have high accuracy (95 percent for this example) without correctly classifying any of the positive outcomes. Third, focusing on predicting changes in earnings allows me to use a large dataset which, in turn, allows me to use a large set of independent variables. Lastly, changes in earnings are also likely to be of interest to investors and regulators because of their relationship to abnormal returns (Ou and Penman 1989b; Abarbenell and Bushee 1998). Logistic regression is the first algorithm I investigate because of its prevalent use in accounting literature. Logistic regression uses a maximum likelihood estimator, an iterative process, to find the parameter estimates. Logistic regression has several assumptions. 1 First, logistic regression requires a binary dependent variable. Second, logistic regression requires that the model be correctly specified, meaning that no important variables are excluded from the model and no extraneous variables are included in the model. Third, logistic regression is a parametric classification algorithm, meaning that the log odds of the dependent variable must be linear in the parameters. I use a large number of independent variables chosen because of their use in prior literature. 2 This makes it more likely that extraneous variables are included in the model, violating the second logistic regression assumption. To address this potential problem, I implement stepwise logistic regression, following prior literature (Ou and Penman 1989b; Holthausen and Larcker 1 I only discuss a limited number of the assumptions for logistic regression here. More detail is provided on all of the assumptions in the logistic regression section. 2 Ou and Penman (1989b) begin with 68 independent variables and Holthausen and Larcker (1992) use 60 independent variables. My independent variables are based on these independent variables as well as 11 from Abarbenell and Bushee (1998). 2 1992; Dechow, Ge, Larson, and Sloan 2011). The model begins with all the input variables and each variable is dropped one at a time. The Akaike information criterion (AIC) is used to test whether dropping a variable results in an insignificant change in model fit, and if so, it is permanently deleted. This is repeated until the model only contains variables that change the model fit significantly when dropped. 3 While stepwise logistic regression makes it less likely that extraneous variables are included in the model, it has several weaknesses. First, the stepwise procedure performs poorly in the presence of collinear variables (Judd and McClelland 1989). This can be a concern with a large set of independent variables. Second, the resulting coefficients are inflated, which may affect out-of-sample predictions (Tibshirani 1996). Third, the measures of overall fit, z-statistics, and confidence intervals are biased (Pope and Webster 1972; Wilkinson 1979; Whittingham, Stephens, Bradbury, and Freckleton 2001).4 I implement elastic net to address the first two weaknesses of stepwise logistic regression (multicollinearity and inflated coefficients). Elastic net is a logistic regression with added constraints. Elastic net combines Least Absolute Shrinkage and Selection Operator (lasso) and ridge regression constraints. Lasso is an L1 penalty function that selects important variables by shrinking coefficients toward zero (Tibshirani 1996). 5 Ridge regression also shrinks coefficients, but uses an L2 penalty function and does not zero out coefficients (Hoerl and Kennard 1970). 6 3 This is an example of backward elimination. Stepwise logistic regression can also use forward elimination or a combination of backward and forward elimination. I use backward elimination because it is similar to what has been used in prior literature (Ou and Penman 1989b; Holthausen and Larcker 1992; Dechow et al. 2011). 4 Coefficients tend to be inflated because the stepwise procedure overfits the model to the data. The procedure attempts to insure only those variables that improve fit are included based on the current dataset and this causes the coefficients to be larger than their true parameter estimates. Similarly, the model fit statistics are inflated. The zstatistics and confidence intervals tend to be incorrectly specified due to degrees of freedom errors and because these statistical tests are classical statistics that do not take into account prior runs of the model. 5 A L1 penalty function penalizes the model for complexity based on the absolute value of the coefficients. 6 A L2 penalty function penalizes the model for complexity based on the sum of the squared coefficients. 3 Lasso performs poorly with collinear variables while ridge regression does not. Elastic net combines the L1 and L2 penalties, essentially performing ridge regression to overcome lasso’s weaknesses and then lasso to eliminate irrelevant variables. Logistic regression, stepwise logistic regression, and elastic net are all parametric models subject to the assumption that the independent variables are linearly related to the log odds of the dependent variable (the third logistic regression assumption). Given that increasing (decreasing) a particular financial ratio may not equate to a linear increase (decrease) in the log odds of a positive change in earnings, it is not clear that the relationship is linear. To address this potential weakness, I implement random forest, a nonparametric model. The basic idea of random forest was first introduced in 1995 by Ho (1995) and the algorithm now known as random forest was implemented in 2001 by Brieman (2001). Since then it has been used in biomedical research, chemical research, genetic research, and many other fields (Díaz-Uriarte and De Andres 2006; Svetnik, Liaw, Tong, Culberson, Sheridan, and Feuston 2003; Palmer, O’Boyle, Glen, and Mitchell 2007; Bureau, Dupuis, Falls, Lunetta, Hayward, Keith, and Van Eerdewegh 2005). Random forest is a decision tree-based algorithm that averages multiple decision trees. Decision trees are formed on random samples of the training dataset and random independent variables are used in forming the individual decision trees. 7 Many decision trees are formed with different predictor variables and these trees remain unpruned.8 Each tree is formed on a different bootstrapped sample of the training data. These procedures help ensure that the decision trees are not highly correlated and reduce variability. Highly correlated decision trees in the forest would make the estimation less reliable 7 A training data set refers to the in-sample data set used to form estimates to test on the out-of-sample data set. In my setting, I use rolling 5 year windows as training set and test out-of-sample accuracy on the 6 th year. 8 Pruning a decision tree refers to removing branches that have little effect on overall accuracy. This helps reduce overfitting. 4

Accounting undergraduate Honors theses: Predicting changes in earnings - A walk through a random forest

Nội dung