文章基本信息

标题：Interval estimation of construction cost at completion using least squares support vector machine.
作者：Cheng, Min-Yuan ; Hoang, Nhat-Duc
期刊名称：Journal of Civil Engineering and Management
印刷版ISSN：1392-3730
出版年度：2014
期号：April
语种：English
出版社：Vilnius Gediminas Technical University
摘要：In construction industry, project success has always foundered on the high uncertainty in operational environment. Thus, it is not surprising that construction projects frequently suffer cost overrun (Nassar et al. 2005). In order to operate profitably, construction companies must frequently evaluate project cost at completion to detect deviations and to carry out appropriate responses. However, construction firms typically focus on budget planning during the initial project stage, which practically ignores the impact of engineering cost changes and information updates during construction (Cheng et al. 2010). This fact prevents effective project cost control and detection of potential problems. Therefore, cost estimation is a crucial task and it needs to be carried out at various stages of a project (Liu, Zhu 2007). Moreover, the accuracy of construction cost estimation is a critical factor in the success of the project (Kim et al. 2004). Poor cost estimation may result in profit loss and occasionally leads to project failure.
关键词：Construction industry;Construction management;Decision making;Decision-making;Machine learning

Interval estimation of construction cost at completion using least squares support vector machine.

Cheng, Min-Yuan ; Hoang, Nhat-Duc

Introduction

In construction industry, project success has always foundered on the high uncertainty in operational environment. Thus, it is not surprising that construction projects frequently suffer cost overrun (Nassar et al. 2005). In order to operate profitably, construction companies must frequently evaluate project cost at completion to detect deviations and to carry out appropriate responses. However, construction firms typically focus on budget planning during the initial project stage, which practically ignores the impact of engineering cost changes and information updates during construction (Cheng et al. 2010). This fact prevents effective project cost control and detection of potential problems. Therefore, cost estimation is a crucial task and it needs to be carried out at various stages of a project (Liu, Zhu 2007). Moreover, the accuracy of construction cost estimation is a critical factor in the success of the project (Kim et al. 2004). Poor cost estimation may result in profit loss and occasionally leads to project failure.

Due to its importance, various predictive methods have been proposed for cost estimation. Approaches that are applicable to cost estimation range from statistics based multivariable regression analysis to machine-learning techniques such as Classification and Regression Trees (CART), M5 model tree (M5-MT), Artificial Neu ral Network (ANN), Support Vector Machines (SVM), and Least Squares Support Vector Machine (LS-SVM).

Multivariable regression analysis is a very powerful statistical tool that can be used as both an analytical and a predictive technique in assessing the contribution of potential new items to the overall estimation, although it is limited in modeling non-linear relationships (Kim et al. 2004). In addition, when the number of input variables becomes considerably large, the prediction performance of this method often deteriorates significantly.

CART (Breiman et al. 1984) is a classification method which utilizes historical data to construct decision trees. A CART model that forecasts the value of continuous variables from a set of input variables is known as a regression-type model (Razi, Athappilly 2005). One major advantage of the decision tree based model is its ability to handle small-size data set. Moreover, CART can mitigate the negative effect of outliers because the model is capable of isolating the outliers in a separate node. However, one disadvantage of CART is that it may produce unstable decision trees (Timofeev 2004). The reason is that insignificant modification of learning sample could result in radical changes in the decision tree. In addition, previous works (Razi, Athappilly 2005) have indicated that prediction performance of CART can be inferior to ANN.

A model tree (MT) is similar to a decision tree, but includes the multivariate linear regression functions at the leaves and is able to predict continuous numeric value attributes (Shrestha, Solomatine 2006; Witten, Frank 2000; Kaluzny et al. 2011). The algorithm separates the parameter space into subspaces and constructs a local linear regression model in each of them. Thus, MT is, to some degree, similar to a piecewise linear function. In the M5-MT, the nodes of the tree are selected over the attribute that maximizes the expected error reduction as a function of the standard deviation of output parameter (Bonakdar, Etemad-Shahidi 2011). MT is proved to have the capability of learning in an efficient manner and it can tackle regression tasks with high dimensionality. Compared to other machine learning techniques, MT training process is relatively fast and the results are interpretable (Shrestha, Solomatine 2006).

ANN is a viable alternative for forecasting construction costs and in practice, it has been used to construct various cost prediction models (Hegazy, Ayed 1998; Zhu et al. 2010; Sonmez 2011). This method eliminates the need to find a mapping relationship that mathematically describes the construction cost as a function of input variables. When the influence factors and the structure of ANN are all specified, the task boils down to collecting a reasonable number of data to train the ANN. However, the training process of ANN based models is often time-consuming; and ANN also suffers from difficulties in selecting a large number of controlling parameters such as hidden layer size, learning rate, and momentum term (Bao et al. 2005).

Furthermore, one major disadvantage of ANN is that its training process is achieved through a gradient descent algorithm on the error space, which can be very complex and may contain many local minima (Kiranyaz et al. 2009). Thus, the training of ANN is likely to be trapped into a local minimum and this definitely hinders the forecasting capability. To overcome such issue, evolutionary algorithms, such as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), can be used to train the ANN model (Nasseri et al. 2008; Zhang et al. 2007). It is because these advanced optimization techniques can significantly reduce the chance of getting trapped in local minima. Hence, the training process possibly settles in an optimum solution; nevertheless, this cannot be guaranteed (Kiranyaz et al. 2009).

In construction area, SVM has been utilized in cost estimation (Cheng et al. 2010; Kong et al. 2008; An et al. 2007; Hongwei 2009). The principles of SVM are based on the structural risk minimization and statistical learning theory. The SVM based models also involve identification of influence factors, collection of data sample, and training/testing process. After the mapping function has been established, the model is capable of predicting the future value of project cost. The advantages of SVM are widely known including strong inference capacity, excellent generalization, and accurate prediction ability (Lam et al. 2009; Huang et al. 2004). Nevertheless, SVM training process entails solving a quadratic programming problem subjected to inequality constraint. This means that SVM's training process for large data sets requires expensive computational cost (Guo, Bai 2009).

To overcome the drawback of SVM, LS-SVM has been proposed recently by Suykens et al. (2002), Gestel et al. (2004), and Brabanter et al. (2010). LS-SVM is a modified version of SVMs to alleviate the burden of computational cost. In LS-SVM's training process, a least squares cost function is proposed to obtain a linear set of equations in the dual space. Consequently, to derive the solution, it is required to solve a set of linear equations, instead of the quadratic programming as in standard SVM. And, this linear system can be efficiently solved by iterative methods such as conjugate gradient (Wang, Hu 2005). Studies have been carried out to demonstrate the excellent generalization, prediction accuracy, and fast computation of LS-SVM (Yu et al. 2009; Samui, Kothari 2011; Chen et al. 2005). Despite of its superiority, application of LS-SVM in construction cost estimation is still very limited.

Additionally, when applying LS-SVM, it is recognizable that the tuning parameters, namely regularization and kernel function parameters, play an important role in establishing the predictive model (Yu et al. 2009; Suykens et al. 2002). These parameters control the model's complexity, and they are needed to be determined properly via cross-validation. In doing so, the main objective is to obtain an optimal model that can explore the underlying input-output mapping function and is capable of producing the best predictive performance on new data (Bishop 2006). In this study, DE, a population-based stochastic search engine proposed by Storn and Price (Price et al. 2005), is employed in the cross-validation process to achieve such objective.

In practice, cost estimation in construction industry is often stated in the form of a point forecast (Trost, Oberlender 2003; Iranmanesh et al. 2007; Cheng et al. 2010; Zhu et al. 2010). However, decision makers require not only accurate forecasting of certain variables but also the uncertainty associated with the forecasts. Point estimation does not take into account the various sources of uncertainty that stem from the model itself, input variables, and tuning parameters. Thus, incorporating prediction uncertainty into deterministic forecasts can help improve the reliability and the credibility of the model outputs (Shrestha, Solomatine 2006).

Various approaches (Wonnacott, T. H., Wonna cott, R. J. 1996; Heskes 1997; Mencar et al. 2005; Bra banter et al. 2011) have been introduced to achieve interval estimation. However, existing methods also have many limitations such as requiring the prior distributions of the uncertain input parameters or data and involving certain assumptions about the data and error distribution. The accuracy and the credibility of those approaches rely significantly on the precision of prior information and their assumptions. Another class of methods for deriving prediction interval (PI) is relied on re-sampling or bootstrap. Although bootstrap based methods (Sonmez 2011; Stine 1985; Lam, Veall 2002) can yield accuracy prediction result, this method is notably characterized by high computational cost.

Recently, a new framework for estimation of PI which is based on machine learning technique has been established by Shrestha and Solomatine (2006). The proposed method does not require any assumption and prior knowledge of input data or model error distribution. Moreover, it also does not demand intensive computational cost as in bootstrap based methods. In their research work (Shrestha, Solomatine 2006), the superiority of machine-learning based interval estimation (MLIE) over traditional methods is exhibited.

Therefore, this study aims to propose an artificial intelligence model, namely as EAC-LSPIM, that hybridizes various advanced techniques including LS-SVM, MLIE, and DE to help project manager in construction cost prediction. The newly built model incorporates the strengths and mitigates the weaknesses of each individual technique. The research goal is to build a model that can operate automatically without human intervention and can deliver accurate and reliable forecasting results. Equipped with this tool, it is expected that the tasks of cost control and cost planning in construction industry can be carried out effectively.

The remaining part of this paper is organized as follows. The second section of this paper reviews related research works on estimating of construction cost at completion, LS-SVM, techniques for achieving prediction intervals, and DE. In the third section, the DE-based cross-validation process is introduced. The fourth section describes the framework of the newly proposed model in detail. Simulation and result comparison of the model are demonstrated in the fifth section.

1. Review of pertinent literature

1.1. Estimate of project cost at completion

In construction management, estimating cost of work at completion is vitally important for project success. To achieve this, project managers often rely on Earned Value Management (EVM) methodology. EVM is widely known as a management technique that relates resource planning and schedule usage and technical performance requirement (Abba 1997). EVM comprises of three essential components that support project control: Plan Value (PV) or Budgeted Cost of Work Schedule (BCWS), Earned Value (EV) or Budgeted Cost of Work Performed (BCWP), and Actual Cost (AC) or Actual Cost of Work Performed (ACWP). In the construction industry, project managers emphasize the application of EVM as it provides a tool for tracking project status and for measuring project performance.

EVM is a systematic approach to forecast Estimate at Completion (EAC). The role of EAC is accentuated due to the fact that managers or planners can appraise the total project cost based on the estimated value of EAC. Iranmanesh et al. (2007) point out that the correct and the on time EAC is essential for preventive response during the project execution. If EAC indicates an overrun in cost, the project managers can use proper strategies to adjust construction cost. In the situation of cost overrun, project managers arguably carry out a value engineering program for cost reduction in which scope or quality in some sections of project is decreased. Another option is to require additional budget to offset overrun cost.

At every completion period, managers can extract data from progress report, calculate project Earned Value (EV) and predict EAC. The EAC can be computed by formula using cost management data provided by the contractor in the Cost Performance Report or the Cost/Schedule Status Report. The reliability of these reports depends on the degree to which the contractor adheres to internal controls involving measuring performance on a contract (Christensen 1993).

According to previous works by Christensen (Christensen 1993; Christensen et al. 1995) and Chen (2008), determining an appropriate estimation of EAC is an arduous task. To obtain EAC, managers need to collect voluminous cost management data provided by the contractor in progress report, usually monthly report. For the contractors, in order to form the periodic report to the owner, their site-engineers must gather sufficient data summarized in the daily man-hours summary, daily material summary and daily equipment summary. Finally, one can use various formulas to compute EAC based on the combination of several data elements presented in the report: BCWS, BCWP, and ACWP.

To forecast the EAC, numerous index-based formulas have been utilized. Those formulas are divided into three categories: non-performance method, performance method, composite method (Christensen et al. 1995; Chen 2008; Cheng et al. 2010). Based on a survey carried out by Christensen et al. (1995), the accuracy of index-based formulas depends significantly on the type of system, and the stage and phase of project. This interprets why performance of a particular formula might be quiet acceptable in a certain case, while it could be much worse in other cases (Cheng et al. 2010). Project planners must employ their own judgments to ascertain a most appropriate EAC or a range of reasonable EACs. Currently, there is no official guidance on how to choose an amenable EAC calculation according to a specific setting.

Besides index-based formulas, other EAC prediction methods are based on regression analysis (Iranmanesh et al. 2007; Christensen et al. 1995). The regression-based formulas are typically derived using linear or nonlinear univariate regression analysis (Christensen 1993). However, methods based on traditional regression analysis also have disadvantages such as their limitations in describing nonlinear relationships (An et al. 2007). In addition, the number of influence factors for construction cost estimate can be appreciable (Trost, Oberlender 2003; Cheng et al. 2010) and the underlying regression function is possibly very intricate. That fact explains why EAC estimation based on traditional regression analysis is not widely used in the industry (Christensen 1993).

Needless to say, EAC prediction problem is complicated since it involves voluminous construction data, considerable number of influence factors, and complicated regression function. Thus, it is reasonable for planners or managers to resort to more advanced methods, specifically Artificial Intelligence (AI) methods, such as Artificial Neural Network (ANN) and Least Squares Support Vector Machine (LS-SVM).

1.2. Least squares support vector machine for regression analysis

This section is dedicated to describing the LS-SVM's mathematical formulation. Consider the following model of interest, which underlies the functional relationship between a response variable and one or more independent variables (Suykens et al. 2002; Wang, Hu 2005):

y(x) = [w.sup.T] [phi](x) + b, (1)

where: x [member of] [R.sup.n], y [member of] R, and [phi](x): [R.sup.n] [right arrow] [R.sup.nh] is the mapping to the high dimensional feature space. In LS-SVM for regression analysis, given a training dataset [{[x.sub.k],[y.sub.k]}.sup.N.sub.k = 1], the optimization problem is formulated as follows:

minimize [J.sub.p (w,e) = 1/2 [w.sup.T] w + [gamma] 1/2 [N.summation over.(k = 1)] [e.sup.2.sub.k]; (2)

subjected to [y.sub.k] = [w.sup.T][phi]([x.sub.k]) + b + [e.sub.k], k = 1,...N,

where: [e.sub.k] [member of] R are error variables; [gamma] > 0 denotes a regularization constant.

In the above optimization problem, it is noted that the objective function includes a sum of squared fitting error and a regularization term. This cost function is similar to standard procedure in training feedforward neural networks and is related to a ridge regression (Wang, Hu 2005). However, when w becomes infinite dimensional, one cannot solve this primal problem. Therefore, it is necessary to construct the Lagrangian and derive the dual problem (Suykens et al. 2002).

The Lagrangian is given by:

L(w, b, e; [alpha]) = [J.sub.p] (w, e) - [N.summation over.(k = 1)][[alpha].sub.k]{[w.sup.T][phi]([x.sub.k]) + b + [e.sub.k] - [y.sub.k]}, (3)

where: [[alpha].sub.k] are Lagrange multipliers. The conditions for optimality are given by:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (4)

After elimination of e and w, the following linear system is obtained:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (5)

where: y = [y.sub.1] ,..., [y.sub.N], [1.sub.v] = [1;...;1] and [alpha] = [[[alpha].sub.1]...; [[alpha].sub.N]].

And the kernel function is applied as follows:

[omega] = [phi][([x.sub.k]).sup.T][phi]([x.sub.l]) = K([x.sub.k,[x.sub.l]). (6)

The resulting LS-SVM model for function estimation is expressed as:

y(x) = [N.summation over (k = 1)][[alpha].sub.k]K([x.sub.k],[x.sub.l]) + b, (7)

where: [[alpha].sub.k] and b are the solution to the linear system (5). The kernel function that is often utilized is Radial Basis Function (RBF) kernel. Description of RBF kernel is given as follows:

K([x.sub.k], [x.sub.l]) = exp([parallel][x.sub.k] - [x.sub.l][[parallel]].sup.2]/2[[sigma].sup.2]), (8)

where: [sigma] is the kernel function parameter.

In the case of the Radial Basis Function kernel, there are two tuning parameters ([gamma], [sigma]) that are needed to be determined in LS-SVM. The regularization parameter ([gamma]) controls the penalty imposed to data points that deviate from the regression function. Meanwhile, the kernel parameter ([sigma]) affects the smoothness of the regression function. It is worth noticing that proper setting of these tuning parameters is required to ensure desirable performance of the prediction model (Suykens et al. 2002).

1.3. Regression analysis with prediction intervals

1.3.1. Background

Regression analysis is the study of the function that underlies the relation between the dependent variable Y and a vector x as the independent variable (Olive 2007). A typical regression model can be expressed as follows:

[Y.sub.i] = m([x.sub.i]) + [e.sub.i], i = 1, ..., (9)

where: m denotes a function of x and [e.sub.i] is the prediction error.

Various methods are used to find the estimate [??] of m. These methods range from traditional techniques, such as multiple linear regression model and many time series, nonlinear, nonparametric, and semiparametric models (Olive 2007), to various machine learning techniques, such as M5-MT (Bhattacharya, Solomatine 2005; Jekab sons 2010), ANN (Zhu et al. 2010; Wong et al. 1997), SVM (Cheng et al. 2010; Lu et al. 2009), and LS-SVM (Suykens et al. 2002; Brabanter et al. 2010).

Once the mapping function is obtained, the primary task is to predict the future value of Y when a specific input x is presented to the system. In point estimation, Y is expressed as a single value. On the contrary, in interval estimation, the prediction result is given in the form of an interval of possible values. In many situations, interval estimation draws more attention than point estimation. The reason is that the requirement of decision makers not only resides in an accurate forecasting but also in the inherent uncertainty of the forecasts (Shrestha, Solomati ne 2006).

[FIGURE 1 OMITTED]

Interval estimation includes the upper and lower limits between which a pointwise value of response variable is expected to lie with a certain level of confidence (usually 95%). The range restricted by those limits is known as prediction interval (PI) (Fig. 1). Prediction intervals as outputs are desirable since they provide a range of values that most likely include the point estimation of the predicted variable. In addition, one can employ prediction intervals to discern the accuracy of the estimation provided by the model, and then decide to keep or reject the result (Mencar et al. 2005).

1.3.2. Evaluating performance of prediction interval

Once the output with interval has been obtained, the Prediction Interval Coverage Probability (PICP) can be utilized for performance evaluation (Shrestha, Solomatine 2006; Khosravi et al. 2010). PICP measures the proportion of data point lying within the PI. In some cases, the empirical PICP can be much less than the pre-specified level of confidence. This phenomenon indicates that the derived PIs are not reliable (Khosravi et al. 2011). Hence, PICP is oftentimes expected to be equal or greater than the level of confidence, since this reflects the reliability of the prediction results.

However, PICP is not the only metric for evaluating PIs. The reason is that one can simply construct a very large PI to achieve the maximum reliability of the prediction outcomes (e.g. 100%). Nevertheless, extremely large PIs, in practice, reduce the usability of forecasting results because the interval estimation does not convey any valuable information for the decision-makers (Khosravi et al. 2011). Hence, to guarantee the usability of the interval estimation, Mean Width of Prediction Interval (MPI) (Khosravi et al. 2010; Shrestha, Solomatine 2006), which measures the average width of the PIs, is also needed to be considered. Accordingly, a well-constructed PI should achieve the balance between reliability and usability. Put differently, it is desirable to obtain a high PICP corresponding to a narrow MPI (Khosravi et al. 2010, 2011). Nevertheless, these two criteria oftentimes conflict with each other and this makes interval estimation a challenging problem. Due to its importance and challenge, studies have dedicated in establishing PIs for variety of prediction models.

2.3.3. Previous works on prediction interval estimation

Sonmez (2011) integrated neural networks with bootstrap prediction intervals for range estimation of construction costs. In this approach, neural networks are used for modeling the mapping function between the influence factors and costs. Bootstrap method is utilized to quantify the level of variability included in the estimated costs. However, to construct interval estimates based on the bootstrap, which possibly produces accurate intervals, requires heavy computational expense (Brabanter et al. 2011).

Mencar et al. (2005) proposed a method for estimating prediction interval for neuro-fuzzy network such that the system provides an estimate of the uncertainty associated with predicted output values. This method does not require any strict assumption on the unknown distribution of data. However, the derived intervals are constant throughout the input domain. This feature might not reflect the true phenomenon happening in real-world time series data. In these cases, inherent uncertainty may distribute unequally in different periods of time (Cheng, Roy 2011).

[FIGURE 2 OMITTED]

Another method for constructing PI, which is based on machine learning approach, is established by Shrestha and Solomatine (2006). In their study, the authors presented a method to estimate PI via uncertainty of the model output. The crucial idea herein is the historical residuals between the model outputs and the corresponding observed data can be the quantitative indicators of the difference between the model and the modeled real world system and provide the valuable information to evaluate the model uncertainty.

The machine learning based interval estimation (MLIE) approach (Shrestha, Solomatine 2006) can be divided into five main steps (Fig. 2). First, the point estimation process is carried out. A regression technique is employed to learn the underlying mapping function between input data and outputs. Second, the input data points are separated into different clusters that have similar historical residuals, which are obtained from point estimation process, using fuzzy c-means clustering. In the third step, prediction limits (PLs) for each cluster are computed based on empirical distributions of the errors associated with all data points of one cluster. In the next step, PLs for each training data point is then calculated according to their membership grades in each cluster. In the final step, a machine learning (ML) technique can be deployed to learn the underlying functions between the input data and the computed PLs for training data. PLs for testing data can be inferred using those underlying functions.

Another advantage of MLIE method is its independence on the machine learning technique. However, this approach lacks a mechanism for selecting tuning parameters of the regression machine appropriately. Additionally, the performance of the proposed MLIE, in term of prediction accuracy and of computational cost, can be enhanced significantly by using more superior technique such as LS-SVM.

1.4. Differential Evolution optimization algorithm

This section describes the standard algorithm of Differential Evolution (DE) proposed by Storn and Price (Price et al. 2005; Storn, Price 1997). The algorithm (Fig. 3) consists of five main stages: initialization, mutation, crossover, selection, and stopping condition verification. Given that the problem at hand is to minimize a cost function f(X), where the number of decision variables is D, we can describe each stages of DE in details.

1.4.1. Initialization

DE commences the search process by randomly generating NP number of D-dimensional parameter vectors [X.sub.i,g] where i = 1, 2, NP and g denotes the current generation. In original DE algorithm, NP does not change during the optimization process (Storn, Price 1997). Moreover, the initial population (at g = 0) ought to cover the entire search space in a uniform manner. Thus, we can simply generate these individuals as follows:

[FIGURE 3 OMITTED]

[X.sub.i,0] = LB + rand [0,1] x (UB - LB), (10)

where: [X.sub.i0] is the decision variable i at the first generation. rand [0,1] denotes a uniformly distributed random number between 0 and 1. LB and UB are two vectors of lower bound and upper bound for any decision variable.

1.4.2. Mutation

A vector in the current population (or parent) is called a target vector. Hereafter, the terms parent and target vector are used interchangeably. For each target vector, a mutant vector is created according to the following equation (Storn, Price 1997):

[V.sub.i,g + 1] = [X.sub.r1, g] + F([X.sub.r2,g] - [X.sub.r3,g]), (11)

where: [r.sub.1], [r.sub.2], and [r.sub.3] are three random indexes lying between 1 and NP. These three randomly chosen integers are also selected to be different from the index i of the target vector. F denotes the mutation scale factor, which controls the amplification of the differential variation between [X.sub.r2,g] and [X.sub.r3,g.] [V.sub.ig + 1] represents the newly created mutant vector.

1.4.3. Crossover

The crossover stage aims to diversify the current population by exchanging components of target vector and mutant vector. In this stage, a new vector, named as trial vector, is created. The trial vector is also called the offspring. The trial vector can be formed as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (12)

where: [U.sub.t,g + i] is the trial vector; j denotes the index of element for any vector; randj is a uniform random number lying between 0 and 1; Cr is the crossover probability, which is needed to be determined by the user; rnb(i) is a randomly chosen index of {1, 2, NP} which guarantees that at least one parameter from the mutant vector ([V.sub.j,i,g + i)] is copied to the trial vector ([U.sub.j,i,g + i]).

1.4.4. Selection

In this stage, the trial vector is compared to the target vector (Price et al. 2005). If the trial vector can yield a lower objective function value than its parent, then the trial vector replaces the position of the target vector. The selection operator is expressed as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (13)

1.4.5. Stopping criterion verification

The optimization process can terminate when the stopping criterion is met. The user can set the type of this condition. Commonly, maximum generation ([G.sub.max]) or maximum number of function evaluations (NFE) can be used as the stopping condition. When the optimization process terminates, the final optimal solution is readily presented to the user.

2. Differential evolution based cross-validation

As mentioned earlier, in machine learning, one important objective is to construct a prediction model that can deliver the best generalization. The reason is that the performance on the training data set is not necessarily a good indicator of predictive performance on the testing data due to the problem of over-fitting (Bishop 2006). Over-fitting arises when a regression model fits the training set very well, but performs poorly on the new data set.

[FIGURE 4 OMITTED]

Hence, to build a desirable prediction model, one commonly used technique is S-fold cross-validation (Bishop 2006; Samarasinghe 2006; Suykens et al. 2002). The training data is divided into S folds and this allows a proportion (S -1) / S of the available data to be used for training while other portion of the data is for assessing model performance. However, one major disadvantage of cross-validation is that the number of training runs that must be performed is increased by a factor of S, and this can impose difficulty for models with high computational expense in the training process (Bishop 2006). Moreover, another challenge is that there might be infinite combinations of model's parameters. Thus, it is problematic and time-consuming when designing the combinations of parameters for the cross-validation process.

Since our study employs LS-SVM as the regression machine, there are two parameters needed to be determined, namely regularization parameter [gamma] and RBF kernel parameter [sigma]. To avoid over-fitting and drawbacks of traditional cross-validation approach, the new model utilizes DE (Price et al. 2005) to automatically explore the various combinations of ([gamma], [sigma]) and to identify the optimal set of these tuning parameters. In the following section, the DE-based cross-validation (Fig. 4) is described in details.

In the step of data processing, the training data set is divided into S folds (e.g. 5 folds). In each run, one fold is used as a validating set; meanwhile, the other folds are used for training the model (Fig. 5). Tuning parameters of LS-SVM is initialized randomly using Eqn (10). The lower bounds for [gamma] and [sigma] are both 0.001. Meanwhile, the upper bounds for [gamma] and [sigma] are specified to be 10000 and 100, respectively.

[FIGURE 5 OMITTED]

In LS-SVM training, LS-SVM is utilized to learn the regression function between input and output for each run. These regression functions can be described in the form of Eqn (7). After the training process, LS-SVM is applied to predict the output of the validating sets. In order to determine the optimal set of tuning parameters, the following objective function is used in the step of fitness function evaluation:

[F.sub.fitness] = [5.summation over (k = 1)] [E.sup.k.sub.tr]/5 + [5.summation over (k = 1)] [E.sup.k.sub.va]/5, (14)

where: [E.sup.k.sub.tr] and [E.sup.k.sub.va] denotes the training error and validating error, respectively, for kth run. The training and validating errors herein are Root Mean Squared Error calculated as follows:

RMSE = [square root of [N.summation over (j = 1)] [([Y.sup.j.sub.P] - [Y.sup.j.sub.A]).sup.2]/N], (15)

where: [Y.sup.j.sub.P] and [Y.sup.j.sub.A] denote predicted and actual value for output jth. In addition, N is the number of training data in each run.

The fitness function, in essence, represents the trade-off between model generalization and model complexity. It is worth noticing that well-fitting of the training set may reflect the model complexity. However, complex model tends to suffer from over-fitting (Bishop 2006; Suykens et al. 2002). Thus, incorporating the error of the validating data can help identify the model that features the balance of minimizing training error and generalization property.

In each generation, the DE optimization carries out mutation, crossover, and selection process to guide the initial population to the final optimal solution. The search terminates when the current generation g achieves the maximum number of generation [G.sub.max]. After being optimized, the prediction model is ready to be used in the next step.

3. Interval estimation of construction cost at completion using LS-SVM inference model (EAC-LSPIM)

Figure 6 provides the overall picture of the model EAC-LSPIM. Before describing the model in detail, it is noted that our study benefits from previous research works of Chen (2008) and Cheng et al. (2010) in identifying the influence factors for EAC prediction (Table 1), and of Shrestha and Solomatine (2006) in establishing the MLIE.

3.1. Input data

Historical data sets (Table 2) used in this paper were collected from 13 reinforced concrete building projects executed between 2000 and 2007 by one construction company headquartered in Taipei City, Taiwan. Building heights ranges from 9 to 17 stories (including underground floors). Contract values ran from NT$80 million to NT$1.1 billion. Total floor areas for the projects ranged from 2,094 [m.sup.2] to 31,797 [m.sup.2]. Besides, construction durations varied between 15 to 63 months. Historical data sets were separated into training sets (from 1 to 11) and testing sets (12 and 13). The training and testing data sets consist of 262 and 44 data cases respectively. Table 3 provides descriptive statistics of influencing factors as well as desired output of the historical data. In Table 4, the sample of 10 input variables from project 2, which had 24 completion periods, is used to illustrate the data set.

[FIGURE 6 OMITTED]

3.2. LS-SVM for point estimation of Estimate to Completion

Herein, LS-SVM is employed to learn the mapping function between model's input and output. Each 1 x 10 vector of influence factor acts as input for LS-SVM. Input vectors and observed values of Estimate to Completion (ETC) take the role of training data to obtain the prediction model. LS-SVM uses regularization parameter ([gamma]) and RBF kernel parameters ([sigma]), which are chosen by DE-based cross-validation process. After the training process, the model is capable of inferring unknown ETC value whenever new input information is presented.

3.3. Estimate at Completion calculation

In this step, Actual cost percentage (AC) of completed jobs is added to the Estimate to Completion (ETC) in order to obtain the Estimate at Completion (EAC) values, as defined in Eqn (16):

[EAC.sub.P] = A[C.sub.P] + ET[C.sub.P], (16)

where: [EAC.sub.P] denotes point estimation of EAC; ET[C.sub.P] represents point estimation of ETC; A[C.sub.P] is actual cost percentage.

ETC is a value used to determine forecasted expenditures necessary to complete remaining project work. AC percentage is a known value defined as the ratio of actual construction cost (AC) value to the Budget at Completion (BAC). It is noted that the BAC itself is the cost of the project when all contracted works are completed. EAC replaces BAC for the predicted total cost of the project at a specific period during construction.

3.4. Estimation of EAC prediction interval

Prediction interval (PI) estimation is carried out in four steps. First, the input dataset is separated into a certain number of clusters corresponding to distributions of historical residuals, obtained from the point estimation, using fuzzy c-means clustering algorithm (FCMC) (Bezdek 1981). The FCMC is an unsupervised machine learning technique employed to separate data into different clusters. Notably, using this technique, a data point can belong to many clusters, and the degrees of belonging are quantified by fuzzy membership grades. In FCMC, the number of clusters needs to be specified by the user. Commonly, the optimal number of clusters can be selected so that it results in the clustering performance corresponding to the smallest Xie-Beni index. For details of FCMC and selecting cluster number, readers are guided to the previous works of Xie and Beni (1991), and of Oliveira and Pedrycz (2007).

The next step is to compute the lower and upper PIs for each cluster. Given a certain level of confidence (e.g. 95% or [alpha] is 5%), the PIs for each cluster is calculated from empirical distributions of the corresponding historical residuals (e). To construct (100 - [alpha])% PI, the ([alpha]/2) x 100 and (1-([alpha]/2)) x 100 percentile values are taken from empirical distribution of residuals for lower and upper PI, respectively (Fig. 7). The mathematical expression for calculating lower and upper PIs for cluster i ([PI.sup.L.sub.ci] and [PI.sup.U.sub.ci]) is given as follows:

[PI.sup.L.sub.ci] = [e.sub.j] j : [j.summation over (k = 1)] [[mu].sub.i,k] < [alpha]/2 [n.summation over (j = 1)] [[mu].sub.i,j]; (17)

[PI.sup.U.sub.ci] = [e.sub.j] j : [j.summation over (k = 1)] [[mu].sub.i,k] > (1 - [alpha]/2) [n.summation over (j = 1)] [[mu].sub.i,j]; (17)

where: j is the index of the sorted data point that satisfies the corresponding inequalities; [e.sub.j] denotes historical residuals of sorted data point j; and [[mu].sub.i,j] is membership grade of data point j to cluster i.

[FIGURE 7 OMITTED]

In the third step, the PI for each tcraining data point is calculated using the weighted mean of PI of each cluster:

[PI.sup.L.sub.j] = [c.summation over (i = 1)] [[mu].sub.i,j] x [PI.sup.L.sub.ci]; (19)

[PI.sup.U.sub.j] = [c.summation over (i = 1)] [[mu].sub.i,j] x [PI.sup.U.sub.ci], (20)

where: [PI.sup.L.sub.j] and [PI.sup.U.sub.j] are lower and upper prediction intervals for data point j.

Finally, prediction limits of EAC for each data point are computed as follows:

[EAC.sup.L.sub.j] = [y.sub.i] + [PI.sup.L.sub.j] ; (21)

[EAC.sup.U.sub.j] = [y.sub.i] + [PI.sup.U.sub.j] ; (21)

where: [EAC.sup.L.sub.j] and [EAC.sup.U.sub.j] are lower and upper prediction limits of EAC for input data j.

3.5. LS-SVM for inference of EAC prediction limits

Once the prediction limits for each training data point are obtained, LS-SVM is utilized to establish two regression functions that model the relationship between the input data and its corresponding prediction limits. Tuning parameters of LS-SVM in this step are also selected via DE-based cross-validation. When the training process finishes, the model is then capable of estimating lower and upper prediction limits for new instances of input data.

3.6. Interval estimation of project cost at completion

In this step, the final model outputs ([EAC.sub.p], [EAC.sup.L.sub.j], and [EAC.sup.U.sub.j]) are presented. The interval estimation of total cost is available for decision-making process. The planners or managers can anticipate the cost required to complete the project associated with uncertainty described in the form of prediction intervals.

4. Simulation result and comparisons

After the training process, the proposed model, EAC-LSPIM, is utilized to predict two testing projects (12 and 13). Projects 12 and 13 consist of 27 and 17 completion periods, respectively. To achieve interval forecast of EAC, the level of confidence is set as 95%, which is corresponding to a of 5%. In order to evaluate the accuracy of EAC point estimation, Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE) are employed. In addition, to assess the performance of EAC interval estimation, PICP and MPI are utilized.

Prediction results of EAC-LSPIM for two testing projects are illustrated in Tables 5 and 6, and Figures 8 and 9. In these tables and figures, EACA denotes the actual EAC. Meanwhile, EACL, EACP, and EACu represent lower prediction limit, point estimation and upper prediction limit of EAC, respectively. Deviation is the error between point estimate of EAC and the actual EAC.

In the experiment in which projects 12 and 13 serve as testing cases, the RMSE, MAPE, and MAE of point estimate are 0.044, 3,741, and 0.034, respectively. The PICP and the MPI derived from EAC-LSPIM are 97.73% and 19.22, respectively. Since the level of confidence is set as 95%, the derived PICP is desirable; and this demonstrates the reliability of the prediction results. Meanwhile, it can be observed that the width of PIs yielded by the proposed model is acceptable. On average, the range of predicted EAC is 19.22%, and this is relatively satisfactory in an operational environment of construction industry, which is often hazarded by uncertainty.

In order to validate the superiority of EAC-LSPIM, its performance is compared to other benchmarked approaches. It is noted that the newly developed model is composed of LS-SVM, MLIE, and DE-based cross-validation. In order to validate the superiority of the proposed prediction model, various machine learning techniques, namely M5-MT, ANN, and LS-SVM, has been integrated with MLIE and are applied to for interval prediction of EAC. For LS-SVM, the selection of tuning parameters is achieved via the grid search approach (Suykens et al. 2002; Shu et al. 2010). Utilizing this approach, various pairs of ([gamma] and [delta]) are tried and the one with the best cross-validation accuracy is chosen. Accordingly, the values of [gamma] and [sigma] obtained from the grid search method, for point estimation of EAC, are 256 and 2.8, respectively. Meanwhile, the optimal values of [gamma] and [sigma] found by DE are 251.4 and 3.9, respectively.

The result comparison is shown in the Table 7. From Table 7, it is observable that the proposed model, EAC-LSPIM, has achieved the best result in point estimate of EAC having the smallest RMSE, MAPE, and MAE of testing data. Moreover, the model also yields the most desirable performance in interval estimation of project cost. Its prediction interval has the highest PICP value (97.73%) with relatively narrow MPI (19.22) compared to other outcomes.

[FIGURE 8 OMITTED]

[FIGURE 9 OMITTED]

To better demonstrate the performance of each benchmark model, the results of different combinations of 13 projects for training and testing have been added. In this experiment, 13 cases of experiment are carried out. In each case, a project serves as a testing set, the rest are training sets. As shown in Table 8, based on the average prediction results, the proposed model achieves the most desirable outcome. For point estimation, the RMSE, MAPE, and MAE of EAC-LSPIM are 3.812, 0.035, and 0.042, respectively. Meanwhile, for interval estimation, the PICP and MPI of the proposed model are 98.43% and 21.89. Observably, EAC-LSPIM is the most accurate model in point forecast of project cost. Moreover, it also achieves the highest PICP value corresponding to a relatively small value of MPI. These facts strongly proved the superiority of the new model over other benchmark approaches.

Conclusion

This study proposes a new prediction model, namely EAC-LSPIM, to assist project manager in construction cost planning and monitoring. To address the uncertainty in construction cost forecasting, this study incorporates LS-SVM, MLIE, and DE to achieve interval forecasting of construction project cost.

In EAC-LSPIM, the utilization of LS-SVM is twofold. First, LS-SVM is used to infer the underlying function between input data and point estimation of ETC. Second, it is employed to model the mapping relationship between the input data and the prediction limits of EAC.

Moreover, by using MLIE, the new model derives the prediction interval by evaluating the uncertainty inherent in the data set, without any assumption or prior knowledge about model's error distribution.

In order to avoid over-fitting, our study employs DE search engine in the cross-validation process. The DE-based cross-validation successfully identifies the most appropriate set of tuning parameters and eliminates the need of expertise or trial-and-error process in parameter setting.

Consequently, the proposed model has the capacity to operate automatically without human intervention and domain knowledge. In addition, simulation and performance comparison have demonstrated the accuracy, the reliability, and the usability of EAC-LSPIM prediction. Therefore, the newly established model has a great potential to assist decision-makers in the field of construction management.

References

Abba, W. F. 1997. Earned value management: reconciling government and commercial practices, Program Manager 26(1): 58-63.

An, S.-H.; Park, U. Y.; Kang, K. I.; Cho, M. Y.; Cho, H. H. 2007. Application of support vector machines in assessing conceptual cost estimates, Journal of Computing in Civil Engineering 21(4): 259-264. http://dx.doi.org/10.1061/(ASCE)0887-3801(2007)21: 4(259)

Bao, Y. K.; Liu, Z. T.; Guo, L.; Wang, W. 2005. Forecasting stock composite index by fuzzy Support Vector Machine regression, in Proc. of the 4th International Conference on Machine Learning and Cybernetics, 18-21 August, 2005, Guangzhou, China, Vol. 6: 3535-3540. http://dx.doi.org/10.1109/ICMLC.2005.1527554

Bezdek, J. C. 1981. Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers Norwell, MA, USA. 256 p. http://dx.doi.org/10.1007/978-1-4757-0450-1

Bhattacharya, B.; Solomatine, D. P. 2005. Neural networks and M5 model trees in modelling water level-discharge relationship, Neurocomputing 63: 381-396. http://dx.doi.org/10.1016/j.neucom.2004.04.016

Bishop, C. 2006. Pattern recognition and machine learning. Singapore: Springer Science+Business Media, LLC.

Bonakdar, L.; Etemad-Shahidi, A. 2011. Predicting wave run-up on rubble-mound structures using M5 model tree, Ocean Engineering 38(1): 111-118. http://dx.doi.org/10.1016/j.oceaneng.2010.09.015

Breiman, L.; Breiman, J.; Charles, J. S.; Olshen, R. A. 1984. Classification and regression trees. Chapman and Hall/CRC.

Chen, A.-L.; Wang, M.-L.; Liu, K. 2005. Prediction of the flow stress for 30 MnSi steel using evolutionary least squares support vector machine and mathematical models, in Proc. of the IEEE International Conference on Industrial Technology ICIT, 14-17 December, 2005, Hong Kong, China, 8968302.

Chen, T. L. 2008. Estimate at completion for construction projects using evolutionary fuzzy neural inference model: MS Thesis. Department of Construction Engineering, National Taiwan University of Science and Technology.

Cheng, M.-Y.; Peng, H.-S.; Wu, Y.-W.; Chen, T.-L. 2010. Estimate at completion for construction projects using evolutionary support vector machine inference model, Automation in Construction 19(5): 619-629. http://dx.doi.org/10.1016/j.autcon.2010.02.008

Cheng, M.-Y.; Roy, A. F. V. 2011. Evolutionary fuzzy decision model for cash flow prediction using time-dependent support vector machines, International Journal of Project Management 29(1): 56-65. http://dx.doi.org/10.1016/j.ijproman.2010.01.004

Christensen, D. S. 1993. Determining an accuracy estimate at completion, National Contract Management Journal 25: 17-25.

Christensen, D. S.; Antolini, R. C.; McKinney, J. W. 1995. A review of estimate at completion research, Journal of Cost Analysis and Management, 41-62.

De Brabanter, K.; De Brabanter, J.; Suykens, J. A. K.; De Moor, B. 2011. Approximate confidence and prediction intervals for least squares support vector regression, IEEE Transactions on Neural Networks 22(1): 110-120. http://dx.doi.org/10.1109/TNN.2010.2087769

De Brabanter, K.; Karsmakers, P.; Ojeda, F.; Alzate, C.; De Brabanter, J.; Pelckmans, K.; De Moor, B.; Vandewalle, J.; Suykens, J. A. K. 2010. LS-SVMlab Toolbox User's Guide version 1.8. Internal Report 10-146, ESAT-SISTA, K.U. Leuven (Leuven, Belgium).

Gestel, T. V.; Suykens, J. A. K.; Baesens, B.; Viaene, S.; Vanthienen, J.; Dedene, G.; De Moor, B.; Vandewalle, J. 2004. Benchmarking least squares support vector machine classifiers, Machine Learning 54(1): 5-32. http://dx.doi.org/10.1023/B:MACH.0000008082.80494.e0

Guo, Z.; Bai, G. 2009. Application of least squares support vector machine for regression to reliability analysis, Chinese Journal of Aeronautics 22(2): 160-166. http://dx.doi.org/10.1016/s1000-9361(08)60082-5

Hegazy, T.; Ayed, A. 1998. Neural network model for parametric cost estimation of highway projects, Journal of Construction Engineering and Management 124(3): 210-218. http://dx.doi.org/10.1061/(asce)0733-9364(1998)124: 3(210)

Heskes, T. 1997. Practical confidence and prediction interval, Advances in Neural Information Processing Systems 9. Cambridge: MIT Press. 176-182. http://dx.doi.org/10.L1.56.3753

Huang, Z.; Chen, H.; Hsu, C. J.; Chen, W. H.; Wu, S. 2004. Credit rating analysis with support vector machines and neural networks: a market comparative study, Decision Support System 37(4): 543-558. http://dx.doi.org/10.1016/S0167-9236(03)00086-1

Iranmanesh, H.; Mojir, N.; Kimiagari, S. 2007. A new formula to "Estimate at Completion" of a project's time to improve "Earned Value Management System", in Proc. of the IEEE International Conference on Industrial Engineering and Engineering Management, 2-4 December, 2007, Singapore, 1014-1017.

Jekabsons, G. 2010. M5 regression tree and model tree toolbox for Matlab. Technical Report, Institute of Applied Computer Systems, Riga Technical University.

Kaluzny, B. L.; Barbici, S.; Berg, G.; Chiomento, R.; Derpanis, D.; Jonsson, U.; Shaw, R. H. A. D.; Smit, M. C.; Ramaroson, F. 2011. An application of data mining algorithms for shipbuilding cost estimation, Journal of Cost Analysis and Parametrics 4(1): 2-30. http://dx.doi.org/10.1080/1941658x.2011.585336

Khosravi, A.; Nahavandi, S.; Creighton, D. 2010. Construction of optimal prediction intervals for load forecasting problems, IEEE Transactions on Power Systems 25(3): 1496-1503. http://dx.doi.org/10.1109/TPWRS.2010.2042309

Khosravi, A.; Nahavandi, S.; Creighton, D.; Atiya, A. F. 2011. Lower upper bound estimation method for construction of neural network-based prediction intervals, IEEE Transactions on Neural Network 22(3): 337-346. http://dx.doi.org/10.1109/TNN.2010.2096824

Kim, G.-H.; An, S.-H.; Kang, K.-I. 2004. Comparison of construction cost estimating models based on regression analysis, neural networks, and case-based reasoning, Building and Environment 39(10): 1235-1242. http://dx.doi.org/10.1016/j.buildenv.2004.02.013

Kiranyaz, S.; Ince, T.; Yildirim, A.; Gabbouj, M. 2009. Evolutionary artificial neural networks by multi-dimensional particle swarm optimization, Neural Networks 22(10): 1448-1462. http://dx.doi.org/10.1016/j.neunet.2009.05.013

Kong, F.; Wu, X.-J.; Cai, L.-Y. 2008. A novel approach based on support vector machine to forecasting the construction project cost, in Proc. of the International Symposium on Computational Intelligence and Design, 17-18 October, 2008, Wuhan, China, 1045-1076.

Lam, J. P.; Veall, M. R. 2002. Bootstrap prediction intervals for single period regression forecasts, International Journal of Forecasting 18(1): 125-130. http://dx.doi.org/10.1016/s0169-2070(01)00112-1

Lam, K. C.; Palaneeswaran, E.; Yu, C.-Y. 2009. A support vector machine model for contractor prequalification, Automation in Construction 18(3): 321-329. http://dx.doi.org/j.autcon.2008.09.007

Yu, L.; Chen, H.; Wang, S.; Lai, K. K. 2009. Evolving least squares support vector machines for stock market trend mining, IEEE Transactions on Evolutionary Computation 13(1): 87-102. http://dx.doi.org/10.1109/TEVC.2008.928176

Liu, L.; Zhu, K. 2007. Improving cost estimates of construction projects using phased cost factors, Journal of Construction Engineering and Management 133(1): 91-95. http://dx.doi.org/10.1061/(asce)0733-9364(2007)133: 1(91)

Lu, C.-J.; Lee, T.-S.; Chiu, C.-C. 2009. Financial time series forecasting using independent component analysis and support vector regression, Decision Support Systems 47(2): 115-125. http://dx.doi.org/10.1016/jdss.2009.02.001

Hongwei, M. 2009. An improved support vector machine based on rough set for construction cost prediction, in Proc. of the International Forum on Computer Science-Technology and Applications, 25-27 December, 2009, Chongqing, China, Vol. 2: 3-6. http://dx.doi.org/10.1109/IFCSTA.2009.123

Mencar, C.; Castellano, G.; Fanelli, A. M. 2005. Deriving prediction intervals for neuro-fuzzy networks, Mathematical and Computer Modelling 42(7-8): 719-726. http://dx.doi.org/10.1016/j.mcm.2005.09.001

Nassar, K. M.; Nassar, W. M.; Hegab, M. Y. 2005. Evaluating cost overruns of asphalt paving project using statistical process control methods, Journal of Construction Engineering and Management 131(11): 1173-1178. http://dx.doi.org/10.1061/(ASCE)0733-9364(2005) 131:11(1173)

Nasseri, M.; Asghari, K.; Abedini, M. J. 2008. Optimized scenario for rainfall forecasting using genetic algorithm coupled with artificial neural network, Expert Systems with Applications 35(3): 1415-1421. http://dx.doi.org/10.1016/j.eswa.2007.08.033

Olive, D. J. 2007. Prediction intervals for regression models, Computational Statistics & Data Analysis 51(6): 3115-3122. http://dx.doi.org/10.1016/jxsda.2006.02.006

Oliveira, J. V. D.; Pedrycz, W. 2007. Advances in fuzzy clustering and its applications. John Wiley & Sons Ltd. 434 p. http://dx.doi.org/10.1002/9780470061190

Price, K. V.; Storn, R. M.; Lampinen, J. A. 2005. Differential evolution a practical approach to global optimization. Berlin, Heidelberg: Springer-Verlag.

Razi, M. A.; Athappilly, K. 2005. A comparative predictive analysis of neural networks (NNs), nonlinear regression and classification and regression tree (CART) models, Expert Systems with Applications 29(1): 65-74. http://dx.doi.org/10.1016/j.eswa.2005.01.006

Samarasinghe 2006. Neural networks for applied sciences and engineering. USA: Taylor & Francis Group, LLC.

Samui, P.; Kothari, D. P. 2011. Utilization of a least square support vector machine (LSSVM) for slope stability analysis, Scientia Iranica 18(1): 53-58. http://dx.doi.org/10.1016/j.scient.2011.03.007

Shrestha, D. L.; Solomatine, D. P. 2006. Machine learning approaches for estimation of prediction interval for the model output, Neural Networks 19(2): 225-235. http://dx.doi.org/10.1016/j.neunet.2006.01.012

Shu, C. W.; Chang, C. C.; Lin, C. J. 2010. A practical guide to support vector classification. Technical Report. Department of Computer Science, National Taiwan University.

Sonmez, R. 2011. Range estimation of construction costs using neural networks with bootstrap prediction intervals, Expert Systems with Applications 38(8): 9913-9917. http://dx.doi.org/10.1016Zj.eswa.2011.02.042

Stine, R. A. 1985. Bootstrap prediction intervals for regression, Journal of the American Statistical Association 80(392): 1026-1031. http://dx.doi.org/10.2307/2288570

Storn, R.; Price, K. 1997. Differential evolution--a simple and efficient heuristic for global optimization over continuous spaces, Journal of Global Optimization 11(4): 341-359. http://dx.doi.org/10.1023/A:1008202821328

Suykens, J.; Gestel, J. V.; Brabanter, J. D.; Moor, B. D.; Vandewalle, J. 2002. Least square support vector machines. Singapore: World Scientific Publishing Co. Pte. Ltd.

Timofeev, R. 2004. Classification and Regression Trees (CART): theory and applications: Master's Thesis. Center of Applied Statistics and Economics, Humboldt University, Berlin.

Trost, S. M.; Oberlender, G. D. 2003. Predicting accuracy of early cost estimates using factor analysis and multivariate regression, Journal of Construction Engineering and Management 129(2): 198-204. http://dx.doi.org/10.1061/(asce)0733-9364(2003)129: 2(198)

Wang, H.; Hu, D. 2005. Comparison of SVM and LS-SVM for regression, in Proc. of the International Conference on Neural Networks and Brain (ICNNB), 13-15 October, 2005, Beijing, China, 279-283. http://dx.doi.org/10.1109/ICNNB.2005.1614615

Witten, I. H.; Frank, E. 2000. Practical machine learning tools and techniques with java implementations. USA: Morgan Kaufmann.

Wong, B. K.; Bodnovich, T. A.; Selvi, Y. 1997. Neural network applications in business: a review and analysis of the literature (1988-1995), Decision Support System 19(4): 301-320. http://dx.doi.org/10.1016/s0167-9236(96)00070-x

Wonnacott, T. H.; Wonnacott, R. J. 1996. Introductory statistics. New York: Wiley.

Xie, X. L.; Beni, G. 1991. A validity measure for fuzzy clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 13(8): 841-847. http://dx.doi.org/10.1109/34.85677

Zhang, J. R.; Zhang, J.; Lok, T. M.; Lyu, M. R. 2007. A hybrid particle swarm optimization--back-propagation algorithm for feedforward neural network training. Applied Mathematics and Computation 185(2): 1026-1037. http://dx.doi.org/10.1016/j.amc.2006.07.025

Zhu, W.-J.; Feng, W.-F.; Zhou, Y.-G. 2010. The application of genetic fuzzy neural network in project cost estimate, in Proc. of the International Conference on E-Product E-Service and E-Entertainment (ICEEE), 1-4 November, 2010, Henan, China, 1-k http://dx.doi.org/10.1109/ICEEE.2010.5660115

Min-Yuan CHENG (a), Nhat-Duc HOANG (b)

(a) Department of Construction Engineering, National Taiwan University of Science and Technology, Taiwan

(b) Faculty of Building and Industrial Construction, National University of Civil Engineering, Hanoi, Vietnam

Received 28 Oct 2011; accepted 08 Jun 2012

Corresponding author: Nhat-Duc Hoang

E-mail: [email protected], [email protected]

Min-Yuan CHENG is currently a Professor at the Department of Construction Engineering, National Taiwan University of Science and Technology. He holds lectures in Construction Automation and Construction Process Re-engineering. He has published many papers in various international journals such as Automation in Construction, Journal of Construction Engineering and Management, and Expert Systems with Applications. His research interests include management information system, applications of artificial intelligence, and construction management process reengineering.

Nhat-Duc HOANG is currently a lecturer at Department of Technology and Construction Management, Faculty of Building and Industrial Construction, National University of Civil Engineering, Hanoi, Vietnam. He got the MSc and PhD degrees at National Taiwan University of Science and Technology, Taipei, Taiwan. His research focuses on applications of Artificial Intelligence in Construction engineering and management. His research articles have been published in Journal of Computing in Civil Engineering (ASCE), Engineering Applications of Artificial Intelligence, and Automation in Construction.

Table 1. EAC prediction's influencing factors

No.   Influence          Index
      Factor (IF)

IF1   Construction       Construction
        duration            progress (%)
IF2   Actual cost        A[C.sub.p]
IF3   Planned cost       E[V.sub.p]
IF4   Cost management    CPI
IF5   Time management    SPI
IF6   Subcontractor      Subcontractor
        management         billed index
IF7   Contract payment   Owner billed
                           index
IF8   Change order       Change order
                           index
IF9   Construction       CCI
        price
        fluctuation
IF10  Number of          Climate effect
        rainy day          index

No.   Definition

IF1   Duration to date/
        revised contract duration
IF2   Actual Cost/
        Budget at Completion
IF3   Earned Value/
        Budget at Completion
IF4   Earned Value/Actual Cost
IF5   Earned Value/Planned Value
IF6   Subcontractor billed amount/
        Actual Cost
IF7   Owner billed amount/
        Earned Value
IF8   Revised contract amount/
        Budget at Completion
IF9   Construction material price
        index of that month/
      construction material price
        index of initial stage
IF10  (Revised project duration--number
        of rainy day)/
        revised project duration

Table 2. Project information

Projed   Total area    Underground   Ground   Buildings   Start date
         ([m.sup.2])   floors        floors

1        12622         2             9        1           2003/12/1
2        4919          3             11       1           2003/12/13
3        19205         5             8        1           2000/5/20
4        5358          3             9        1           2000/11/15
5        27468         2             11       3           1999/12/16
6        31797         2             9        4           2001/7/4
7        7707          2             14       1           2001/11/24
8        10087         3             14       1           2002/6/18
9        3479          1             10       1           2003/6/2
10       7289          2             8        1           2005/6/15
11       6352          4             11       1           2004/3/5
12       4774          2             11       1           2004/2/21
13       3094          2             7        1           2005/10/1

Projed   Finish date   Duration   Contract        Prediction
                       (days)     amount (NTD)    periods

1        2005/8/22     630        289,992,000     29
2        2005/11/10    689        149,300,000     24
3        2002/5/19     729        332,800,000     20
4        2002/11/14    729        199,600,000     25
5        2001/12/3     718        1,142,148,388   26
6        2003/3/31     635        530,000,000     20
7        2003/10/20    695        153,500,000     22
8        2004/7/6      749        216,000,000     27
9        2004/9/30     486        85,714,286      18
10       2006/9/15     457        190,844,707     20
11       2006/2/18     715        202,241,810     31
12       2006/2/20     730        145,377,589     27
13       2007/2/28     515        102,500,000     17

Table 3. Descriptive statistics of historical data

            IF1(%)   IF2(%)   IF3(%)   IF4    IF5    IF6     IF7

Mean        65.48    51.62    60.66    1.20   1.00   1.06    0.89
Median      65.30    49.25    58.05    1.16   1.00   1.08    0.91
Minimum     2.40     0.00     0.00     0.34   0.40   0.00    0.00
Maximum     130.60   132.70   141.60   2.33   1.62   1.90    1.57
Std. Dev.   32.92    31.50    35.72    0.23   0.09   0.26    0.22

            IF8    IF9    IF10   ETC(%)   EAC(%)

Mean        1.03   1.05   0.89   42.93    93.87
Median      1.00   1.03   0.90   42.00    91.70
Minimum     0.87   0.97   0.70   0.00     73.20
Maximum     1.42   1.20   1.00   108.70   132.70
Std. Dev.   0.10   0.06   0.07   30.81    14.84

Table 4. Input data for project 2 with 24 completion periods

IF1     IF2     IF3     IF4   IF5   IF6     IF7   IF8   IF9   IF10
                        1
2.4     0       0       1     1.0   1.0     1.0   1.0   1.0   1.0
6.9     2.5     0       1.8   1.0   1.2     1.0   1.0   1.0   1.0
11      6.3     11      1.3   1.0   1.8     1.0   1.0   1.1   1.0
15.5    9.6     12.7    1     1.0   1.3     1.0   1.0   1.1   1.0
19.9    12.6    14.8    1.1   1.0   1.0     0.9   1.0   1.1   1.0
24.2    14.3    16.4    1     1.0   0.9     0.8   1.0   1.1   1.0
28.7    16.5    16.4    1     1.0   1.4     1.4   1.0   1.1   0.9
33      19.4    18.8    1.2   1.0   1.2     1.2   1.0   1.1   0.9
37.4    22.3    26.5    1.2   1.0   1.4     1.2   1.0   1.1   0.9
41.8    25.3    30.4    1     1.0   1.4     1.1   1.0   1.1   0.9
46.1    29.2    29.7    1     1.0   1.3     1.3   1.0   1.1   0.9
50.6    32.6    33.1    1     1.0   1.2     1.2   1.0   1.1   0.9
54.9    36.4    37      1     1.0   1.3     1.2   1.0   1.1   0.9
59.3    40.6    41.5    1     1.0   1.1     1.1   1.0   1.1   0.9
63.5    43.8    44.8    1     1.0   1.0     1.0   1.0   1.1   0.9
72.2    54.6    55.8    1     1.0   1.1     1.1   1.0   1.1   0.9
76.5    61.5    62.9    1     1.0   1.1     1.0   1.0   1.1   0.8
81      66.1    67.6    1     0.9   1.0     1.0   1.0   1.1   0.8
85.2    72.5    82.5    1.1   1.0   1.1     1.0   1.0   1.1   0.8
89.7    78.5    92      1.2   1.0   1.1     1.0   1.0   1.1   0.8
94.1    79.6    94.4    1.2   1.0   1.2     1.0   1.0   1.1   0.8
98.4    81.6    99      1.2   1.0   1.2     1.0   1.0   1.1   0.8
102.9   84.5    100.5   1.2   1.0   1.2     1.0   1.0   1.1   0.8
107.2   91.2    100.5   1.1   1.0   1.1     1.0   1.0   1.1   0.8

Table 5. Interval Estimation of [EAC.sub. for project 12

No.  [EAC.sub.A]  [EAC.sub.L]  [EAC.sub.P]  [EAC.sub.U]  Deviation

1    91.69        74.45        84.79        93.77        6.90
2    91.69        69.06        79.41        88.39        12.28
3    91.69        79.11        89.33        98.31        2.36
4    91.69        78.35        88.74        97.85        2.95
5    91.69        76.39        86.75        95.73        4.94
6    91.69        73.08        83.34        92.39        8.35
7    91.69        73.86        84.01        93.11        7.68
8    91.69        73.02        82.82        91.71        8.87
9    91.69        78.20        87.94        96.78        3.75
10   91.69        82.36        92.59        101.57       0.90
11   91.69        80.73        91.00        99.98        0.69
12   91.69        80.89        91.26        100.29       0.43
13   91.69        80.14        90.55        99.57        1.14
14   91.69        84.09        94.48        103.46       2.79
15   91.69        80.85        91.25        100.22       0.44
16   91.69        83.75        94.12        103.09       2.43
17   91.69        80.40        90.74        99.72        0.95
18   91.69        80.69        91.03        100.01       0.66
19   91.69        75.33        85.69        94.68        6.00
20   91.69        77.65        88.02        97.00        3.67
21   91.69        78.10        88.48        97.47        3.21
22   91.69        80.08        90.47        99.46        1.22
23   91.69        77.24        87.62        96.61        4.07
24   91.69        82.37        92.73        101.71       1.04
25   91.69        82.57        92.48        101.29       0.79
26   91.69        82.12        91.70        100.44       0.01
27   91.69        83.60        93.24        102.06       1.55

Table 6. Interval Estimation of EAC for project 13

No.  [EAC.sub.A]  [EAC.sub.L]  [EAC.sub.P]  [EAC.sub.U]  Deviation

1    92.49        83.98        94.33        103.31       1.84
2    92.49        86.85        97.42        106.39       4.93
3    92.49        78.87        89.27        98.25        3.22
4    92.49        78.16        88.54        97.52        3.95
5    92.49        76.97        87.32        96.30        5.17
6    92.49        83.88        94.24        103.22       1.75
7    92.49        90.91        101.26       110.24       8.77
8    92.49        91.51        101.82       110.79       9.33
9    92.49        86.05        95.73        104.49       3.24
10   92.49        81.05        90.32        98.91        2.17
11   92.49        80.64        90.69        99.54        1.80
12   92.49        84.49        94.79        104.04       2.30
13   92.49        79.70        90.23        99.39        2.26
14   92.49        78.55        89.39        98.61        3.10
15   92.49        77.98        88.68        97.90        3.81
16   92.49        81.06        91.12        100.08       1.37
17   92.49        84.62        94.86        103.97       2.37

Table 7. Prediction result comparison for 2 testing projects

Prediction model   M5-MT   ANN    LS-SVM   EAC-LSPIM

Training

MAPE               3.16    3.11   1.81     2.36
MAE                0.03    0.03   0.02     0.02
RMSE               0.04    0.05   0.03     0.03

Testing

MAPE               7.63    3.83   3.90     3.74
MAE                0.07    0.04   0.04     0.03
RMSE               0.09    0.06   0.05     0.04

Interval
Estimation

PICP               81.8    90.9   93.2     97.7
MPI                27.9    23.0   20.8     19.2

Table 8. Prediction result comparison for 13 projects

Model                           Project
               1      2       3       4       5       6       7

M5-MT   MAPE   4.3    10.6    9.7     7.3     1.7     6.3     5.0
        MAE    0.0    0.1     0.1     0.1     0.0     0.1     0.0
        RMSE   0.1    0.1     0.1     0.1     0.0     0.1     0.1
        PICP   93.1   79.2    90.0    88.0    100.0   95.0    86.4
        MPI    21.2   22.1    27.5    26.4    14.0    24.1    21.1
ANN     MAPE   7.2    7.7     6.4     4.5     4.1     6.5     3.2
        MAE    0.1    0.1     0.1     0.0     0.1     0.1     0.0
        RMSE   0.1    0.1     0.1     0.0     0.1     0.1     0.0
        PICP   89.7   100.0   100.0   100.0   84.6    85.0    100.0
        MPI    28.5   31.6    24.1    23.1    25.6    28.6    30.4
LSSVM   MAPE   4.0    4.0     4.0     6.4     5.3     7.3     6.6
        MAE    0.0    0.0     0.0     0.1     0.1     0.1     0.1
        RMSE   0.1    0.0     0.0     0.1     0.1     0.1     0.1
        PICP   96.6   100.0   100.0   100.0   84.6    100.0   95.5
        MPI    21.6   22.0    26.2    34.0    19.7    36.4    32.1
EAC-    MAPE   3.5    3.7     4.3     6.5     2.8     6.7     1.7
LSPIM   MAE    0.0    0.0     0.0     0.1     0.0     0.1     0.0
        RMSE   0.0    0.0     0.0     0.1     0.0     0.1     0.0
        PICP   96.6   100.0   100.0   100.0   92.3    100.0   100.0
        MPI    17.8   24.3    16.6    25.1    17.7    29.1    20.2

Model                                                Average
        8       9      10     11      12     13

M5-MT   4.9     8.1    8.0    10.3    6.8    8.2     7.0
        0.0     0.1    0.1    0.1     0.1    0.1     0.1
        0.0     0.1    0.1    0.1     0.1    0.1     0.1
        96.3    94.4   87.1   80.7    85.2   82.4    89.1
        25.2    36.5   26.4   27.4    28.4   25.0    25.0
ANN     7.6     5.6    3.1    8.8     4.7    5.7     5.8
        0.1     0.0    0.0    0.1     0.0    0.1     0.1
        0.1     0.1    0      0.1     0.1    0.1     0.1
        92.6    88.9   100    93.6    92.6   88.2    93.5
        22.7    25.4   27.1   30.7    23.4   21.5    26.4
LSSVM   3.9     5.5    6.8    5.6     5.0    4.3     5.3
        0.0     0.0    0.1    0.1     0.0    0.0     0.1
        0.0     0.1    0.1    0.1     0.1    0.1     0.1
        100.0   83.3   87.1   96.8    92.6   94.1    94.7
        35.6    22.8   24.6   21.4    22.4   22.1    26.2
EAC-    4.2     3.9    2.2    2.5     3.6    3.9     3.8
LSPIM   0.0     0.0    0.0    0.0     0.0    0.0     0.0
        0.0     0      0      0.0     0.0    0.0     0.0
        100.0   94.4   100    100.0   96.3   100.0   98.4
        28.1    20.8   21.2   25.4    19.0   19.3    21.9