文章基本信息

标题：Using fuzzy classification for chronic disease management.
作者：Ghosh, Biswadip
期刊名称：Indian Journal of Economics and Business
印刷版ISSN：0972-5784
出版年度：2012
期号：March
语种：English
出版社：Indian Journal of Economics and Business
摘要：Clinicians and hospital administrators rely on information models for healthcare management and decision-making. However, healthcare data is problematic due to multiple reasons. As a result many statistical modeling approaches can break down with healthcare data. Other more resilient algorithms, such as Fuzzy Composite Programming (FCP) hold promise to address these issues. FCP allows the integration of qualitative and quantitative data and can better handle conflicts and variability in the data. This paper applies FCP to build a diabetes classifier using the PIMA diabetes dataset. The performance of the fuzzy classier was found to be better than a logistic regression classifier.
关键词：Chronic diseases;Decision making;Decision-making;Diabetes;Diabetes mellitus;Health services administration

Using fuzzy classification for chronic disease management.

Ghosh, Biswadip

Abstract

Clinicians and hospital administrators rely on information models for healthcare management and decision-making. However, healthcare data is problematic due to multiple reasons. As a result many statistical modeling approaches can break down with healthcare data. Other more resilient algorithms, such as Fuzzy Composite Programming (FCP) hold promise to address these issues. FCP allows the integration of qualitative and quantitative data and can better handle conflicts and variability in the data. This paper applies FCP to build a diabetes classifier using the PIMA diabetes dataset. The performance of the fuzzy classier was found to be better than a logistic regression classifier.

Keywords: Healthcare Ddisease Management, Multiple Criteria Decision Making (MCDM), Fuzzy Composite Programming (FCP)

I. INTRODUCTION

The recent increase in the use of Electronic Health Record (EHR) systems in health care facilities has resulted in a huge amount of clinical data being collected and available online. Such data is presenting opportunities for creating information systems for various healthcare organizational management and decision making purposes (Figure 1). Disease management is a defined as a system of coordinated interventions aimed to improve the patient's self-management of their health. Reports indicate that several healthcare organizations are proceeding to introduce evidence-based medicine and disease management practices by implementing information systems based on this clinical information (McGrath, Hendy, Klecum and Young, 2008). Personnel at multiple levels in a healthcare organization can rely on such information systems to create and deploy statistical models that facilitate decision-making. For example, medical chiefs and hospital directors need to track resource utilization and outcomes of selected treatment and procedures and plan unit based resource allocation and standardized procedures (Epstein, 2006). Healthcare system policy makers also need information from across a healthcare network to make strategic decisions on standardization of treatment protocols and procedures. Clinicians need historical patient outcome information to facilitate decisions on elective treatment (e.g. elective surgeries) and judge the suitability of treatment options and medical procedures for a presenting patient.

[FIGURE 1 OMITTED]

Disease management is particularly important for chronic diseases such as diabetes. Diabetes mellitus (MEL-ih-tus), or simply, diabetes, is a group of diseases characterized by high blood glucose levels that result from defects in the body's ability to produce and/or use insulin. Diabetes is a common disease with high prevalence rates. According to the ADA (American Diabetes Association), 7.8% of the US population have diabetes (ADA, 2011). Worldwide, more than 220 million people are estimated to have diabetes. The World Health Organization (WHO) estimates that over 1.1 million people die from diabetes every year and that the number of deaths will double by year 2030. Additionally, diabetes can have severe physical complications on patients--leading to heart disease and stroke, high blood pressure, blindness, kidney disease, nervous system disease (neuropathy) and amputations. The economic impacts of diabetes can be severe for an individual and a country The cost of diabetes in the US alone is estimated at $174 billion in 2007 of which $116 Billion for direct medical costs and $58 Billion for indirect costs (disability, work loss and premature mortality) (ADA, 2011).

The prevalence rates of diabetes can greatly vary across nationalities and racial groups. It is possible that for certain conditions, such as for high risk groups such as Asian populations or Native Indian population in the US, diabetes rates are higher due to heredity and lifestyle. Diabetes is a silent disease for an extended period of time. A patient in the high risk domain with "hidden" diabetes that is not identified and treated may have severe consequences, as the disease progresses rapidly. Hence, it is important to identify patients at elevated risk of developing diabetes even before full blown diabetes is developed and diagnosed, so that lifestyle management can be established to delay or even eliminate the onset of the disease.

Finding better and more effective classifiers for diabetes continues to be an ongoing research challenge. Data in healthcare organizations is problematic for reasons that include (1) the technological differences in medical protocols across facilities; (2) organizational differences in clinical personnel and their interpretation of patient variables, (3) the lack of standardization of information recording practices, and (4) poor data integration across care facilities in the network. Fuzzy logic is a useful modeling platform for constructing classifiers in the healthcare environment, as it allows for use of different types of data which have large variability, conflicts and ambiguity (Chen, 2003). It is therefore, important to apply fuzzy logic to construct a classifier for diabetes and evaluate its performance against more traditional statistical classifiers (eg. Logistic regression). Such a study has not been attempted in the research literature and may provide tools for better classification of diabetes and other chronic diseases in general. Receiver Operating Characteristics (ROC) analysis is an established approach in comparing classifiers and models for disease management in healthcare and the technique is adopted for this study (Linden, 2004).

The goals of this research are as below:

1. Use Fuzzy Composite Programming (FCP) to build a classifier for Diabetes detection.

2. Evaluate the performance of the Fuzzy classifier using Receiver Operating Characteristic (ROC) Curves.

3. Compare the performance of the Fuzzy classifier against a classifier based on Logistic Regression.

This research utilizes the Pima Native American diabetes dataset, which was collected by the National Institute of Diabetes and Digestive and Kidney Diseases and was donated for public use by Sigillito (1990). The data reports the diabetic status (diabetic or non-diabetic) of 768 women, along with data for 8 health-status variables. The population lives near Phoenix, Arizona, USA. The database has 768 cases (500 no disease, 268 disease).

II. RESEARCH BACKGROUND

Predictive models are typically used in the domain of disease management in order to assess the risk of patient for a disease and to allow healthcare organizations to forecast future loads and utilization of healthcare services. Information systems that facilitate the processes of decision making are referred to as Decision Support Systems (DSS). Such systems include data warehouses that aggregate data needed for decision making over several dimensions, data mining applications that analyze data for patterns to assist in finding the knowledge to build classifiers to assist in decision making. Most DSS offer functionality intended to support all phases of decision making--intelligence, design, choice and implementation (Simon, 1977). DSS technologies support- (1) the general goals of reducing the uncertainty in the decision making process, such as framing the right questions and problem(s) to solve, (2) building a model to evaluate choices and estimating the impact of the choices on one or more objectives and (3) the capability to evaluate changes in assumptions, model inputs and parameter values on a chosen decision. All activities involve the efficient and accurate collection, management, processing and application of data/information to the decision making process steps.

Fuzzy logic is a useful modeling platform for constructing classifiers for complex decision making scenarios, as it allows for use of different types of data which have large variability in the data set (Chen, 2003). Real life situations such as in chronic disease management are often different because the actual values of the selected measurement criteria may exhibit variability as well have imprecision in the way they are collected. Statistical data analysis techniques are able to account for variability but may not work well with imprecision, as well as criteria that are not statistically independent (e.g. blood glucose levels and insulin levels). By using fuzzy logic, an area/volume is used to represent each scenario, instead of a single point (statistical approach) to get a more complete classification of each patient scenario under variability. This leads to better decision making in these imprecise domains, such as chronic disease management scenarios in healthcare.

(A) Evaluating Classifiers

Four measures typically help in evaluating classifiers: The number of cases the classifier got correct--True Positives and True negatives and the number of cases the classifier got incorrect -False positives and false negatives. A cutoff value is chosen by the decision maker for the classifier. Any scenario that is lower than the cutoff value is marked negative, while any value greater than the cutoff is marked positive. Sensitivity is the proportion of true positives that were correctly predicted by the classifier, while specificity is the proportion of true negatives that were correctly predicted by the classifier. A perfect classifier would have 100% sensitivity and 100% specificity, thereby correctly identifying everyone having the condition (positives) and never mislabeling people that do not have the condition (negatives). In most cases the classifier is not perfect and a tradeoff has to be made by selecting the appropriate cutoff value to get the most desired sensitivity and specificity. Receiver operating characteristics (ROC) analysis is useful in evaluating classifiers as a ROC curve plots 1 minus specificity on the X-axis and the sensitivity on the Y-axis for all possible cutoff values. The area under a ROC curve provides a numeric measure of the quality of the classifier An area under the curve (AUC) of 1.0 indicates a perfect classifier, while an area of 0.5 indicates a non discriminating classifier (Fawcett, 2003). Two classifiers can be compared using the ROC curve and the calculated area under the curves.

III. FUZZY COMPOSITE PROGRAMMING MODEL

(A) Fuzzy Composite Index

FCP is one of MCDM techniques, which can handle mixed indicator data (quantitative and qualitative), and also work with conflicting, uncertain and hierarchical criteria. FCP methodology was developed by Bardossy and Duckstein (1992). There have been a lot of successful applications of FCP in the DSS literature (Lee, Dahab and Bogardi, 1992; Hagemeister, Jones and Woldt, 1996; Prodanovic and Simonovic, 2002; Sadip and Veitch, 2002). The normalization is done by using the best and worst basic indicator values that are described by the following equation (Lee, Dahab and Bogardi, 1992)

[[beta].sub.ij] = [f.sub.ij] - [f.sup.-.sub/ij]/[f.sup.+.sub/ij] - [f.sup.-.sub.ij] (When [f.sup.+.sub.ij] is best) (1)

Or

[[beta].sub.ij] = [f.sup.+.sub.ij] - [f.sub.ij]/[f.sup.+.sub.ij] - [f.sup.-.sub.ij] (When [f.sup.-.sub.ij] is best) (2)

FCP is based on a Fuzzy Composite Index (FCI). The equation is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

Where, Lj is Fuzzy Composite Index for the B+1 level group j of B level indicators;

[w.sub.ij] is weight of B level indicators in group j;

[p.sub.j] is balancing factors among indicators for group j;

[f.sub.ij.sup.+] is the best value of ith fuzzy indicators for group j;

[f.sub.ij.sup.-] is the worst value of ith fuzzy indicators for group j;

[f.sub.ij] is the value of ith fuzzy indicators for group j.

The final fuzzy composite index, which is used for ranking, is obtained by calculating the FCI from basic level to top level.

The weight parameters for indicators at different levels ([w.sub.ij]) are established based on the degree of importance that decision makers feel each indicator has relative to other indicators of the same group (Bardossy and Duckstein, 1992).

The balancing factors ([p.sub.j]) reflect the importance of maximal deviations between indicators in the same group, and determine the degree of substitution between indicators of the same group. Low balancing factors (equal to 1) are used for a high level of allowable substitution. High balancing factors (equal to 3) are used for minimal substitution (Bardossy and Duckstein, 1992). The best value ([f.sub.ij.sup.+]) stands for the maximum possible value of the indicator, and the worst value ([f.sub.ij.sup.-]) stands for the minimum possible value of indicator.

(B) Most Likely Interval (MLI) and Largest Likely Interval (LLI)

MLI reflects the most likely range for indicator value, and LLI reflects the largest possible range for indicator value. MLI consists of both low bound (LMLI) and high bound (HMLI). LLI consists of both low bound (LLLI) and high bound (HLLI). Both the Low bound and the high bound can be same. Hagemeister, Jones and Woldt (1996) gives such an example: water volume has the low bound of 310,000 [m.sup.3] for LLLI, the high bound of 410,000 [m.sup.3] for HLLI, and 340,000 [m.sup.3] for both LMLI and HMLI. For our classifier, blood glucose has the low bound of 120 [mg.sup.3] for LLLI, the high bound of 146 [mg.sup.3] for HLLI, and 178 [mg.sup.3] for both LMLI and HMLI.

Normally, the following relationship exists among the above values in a typical fuzzy logic scenario: LLLI [less than or equal to ] LMLI [less than or equal to ] HMLI [less than or equal to ] HLLL

(C) Ideal Point and Worst Point of FCP

The normalized Ideal Point of FCP is located where all indexes are 1, and the normalized Worst Point of FCP is located where all indexes are 0. The real scenarios scatter inside those boundaries. The closer the values are to the Ideal Point, the better the scenario is. Note that in this study, each patient being assesses for diabetes is considered a scenario.

(D) FCP Computation Steps

Based on Bardossy and Duckstein (1992), FCP computation is divided into the following steps:

First step, utilize the formula below to compute the fuzzy composite index (FCI) for low bound and high bound for both Most Likely Interval (MLI) and Largest Likely Interval (LLI) for each scenario. Second step, based on the computed fuzzy composite index of LMLI, HMLI, LLLI and HLLI at step and compute final fuzzy composite index for each scenario.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Where, i is the scenario number:

min = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], Which is minimum LLLI in all scenarios;

max = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], Which is maximum HLLI in all scenarios.

Third step, compare the final FCI for all scenarios.

The ranking rule is: the larger the final FCI value, the higher the chance of onset of diabetes for the patient.

(E) FuzzyDeciMaker

The FuzzyDeciMaker tool was developed by the Civil Engineering Department of the University of Nebraska at Lincoln. It is a software tool to implement FCP functions, which supports building tree data structure, inputting data, calculating the Fuzzy Composite Index for different levels and ranking different scenarios. The indicators in the measurement of outcomes were based on using a combination of quantitative and qualitative data from the PIMA American Indian Diabetes dataset. The qualitative measures were from the diabetes pedigree function that provides an estimate of the hereditary factors for the patient.

IV. RESEARCH METHODOLOGY

The research methodology consists of measurement of creating a patient scenario in Fuzzy Decimaker software to represent each patient's measurements from the PIMA American Indian diabetes dataset. For this research, 50 patients (25 that were diabetes positive and 25 with no diabetes) were randomly selected from the dataset of 768 cases. The Pima Diabetes Dataset consists of 768 cases (500 with no disease and 268 with disease) with the following variables:

1. Number of times pregnant

2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3. Diastolic blood pressure (mm Hg)

4. Triceps skin fold thickness (ram)

5. 2-Hour serum insulin (mu U/ml)

6. Body mass index (weight in kg/(height in m)^2)

7. Diabetes pedigree function

8. Age (years)

(A) Fuzzy Classification Using FuzzyDeciMaker

Since substitution is allowed for all indicators therefore, the balancing factors for all indicators are set to 1.

The values of the worst score and the best score achieved for each indicator are shown in Table 1. For this dataset, for each scenario, the maximum value of dataset is set as HLLI, the minimum value of dataset as LLLI, and the 95% confidence limits of the average value of that scenario as the LMLI and HMLI.

(B) Logistic Regression Classifier

A classifier was produced using a training/test split of 469 (301 with no disease and 168 with disease) in the training set and 50 (25 with no disease and 25 with disease) in the test set using Logistic Regression to classify the data without introducing any bias using a p value of 0.5. The accuracy, specificity and sensitivity measures for the classification were obtained from SPSS. The coefficients: of the classifier are shown in Table 2.

V. RESULTS

Figure 2 shows the ROC curves based on classification results of the classifier based on Logistic Regression and the classifier based on the Fuzzy Composite Programming (FCP) for each of the 50 patients. Figure 2 allows the comparison of the performance of the FCP classifier and the Logistic regression classifier. The results indicate that FCP classifier is indeed better when the calculated AUC values are compared. The classifier based on logistic regression has an AUC of 0.648, while the classifier based on FCP has an AUC of 0.722. The AUC for the Fuzzy classifier is 0.074 greater than AUC for the Logistic regression classifier for the task of classification of the randomly picked 50 patient sample from the Pima Diabetes dataset. Hence the performance of the Fuzzy classifier is better than the logistic regression classifier.

VI. CONCLUSIONS

This study aimed to build a multi-criteria decision making model using fuzzy composite programming to assess diabetes and then compare the performance of that fuzzy classifier to the performance of a logistic regression classifier using a ROC curve analysis. By drawing on PIMA American Indian diabetes dataset, criteria was selected to build the final FCP model. Both quantitative data and qualitative data were used in the hierarchical model. As seen from this research, Fuzzy Composite Programming (FCP) is an appropriate decision making model to work with mixed indicator data (quantitative and qualitative), as well as with conflicting, uncertain and hierarchical criteria. Further research into the use of FCP in constructing and testing classifiers for the management of chronic disease, such as stress management, hypertension, mental health is motivated from the results of' this study. These medical careas have an abundance of mixed data - some survey data, that is qualitative and other lab data that is quantitative and hence the potential that the Fussy classifier may perform better.

[FIGURE 2 OMITTED]

References

ADA (2011), "Diabetes Statistics", http://www.diabetes.org/diabetes-basics/diabetes-statistics, accessed Jan 11, 2011.

Bardossy, A. and Duckstein, L. (1992), "Analysis of a Karstic Aquifer Management by Fuzzy Composite Programming", Water Resources Bulletin (28:1), 1992, pp. 63-73.

Chen, Z., "Computational Intelligence for Decision Support", CRC Press, 2003.

Epstein, A. J. (2006), Do Cardiac Surgery Report Cards Reduce Mortality? Assessing the Evidence. Medical Care Research and Review, 63 (4), 403-426.

Fawcett, T. (2003), "ROC Graphs: Notes and Practical Considerations for Data Mining Researchers," HP Technical Report HP Labs Tech Report HPL-2003-4, 2003.

Hagemeister, M. Jones, D. and Woldt, W. (1996), "Hazard Ranking of Landfills Using Fuzzy Composite Programming", Journal of Environmental Engineer, April, 1996, pp. 248-258.

Lee, Y. L., Dahab M. and Bogardi, I. (1992), "Nitrate Risk Assessment under Uncertainty", Journal of Water Resources, Planning and Management (118:2), 1992, pp. 151-165.

Linden, A. (2004), "Measuring Diagnostic and Predictive Accuracy in Disease Management: An Introduction to Receiver Operating Characteristic (ROC) Analysis", Journal of Evaluation in Clinical Practice, 12 (2), pp. 132-139.

McGrath, K., Hendy, J., Klecun, E. and Young, T. (2008), The Vision and Reality of 'Connecting For Health': Tensions, Opportunities, and Policy Implications of the UK National Programme. Communications of the Association for Information Systems, 23 (33).

Prodanovic, P. and S. Simonovic, S. (2002), "Comparison of Fuzzy Set Ranking Methods for Implementation in Water Resources Decision Making", Canadian Journal of Civil Engineering (29), 2002, pp. 692-701.

Sadip R. and Veitch, B. (2002), "An Integrated Approach to Environmental Decision-making for offshore Oil and Gas Operations", Canada-Brazil Oil & Gas HSE seminar and Workshop, March 11-12, 2002.

Sagillito, V. (1990), Pima Indians Diabetes Database, http://www.ics.uci.edu/~mlearn/databases/ pima-indians-diabetes/pima-indians-diabetes, names (1990).

Simon, H. A., "The New Science of Management Decision Prentice-Hall, Englewood Cliffs, NJ, 1977.

Note

(1.) There was high correlation between glucose and insulin variables in the data set. This created ambiguity in the regression model when both were included in the model. Hence the insulin variable was dropped from the regression models.

BISWADIP GHOSH, Computer Information Systems, Metropolitan State College of Denver, Denver, USA, E-mail: [email protected]

Table 1
Fuzzy Weights, Best Value and Worst Value

 Times BMI Pedigree Blood Age
Indicator Pregnant Function Glucose

Weight 0.15 0.14 0.15 0.14 0.14
Balancing Factor 1 1 1 1 1
Best Value 1 19 0.27 88 21
Worst Value 9 58 1.60 173 81

 Insulin Skin
Indicator Level Fold

Weight 0.14 0.14
Balancing Factor 1 1
Best Value 111 23
Worst Value 540 38

Table 2
Logistic Regression Coefficients

Variable B S.E. Wald df Sig. Exp(B)

timesPreg .096 .047 4.192 1 .041 1.101
BMI .085 .018 23.334 1 .000 1.088
Pedigree .930 .355 6.851 1 .009 2.535
Age .011 .011 1.083 1 .298 1.011
PlsGlucose .031 .004 52.167 1 .000 1.031
Constant -8.370 .863 94.134 1 .000 .000