Using fuzzy classification for chronic disease management.
Ghosh, Biswadip
Abstract
Clinicians and hospital administrators rely on information models
for healthcare management and decision-making. However, healthcare data
is problematic due to multiple reasons. As a result many statistical
modeling approaches can break down with healthcare data. Other more
resilient algorithms, such as Fuzzy Composite Programming (FCP) hold
promise to address these issues. FCP allows the integration of
qualitative and quantitative data and can better handle conflicts and
variability in the data. This paper applies FCP to build a diabetes
classifier using the PIMA diabetes dataset. The performance of the fuzzy
classier was found to be better than a logistic regression classifier.
Keywords: Healthcare Ddisease Management, Multiple Criteria
Decision Making (MCDM), Fuzzy Composite Programming (FCP)
I. INTRODUCTION
The recent increase in the use of Electronic Health Record (EHR)
systems in health care facilities has resulted in a huge amount of
clinical data being collected and available online. Such data is
presenting opportunities for creating information systems for various
healthcare organizational management and decision making purposes
(Figure 1). Disease management is a defined as a system of coordinated
interventions aimed to improve the patient's self-management of
their health. Reports indicate that several healthcare organizations are
proceeding to introduce evidence-based medicine and disease management
practices by implementing information systems based on this clinical
information (McGrath, Hendy, Klecum and Young, 2008). Personnel at
multiple levels in a healthcare organization can rely on such
information systems to create and deploy statistical models that
facilitate decision-making. For example, medical chiefs and hospital
directors need to track resource utilization and outcomes of selected
treatment and procedures and plan unit based resource allocation and
standardized procedures (Epstein, 2006). Healthcare system policy makers
also need information from across a healthcare network to make strategic
decisions on standardization of treatment protocols and procedures.
Clinicians need historical patient outcome information to facilitate
decisions on elective treatment (e.g. elective surgeries) and judge the
suitability of treatment options and medical procedures for a presenting
patient.
[FIGURE 1 OMITTED]
Disease management is particularly important for chronic diseases
such as diabetes. Diabetes mellitus (MEL-ih-tus), or simply, diabetes,
is a group of diseases characterized by high blood glucose levels that
result from defects in the body's ability to produce and/or use
insulin. Diabetes is a common disease with high prevalence rates.
According to the ADA (American Diabetes Association), 7.8% of the US
population have diabetes (ADA, 2011). Worldwide, more than 220 million
people are estimated to have diabetes. The World Health Organization
(WHO) estimates that over 1.1 million people die from diabetes every
year and that the number of deaths will double by year 2030.
Additionally, diabetes can have severe physical complications on
patients--leading to heart disease and stroke, high blood pressure,
blindness, kidney disease, nervous system disease (neuropathy) and
amputations. The economic impacts of diabetes can be severe for an
individual and a country The cost of diabetes in the US alone is
estimated at $174 billion in 2007 of which $116 Billion for direct
medical costs and $58 Billion for indirect costs (disability, work loss
and premature mortality) (ADA, 2011).
The prevalence rates of diabetes can greatly vary across
nationalities and racial groups. It is possible that for certain
conditions, such as for high risk groups such as Asian populations or
Native Indian population in the US, diabetes rates are higher due to
heredity and lifestyle. Diabetes is a silent disease for an extended
period of time. A patient in the high risk domain with
"hidden" diabetes that is not identified and treated may have
severe consequences, as the disease progresses rapidly. Hence, it is
important to identify patients at elevated risk of developing diabetes
even before full blown diabetes is developed and diagnosed, so that
lifestyle management can be established to delay or even eliminate the
onset of the disease.
Finding better and more effective classifiers for diabetes
continues to be an ongoing research challenge. Data in healthcare
organizations is problematic for reasons that include (1) the
technological differences in medical protocols across facilities; (2)
organizational differences in clinical personnel and their
interpretation of patient variables, (3) the lack of standardization of
information recording practices, and (4) poor data integration across
care facilities in the network. Fuzzy logic is a useful modeling
platform for constructing classifiers in the healthcare environment, as
it allows for use of different types of data which have large
variability, conflicts and ambiguity (Chen, 2003). It is therefore,
important to apply fuzzy logic to construct a classifier for diabetes
and evaluate its performance against more traditional statistical
classifiers (eg. Logistic regression). Such a study has not been
attempted in the research literature and may provide tools for better
classification of diabetes and other chronic diseases in general.
Receiver Operating Characteristics (ROC) analysis is an established
approach in comparing classifiers and models for disease management in
healthcare and the technique is adopted for this study (Linden, 2004).
The goals of this research are as below:
1. Use Fuzzy Composite Programming (FCP) to build a classifier for
Diabetes detection.
2. Evaluate the performance of the Fuzzy classifier using Receiver
Operating Characteristic (ROC) Curves.
3. Compare the performance of the Fuzzy classifier against a
classifier based on Logistic Regression.
This research utilizes the Pima Native American diabetes dataset,
which was collected by the National Institute of Diabetes and Digestive
and Kidney Diseases and was donated for public use by Sigillito (1990).
The data reports the diabetic status (diabetic or non-diabetic) of 768
women, along with data for 8 health-status variables. The population
lives near Phoenix, Arizona, USA. The database has 768 cases (500 no
disease, 268 disease).
II. RESEARCH BACKGROUND
Predictive models are typically used in the domain of disease
management in order to assess the risk of patient for a disease and to
allow healthcare organizations to forecast future loads and utilization
of healthcare services. Information systems that facilitate the
processes of decision making are referred to as Decision Support Systems
(DSS). Such systems include data warehouses that aggregate data needed
for decision making over several dimensions, data mining applications
that analyze data for patterns to assist in finding the knowledge to
build classifiers to assist in decision making. Most DSS offer
functionality intended to support all phases of decision
making--intelligence, design, choice and implementation (Simon, 1977).
DSS technologies support- (1) the general goals of reducing the
uncertainty in the decision making process, such as framing the right
questions and problem(s) to solve, (2) building a model to evaluate
choices and estimating the impact of the choices on one or more
objectives and (3) the capability to evaluate changes in assumptions,
model inputs and parameter values on a chosen decision. All activities
involve the efficient and accurate collection, management, processing
and application of data/information to the decision making process
steps.
Fuzzy logic is a useful modeling platform for constructing
classifiers for complex decision making scenarios, as it allows for use
of different types of data which have large variability in the data set
(Chen, 2003). Real life situations such as in chronic disease management
are often different because the actual values of the selected
measurement criteria may exhibit variability as well have imprecision in
the way they are collected. Statistical data analysis techniques are
able to account for variability but may not work well with imprecision,
as well as criteria that are not statistically independent (e.g. blood
glucose levels and insulin levels). By using fuzzy logic, an area/volume
is used to represent each scenario, instead of a single point
(statistical approach) to get a more complete classification of each
patient scenario under variability. This leads to better decision making
in these imprecise domains, such as chronic disease management scenarios
in healthcare.
(A) Evaluating Classifiers
Four measures typically help in evaluating classifiers: The number
of cases the classifier got correct--True Positives and True negatives
and the number of cases the classifier got incorrect -False positives
and false negatives. A cutoff value is chosen by the decision maker for
the classifier. Any scenario that is lower than the cutoff value is
marked negative, while any value greater than the cutoff is marked
positive. Sensitivity is the proportion of true positives that were
correctly predicted by the classifier, while specificity is the
proportion of true negatives that were correctly predicted by the
classifier. A perfect classifier would have 100% sensitivity and 100%
specificity, thereby correctly identifying everyone having the condition
(positives) and never mislabeling people that do not have the condition
(negatives). In most cases the classifier is not perfect and a tradeoff
has to be made by selecting the appropriate cutoff value to get the most
desired sensitivity and specificity. Receiver operating characteristics
(ROC) analysis is useful in evaluating classifiers as a ROC curve plots
1 minus specificity on the X-axis and the sensitivity on the Y-axis for
all possible cutoff values. The area under a ROC curve provides a
numeric measure of the quality of the classifier An area under the curve
(AUC) of 1.0 indicates a perfect classifier, while an area of 0.5
indicates a non discriminating classifier (Fawcett, 2003). Two
classifiers can be compared using the ROC curve and the calculated area
under the curves.
III. FUZZY COMPOSITE PROGRAMMING MODEL
(A) Fuzzy Composite Index
FCP is one of MCDM techniques, which can handle mixed indicator
data (quantitative and qualitative), and also work with conflicting,
uncertain and hierarchical criteria. FCP methodology was developed by
Bardossy and Duckstein (1992). There have been a lot of successful
applications of FCP in the DSS literature (Lee, Dahab and Bogardi, 1992;
Hagemeister, Jones and Woldt, 1996; Prodanovic and Simonovic, 2002;
Sadip and Veitch, 2002). The normalization is done by using the best and
worst basic indicator values that are described by the following
equation (Lee, Dahab and Bogardi, 1992)
[[beta].sub.ij] = [f.sub.ij] - [f.sup.-.sub/ij]/[f.sup.+.sub/ij] -
[f.sup.-.sub.ij] (When [f.sup.+.sub.ij] is best) (1)
Or
[[beta].sub.ij] = [f.sup.+.sub.ij] - [f.sub.ij]/[f.sup.+.sub.ij] -
[f.sup.-.sub.ij] (When [f.sup.-.sub.ij] is best) (2)
FCP is based on a Fuzzy Composite Index (FCI). The equation is:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)
Where, Lj is Fuzzy Composite Index for the B+1 level group j of B
level indicators;
[w.sub.ij] is weight of B level indicators in group j;
[p.sub.j] is balancing factors among indicators for group j;
[f.sub.ij.sup.+] is the best value of ith fuzzy indicators for
group j;
[f.sub.ij.sup.-] is the worst value of ith fuzzy indicators for
group j;
[f.sub.ij] is the value of ith fuzzy indicators for group j.
The final fuzzy composite index, which is used for ranking, is
obtained by calculating the FCI from basic level to top level.
The weight parameters for indicators at different levels
([w.sub.ij]) are established based on the degree of importance that
decision makers feel each indicator has relative to other indicators of
the same group (Bardossy and Duckstein, 1992).
The balancing factors ([p.sub.j]) reflect the importance of maximal
deviations between indicators in the same group, and determine the
degree of substitution between indicators of the same group. Low
balancing factors (equal to 1) are used for a high level of allowable
substitution. High balancing factors (equal to 3) are used for minimal
substitution (Bardossy and Duckstein, 1992). The best value
([f.sub.ij.sup.+]) stands for the maximum possible value of the
indicator, and the worst value ([f.sub.ij.sup.-]) stands for the minimum
possible value of indicator.
(B) Most Likely Interval (MLI) and Largest Likely Interval (LLI)
MLI reflects the most likely range for indicator value, and LLI
reflects the largest possible range for indicator value. MLI consists of
both low bound (LMLI) and high bound (HMLI). LLI consists of both low
bound (LLLI) and high bound (HLLI). Both the Low bound and the high
bound can be same. Hagemeister, Jones and Woldt (1996) gives such an
example: water volume has the low bound of 310,000 [m.sup.3] for LLLI,
the high bound of 410,000 [m.sup.3] for HLLI, and 340,000 [m.sup.3] for
both LMLI and HMLI. For our classifier, blood glucose has the low bound
of 120 [mg.sup.3] for LLLI, the high bound of 146 [mg.sup.3] for HLLI,
and 178 [mg.sup.3] for both LMLI and HMLI.
Normally, the following relationship exists among the above values
in a typical fuzzy logic scenario: LLLI [less than or equal to ] LMLI
[less than or equal to ] HMLI [less than or equal to ] HLLL
(C) Ideal Point and Worst Point of FCP
The normalized Ideal Point of FCP is located where all indexes are
1, and the normalized Worst Point of FCP is located where all indexes
are 0. The real scenarios scatter inside those boundaries. The closer
the values are to the Ideal Point, the better the scenario is. Note that
in this study, each patient being assesses for diabetes is considered a
scenario.
(D) FCP Computation Steps
Based on Bardossy and Duckstein (1992), FCP computation is divided
into the following steps:
First step, utilize the formula below to compute the fuzzy
composite index (FCI) for low bound and high bound for both Most Likely
Interval (MLI) and Largest Likely Interval (LLI) for each scenario.
Second step, based on the computed fuzzy composite index of LMLI, HMLI,
LLLI and HLLI at step and compute final fuzzy composite index for each
scenario.
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Where, i is the scenario number:
min = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], Which is
minimum LLLI in all scenarios;
max = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], Which is
maximum HLLI in all scenarios.
Third step, compare the final FCI for all scenarios.
The ranking rule is: the larger the final FCI value, the higher the
chance of onset of diabetes for the patient.
(E) FuzzyDeciMaker
The FuzzyDeciMaker tool was developed by the Civil Engineering
Department of the University of Nebraska at Lincoln. It is a software
tool to implement FCP functions, which supports building tree data
structure, inputting data, calculating the Fuzzy Composite Index for
different levels and ranking different scenarios. The indicators in the
measurement of outcomes were based on using a combination of
quantitative and qualitative data from the PIMA American Indian Diabetes
dataset. The qualitative measures were from the diabetes pedigree
function that provides an estimate of the hereditary factors for the
patient.
IV. RESEARCH METHODOLOGY
The research methodology consists of measurement of creating a
patient scenario in Fuzzy Decimaker software to represent each
patient's measurements from the PIMA American Indian diabetes
dataset. For this research, 50 patients (25 that were diabetes positive
and 25 with no diabetes) were randomly selected from the dataset of 768
cases. The Pima Diabetes Dataset consists of 768 cases (500 with no
disease and 268 with disease) with the following variables:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose
tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (ram)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
(A) Fuzzy Classification Using FuzzyDeciMaker
Since substitution is allowed for all indicators therefore, the
balancing factors for all indicators are set to 1.
The values of the worst score and the best score achieved for each
indicator are shown in Table 1. For this dataset, for each scenario, the
maximum value of dataset is set as HLLI, the minimum value of dataset as
LLLI, and the 95% confidence limits of the average value of that
scenario as the LMLI and HMLI.
(B) Logistic Regression Classifier
A classifier was produced using a training/test split of 469 (301
with no disease and 168 with disease) in the training set and 50 (25
with no disease and 25 with disease) in the test set using Logistic
Regression to classify the data without introducing any bias using a p
value of 0.5. The accuracy, specificity and sensitivity measures for the
classification were obtained from SPSS. The coefficients: of the
classifier are shown in Table 2.
V. RESULTS
Figure 2 shows the ROC curves based on classification results of
the classifier based on Logistic Regression and the classifier based on
the Fuzzy Composite Programming (FCP) for each of the 50 patients.
Figure 2 allows the comparison of the performance of the FCP classifier
and the Logistic regression classifier. The results indicate that FCP
classifier is indeed better when the calculated AUC values are compared.
The classifier based on logistic regression has an AUC of 0.648, while
the classifier based on FCP has an AUC of 0.722. The AUC for the Fuzzy
classifier is 0.074 greater than AUC for the Logistic regression
classifier for the task of classification of the randomly picked 50
patient sample from the Pima Diabetes dataset. Hence the performance of
the Fuzzy classifier is better than the logistic regression classifier.
VI. CONCLUSIONS
This study aimed to build a multi-criteria decision making model
using fuzzy composite programming to assess diabetes and then compare
the performance of that fuzzy classifier to the performance of a
logistic regression classifier using a ROC curve analysis. By drawing on
PIMA American Indian diabetes dataset, criteria was selected to build
the final FCP model. Both quantitative data and qualitative data were
used in the hierarchical model. As seen from this research, Fuzzy
Composite Programming (FCP) is an appropriate decision making model to
work with mixed indicator data (quantitative and qualitative), as well
as with conflicting, uncertain and hierarchical criteria. Further
research into the use of FCP in constructing and testing classifiers for
the management of chronic disease, such as stress management,
hypertension, mental health is motivated from the results of' this
study. These medical careas have an abundance of mixed data - some
survey data, that is qualitative and other lab data that is quantitative
and hence the potential that the Fussy classifier may perform better.
[FIGURE 2 OMITTED]
References
ADA (2011), "Diabetes Statistics",
http://www.diabetes.org/diabetes-basics/diabetes-statistics, accessed
Jan 11, 2011.
Bardossy, A. and Duckstein, L. (1992), "Analysis of a Karstic
Aquifer Management by Fuzzy Composite Programming", Water Resources
Bulletin (28:1), 1992, pp. 63-73.
Chen, Z., "Computational Intelligence for Decision
Support", CRC Press, 2003.
Epstein, A. J. (2006), Do Cardiac Surgery Report Cards Reduce
Mortality? Assessing the Evidence. Medical Care Research and Review, 63
(4), 403-426.
Fawcett, T. (2003), "ROC Graphs: Notes and Practical
Considerations for Data Mining Researchers," HP Technical Report HP
Labs Tech Report HPL-2003-4, 2003.
Hagemeister, M. Jones, D. and Woldt, W. (1996), "Hazard
Ranking of Landfills Using Fuzzy Composite Programming", Journal of
Environmental Engineer, April, 1996, pp. 248-258.
Lee, Y. L., Dahab M. and Bogardi, I. (1992), "Nitrate Risk
Assessment under Uncertainty", Journal of Water Resources, Planning
and Management (118:2), 1992, pp. 151-165.
Linden, A. (2004), "Measuring Diagnostic and Predictive
Accuracy in Disease Management: An Introduction to Receiver Operating
Characteristic (ROC) Analysis", Journal of Evaluation in Clinical
Practice, 12 (2), pp. 132-139.
McGrath, K., Hendy, J., Klecun, E. and Young, T. (2008), The Vision
and Reality of 'Connecting For Health': Tensions,
Opportunities, and Policy Implications of the UK National Programme.
Communications of the Association for Information Systems, 23 (33).
Prodanovic, P. and S. Simonovic, S. (2002), "Comparison of
Fuzzy Set Ranking Methods for Implementation in Water Resources Decision
Making", Canadian Journal of Civil Engineering (29), 2002, pp.
692-701.
Sadip R. and Veitch, B. (2002), "An Integrated Approach to
Environmental Decision-making for offshore Oil and Gas Operations",
Canada-Brazil Oil & Gas HSE seminar and Workshop, March 11-12, 2002.
Sagillito, V. (1990), Pima Indians Diabetes Database,
http://www.ics.uci.edu/~mlearn/databases/
pima-indians-diabetes/pima-indians-diabetes, names (1990).
Simon, H. A., "The New Science of Management Decision
Prentice-Hall, Englewood Cliffs, NJ, 1977.
Note
(1.) There was high correlation between glucose and insulin
variables in the data set. This created ambiguity in the regression
model when both were included in the model. Hence the insulin variable
was dropped from the regression models.
BISWADIP GHOSH, Computer Information Systems, Metropolitan State
College of Denver, Denver, USA, E-mail:
[email protected]
Table 1
Fuzzy Weights, Best Value and Worst Value
Times BMI Pedigree Blood Age
Indicator Pregnant Function Glucose
Weight 0.15 0.14 0.15 0.14 0.14
Balancing Factor 1 1 1 1 1
Best Value 1 19 0.27 88 21
Worst Value 9 58 1.60 173 81
Insulin Skin
Indicator Level Fold
Weight 0.14 0.14
Balancing Factor 1 1
Best Value 111 23
Worst Value 540 38
Table 2
Logistic Regression Coefficients
Variable B S.E. Wald df Sig. Exp(B)
timesPreg .096 .047 4.192 1 .041 1.101
BMI .085 .018 23.334 1 .000 1.088
Pedigree .930 .355 6.851 1 .009 2.535
Age .011 .011 1.083 1 .298 1.011
PlsGlucose .031 .004 52.167 1 .000 1.031
Constant -8.370 .863 94.134 1 .000 .000