Benchmark experiments are the method of choice to compare
learning algorithms empirically. For collections of data sets, the empirical
performance distributions of a set of learning algorithms are estimated, compared,
and ordered. Usually this is done for each data set separately. The
present manuscript extends this single data set-based approach to a joint analysis
for the complete collection, the so called problem domain. This enables
to decide which algorithms to deploy in a specific application or to compare
newly developed algorithms with well-known algorithms on established
problem domains.
Specialized visualization methods allow for easy exploration of huge amounts
of benchmark data. Furthermore, we take the benchmark experiment design
into account and use mixed-effects models to provide a formal statistical analysis.
Two domain-based benchmark experiments demonstrate our methods:
the UCI domain as a well-known domain when one is developing a new algorithm;
and the Grasshopper domain as a domain where we want to find the
best learning algorithm for a prediction component in an enterprise application
software system.