Data is currently being collected and
accumulated at a dramatic pace in a number of different
scientific areas. This data accumulation can vary from the
long time archiving of the entire collection of raw data, to
the persistent storage of summary statistics only. The type
of data being analysed can also vary in content from
text-based data streams to numeric data (and increasingly
image -based data), managed in distributed file systems or
structured databases. There is often a distinction made
between machine learning algorithms/statistical analysis and
data mining; the former is seen as the set of theories and
computational methods needed to deal with a variety of
different analysis problems, whereas the latter is seen as a
means to encode such algorithms in a form that can be
efficiently used in real world applications. Often data
mining applications and toolkits contain a variety of
machine learning algorithms that can be used alongside a
number of other components, such as those needed to sample a
data set, read/write output from/to data sources, and
visualise the outcome of analysis algorithms in some
meaningful way.
Visualisation is also often seen as a key component within
many data mining applications, as the results of
data mining applications/toolkits are often used by
individuals not fully conversant with the details of the
algorithm deployed for analysis. Further, users of results
of data mining are generally domain experts (and
not algorithm experts) , and often some (albeit limited)
support is needed to allow such a user to chose an
algorithm.
The basic problem addressed by the data mining process is
one of mapping low-level data (which are
typically too voluminous to understand) into other forms
that might be more compact (for example, a short
report), more abstract (for example, a descriptive
approximation or model of the process that generated the
data), or more useful (for example, a predictive model for
estimating the value of future cases). At the core
of the process is the application of specific data-mining
methods for pattern discovery and extraction. This
process is often structured from interactive and iterative
stages within a discovery pipeline/workflow. At
these different stages of the discovery pipeline , a user
needs to access, integrate and analyse data from
disparate sources, to use data patterns and models generated
through intermediate stages, and feed those
models to further stages in the pipeline. Consider, for
instance, a breast-cancer data set acquired by a cancer
research centre, where a physician carries out a series of
experiments on breast cancer cases and records the
results in a database. The data now needs to be analysed to
discover knowledge of the possible causes (ortrends) of breast cancer. One approach is to use a
classification algorithm. However, applying an
appropriate classification algorithm requires some
preliminary understanding of the approach used in the
classification algorithm, and in the instance where the size
of data is large, for processing of the data to be
carried out on computational resources suitable to handle
the large volume of data.
The availability of Web Service standards (such as WSDL,
SOAP), and their adoption by a number of
communities, including the Grid community as part of the Web
Services Resource Framework (WSRF)
indicates that development of a data mining toolkit based on
Web Services is likely to be useful to a
significant user community. Providing data mining Web
Services also enables these to be integrated with
other third party services, allowing data mining algorithms
to be embedded within existing applications.
The
project presents a data mining toolkit that makes use of Web
Services composition, with the widely deployed
Triana workflow
environment. Most of the Web Services are derived from the
WEKA data
mining library of algorithms, and contain approximately 75
different algorithms (primarily classifiers, clustering
algorithms and association rules). Additional capability is
provided to support attribute search and selection within a
numeric data set, and 20 different approaches are provided
to achieve this (such as a genetic search operator).
Visualisation capability is provided by wrapping the GNUPlot
software; additional capability is supported through the
deployment of a Mathematica Web Service (developed using the
MathLink software). Other visualisation routines include a
decision tree and a cluster visualiser. |
|