Home Home Preferences  Web Services Contacts    
 

 

 
 


   Home

 
 
Data is currently being collected and accumulated at a dramatic pace in a number of different scientific areas. This data accumulation can vary from the long time archiving of the entire collection of raw data, to the persistent storage of summary statistics only. The type of data being analysed can also vary in content from text-based data streams to numeric data (and increasingly image -based data), managed in distributed file systems or structured databases. There is often a distinction made between machine learning algorithms/statistical analysis and data mining; the former is seen as the set of theories and computational methods needed to deal with a variety of different analysis problems, whereas the latter is seen as a means to encode such algorithms in a form that can be efficiently used in real world applications. Often data mining applications and toolkits contain a variety of machine learning algorithms that can be used alongside a number of other components, such as those needed to sample a data set, read/write output from/to data sources, and visualise the outcome of analysis algorithms in some meaningful way.

Visualisation is also often seen as a key component within many data mining applications, as the results of data mining applications/toolkits are often used by individuals not fully conversant with the details of the algorithm deployed for analysis. Further, users of results of data mining are generally domain experts (and not algorithm experts) , and often some (albeit limited) support is needed to allow such a user to chose an algorithm. The basic problem addressed by the data mining process is one of mapping low-level data (which are typically too voluminous to understand) into other forms that might be more compact (for example, a short report), more abstract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for example, a predictive model for estimating the value of future cases). At the core of the process is the application of specific data-mining methods for pattern discovery and extraction. This process is often structured from interactive and iterative stages within a discovery pipeline/workflow. At these different stages of the discovery pipeline , a user needs to access, integrate and analyse data from disparate sources, to use data patterns and models generated through intermediate stages, and feed those models to further stages in the pipeline. Consider, for instance, a breast-cancer data set acquired by a cancer research centre, where a physician carries out a series of experiments on breast cancer cases and records the results in a database. The data now needs to be analysed to discover knowledge of the possible causes (ortrends) of breast cancer. One approach is to use a classification algorithm. However, applying an appropriate classification algorithm requires some preliminary understanding of the approach used in the classification algorithm, and in the instance where the size of data is large, for processing of the data to be carried out on computational resources suitable to handle the large volume of data.

The availability of Web Service standards (such as WSDL, SOAP), and their adoption by a number of communities, including the Grid community as part of the Web Services Resource Framework (WSRF) indicates that development of a data mining toolkit based on Web Services is likely to be useful to a significant user community. Providing data mining Web Services also enables these to be integrated with other third party services, allowing data mining algorithms to be embedded within existing applications.

The project presents a data mining toolkit that makes use of Web Services composition, with the widely deployed Triana workflow environment. Most of the Web Services are derived from the WEKA data mining library of algorithms, and contain approximately 75 different algorithms (primarily classifiers, clustering algorithms and association rules). Additional capability is provided to support attribute search and selection within a numeric data set, and 20 different approaches are provided to achieve this (such as a genetic search operator). Visualisation capability is provided by wrapping the GNUPlot software; additional capability is supported through the deployment of a Mathematica Web Service (developed using the MathLink software). Other visualisation routines include a decision tree and a cluster visualiser.
 


 
Contents Copyright ©  2005. All rights reserved.