Summary of:
; ; ; ; ; "Deployable Suite of Data Mining Web Services for Online Science Data Repositories", 23rd Conference on IIPS, (2007).
Background
I read this article because I wanted to find a example of someone who wanted to use web services for something that was (a) specific and real (b) compelling (to me), and (c) includes of legacy systems and data stores. Most of my reading about web services has been generic, hypothetical, or simply uninteresting (for example, "retrieve purchase orders", or "process inventory". Yawn.).
Summary
NASA is focused on developing a distributed service-oriented architecture to enable remote data processing of its huge data stores. This paper describes a NASA ACCESS project called “Deployable Suite of Data Mining Web Services for Online Science Data Repositories”, which is aimed at developing web service based technology to allow scientists to locally define analysis workflows that can directly use data residing in online repositories. Furthermore, it aims to allow scientists both within and outside of NASA to "combine persistent distributed data processing services and make them available to other users over the internet" (p. 1).
This project has chosen web service technology because it "provides a standard way to remotely interface with programmatic components and to orchestrate the chaining of services through standardized service descriptions" (p. 2). The web services solution being developed by this project uses the standard XML/SOAP/WSDL group, plus BPEL4WS for process orchestration. The project will be using an existing XML schema for data transfer (ESML, the Earth Science Markup Language) and will adapt an existing data mining toolkit (the ADaM (Algorithm Development and Mining) Toolkit) to allow existing data mining algorithms to be adapted and new ones to be developed, and will encapsulate existing data access and analysis applications (such as the Simple, Scalable Script-based Science Processor for Measurements (S4PM)).
Much of the goal of this project is to expose to scientists worldwide the both legacy data sets and databases, to future data sets and databases, and to share data mining and analysis solutions. These previously were inaccessible outside of NASA computers, and required researchers to request access, and either go to the NASA site, or to transport huge quantities of data from NASA to their own site for processing. Any analysis solution developed by that researcher would typically remain isolated at that researcher's site and would not be accessible to other scientists.
The article also describes a prototype architecture, and relates a description of its first application.
Issues
WSDL records
This paper contains the first description I've found of in-situ use of WSDL, as well as workflows:
"A workflow is generated within the Workflow Composer [part of ActiveBPEL], based on experimentation in the Sand Box. This workflow is then deployed to a BPEL engine, which returns a URL pointing to the WSDL for that workflow. This URL is then transmitted to the GES DISC via a Web Services request along with a specification of the data to be mined, such as the dataset to be mined and temporal or spatial constraints" (p. 6).
NASA ACCESS
NASA ACCESS seems to be a NASA sponsored project to create computational infrastructure which will allow scientists to more easily share data and processing resources. Other NASA ACCESS projects I found on the Internet are: the Data and Information Application Layer (DIAL) project, A Distributed Knowledge Extraction Framework Based on SemanticWeb Services (SKIF) and Modeling and On-the-fly Solutions for Solid Earth Sciences (MOSES).