This is an updated version of the original "How-To:Create CaGrid Workflow Using Taverna 1.7".
The purpose of this guide is to provide caGrid users, who have little or no knowledge of Taverna workbench, with a basic understanding of how to discover caGrid services and compose them in the Taverna GUI.
The Taverna Workbench allows users to construct complex workflows that consist of multiple types of components. Each type of component is called an "Activity". Once combined in a workflow, these components, which may be located on different machines, are orchestrated by Taverna which then gathers the results and displays them in the workbench interface.
The 2.0 release of Taverna supports many types of activities, including: apiconsumer activity, beanshell activity, biomart activity, biomody activity, java activity, soaplab activity, and wsdl activity, among others.
The caGrid Workflow team has built two plug-ins into Taverna 2.0 that help facilitate the discovery, invocation, and execution of caGrid services through the Taverna Workbench. Those plug-ins are:
- cagrid discoverer: This plug-in is used to query caGrid services from the caGrid Index Service. This allows you to easily find the available caGrid services, and select and leverage them as needed for your workflow.
- workflow-execution service: This workflow-execution plug-in passes a workflow definition file, including the appropriate inputs, to a generic caGrid service where the workflow is then executed. This is useful in instances where the workflow is long-running and relies on constant access to the Grid to execute properly.
Taverna 2 also supports grid services built using WSRF specifications.
The caGrid Workflow team would like to thank the entire Taverna team for their efforts in identifying ways to use and extend the Taverna tool for use with caGrid. In particular we recognize the contributions of Stian Soiland-Reyes, Tom Oinn, Alan Williams, and many others for their great effort. We also thank Mr. Alan Williams for his comments toward improving this manual.
For additional reference, the list below provides links to some useful web resources that are mentioned throughout this guide:
- Taverna: http://www.taverna.org.uk
- caBIG project: https://cabig-kc.nci.nih.gov/CaGrid/KC/index.php/Main_Page
- caGrid Plug-ins download: http://software.cagrid.org/taverna2/
The first step to using Taverna is to install Taverna 2.0. (also referred to as the Taverna Workbench).
Before installing the program, however, be sure to review the system requirements for installation. That information is available at the following URL: http://www.taverna.org.uk/download/workbench/system-requirements/.
For the most part, Taverna 2.0 works with Java 1.5 or higher, although it is not fully compatible with some Java 1.6 versions.
The Taverna 2.0 Workbench program is available for download from the following URL:
http://sourceforge.net/projects/taverna/files/taverna/2.0.0. This page also provides instructions for installing the program once you have downloaded it.
If you are using Linux, you must install the graphviz package so that Taverna can find the "dot" executable. The graphviz package can be found at the following URL: http://www.graphviz.org.
If this is your first time using Taverna 2.0, be advised that it may take some time to initiate the program since it will need to fetch components from web.
If you have used previous versions of Taverna (e.g., Taverna 1.x), you may want to review the Quick Guide for Taverna 1 Users. This guide provides a basic feature comparison between Taverna 1.7 and 2.0 and identifies those features that are not provided in 2.0. The quick guide also lists the restrictions for upgrading your existing 1.x Taverna workflows to 2.0 workflows.
The quick guide is available at the following URL:
You can find more complete Taverna 2.0 user documentation at the following URL:
The caGrid Workflow team has developed two plug-ins for connecting Taverna 2.0 and caGrid. In order to use these plug-ins, you must first download them using the Taverna interface.
To download the caGrid plug-ins:
- Open the Taverna 2.0 Workbench.
- From the Advanced option on the main menu, select Software Updates.
3. In the Plugin Manager dialog box that appears, click Find New Plugins located at the bottom of the dialog box.
4. In the Plugin Sites dialog box that appears, click Add Plugin site located at the bottom of the dialog box.
5. A dialog box appears that allows you to identify the site from which you want to download the plug-in(s). Enter the name and the URL of the site where the plug-ins are located and click OK.
For the caGrid plug-ins, the Site Name is caGrid and the Site URL should be: http://software.cagrid.org/taverna2/.
When the Plugin Sites dialog box reappears, it should show the two caGrid plug-ins found at this site.
6. Make sure the checkboxes for both plug-ins are checked and click Install.
7. After the plug-in installation is complete, shutdown Taverna 2.0 and restart it to allow the installation to take effect.
When the Taverna Workbench reappears, you can check the interface to verify the installation of both plug-ins.
The first thing you should notice is a new CaGrid Service icon on the toolbar. The appearance of this icon indicates that the caGrid workflow execution service plug-in is installed. This plug-in allows you to submit a workflow to a caGrid service for execution.
Next, click Activities from the main menu and select New Activity. Within the New Activity sub-menu, you should see a CaGrid option. The appearance of this option means that the caGrid discoverer plug-in is installed. This plug-in allows you to query caGrid services from the caGrid Index Service.
The caGrid Service Discoverer plug-in (identified as cagrid-activity-ui in the Plugin Sites dialog box seen during download/install) acts as a semantic, metadata-based caGrid service search tool.
Often the URL of a caGrid service of interest is not a well-known value to end users. To locate available caGrid services, the service discoverer plug-in integrates the caGrid discovery API to assist users in finding the appropriate caGrid service to use in their workflow.
The simplest discovery option for finding caGrid services is to query the Index Service for all active services for all of caGrid.
When you select this option, the discoverer plug-in returns a full list of caGrid URLs, along with the operations belonging to each URL. Each URL represents a valid caGrid service and can be used, by Taverna, to create activites to invoke operations on the corresponding services. Details for how to create activities and execute them on the corresponding grid services are provided in the workflow:Workflow Modeling section of this guide.
To retrieve a list of all active caGrid services:
- From the Taverna Workbench main menu, select Activities > New Activity > caGrid.
- At the top of the Add Your Custom Service Query dialog box, use the drop-down list to select the Location (URL) of the index service you want to query.
To obtain active caGrid service, select the default caGrid Index Service: http://cagrid-index.nci.nih.gov:8080/wsrf/services/DefaultIndexService.
3. Leave the other fields in the dialog box unchanged, and click Send Service Query.
When the query is finished, Taverna Workbench displays all of the registered services together with their operations. The list of caGrid services appear under the CaGrid Services folder in the Available activities node in the Workbench panel.
These services can be sorted by url as shown in the figure below on the left, or by provider as shown in the figure below on the right. You can toggle between the two views using the arrow located on the right side of the list-type display button.
Regardless of the selected display type, you can view the operations available for each service by double-clicking the entry. The operations that appear can later be added into workflows.
The discovery option for finding caGrid services, other than to get a list of all active services, is to query the Index Service using specific search criteria.
The caGrid discovery APIs provide a wide range of discovery capabilities, from full text search suitable for a free-form webpage-like interface, to simple text-based criteria such as specifying operation names or concept codes, to more complex criteria (i.e., query by example) such as specification of point-of-contact information or UML class criteria.
The discoverer plug-in supports the creation of a semantic-based service query. The Add Custom Service Query dialog box allows for the input of up to three service query criteria and corresponding values.
To create a custom query for caGrid Services:
- From the Taverna Workbench main menu, select Activities > New Activity > caGrid.
- At the top of the Add Your Custom Service Query dialog box, use the drop-down list to select the Location (URL) of the index service you want to query.
To query active caGrid services, select the default caGrid Index Service: *http://cagrid-index.nci.nih.gov:8080/wsrf/services/DefaultIndexService*.
- Initially the Service Query Criteria list only shows one criteria drop-down list. Click the Add Service Query button to add more criteria to the list (up to three).
- Use the Service Query Criteria drop-down list(s) and the corresponding Service Query Value drop-down list(s) to provide your query information.
For example, in the figure below we are querying for all active caGrid services where the Research Center name is CBIIT (NCI Center for Biomedical Informatics and Information Technology) and where the Service Name is CaDSRDataService.
5. When the appropriate criteria and values appear, click Send Service Query.
When the query is finished, Taverna Workbench displays all of the services that match the entered criteria, together with their operations. The list of caGrid services appear under the caGrid Services folder in the Available Activities node in the workbench panel, as shown in the figure below.
Since the example query was fairly specific, the figure shows only one result: the CaDSRService on caGrid. Depending on the query you construct, however, you may retrieve several services. The retrieved services can be sorted by url or by provider (as shown in the previous section). You can toggle between the two views using the arrow located on the right side of the list-type display button at the top of the list.
Regardless of the selected display type, you can view the operations available for each service by double-clicking the entry. The operations that appear can later be added into Taverna workflows.
A Taverna workflow is made up of a combination of:
- input and output
- XML splitters, which aggregate/split the input/output data for the activities
- data links
- control links
The purpose of this section is to use a sample to demonstrate the modeling steps needed to create a workflow in Taverna 2.0. The sample workflow is made up of the following items:
- two activities - queryProject and queryClass
- one input - cqlClause
- one output port - classInformation
- some XML splitters to process the input/output.
- a beanshell script activity to transform the output of queryProject to the input of queryClass.
The purpose of the workflow built in this section is straightforward: 1.) use a CQL clause to query the CaDSRDataService in order to get a list of projects related to a context 'caBIG', and 2.) use the first project object obtained in step 1 to find all of the packages in it.
Because the input and output data of these two steps do not fit exactly, we must add a beanshell activity to transform the output of queryProject into properly formatted input for the queryClass activity. Besides that, we add several xmlsplitter activities that can help compose complex xml elements or extract child elements automatically.
In the figures shown in the example, the different items are indicated with different colors. The XML splitters are shown in purple, the activities (which represent caGrid services) are shown in green, the beanshell activities are shown in yellow, and input/output ports are in blue.
The first step for creating our sample workflow is to add a new activity into an empty workflow.
To add an activity to a workflow:
- If necessary, in the Available activities list, double-click the service to open the list of available operations for that service.
- Find the operation you want to use in your workflow.
- Use the mouse to click and drag the operation in the workflow drawing pane (middle area) of the Taverna Workbench window.
Once an operation appears in the workflow drawing pane, you can click the Display all processor ports button at the top of the middle pane to see all of the input/output ports for the activity. You can also right click the activity and use the "Rename Processor..." option to rename the activity to "queryProject".
The figure below shows the addition of the queryProjects activity into the workflow. You have selected to Display all processor ports for the activity as well. AS the figure shows, the queryProjects activity comes from a WSRF service, so there is an additional EndPointReference port.
In Taverna, while it is possible to directly provide the XML data needed by WSDL services, some users may find that some XML data elements are too verbose to handle.
Taverna provides XML splitters, which interrogate the data structure and present the user with the internal data elements. One XML splitter can be used to resolve the input XML data structure at a single level, so multiple splitters might be needed if the XML data contains multiple-level complex types.
For example, the XML element parameters of the input for the queryProjects activity contains a <CQLQuery/> node as its sub-element. By adding two consecutive XML splitters in the input port of the queryProjects activity, the user can directly input the CQL clause value for the element <CQLQuery/>.
To add an XML splitter to the workflow:
- In the workflow drawing pane, click on the appropriate activity to display the activity details in the left pane.
- In the contextual view (left pane) of the workbench window, click the Add input XML splitter button, located at the bottom of the pane.
3. In the pop-up dialog box that appears, use the drop-down list to select the input port for which the XML splitter is to be added, and then click OK.
The XML splitter appears in the workflow drawing pane of the workbench window.
For this example, the XML splitter is added to the queryProject activity, and once added, it appears in the workflow drawing area with a data link to the activity. Similarly, add another XML splitter on top of the newly added XML splitter.
This finishes the adding of the activity for step 1. Additionally, add another activity from operation "query" of the CaDSRDataService, with the name "queryClass", and with two XML splitters.
Our purpose is to connect the output of queryProject to the input of queryClass (i.e., we would like to first get a project from CaDSR and then get all the UML classes in this project). However, the output format of queryProject does not exactly fit the input of queryClass. Taverna provides beanshell activity for users to embed a snippet of Java program to do customized data transformation. A beanshell activity can be added from "Available activities" panel, in a similar fashion you added the CaGrid activities just now. See figure below, a beanshell activity with the name "Beanshell" is added.
Now you need to configure the beanshell activity. Add both an input and output port. First, choose the beanshell activity in the workflow diagram and click the Configure button from the Contextual View on the bottom left.
In the popped-up dialog, choose the Ports tab view and add a Input port with name "in" and an Output port with name "out" (using button Add Port), respectively. Remember to click the OK button to save the configuration in each step.
Afterward we switch to the Script tab view to add the Java code to transform data. Please copy and paste the code snippet below and click the OK button to save.
Data links exist between workflow inputs, activities, and workflow outputs.
For example, a data link between activity A and B feeds the output from activity A to the input of activity B.
As we have seen in the last sub-section, you can have many data links, and each of these data links are added automatically when you add XML splitters for activities.
Data links can also be added manually.
For example, if we want to feed the output of queryProjects (port parameters) to the input (port in) of Beanshell, we can draw a line between them to add the appropriate data link between the two items.
Similarly, we add another data link between the output of Beanshell (i.e., port out) to the input of queryClass_cqlQuery (i.e., port CQLQuery). The figure below shows that, by adding two data links, we connect the beanshell activity with other activities in the workflow.
Control links represent the control flow between activities. The target activity of a control link cannot start until the source activity completes.
To add a control link, right-click the activity at the end of the link and select Coordinate from from the right-click menu. From the list that appears, select the controlling activity for the link.
More specific instructions are not provided in this document because our sample workflow does not contain a control link. However, you can find instructions for creating a control link in the Taverna 2.0 user documentation, including examples of workflows containing control links.
Workflow input and Workflow output nodes are used to create workflow inputs and outputs, respectively.
To create input and output nodes, right-click anywhere in the empty space of the workflow diagram pane to see the Create New Input and Create New Output buttons.
Once these nodes are created, you can connect them to the activities/xml splitters, by adding data links between them.
When completed, the sample workflow appears as shown in the figure below.
In the sample, you added an input node, cqlClause, to help the user input the CQL clause to query all caBIG related projects in CaDSRDataService. You also created an output node, classInformation, to store the classes' information obtained.
Once your workflow is built, you can execute the workflow using Taverna or workflow:through caGrid Service.
To execute a workflow using Taverna 2.0:
- From the Taverna main menu, select File > Run workflow.
2. In the Workflow input builder dialog box, you must provide a value for each input port. Use the buttons at the top of the dialog box to select the type of value that must be entered, and then enter the value for the input in the text box.
The sample workflow contains only one input port, which as indicated in the previous section, is a string that represents a CQL query clause. Copy and paste the value shown below.
3. When all of the necessary inputs have been enetered, cick the Launch workflow button located at the top of the dialog box.
The Taverna Workbench window changes to show the Results perspective. The Results pane shows the execution trace of the workflow.
When the workflow execution terminates, the result of the workflow appears, which can then be saved into a file.
There are circumstances where you want your workflow to be executed on the Grid instead of on your local Taverna workbench. This is especially true in cases where the execution of your workflow may take a long time and your machine may need to be shut down. The workflow may take more computational processing or data handling than your local Taverna workbench can afford.
caGrid provides a workflow execution service so that the actual orchestration of the workflow does not rely on a local engine inside the end-user's Taverna workbench installation, but on a remote engine that exposes its capability as a caGrid service.
In principle, this service can be invoked by any WSRF-compliant service client. However, for easier access by Taverna users, there is a plug-in called t2-cagrid that allows you to invoke this service from inside Taverna workbench. This is one of the two caGrid plug-ins downloaded and installed in the workflow:Download Plug-ins section of this guide.
As stated in that download section, if you see a caGrid Service icon in the Taverna Workbench interface panel, this means that the plug-in was successfully installed.
This button provides an easy method for running a Taverna workflow on the Grid.
To execute a workflow through caGrid workflow execution service:
- If necessary, open the workflow you want to run.
- Click the caGrid Service button. If you do not see a CaGrid Service button, review the instructions located in the workflow:Download Plug-ins section of this guide.
The Taverna workbench automatically switches to the CaGrid Service perspective. At the top of this view, you should see a caGrid Service URI address box.
3. In the caGrid Service URI address field, enter the URL of the caGrid Service to which you want to submit the currently open workflow.
As caGrid services are configured and deployed to the caGrid Production environment, there should be more caGrid Service URLs available for use.
4. Click the Run Workflow button located below the Grid Service address. An input dialog box appears, allowing you to enter input node values as needed.
5. In the Workflow Input Builder dialog box, you must provide a value for each input port. Use the buttons at the top of the dialog box to select the type of value that must be entered, and then enter the value for the input in the text box.
As noted previously, the sample workflow contains only one input port, a string value of the <CQLQuery/> element.
6. When all of the necessary inputs have been entered, click the Launch workflow button located at the top of the dialog box.
The workflow is submitted to the target URL for execution. When finished, the result is returned and appears in the output panel. While our sample workflow only took several seconds to complete, the time needed to complete the execution of your workflow will vary. The results from our sample workflow are shown in the figure below.
The Taverna 2.0 Workbench, while a very good tool for creating and executing workflows, does have some limitations about which you should be aware.
- The current Taverna release (Taverna 2.0) does not allow for workflows to be paused and resumed.
- The current release of Taverna Workbench does not support caGrid security mechanisms. The caGrid team is actively working with the Taverna team on this issue and we expect caGrid security to be enabled in a future release of Taverna.
In addition to the items noted above, Taverna uses a "pipeline" execution mode to improve execution efficiency. While not specifically a limitation, this mode ensures that activities are executed as soon as possible, even when the upstream activities have not completed execution and have only generated a partial set of data. Keep this in mind as it may affect the final results of your workflow.
The CaDSRDataService workflow example used in this article can be downloaded at myExperiment web site:
gRAVI (Grid Remote Application Virtualization Interface)
Here are a few publications that are related to caGrid and Taverna:
Wei Tan, Ian Foster, Ravi Madduri. Scientific workflows that enable Web-scale collaboration: combining the power of Taverna and caGrid. IEEE Internet Computing. 2008, vol.12, no.6: 30-37
Wei Tan, Ravi Madduri, Kiran Keshav, Baris E. Suzek, Scott Oster, Ian Foster. Orchestrating caGrid Services in Taverna. IEEE International Conference on Web Services 2008.Sept, 2008.
Wei Tan, Paolo Missier, Ravi Madduri, Ian Foster. Building Scientific Workflow with Taverna and BPEL: a Comparative Study in caGrid. 4th International Workshop on Engineering Service-Oriented Applications (WESOA'08), in conjunction with 6th International Conference on Service Oriented Computing, Dec, 2008.
Kyle Chard, Cem Onyuksel, Wei Tan, Dinanath Sulakhe, Ravi Madduri, Ian Foster. Build Grid Enabled Scientific Workflows using gRAVI and Taverna. SWBES08: Challenging Issues in Workflow Applications Workshop, in conjunction with 4th IEEE International Conference on e-Science, Dec, 2008.
If you have any questions, concerns, or suggestions regarding the workflow tools, please do not hesitate to contact the members of the caGrid Workflow team:
- Wei Tan: email@example.com
- Ravi Madduri: firstname.lastname@example.org
- Dinanath Sulakhe: email@example.com