Scientific experimental data generated by all the bionomic technologies is characterized by heterogeneity in its representation formats, constituents, and generation processes and, therefore, also in its usage. Using the proteomics domain we demonstrate the important role of provenance information o manage, interpret and analyze experimental data. We present a novel approach that employs an ontology as a knowledge model to automatically create semantic provenance information for high-throughput mass spectrometry (MS) data in the glycoproteomics domain. The Semantic Provenance Annotation of Data in protEomics (SPADE) implementation is based on the ProPreO ontology, a large-process ontology ( ~500 classes, 40 named relationships with 170 class-level restrictions, and 3.1 million instances) that models the complete experimental protocol for MS-based glycoproteomics data analysis. The semantic provenance information created in SPADE enables biologists to query over the semantic provenance information and retrieve exact data using “train-of-thought” expressive queries in SPARQL query language. We also discuss our current work in extending the ProPreO ontology to support toxicological metabolomics experimentation using Nuclear Magnetic Resonance (NMR) spectroscopy. Our strategic goal is to use Semantic Provenance information by pattern recognition and data mining algorithms for comparative or correlation analysis of Liquid Chromatography MS (LCMS) and NMR spectroscopy experimental data as part of toxicological metabolomics studies.
Available at: http://works.bepress.com/michael_raymer/42/