Wednesday 19 January 2011

Semantic Access to ChEMBL

ChEMBL is a database of bioactive drug-like small molecules curated by the The European Bioinformatics Institute (EBI). EBI makes relational database versions (MySQL and Oracle) available for download. As part of a recent project I wrapped ChEMBL MySQL into OWL using TopBraid Composer's D2RQ support. The steps I took were as follows:
  1. created an Ubuntu Server (10.04 x64) LAMP Virtual Machine using VMWare Fusion on an iMac
  2. installed phpmyadmin on the VM to allow access to MySQL from my Web browser, and tested this to be sure I could access MySQL in the VM guest from Firefox on the VM host
  3. uncompressed and unarchived the chembl_08_mysql.tar.gz download file in the VM
  4. the resulting folder contains an INSTALL file which tells you to create a database and execute the chembl_08.mysqldump.sql file which loads the database using CREATE TABLE and INSERT statements
  5. logged into phpmyadmin from the VM host Firefox and checked that I could browse the ChEBML MySQL data ... it looked fine, there's a screenshot below showing there are 7 million records in the tables as well as a screenshot of the MySQL data model






  6. for licensing reasones, TopBraid cannot distribute the MySQL JDBC driver so I downloaded the mysql-connector.jar and place it in the TBC workspace root folder (TBC help explained this and included links to the MySQL Web site)
  7. started TopBraid Composer and created a new project called ChEMBL
  8. Selected the project and Import, Create Mapping Files for D2RQ Relational Database Interface
  9. filled in the filename of chembl_08, , base URIs of http://www.ebi.ac.uk/chembl_08-xxx.owl, database URL/database user/password and MySQL JDBC driver class



  10. Selected Finish and held my breath ... and after a couple of minutes .. Success!




I won't pretend it's fast to do general instance data browsing - I set my TBC to bring back 30,000 triples at a time to get lots of data, however doing SPARQL queries works pretty quickly. I'm now off to modify the D2RQ mapping to fit with another ontology, but the D2RQ approach to data integration is clearly a powerful and useful capability.



The default mapping creates usable URIs that enable SPARQL queries across classes, so I'm off to a good start. I'll update this post if the D2RQ mapping changes turn out to be interesting.

Friday 6 August 2010

Modeling with Range-less OWL Properties

For those with a more traditional data modeling background, the idea of a property (aka attribute) with no range or specified data value space seems counter-intuitive. Attributes must be integers or real numbers or pointers to other instances/individuals. However, the OWL language allows range-less properties ... and that can be quite powerful when used in combination with the OWL allValuesFrom restriction.

Take the concept of identification or identifiers as an example. In simple cases, a string is all that's required for an identifier. In other cases, the organization owning the identifier and the specification of its format are very important. So, in some cases a property with a string data type suffices while in others a reference to an instance/individual with properties of its own (e.g. owningOrganization) is required. Here's a diagram of this example:



The key is to use rdf:Property for the rangeless identification property and then to use subClassOf restrictions when to associate the property with an OWL class.

Tuesday 2 March 2010

SPARQL for SysML/AP233 Transformations

I've had a little play with SPARQL to see how powerful a transformation engine it can be given the CONSTRUCT capability. It's actually pretty powerful! In a past life I was involved in the OMG/INCOSE/ISO mapping creating a SysML-to-AP233 mapping spec. I thought I'd try a little of that as a test case - here's a SysML Block with name mapped to AP233. Pretty clear, computer interpretable and requires no fancy MOF metamodel like the TGG tools (e.g. MOFLON) require. Worth considering ...

CONSTRUCT { _:s rdf:type ap233:System .
_:s ap233:name ?name .
_:sv rdf:type ap233:SystemVersion .
_:sv ap233:ofProduct _:s .
_:sv ap233:id "1" .
_:svd rdf:type ap233:SystemViewDefinition .
_:svd ap233:ofVersion _:sv .
_:ca rdf:type ap233:ClassificationAssignment .
_:ca ap233:items _:s .
_:ca ap233:externalClass <ap233:rd_block> .
}
WHERE {
?subject rdf:type sysml:Block .
?subject sysml:name ?name .
}
}

Saturday 27 February 2010

Working In The Semantic Web Now

I've moved to TopQuadrant to work in the field of Semantic Web technology. A particular interest of mine is applying that technology to engineering data. More to come as projects emerge...

Wednesday 18 November 2009

Trying Nokogiri XML Parser for Ruby

Bumped into a few blogs talking about how Nokogiri was faster than REXML for Ruby XML parsing. My OMG SysML XMI to ISO STEP AP233 XML converter demonstrations are taking so long that I can't run them live during the demo - up to 8 minutes for a large example SysML diagram. So, I'm modifying the converter to use Nokogiri in the hopes that I'll be able to do the demos live in the future. My first small test showed an 80 percent improvement ... fingers crossed that holds for the real converter.

Update - having problem with XML namespaces. Only xpath method seems to understand them, so after converting from REXML to Nokogiri my converter actually takes longer to run.

Saturday 14 November 2009

IDIOM - An ontology-centric IT Framework

I've been working in an ISO committee for many years and have been trying to convince it to adopt the same IT that the rest of industry uses. It appears that's finally coming to pass (small hurrah heard here).

The committee is called Industrial Data so I came up with the name 'Industrial Data Integrated Ontologies and Models' or IDIOM. Actually, came up with the name 3-4 years ago on one of my many cross-Atlantic flights when my laptop battery died. The core point of the whole exercise is to base a suite of inter-related IT capabilities (process models, SOA, etc) around a core of concepts that are formally specified in logic-based ontologies.

We've done the basic developing on the futurearch wiki (at wikispaces.com, which is a nice free service). Note that although 'Industrial Data' is in the name, there is absolutely nothing limiting how this IT Framework can be used.

More to come ...

Friday 28 August 2009

VMWare Fusion - Reusing Core Ubuntu Install

When testing out new software, Monodevelop this week, I want to keep my daily Ubuntu VMWare guest clean until I decide if I'll use the new tool or not so I use a separate install. I don't want to keep so many Ubuntu installs around so cloning/disk reuse is interesting. I found a good article explaining how to build a core Ubuntu OS and then extend it. Basically, you create a core OS install guest, then share its disk with other testing guests after taking a snapshot in the other testing guest. Taking a snapshot keeps the core disk unwritable.... nice.