Experimenting with Linked Open Data about FLOSS projects : matching Debian upstream projects

I’ve been experimenting with Linked Open Data about FLOSS projects harvested from different sources of DOAP or ADMS.SW descriptions. I’ve tried and match upstream projects of Debian packages with upstream projects hosted at Apache, Gnome, or Alioth.debian.org, or catalogued on Pypi.

I’m matching them on identical values of the Homepage field (comparing the Homepage Control field set by Debian packagers with the doap:homepage meta-data in the RDF documents harvested from the upstream project catalogues).

Here are initial results of my little experiment, for number of matched projects, and results on project name’s similarity :

Upstream catalogue Total matching projs Exact same project name Same project name (case independant)
apache 31 0 (0 %) 0 (0 %)
alioth 16 13 (81 %) 13 (81 %)
pypi 439 217 (49 %) 273 (62 %)
gnome 21 0 (0 %) 7 (33 %)
Total 507 230 (45%) 293 (58 %)

The data set contains tens of thousands of projects, with probably many duplicates, but from all of these, only 507 have common homepages.

As you can see, in some cases, the Debian source package names match the upstream project name (sometimes with lower/upper case variants), but in general, the project names aren’t identical, so it is interesting to try and match them by homepage.

For the curious ones, the Apache, Gnome and Pypi project catalogues use to provide RDF meta-data for quite some time. More recently have we introduced ADMS.SW meta-data for Debian source packages, and even more recently for the Alioth projects (through the ADMS.SW exporter plugin for FusionForge).

There are still some ways for improvements, for instance to normalize homepage URLs which tend to vary (trailing slashes, or different HTTP/HTTPS schemes).

Stay tuned for more details.

New paper “Authoritative linked data descriptions of debian source packages using ADMS.SW” accepted at OSS 2013

I’ll be presenting “Authoritative linked data descriptions of debian source packages using ADMS.SW” at OSS 2013.

Here’s the abstract :

The Debian Package Tracking System is a Web dashboard for Debian contributors and advanced users. This central tool publishes the status of subsequent releases of source packages in the Debian distribution.

It has been improved to generate RDF meta-data documenting the source packages, their releases and links to other packaging artifacts, using the ADMS.SW 1.0 model. This constitutes an authoritative source of machine-readable Debian “facts” and proposes a reference URI naming scheme for Linked Data resources about Debian packages.

This should enable the interlinking of these Debian package descriptions with other ADMS.SW or DOAP descriptions of FLOSS projects available on the Semantic Web also using Linked Data principles. This will be particularly interesting for traceability with upstream projects whose releases are packaged in Debian, derivative distributions reusing Debian source packages, or with other FLOSS distributions.

Update: If you are interested, a preprint is available here in HTML form. See also previous installments on ADMS.SW in this blog.

Update: The slides of the presentation I made at Isola are here.

The Debian Package Tracking System now publishes Turtle RDF meta-data

The Debian PTS now speaks the Turtle representation format for the export of RDF meta-data about Debian source packages.

Alongside HTML pages for humans, and the RDF/XML that had already been added to it this means that a new flavour of RDF is now available.

The Turtle format offers the benefits of both machine-readable meta-data, and a somehow human readable textual format too.

For instance, you may check the apache2 Turtle meta-data from the command-line with :
$ curl -L -s -H "Accept: text/turtle" http://packages.qa.debian.org/apache2

Here’s a link to a colorized HTML preview of http://packages.qa.debian.org/a/apache2.ttl.

Under the hood, the XSLT stylesheets of the PTS have been reworked to produce the Turtle format by default, and later convert them to RDF/XML.

Every Debian source package then has a reference URI in the Linked Data word, in the form http://packages.qa.debian.org/PACKAGE_NAME, that redirects, through proper content-negociation (the HTTP Accept header) to the HTML, RDF/XML or Turtle documents. For apache2, these are, resp. at http://packages.qa.debian.org/a/apache2.html, http://packages.qa.debian.org/a/apache2.rdf and http://packages.qa.debian.org/a/apache2.ttl.

The meta-data uses the model of the ADMS.SW ontology (1.0), and the content has also been slightly updated to make it more conformant to the ADMS.SW specifications (checks done with the ADMS.SW validator).

Let’s hope this makes RDF more familiar to Debian folks, and allows more Linked Data interlinking with other resources about FLOSS packages.

Presented “Generating Linked Data descriptions of Debian packages in the Debian PTS” at the Paris Mini DebConf

I have made a presentation at the Paris MinDebconf 2012 about the work I’ve done to bring more semantic meta-data to the Debian PTS (see previous posts).

Here are my slides :

Also available here as PDF.