Robots exclusion and Spatineo

Robots.txt refers to the file name specified in the unofficial robots exclusion “standard”. This is used to inform automatic web crawlers which parts of a server should not crawled. You can also specify different rules for different crawlers. This standard is not a technical barrier for crawlers but a gentlemen’s agreement that automated processes should, and generally do respect.

A website may define robots exclusion information by publishing a robots.txt in the root path of the service. For example http://www.spatineo.com/robots.txt is the exclusion information for our website.

More on this specification can be found on robotstxt.org.

Spatineo Monitor

Spatineo Monitor adheres to the exclusion rules and thus, does not monitor web services that are disallowed via this mechanism. Spatineo however does load service descriptions despite robots.txt in the following cases, where we think it is nevertheless appropriate.

  • A user may request to update or add a service to our registry. This is an user-initiated operation and thus robots.txt does not apply to this situation.
  • We attempt to update every service once per week. This is because we want to avoid Spatineo Directory containing outdated or incorrect information about other service providers (you, perhaps?). One request per week should not cause performance issues for anyone.

“Why is there no availability information for my service?”

It is common practice for IT maintenance to disallow all crawling for web services. This is usually done by having a catch-all disallow-all robots.txt on the server in question. This is done to avoid generic web crawlers from inadvertently causing load peaks and performance issues on the servers. While it is true, that typical search engine spiders will usually only be confused by web service descriptions and operations, Spatineo Monitor is created specifically to understand these services. As such, allowing Spatineo to crawl the service will not cause performance issues.

We recommend you make sure that your current robots.txt is truly appropriate for your server. Broad exclusion of crawlers will mean that your users may never find interesting information you have published on the server. Generally, when you publish something online, you want that to be found.

The easiest change (besides completely removing robots.txt) you can make to allow Spatineo Monitoring is to add the following lines in your robots.txt, before all other content:


User-agent: spatineo
Allow: /

Please note that both “User-agent” and “spatineo” here are case sensitive. Also, our monitoring follows the first ruleset that matches our user agent.

“I want you to stop monitoring my service”

If monitoring is causing performance issues for you, we recommend you first take a look at how your service is built and configured. We monitor services once every 5 minutes and this should not cause noticeable load to any web service. If performance issues is not the reason you want to stop our monitoring, then I urge you to reconsider: Does monitoring take anything away from you? Do your users appreciate having availability statistics publicly available? If you have a good reason for us to not monitor you besides performance, I ask you to comment on this post and we can discuss your case.

In case your mind is made up, you can forbid us from monitoring your service. You can either upload a catch-all disallow-all robots.txt on your server, or place the following directives in your robots.txt:


User-agent: spatineo
Disallow: /

Please note that both “User-agent” and “spatineo” here are case sensitive and should be written as in the example above. Also keeping in mind that directives are read in order and robots use only the first matching directive. So place the above directive as the first directive or at least before User-agent *.

If you think you have already set up blocking correctly, but we are still monitoring your service, please do the following:

  • Make sure the character cases in your robots.txt match the above example (User-agent != User-Agent).
  • Check that your robots.txt does not have conflicting rules which would specifically allow our monitoring.
  • If you only just changed the file, you can update our records manually: enter the complete URL to your service into our search engine. This will update the records for that service and monitoring will cease.
  • In case this does not stop the requests, please post below or contact us via this page

Spatial web services & data journalism, the Talvivaara case

We had an interesting real-world case of using open environmental data for journalism a couple of weeks ago in Finland. In the early hours of Saturday the 10th of November Yle, the Finnish public broadcasting company, published a background news item at their site related to the continued pollution leakage at Talvivaara mining site in Sodankylä, Finland.

In the post “Kaikki Talvivaaran alueesta” (“All about Talvivaara area”) they point to the interactive mashup map of the mining area, including natural protection areas, mining reservations etc., aggregated at the Paikkatietoikkuna geoportal of the Finnish National Land Survey.

A few hours later the map was rendered practically useless because of the serious performance problems of the background WMS services providing the data.

The map window application at Paikkatietoikkuna makes it possible for any user to aggregate and publish web maps with their preferred selection of visualized geospatial data layer provided by the various Finnish governmental organizations. The data layers are served by the WMS servers hosted by the organizations, the application only provides an interactive graphical user interface for displaying them as a mashup. In this case Yle reporters had been able to make an up-to-date, interactive map covering soil types, lakes and rivers, ground water reserves, land claims for minings and natural protection areas just by selecting the layers and publishing the link pointing at it in their news item.

The data layers in the mashup was provided by the Geological Survey of Finland (soil types), Finnish Environment Institute (river, lake, natural water reserves and natural protection area) and the Finnish Ministry of Employment and the Economy (the mining-related information). The attached report from our Spatineo Monitor clearly shows the increased response times for all the WMS servers providing the selected data layers starting in the morning of 10th Nov 2012. At 04 UTC (06 local time) the Soil type service were struggling with the first traffic peak, and by 06 UTC the server was unresponsive. The situation started to improve only at evening, about 17 UTC.

The one month time series of one of the services (Soil data) shows the average response times on10th Nov. were considerably above normal for that service:

It seems that the journalists are really starting to take advantage of the public open geospatial data resources and easily available web map tools like Paikkatietoikkuna, but the data providers are not very well prepared for even pretty minor “slashdot effects” caused by sudden increased traffic at their services.

We at Spatineo are quite glad to be able to report things like this based on our continuous monitoring of thousands of spatial web services around the world. It confirms us that our proactive monitoring strategy is the right one: In most cases we have been collecting the performance data already before our customers experience performance problems in their spatial web services.

OGC to switch to WC3 XLink in July 2012

Open Geospatial Consortium (OGC) will make a backwards incompatible change to it’s XML Schema files of a large part of it’s standards in July 21st 2012. This change is done as a global corrigendum to move into using the W3C XLink version 1.1 instead of the OGC-specific XLink XML Schema implementation. See my previous post at for details on the reasons behind this pretty large-scale change.

Basically the change is quite a simple one:

  • all existing OGC standards that reference the OGC XLink shall be updated to reference the W3C XLink 1.1 schema and
  • going forward any new standards work shall only reference the W3C XLink schema.

By far the most used XLink attribute in OGC schemas is the locator attribute xlink:href, which contains an URL pointing to a link between two XML documents. In the XML Schema documents, the XLink href attribute is usually included in a complex type by adding an attribute group named simpleLink. In schemas using GML this is often done indirectly by using a pre-defined gml:AssociationAttributeGroup:

<complexType name="ReferenceType">
  <annotation>
    <documentation>
    gml:ReferenceType is intended to be used in application schemas directly,
    if a property element shall use a "by-reference only" encoding.
    </documentation>
  </annotation>
  <sequence/>
  <attributeGroup ref="gml:OwnershipAttributeGroup"/>
  <attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>

The gml:AssociationAttributeGroup GML 3.2.1 (before the XLink corrigendum) in turn refers to the simpleLink attribute group defined in the XLink namespace:

<attributeGroup name="AssociationAttributeGroup">
  <annotation>
    <documentation>
    XLink components are the standard method to support hypertext referencing in XML. An XML Schema 
    attribute group, gml:AssociationAttributeGroup, is provided to support the use of Xlinks as 
    the method for indicating the value of a property by reference in a uniform manner in GML.
    </documentation>
  </annotation>
  <attributeGroup ref="xlink:simpleLink"/>
  <attribute name="nilReason" type="gml:NilReasonType"/>
  <attribute ref="gml:remoteSchema">
    <annotation>
      <appinfo>deprecated</appinfo>
    </annotation>
  </attribute>
</attributeGroup>

In non-corrected GML 3.2.1 schema files the XLink namespace is imported from the OGC version of the XLink schema:

<import namespace="http://www.w3.org/1999/xlink" schemaLocation="http://schemas.opengis.net/xlink/1.0.0/xlinks.xsd"/>

In this file the simpleLink attributeGroup is defined like this:

<attribute name="href" type="anyURI"/>
...
<attributeGroup name="simpleLink">
  <attribute name="type" type="string" fixed="simple" form="qualified"/>
  <attribute ref="xlink:href" use="optional"/>
  <attribute ref="xlink:role" use="optional"/>
  <attribute ref="xlink:arcrole" use="optional"/>
  <attribute ref="xlink:title" use="optional"/>
  <attribute ref="xlink:show" use="optional"/>
  <attribute ref="xlink:actuate" use="optional"/>
</attributeGroup>

The thing that will change in July 2012 is all the schema files of all affected OGC standards will modified to point to the W3C official XLink 1.1 schema available at http://www.w3.org/XML/2008/06/xlink.xsd. The href attribute definition in the W3C XLink schema is only slightly different from the OGC version:

<xs:attribute name="href" type="xlink:hrefType"/>
<xs:simpleType name="hrefType">
  <xs:restriction base="xs:anyURI"/>
</xs:simpleType>
...
<xs:attributeGroup name="simpleAttrs">
  <xs:attribute ref="xlink:type" fixed="simple"/>
  <xs:attribute ref="xlink:href"/>
  <xs:attribute ref="xlink:role"/>
  <xs:attribute ref="xlink:arcrole"/>
  <xs:attribute ref="xlink:title"/>
  <xs:attribute ref="xlink:show"/>
  <xs:attribute ref="xlink:actuate"/>
</xs:attributeGroup>

This means that all XML files using xlink:href attribute valid against the OGC XLink schema are also valid against the W3C XLink 1.1 schema. However because the attribute group “simpleLink” in the OGC schema is called “simpleAttrs” in the W3C schema, the XML schema files using this attribute group will no longer be valid after the change. To fix this all the schema files using the “simpleLink” attribute group will have to be changed to use the “simpleAttrs” instead.

This change has to be done simultaneously to as many schema files as possible, because the XML validators become confused if they encounter two different schema versions of the same XML namespace. In addition to the OGC’s schema files, the same change should also be done to any other schemas using the OGC version of the XLink schema available at http://schemas.opengis.net/xlink/1.0.0/xlinks.xsd. To force the users to do this change, the OGC Architecture Board has decided to remove the OGC XLink schema file along with the other schema changes.

According to a mailing list post by Carl Reed, the CTO of the OGC, on 12th April 2012, at least the following OGC standards are affected by this change:

  • All versions of WM context
  • All versions of GML since version 2.0.0
  • All profiles of GML since 2.0.0
  • Image CRSs
  • All versions of OpenLS since version 1.1.0
  • All versions of OWS Common since 1.0.0
  • Symbology Encoding 1.0
  • All versions of SLD since 1.0.0
  • All versions of SensorML (including 2.0)
  • All versions of SWE Common
  • Table Join Service
  • All versions of Web Coverage Service
  • Web Feature Service 2.0
  • Web Map Service 1.3
  • WMTS
  • Web Processing Service

There are probably other schemas and standards in addition to this list because the schemas are inter-linked. Especially the different version of GML are used in many other OGC schemas.

Further quoting the announcement from Carl Reed about the OGC actions to be taken:

The target date for implementing change is the weekend of July 21, 2012.

The process will be:

  • Scan schema repository for import of xlink to find a list of standards that use xlink.
  • Also scan for strings such as Gml:ReferenceType to find other possible places that xlink is required.
  • Whatever schema uses any of XLink schema components will need to replace the schema location. We need to do this for all schemas that import xlink. All these changes will be done to a copy of the existing OGC schema repository.
  • For software developers, they need to patch their products to use the revised OGC schemas.
  • Everyone will need to delete local copies, get a new copy from the OGC schema repository, and use the new schemas. There is also the possibility to use a tool such as the OASIS XML Catalogue to override the required change and to continue using the old XLink.
  • In July, we will then issue one global corrigendum for all the affected standards. Essentially, the current OGC schema repository will be replaced with the schemas that have been changed (and tested). The actual standards documents will not change – only the schemas. OGC policy is that the schemas are normative and that if there are differences between a standards document and a schema, then the schemas are normative.

This is pretty much the approach I expected the OGC to take when I wrote about this in January.

If you are running or developing software dealing with OGC compliant data or services you really should check that it will still work with the modified versions of the schema files. You can begin testing your software as soon as the modified OGC schema files are made available in the alternative OGC schema repository. One of the simplest ways to test this is to use the OASIS XML Catalog to temporarily redirect the requests for the schema files of the modified standards’ namespaces to the alternative OGC schema locations. If your software supports the XML Catalog a catalog.xml file with directives something like the following should do the trick (assuming that the modified OGC schemas would be made available under the domain alternative.schemas.opengis.net):

<!DOCTYPE catalog
  PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
         "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"
         prefer="public">
  <rewriteURI uriStartString="http://schemas.opengis.net/gml/"
		rewritePrefix="http://alternative.schemas.opengis.net/gml/" />
  <rewriteURI uriStartString="http://schemas.opengis.net/wfs/"
		rewritePrefix="http://alternative.schemas.opengis.net/wfs/" />
  ....
  [etc for all affected standards]
</catalog>

When an XML validator using this catalog needs to fetch any xml files from URLs beginning with “http://schemas.opengis.net/gml/” it will try to fetch them from “http://alternative.schemas.opengis.net/gml/” instead. The benefit from this approach is that you will be able to simulate schema switch-over well before the actual change in July without making any changes to your code or data files.

You can also use the XML Catalog if you find that you must delay the schema changes for your local system. To do this you can take local copies from the unmodified OGC schema files and create another set of rewriteURI directives. Assuming that the local schema files are stored under /etc/xml/schemas/original/ogc/:

<!DOCTYPE catalog
  PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
         "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"
         prefer="public">
  <rewriteURI uriStartString="http://schemas.opengis.net/gml/"
		rewritePrefix="file:///etc/xml/schemas/original/ogc/gml/" />
  <rewriteURI uriStartString="http://schemas.opengis.net/wfs/"
		rewritePrefix="file:///etc/xml/schemas/original/ogc/wfs/" />
  ....
</catalog>