Robots exclusion and Spatineo

Robots.txt refers to the file name specified in the unofficial robots exclusion “standard”. This is used to inform automatic web crawlers which parts of a server should not crawled. You can also specify different rules for different crawlers. This standard is not a technical barrier for crawlers but a gentlemen’s agreement that automated processes should, and generally do respect.

A website may define robots exclusion information by publishing a robots.txt in the root path of the service. For example http://www.spatineo.com/robots.txt is the exclusion information for our website.

More on this specification can be found on robotstxt.org.

Spatineo Monitor

Spatineo Monitor adheres to the exclusion rules and thus, does not monitor web services that are disallowed via this mechanism. Spatineo however does load service descriptions despite robots.txt in the following cases, where we think it is nevertheless appropriate.

  • A user may request to update or add a service to our registry. This is an user-initiated operation and thus robots.txt does not apply to this situation.
  • We attempt to update every service once per week. This is because we want to avoid Spatineo Directory containing outdated or incorrect information about other service providers (you, perhaps?). One request per week should not cause performance issues for anyone.

“Why is there no availability information for my service?”

It is common practice for IT maintenance to disallow all crawling for web services. This is usually done by having a catch-all disallow-all robots.txt on the server in question. This is done to avoid generic web crawlers from inadvertently causing load peaks and performance issues on the servers. While it is true, that typical search engine spiders will usually only be confused by web service descriptions and operations, Spatineo Monitor is created specifically to understand these services. As such, allowing Spatineo to crawl the service will not cause performance issues.

We recommend you make sure that your current robots.txt is truly appropriate for your server. Broad exclusion of crawlers will mean that your users may never find interesting information you have published on the server. Generally, when you publish something online, you want that to be found.

The easiest change (besides completely removing robots.txt) you can make to allow Spatineo Monitoring is to add the following lines in your robots.txt, before all other content:


User-agent: spatineo
Allow: /

Please note that both “User-agent” and “spatineo” here are case sensitive. Also, our monitoring follows the first ruleset that matches our user agent.

“I want you to stop monitoring my service”

If monitoring is causing performance issues for you, we recommend you first take a look at how your service is built and configured. We monitor services once every 5 minutes and this should not cause noticeable load to any web service. If performance issues is not the reason you want to stop our monitoring, then I urge you to reconsider: Does monitoring take anything away from you? Do your users appreciate having availability statistics publicly available? If you have a good reason for us to not monitor you besides performance, I ask you to comment on this post and we can discuss your case.

In case your mind is made up, you can forbid us from monitoring your service. You can either upload a catch-all disallow-all robots.txt on your server, or place the following directives in your robots.txt:


User-agent: spatineo
Disallow: /

Please note that both “User-agent” and “spatineo” here are case sensitive and should be written as in the example above. Also keeping in mind that directives are read in order and robots use only the first matching directive. So place the above directive as the first directive or at least before User-agent *.

If you think you have already set up blocking correctly, but we are still monitoring your service, please do the following:

  • Make sure the character cases in your robots.txt match the above example (User-agent != User-Agent).
  • Check that your robots.txt does not have conflicting rules which would specifically allow our monitoring.
  • If you only just changed the file, you can update our records manually: enter the complete URL to your service into our search engine. This will update the records for that service and monitoring will cease.
  • In case this does not stop the requests, please post below or contact us via this page

7 Comments

Ilkka Rinne

Hi,

Sorry for a late answer, your comment had ended in the spam moderation queue.

It makes no sense to make a request for it on every monitoring request, so it may take a few days for us to notice the changed robots.txt files. If you need a quicker action in the future, please contact us by email at support(at)spatineo.com, by using our website contact form, or by Twitter (@spatineo). Sorry for the inconvenience.

Cheers,

Ilkka Rinne
Head of User Experience and Interoperability
Spatineo

Reply
Nick Massaro

Our site infrastructure is not capable of spinning up all the services we host at once. Will Spatineo honor the following robot.txt setting?
>>>
User-agent: spatineo
Disallow: */FeatureServer
Disallow: */GPServer
Disallow: */MapServer

Reply
Sampo Savolainen

We discussed your case via email, but I realized that it makes sense to answer this question here so others can find the answer as well. Unfortunately the robots.txt specifcation does not allow for this kind of wildcarding. The allow/disallow rules may only contain a path prefix, that matches for request paths starting with or equal to that.

(http://www.robotstxt.org/norobots-rfc.txt)

Reply
Joanne McGraw

Yesterday, in a 12 hour period we were hit with over 22,000 requests with the Spatineo Web Bot. That was out of 26,000+. Please stop monitoring our services as it is now impacting our users’ ability to work with our applications.
We have edited the robots as suggested. It can be accessed at http://www.agr.gc.ca/robots.txt and requests that spatineo User-agents no longer monitor services found at http://www.agr.gc.ca/atlas.
Your earliest attention to this would be appreciated.

Reply
Sampo Savolainen

Dear Joanne and James,

Our systems update the robots.txt information with a slight delay. We would be making even more requests if we would test for changes in a robots.txt every time before sending a monitoring request. We have however hastened this update for your servers and the monitoring is now disabled as per the instructions in the file.

Please note we have monitored your services at a near constant rate, with about the same number of requests per day, since January 2016. This means we sent roughly the same number of requests to your server yesterday as we did on the days before that for months. Therefore I find it improbable that it was indeed our monitoring that caused your issues. But as I said in the beginning, our systems have now stopped monitoring your services.

I hope you can resolve the problems with your services so you can go back to business as usual.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *