Fixing the robots.txt content type of Redmine at CodeTRAX

This post describes how I addressed an issue with the content type of the robots.txt file of my Redmine installation at CodeTRAX, so that the indexing bots could properly evaluate it and thus index only those sections of the web site that are meant to be indexed.

Yesterday, a long standing issue with my Redmine installation at CodeTRAX has hopefully been addressed. For a very long time, I’ve been wondering why all indexing bots, both the well known ones and also others not so popular, had been ignoring the Disallow rules as defined in the robots.txt file and kept indexing the sections of the web site that were excluded. My explanation so far had been that, since the web site lacked a proper robots.txt file for many years, the indexing services needed much time before actually taking the new rules into consideration and adjusting their indexing bots’ behavior. By trying to index sections such as the issue tracker, calendar or the repository, which provide various different ways of displaying and sorting the available data using query arguments, they ended up causing a tremendous increase of the server load.

Yesterday, I found some time to look again into this issue and, after thoroughly checking the response headers when the robots.txt file was requested, I noticed a small detail which I had overlooked in the past. The Content-Type of the resource was set to text/html in the HTTP response headers instead of the correct text/plain. This was a little strange, so I decided to look into it more closely.

Redmine’s robots.txt file is generated dynamically at run time by using the template file at app/views/welcome/robots.html.erb. I had created a custom plugin, in which, among other things, I override this template in order to extend the exclusion rules with more sections of the web site, such as each project’s issue and time tracker, the repository, calendar, gantt section etc. These parts of a Redmine site tend to generate a huge number of web pages due to the various query arguments that are available for sorting and for alternate data views. My robots.html.erb looks like this:

User-agent: *
<% @projects.each do |p| -%>
Disallow: /projects/<%= p.to_param %>/time_entries.csv
Disallow: /projects/<%= p.to_param %>/activity
Disallow: /projects/<%= p.to_param %>/activity.atom
Disallow: /projects/<%= p.to_param %>/roadmap
Disallow: /projects/<%= p.to_param %>/issues
Disallow: /projects/<%= p.to_param %>/issues.atom
Disallow: /projects/<%= p.to_param %>/issues.pdf
Disallow: /projects/<%= p.to_param %>/issues.csv
Disallow: /projects/<%= p.to_param %>/issues/calendar
Disallow: /projects/<%= p.to_param %>/issues/gantt
Disallow: /projects/<%= p.to_param %>/issues/report
Disallow: /projects/<%= p.to_param %>/time_entries
Disallow: /projects/<%= p.to_param %>/time_entries.atom
Disallow: /projects/<%= p.to_param %>/time_entries.csv
Disallow: /projects/<%= p.to_param %>/wiki/Wiki/history
Disallow: /projects/<%= p.to_param %>/wiki/date_index
Disallow: /projects/<%= p.to_param %>/repository
Disallow: /projects/<%= p.to_param %>/repository/annotate
Disallow: /projects/<%= p.to_param %>/repository/diff
Disallow: /projects/<%= p.to_param %>/repository/statistics
<% end -%>
Disallow: /issues
Disallow: /issues.atom
Disallow: /issues.pdf
Disallow: /issues.csv
Disallow: /issues/gantt
Disallow: /issues/calendar
Disallow: /activity
Disallow: /activity.atom
Disallow: /time_entries
Disallow: /time_entries.atom
Disallow: /time_entries.csv
Disallow: /login

After experimenting with it for a while, I came to the conclusion that for a very weird reason my Redmine application always returns the robots.txt file using the text/html content type. I tried clearing the cache in the application’s tmp/cache/ directory, restarted the application server, cleared Varnish‘s and my browser’s cache, but still the content type of the file was returned as:

...
Content-Type: text/html; charset=utf-8
...

At this point, I cannot safely tell what the real cause of this issue is. It could be a bug of Redmine, which I really doubt, or it could be a problem with the rather complicated and experimental server configuration I currently use on the box where CodeTRAX is hosted. Time permitting, in the following months, I’m going to do some more research about this.

As I needed a quick resolution, I implemented the following workarounds, which actually enforce the content type when the robots.txt file is requested.

I added the following to the Varnish configuration (I’ll post more information about running Redmine behind Varnish in a future post. I’m still experimenting with it!):

sub vcl_backend_response {
    if (bereq.url == "/robots.txt") {
        # Make sure robots.txt has correct content type.
        set beresp.http.Content-Type = "text/plain; charset=utf-8";
        # Force a caching timeout (TTL) of 1 hour.
        set beresp.ttl = 1h;
    }
}

Also, to be on the safe side, I added the following to the Apache configuration:

<Files "robots.txt">
    ForceType "text/plain"
</Files>

So, now robots.txt is always returned with the correct text/plain content type to the HTTP client. My guess is that until now the indexing bots could not properly evaluate the contents of this file, due to the wrong content type they were given, and ended up ignoring all indexing rules inside it. I’ll have to wait for several weeks before I am certain that the wrong content type was the actual reason why the indexing bots completely ignored the indexing rules, but the more I think about it the more certain I am that this small detail has been the cause of the problem.

Update (Sep 23, 2016): The robots.html.erb template has been revised.

Fixing the robots.txt content type of Redmine at CodeTRAX by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2016 - Some Rights Reserved

George Notaras avatar

About George Notaras

George Notaras is the editor of the G-Loaded Journal, a technical blog about Free and Open-Source Software. George, among other things, is an enthusiast self-taught GNU/Linux system administrator. He has created this web site to share the IT knowledge and experience he has gained over the years with other people. George primarily uses CentOS and Fedora. He has also developed some open-source software projects in his spare time.