How to fix the WordPress database’s character-set issue

WordPress version 2.2 or newer allows the user to define the MySQL database character set and the collation (get familiar with these terms) inside wp-config.php. Today, after upgrading to the newest version of WordPress, I decided to also update this file and append the statement that sets the database encoding to ‘utf8‘. But, as soon as I started validating the RSS feed as part of a general test of the new WordPress version, I noticed some weird characters which were the cause of several warnings and errors in the feed validator‘s output. This seemed a bit strange as I was certain that my data was being stored using the UTF-8 encoding! Having spent over two hours trying to dump the WordPress database, perform all the required conversions and re-import the fixed dump back to MySQL using all the possible combinations of character sets, I started thinking that I had made some very serious mistake while configuring MySQL itself or WordPress! Fortunately, this was not true…

Beginning with version 2.2, fresh installations of WordPress use UTF-8 as the default encoding for the database, tables and text/string fields, while older versions used latin1-encoded tables (with latin1_swedish_ci collation) by default. This means that the user data, regardless of its own encoding, was being stored in latin1 tables etc, which, eventually, has added extra trouble to all old (pre-v2.2) WordPress users, especially those who write in their national language. The inevitable changes in version 2.2 give old WP users two choices:

  1. Either continue storing their data to latin1 MySQL tables, regardless of the actual encoding of that data, which obviously means to continue making the same mistake forever,
  2. or follow the painful procedure outlined in the Codex in order to efficiently convert the character set of the database, tables and fields to the appropriate one, but without affecting the already encoded user data.

Of course, I chose solution No.2 so to get rid of this idiotic way of storing my data once and for all! The real problem is that the second solution is only provided in a “theoritical” form – no official database converters. Fortunately, a heroic WP user has coded a small “UTF-8 database converter” which can be used as a usual plugin and do the dirty job with a few clicks. Although this plugin has not been tested with the newest WordPress 2.5, after checking the part of the code that performs the actual conversion, I tried it and I think it works just fine. Afterwards, I checked the encodings of the WP tables (through a phpMyAdmin installation) and it seems that the plugin has done a good job. Also, the text is displayed correctly throughout G-Loaded.eu and its feeds, so, I recommend it…

I should state that this issue is totally irrelevant to the WordPress 2.5 release, which is probably one of the best releases I’ve seen so far. I just happened to try to resolve the database character-set issue today. Somehow, this post reminds me of the issue with the backslashes inside pre HTML tags I had written about in the past.

As always, if you notice any peculiar behaviour throughout the website, especially weird characters, feel free to contact me by email or use the forums.

How to fix the WordPress database’s character-set issue by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2008 - Some Rights Reserved

George Notaras avatar

About George Notaras

George Notaras is the editor of the G-Loaded Journal, a technical blog about Free and Open-Source Software. George, among other things, is an enthusiast self-taught GNU/Linux system administrator. He has created this web site to share the IT knowledge and experience he has gained over the years with other people. George primarily uses CentOS and Fedora. He has also developed some open-source software projects in his spare time.