6. Using Unicode with MySQL++

6.1. A Short History of Unicode

...with a focus on relevance to MySQL++

In the old days, computer operating systems only dealt with 8-bit character sets. This only gives you 256 possible characters, but the modern Western languages have more than that by themselves. Add in all the other lanauges of the world, plus the various symbols people use, and you have a real mess! Since no standards body held sway over things like international character encoding in the early days of computing, many different character sets were invented. These character sets weren't even standardized between operating systems, so heaven help you if you needed to move localized Greek text on a Windows machine to a Russian Macintosh! The only way we got any international communication done at all was to build standards on the common 7-bit ASCII subset. Either people used approximations like a plain "c" instead of the French "ç", or they invented things like HTML entities ("ç" in this case) to encode these additional characters using only 7-bit ASCII.

Unicode solves this problem. It encodes every character in the world, using up to 4 bytes per character. The subset covering the most economically valuable cases takes two bytes per character, so most Unicode-aware programs limit themselves to this set, for efficiency.

Unfortunately, Unicode came about 20 years too late for Unix and C. Converting the Unix system call interface to Unicode would break all existing programs. The ISO lashed a wide character sidecar onto C in 1995, but in common practice C is still tied to 8-bit characters.

As Unicode began to take off in the early 1990s, it became clear that some sort of accommodation with Unicode was needed in legacy systems like Unix and C. During the development of the Plan 9 operating system (a kind of successor to Unix) Ken Thompson invented the UTF-8 encoding. Since UTF-8 is a superset of 7-bit ASCII, many programs that deal in text actually get by okay without any explicit support for UTF-8.

The MySQL database server comes out of the Unix/C tradition, so it only supports 8-bit characters natively. UTF-8 data is compatible with C strings, so all versions of MySQL could store UTF-8 data, but sometimes the database actually needs to understand the data. When sorting, for instance. To support this, explicit UTF-8 support was added to MySQL in version 4.1.

Because MySQL++ does not need to know anything about the flowing through it, it doesn't have explicit UTF-8 support. C++'s std::string stores UTF-8 data just fine. But, your program probably does care about the data coming from MySQL++. The remainder of this chapter covers the choices you have for dealing with UTF-8 encoded Unicode data.

6.2. Unicode and Unix

Modern Unices support UTF-8 natively. Red Hat Linux, for instance, has had system-wide UTF-8 support since version 8. This continues in the commercial and Fedora forks of Red Hat Linux, of course.

On such a Unix, the terminal I/O code understands UTF-8 encoded data, so your program doesn't require any special code to correctly display a UTF-8 string. If you aren't sure whether your system supports UTF-8 natively, just run the simple1 example: if the first item has two high-ASCII characters in place of the "ü" in "Nürnberger Brats", you know it's not handling UTF-8.

If your Unix doesn't support UTF-8 natively, it likely doesn't support any form of Unicode at all, for the historical reasons I gave above. Therefore, you will have to convert the UTF-8 data to the local 8-bit character set. The standard Unix function iconv() can help here. If your system doesn't have the iconv() facility, there is a free implementation available from the GNU Project. Another library you might check out is IBM's ICU. This is rather heavy-weight, so if you just need basic conversions, iconv() should suffice.

6.3. Unicode and Win32

Each Win32 API function that takes a string actually has two two versions. One version supports only 1-byte "ANSI" characters (a superset of ASCII), so they end in 'A'. Win32 also supports the 2-byte subset of Unicode called UCS-2. Some call these "wide" characters, so the other set of functions end in 'W'. The MessageBox() API, for instance, is actually a macro, not a real function. If you define the UNICODE macro when building your program, the MessageBox() macro evaluates to MessageBoxW(); otherwise, to MessageBoxA().

Since MySQL uses UTF-8 and Win32 uses UCS-2, you must convert data going between the Win32 API and MySQL++. Since there's no point in trying for portability — no other OS I'm aware of uses UCS-2 — you might as well use native Win32 functions for doing this translation. The following code is distilled from utf8_to_win32_ansi() in examples/util.cpp:

void utf8_to_win32_ansi(const char* utf8_str, char* ansi_str, int ansi_len)
{
    wchar_t ucs2_buf[100];
    static const int ub_chars = sizeof(ucs2_buf) / sizeof(ucs2_buf[0]);

    MultiByteToWideChar(CP_UTF8, 0, utf8_str, -1, ucs2_buf, ub_chars);
    CPINFOEX cpi;
    GetCPInfoEx(CP_OEMCP, 0, &cpi);
    WideCharToMultiByte(cpi.CodePage, 0, ucs2_buf, -1,
            ansi_str, ansi_len, 0, 0);
}

To see this in action, uncomment "#define USE_WIN32_UCS2" at the top of util.cpp, build the example programs, and run simple1 in a console window (a.k.a. "DOS box"). The first item should be "Nürnberger Brats". If not, see the last paragraph in this section.

utf8_to_win32_ansi() converts utf8_str from UTF-8 to UCS-2, and from there to the local code page. "Waitaminnit," you shout! "I thought we were trying to get away from the problem of local code pages!" The console is one of the few Win32 facilities that doesn't support UCS-2 by default. It can be put into UCS-2 mode, but that seems like more work than we'd like to go to in a portable example program. Since the default code page in most versions of Windows includes the "ü" character used in the sample database, this conversion works out fine for our purposes.

If your program is using the GUI to display text, you don't need the second conversion. Prove this to yourself by adding the following to utf8_to_win32_ansi() after the MultiByteToWideChar() call:

	MessageBox(0, ucs2_buf, "UCS-2 version of Item", MB_OK);

All of this assumes you're using Windows NT or one of its direct descendants: Windows 2000, Windows XP, Windows 2003 Server, and someday "Longhorn". Windows 95/98/ME and Windows CE do not support UCS-2. They still have the 'W' APIs for compatibility, but they just smash the data down to 8-bit and call the 'A' version for you.

6.4. For More Information

The Unicode FAQs page has copious information on this complex topic.

When it comes to Unix and UTF-8 specific items, the UTF-8 and Unicode FAQ for Unix/Linux is a quicker way to find basic information.