How do I convert Chinese characters to their Latin equivalents?
Author: Deron Eriksson
Description: This Java tutorial describes how to convert Chinese characters to their Latin equivalents using ICU4J.
Tutorial created using: Windows XP || JDK 1.6.0_24 || Eclipse Java EE IDE for Web Developers, Indigo


Page:    1

The ICU4J library makes it possible to perform many different types of character conversions. In this tutorial, we'll look at how to convert Chinese characters to their romanized (Latin) equivalents. First off, I need some Chinese characters, and I don't know Chinese, so I'll go to Google translate and convert "large toad" in English to its Chinese equivalent. This yields three Chinese characters.

Chinese characters for 'large toad'

In ICU4J, the Transliterator class is responsible for the transliteration (fancy word, huh?) of text from one form to another. We need to obtain a Transliterator object to perform a Chinese to Latin conversion. We can do that by calling Transliterator's static getInstance method with a String specifying "Han-Latin", meaning we want to take Chinese characters and convert to Latin characters. The resulting String will include accent characters.

When writing web applications, often we run into situations where we want plain characters with no accents. We can obtain a Transliterator that converts Chinese to Latin characters with the accents removed by passing "Han-Latin; nfd; [:nonspacing mark:] remove; nfc" to the Transliterator's getInstance method. If you're interested in the specifics of what's going on here, you can read more about it in the ICU4J documentation.

ChineseToLatin.java

package demo;

import org.apache.commons.lang.StringEscapeUtils;

import com.ibm.icu.text.Transliterator;

public class ChineseToLatin {

	public static String CHINESE_TO_LATIN = "Han-Latin";
	public static String CHINESE_TO_LATIN_NO_ACCENTS = "Han-Latin; nfd; [:nonspacing mark:] remove; nfc";

	public static void main(String[] args) {
		String chineseString = "???";

		String unicodeCodes = StringEscapeUtils.escapeJava(chineseString);
		System.out.println("Unicode codes:" + unicodeCodes);

		Transliterator chineseToLatinTrans = Transliterator.getInstance(CHINESE_TO_LATIN);
		String result1 = chineseToLatinTrans.transliterate(chineseString);
		System.out.println("Chinese to Latin:" + result1);
		
		Transliterator chineseToLatinNoAccentsTrans = Transliterator.getInstance(CHINESE_TO_LATIN_NO_ACCENTS);
		String result2 = chineseToLatinNoAccentsTrans.transliterate(chineseString);
		System.out.println("Chinese to Latin (no accents):" + result2);
	}

}

The ChineseToLatin class demonstrates this. First off, I copy the 3 Chinese characters that represent "large toad" and paste them into the ChineseToLatin class so that they're represented by the chineseString variable.

For informative purposes, I'll display the unicode codes of these characters with the help of the commons-lang StringEscapeUtils escapeJava method.

Now, let's convert our Chinese characters to Latin, with accents. We do this by obtaining a "Han-Latin" Transliterator. We pass in chineseString to the Transliterator's transliterate method, and our romanized text is returned as a String.

After this, we convert from Chinese to Latin and remove the accents. We do this via a "Han-Latin; nfd; [:nonspacing mark:] remove; nfc" Transliterator. We pass in chineseString to this Transliterator's transliterate method, and we obtain the resulting romanized text with accents removed in the form of a String.

When we execute the ChineseToLatin class, we see our expected results in the console.

Converting Chinese Characters to their Latin Equivalents, with and without Accents

Our results match the romanized Google Translation of "large toad".

Google Translation of 'large toad'

Here are the mavenSW dependencies for the ICU4J and CommonsSW Lang libraries used in this tutorial:

Maven Dependency for ICU4J


<dependency>
	<groupId>com.ibm.icu</groupId>
	<artifactId>icu4j</artifactId>
	<version>4.8.1.1</version>
</dependency>

Maven Dependency for Commons Lang


<dependency>
	<groupId>commons-lang</groupId>
	<artifactId>commons-lang</artifactId>
	<version>2.6</version>
</dependency>
Page:    1