Home > Blog > Converting Unicode To String

Converting Unicode to string

I thought it'd be cool to share this little idea I had in mind few days ago. I was working something related to strings in JavaScript. It sounds like nothing interesting since almost everyone knows string in JavaScript but not in-depth and that's ok. No one really cares that much in day-to-day life or work.

What caught my attention when dealing with strings is that I accidentally put a special symbol inside a string. All special symbols are technically represented in the form of Unicode in JavaScript, AFAIK. Here's a definition of Unicode from speakingjs.com:

Unicode represents the characters it supports via numbers called code points. The hexadecimal range of code points is 0x0 to 0x10FFFF (17 times 16 bits).

That sounds cool but this is not what we're going to deep dive into.

A JavaScript code unit is 16 bits wide (0x0000 - 0xffff). To convert a string to its corresponding integer, we can use String.prototype.charCodeAt.

/** UTF-16 -> integer */
const input = '∆';
const charCode = input.charCodeAt(0);

console.log(charCode); // 8710

The triangle symbol returns an integer value of 8710 which is safe for the UTF-16 range. We can convert the integer value back to its symbol to prove the correctness.

/** integer -> UTF-16 */
const charCode = 8710;
const hex = charCode.toString(16); /** UTF-16 is in the hex form which is 0x2206 */

console.log('\u2206' === String.fromCharCode(8710)); // Returns true. Both methods work!

Beautiful! How about Unicode like emoji characters? Let's give it a try.

/** Unicode -> integer */
const input = '😄';
const charCode = input.charCodeAt(0); // 55357

console.log('\u55357' === String.fromCharCode(charCode)); // Returns false. What!?

It looks like its integer is still within the UTF-16 range but the inverse returns different result. What is going on here?

According to the description of the method String.prototype.charCodeAt, it always return a value that is less than 65536. In order to examine the actual value of the emoji, we have to use another new method introduced in ES6 - String.prototype.codePointAt.

The modified code will be as follows:

/** Unicode -> integer (with String.prototype.codePointAt) */
const input = '😄';
const charCode = input.codePointAt(0); // 128516 or 0x1f604

You can obviously see that the actual integer value is now larger than what defines in UTF-16 and it totally defies what has been mentioned earlier.

By taking a closer look at the composition of the emoji, you will find an interesting fact.

console.log('😄'.split('')); // Array [ "\ud83d", "\ude04" ]

Now you can see how thing works behind the scene. Instead of 1 character, an emoji contains more than that in order to store such large character code. That makes some sense now. Besides that, there is another useful method introduced in ES6 to convert the integer back to its actual representation - String.fromCodePoint.

console.log(String.fromCodePoint(128516)); // 😄

Wrap up

All JavaScript code unit is 16-bit. Commonly used Unicode values can be represented in a 16-bit number in the early days but it's no longer fit the use case in today's use case. Therefore, String.fromCodePoint and String.prototype.codePointAt are introduced in ES6 to deal with a more complete range of Unicode characters that can consist of more than 1 code unit.

That is it. Have a wonderful day ahead and I'll see you in the upcoming blog post. Peace! ✌️

References