Converting Unicode to string
Unicode represents the characters it supports via numbers called code points. The hexadecimal range of code points is 0x0 to 0x10FFFF (17 times 16 bits).
That sounds cool but this is not what we're going to deep dive into.
0xffff). To convert a string to its corresponding integer, we can use
/** UTF-16 -> integer */ const input = '∆'; const charCode = input.charCodeAt(0); console.log(charCode); // 8710
The triangle symbol returns an integer value of
8710 which is safe for the
UTF-16 range. We can convert the integer value back to its symbol to prove the correctness.
/** integer -> UTF-16 */ const charCode = 8710; const hex = charCode.toString(16); /** UTF-16 is in the hex form which is 0x2206 */ console.log('\u2206' === String.fromCharCode(8710)); // Returns true. Both methods work!
Beautiful! How about Unicode like emoji characters? Let's give it a try.
/** Unicode -> integer */ const input = '😄'; const charCode = input.charCodeAt(0); // 55357 console.log('\u55357' === String.fromCharCode(charCode)); // Returns false. What!?
It looks like its integer is still within the
UTF-16 range but the inverse returns different result. What is going on here?
According to the description of the method
String.prototype.charCodeAt, it always return a value that is less than
65536. In order to examine the actual value of the emoji, we have to use another new method introduced in
The modified code will be as follows:
/** Unicode -> integer (with String.prototype.codePointAt) */ const input = '😄'; const charCode = input.codePointAt(0); // 128516 or 0x1f604
You can obviously see that the actual integer value is now larger than what defines in
UTF-16 and it totally defies what has been mentioned earlier.
By taking a closer look at the composition of the emoji, you will find an interesting fact.
console.log('😄'.split('')); // Array [ "\ud83d", "\ude04" ]
Now you can see how thing works behind the scene. Instead of 1 character, an emoji contains more than that in order to store such large character code. That makes some sense now. Besides that, there is another useful method introduced in
ES6 to convert the integer back to its actual representation -
console.log(String.fromCodePoint(128516)); // 😄
16-bit. Commonly used Unicode values can be represented in a
16-bit number in the early days but it's no longer fit the use case in today's use case. Therefore,
String.prototype.codePointAt are introduced in
ES6 to deal with a more complete range of Unicode characters that can consist of more than 1 code unit.
That is it. Have a wonderful day ahead and I'll see you in the upcoming blog post. Peace! ✌️