Chinese Character Count in PHP

Posted on

I just spent five hours banging my head against the wall trying to figure this out. In hopes that I can prevent someone else from suffering the same fate, I decided to share this.

Background & Issues:

Chinese characters are word symbols rather than letters. A word count is impossible because there are no spaces between words.

Instead you have to do a character count. Unfortunately using a substring (PHP: substr()) won’t work because Chinese characters are encoded in Unicode.

Keep in mind that the $content string is UTF-8 encoded. Each Chinese character is composed of multiple characters. A ten character long string is equal to only three UTF-8 Chinese characters.

$content = '有史以來最好的網站';
// In Unicode (UTF-8) this string equals '%E6%9C%89%E5%8F%B2%E4%BB%A5%E4%BE%86%E6%9C%80%E5%A5%BD%E7%9A%84%E7%B6%B2%E7%AB%99'

$excerpt = substr($content, 0, 10);
// substr() does not recognize Unicode characters, this results in broken characters at the end of the excerpt.

echo $excerpt;

The above snippet will output:

有史以�

(The black-diamond-question-mark symbol denotes a broken Unicode character.)

Solution:

The multi-byte string cut function mb_strcut()) function recognizes encoded characters and removes partially encoded characters from the return value.

$content = '有史以來最好的網站';

$excerpt = mb_strcut($content, 0, 10, 'UTF-8');

echo $excerpt;

The above snippet will output:

有史以

6 thoughts on “Chinese Character Count in PHP

  1. Pingback: Using PHP's substr with Chinese characters (doesn't work)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>