I just spent five hours banging my head against the wall trying to figure this out. In hopes that I can prevent someone else from suffering the same fate, I decided to share this.
Background & Issues:
Chinese characters are word symbols rather than letters. A word count is impossible because there are no spaces between words.
Instead you have to do a character count. Unfortunately using a substring (PHP: substr()) won’t work because Chinese characters are encoded in Unicode.
Keep in mind that the $content string is UTF-8 encoded. Each Chinese character is composed of multiple characters. A ten character long string is equal to only three UTF-8 Chinese characters.
// In Unicode (UTF-8) this string equals '%E6%9C%89%E5%8F%B2%E4%BB%A5%E4%BE%86%E6%9C%80%E5%A5%BD%E7%9A%84%E7%B6%B2%E7%AB%99'
$excerpt = substr($content, 0, 10);
// substr() does not recognize Unicode characters, this results in broken characters at the end of the excerpt.
echo $excerpt;
The above snippet will output:
(The black-diamond-question-mark symbol denotes a broken Unicode character.)
Solution:
The multi-byte string cut function mb_strcut()) function recognizes encoded characters and removes partially encoded characters from the return value.
$excerpt = mb_strcut($content, 0, 10, 'UTF-8');
echo $excerpt;
The above snippet will output:
Thanks a lot for saving me 5 hours
Ditto
Good job !
Thank you, you save my ass!
thank you, thank you, thank you! appreciate the hours saved
Pingback: Using PHP's substr with Chinese characters (doesn't work)