PHP and Unicode is just, a well-known secret.
My story began with SOLR DIH. It was way too slow. So, I ended up building another tool to replace DIH. Something friendly to CPU and memory. I did it. Not.
After indexing I realized that my text was full of ???????. WTF. Yeah, it’s encoding problem. So I’ve spent a day trying to solve this thing. What works for me was this advice from 2005:
PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says “Unicode”, it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP’s “UTF-16″ means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.
Example:
$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, ‘UTF-16LE’, ‘UTF-8′);
I get no ??? char anymore. I don’t know if it is the proper way to do it. And I still get occasional htmlspecialchars invalid multibyte sequence. I think I’ll classify this solution as “miracle”.
When’s PHP 6 finally come?
Update:
CRAP. DOES NOT WORK!.
Update of Update:
SET NAMES UTF-8
I missed this statement when initializing Zend_Db connection.