Given how easy it is to post on Facebook, I convinced myself to change this blog theme to P2. Open the homepage and I’m ready to post. No need to open wp-admin, new post and no more distraction. Just a white textbox.
Updates from RSS Toggle Comment Threads | Keyboard Shortcuts
-
Akhmad Fathonih
-
Akhmad Fathonih
#php #unicode #insertcursewordhere
PHP and Unicode is just, a well-known secret.
My story began with SOLR DIH. It was way too slow. So, I ended up building another tool to replace DIH. Something friendly to CPU and memory. I did it. Not.
After indexing I realized that my text was full of ???????. WTF. Yeah, it’s encoding problem. So I’ve spent a day trying to solve this thing. What works for me was this advice from 2005:
PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says “Unicode”, it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP’s “UTF-16″ means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.
Example:
$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, ‘UTF-16LE’, ‘UTF-8′);
I get no ??? char anymore. I don’t know if it is the proper way to do it. And I still get occasional htmlspecialchars invalid multibyte sequence. I think I’ll classify this solution as “miracle”.
When’s PHP 6 finally come?
Update:
CRAP. DOES NOT WORK!.
Update of Update:
SET NAMES UTF-8
I missed this statement when initializing Zend_Db connection.
-
Akhmad Fathonih
Belajar Mahout
My brain exploded. That’s pretty much my limit.
So, yes, I’ve been interested in SOLR, Apache Tika, and of course Mahout. The promise of classifying and clustering data are enough to persuade me digging up examples about Mahout. So far, what really helps is Seinfeld demo example. It gives me a proper example to try. We can replace the data with our own to get the gist on how Mahout would work.
However, I haven’t get the gist yet. So far, I’ve tried to cluster 2 datasource. One of them is blog post from navinot.com. Here’s an excerpt from cluster-dump:
C-18 [Ponsel, Mobile, Internet, Mobile internet, Iphone]
- /6 Hal Tentang Mobile Internet.txt
- /Mobile Application_ Masa Depan Yang Ditunggu?.txt
- /Netbook_ Bakal Lenyap Seperti PDA?.txt
- /Premium Mobile Internet?.txt
- /The Gaps in Indonesian Internet.txt
- /iPhone & Telkomsel_ Deal or No Deal?.txtI’m imagining Mahout with cluster it into similarity groups. My guess is, it was clustered by keyword. I was using kmeans.
Anyway, obviousy we need to filter out stopwords. Mahout can read directly from SOLR/Lucene index. But I didn’t have much luck on it. Something to do with empty terms or whatever. Probably, feed my raw data to SOLR and then query it out to get text files will make a decent workaround.
That’s a wrap for today. Time for Pocket Legend!
indobeta 10:29 am on 2/9/2012 Permalink
thanks. info’a.