About Akhmad Fathonih

geek wannabe :)

#php #unicode #insertcursewordhere

PHP and Unicode is just, a well-known secret.

My story began with SOLR DIH. It was way too slow. So, I ended up building another tool to replace DIH. Something friendly to CPU and memory. I did it. Not.

After indexing I realized that my text was full of ???????. WTF. Yeah, it’s encoding problem. So I’ve spent a day trying to solve this thing. What works for me was this advice from 2005:

PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says “Unicode”, it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP’s “UTF-16″ means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

Example:

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, ‘UTF-16LE’, ‘UTF-8′);

I get no ??? char anymore. I don’t know if it is the proper way to do it. And I still get occasional htmlspecialchars invalid multibyte sequence. I think I’ll classify this solution as “miracle”.

When’s PHP 6 finally come?

Update:

CRAP. DOES NOT WORK!.

Update of Update:

SET NAMES UTF-8

I missed this statement when initializing Zend_Db connection.

Belajar Mahout

My brain exploded. That’s pretty much my limit.

So, yes, I’ve been interested in SOLR, Apache Tika, and of course Mahout. The promise of classifying and clustering data are enough to persuade me digging up examples about Mahout. So far, what really helps is Seinfeld demo example. It gives me a proper example to try. We can replace the data with our own to get the gist on how Mahout would work.

However, I haven’t get the gist yet. So far, I’ve tried to cluster 2 datasource. One of them is blog post from navinot.com. Here’s an excerpt from cluster-dump:

C-18 [Ponsel, Mobile, Internet, Mobile internet, Iphone]
- /6 Hal Tentang Mobile Internet.txt
- /Mobile Application_ Masa Depan Yang Ditunggu?.txt
- /Netbook_ Bakal Lenyap Seperti PDA?.txt
- /Premium Mobile Internet?.txt
- /The Gaps in Indonesian Internet.txt
- /iPhone & Telkomsel_ Deal or No Deal?.txt

I’m imagining Mahout with cluster it into similarity groups. My guess is, it was clustered by keyword. I was using kmeans.

Anyway, obviousy we need to filter out stopwords. Mahout can read directly from SOLR/Lucene index. But I didn’t have much luck on it. Something to do with empty terms or whatever. Probably, feed my raw data to SOLR and then query it out to get text files will make a decent workaround.

That’s a wrap for today. Time for Pocket Legend!

 

How to use Lucene 3.4 with Mahout 0.5

As you may have been frustrated by, Mahout 0.5 was build with Lucene 3.1 dependencies. How on earth can we use Lucene 3.4 then? My SOLR is 3.4, I want to use its index to play with Mahout.

Fear not. Just download mahout 0.5, both source and binaries. Extract them, it will reside on the same folder i.e: mahout-distribution-0.5. Now, open up that pom.xml. Find lucene and replace 3.1.0 with 3.4.0. I reckon there are only 4 of them. The do mvn install. You may want to skip tests with: mvn -DskipTests=true install.

Once done, do: export MAHOUT_CORE=1

Run mahout from mahout-distribution-0.5/bin folder.

I don’t get index incompatibility anymore. But, I keep getting not enough term vector on document. Even I’ve set the schema.xml dan reindex my docs.

Will write more once I pass it.