My brain exploded. That’s pretty much my limit.
So, yes, I’ve been interested in SOLR, Apache Tika, and of course Mahout. The promise of classifying and clustering data are enough to persuade me digging up examples about Mahout. So far, what really helps is Seinfeld demo example. It gives me a proper example to try. We can replace the data with our own to get the gist on how Mahout would work.
However, I haven’t get the gist yet. So far, I’ve tried to cluster 2 datasource. One of them is blog post from navinot.com. Here’s an excerpt from cluster-dump:
C-18 [Ponsel, Mobile, Internet, Mobile internet, Iphone]
- /6 Hal Tentang Mobile Internet.txt
- /Mobile Application_ Masa Depan Yang Ditunggu?.txt
- /Netbook_ Bakal Lenyap Seperti PDA?.txt
- /Premium Mobile Internet?.txt
- /The Gaps in Indonesian Internet.txt
- /iPhone & Telkomsel_ Deal or No Deal?.txt
I’m imagining Mahout with cluster it into similarity groups. My guess is, it was clustered by keyword. I was using kmeans.
Anyway, obviousy we need to filter out stopwords. Mahout can read directly from SOLR/Lucene index. But I didn’t have much luck on it. Something to do with empty terms or whatever. Probably, feed my raw data to SOLR and then query it out to get text files will make a decent workaround.
That’s a wrap for today. Time for Pocket Legend!