Belajar Mahout

My brain exploded. That’s pretty much my limit.

So, yes, I’ve been interested in SOLR, Apache Tika, and of course Mahout. The promise of classifying and clustering data are enough to persuade me digging up examples about Mahout. So far, what really helps is Seinfeld demo example. It gives me a proper example to try. We can replace the data with our own to get the gist on how Mahout would work.

However, I haven’t get the gist yet. So far, I’ve tried to cluster 2 datasource. One of them is blog post from navinot.com. Here’s an excerpt from cluster-dump:

C-18 [Ponsel, Mobile, Internet, Mobile internet, Iphone]
- /6 Hal Tentang Mobile Internet.txt
- /Mobile Application_ Masa Depan Yang Ditunggu?.txt
- /Netbook_ Bakal Lenyap Seperti PDA?.txt
- /Premium Mobile Internet?.txt
- /The Gaps in Indonesian Internet.txt
- /iPhone & Telkomsel_ Deal or No Deal?.txt

I’m imagining Mahout with cluster it into similarity groups. My guess is, it was clustered by keyword. I was using kmeans.

Anyway, obviousy we need to filter out stopwords. Mahout can read directly from SOLR/Lucene index. But I didn’t have much luck on it. Something to do with empty terms or whatever. Probably, feed my raw data to SOLR and then query it out to get text files will make a decent workaround.

That’s a wrap for today. Time for Pocket Legend!

 

How to use Lucene 3.4 with Mahout 0.5

As you may have been frustrated by, Mahout 0.5 was build with Lucene 3.1 dependencies. How on earth can we use Lucene 3.4 then? My SOLR is 3.4, I want to use its index to play with Mahout.

Fear not. Just download mahout 0.5, both source and binaries. Extract them, it will reside on the same folder i.e: mahout-distribution-0.5. Now, open up that pom.xml. Find lucene and replace 3.1.0 with 3.4.0. I reckon there are only 4 of them. The do mvn install. You may want to skip tests with: mvn -DskipTests=true install.

Once done, do: export MAHOUT_CORE=1

Run mahout from mahout-distribution-0.5/bin folder.

I don’t get index incompatibility anymore. But, I keep getting not enough term vector on document. Even I’ve set the schema.xml dan reindex my docs.

Will write more once I pass it.

bacula-fd authentication failed

So, been trying to setup two-tier bacula. Stuck on cannot connect to client.

To grab more clues, run this line on bacula-fd machine:

sudo /usr/sbin/bacula-fd -f -d100 -c /etc/bacula/bacula-fd.conf

Then do bconsole dance on bacula-dir machine. Use “status” command to test connection to client. I you see cram-md5 authentication failed in bacula-fd output then you have the same problem as I did. Otherwise, check your connection between bacula-dir and nacula-fd

Here’s the solution:

in bacula-fd.conf:

Director {
  Name = bacula-director
  Password = "remote-fd-passwd"
}

“Name” should be your bacula-dir Name. You can found this in bacula-dir.conf. See below:

Director {                            # define myself
  Name = bacula-director
  DIRport = 9101                # where we listen for UA connections
  QueryFile = "/etc/bacula/scripts/query.sql"
  WorkingDirectory = "/var/lib/bacula"
  PidDirectory = "/var/run/bacula"
  Maximum Concurrent Jobs = 1
  Password = "blahblahblah"         # Console password
  Messages = Daemon
  DirAddress = 127.0.0.1
}

Then the password part on bacula-fd.conf should be the same with your client definition in bacula-dir.conf. eg:

Client {
  Name = remote-fd
  Address = remote.fd.ip
  FDPort = 9102
  Catalog = MyCatalog
  Password = "remote-fd-passwd"          # password for FileDaemon
  File Retention = 30 days            # 30 days
  Job Retention = 6 months            # six months
  AutoPrune = yes                     # Prune expired Jobs/Files
}

Don’t forget to restart bacula-dir and bacula-fd after modifying conf files. Good luck!