Updates from May, 2008 Toggle Comment Threads | Keyboard Shortcuts

  • Akhmad Fathonih 10:14 pm on 5/30/2008 Permalink | Reply
    Tags: opensource, , Technology, vertical search   

    Going Vertical with SOLR: Apa sih SOLR itu? 

    rocketHehehe, harus saya akui bahwa saya melewatkan hal penting dalam tulisan saya sebelumnya. Mungkin hal tersebut yang menjerumuskan tulisan saya ke jurang kenistaan tanpa komentar. Nyahahahahah.

    To tell you the truth, SOLR is great. SOLR sebenarnya mirip dengan flat database yang teroptimasi untuk keperluan searching. Sama seperti halnya database, dalam SOLR juga dikenal apa yang disebut field. Jika dalam common DBMS bisa terdapat banyak tabel, dalam SOLR hanya bisa dibuat satu “tabel”. Lalu apa bedanya dengan database pada umumnya?

    Seperti pada database pada umumnya, field dalam SOLR juga bisa diindex. yang membedakan SOLR dengan ordinary database adalah bahwa cara mengindex dengan algoritma yang kita definisikan sendiri. Misal, kita bisa mengindex dengna menghilangkan whitespace sehingga suatu record bisa dimatchkan dengan keyword: “PowerShot”, “Power-shot”, ataupun “power shot”, atau “power/shot”. Jika memakai database pada umumnya, anda memang bisa mensimulasikan hal yang sama. Akan tetapi anda pasti harus memproses keyword sebelum diforward ke database sebagai query. You won’t need such activity when dealing with SOLR. Dalam dunia SOLR, keyword akan dianalisa oleh SOLR sendiri. Bisa jadi prosesnya sama persis seperti saat hendak melakukan peng-indeks-an atau sama sekali berbeda. Kita bisa mendefinisikan tata caranya sesuai kebutuhan kita. Misalnya, kita ambil dari definisi yang ada di file skema SOLR:

    A text field that uses WordDelimiterFilter to enable splitting and matching of  words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of “wifi” or “wi fi” could match a document containing “Wi-Fi”.

    Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.

    I guess above quote explains to you how interesting SOLR field is. Hehehe. Versi complete contoh schema bisa dilihat di sini. Jika dicuplik, terkait quote di atas, akan tampak seperti ini:

    
    
    
    
    <!-- A text field that uses WordDelimiterFilter to enable splitting and matching of
            words on case-change, alpha numeric boundaries, and non-alphanumeric chars,
            so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
            Synonyms and stopwords are customized by external files, and stemming is enabled.
            Duplicate tokens at the same position (which may result from Stemmed Synonyms or
            WordDelim parts) are removed.
            -->
        <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <!-- in this example, we will only use synonyms at query time
            <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
            -->
            <!-- Case insensitive stop word removal.
                 enablePositionIncrements=true ensures that a 'gap' is left to
                 allow for accurate phrase queries.
            -->
            <filter class="solr.StopFilterFactory"
                    ignoreCase="true"
                    words="stopwords.txt"
                    enablePositionIncrements="true"
                    />
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
        </fieldType>

    See, ada bagian tersendiri untuk melakukan proses analisa dalam rangka pengindeksan dan ada definisi tata cara tersendiri untuk pemrosesan query yang dimasukkan oleh pengguna.

    Lalu apa keunggulannya? Jelas unggul karena proses ini sudah “di-refactor”, tidak perlu lagi anda tangani sendiri jika anda memakai solusi database biasanya. Ini berarti anda bisa menyediakan pengalaman berbeda dan lebih unggul. tentun saja ini bisa berarti konten anda akan menjadi lebih discoverable dan lebih memiliki banyak value daripada sekedar teks biasa.

    Vertical search, itulah yang terdekat bisa kita pikir. Tidak lagi seperti Google yang saat ini (ya, Google mungkin juga punya data untuk melakukan vertical search), akan tetapi mungkin bisa jadi seperti AOL yang saya contohkan kemarin.

    Sumber foto:

    Source
    Flickr
    Author
    jurvetson
    License
     
    • KaiToU 10:46 pm on 5/30/2008 Permalink

      Yeah ;)

    • andrew 1:45 pm on 6/12/2008 Permalink

      kalo dibandingkan sama XML database dengan XQL gimana?? lebih baik SOLR atau XMLDB?

    • Akhmad Fathonih 10:55 am on 6/13/2008 Permalink

      @andrew
      SOLR/Lucene tidak menyimpan datanya dalam bentuk XML. Dan tidak perlu XQL/XPath untuk query datanya.

      Saya belum tahu (banyak) tentang XMLDB. Tapi sepertinya target akhirnya beda dengan SOLR. SOLR is search optimized, while XMLDB may not.

  • Akhmad Fathonih 11:42 am on 2/21/2008 Permalink | Reply  

    Database Scalability (and Drupal) 

    Disclaimer:

    1. I am not a DBA nor having any other database expert title attached.
    2. I am not Drupal expert either

    Having spend some good days browsing the net and read like crazy, I managed to write up the following.

    Database Scalability options:

    • Scale up
      • more powerful hardware: RAM, processors, storage
    • Scale out
      • Federation
        • MySQL 5.x support federated table (remote table). However its still have some issues: heavy traffic between federated server
        • Benefits: spreaded storage, relatively reducing main server load (delegated to federated server)
        • Disadvantages: heavy network traffic between federated server
        • Issues: network connection capacity between federated servers
      • Sharding (Partitioning)
        • Benefits: spreaded server load, spreaded storage
        • Disadvantages: relatively complicated since it’s involving application layer changes
        • Issues: Involves application layer

    Federation is more transparent to developer as application can be totally unaware of the federation setup. However, its bottleneck for sure is giving a limit to scaling (out).
    More to explore:

    • Replicating federated database.
    • Load balancing federated database;

    Sharding on the other hand, while giving enourmous flexibility to scaling (out) options, is likely requiring ‘built-from-scratch’ application setup/environment. A special data hashing/sharding logic must be incorporated to the application layer in order to implement sharding, which sometimes against the intention of some framework.

    What about Drupal then? Drupal depends so much on the node table. Though developer can new type of nodes which uses external table (for extra attributes), those nodes will still be related to the core node table (API restriction, eg: node_load, node_save, and other node_xxx)  From these points, the only avaliable scaling option for Drupal would be scaling up or using federated option (which is somehow felt like scaling up, eventually)

    There has been rumour about ASQL (automated sharding proxy) but I haven’t found any available code yet.

    PS:

    1. For load balancing, there are many options includes using hardware (BIG-IP) or software (mysqlproxy, sql-relay, sequoia)
    2. Sharding approach is similar to BigTable approach which spreads storage to localize (spread) load via colum based database structure.
    3. The more you want to scale (your app/database) the more you will love data redundancy and giving up normalization

    References:
    [1] http://en.wikipedia.org/wiki/Federated_database_system
    [2] http://dev.mysql.com/tech-resources/articles/mysql-federated-storage.html
    [3] http://www.onlamp.com/pub/a/databases/2006/08/10/mysql-federated-tables.html
    [4] http://buytaert.net/scaling-with-mysql-replication
    [5] http://www.johnandcailin.com/blog/john/scaling-drupal-step-four-database-segmentation-using-mysql-proxy
    [6] http://mysqldba.blogspot.com/2006/11/unorthodox-approach-to-database-design.html
    [7] http://sequoia.continuent.org/HomePage

     
  • Akhmad Fathonih 8:19 am on 6/19/2007 Permalink | Reply  

    Officially on bcm43xx (not ndiswrapper) 

    [  211.580000] bcm43xx driver[  211.580000] ACPI: PCI Interrupt 0000:06:06.0[A] -> GSI 18 (level, low) -> IRQ 19[  270.632000] ADDRCONF(NETDEV_UP): eth1: link is not ready[  273.804000] ieee80211_crypt: registered algorithm 'WEP'[  273.960000] SoftMAC: Open Authentication completed with 00:80:48:25:81:c6[  273.968000] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

    Ok, now I’m officially on broadcom43xx driver. Not with ndiswrapper anymore. There’s still a configuration item left as the module doesn’t loaded automatically. wlan0 is more, eth1 is the new wlan0. We’ll see if this driver can handle the weirdness I have when hanging around the hotspot I mentioned before.

    Powered by ScribeFire.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel