Apr 19 2011

Episode 117: Full Text Search


Add enterprise level search into your site.

News and Follow/Ups – 01:00

Geek Tools – 14:13

  • Yikerz! – Super fun magnet game

Webapps – 16:12

Full Text Search – 22:11

  • Options
    • Google Custom Search
      • Commercial
      • Benefits
        • Super fast to setup
        • Easy to implement
        • Ability to add adsense into search results
      • Downsides
        • Unable to adjust content ranking and do custom integration
        • Mainly for just indexing HTML pages, not search queries and other text.
    • Sphinx
      • “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
      • Open source with commercial support
      • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
      • The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
      • API for:
        • Java, PHP, Python, Ruby, Perl, C, and other languages.
      • Written in C++
      • Stats
        • 60+ MB/sec per server
        • 500+ queries/sec
        • Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
      • Companies using Sphinx
    • Lucene
      • Done by the Apache foundation
      • Open source
      • Written in Java
      • Search types
        • ranked searching — best results returned first
        • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
        • fielded searching (e.g., title, author, contents)
        • date-range searching
        • sorting by any field
        • multiple-index searching with merged results
        • allows simultaneous update and searching
      • Stats
        • over 95GB/hour on modern hardware
        • small RAM requirements — only 1MB heap
        • index size roughly 20-30% the size of text indexed
    • Solr
      • Lucene is a library where Solr is a server that supports XML, REST
      • Benefits over Sphinx
        • Solr is easily embeddable in Java applications.
        • Solr can be integrated with Hadoop to build distributed applications
        • Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
      • Companies using Solr