Episode 117: Full Text Search

Podcast: Play in new window | Download

Add enterprise level search into your site.

News and Follow/Ups – 01:00

Geek Tools – 14:13

Yikerz! – Super fun magnet game

Webapps – 16:12

Surfboard – Flipboard as a web app
InstaLyrics – Find lyrics quickly

Full Text Search – 22:11

Options
- Google Custom Search
  - Commercial
  - Benefits
    - Super fast to setup
    - Easy to implement
    - Ability to add adsense into search results
  - Downsides
    - Unable to adjust content ranking and do custom integration
    - Mainly for just indexing HTML pages, not search queries and other text.
- Sphinx
  - “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
  - Open source with commercial support
  - Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  - The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
  - API for:
    - Java, PHP, Python, Ruby, Perl, C, and other languages.
  - Written in C++
  - Stats
    - 60+ MB/sec per server
    - 500+ queries/sec
    - Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
  - Companies using Sphinx
    - Craigslist
    - Slashdot
    - Mozilla
    - WordPress.org
- Lucene
  - Done by the Apache foundation
  - Open source
  - Written in Java
  - Search types
    - ranked searching — best results returned first
    - many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
    - fielded searching (e.g., title, author, contents)
    - date-range searching
    - sorting by any field
    - multiple-index searching with merged results
    - allows simultaneous update and searching
  - Stats
    - over 95GB/hour on modern hardware
    - small RAM requirements — only 1MB heap
    - index size roughly 20-30% the size of text indexed
- Solr
  - Lucene is a library where Solr is a server that supports XML, REST
  - Benefits over Sphinx
    - Solr is easily embeddable in Java applications.
    - Solr can be integrated with Hadoop to build distributed applications
    - Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
  - Companies using Solr
    - eHarmony
    - Ticketmaster
    - Digg
    - AOL
    - Zappos

Tags: aol, craigslist, digg, eharmony, google, lucene, mozilla, slashdot, solr, sphinx, ticketmaster, Wordpress, zappos

This entry was posted on Tuesday, April 19th, 2011 at 3:00 am. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

2 Responses to “Episode 117: Full Text Search”

Daniel Gafitescu Says:
April 19th, 2011 at 6:30 am
Thanks for this , already looking for Sphinx integration.
Jade Robbins Says:
April 20th, 2011 at 10:08 am
Glad to help! Keep us in the loop if you find anything interesting!

Episode 117: Full Text Search

News and Follow/Ups – 01:00

Geek Tools – 14:13

Webapps – 16:12

Full Text Search – 22:11

2 Responses to “Episode 117: Full Text Search”

Leave a Reply

Twitter

Facebook