Apr
19
2011
Episode 117: Full Text Search
Podcast: Play in new window | Download
Add enterprise level search into your site.
News and Follow/Ups – 01:00
Geek Tools – 14:13
- Yikerz! – Super fun magnet game
Webapps – 16:12
- Surfboard – Flipboard as a web app
- InstaLyrics – Find lyrics quickly
Full Text Search – 22:11
- Options
- Google Custom Search
- Commercial
- Benefits
- Super fast to setup
- Easy to implement
- Ability to add adsense into search results
- Downsides
- Unable to adjust content ranking and do custom integration
- Mainly for just indexing HTML pages, not search queries and other text.
- Sphinx
- “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
- Open source with commercial support
- Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
- The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
- API for:
- Java, PHP, Python, Ruby, Perl, C, and other languages.
- Written in C++
- Stats
- 60+ MB/sec per server
- 500+ queries/sec
- Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
- Companies using Sphinx
- Lucene
- Done by the Apache foundation
- Open source
- Written in Java
- Search types
- ranked searching — best results returned first
- many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
- fielded searching (e.g., title, author, contents)
- date-range searching
- sorting by any field
- multiple-index searching with merged results
- allows simultaneous update and searching
- Stats
- over 95GB/hour on modern hardware
- small RAM requirements — only 1MB heap
- index size roughly 20-30% the size of text indexed
- Solr
- Lucene is a library where Solr is a server that supports XML, REST
- Benefits over Sphinx
- Solr is easily embeddable in Java applications.
- Solr can be integrated with Hadoop to build distributed applications
- Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
- Companies using Solr
- Google Custom Search