Google Cloud Search vs. Apache Solr

SADA Says | Cloud Computing Blog

By Chad Johnson | Director, Google Cloud Search

More and more organizations are coming to realize that implementing an effective enterprise search solution can provide a myriad of benefits including more informed decision-making, easier access to institutional knowledge, improved data governance, and improved customer service and sales. Though, not all enterprise search solutions are created equal. Google Cloud Search (GCS) and Apache Solr are both text-based search engines with a query interface for finding relevant information across both structured and unstructured content, but they vary in several key ways. Here are the top 6 reasons why GCS is superior to Solr: 

1. “Optimal” Settings

Most search engines have a wide variety of settings and tuning capabilities. For example, search engines allow you to adjust the way text is parsed and interpreted depending on its language or context. And they allow you to control or influence the way documents are scored and ranked.

This represents one of the first major differences between GCS and Solr. While both engines rely on very complex algorithms, GCS has built-in “optimal” settings for typical use, whereas Solr requires configuration and tuning in almost every respect. GCS makes useful assumptions about content and language models and implements a well-balanced scoring algorithm out of the box. Those models and algorithms are derived from Google’s extensive search engine experience on google.com and from diligent design and testing across a wide variety of situations.

Solr, on the other hand, makes almost no assumptions. Everything is adjustable and there is no one-size-fits-all default mode. On one hand, that means Solr can be tuned for many unique, specialized situations. On the other hand, it means that it requires tuning for even the “everyday” scenarios. If you start using Solr without a good understanding of all the possible settings and customizations, or do not have a strong background in search engine concepts and algorithms, you will likely get poor quality results. Solr can operate with very high quality, but it requires significant expertise- expertise that Google has built in to GCS by default.

2. Scale and Maintenance

The second major area of difference is the software architecture and maintenance. Solr is  highly-scalable and can handle very large workloads, but it must be installed, configured and maintained by the customer. Solr supports multi-node installations for scale (called sharding) but it must be statically set up and managed. If your needs change in the future, it will require complex devops procedures to rebuild the nodes and shards. There is no automatic rebalancing or resizing. There are hosted versions of Solr available in the marketplace, but not with the scale and reliability of Google’s offering. 

GCS is a managed service that can accommodate customers with 100,000 documents just as easily as customers with 100,000,000 documents. Google has engineered GCS to accommodate billions of documents and there are no end-user adjustments required to achieve that scale. GCS performs very well across the entire range. There is never any need to manage your own backups, adjust nodes or shards, or make any other architecture or performance settings.

3. Query Features

Solr is used across a wide variety of scenarios. Therefore, it has a very broad set of query features. Some customers use it to search large volumes of unstructured data in documents while others use it to analyze structured data, similar to SQL but with performance advantages. You will find more advanced features in the Solr query language, such as being able to specify proximity factors (nearness of terms), wildcard searches, fuzzy searches, and dynamic relevance boosts. Solr also supports queries across denormalized documents (joining) and spatial search on longitude and latitude data. The query parser is extensible, supporting very specialized queries, if you have the expertise to develop them.

GCS on the other hand, implements a simplified query protocol that is very similar to the search interface on google.com. Google keeps the search interface simple, and uses machine learning and natural language processing under the hood to automatically derive additional intentions from the query. For example, while Solr allows you to adjust proximity and fuzziness through manual adjustments, GCS applies those adjustments for you automatically based on data-driven analytics and feedback loops. If GCS believes that your query will deliver more relevant results by increasing or decreasing fuzziness, it will do it automatically. Google does not reveal all the adjustments they can make, but we have observed automatic adjustments to synonym expansion, fuzziness, proximity, filters and ranking, all derived automatically from basic keyword/phrase searches.

4. Indexing Features

Solr and GCS both support structured and unstructured data. Both search engines support indexing via API or wrappers written in popular programming languages. GCS will natively accept binary file formats (like PDF or Word), whereas binary files must be converted to text prior to indexing with Solr. GCS supports row-level security natively, meaning that every record in the index can have different access control permissions. Solr does not support any native security trimming. In highly secure environments, Solr can represent a security gap, because it relies on external applications or metadata filters to control access to sensitive content. GCS enforces security permissions at the record level regardless of how the data is accessed. There is no back door around the permissions.

5. Performance

In general, Solr and GCS are both capable of operating in high performance environments. Being a managed serviced, GCS performs within designed limits under almost all conditions. If those limits are exceeded, there are not any user-serviceable adjustments and Google will resolve the issue. 

Solr also performs very well when appropriately sized and tuned. However, there are known performance impacts to Solr query performance during periods of large indexing activity or configuration changes. GCS does not have any downtime during indexing activity or configuration changes. 

6. Google Innovations

GCS is a SaaS solution that will continue to evolve with new features and capabilities over time. New features will be available to both new and existing customers, many with no intervention required. For example, during the first year, Cloud Search added:

  • Built-in optical character recognition (at no additional charge)
  • Advanced spell-checking and correction
  • Dynamic query expansion (with google.com derived synonyms for common terms and concepts)
  • Natural language query interpretation
  • Wildcard search
  • Content-aware, secure type-ahead suggestions

Over the next year, the roadmap includes innovative features such as “answer” cards, knowledge graphs and ontologies, additional optimizations to ranking and relevance, increased understanding of natural language in queries, people search and more.

LET'S TALK

Our expert teams of consultants, architects, and solutions engineers are ready to help with your bold ambitions, provide you with more information on our services, and answer your technical questions. Contact us today to get started.

Scroll to Top