Semantic search has gained immense popularity in the wake of the recent Generative AI wave and it’s really useful if we want to retrieve contextual information from a large corpus of documents. So, let us first understand what a Semantic Search is.
What is a Semantic Search?
Semantic search is an advanced searching technique which uses the intent and contextual meaning behind a search query and deliver highly relevant results rather than just matching the search keywords. This innovative approach enables users to find precisely what they are looking for, even if the exact keywords are not explicitly present in the results. As a result, semantic search greatly enhances the overall search experience, providing more targeted and valuable information to users.
But why we needed to shift from traditional keyword search to these advanced search techniques. Let us now understand how the search technology evolved over the years and their limitations.
Evolution of Search Techniques:
Keyword Search (1970s): Keyword search is there from a long time, and it was introduced somewhere around 1970s even before the invention of world wide web. Keyword search basically works like index of a book where all the words of a document are indexed and then deliver results based on simple keyword matching with your query.
Later some statistical algorithms like “TF-IDF” were introduced to improve the search accuracy. This algorithm basically finds the number of occurrences of your search words in a document along with the total number of documents that contains the search word. However still user had to create synonym libraries, rules and use additional metadata or keywords to improve accuracy.
Introduction of NLP (beginning of 1980s): Statistical ranking was useful but not enough to understand related words accurately. For example, “Snow” and “Cold” are related words even though there is no direct match between the words. Also, like singular/plural (ex: woman, women), present/past (go, went) version of words will not exactly match but these are similar words only.
So, there was a need for a better search technology which can understand related words like human to some extent, and this is how Natural Language Processing or NLP came into the picture to manage the complexity of languages. There were various processes or techniques developed for NLP, some of which are –
- Stemming – Stemming is an NLP technique where we convert the word into their base forms by removing the prefix and suffix. For example, “extract”, “extraction” and “extracting” are converted to a root form extract.
- Lemmatization – Similar to Stemming, Lemmatization is also a text pre-processing technique in NLP where it converts the word to its base meaning. For example, the word “better” would be converted to its root meaning “good”.
- Speech tagging – Speech tagging or parts of speech (PoS) is an NLP technique where it classifies list of words as nouns, verbs, adjectives etc. for more accurate query processing. It tries to identify the meaning of the sentence by looking at the relationship between the words.
- Entity extraction – Entity extraction is one of the most popular NLP techniques, specifically for voice search where it tries to detect different objects of a query – places, people, dates, quantities etc. to understand the information inside it.
Knowledge Graphs (2005): A knowledge graph, also referred to as a semantic network, is a powerful representation of real-world entities, such as objects, events, situations, or concepts, and their intricate relationships. This valuable information is typically stored in a graph database and visualized as a structured graph, giving rise to the term “knowledge graph.”
Google Knowledge Graph is an integral part of Google Search Engine Results Pages (SERPs), offering relevant information tailored to people’s search queries. This dynamic knowledge graph encompasses an impressive collection of over 500 million interconnected objects, drawing data from various sources like Freebase, Wikipedia, the CIA World Factbook, and others.
AI Ranking (2007): AI ranking is the first search technique which incorporated Artificial Intelligence, which includes a variety of machine learning algorithms. It uses reinforcement learning to provide more relevant information to the users. The idea of reinforcement learning is to leverage user feedbacks to provide more positive outcome.
The AI ranking was good for search result ranking, but it does not help to identify the correct records. It still relies on the old keywords matching. This is what fueled the need of vector search to overcome the limitation with keyword matching.
Vector Search (2013): Vector representation of text is a very old technique. Its root goes back to 1950s and there were several major advancements over the decades. The release of transformer models in 2018 by Google was a breakthrough in this space and vector search started to become popular after this.
Vector search, commonly known as semantic search, is an advanced search technique that identifies related objects with similar characteristics. This innovative approach involves converting both the query and the text database into vectors, which represent the semantic meaning of the texts instead of relying solely on keywords. The search process revolves around finding vectors that exhibit similar semantic meaning.
For instance, the system can intelligently recognize the connection between words like “cold,” “snow,” and “frozen,” even when there is no exact keyword match. This capability effectively addresses the limitations associated with traditional keyword matching methods.
Hybrid Search(2022 and future): In various scenarios, such as product searches on e-commerce sites, keyword-based search tends to produce better results than semantic search. For instance, when a user searches for “Nike” shoes, a keyword search will specifically display only the “Nike” shoes. On the other hand, if we rely solely on semantic search, it may result to showing shoes from other brands also, not just “Nike,” as the semantic meaning of all shoe products would be considered the same.
To address this limitation and optimize search results for various use-cases, there arises a necessity to merge both keyword and semantic search methodologies. This has paved the way for the development of hybrid search technology, where keyword and semantic search complement each other, working side by side to provide more accurate and relevant search outcomes.
Semantic search has emerged as a highly popular choice across various industries, primarily due to the recent advancements in Generative AI and the introduction of transformer models like OpenAI GPT. These models enable us to effortlessly convert text into vectors by leveraging GPT APIs. As a result, businesses are increasingly adopting this technology, applying it in diverse scenarios ranging from customer service chatbot applications to healthcare apps capable of generating responses from a collection of medical documents.
One of the key advantages of semantic search lies in its ability to mitigate the risk of providing incorrect or misleading information which is called “hallucination” in the AI world. By restricting the search to a specific set of documents, the chances of generating false answers are minimized. If the required information is not found within the referenced documents, the chatbot can gracefully respond by stating its inability to provide an answer.