A Study In Japanese Language Search
Disclaimer: I am a beginner in the Japanese Language and not familiar with actual implementation of such feature
Whenever I interact with Japanese websites. there is usually one thing that catches my attention and that is the search functionality. Inadvertently, I would start wondering how they are implemented. For context, the Japanese language utilises 3 sets of characters: hiragana, katakana, and kanji. So how does search work across the different characters?
*There’s a simple demo at the end to apply what I’ve learnt
Yomi Data
Although each is used in different context, they share a characteristic and that is the phonetic reading. For example, 晩餐歌 is read as ばんさんか。In search, the phonetic reading is often used instead of kanji. Hence, rather than matching characters, matching the phonetic reading 「読み」(yomi) is much more useful. This is especially true in cases where the same kanji has multiple readings.
Morphological Analysis
Looking into this topic, I discovered that the difficulty of this problem comes from deciphering words in the input. Unlike English, Japanese does not utilise spacings to delimit words. They are identified by the structural context instead. For example, 「私を雇ってください」→ 「私 を 雇って ください」. This is where Morphological Analysis comes in.
Morphological Analysis is a complex topic and it is a technique used in Natural Language Processing. Unfortunately, it is not something I am able to implement now. Thankfully, there are several open-source Morphological Analyzers we can use — e.g. Kuromoji, MeCab.
Generally, the approach Morphological Analyzer takes is to first split the text into tokens (e.g. morphemes), there are various strategies to this such as the level of division. Using the dictionaries, information about each token is retrieved such as the Reading (読み) mentioned earlier.
Full-Text Search
Full-text Search is a technique that lets us efficiently search through large text documents. There are several methods to full-text search each with different benefits and level of relevance in the result. This is how text search is performed in English texts so we can just customize this approach using the knowledge we discovered earlier for Japanese texts. In Japanese, Full-Text Search is 全文検索.
I am keeping this section short as the focus is on the uniqueness of Japanese in the text search problem. In the demo below, I have made mentions about aspects of full-text search I have incorporated.
Search Demo
Tech stack for this demo — Vite + Tailwind CSS + Shadcn + (Framer) Motion. It’s a bit overkill for this but it’s for practice.
A simple attempt in implementing search for Japanese language text. This demo will visualize the tokenization and identification of parts of speech of your search input. The search input will then be used to filter the relevant sentences below.
[Using Morphological Analyzer] The Kuromoji morphological analyzer mentioned earlier is written in Java. However, there is JavaScript port package for Kuromoji written by takuyaa. I will be using a fork from aiktb as it provides additional convenience.
[Implementing Full-Text Search] In this project, I’ve implemented basic techniques in Full-Text Search to improve the searching performance.
- Stemming — reducing words to root form, this is done using the data provided by the Morphological Analyzer
- Stop word removal — remove unmeaningful words, I’m using a list
stopwords-ja
to do the filtering - Inverted indexing — maps terms to text for quick retrieval
Final Reflection
This implementation brings the benefit of searching with meaning of words instead of just pattern matching. However, Morphological Analysis is not a catch-all. It is likely that additional strategies are adopted. Some of the issues I’ve noticed are:
- The search seems inaccurate at times as we see the same character but it’s not catching the sentence. This is because the same character was interpreted as a different word by the morphological analysis.
- When searching with Kana (Hiragana, Katakana) there is sometimes insufficient context for the analyzer to correctly tokenize the words.
To make the search feel more intuitive I’ve added an option to search wider, checking for parts of words within the tokens. Unfortunately, the implementation doesn’t leverage hashtable retrievals like in the standard mode.
For stop word removal step, rather than pulling from a list it is better to use a custom list in actual projects. Different use cases will require different level of filtering, it is not a one-size situation. In this project I removed 「私」 from the filter.
When I first started learning about this topic, I had intended to build a simple demo where I search directly through the results of the morphological analyzer. However, while learning it led me to topics like Full-Text Search and Database Indexing so I thought it was a good opportunity to apply them for a more complete text search simulation.
This topic is complex in the technical sense but also requires strong knowledge in the Japanese language which I lack. Life is continuous learning so perhaps I will revisit this topic further down the line.