Wee Kiat • Blog

Articles

Home

A Study In Japanese Language Search

Demo

Repository

Disclaimer: I am a beginner in the Japanese Language and not familiar with actual implementation of such feature

Whenever I interact with Japanese websites. there is usually one thing that catches my attention and that is the search functionality. Inadvertently, I would start wondering how they are implemented. For context, the Japanese language utilises 3 sets of characters: hiragana, katakana, and kanji. So how does search work across the different characters?

*There’s a simple demo at the end to apply what I’ve learnt

Yomi Data

Although each is used in different context, they share a characteristic and that is the phonetic reading. For example, 晩餐歌 is read as ばんさんか。In search, the phonetic reading is often used instead of kanji. Hence, rather than matching characters, matching the phonetic reading 「読み」(yomi) is much more useful. This is especially true in cases where the same kanji has multiple readings.

Morphological Analysis

Looking into this topic, I discovered that the difficulty of this problem comes from deciphering words in the input. Unlike English, Japanese does not utilise spacings to delimit words. They are identified by the structural context instead. For example, 「私を雇ってください」→ 「私　を　雇って　ください」. This is where Morphological Analysis comes in.

Morphological Analyzers helps us to understand text by breaking it down and providing useful information about each unit. Generally, the approach is to first split the text into tokens (e.g. morphemes) according to level of division. Then, it uses the dictionaries to retrieve information about each token, such as the Reading (読み) mentioned earlier.

Morphological Analysis is a technique used in Natural Language Processing and is complex on its own. Thankfully however, there are several open-source Morphological Analyzers we can use — e.g. Kuromoji, MeCab.

Full-Text Search

Full-text Search is a technique that lets us efficiently search through large text documents. There are several methods to full-text search each with different benefits and level of relevance in the result. This is how text search is performed in English texts so we can just customize this approach using the knowledge we discovered earlier for Japanese texts. In Japanese, Full-Text Search is 全文検索.

In the demo below, I have incorporated some aspects of full-text search to improve the efficiency of the search.

Search Demo

Tech stack for this demo — Vite + Tailwind CSS + Shadcn + (Framer) Motion. It’s a bit overkill for this but it’s for practice.

A simple attempt in implementing search for Japanese language text. This demo will visualize the tokenization and identification of parts of speech of your search input. The search input will then be used to filter the relevant sentences below.

[Using Morphological Analyzer] The Kuromoji morphological analyzer mentioned earlier is written in Java. However, there is a JavaScript port for Kuromoji written by takuyaa. I will be using a fork from aiktb as it provides additional convenience.

[Implementing Full-Text Search] In this project, I have implemented some basic techniques in Full-Text Search to improve the searching performance.

Stemming — reducing words to root form, this is done using the data provided by the Morphological Analyzer
Stop word removal — remove unmeaningful words, I’m using a list stopwords-ja to do the filtering
Inverted indexing — maps terms to text for quick retrieval

Final Reflection

This implementation brings the benefit of searching with meaning of words instead of just pattern matching. However, Morphological Analysis is not a catch-all. It is likely that additional strategies are adopted. Some of the issues I’ve noticed are:

The search seems inaccurate at times as we see the same character but it’s not catching the sentence. This is because the same character was interpreted as a different word by the morphological analysis.
When searching with Kana (Hiragana, Katakana), there may be insufficient context for the analyzer to correctly tokenize the words.

To make the search feel more intuitive I’ve added an option to search wider, checking for parts of words within the tokens. Unfortunately, the implementation doesn’t leverage hashtable retrievals like in the standard mode.

[Edit] Updated the wider search feature to utilise indexing as well. This will incur more computation initially but search will benefit from the built index subsequently.

For stop word removal step, rather than pulling from a list it is better to use a custom list in actual projects. Different use cases will require different level of filtering, it is not a one-size situation. In this project I removed 「私」 from the filter.

When I first started learning about this topic, I had intended to build a simple demo where I search directly through the results of the morphological analyzer. However, while learning it led me to topics like Full-Text Search and Database Indexing so I thought it was a good opportunity to apply them for a more complete text search simulation.

This topic is complex in the technical sense but it also requires strong knowledge in the Japanese language which I currently lack. Life is continuous learning so perhaps I will revisit this topic further down the line.