#Aozora Bunko jukugo frequency by kanji

Here you can find, for each kanji, the compound words (or jukugo 熟語) in which it appears in the Aozora Bunko digital library collection of texts, and the frequency of each of these words.

► Download data for Mac/Unix/Linux (.tar.bz2, UTF-8 text files with LF line endings) ► Download data for Windows (.7z, UTF-16 text files with CRLF line endings)

Purpose

Beginner or intermediate learners of Japanese can use this data to optimize their studies by learning the most common words. That being said, (a) there already exists lists of common words, sometimes sorted by frequency, and (b) the source I used, Aozora Bunko, make the data skewed for common kanji as it includes mostly old books which don’t always reflect the contemporary usage of words in everyday contexts (conversation, news, contemporary novels…).
Advanced learners of Japanese, maybe even natives, who are learning rare kanji are more likely to find this data useful. Let’s take an example. Say you encounter the kanji 耆 and decide to learn its meanings and the words in which it is found. In this case, Japanese-English dictionaries such as the widely-used EDICT won’t be of much use since they naturally include no or few word entries for rare kanji. You can use an online monolingual Japanese dictionary such as Kotobank, but you will find more than 60 entries with 耆 in them and no information about their respective frequencies is available. Instead, by looking at the data I provide here, you will see that the kanji 耆 appears in 14 different jukugo in the Aozora Bunko, the most common ones being 伯耆, 耆宿 and 耆婆.

Sources and methodology

The 14,000+ text files of the Aozora Bunko digital library were used for the corpus.

A large list of word entries was made by combining the EDICT dictionary, a list of yojijukugo extracted from 四字熟語辞典 ONLINE, a list of entries from the Dai Kan-Wa Jiten and a list of entries from the Hanyu Da Cidian, the last two made available by the Kanji Database Project.

I made a script which performs the following steps:

For each kanji found in the corpus (including kanji outside the JIS X 0208 set), it searches for all sequences of one or more kanji (々 and 〆 were considered valid kanji) containing that kanji. Occurrences of 々 and 々々 are expanded into kanji appropriately so as not to miss out any jukugo (for many jukugo with repeated kanji, the dictionaries used include both an entry with 々 and an entry without, but not systematically).
For each sequence, the script looks for the longest substring which can be found in the dictionary entries.
Statistics are computed using the number of texts in which a word is found rather than by counting the total number of occurrences of that word in the whole corpus. This methodology is similar to the one I used in this project where I explain why it gives more representative results.

Caveats

Jukugo which are absent from the dictionary entries are not reported in the data since the software has no way of knowing whether it encountered a legitimate jukugo or merely a juxtaposition of several words (e.g. when two or more nouns are combined together to form a new noun, or when a jukugo is used as an adverb).
Sometimes a compound word can be either a Sino-Japanese jukugo read in on’yomi, or a native Japanese word read in kun’yomi and sometimes accompanied with okurigana. For example, 蹌踉 can either be a taru-adjective or to-adverb of Chinese origin read そうろう, or the root of a Japanese verb whose dictionary form is 蹌踉めく and which is read よろめく. Keep in mind that the program I wrote doesn’t parse kana and doesn’t try to disambiguate kanji readings. Consequently, occurrences of 蹌踉 read そうろう and 蹌踉 read よろ aren’t distinguished and are grouped together in the statistics. So if you look at the data for kanji 蹌, the line corresponding to 蹌踉 refers to all occurrences of 蹌踉 in the corpus, whatever their respective readings.
Due to the parsing method used and to the imperfect nature of Chinese characters word segmentation algorithms, there is a small (negligible but non-zero) number of false positives and missed out words.