Check false matches by * in Japanese #28

koheiw · 2019-04-19T09:14:45Z

"タイ*" for Thailand produces a lot of false matches. For example, "タイヤ" (tire), "タイム" (time), "タイミング" (timing), "タイプ" (type), "タイトル" (title), "タイガー" (tiger).

This is a good reminder that we have to careful about wildcard. We need to check words for other countries too.

koheiw · 2019-04-20T23:39:50Z

The chance of false match increases when we use * but need of wildcard depends on how Japanese words are segmented in tokenization. Below is the code to test if country names are isolated from following elements. For example we need * for Japan because tokens() does not separate "日本人" to "日本" and "人" while "アメリカ人" becomes "アメリカ" and "人".

require(quanteda)
#> Loading required package: quanteda
#> Package version: 1.4.4
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
require(newsmap)
#> Loading required package: newsmap
require(stringi)
#> Loading required package: stringi

lis <- as.list(data_dictionary_newsmap_ja, TRUE, 3) %>% 
       lapply(function(x) stri_replace_last_fixed(x[1], "*", ""))

# followed by kanji (country names as part of demonym)
people_fixed <- unlist(lis) %>% 
    paste0("人") %>% 
    tokens() %>% 
    tokens_lookup(dictionary(lis)) %>% 
    ntoken()

people_glob <- unlist(lis) %>% 
    paste0("人") %>% 
    tokens() %>% 
    tokens_lookup(data_dictionary_newsmap_ja) %>% 
    ntoken()

(missed_people <- names(lis)[people_glob > 0 & people_fixed == 0])
#> [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP"

# followed by katakana (country names as adjectives)
team_fixed <- unlist(lis) %>% 
    paste0("チーム") %>% 
    tokens() %>% 
    tokens_lookup(dictionary(lis)) %>% 
    ntoken()

team_glob <- unlist(lis) %>% 
    paste0("チーム") %>% 
    tokens() %>% 
    tokens_lookup(data_dictionary_newsmap_ja) %>% 
    ntoken()

(missed_team <- names(lis)[team_glob > 0 & team_fixed == 0])
#>  [1] "MG" "YT" "CD" "CG" "ST" "AI" "BQ" "PM" "KG" "GG" "MP" "NU" "TK"

union(missed_people, missed_team) # countries that need wildcard
#>  [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP" "MG" "YT" "AI" "BQ" "GG" "NU" "TK"

Interestingly, it is not only tokens() but Mecab also works in the similar manner.

日本人
日本人  名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン

アメリカ人
アメリカ        名詞,固有名詞,地域,国,*,*,アメリカ,アメリカ,アメリカ
人      名詞,接尾,一般,*,*,*,人,ジン,ジン

koheiw · 2019-04-21T03:48:52Z

@ClaudeGrasland here is the comparison between new and old.

There are large increase in small insular countries in the new version because I treated their names as phrases. The increase in Madagascar and Germany is due to wrong translation in the old version.
Removal of wildcard affects little when tokens are not compounded but the impact is when they are compounded. In Thailand, for example, -0.02% with non-compunded tokens, but -24% with compounded tokens. This is because "タイ" matches only "タイ" "軍" (Thai military), not "タイ軍". This is a tricky issue.

> diff["kh"]
          kh 
-0.000264131 
> diff2["kh"]
        kh 
-0.2463005

I produced this plot in https://github.com/koheiw/newsmap/blob/issue-28/tests/misc/comapre-dictionaries.R

koheiw added bug dictionary labels Apr 19, 2019

koheiw self-assigned this Apr 19, 2019

koheiw mentioned this issue Apr 19, 2019

Add more seed dictionaries #6

Open

16 tasks

koheiw changed the title ~~Check false maches by * in Japanese~~ Check false matches by * in Japanese Apr 19, 2019

chainsawriot mentioned this issue Jul 21, 2022

Ambiguity in Chinese seed words for CF and MN #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check false matches by * in Japanese #28

Check false matches by * in Japanese #28

koheiw commented Apr 19, 2019

koheiw commented Apr 20, 2019

koheiw commented Apr 21, 2019

Check false matches by * in Japanese #28

Check false matches by * in Japanese #28

Comments

koheiw commented Apr 19, 2019

koheiw commented Apr 20, 2019

koheiw commented Apr 21, 2019