Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check false matches by * in Japanese #28

Open
koheiw opened this issue Apr 19, 2019 · 2 comments
Open

Check false matches by * in Japanese #28

koheiw opened this issue Apr 19, 2019 · 2 comments
Assignees

Comments

@koheiw
Copy link
Owner

koheiw commented Apr 19, 2019

"タイ*" for Thailand produces a lot of false matches. For example, "タイヤ" (tire), "タイム" (time), "タイミング" (timing), "タイプ" (type), "タイトル" (title), "タイガー" (tiger).

This is a good reminder that we have to careful about wildcard. We need to check words for other countries too.

@koheiw koheiw self-assigned this Apr 19, 2019
@koheiw koheiw mentioned this issue Apr 19, 2019
16 tasks
@koheiw koheiw changed the title Check false maches by * in Japanese Check false matches by * in Japanese Apr 19, 2019
@koheiw
Copy link
Owner Author

koheiw commented Apr 20, 2019

The chance of false match increases when we use * but need of wildcard depends on how Japanese words are segmented in tokenization. Below is the code to test if country names are isolated from following elements. For example we need * for Japan because tokens() does not separate "日本人" to "日本" and "人" while "アメリカ人" becomes "アメリカ" and "人".

require(quanteda)
#> Loading required package: quanteda
#> Package version: 1.4.4
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
require(newsmap)
#> Loading required package: newsmap
require(stringi)
#> Loading required package: stringi

lis <- as.list(data_dictionary_newsmap_ja, TRUE, 3) %>% 
       lapply(function(x) stri_replace_last_fixed(x[1], "*", ""))

# followed by kanji (country names as part of demonym)
people_fixed <- unlist(lis) %>% 
    paste0("") %>% 
    tokens() %>% 
    tokens_lookup(dictionary(lis)) %>% 
    ntoken()

people_glob <- unlist(lis) %>% 
    paste0("") %>% 
    tokens() %>% 
    tokens_lookup(data_dictionary_newsmap_ja) %>% 
    ntoken()

(missed_people <- names(lis)[people_glob > 0 & people_fixed == 0])
#> [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP"

# followed by katakana (country names as adjectives)
team_fixed <- unlist(lis) %>% 
    paste0("チーム") %>% 
    tokens() %>% 
    tokens_lookup(dictionary(lis)) %>% 
    ntoken()

team_glob <- unlist(lis) %>% 
    paste0("チーム") %>% 
    tokens() %>% 
    tokens_lookup(data_dictionary_newsmap_ja) %>% 
    ntoken()

(missed_team <- names(lis)[team_glob > 0 & team_fixed == 0])
#>  [1] "MG" "YT" "CD" "CG" "ST" "AI" "BQ" "PM" "KG" "GG" "MP" "NU" "TK"

union(missed_people, missed_team) # countries that need wildcard
#>  [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP" "MG" "YT" "AI" "BQ" "GG" "NU" "TK"

Interestingly, it is not only tokens() but Mecab also works in the similar manner.

日本人
日本人  名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン

アメリカ人
アメリカ        名詞,固有名詞,地域,国,*,*,アメリカ,アメリカ,アメリカ
人      名詞,接尾,一般,*,*,*,人,ジン,ジン

@koheiw
Copy link
Owner Author

koheiw commented Apr 21, 2019

@ClaudeGrasland here is the comparison between new and old.

image

There are large increase in small insular countries in the new version because I treated their names as phrases. The increase in Madagascar and Germany is due to wrong translation in the old version.
Removal of wildcard affects little when tokens are not compounded but the impact is when they are compounded. In Thailand, for example, -0.02% with non-compunded tokens, but -24% with compounded tokens. This is because "タイ" matches only "タイ" "軍" (Thai military), not "タイ軍". This is a tricky issue.

> diff["kh"]
          kh 
-0.000264131 
> diff2["kh"]
        kh 
-0.2463005 

I produced this plot in https://github.com/koheiw/newsmap/blob/issue-28/tests/misc/comapre-dictionaries.R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant