Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web page in multiple languages #13

Open
xfq opened this issue Jul 26, 2022 · 3 comments
Open

Web page in multiple languages #13

xfq opened this issue Jul 26, 2022 · 3 comments

Comments

@xfq
Copy link
Member

xfq commented Jul 26, 2022

What's the expected behaviour if a web page contains multiple languages? For example, if a page contains Chinese and Japanese, the segmentation process and full-text indexes could be different. Even the same code point sequences may be segmented differently depending on whether it's ja or zh.

@r12a
Copy link
Contributor

r12a commented Jul 26, 2022

@xfq i'm not sure what the problem is here.

@aphillips
Copy link
Contributor

@xfq If one does true full-text search on a page in multiple languages (as opposed to sub-string matching, which is the primary topic of our document), then the segmentation, stemming, and other processing (such as named entity recognition) of the corpus should be matched to the language of each block of text--i.e. word segmentation on ja is different from that on zh.

When search terms are entered against a multilingual index, it may be necessary to do "explosive stemming" (multiple stemming processes using the rules for the various languages in the corpus) or other types of processing to try to match the search terms against the indices.

FTS is complicated.

As @r12a asks, what is the problem here (with respect to our text)? 😉 Happy to accept suggestions.

@xfq
Copy link
Member Author

xfq commented Jul 26, 2022

I think it should be pointed out that if a piece of text contains multiple languages, then the search for the text needs to be adapted to support multiple languages, not just the primary language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants