Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to remove POS tagging before input to word cloud #991

Open
wvdvegte opened this issue Aug 1, 2023 · 7 comments
Open

Add option to remove POS tagging before input to word cloud #991

wvdvegte opened this issue Aug 1, 2023 · 7 comments

Comments

@wvdvegte
Copy link

wvdvegte commented Aug 1, 2023

Is your feature request related to a problem? Please describe.
In a workflow where I applied POS tagging to allow selecting (for instance) just nouns and verbs, then Bag of Words, Distances, Hierarchical Clustering and visualize clusters in Word Cloud, the word cloud shows all words with their POS tags, and words that are present with different tags are shown multiple times:
image
Instead I would like to be able to see each word in Word Cloud only once, without POS tagging.
Contrary to Bag of Words, widgets with similar functionality such as Document Embedding or Similarity Hashing do not produce output with POS tagging.

Describe the solution you'd like
I think there are different options:

  • add an option to Bag of Words to remove POS tagging from output if present, and to merge multiple occurrences of the same word
  • add an option to Word Cloud to ignore POS tagging and to merge multiple occurrences of the same word
  • offer a separate widget that allows removing tags and/or tokens
  • add an option to Preprocess Text to remove POS tagging after filtering based on it.

Describe alternatives you've considered
Couldn't find any

@wvdvegte
Copy link
Author

wvdvegte commented Aug 2, 2023

Two small corrections:

  • POS tagging in the output of Document Embedding is indeed removed in input to Word Cloud, but not in input to MDS or t-SNE (and to a subsequent Annotated Corpus Map)
  • There is a workaround, however very cumbersome: after doing the machine learning, merge with the original Corpus (from before preprocessing) into a new Corpus, remove the sparse features with Select Columns, and preprocess again without POS tagging.

@PrimozGodec
Copy link
Collaborator

Thank you for the report. I think we should internally discuss the best solution to this issue. Is there any other situation where you would like to have pos tags and then have them removed later besides the following two:

  • filtering according to POS tags (and later do processing without POS tags)
  • visualising with word cloud (POS tags needed before but committed for visualisation).

@wvdvegte
Copy link
Author

wvdvegte commented Aug 7, 2023

Yes, I assume the POS tags (if present) make a difference not only in filtering but in any type of analysis (classification, clustering, network analysis, ...), but I'd like to have the choice not to show them in any type of visualization - not only Word Cloud but also, for instance, Annotated Corpus Map and even in Data Table. There, I think it also makes sense to merge different 'versions' of a word, like 'practitioner' in my screenshot above.

@wvdvegte
Copy link
Author

wvdvegte commented Aug 7, 2023

BTW, Annotated Corpus Map is clustering and visualization in one widget. I seems to makes sense to consider the POS tags for clustering but not for the visualization.

@ajdapretnar
Copy link
Collaborator

This is a bit of a stale issue but I gave it some thought. Word Cloud currently doesn't show POS tags anymore. However, it would not merge two words with the same name into one.
I propose adding an option to remove POS tags in Preprocess Text. It makes the most sense to me. That said - where in Preprocess Text? As a final option in POS Tagger? As in "POS tag or remove any tags"?

@wvdvegte
Copy link
Author

I agree this could best be added to Preprocess Text. However, if you add it to POS Tagger, you have to activate POS Tagger twice: once before and once after Filtering. Perhaps it makes more sense as a final option in Filtering, where the current final option is filtering based on POS tags?

@ajdapretnar
Copy link
Collaborator

Duh, how did this not occur to me? 🤦‍♀️
Filtering it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants