Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare biterm topic modelling to rainette, LDA, coclustering, structural topic model, embedding clustering, autoencoders #9

Open
jwijffels opened this issue Jun 26, 2019 · 12 comments

Comments

@jwijffels
Copy link
Collaborator

Looking for some typical open data with short texts which are interesting, in order to compare clustering methods (BTM / LDA / stm / coclustering / reinert text clustering / embedding clustering / autoencoder)
@datasculptor / @manuelbickel you know any interesting open data?

@rdatasculptor
Copy link

I have never used (or even taken a look at) this dataset before, but it maybe interesting: https://registry.opendata.aws/amazon-reviews/

@jwijffels
Copy link
Collaborator Author

Interesting and huge dataset, but unfortunately the license of that data is too restrictive.

@rdatasculptor
Copy link

You are right.
How about this list of tweet collections: https://www.docnow.io/catalog/

@jwijffels
Copy link
Collaborator Author

Would prefer to use data which can be shared

@rdatasculptor
Copy link

Sorry for not checking before giving the link.

@manuelbickel
Copy link

manuelbickel commented Jun 27, 2019 via email

@jwijffels
Copy link
Collaborator Author

No problem.
Japanes Haiku, yes, why not :)

@rdatasculptor
Copy link

Could this be interesting? https://www.linkedin.com/feed/update/urn:li:activity:6553904839447973888
Not that I am a fan or something :-)

@jwijffels
Copy link
Collaborator Author

I'm sure you are a fan :)

@rdatasculptor
Copy link

Also this one could be interesting: https://github.com/EmilHvitfeldt/textdata

@msaeltzer
Copy link

You could look at manifestos. manifestoR is an API to coded political text in several languages.
https://github.com/ManifestoProject/manifestoR
While manifestos are (very) long texts, they are coded here as quasi-sentences, statements that can be sentence level or sub-sentence level. They make up short micro texts of specific topics. While the coding is useful, it is far from perfect. It gives an idea about the number of topics in the text, but are not conclusive, as they can be aggregated to higher categories like issues and domains.
I am working with them right now, using BTM.

@jwijffels
Copy link
Collaborator Author

Interesting. Didn't know these political party manifesto's existed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants