Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialisation and topWords info #127

Open
wants to merge 31 commits into
base: master
Choose a base branch
from
Open

Conversation

cbadenes
Copy link

Minor changes in serialisation process and added a method to get top words along with their weights per topic

@mimno
Copy link
Owner

mimno commented Apr 25, 2018

Thank you for all of this! Some comments:

Could you say more about the Lexer -> Pattern shift in CharSequence2TokenSequence?

It looks like the validateTopics function is adding stopwords during training? Is there a reference for this? I'm reluctant to make something available without fully understanding when users should and shouldn't use it.

I'm planning to release the HPPC version as 2.1, I'd like to see this as part of it.

@cbadenes
Copy link
Author

Hi David,

To make the CharSequence2TokenSequence class thread-safety when perform a pipe build action, a new instance of CharSequenceLexer is required for each instance carried in a pipe. Thus, the regex pattern should be the only class attribute of the CharSequence2TokenSequence object.

About the validateTopics function, the idea is to create a list of stopwords, in an iterative way, based on those words appearing as top-words in multiple topics. This is similar to apply TF/IDF on Topics instead of Documents.

I hope it was helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants