serialisation and topWords info #127

cbadenes · 2018-02-14T16:18:12Z

Minor changes in serialisation process and added a method to get top words along with their weights per topic

mimno · 2018-04-25T18:19:02Z

Thank you for all of this! Some comments:

Could you say more about the Lexer -> Pattern shift in CharSequence2TokenSequence?

It looks like the validateTopics function is adding stopwords during training? Is there a reference for this? I'm reluctant to make something available without fully understanding when users should and shouldn't use it.

I'm planning to release the HPPC version as 2.1, I'd like to see this as part of it.

cbadenes · 2018-04-26T15:02:55Z

Hi David,

To make the CharSequence2TokenSequence class thread-safety when perform a pipe build action, a new instance of CharSequenceLexer is required for each instance carried in a pipe. Thus, the regex pattern should be the only class attribute of the CharSequence2TokenSequence object.

About the validateTopics function, the idea is to create a list of stopwords, in an iterative way, based on those words appearing as top-words in multiple topics. This is similar to apply TF/IDF on Topics instead of Documents.

I hope it was helpful.

cbadenes added 11 commits February 14, 2018 17:09

alpha serialized as Double

56cd820

labelAlphabet serialized as Object

e016fc8

top words and weights per topic

8ad0352

return label alphabet as Alphabet

411cb68

no static fields to avoid memory leaks

dec98c8

no static fields to avoid memory leaks

c40bcfd

handle pipe in a parallel way

4e41765

handle multi-thread operations

c9ed380

parallel processing in LabeledLDA

5a9cbae

handle illegal argument

4d0e450

initialize vocabulary in a parallel way

e03d723

cbadenes added 17 commits June 21, 2018 15:44

fix feature index type

891ba96

added word to stoplist

2ec98df

added word to stoplist

deea9f9

allow clone a model inferencer

c214a8e

entries as ArrayList

d56706a

minor changes

56208d7

thread-safe iterator

d31d472

parse out as csv or json

b77933f

print progress

450c9e3

skip tests

f785b9b

topic co-occurrence calculated after a model is trained

b6ce6de

avoid null instances

2eb0c42

handle invalid data

e2342ce

handle instance error

5502c83

parallel read csv

09ea40b

maintain topic assignments

d92d595

parallel execution

f97745d

cbadenes added 3 commits December 17, 2018 12:49

updated version

7dc3455

fixed data alphabet

b2b2fdc

added concurrent hashmap for stopwords list

e4e1501

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serialisation and topWords info #127

serialisation and topWords info #127

cbadenes commented Feb 14, 2018

mimno commented Apr 25, 2018

cbadenes commented Apr 26, 2018

serialisation and topWords info #127

Are you sure you want to change the base?

serialisation and topWords info #127

Conversation

cbadenes commented Feb 14, 2018

mimno commented Apr 25, 2018

cbadenes commented Apr 26, 2018