aggrigation on clustering k-means #31852

snozwoz · 2024-04-02T21:52:18Z

snozwoz
Apr 2, 2024

unusual requirement this one if you can solve it.
I have a database of unstructured chatty social media posts (no hashtags, even) and want to identify trending stories—up or down. I want to embed each story, group them into clusters of similar stories, and identify whether the clusters are shrinking or growing over time. Similar to an SQL group by query with count.
I am looking at k-means scatter graphs, thinking this clustering is perfect. If I can check over time clusters that are growing or shrinking, then I can look at posts near the centroids to identify the themes of the clusters.
I am not looking for a visualisation scatter chart though, just a list of movers i.e. "vector db milvus up 12% on last week".
Knowing that milvus index stores as k-means cluster, is there a way I can pull the information directly from the index and aggregate the count on the cluster size - or something along those lines? This may be a bad idea, open to suggestions.

xiaofan-luan · 2024-04-02T21:56:28Z

xiaofan-luan
Apr 2, 2024
Maintainer

If I understand correctly, you are looking for running a DBScan on milvus and clustering all the dataset. (KMeans need to specify K but under your case you don't know how many categories you have).

How many embeddings do you have? If less than 10m using faiss could simply solve this problem on one single machine. If the vector numbers are huge we can definitely help on that

3 replies

snozwoz Apr 2, 2024
Author

Correct. I don't know how many categories I have. I don't know what the topics are, and certainly don't know what tomorrow's news is. :-)

Less than 10m.

Can you explain your idea in more detail - I'm a bit of a noob on this idea.

xiaofan-luan Apr 2, 2024
Maintainer

check if https://scikit-learn.org/stable/modules/clustering.html#dbscan is what you need

snozwoz Apr 2, 2024
Author

Just read up on DBscan, yes sounds more suitable than K-means. But how do I use DBscan with milvus or zilliz to pull the information out?

xiaofan-luan · 2024-04-02T23:39:38Z

xiaofan-luan
Apr 2, 2024
Maintainer

We are working on a distributed DBScan but it should be ideally fits for large dataset.
For smaller dataset I guess simply pull the data out and running on your local machines work?

1 reply

jnt0rrente Sep 19, 2024

Sorry for slight necro: What is the recommended way of reading all data out of a milvus instance? I have the exact same use case as OP, but cant find any documentation on how to export milvus data programatically, other than query/limit. I need to pass it to scikit. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggrigation on clustering k-means #31852

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

aggrigation on clustering k-means #31852

snozwoz Apr 2, 2024

Replies: 2 comments · 4 replies

xiaofan-luan Apr 2, 2024 Maintainer

snozwoz Apr 2, 2024 Author

xiaofan-luan Apr 2, 2024 Maintainer

snozwoz Apr 2, 2024 Author

xiaofan-luan Apr 2, 2024 Maintainer

jnt0rrente Sep 19, 2024

snozwoz
Apr 2, 2024

Replies: 2 comments 4 replies

xiaofan-luan
Apr 2, 2024
Maintainer

snozwoz Apr 2, 2024
Author

xiaofan-luan Apr 2, 2024
Maintainer

snozwoz Apr 2, 2024
Author

xiaofan-luan
Apr 2, 2024
Maintainer