Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregated Throughput Anomaly detection #184

Merged
merged 4 commits into from
Jul 12, 2023

Conversation

tushartathgur
Copy link
Contributor

This PR does the following:

  • Implements argument "agg-flow" and "p2p-label" for aggregated flow.
  • Aggregated flow contains Pods to external, pods to pods based of labels and pod to service flows.
  • New retrieve table has been added for aggregated TAD.
  • Modified retrieve table for TAD, new fields include agg_type and algo_type for better understanding.
  • TAD delete command can now take multiple tad ids to delete.

@codecov
Copy link

codecov bot commented Mar 17, 2023

Codecov Report

Merging #184 (e305fdd) into main (31f4752) will decrease coverage by 0.30%.
The diff coverage is 64.32%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #184      +/-   ##
==========================================
- Coverage   66.54%   66.25%   -0.30%     
==========================================
  Files          38       38              
  Lines        4783     5049     +266     
==========================================
+ Hits         3183     3345     +162     
- Misses       1453     1548      +95     
- Partials      147      156       +9     
Flag Coverage Δ
python-coverage 56.37% <55.73%> (-1.14%) ⬇️
unit-tests 69.90% <74.21%> (+0.20%) ⬆️
Impacted Files Coverage Δ
plugins/anomaly-detection/anomaly_detection.py 46.52% <51.20%> (-2.51%) ⬇️
pkg/theia/commands/anomaly_detection_run.go 84.81% <60.97%> (-12.00%) ⬇️
pkg/controller/anomalydetector/controller.go 73.76% <70.96%> (+1.49%) ⬆️
pkg/theia/commands/anomaly_detection_delete.go 94.00% <100.00%> (+1.14%) ⬆️
pkg/theia/commands/anomaly_detection_retrieve.go 89.65% <100.00%> (+3.50%) ⬆️
...lugins/anomaly-detection/anomaly_detection_test.py 94.23% <100.00%> (+1.20%) ⬆️

... and 1 file with indirect coverage changes

@tushartathgur tushartathgur force-pushed the agg_flow_part2 branch 6 times, most recently from 66bd428 to 1518a84 Compare March 21, 2023 22:55
@tushartathgur tushartathgur force-pushed the agg_flow_part2 branch 5 times, most recently from abcf5bc to 842d8e8 Compare March 27, 2023 20:18
algoCalc,
anomaly
FROM tadetector WHERE id = (?);`,
aggtadpodQuery: `
SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code formatter missing in .go files?

Copy link
Contributor Author

@tushartathgur tushartathgur Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, make fmt does cover rest files
tathgurt@tathgurtFLVDL theia % find . -type d -name '.cache' -prune -o -type f -name '*.go' -print | grep rest ./pkg/apiserver/registry/intelligence/throughputanomalydetector/rest.go ./pkg/apiserver/registry/intelligence/throughputanomalydetector/rest_test.go ./pkg/apiserver/registry/intelligence/networkpolicyrecommendation/rest.go ./pkg/apiserver/registry/intelligence/networkpolicyrecommendation/rest_test.go ./pkg/apiserver/registry/system/supportbundle/rest.go ./pkg/apiserver/registry/system/supportbundle/rest_test.go ./pkg/apiserver/registry/stats/clickhouse/rest.go ./pkg/apiserver/registry/stats/clickhouse/rest_test.go

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

algoType String,
algoCalc Float64,
throughput Float64,
anomaly String,
id String
) engine=ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/{table}', '{replica}')
ORDER BY (flowStartSeconds);
ORDER BY (flowEndSeconds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for curiosity, why do we change this from flowStartSeconds to flowEndSeconds?

Copy link
Contributor Author

@tushartathgur tushartathgur Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because we are reusing the same table for aggregated flow output and it will not have flowStartSeconds col in aggregated output, but the common column in both the outputs is flowEndSeconds

Comment on lines 75 to 80
SELECT
id,
sourcePodNamespace,
sourcePodLabels,
flowEndSeconds,
throughput,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the mix of spaces and tabs makes the indentation looks strange here. Could you unify them?

Comment on lines 37 to 39
aggtadpodQuery
aggtadpod2podQuery
aggtadpod2svcQuery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the camel-case naming to make them aggTadPodQuery, aggTadPod2PodQuery, aggTadPod2SvcQuery

)
}

func deleteTADid(cmd *cobra.Command, tadName string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func deleteTADid(cmd *cobra.Command, tadName string) error {
func deleteTADId(cmd *cobra.Command, tadName string) error {

Comment on lines 110 to 113
case "e2e":
fmt.Fprintf(w, "id\tsourceIP\tsourceTransportPort\tdestinationIP\tdestinationTransportPort\tflowStartSeconds\tflowEndSeconds\tthroughput\taggType\talgoType\talgoCalc\tanomaly\n")
for _, p := range tad.Stats {
fmt.Fprintf(w, "%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\n", p.Id, p.SourceIP, p.SourceTransportPort, p.DestinationIP, p.DestinationTransportPort, p.FlowStartSeconds, p.FlowEndSeconds, p.Throughput, p.AggType, p.AlgoType, p.AlgoCalc, p.Anomaly)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feels this is not quite readable. Is it possible to construct a matrix and reuse the TableOutput function here?

Comment on lines 262 to 270
throughputAnomalyDetectionAlgoCmd.Flags().String(
"agg-flow",
"",
`Specifies which aggregated flow to perform anomaly detection on, options are pods/pod2pod/pod2svc`,
)
throughputAnomalyDetectionAlgoCmd.Flags().String(
"p2p-label",
"",
`On choosing agg-flow as pod2pod, user need to mention labels for inbound/outbound throughput`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the newly added functionality, could you include them in documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, I have added it in the other PR 192

f.col("new.anomaly").alias("anomaly"))
def filter_df_with_true_anomalies(spark, plotDF, algo_type, agg_flow=None):
if agg_flow:
plotDF = plotDF.withColumn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend merging the shared operations found in lines 317-343. Looks like the different operations are selecting different columns in 322-324 and 336-338.

f.collect_list("flowEndSeconds").alias("flowEndSeconds"),
f.stddev_samp("max(throughput)").alias("throughputStandardDeviation"),
f.collect_list(f.struct(["Diff_Secs", "max(throughput)"])).alias(
"Diff_Secs, Throughput"))
f.collect_list(f.struct(["max(throughput)"])).alias("Throughput"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed last Friday, this should be sum(throughput) for aggregated TAD, I noticed you put that changes on #192, can you move all functional code changes inside this PR and let #192 have test codes only?

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

Copy link
Contributor

@dreamtalen dreamtalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main issue is that the implementation has deviated from our initial design. If you prefer, we can discuss this further offline.

result = append(result, []string{p.Id, p.SourcePodNamespace, p.SourcePodLabels, p.DestinationPodNamespace, p.DestinationPodLabels, p.FlowEndSeconds, p.Throughput, p.AggType, p.AlgoType, p.AlgoCalc, p.Anomaly})
}
case "pod_to_svc":
result = append(result, []string{"id", "sourcePodNamespace", "sourcePodLabels", "destinationServicePortName", "flowEndSeconds", "throughput", "aggType", "algoType", "algoCalc", "anomaly"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the 'pod_to_svc' type aligns with our design. Our goal was to monitor the aggregated throughput of all traffic directed to a Service, so the 'sourcePodNamespace' and 'sourcePodLabels' parameters are not relevant in this case.

for _, p := range tad.Stats {
result = append(result, []string{p.Id, p.SourceIP, p.SourceTransportPort, p.DestinationIP, p.DestinationTransportPort, p.FlowStartSeconds, p.FlowEndSeconds, p.Throughput, p.AggType, p.AlgoType, p.AlgoCalc, p.Anomaly})
}
case "pod_to_external":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern for the 'pod_to_external' type. I think our goal was to monitor the aggregated throughput of all traffic directed to an external IP.

for _, p := range tad.Stats {
result = append(result, []string{p.Id, p.SourcePodNamespace, p.SourcePodLabels, p.FlowEndSeconds, p.Throughput, p.AggType, p.AlgoType, p.AlgoCalc, p.Anomaly})
}
case "pod_to_pod":
Copy link
Contributor

@dreamtalen dreamtalen Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pod_to_pod" seems a fair usecase here, just want to point out our initial goal was to monitor the aggregated in/out bound throughput of a set of specific pod labels.

throughputAnomalyDetection.AggregatedFlow = aggregatedFlow
throughputAnomalyDetection.ExternalIP = externalIp
case "svc":
throughputAnomalyDetection.AggregatedFlow = aggregatedFlow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we require users input Pod labels and IP for "pod" and "external" cases but not require a service name for the "svc" case?
I feel Pod labels and external IP are optional input, if user doesn't provide this info, we will considering all possible pod labels and external IPs. Like what we have done in the "svc" case.

Copy link
Contributor Author

@tushartathgur tushartathgur Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add them as optional parameters, just wanted to confirm if the user doesn’t provide labels or external ip, should we have a check if the corresponding columns are empty or should we even include those columns in DF ?

Copy link
Contributor

@dreamtalen dreamtalen Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example if user chooses "external" agg type and doesn't provided any input IP, we will consider all toExternal flows, group by the destinationIP, aggregated throughput for each destinationIP, and find anomalies if any.
Hope this exmaple answered your question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for “external” case, it may work as we have flow type to define if the traffic is for external, but there could be pods with no labels, and pods with any type of label, should we add both of them? they both would have different kind of queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add both the cases of empty and non empty case for labels, also we can add svc name as argument for user to provide a specific service name, but in case of service age_type, we should only consider non empty cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can consider the non empty case only for now if the empty case needs big changes.
Maybe you can add another parameter "podname" in the pod agg case, it will help cover the cases of pods with no labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we should not consider the empty cases, as we look for pods based of labels, if there are no labels it would be little misleading to still collect them based of their name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was to have a user input parameter 'podname' that would cover cases where the user knows the pods they're interested in don't have labels.
Just a nice-to-have.

@tushartathgur tushartathgur force-pushed the agg_flow_part2 branch 5 times, most recently from 4c2299f to ed6a2b4 Compare March 30, 2023 19:36
@@ -198,3 +198,17 @@ ALTER TABLE flows
ALTER TABLE flows_local
DROP COLUMN egressName,
DROP COLUMN egressIP;
ALTER TABLE tadetector
DROP COLUMN podNamespace;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a comma instead of semicolon between DROP COLUMNs, please check my previous comment: #184 (comment)

@@ -137,3 +137,17 @@ ALTER TABLE flows
ALTER TABLE flows_local
ADD COLUMN egressName String,
ADD COLUMN egressIP String;
ALTER TABLE tadetector
ADD COLUMN podNamespace;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above, use comma and add datatype in ADD COLUMNs, please check my previous comment: #184 (comment)

@tushartathgur tushartathgur force-pushed the agg_flow_part2 branch 2 times, most recently from 512a38a to 11741fa Compare July 8, 2023 00:13
@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

2 similar comments
@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@elton-furtado
Copy link
Contributor

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

2 similar comments
@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

Signed-off-by: Tushar Tathgur <[email protected]>
@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur tushartathgur merged commit aa09d11 into antrea-io:main Jul 12, 2023
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants