antrea-io / theia Goto Github PK

View Code? Open in Web Editor NEW

38.0 15.0 25.0 3.67 MB

Network observability for Kubernetes

License: Apache License 2.0

Makefile 1.00% Shell 9.52% Go 67.72% JavaScript 0.65% TypeScript 3.42% Python 17.23% Smarty 0.35% Dockerfile 0.10%

network-analytics network-observability networkpolicy security networking kubernetes

theia's People

Contributors

Stargazers

Watchers

theia's Issues

Dependabot in Snowflake Folder

Currently, the snowflake folder dependencies are not automatically checked for vulnerabilities, nor are PRs made by Dependabot to fix any vulnerabilities. This is because Dependabot hasn't been setup for the snowflake folder yet.

Theia docker image vulnerability scan is failing constantly

Please refer to https://github.com/antrea-io/theia/actions/workflows/clair.yml

Network Threat Analytics - Throughput Anomaly Detector

Detect throughput anomalies in Antrea flows using well-known algorithms.
The feature should be able to do detection for individual flows, or for a given pod traffic (source and destination)

The feature should provide:

Spark Application for Network Flow Analysis
Theia manager integration
CLI support

Bump Antrea to 1.11.1

The PR from Dependabot to bump Antrea from 1.8.0 to 1.11.1 cannot be merged directly due to errors in E2E tests: #267. There are some API changes which includes those breaking changes. We need to manually resolve those errors and do the upgrade.

Add the support for the aggregated throughput anomaly detection

Expand the scope of throughput anomaly detection beyond individual endpoint-to-endpoint flow, for example:

Monitor the accumulated inbound/outbound throughput of a specific Pod.
Monitor the accumulated inbound/outbound throughput of a set of Pods (based on Pod labels).
Monitor the accumulated inbound/outbound throughput of a specified Service.

Notes:

Corresponding CLI changes are needed to complete this enhancement.
Add ignorelist for namespaces as an argument for TAD
Consistent Throughput and Algocalc datatypes.
Restructuring Anomalydetector in api to intelligence in api as required by Theia Manager
Restructuring Rest files of Anomalydetector in apiserver folder.
Add Theia resync logic for TAD.
If #167 has been finished first, corresponding changes of the real-time detection are needed.

Add label toggle to Dependency Plugin

In the current dependency plugin, individual Pod traffic is shown. In instances where multiple replicas of multiple Pods exist, the graph could get cluttered easily and become unreadable. To remedy this, I suggest we add a toggle to the Network Topology Dashboard that allows users to group Pods by Pod labels.

Codecov Flag Support for e2e-tests and kind-e2e-tests

Currently, PR #171 only includes support for unit-tests flag. Additional configuration is needed besides changes to codecov.yml and kind.yml to include support.

Flaky ClickHouseMonitor E2E test

Saw a kind e2e test failure at ClickHouseMonitor:

=== RUN   TestFlowVisibility/IPv4/ClickHouseMonitor
I0130 20:07:47.602139   21006 flowvisibility_test.go:629] Generating flow records to exceed monitor threshold...
I0130 20:08:05.824923   21006 flowvisibility_test.go:634] Waiting for the flows to be exported...
I0130 20:08:35.960261   21006 flowvisibility_test.go:641] Waiting for the monitor to detect and clean up the ClickHouse storage
    flowvisibility_test.go:678: 
        	Error Trace:	flowvisibility_test.go:678
        	            				flowvisibility_test.go:643
        	            				flowvisibility_test.go:519
        	Error:      	Max difference between 17877 and 9189.5 allowed is 2681.5499999999997, but difference was 8687.5
        	Test:       	TestFlowVisibility/IPv4/ClickHouseMonitor
        	Messages:   	Difference between expected and actual number of deleted Records should be lower than 15%

https://github.com/antrea-io/theia/actions/runs/4047547236/jobs/6961777226

Add unit tests for policy recommendation job

Currently, policy recommendation Spark jobs are written in Python and lack unit tests covered. Since Python is an interpreted language, unit test coverage is important to ensure the correct functionalities.

We need to add unit tests for these codes and Python CI on Github to automatically run unit tests when Python code changes are detected.

[E2E]Better way to check connection setup between CH client and CH server

In FlowVisibility test, we use a time sleep to wait FA set up the connection with ClickHouse DB. It would make more sense if we change to wait until FA Pod has been in "Running" state for a period of time, e.g. 20s. So that we know the Pod does not crash due to the 10s ping time out and has successfully set up the connection.

An alternative is adding a log in CH client source code, once it successfully connected to CH DB. And we check the Pod log in FlowVisibility test.

e2e test coverage for Grafana dashboards

Background

Code provided by us/target code to be tested on:
All dashboard JSON files, including panels configuration, ClickHouse SQL queries.

Aspects can be tested on:

Correctness of query syntax(It could happen when the dashboards are out-of-date for new data schema change)
Correctness of query contents, whether it is querying the data we actually want to display
Correctness of panel configuration(which is unlikely as the JSON file is converted and exported by Grafana automatically. All edits to the dashboard are expected to be done on the UI. We shouldn't edit the JSON file directly)

Grafana data logic flow:

Refresh a dashboard
Grafana loops through all the queries and calls the appropriate data source. This step is done by the third party datasource plugin. In our case is the grafana-clickhouse datasource. The datasource plugin takes the query as the input, send the query to the external datasource API, and return the query result data as the output to Grafana
Grafana panel plugin takes the query result as input, render and display the graphs on the panel

Solution1 - Grafana HTTP API

The Grafana backend exposes an HTTP API, which is the same API that is used by the frontend to do everything from saving dashboards, creating users, and updating data sources.

Eligibilities relevant our use case:

get a list of dashboards/get a single dashboard by uid
get a list of datasources/get a single data source by uid

Limitation

However, in the Grafana HTTP API, it doesn't include an API that can execute a query and get the query result.
Ref issue1: https://community.grafana.com/t/backend-api-to-get-query-result/67293/2
Ref issue2: https://community.grafana.com/t/grafana-http-api-to-get-panel-json-data/63901/3
Ref issue3: https://community.grafana.com/t/dashboard-api-returns-query-results-as-well/5556/7

If we want to verify the query result, one alternative is: Send request to Grafana dashboard API, get the dashboard JSON file, extract the query from the dashboard JSON, and run the query independently against the datasource.

Solution2 - Grafana e2e package

It provides a package built on top of Cypress, which allow us to define some actions on the app and define the corresponding expected outcome. e.g. Open a dashboard, check correct panels are displayed. The package doesn't seem to have a clear documentation, but there are some example test suites to get start with.

Limitation

When we run tests in Cypress, it launch a browser for us, which doesn't fit our Kind test environment.
Although it allows us to execute a query. Still, it cannot automatically retrieve and verify the query result. We need to compare by our own the testing panel visualization with expected panel screenshot. Example

Resync Theia Manager at startup

If the Theia manager is restarted due to events such as errors, evacuation or user actions we must make sure it synchronizes with the latest state with both Clickhouse and Spark.

From a scale perspective, we should ensure this synchronization does not add significant time to startup processes when there is a large data set to validate.

Create a unified docker image for all Spark applications

All of our Spark applications, including Policy Recommendation and Throughput Anomaly Detection, are built on the gcr.io/spark-operator/spark-py docker image and have their dependent libraries installed on top.

However, this image has a relatively large size of around 1GB. To save disk space on the user's node and prepare for the addition of more Spark applications in the future, we aim to create a unified docker image for all of these applications.

e2e test coverage improvement for policy recommendation

Background

Currently, our e2e test for policy recommendation only covers a single execution of run/status/retrieve CLI commands. And only one single Pod-to-Pod flow traffic is generated for the test of the retrieve command.

Tasks

To have better e2e test coverage, we are planning to add these test cases:

Add coverage for list and delete command. #49
Add test cases covering policy recommendations for Pod-to-Service, and Pod-to-External flows.
Add coverage for the failed cases.

Notes

Because a recommendation job may take several minutes to complete, to minimize the test running time, I'm considering adding the Pod-to-Service and Pod-to-External flows into the current test of the retrieve command instead of creating separate test cases.

To test the failed cases, we could simulate a test case where the Driver Pod is destroyed and unavailable when the policy recommendation job is still running. Then we could check the Status and FailedReason of this job through the status command.

List and delete policy recommendation jobs

Add a list command to show all policy recommendation jobs, e.g:

> theia pr list
CreateTime          CompleteTime        ID                                   Status
2022-06-17 18:33:15 N/A                 2cf13427-cbe5-454c-b9d3-e1124af7baa2 RUNNING
2022-06-17 18:06:56 2022-06-17 18:08:37 69e2e543-60e9-4d45-97a1-d56337966579 COMPLETED
2022-06-16 23:41:43 2022-06-16 23:43:15 a65daf22-8e7e-4479-9f4e-edc1d99716ff COMPLETED
N/A                 2022-06-13 22:19:17 749ecc41-bf5e-4d08-88ef-fb66b60bf1fb COMPLETED
N/A                 2022-06-15 21:41:16 1e7ffc6d-2321-422d-b982-0ffca2d7987f COMPLETED

We will fetch and display all sparkapplication from the K8s API server first, then we will check the recommendation result table of ClickHouse DB for additional completed jobs that are not inside the K8s API server (probably they are deleted by users manually).
CompleteTime of uncompleted jobs will show as N/A, CreateTime of jobs fetching from the result table of ClickHouse DB will show as N/A since we didn't save that info in DB.

Add a delete command to let users delete a policy recommendation job given ID.

If the status of this job is not completed(like running, falied, etc), the sparkapplication behind this job will be deleted. Otherwise, both sparkapplication behind this job and the recommendation result in database will be deleted. e.g:

> theia pr delete 1e7ffc6d-2321-422d-b982-0ffca2d7987f
Successfully deleted policy recommendation job with ID 1e7ffc6d-2321-422d-b982-0ffca2d7987f

Open question:

Do we need to add a status column in the recommendations table?
Currently, statuses of policy recommendation jobs are obtained from the k8s API server. Only completed jobs will write results into the recommendations table.
May need to think about how to sync status between the k8s API server and database. (Should be handled in the middle layer application later)
For now, let's not add more columns to the recommendation result table.
Do we have other columns that would like to be added to the list command result?
job parameters, flows number, etc.
(For now we don't plan to show job parameters in list command since they are too much to display in a table, users could see them by describe the driver pod)
Check the failure reason for failed jobs and write them into database
Could check the debug APIs of spark operator first.
There is a Error Message field in failed sparkapplication, we could display it in the status command of cli.
Writing it into database should be handled in the middle layer application later.

Support bundle collection for Flow Visibility Components

Background

Currently to triage issues in Theia, logs needs to be individually collected from each component. It would be useful to have a workflow to capture a support bundle for all Theia components.

Flow Visibility Components

Flow Aggregator

Technically FA is not part of Theia repo, but as it being source of flow records, it still makes sense to capture its log as part of bundle. The logs can be copied from hostPath /var/log/antrea/flow-aggregator.
Priority: P0
Log level and persistence: v=0, 100M x 4

ClickHouse

ClickHouse server main log is located at /var/log/clickhouse-server. clickhouse-server.log contains all operational logs and access / query logs, and a separate err log clickhouse-server.err.log is dedicated for errors. The entire dir should be captured.
Log level and persistence: trace, 1000M x 10 (Too detailed and size too large)
Priority: P0
ref: ClickHouse logger, ClickHouse troubleshooting

Grafana

Grafana logs should all be under the same log dir defined in config. At the moment the log is not persisted. One approach is to share the PVC other data resides, if we limit the total size of backlog.
Log level and persistence: info, console only. If file, logs by default rotate daily or hitting 1000000 lines / 256MB, max 7 days
Priority: P0
current params: grafana-server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini --packaging=docker cfg:default.log.mode=console cfg:default.paths.data=/var/lib/grafana
ref Grafana Log configuration

Policy Recommendation

Policy Recommendation is done via Spark jobs, and the logs needs to be obtained through Spark. The job logs should be obtainable via kubectl, but it's currently not universally persisted at a single place. We may consider adding the log collection functionality to the pr common workflow, and call it as a part of log bundle collection.
Priority: P0

ClickHouse monitor

No log persistence at the moment. Could consider follow the design in Antrea and FA to mount a hostPath dir. Given the simplicity of this component, grabbing log from console (kubectl logs) should also be fine at the start.
Log level and persistence: default, console only
Priority: P1

ClickHouse Operator

CH operator is mainly used to spawn ClickHouse and setup access to DB. When everything is up and running (during daily operation), no modification is expected thus we typically won't care about the log of this component.
Log level and persistence: v=1, console only
Priority: P2
ref: operator log parans

Other components

Besides logs of each component, running status and configuration files may also be collected as part of support bundle. At a lower priority, the following should be considered to be collected as well:

k8s cluster workload snapshop
configMaps of Flow Visibility stack
Internal ("native") config files for ClickHouse, Grafana, and maybe Spark

Replace deprecated `set-output` command with environment file

In workflow, set-output command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information, see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/

Update `hack/generate-manifest.sh --mode antrea-e2e`

hack/generate-manifest.sh --mode antrea-e2e is used to generates YAML manifest for e2e tests in Antrea repository, which should only includes a ClickHouse server with default credentials. We need the script to create the YAML manifest because it is easier to keep the ClickHouse data schema in both Theia and Antrea in-sync.

The current script is a bit out-of-date, since we added new features to the Theia, by running the command will give us a manifest with a lot of more resources other than the required ClickHouse server. We would like to remove all the unnecessary resources from the generated script, including theia-manager, zookeeper, theia-cli. We would also like to replace projects.registry.vmware.com/antrea/theia-clickhouse-server by projects.registry.vmware.com/antrea/clickhouse-server in the generated manifest, as the Antrea e2e test does not need to be run based on theia built images.

Adding support for TAD Sparkstreaming job version and aggregated flow throughput Anomaly Detection

Expand the scope of throughput anomaly detection beyond each endpoint-to-endpoint flow, for example:
- Monitor the aggregated in/out throughput of a given Pod
- Monitor the aggregated in/out throughput of a Pod sets (Pod label)
- Monitor the aggregated in/out throughput of a given Service
Develop CLI commands and Theia manager integration for the Spark Streaming version job: https://github.com/antrea-io/theia/blob/b1bcc1a2b48b1d2617afbd6f5fca53ac716f08f5/plugins/anomaly-detection/SparkStreaming.py

Add theming to network topology dashboard

Currently, the graph displayed on the network topology dashboard uses default Mermaid.js colors. As a result, the graph looks monotonous, and it is hard to differentiate between Pods and Services. In order to fix this, some simple theming and coloring that matches the other dashboards could be implemented by us.

User-friendly names for flow record fields in Grafana dashboard

Issue description: Names for columns and filters are exactly the ones returned by queries
Improvement: Have friendly names for columns used in tables and filters. For instance "flowEndSecondsFromSourceNode" could be displayed as "end of flow as reported by source"

Add ad-hoc filter to Grafana Snowflake datasource plugin

In the current Theia powered by Snowflake, we are using an opensource Snowflake datasource plugin: https://github.com/michelin/snowflake-grafana-datasource. You can find the deployment documentation at: https://github.com/antrea-io/theia/tree/main/snowflake#deployments

Compared to regular Theia powered by ClickHouse, one feature we are missing in Snowflake datasource plugin is the Ad-hoc filter. You can find example usage of the filters in our documentation. I have previously opened a feature request issue in the plugin repo, but has not seen any updates on it. We shall expect to add this feature by our own.

Graphana: Service Dependency Graph

Goal: visualize how pods send/receive traffic to/from other pods or services.
This should be done with a connected graph where every node is a pod or a service, and arches between nodes represent the amount of traffic sent/received in the selected interval.

Add an operator for ClickHouse uninstallation

The ClickHouse cluster needs ZooKeeper for deletion. But when uninstalling Theia with yaml file or Helm chart as a whole, Zookeeper is always deleted before ClickHouse. This will lead to a fact that ClickHouse deletion will be stuck, while the ClickHouse Operator will keep trying to connect ZooKeeper for around 4 minutes and eventually fail.

Currently we instruct the user in doc to manually delete the ClickHouse first and then uninstall the rest of Theia. But for usability, we need an operator to take care of the installation and uninstallation order of ClickHouse cluster and ZooKeeper.

Coverage for Python files

Currently, python files, specifically spark jobs, are not described through code coverage. The setup required to generate the python coverage already exists, Codecov just needs to point to the generated file.

Update Go version to 1.20

Go 1.20 comes with a variety of useful features and updates. A guide describing all the changes can be found here.

Easier Coverage Calculation
The primary reason for updating to 1.20 is that coverage is easier to generate for end to end tests. Previously, using instrumented binaries with a coverage collector was the only way to ensure full and correct coverage calculations. This process is slightly cumbersome and requires setups in several different files. Go 1.20 changes this by allowing the go build command itself to generate the instrumented binaries necessary for testing by providing the -cover argument. This instrumented binary is then run multiple times across the test cases, and generates several reports, which can be merged into a larger report to encompass the kind-e2e-tests flag.

WithCancelCause()
This function allows the goroutine that calls cancel to pass an error that describes the reason for the cancel. The cancel function is used in several places throughout Theia code.

errors.Join()
This function allows multiple errors to be joined and returned as one. It doesn't seem like Theia attempts to return multiple errors anywhere.

stdlib is now a package
stdlib will no longer get pre-compiled into $GOROOT/pkg, and will likely need to be manually included.

math/rand.Seed() and math/rand.Read() deprecated
The previous global random number generator used a workaround to receive a truly random seed at program start. With 1.20, the generator is now automatically seeded with a random value. Theia uses this workaround in several places, all of which can easily be updated.

http.ResponseWriter can now implement optional interfaces, though they are not discoverable. Perhaps later development of this feature will make it more useful for Theia use cases.

There are other minor changes, such as easier slice to array conversion, but nothing that effects Theia codebase significantly. Feel free to comment feedback regarding potential features with these updates or changes to the estimated scope of each feature change.

Add the real-time throughput anomaly detection

Instead of requiring manual triggering of throughput anomaly detection jobs by users, we could implement real-time detection through a continuously running Spark streaming job.

Subramanian has already implemented a Spark streaming job that can be found at https://github.com/antrea-io/theia/blob/b1bcc1a2b48b1d2617afbd6f5fca53ac716f08f5/plugins/anomaly-detection/SparkStreaming.py.
To fully realize this feature, we need to undertake the following steps:

Design and develop CLI commands for the real-time detection process.
Figure out a method to retrieve the latest detection results from the active Spark Pods and return the information to the user.
Add the DBSCAN algorithm in the Spark streaming job , which is missing in Subramanian's implementation, to align with our one-time version.

Enable Dependabot to create PRs to update dependencies

Currently the Dependabot can only raise security issues at: https://github.com/antrea-io/antrea/security/dependabot .

Expected:
Dependabot should be able to automatically raise PRs to update dependencies, like what Antrea has: antrea-io/antrea#3442

References:
https://docs.github.com/en/code-security/dependabot/working-with-dependabot/managing-pull-requests-for-dependency-updates#about-dependabot-pull-requests

Support theia commands to get diagnostic infos about Clickhouse DB/Spark

Describe what you are trying to solve:

Currently, if we want to acquire metrics from Clickhouse database, we need to go to each shards (clickhouse cluster) to send queries to get the infos we need, which is time consuming.

Describe the solution you have in mind :

Using CLI to directly send built-in queries to retrieve metrics from Clickhouse database.

Describe how your solution impacts user flows :

User can acquire the basic infos (storage usage, table infos, current insertion rate) for each shards.
User can choose whether to print the 2-d array or print the returned metrics by using --print-table flag

For example, when user type
theia get clickhouse --diskInfo --print-table

Shard	Name	Path	Free	Total	Used Percentage
1	default	/var/lib/clickhouse/	1.75 GiB	1.84 GiB	5.04
2	default	/var/lib/clickhouse/	1.75 GiB	1.84 GiB	4.81

Improve Code Coverage

Current code coverage overall in Theia is 50%, we need to improve the coverage to at least 60% in the next release.

There is also codecov bot reporting missing in every PR, this needs to be investigated further and resolved

Port Scan Detection

Currently there is no way to detect port scanning attacks on Pods. The current infrastructure of both the flow exporter and aggregator do not allow for detection of such attacks, as they only focus on complete flows. The goal is to leverage the failed connection requests that occur as a result of such attacks to identify and highlight potentially malicious connection requests to the user.

In order to implement a port scan attack detector, several tasks exist:

flow-exporter will export incomplete connections, with a special flag to differentiate between complete flows
flow-aggregator will split records by connection type, and export the data to a new table incomplete_flows instead of flows
spark operator will have a job which reads the incomplete flow records, and computes a list of addresses of potential attackers
develop CLI commands for users to retrieve the list of potential attackers and filter based on more queries

After fleshing out the design more thoroughly and getting feedback from team members, I will comment a design document under this issue.

Document multi-cluster support

It seems that Theia supports multiple clusters, i.e. different Antrea Flow Aggregator instances connecting from different clusters.

I believe that the documentation should be updated to reflect that (I could not find a reference to this?), and we should probably provide information on how to use it.

IMO, there is a key missing feature for multi-cluster support: TLS support for the connection between the Flow Aggregator and ClickHouse server. See antrea-io/antrea#4902.

Fix ClickHouse crashing after ZooKeeper lost data

There is a problem reported by a user when deployment ClickHouse with PV and default ZooKeeper.

How to reproduce the problem

After deploy the ClickHouse with PV and default ZooKeeper, if the ZooKeeper pod crashes first and followed by the crashes of the ClickHouse pod, the bug will show up. e.g., if we run the following commands, the ClickHouse pod will keep crashing.

kubectl delete pod zookeeper-0 -n flow-visibility 
kubectl delete pod chi-clickhouse-clickhouse-0-0-0 -n flow-visibility

We will be able to see from the ClickHouse pod log

...
2023.07.20 23:05:30.492951 [ 106 ] {} <Warning> default.flows_local (b0182a24-d467-4e9e-8f4a-908322270b64): No metadata in ZooKeeper for /clickhouse/tables/0/default/flows_local: table will be in readonly mode.
...
2023.07.20 23:05:31.086664 [ 52 ] {93600396-a3b1-42ca-acbb-5545f10eca29} <Error> TCPHandler: Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/0/default/flows_local/replicas/chi-clickhouse-clickhouse-0-0). (TABLE_IS_READ_ONLY), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xb8a4c1a in /usr/bin/clickhouse
1. DB::Exception::Exception<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>(int, fmt::v8::basic_format_string<char, fmt::v8::type_identity<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>::type>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0xb90e6d8 in /usr/bin/clickhouse
2. DB::StorageReplicatedMergeTree::alter(DB::AlterCommands const&, std::__1::shared_ptr<DB::Context const>, std::__1::unique_lock<std::__1::timed_mutex>&) @ 0x16904fac in /usr/bin/clickhouse
3. DB::InterpreterAlterQuery::executeToTable(DB::ASTAlterQuery const&) @ 0x161a1fe8 in /usr/bin/clickhouse
4. DB::InterpreterAlterQuery::execute() @ 0x1619ff72 in /usr/bin/clickhouse
5. ? @ 0x1656ce56 in /usr/bin/clickhouse
6. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x1656a675 in /usr/bin/clickhouse
7. DB::TCPHandler::runImpl() @ 0x1714c8ea in /usr/bin/clickhouse
8. DB::TCPHandler::run() @ 0x1715f1d9 in /usr/bin/clickhouse
9. Poco::Net::TCPServerConnection::start() @ 0x19dc77f3 in /usr/bin/clickhouse
10. Poco::Net::TCPServerDispatcher::run() @ 0x19dc8b71 in /usr/bin/clickhouse
11. Poco::PooledThread::run() @ 0x19f79e3b in /usr/bin/clickhouse
12. Poco::ThreadImpl::runnableEntry(void*) @ 0x19f77540 in /usr/bin/clickhouse
13. ? @ 0x7fe5224f7609 in ?
14. __clone @ 0x7fe52241c133 in ?

Received exception from server (version 22.6.9):
Code: 242. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/0/default/flows_local/replicas/chi-clickhouse-clickhouse-0-0). (TABLE_IS_READ_ONLY)
(query: ALTER TABLE flows_local MODIFY TTL timeInserted + INTERVAL 12 HOUR;)

What we want to do

It turns out the root cause is as the ZooKeeper does not support PV, the data in ZooKeeper will completely lost if it crashes. ClickHouse will convert the table to READONLY in this case.

Here we have several strategies we can consider:

ClickHouse provides some strategies to recover from this condition:

https://clickhouse.com/docs/en/sql-reference/statements/system#restore-replica
https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/altinity-kb-recovering-from-complete-metadata-loss-in-zookeeper/

If we need to use this strategy, we will need to investigate how to use them during the pod initialization.

Add PV option for ZooKeeper. This can be done refer to the way we add PV for ClickHouse and this example:https://github.com/Altinity/clickhouse-operator/blob/master/deploy/zookeeper/quick-start-persistent-volume/zookeeper-1-node.yaml

There might be a risk here if the data in ZooKeeper PV lost.

IMO, it would be good to add the ZooKeeper PV first. And we may also would like to investigate how to avoid the crashing if the data lost from ZooKeeper side so that users can still recover through ClickHouse strategies. Basically, this error may not appear if we remove the alter table clause for the TTL. We may need to investigate a better way to update the TTL.

Throughput value reported by Grafana is different than the value reported by iperf

Throughput reported in Grafana is not correct, which is caused by the incorrect StopTime.
In FlowAggregator (FA), throughput is calculated by deltaBytes / (latest_flow_end_time - previous_flow_end_time).

In FA, the latest_flow_end_time is reported by the FlowExporter(FE). And in the FE, it always sets the flow StopTime to time.now(). Since the default pollInterval in FE is 5s, the StopTime of current and previous record always have a 5s gap, which means there is a time difference (up to 5s) between the real StopTime and the reported StopTime.

We need to improve the accuracy of the StopTime (flowEndTime) to have correct throughput value.

Store recommended policies individually

Up to Theia 0.3 the full output of a Spark Policy Recommendation job is stored in a single record.
We should have distinct records for each recommended policy, and handle corresponding data migrations.
The Theia manager controller for recommended policies should also be amended to reflect this change.

Ability to create a read only user

From looking here - https://github.com/antrea-io/theia/blob/main/build/charts/theia/templates/clickhouse/clickhouseinstallation.yaml#L12

There is only the ability to create the clickhouse_operator user, it would be great if we could create additional users for readonly access. From looking at the profiles that get created there is the default readonly profile, this could be assigned to a readonly user.

L7 Visibility

Antrea supports L7 network Policy with Http protocol.
In Theia, we want to support the L7 visibility that shall include the flows from the Antrea collected in clickhouse and then displayed to user using Grafana UI.

Support L7 flow-exporter in Antrea.
Support L7 network flow visibility in Theia.

App Visibility Grafana dashboard enhancement

Add capabilities that might provide users insights more oriented to understand how applications deployed in a k8s cluster communicate.

P0 items:

Add Pod/Service Namespace and port metadata into dashboard - #57
Have user-friendly names for columns used in tables and filters. For instance "flowEndSecondsFromSourceNode" could be displayed as "end of flow as reported by source" - #83
For time-series diagram, do not connect discrete points with line, as they might tell some misleading information, showing throughput is growing/decreasing linearly, which might not be true. For legend of time-series diagram, do not show the "mean" value on the right hand side, as its internal calculation logic is not accurate for our use case. - #83
For "FlowEndReason" field, do the value mapping from int to string. - #83
Save configurations/settings of Grafana at runtime, and re-apply those settings when restart Pods. We should expose this option through Helm chart to users, determining whether they would like to save and re-apply the settings. #84

P1 items:

Create a customized "Home page" dashboard, which includes basic introduction to network flow-visibility(pointer to documentation), cluster network overview(diagram monitoring network metrics), links to all the other dashboard #101
Enable adding users by sending email invitations; require configuring the Grafana server to authenticate on mail server
Aggregate traffic by Pod labels.

Support ClickHouse cluster deployment

Describe what you are trying to solve
 Currently we only support the single node ClickHouse server, which lacks reliability and the ability to scale horizontally.

Describe the solution you have in mind 
Support ClickHouse cluster deployment with shards for horizontal scaling and replicas for reliability.

Describe how your solution impacts user flows 
ClickHouse cluster can manage flow records in a larger network with reliability for users.

Describe the main design/architecture of your solution 
Refer to the ClickHouse Operator docs to enable the ClickHouse cluster deployment. Add corresponding values in Theia Helm Chart to make the deployment user-friendly.

Kind e2e test failure when pulling flow-aggregator image

Describe the bug
Recently I have noticed the kind e2e test sometimes failed with the errors below. The error may not show up after re-running the test, but I'm not sure what is the root cause.

 Events:
          Type     Reason     Age                    From               Message
          ----     ------     ----                   ----               -------
          Normal   Scheduled  6m13s                  default-scheduler  Successfully assigned flow-aggregator/flow-aggregator-95f95b7bd-4lxj6 to kind-worker2
          Normal   Pulling    4m37s (x4 over 6m13s)  kubelet            Pulling image "projects.registry.vmware.com/antrea/flow-aggregator:latest"
          Warning  Failed     4m37s (x4 over 6m12s)  kubelet            Failed to pull image "projects.registry.vmware.com/antrea/flow-aggregator:latest": rpc error: code = NotFound desc = failed to pull and unpack image "projects.registry.vmware.com/antrea/flow-aggregator:latest": failed to copy: httpReadSeeker: failed open: content at https://projects.registry.vmware.com/v2/antrea/flow-aggregator/manifests/sha256:b3594bd6fa1a8c5b12f2bdda5a4ddc5f7903f892660a1b4f832b245cf6082f77 not found: not found
          Warning  Failed     4m37s (x4 over 6m12s)  kubelet            Error: ErrImagePull
          Warning  Failed     4m23s (x6 over 6m12s)  kubelet            Error: ImagePullBackOff
          Normal   BackOff    67s (x20 over 6m12s)   kubelet            Back-off pulling image "projects.registry.vmware.com/antrea/flow-aggregator:latest"

Here are some cases for this failure:

Network Threat Analytics - Traffic Drop Detector

Purpose

Traffic Drop Detector is used to detect unreasonable amount of flows dropped or blocked by Network Policy for each endpoint, and reporting alerts to admins. It helps identify potential issues with the Network Policy, as well as helps identify potential security threats and take appropriate action to mitigate them.

Tasks

P0 items:

Add the Snowflake UDF implementation of Traffic Drop Detector, which works with the Snowflake datasource. We likely only check in the initial/one-time version for now, which needs to be triggered by user like Policy Recommendation. - #148

P1 items:

Add the Spark implementation of Traffic Drop Detector, which works with the ClickHouse datasource.
WIP, probably targeting ~~v0.6~~v0.7 release.
Add the periodical version of Traffic Drop Detector on Snowflake. We will leverage Snowflake Task to execute Traffic Drop UDF periodically like once a day, and write down the detection result into a table. This will need to implement a series of CLI commands to let user create/check/delete Snowflake Task, check detection result, etc.
~~WIP, probably targeting v0.6 release.~~ We are investigating the snowpark implementation of NTA/NPR, lower the priority of this item.

Integration Tests for Theia

Currently, Theia does not have any integration tests in place. Adding integration tests requires some research into how they are implemented in Antrea, an outline of which components could be tested, and the necessary steps required to test each of those components. Additionally, integration tests will need to be run as part of ci.

antrea-io / theia Goto Github PK

theia's People

Contributors

Stargazers

Watchers

Forkers

theia's Issues

Background

Tasks

Notes

Add a list command to show all policy recommendation jobs, e.g:

Add a delete command to let users delete a policy recommendation job given ID.

Open question:

Background

Flow Visibility Components

Flow Aggregator

ClickHouse

Grafana

Policy Recommendation

ClickHouse monitor

ClickHouse Operator

Other components

Purpose

Tasks

Recommend Projects

Recommend Topics

Recommend Org

Jobs