antrea-io / theia Goto Github PK
View Code? Open in Web Editor NEWNetwork observability for Kubernetes
License: Apache License 2.0
Network observability for Kubernetes
License: Apache License 2.0
Currently, the snowflake folder dependencies are not automatically checked for vulnerabilities, nor are PRs made by Dependabot to fix any vulnerabilities. This is because Dependabot hasn't been setup for the snowflake folder yet.
Please refer to https://github.com/antrea-io/theia/actions/workflows/clair.yml
Detect throughput anomalies in Antrea flows using well-known algorithms.
The feature should be able to do detection for individual flows, or for a given pod traffic (source and destination)
The feature should provide:
The PR from Dependabot to bump Antrea from 1.8.0 to 1.11.1 cannot be merged directly due to errors in E2E tests: #267. There are some API changes which includes those breaking changes. We need to manually resolve those errors and do the upgrade.
Expand the scope of throughput anomaly detection beyond individual endpoint-to-endpoint flow, for example:
Notes:
In the current dependency plugin, individual Pod traffic is shown. In instances where multiple replicas of multiple Pods exist, the graph could get cluttered easily and become unreadable. To remedy this, I suggest we add a toggle to the Network Topology Dashboard that allows users to group Pods by Pod labels.
Currently, PR #171 only includes support for unit-tests flag. Additional configuration is needed besides changes to codecov.yml
and kind.yml
to include support.
Saw a kind e2e test failure at ClickHouseMonitor:
=== RUN TestFlowVisibility/IPv4/ClickHouseMonitor
I0130 20:07:47.602139 21006 flowvisibility_test.go:629] Generating flow records to exceed monitor threshold...
I0130 20:08:05.824923 21006 flowvisibility_test.go:634] Waiting for the flows to be exported...
I0130 20:08:35.960261 21006 flowvisibility_test.go:641] Waiting for the monitor to detect and clean up the ClickHouse storage
flowvisibility_test.go:678:
Error Trace: flowvisibility_test.go:678
flowvisibility_test.go:643
flowvisibility_test.go:519
Error: Max difference between 17877 and 9189.5 allowed is 2681.5499999999997, but difference was 8687.5
Test: TestFlowVisibility/IPv4/ClickHouseMonitor
Messages: Difference between expected and actual number of deleted Records should be lower than 15%
https://github.com/antrea-io/theia/actions/runs/4047547236/jobs/6961777226
Currently, policy recommendation Spark jobs are written in Python and lack unit tests covered. Since Python is an interpreted language, unit test coverage is important to ensure the correct functionalities.
We need to add unit tests for these codes and Python CI on Github to automatically run unit tests when Python code changes are detected.
In FlowVisibility test, we use a time sleep to wait FA set up the connection with ClickHouse DB. It would make more sense if we change to wait until FA Pod has been in "Running" state for a period of time, e.g. 20s. So that we know the Pod does not crash due to the 10s ping time out and has successfully set up the connection.
An alternative is adding a log in CH client source code, once it successfully connected to CH DB. And we check the Pod log in FlowVisibility test.
Background
Code provided by us/target code to be tested on:
All dashboard JSON files, including panels configuration, ClickHouse SQL queries.
Aspects can be tested on:
Grafana data logic flow:
Solution1 - Grafana HTTP API
The Grafana backend exposes an HTTP API, which is the same API that is used by the frontend to do everything from saving dashboards, creating users, and updating data sources.
Eligibilities relevant our use case:
Limitation
However, in the Grafana HTTP API, it doesn't include an API that can execute a query and get the query result.
Ref issue1: https://community.grafana.com/t/backend-api-to-get-query-result/67293/2
Ref issue2: https://community.grafana.com/t/grafana-http-api-to-get-panel-json-data/63901/3
Ref issue3: https://community.grafana.com/t/dashboard-api-returns-query-results-as-well/5556/7
If we want to verify the query result, one alternative is: Send request to Grafana dashboard API, get the dashboard JSON file, extract the query from the dashboard JSON, and run the query independently against the datasource.
Solution2 - Grafana e2e package
It provides a package built on top of Cypress, which allow us to define some actions on the app and define the corresponding expected outcome. e.g. Open a dashboard, check correct panels are displayed. The package doesn't seem to have a clear documentation, but there are some example test suites to get start with.
Limitation
If the Theia manager is restarted due to events such as errors, evacuation or user actions we must make sure it synchronizes with the latest state with both Clickhouse and Spark.
From a scale perspective, we should ensure this synchronization does not add significant time to startup processes when there is a large data set to validate.
All of our Spark applications, including Policy Recommendation and Throughput Anomaly Detection, are built on the gcr.io/spark-operator/spark-py
docker image and have their dependent libraries installed on top.
However, this image has a relatively large size of around 1GB. To save disk space on the user's node and prepare for the addition of more Spark applications in the future, we aim to create a unified docker image for all of these applications.
Currently, our e2e test for policy recommendation only covers a single execution of run/status/retrieve
CLI commands. And only one single Pod-to-Pod flow traffic is generated for the test of the retrieve
command.
To have better e2e test coverage, we are planning to add these test cases:
list
and delete
command. #49Because a recommendation job may take several minutes to complete, to minimize the test running time, I'm considering adding the Pod-to-Service and Pod-to-External flows into the current test of the retrieve
command instead of creating separate test cases.
To test the failed cases, we could simulate a test case where the Driver Pod is destroyed and unavailable when the policy recommendation job is still running. Then we could check the Status and FailedReason of this job through the status
command.
> theia pr list
CreateTime CompleteTime ID Status
2022-06-17 18:33:15 N/A 2cf13427-cbe5-454c-b9d3-e1124af7baa2 RUNNING
2022-06-17 18:06:56 2022-06-17 18:08:37 69e2e543-60e9-4d45-97a1-d56337966579 COMPLETED
2022-06-16 23:41:43 2022-06-16 23:43:15 a65daf22-8e7e-4479-9f4e-edc1d99716ff COMPLETED
N/A 2022-06-13 22:19:17 749ecc41-bf5e-4d08-88ef-fb66b60bf1fb COMPLETED
N/A 2022-06-15 21:41:16 1e7ffc6d-2321-422d-b982-0ffca2d7987f COMPLETED
We will fetch and display all sparkapplication
from the K8s API server first, then we will check the recommendation result table of ClickHouse DB for additional completed jobs that are not inside the K8s API server (probably they are deleted by users manually).
CompleteTime
of uncompleted jobs will show as N/A
, CreateTime
of jobs fetching from the result table of ClickHouse DB will show as N/A
since we didn't save that info in DB.
If the status of this job is not completed(like running, falied, etc), the sparkapplication
behind this job will be deleted. Otherwise, both sparkapplication
behind this job and the recommendation result in database will be deleted. e.g:
> theia pr delete 1e7ffc6d-2321-422d-b982-0ffca2d7987f
Successfully deleted policy recommendation job with ID 1e7ffc6d-2321-422d-b982-0ffca2d7987f
Do we need to add a status
column in the recommendations table?
Currently, statuses of policy recommendation jobs are obtained from the k8s API server. Only completed jobs will write results into the recommendations table.
May need to think about how to sync status between the k8s API server and database. (Should be handled in the middle layer application later)
For now, let's not add more columns to the recommendation result table.
Do we have other columns that would like to be added to the list command result?
job parameters, flows number, etc.
(For now we don't plan to show job parameters in list command since they are too much to display in a table, users could see them by describe the driver pod)
Check the failure reason for failed jobs and write them into database
Could check the debug APIs of spark operator first.
There is a Error Message
field in failed sparkapplication
, we could display it in the status
command of cli.
Writing it into database should be handled in the middle layer application later.
Currently to triage issues in Theia, logs needs to be individually collected from each component. It would be useful to have a workflow to capture a support bundle for all Theia components.
/var/log/antrea/flow-aggregator
./var/log/clickhouse-server
. clickhouse-server.log
contains all operational logs and access / query logs, and a separate err log clickhouse-server.err.log
is dedicated for errors. The entire dir should be captured.grafana-server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini --packaging=docker cfg:default.log.mode=console cfg:default.paths.data=/var/lib/grafana
kubectl
, but it's currently not universally persisted at a single place. We may consider adding the log collection functionality to the pr common workflow, and call it as a part of log bundle collection.kubectl logs
) should also be fine at the start.Besides logs of each component, running status and configuration files may also be collected as part of support bundle. At a lower priority, the following should be considered to be collected as well:
In workflow, set-output
command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information, see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
hack/generate-manifest.sh --mode antrea-e2e
is used to generates YAML manifest for e2e tests in Antrea repository, which should only includes a ClickHouse server with default credentials. We need the script to create the YAML manifest because it is easier to keep the ClickHouse data schema in both Theia and Antrea in-sync.
The current script is a bit out-of-date, since we added new features to the Theia, by running the command will give us a manifest with a lot of more resources other than the required ClickHouse server. We would like to remove all the unnecessary resources from the generated script, including theia-manager, zookeeper, theia-cli. We would also like to replace projects.registry.vmware.com/antrea/theia-clickhouse-server
by projects.registry.vmware.com/antrea/clickhouse-server
in the generated manifest, as the Antrea e2e test does not need to be run based on theia built images.
Currently, the graph displayed on the network topology dashboard uses default Mermaid.js colors. As a result, the graph looks monotonous, and it is hard to differentiate between Pods and Services. In order to fix this, some simple theming and coloring that matches the other dashboards could be implemented by us.
Issue description: Names for columns and filters are exactly the ones returned by queries
Improvement: Have friendly names for columns used in tables and filters. For instance "flowEndSecondsFromSourceNode" could be displayed as "end of flow as reported by source"
In the current Theia powered by Snowflake, we are using an opensource Snowflake datasource plugin: https://github.com/michelin/snowflake-grafana-datasource. You can find the deployment documentation at: https://github.com/antrea-io/theia/tree/main/snowflake#deployments
Compared to regular Theia powered by ClickHouse, one feature we are missing in Snowflake datasource plugin is the Ad-hoc filter. You can find example usage of the filters in our documentation. I have previously opened a feature request issue in the plugin repo, but has not seen any updates on it. We shall expect to add this feature by our own.
Goal: visualize how pods send/receive traffic to/from other pods or services.
This should be done with a connected graph where every node is a pod or a service, and arches between nodes represent the amount of traffic sent/received in the selected interval.
The ClickHouse cluster needs ZooKeeper for deletion. But when uninstalling Theia with yaml file or Helm chart as a whole, Zookeeper is always deleted before ClickHouse. This will lead to a fact that ClickHouse deletion will be stuck, while the ClickHouse Operator will keep trying to connect ZooKeeper for around 4 minutes and eventually fail.
Currently we instruct the user in doc to manually delete the ClickHouse first and then uninstall the rest of Theia. But for usability, we need an operator to take care of the installation and uninstallation order of ClickHouse cluster and ZooKeeper.
Currently, python files, specifically spark jobs, are not described through code coverage. The setup required to generate the python coverage already exists, Codecov just needs to point to the generated file.
Go 1.20 comes with a variety of useful features and updates. A guide describing all the changes can be found here.
Easier Coverage Calculation
The primary reason for updating to 1.20 is that coverage is easier to generate for end to end tests. Previously, using instrumented binaries with a coverage collector was the only way to ensure full and correct coverage calculations. This process is slightly cumbersome and requires setups in several different files. Go 1.20 changes this by allowing the go build command itself to generate the instrumented binaries necessary for testing by providing the -cover argument. This instrumented binary is then run multiple times across the test cases, and generates several reports, which can be merged into a larger report to encompass the kind-e2e-tests flag.
WithCancelCause()
This function allows the goroutine that calls cancel to pass an error that describes the reason for the cancel. The cancel function is used in several places throughout Theia code.
errors.Join()
This function allows multiple errors to be joined and returned as one. It doesn't seem like Theia attempts to return multiple errors anywhere.
stdlib is now a package
stdlib will no longer get pre-compiled into $GOROOT/pkg, and will likely need to be manually included.
math/rand.Seed() and math/rand.Read() deprecated
The previous global random number generator used a workaround to receive a truly random seed at program start. With 1.20, the generator is now automatically seeded with a random value. Theia uses this workaround in several places, all of which can easily be updated.
http.ResponseWriter can now implement optional interfaces, though they are not discoverable. Perhaps later development of this feature will make it more useful for Theia use cases.
There are other minor changes, such as easier slice to array conversion, but nothing that effects Theia codebase significantly. Feel free to comment feedback regarding potential features with these updates or changes to the estimated scope of each feature change.
Instead of requiring manual triggering of throughput anomaly detection jobs by users, we could implement real-time detection through a continuously running Spark streaming job.
Subramanian has already implemented a Spark streaming job that can be found at https://github.com/antrea-io/theia/blob/b1bcc1a2b48b1d2617afbd6f5fca53ac716f08f5/plugins/anomaly-detection/SparkStreaming.py.
To fully realize this feature, we need to undertake the following steps:
Currently the Dependabot can only raise security issues at: https://github.com/antrea-io/antrea/security/dependabot .
Expected:
Dependabot should be able to automatically raise PRs to update dependencies, like what Antrea has: antrea-io/antrea#3442
Describe what you are trying to solve:
Currently, if we want to acquire metrics from Clickhouse database, we need to go to each shards (clickhouse cluster) to send queries to get the infos we need, which is time consuming.
Describe the solution you have in mind :
Using CLI to directly send built-in queries to retrieve metrics from Clickhouse database.
Describe how your solution impacts user flows :
User can acquire the basic infos (storage usage, table infos, current insertion rate) for each shards.
User can choose whether to print the 2-d array or print the returned metrics by using --print-table flag
For example, when user type
theia get clickhouse --diskInfo --print-table
Shard | Name | Path | Free | Total | Used Percentage |
---|---|---|---|---|---|
1 | default | /var/lib/clickhouse/ | 1.75 GiB | 1.84 GiB | 5.04 |
2 | default | /var/lib/clickhouse/ | 1.75 GiB | 1.84 GiB | 4.81 |
Current code coverage overall in Theia is 50%, we need to improve the coverage to at least 60% in the next release.
There is also codecov bot reporting missing in every PR, this needs to be investigated further and resolved
Currently there is no way to detect port scanning attacks on Pods. The current infrastructure of both the flow exporter and aggregator do not allow for detection of such attacks, as they only focus on complete flows. The goal is to leverage the failed connection requests that occur as a result of such attacks to identify and highlight potentially malicious connection requests to the user.
In order to implement a port scan attack detector, several tasks exist:
incomplete_flows
instead of flows
After fleshing out the design more thoroughly and getting feedback from team members, I will comment a design document under this issue.
It seems that Theia supports multiple clusters, i.e. different Antrea Flow Aggregator instances connecting from different clusters.
I believe that the documentation should be updated to reflect that (I could not find a reference to this?), and we should probably provide information on how to use it.
IMO, there is a key missing feature for multi-cluster support: TLS support for the connection between the Flow Aggregator and ClickHouse server. See antrea-io/antrea#4902.
There is a problem reported by a user when deployment ClickHouse with PV and default ZooKeeper.
How to reproduce the problem
After deploy the ClickHouse with PV and default ZooKeeper, if the ZooKeeper pod crashes first and followed by the crashes of the ClickHouse pod, the bug will show up. e.g., if we run the following commands, the ClickHouse pod will keep crashing.
kubectl delete pod zookeeper-0 -n flow-visibility
kubectl delete pod chi-clickhouse-clickhouse-0-0-0 -n flow-visibility
We will be able to see from the ClickHouse pod log
...
2023.07.20 23:05:30.492951 [ 106 ] {} <Warning> default.flows_local (b0182a24-d467-4e9e-8f4a-908322270b64): No metadata in ZooKeeper for /clickhouse/tables/0/default/flows_local: table will be in readonly mode.
...
2023.07.20 23:05:31.086664 [ 52 ] {93600396-a3b1-42ca-acbb-5545f10eca29} <Error> TCPHandler: Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/0/default/flows_local/replicas/chi-clickhouse-clickhouse-0-0). (TABLE_IS_READ_ONLY), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xb8a4c1a in /usr/bin/clickhouse
1. DB::Exception::Exception<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>(int, fmt::v8::basic_format_string<char, fmt::v8::type_identity<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&>::type>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0xb90e6d8 in /usr/bin/clickhouse
2. DB::StorageReplicatedMergeTree::alter(DB::AlterCommands const&, std::__1::shared_ptr<DB::Context const>, std::__1::unique_lock<std::__1::timed_mutex>&) @ 0x16904fac in /usr/bin/clickhouse
3. DB::InterpreterAlterQuery::executeToTable(DB::ASTAlterQuery const&) @ 0x161a1fe8 in /usr/bin/clickhouse
4. DB::InterpreterAlterQuery::execute() @ 0x1619ff72 in /usr/bin/clickhouse
5. ? @ 0x1656ce56 in /usr/bin/clickhouse
6. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x1656a675 in /usr/bin/clickhouse
7. DB::TCPHandler::runImpl() @ 0x1714c8ea in /usr/bin/clickhouse
8. DB::TCPHandler::run() @ 0x1715f1d9 in /usr/bin/clickhouse
9. Poco::Net::TCPServerConnection::start() @ 0x19dc77f3 in /usr/bin/clickhouse
10. Poco::Net::TCPServerDispatcher::run() @ 0x19dc8b71 in /usr/bin/clickhouse
11. Poco::PooledThread::run() @ 0x19f79e3b in /usr/bin/clickhouse
12. Poco::ThreadImpl::runnableEntry(void*) @ 0x19f77540 in /usr/bin/clickhouse
13. ? @ 0x7fe5224f7609 in ?
14. __clone @ 0x7fe52241c133 in ?
Received exception from server (version 22.6.9):
Code: 242. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/0/default/flows_local/replicas/chi-clickhouse-clickhouse-0-0). (TABLE_IS_READ_ONLY)
(query: ALTER TABLE flows_local MODIFY TTL timeInserted + INTERVAL 12 HOUR;)
What we want to do
It turns out the root cause is as the ZooKeeper does not support PV, the data in ZooKeeper will completely lost if it crashes. ClickHouse will convert the table to READONLY in this case.
Here we have several strategies we can consider:
https://clickhouse.com/docs/en/sql-reference/statements/system#restore-replica
If we need to use this strategy, we will need to investigate how to use them during the pod initialization.
There might be a risk here if the data in ZooKeeper PV lost.
IMO, it would be good to add the ZooKeeper PV first. And we may also would like to investigate how to avoid the crashing if the data lost from ZooKeeper side so that users can still recover through ClickHouse strategies. Basically, this error may not appear if we remove the alter table clause for the TTL. We may need to investigate a better way to update the TTL.
Throughput reported in Grafana is not correct, which is caused by the incorrect StopTime.
In FlowAggregator (FA), throughput is calculated by deltaBytes / (latest_flow_end_time - previous_flow_end_time)
.
In FA, the latest_flow_end_time is reported by the FlowExporter(FE). And in the FE, it always sets the flow StopTime to time.now(). Since the default pollInterval in FE is 5s, the StopTime of current and previous record always have a 5s gap, which means there is a time difference (up to 5s) between the real StopTime and the reported StopTime.
We need to improve the accuracy of the StopTime (flowEndTime) to have correct throughput value.
Up to Theia 0.3 the full output of a Spark Policy Recommendation job is stored in a single record.
We should have distinct records for each recommended policy, and handle corresponding data migrations.
The Theia manager controller for recommended policies should also be amended to reflect this change.
From looking here - https://github.com/antrea-io/theia/blob/main/build/charts/theia/templates/clickhouse/clickhouseinstallation.yaml#L12
There is only the ability to create the clickhouse_operator
user, it would be great if we could create additional users for readonly access. From looking at the profiles that get created there is the default readonly profile, this could be assigned to a readonly user.
Antrea supports L7 network Policy with Http protocol.
In Theia, we want to support the L7 visibility that shall include the flows from the Antrea collected in clickhouse and then displayed to user using Grafana UI.
Add capabilities that might provide users insights more oriented to understand how applications deployed in a k8s cluster communicate.
P0 items:
P1 items:
Describe what you are trying to solve
Currently we only support the single node ClickHouse server, which lacks reliability and the ability to scale horizontally.
Describe the solution you have in mind
Support ClickHouse cluster deployment with shards for horizontal scaling and replicas for reliability.
Describe how your solution impacts user flows
ClickHouse cluster can manage flow records in a larger network with reliability for users.
Describe the main design/architecture of your solution
Refer to the ClickHouse Operator docs to enable the ClickHouse cluster deployment. Add corresponding values in Theia Helm Chart to make the deployment user-friendly.
Describe the bug
Recently I have noticed the kind e2e test sometimes failed with the errors below. The error may not show up after re-running the test, but I'm not sure what is the root cause.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m13s default-scheduler Successfully assigned flow-aggregator/flow-aggregator-95f95b7bd-4lxj6 to kind-worker2
Normal Pulling 4m37s (x4 over 6m13s) kubelet Pulling image "projects.registry.vmware.com/antrea/flow-aggregator:latest"
Warning Failed 4m37s (x4 over 6m12s) kubelet Failed to pull image "projects.registry.vmware.com/antrea/flow-aggregator:latest": rpc error: code = NotFound desc = failed to pull and unpack image "projects.registry.vmware.com/antrea/flow-aggregator:latest": failed to copy: httpReadSeeker: failed open: content at https://projects.registry.vmware.com/v2/antrea/flow-aggregator/manifests/sha256:b3594bd6fa1a8c5b12f2bdda5a4ddc5f7903f892660a1b4f832b245cf6082f77 not found: not found
Warning Failed 4m37s (x4 over 6m12s) kubelet Error: ErrImagePull
Warning Failed 4m23s (x6 over 6m12s) kubelet Error: ImagePullBackOff
Normal BackOff 67s (x20 over 6m12s) kubelet Back-off pulling image "projects.registry.vmware.com/antrea/flow-aggregator:latest"
Here are some cases for this failure:
Traffic Drop Detector is used to detect unreasonable amount of flows dropped or blocked by Network Policy for each endpoint, and reporting alerts to admins. It helps identify potential issues with the Network Policy, as well as helps identify potential security threats and take appropriate action to mitigate them.
P0 items:
P1 items:
Currently, Theia does not have any integration tests in place. Adding integration tests requires some research into how they are implemented in Antrea, an outline of which components could be tested, and the necessary steps required to test each of those components. Additionally, integration tests will need to be run as part of ci.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.