Comments (4)
@hhlee445 @chrisjrd @maseca - FYI and to provide guidance to @philipjyoon.
from opera-sds-int.
A lot of these metrics would be directly impacted by the number of autoscaling fleets and their maximum sizes. Do we want to standardize these?
from opera-sds-int.
Some ideas from @niarenaw
- Accumulated size (in bytes) of a given AWS S3 bucket over a given time frequency (down to minutes)
- can do this programmatically by running the following command before and after the test and taking the difference: aws s3 ls s3://$BUCKET --recursive --summarize --human-readable
- can also compute s3 size on the aws s3 console
- Throughput (in bytes/sec) of a given AWS S3 bucket over a given time frequency (down to minutes)
- can derive from previous metric and total length of load test
- better granularity with Metrics tab on aws s3 console
- Elasticsearch statistics (num docs, query time, etc.) for a given index over a given time frequency (down to minutes)
- can use elasticsearch sdk or use the web ui to generate queries and filter by time range
- I’m pretty horrible at the elasticsearch DSL syntax, but might be time I learn it properly
- PCM queue sizes (QUEUED / PENDING jobs especially) over a given time frequency (down to minutes)
- probably easiest to get these using Figaro and Lucene queries (ex. “job_queue:<> AND timestamp:<>” for each queue)
- can make these programmatic by querying ES directly instead
- AWS EC2 spot errors (insufficient capacity/terminations)
- using AWS cloudtrail conosole, can search for BidEvictedEvent events in given time range
from opera-sds-int.
Thoughts on S3 size: What Nick has found seems to be the only way we can get near-real-time and high-frequency metrics on S3 bucket size. However it can get very slow for large buckets as well as costly. I think something like 0.005 cents per object query?
I did find an alternative using cloudwatch but it only works at daily frequency so not much useful to us:
aws --profile saml-pub cloudwatch get-metric-statistics --namespace AWS/S3 --start-time 2022-06-08T23:22:00 --end-time 2022-06-08T23:59:00 --period 86400 --statistics Average --metric-name BucketSizeBytes --dimensions Name=BucketName,Value=opera-dev-isl-fwd-pyoon Name=StorageType,Value=StandardStorage
Perhaps there are other metrics we can measure instead that can give us the same/similar insight into what's happening in the PCM and where the bottlenecks lie. If we are looking to see if the ingest workers are lagging behind the download workers (this is what high-frequency ISL S3 accum size would tell us) we could just measure the length of queue that which the ingest workers consume? I don't know if these queue entries would have file size in them; however, at least for HSLS and HSLL data, file sizes seem to be quite uniform.
from opera-sds-int.
Related Issues (20)
- Run CalVal on PST using R2 RC10
- Run 3 failed granules from R2 RC9 testing and additional list of granules of edge cases with static_layer enabled HOT 2
- [New Feature]: Complete PGE smoke test automation
- Deploy R2.0.0 to PST
- Deploy R2.0.0 to INT-FWD
- Restore PST ES HOT 1
- Run R2 Smoketest on INT-FWD HOT 1
- Run edge cases with non-static only HOT 1
- Run R2 PGEs static_layer in reprocessing mode for specific 24 hr periods HOT 4
- Run R1 and R2 PGEs in FWD mode for 12 hrs non-static only HOT 1
- Run HIST CSLC-S1
- Run edge cases with static_layer only HOT 2
- Clear out input data products
- DSWx-S1 E2E FWD mode HOT 1
- DSWx-S1 E2E On-Demand mode
- DSWx-S1 E2E FWD mode in the past HOT 2
- PFR: DSWx-S1 E2E Test PGE Job Failures
- PFR: DSWx-S1 E2E Test 25 PGE jobs were never queued up
- PFR: DSWx-S1 E2E Test discrepancy between jobs in Figaro and output sets on S3.
- PFR: DSWx-S1 E2E Test did not get to 1x rate
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opera-sds-int.