Comments (14)
Happened to us again today:
Terminating EC2 instance: i-0c4267833705d8d9c At 2024-05-22T13:59:51Z an instance was taken out of service in response to an EC2 instance status checks failure. 2024 May 22, 05:59:51 PM +04:00
from aws-fpga.
Again:
Terminating EC2 instance: i-0a74638e8ea8a0844 At 2024-05-23T12:09:03Z an instance was taken out of service in response to an EC2 instance status checks failure. 2024 May 23, 04:09:03 PM +04:00
from aws-fpga.
Again:
Terminating EC2 instance: i-05023a2fab96c88f9 At 2024-05-25T16:10:40Z an instance was taken out of service in response to an EC2 instance status checks failure.
from aws-fpga.
Again:
Terminating EC2 instance: i-0dbc92d566e694e79 At 2024-05-27T04:21:18Z an instance was taken out of service in response to an EC2 instance status checks failure.
from aws-fpga.
Terminating EC2 instance: i-00f1d4ba1fa292dbd At 2024-05-31T22:30:42Z an instance was taken out of service in response to an EC2 instance status checks failure.
It seems like there's a 1-2% chance of failure per 24 hours of uptime recently. I can't imagine this level of unreliability would be tolerated with any other instance type.
This is happening across multiple accounts, multiple zones in us-east-1, two different bitfiles, single software architecture / fpga drivers.
The error message suggests a pure AWS issue. It wouldn't surprise me if we're contributing to the failures somehow but nothing to go on based on the error message.
from aws-fpga.
Hello,
Thanks for reaching out with this issue. We've been internally monitoring the issue and will report back soon. In the mean time, have you been able to follow some of the AWS EC2 troubleshoot steps?
from aws-fpga.
Looked over the link with devops and most of it doesn't make sense in our context because it's primarily about configuration problems and this happens after instances run for extended periods. It look like there may be a more specific cause available if we can check the EC2 console and see the instance's details but they're in an autoscaling group so get automatically terminated and details become inaccessible on the console quickly.
Questions I have are:
-
Are the status checks failures unique to my accounts?
-
Do you know the specific reason for the status checks failure (e.g., a networking problem)?
-
Is there anything we could be doing with the FPGA that can cause a status checks failure (e.g., hanging the PCI bus)?
Happy for you to close the support ticket if it helps your KPIs. I don't need this solved urgently but it is problematic from the perspective of our customers so I don't want it ignored either.
from aws-fpga.
One thing that may be contributing to the instance instability are PCIe/AXI errors on the bus. Can you provide the shell timeout data immediately prior to the instance failures? You can find more information on collecting this data with the SDK here: https://github.com/aws/aws-fpga/blob/863d963308231d0789a48f8840ceb1141368b34a/hdk/docs/HOWTO_detect_shell_timeout.md
Gathering the data above will help us narrow down the issue as "hanging the PCI bus" is the most likely root cause of the issue. Don't worry about closing the support tickets, it helps us collect data and gain visibility on the issues!
from aws-fpga.
Added below to the cleanup hook, which might provide some insight. I know OCL reads are working fine immediately prior to the status checks failure.
// print fpga metrics from aws-fpga cli tool, checks for shell timeouts
std::system("sudo fpga-describe-local-image --headers --metrics --fpga-image-slot 0 &> fpga-metrics.log");
std::cout << std::ifstream("fpga-metrics.log").rdbuf();
from aws-fpga.
Could you share what shell interfaces your workload is exercising at the time of failures?
from aws-fpga.
We didn't change anything except for rebuilding the image to add above code and haven't experienced the problem in the last two weeks so might be gone.
OCL and DMA_PCIS. We do DMA writes to two DDRs from the processor, read from two DDRs to the FPGA, and reads & writes between the processor and FPGA with OCL registers.
from aws-fpga.
We're glad to hear you're no longer experiencing the issue. If you ever do experience the failure again, please reach out with any information you have!
from aws-fpga.
Autoscaling group:
Terminating EC2 instance: i-09fcd9b811c50579c
At 2024-07-21T18:55:33Z an instance was taken out of service in response to an EC2 instance status checks failure.
EC2 console:
Instance reachability check failed
fpga-describe-local-image:
Error: (21) afi-command-malformed
A malformed response from the FPGA API can indicate that the FPGA has
stopped behaving correctly and the instance will need to be stopped and
and restarted. If this continues to happen (after an instance restart),
this may be an indication that your AFI is exceeding allowed power
consumption limits.
I find the generic reasons for a reachability check failure implausible given OCL reads and DMA writes aren't erroring immediately prior to the fpga-describe-local-image call.
There's usually an associated issue that comes from within our logic where we have a processing operation timeout when there's a status checks failure. I am aware of the timeout from successful OCL reads. We've had these timeouts happen before with functional problems in our design but I struggle to see how anything we do in the CL or with DDR could cause a reachability failure.
Representative power consumption from a working instance:
Power consumption (Vccint):
Last measured: 9 watts
Average: 36 watts
Max measured: 54 watts
Any further thoughts?
from aws-fpga.
Autoscaling group:
Terminating EC2 instance: i-067ca8584abf6b911
At 2024-07-29T17:20:00Z an instance was taken out of service in response to an EC2 instance status checks failure.
EC2 console:
Instance reachability check failed
fpga-describe-local-image:
Error: (21) afi-command-malformed
A malformed response from the FPGA API can indicate that the FPGA has
stopped behaving correctly and the instance will need to be stopped and
and restarted. If this continues to happen (after an instance restart),
this may be an indication that your AFI is exceeding allowed power
consumption limits.
Same deal.
from aws-fpga.
Related Issues (20)
- [XRT] ERROR: failed to load xclbin: Invalid argument / xocl_read_axlf_helper: interface uuids do not match HOT 4
- Inserting ILA is improving the performance HOT 4
- Update HOT 9
- Name port does not exist for instance f1_inst HOT 6
- PCIM ARID all bits can be used? HOT 1
- peer xclbin download err: -11 HOT 2
- fpga_mgmt_load_local_image_sync() failures HOT 12
- Linux Kernel 6 support for xdma HOT 4
- Does aws have u200? HOT 3
- Files missing from AMI HOT 2
- DRAM page mode HOT 1
- Accessing the AXI-Lite via the fpga_pci_attach call... HOT 2
- Debug embedded microblaze using XVC JTAG in AWS FPGA shell HOT 11
- FPGA cloud-based HOT 1
- Connect MicroBlaze Debug Module (MDM) to Virtual JTAG in Vivado IP Integrator HOT 2
- phys_opt_design error during implementation HOT 3
- Does AWS F1 support Vitis AI 2.5?
- fail in get_f1_ami_id() HOT 3
- ETA on new dev AMI? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-fpga.