Comments (12)
Just an FYI, I’ve got a half implemented change for the gather_facts timeout so there may be some duplication of work if you start to implement that now.
from ansible.windows.
This change to winrm should prevent the infinite loop on locked files or corrupt registry.
diyan/pywinrm#297
I will look at how best to get this integrated into ansible next. It does not quite match the gather_timeout interface.
from ansible.windows.
Unfortunately this isn't something that should be handled in pywinrm, we need something that is connection agnostic. It can easily be done in a module through PowerShell as we have something similar for async it just needs to be implemented.
from ansible.windows.
Do you have a suggestion as to how to handle the infinite loop waiting on the host, in this case, the ansible host? It is a pretty common occurrence when you have a non trivial number of windows hosts, daily for us. It can't be easily handled in ansible as the pywinrm blocks indefinitely.
from ansible.windows.
The infinite loop waiting is how WinRM works, it will continue to poll the host until the WSMan response indicates the process has finished. If something is still keeping it open on the remote side you need to determine what that is. This is somewhat documented in the Receive processing rules for MS-WSMV.
A client SHOULD immediately issue a Receive message when a command is launched, whether or not it will be sending input using Send messages. To prevent deadlock, livelock, or time-out situations, the server can return Receive messages with empty string content, but typically it will delay responding until output is available, providing that wsman:OperationTimeout rules are not violated. If no output is available before the wsman:OperationTimeout expires, the server MUST return a WSManFault with the Code attribute equal to "2150858793". When the client receives this fault, it SHOULD issue another Receive request. The client SHOULD continue to issue Receive messages as soon as the previous ReceiveResponse has been received.
If you are experiencing this daily you will need to figure out
- Whether it blocks on a common scenario, i.e. running particular command
- Whether it always affects the same hosts
- If not, whether there is a common denominator between the hosts, i.e. OS version, PS version, etc
- Any event logs on the Windows host that indicate that something was killed or some other error
- Running procexp to loop at the process tree of the WinRM process and see what is still running and the commands it is running
- Reducing the number of forks and seeing if that reduces the occurance
These issues are hard to track down but having something get stuck in this case is usually the sign of a deeper problem and setting an arbitrary limit on commands is not the way to go. The best way to solve this problem is finding a way to reliably replicate the issue and then drilling down into what it does for each step. We have lots of tools available to help debug these problems but they are only really useful if we can replicate the problem.
from ansible.windows.
As I said, this is with non trivial (many thousand) number of hosts. We have investigated this extensively and we have multiple tickets open with MS, the most common seems to be some registry corruptions. There is no doubt it's a Windows problem. What I need to work-around is the Windows host never returning. The fact that MS says 'MUST' is not my reality, it does not.
This said, if you don't want to do any workaround, I'll close my MR and change it to measure the time looped and use the existing so I can use a fork in the standard ansible.
from ansible.windows.
You're more than welcome to continue with that PR, it will be up to the maintainer there to review it and proceed further. I just wanted to let you know it's just not something we can take advantage of in Ansible. I guess what I'm trying to find out is where does this problem occur, is it something that is a problem when running the fact gathering or does it occur at other points in time when using Ansible.
There's a chance that something we do in this module causes some hang somewhere and we need to fix the problem there but we would need to find out more info to try and track it down. I'm not trying to say there isn't a bug here, I'm just trying to determine where the problem truly lies. If you are just seeing blocks on any module being executed by Ansible then there is a problem deeper in the system that we may not be able to handle in Ansible, i.e. the underlying problem needs to be fixed.
If you have a Red Hat subscription with Ansible this would definitely be something I would talk to them about. They can help try and track down these problems and work towards a solution for fixing it.
from ansible.windows.
It is sometimes a problem with fact gathering, but more often when we do software inventory and access the registry a little more.
As I said before, there is nothing wrong with ansible at all in this case. There is nothing fundamentally wrong with pywinrm, except it does not handle this common, at scale, Windows failure. What I would like to get is a way for ansible not to get blocked forever but to report a failure on an unreasonable timeout when talking to a Windows machine. The only change to ansible would be the additional configuration variable to define 'reasonable' for a particular scenario .
(It still amuses me, you can scan 20K Linux hosts with little trouble, but Windows is a fact of life in corporate computing).
from ansible.windows.
If gather_facts hangs, then it should be detected and handled. The host will be automatically evicted from inventory and the playbook won't hang.
The pywinrm teams think that this should not be implemented in winrm, but at a higher-level diyan/pywinrm#274 (comment)
Implementing the gather_timeout option should probably be done in ansible/modules/windows/setup.ps1 or the script that calls setup.ps1
from ansible.windows.
But the higher level (ansible) never finds out as the winrm layer just sits there blocked, forever. Gather facts does not detect anything. I come in in the morning and see it stopped at 200 of 20,000 hosts.
It can't be done at a lower level unless the ps1 script is wrapped in a ps1 script to check if the lower level script times out. Is this what you are suggesting? I am not averse to doing that as well as having pywinrm sensibly handle a remote lock.
from ansible.windows.
@mianos, this is what I meant setup.ps1 script needs to be wrap in a ps1 that will handle a timeout.
from ansible.windows.
OK, I'll also look at a doing generic wrapper, using Start-Job and Wait-Job -Timeout
This may be the proper solution but it's still not really acceptable for ansible just hang if this fails as well.
from ansible.windows.
Related Issues (20)
- win_update failed since ansible 7.7.0 HOT 3
- win_update show different trigger in Event Viewer HOT 1
- ansible.windows.win_powershell misinterprets block scalar (string block) HOT 2
- win_package: support checksum verification HOT 2
- intermittent winrm connection failures with large hosts count HOT 8
- intermittent "unable to delete temporary file" errors HOT 9
- Win_updates fail with "Exception from HRESULT: 0x80072EE2" HOT 2
- Windows Update Module not installing any updates HOT 4
- Windows update failes due to update loop HOT 3
- win_environment : Maybe add an option to read variable content ? HOT 5
- Error during machine sid retrieval: An error (1788) occurred while enumerating the group membership. The member's SID could not be resolved. HOT 8
- Access denied after renaming windows host : Server not found in Kerberos database HOT 10
- Failed to create temporary directory when running win_template module against Windows Server 2019 HOT 10
- win_copy doesnt work when folder name has special character HOT 2
- Add account_expires functionality from the Set-LocalUser PowerShell Cmdlet HOT 2
- Using Machine credentials from AAP passed to playbook with ansible.windows.win_copy fails HOT 2
- ERROR DURING WINRM SEND INPUT - attempting to recover: WinRMTransportError Bad HTTP response returned from server. Code 413 HOT 1
- win_copy module not handling an invalid path correctly. HOT 1
- Feature request: A better win_acl module HOT 1
- Error in Windows Update: Unhandled exception while executing module: Cannot process request because the process (PID) has exited HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ansible.windows.