GithubHelp home page GithubHelp logo

Comments (12)

jborean93 avatar jborean93 commented on July 17, 2024 2

Just an FYI, I’ve got a half implemented change for the gather_facts timeout so there may be some duplication of work if you start to implement that now.

from ansible.windows.

mianos avatar mianos commented on July 17, 2024

This change to winrm should prevent the infinite loop on locked files or corrupt registry.
diyan/pywinrm#297
I will look at how best to get this integrated into ansible next. It does not quite match the gather_timeout interface.

from ansible.windows.

jborean93 avatar jborean93 commented on July 17, 2024

Unfortunately this isn't something that should be handled in pywinrm, we need something that is connection agnostic. It can easily be done in a module through PowerShell as we have something similar for async it just needs to be implemented.

from ansible.windows.

mianos avatar mianos commented on July 17, 2024

Do you have a suggestion as to how to handle the infinite loop waiting on the host, in this case, the ansible host? It is a pretty common occurrence when you have a non trivial number of windows hosts, daily for us. It can't be easily handled in ansible as the pywinrm blocks indefinitely.

from ansible.windows.

jborean93 avatar jborean93 commented on July 17, 2024

The infinite loop waiting is how WinRM works, it will continue to poll the host until the WSMan response indicates the process has finished. If something is still keeping it open on the remote side you need to determine what that is. This is somewhat documented in the Receive processing rules for MS-WSMV.

A client SHOULD immediately issue a Receive message when a command is launched, whether or not it will be sending input using Send messages. To prevent deadlock, livelock, or time-out situations, the server can return Receive messages with empty string content, but typically it will delay responding until output is available, providing that wsman:OperationTimeout rules are not violated. If no output is available before the wsman:OperationTimeout expires, the server MUST return a WSManFault with the Code attribute equal to "2150858793". When the client receives this fault, it SHOULD issue another Receive request. The client SHOULD continue to issue Receive messages as soon as the previous ReceiveResponse has been received.

If you are experiencing this daily you will need to figure out

  • Whether it blocks on a common scenario, i.e. running particular command
  • Whether it always affects the same hosts
  • If not, whether there is a common denominator between the hosts, i.e. OS version, PS version, etc
  • Any event logs on the Windows host that indicate that something was killed or some other error
  • Running procexp to loop at the process tree of the WinRM process and see what is still running and the commands it is running
  • Reducing the number of forks and seeing if that reduces the occurance

These issues are hard to track down but having something get stuck in this case is usually the sign of a deeper problem and setting an arbitrary limit on commands is not the way to go. The best way to solve this problem is finding a way to reliably replicate the issue and then drilling down into what it does for each step. We have lots of tools available to help debug these problems but they are only really useful if we can replicate the problem.

from ansible.windows.

mianos avatar mianos commented on July 17, 2024

As I said, this is with non trivial (many thousand) number of hosts. We have investigated this extensively and we have multiple tickets open with MS, the most common seems to be some registry corruptions. There is no doubt it's a Windows problem. What I need to work-around is the Windows host never returning. The fact that MS says 'MUST' is not my reality, it does not.
This said, if you don't want to do any workaround, I'll close my MR and change it to measure the time looped and use the existing so I can use a fork in the standard ansible.

from ansible.windows.

jborean93 avatar jborean93 commented on July 17, 2024

You're more than welcome to continue with that PR, it will be up to the maintainer there to review it and proceed further. I just wanted to let you know it's just not something we can take advantage of in Ansible. I guess what I'm trying to find out is where does this problem occur, is it something that is a problem when running the fact gathering or does it occur at other points in time when using Ansible.

There's a chance that something we do in this module causes some hang somewhere and we need to fix the problem there but we would need to find out more info to try and track it down. I'm not trying to say there isn't a bug here, I'm just trying to determine where the problem truly lies. If you are just seeing blocks on any module being executed by Ansible then there is a problem deeper in the system that we may not be able to handle in Ansible, i.e. the underlying problem needs to be fixed.

If you have a Red Hat subscription with Ansible this would definitely be something I would talk to them about. They can help try and track down these problems and work towards a solution for fixing it.

from ansible.windows.

mianos avatar mianos commented on July 17, 2024

It is sometimes a problem with fact gathering, but more often when we do software inventory and access the registry a little more.

As I said before, there is nothing wrong with ansible at all in this case. There is nothing fundamentally wrong with pywinrm, except it does not handle this common, at scale, Windows failure. What I would like to get is a way for ansible not to get blocked forever but to report a failure on an unreasonable timeout when talking to a Windows machine. The only change to ansible would be the additional configuration variable to define 'reasonable' for a particular scenario .

(It still amuses me, you can scan 20K Linux hosts with little trouble, but Windows is a fact of life in corporate computing).

from ansible.windows.

ronansalmon avatar ronansalmon commented on July 17, 2024

If gather_facts hangs, then it should be detected and handled. The host will be automatically evicted from inventory and the playbook won't hang.

The pywinrm teams think that this should not be implemented in winrm, but at a higher-level diyan/pywinrm#274 (comment)

Implementing the gather_timeout option should probably be done in ansible/modules/windows/setup.ps1 or the script that calls setup.ps1

from ansible.windows.

mianos avatar mianos commented on July 17, 2024

But the higher level (ansible) never finds out as the winrm layer just sits there blocked, forever. Gather facts does not detect anything. I come in in the morning and see it stopped at 200 of 20,000 hosts.
It can't be done at a lower level unless the ps1 script is wrapped in a ps1 script to check if the lower level script times out. Is this what you are suggesting? I am not averse to doing that as well as having pywinrm sensibly handle a remote lock.

from ansible.windows.

ronansalmon avatar ronansalmon commented on July 17, 2024

@mianos, this is what I meant setup.ps1 script needs to be wrap in a ps1 that will handle a timeout.

from ansible.windows.

mianos avatar mianos commented on July 17, 2024

OK, I'll also look at a doing generic wrapper, using Start-Job and Wait-Job -Timeout
This may be the proper solution but it's still not really acceptable for ansible just hang if this fails as well.

from ansible.windows.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.