GithubHelp home page GithubHelp logo

tobixen / thrash-protect Goto Github PK

View Code? Open in Web Editor NEW
159.0 159.0 21.0 289 KB

Simple-Stupid user-space program doing "kill -STOP" and "kill -CONT" to protect from thrashing

License: GNU General Public License v3.0

Makefile 6.58% Puppet 1.47% HTML 1.70% Shell 3.06% Python 87.19%

thrash-protect's People

Contributors

bodqhrohro avatar chriscz avatar matthew-sharp avatar pizzonia avatar questandachievement7developer avatar riccardobl avatar tobixen avatar wyg3958 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

thrash-protect's Issues

Ubuntu 14.04 error

I just installed it on my machine (uname -a gives Linux userPC 3.19.0-32-generic #37~14.04.1-Ubuntu SMP Thu Oct 22 09:41:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux) with sudo make install. /lib/systemd/system/thrash-protect.service and /usr/sbin/thrash-protect have been installed as expected, but when executed the latter gives this output:

WARNING:root:failed to do mlockall() - this makes the program vulnerable of being swapped out in an extreme thrashing event
Traceback (most recent call last):
  File "/usr/sbin/thrash-protect", line 517, in thrash_protect
    assert(not ctypes.cdll.LoadLibrary('libc.so.6').mlockall(ctypes.c_int(3)))
AssertionError

I've seen that the script's shebang is #!/usr/bin/python but a few lines after it states that this script is for Python3. I'd suggest to replace the shebang by #!/usr/bin/env python3. I did this change since the previsous was leading to python 2.7.6 but still same error with Python 3.

Here are the Python versions used:

 ~/w/thrash-protect   master   /usr/bin/python
Python 2.7.6 (default, Nov 23 2017, 15:49:48) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
 ~/w/thrash-protect   master   /usr/bin/env python3
Python 3.5.2 (default, Mar 22 2017, 12:47:19) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Wrong order of selectors?

## sorted from cheap to expensive. Also, it is surely smart to be quick on refreezing a recently unfrozen process if host starts thrashing again.
self.collection = [LastFrozenProcessSelector(), PageFaultingProcessSelector(), OOMScoreProcessSelector()]

Is this order right? The README seems to suggest that OOMScoreProcessSelector() should be before PageFaultingProcessSelector().

[Query] Is there a way to skip console apps from being STOPPED

Hi,

thrash-protect is helping me avoid system freezing due to swapping on my old 4GB ram and HDD based laptop. I have one issue, it keeps freezing "ncdu" (curses based du utility) running in xfce terminal. I have tried "fg" after that, but arrow keys and commands are not working.

I tried white listing it in the script and re-running the service ( I am on manjaro linux), but still it freezes ncdu.

I have also tried running it with "ionice -c3 ncdu -x /home", but still no luck.

Please help

edit: how do I gracefully stop this script / service, when it has few processes STOPPED (can I just kill the script and send SIGCONT signal to processes it has STOPPED)

Frequent audio stuttering?

Been trying to use this program but whenever I start the service audio starts stuttering every few minutes or so. More specifically it seems to happen when ram is limited, even casually using the computer - not doing anything that should require excessive swapping. This usually doesn't happen until the system starts thrashing badly. Is there a way to prevent this?

Is it possible to have a portable way to trigger mouse cursor to change state on whatever process thrash-protect is throttling.

On Windows, usually when there is heavy swapping going on that stops windows from responding for more than a couple of seconds, the cursor icon changes to a processing variant to reflect that computer didn't break and that something is working. With thrash-protect, it would be nice if there was someway to detect thrashing on a process and trigger a cursor change state if the mouse is over that process that's being throttled or system wide throttling in general so that it's not so jarring whenever windows become unresponsive as well as indicate whether the application is thrashing or crashing (lack of indicator for latter). I think at least he latter use would be the most useful because there have been countless times where thrash protect kicked in and my heart stopped a little because I wasn't sure if x or some other application with important temp data was crashing or thrashing where a cursor loading/processing icon would have greatly alleviated that.

protect against thrashing not caused by swapping?

I had heard from multiple sources(including your README) that turning off swap can prevent thrashing, but this is not true. Executable files (and some data files) of processes have to be cached by OS to allow them to run. If there is not enough physical memory and swap is off, OS has to discard and refill huge amount of caches during process scheduling, which can cause thrashing.

I did oberseved this issue on my laptop with 4GB memory and swap is off. I monitored IO by atop/iotop during the thrashing, and found that firefox, thunderbird, eclipse, amule etc. generated enoumous amount of reading, and the disk kept 100% busy.

Currently thrash-protect seems not able to handle this situation. I suggest kill -STOP some processes if the disk has been 100% busy for a while.

How to use

How do I use thrash-protect? There is no pointer in the readme how to install, configure, run, ... it.

Full /tmp seems to cause thrash-protect to fail unfreezing processes

A while ago, I did experience severe thrashing on my workstation, and thrash-protect apparently did not help. I should do more research into it and see if I can reproduce it.

  1. Memory setup: 8G memory, 1.5G swap, chromium holding significant amounts of the available memory.
  2. I piped some big file to /tmp, which is on tmpfs. 4G limit.
  3. I resized /tmp to 6G since it was too small to hold my file
  4. I lost control of the box

Consider storing temp files on /dev/shm instead of /tmp

ref #22

Changing the directory breaks backward compatibility a bit - for instance, I have set up monitoring towards this file on multiple production servers - hence I don't want to do this change unless it has significant benefits. Eventually, it would be nice to do research to see how much performance impact it has to write the pid-set to /tmp on a system with /tmp set up on the same physical disk as the swap partition, compared to writing the pid-set to /dev/shm.

A proper fix for job control and parent process getting frozen?

Suspending a child process causes side-effects for the parent sometimes (notably, bash job control - that's the only confirmed case I have so far, though I haven't done much research on this). Also, sometimes the parent process automatically gets suspended (notably, sudo).

Two work-arounds have been implemented so far. The first thing I did was to always resume parent the session process id and the group process id (I think the parent process id was not easily available from the scope where I did this), the second thing was to always stop the parent before the child if the parent process name was equal to "bash". In my upcoming commit, "sudo" has been added to this list.

I came to think that the proper fix for both those two issues may be to always freeze the parent process before suspending a child (possibly recursively, but never attempting on freezing pid 1 obviously). I need to think a bit and do some research before going this route.

make install doesn't work on ubuntu 16.04

~/Downloads/thrash-protect-master$ sudo make install
[sudo] password for user: 
install "thrash-protect.py" """/usr/sbin/thrash-protect"
if [ -d """/lib/systemd/system" ]; then install systemd/thrash-protect.service """/lib/systemd/system" ; \
        elif [ -d """/usr/lib/systemd/system" ]; then install systemd/thrash-protect.service """/usr/lib/systemd/system" ; fi
if [ -d """/etc/init" ]; then install upstart/thrash-protect.conf """/etc/init/thrash-protect.conf" ; fi
[ -d """/usr/lib/systemd/system" ] || [ -d """/etc/init" ] || [ -d """/lib/systemd/system" ] || install systemv/thrash-protect """/etc/init.d/thrash-protect"

Exception not handled

ERROR:root:red alert!  unacceptable time delta observed! interval: 0.5 cooldown_counter: 1 expected delay: 0 delta: 0.0539078712463 time: 1552620438.78 frozen pids: [(27290, 27295), (27290, 27765), (1042,), (1747,), (27240,)]
ERROR:root:Could not fetch process user information
Traceback (most recent call last):
  File "./thrash-protect.py", line 409, in get_process_info
    info = check_output("ps -p %d uf" % pid, shell = True).decode('utf-8')
  File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command 'ps -p 27295 uf' returned non-zero exit status 1
ERROR:root:red alert!  unacceptable time delta observed! interval: 0.5 cooldown_counter: 0 expected delay: 0 delta: 0.149600982666 time: 1552620439.35 frozen pids: [(27290, 27765), (1042,), (1747,), (27240,)]
ERROR:root:red alert!  unacceptable time delta observed! interval: 0.5 cooldown_counter: 1 expected delay: 0 delta: 0.0599908828735 time: 1552620440.44 frozen pids: [(27290, 27765), (1042,), (1747,), (27240,)]

unicode issues in the logging

thrash-protect freezed my browser and crashed:

ERROR:root:red alert!  unacceptable time delta observed! interval: 0.5 cooldown_counter: 1 expected delay: 0 delta: 0.0579028129578 time: 1552623249.83 frozen pids: [(28299,)]
ERROR:root:red alert!  unacceptable time delta observed! interval: 0.5 cooldown_counter: 3 expected delay: 0 delta: 0.0423080921173 time: 1552623249.87 frozen pids: [(28299,), (28928,)]
ERROR:root:red alert!  unacceptable time delta observed! interval: 0.5 cooldown_counter: 5 expected delay: 0 delta: 0.0360288619995 time: 1552623250.0 frozen pids: [(28299,), (28928,), (29415,)]
Traceback (most recent call last):
  File "./thrash-protect.py", line 560, in <module>
    main()
  File "./thrash-protect.py", line 556, in main
    thrash_protect(args)
  File "./thrash-protect.py", line 531, in thrash_protect
    current.unfrozen_pid = unfreeze_something()
  File "./thrash-protect.py", line 505, in unfreeze_something
    log_unfrozen(pid_to_unfreeze)
  File "./thrash-protect.py", line 435, in log_unfrozen
    logfile.write("%s - unfrozen   pid %5s - %s - list: %s\n" % (get_date_string(), str(pid), get_process_info(pid), frozen_pids))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 107-113: ordinal not in range(128)

Rewrite mlockall logic?

  1. Use /dev/shm instead of /tmp.
    /dev/shm always uses tmpfs.

  2. Use mlockall() (works with python3):

from ctypes import CDLL

def mlockall():
    """Lock all memory to prevent swapping process."""

    MCL_CURRENT = 1
    MCL_FUTURE = 2
    MCL_ONFAULT = 4

    libc = CDLL('libc.so.6', use_errno=True)

    result = libc.mlockall(
        MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT
    )
    if result != 0:
        result = libc.mlockall(
            MCL_CURRENT | MCL_FUTURE
        )
        if result != 0:
            print('Cannot lock all memory')
        else:
            print('All memory locked with MCL_CURRENT | MCL_FUTURE')
    else:
        print('All memory locked with MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT')

mlockall()

Configuration for systems using ZRAM and/or SSD swap

The default settings are bad for me: TP often stops multiple processes. How do I configure TP?
How to limit the number of stopped processes? How to change the threshold at which TP starts to stop the processes? How to make TP less sensitive? (Perhaps the fact is that I use ZRAM and have a fast swap.)

thrash-protect gets killed

It often gets killed with the below error:

$sudo thrash-protect
...
WARNING:root:relatively big time delta observed. interval: 0.5 cooldown_counter: 0 expected delay: 0 max acceptable delta: 0.16210890375625014 delta: 0.6467616558074951 time: 1621679841.8632085 frozen pids: [(2839,), (3728,)].  (this message is to be expected every now and then as the max acceptable delta parameter is autotuned)
WARNING:root:relatively big time delta observed. interval: 0.5 cooldown_counter: 2 expected delay: 0 max acceptable delta: 0.16210890375625014 delta: 0.8485774993896484 time: 1621679842.7120879 frozen pids: [(2839,), (3728,), (101220,)].  (this message is to be expected every now and then as the max acceptable delta parameter is autotuned)
Traceback (most recent call last):
  File "/usr/sbin/thrash-protect", line 607, in main
    thrash_protect(args)
  File "/usr/sbin/thrash-protect", line 553, in thrash_protect
    freeze_something()
  File "/usr/sbin/thrash-protect", line 470, in freeze_something
    pids_to_freeze = pids_to_freeze or global_process_selector.scan()
  File "/usr/sbin/thrash-protect", line 397, in scan
    ret = self.collection[self.scan_method_count % len(self.collection)].scan()
  File "/usr/sbin/thrash-protect", line 258, in scan
    stats = self.readStat(pid)
  File "/usr/sbin/thrash-protect", line 212, in readStat
    stats_tx = stat_file.read().decode('utf-8', 'ignore')
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/sbin/thrash-protect", line 616, in <module>
    main()
  File "/usr/sbin/thrash-protect", line 612, in main
    kill(pid_to_unfreeze, signal.SIGCONT)
TypeError: an integer is required (got type tuple)

If condition is not working while writing to log file

I have noticed a strange thing, in log_frozen and log_unfrozen functions, the test for config.log_user_data_on_freeze variable is not working properly. my python version is 3.6.4:

[manjaro@manj-pc thrash-protect]$ python
Python 3.6.4 (default, Jan  5 2018, 02:35:40) 
[GCC 7.2.1 20171224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from os import getenv, kill, getpid, unlink, getpgid, getsid
>>> class config:
...     log_user_data_on_unfreeze = int(getenv('THRASH_PROTECT_LOG_USER_DATA_ON_UNFREEZE', '1'))
... 
>>> if config.log_user_data_on_unfreeze:
...     print ("Log user data")
... else:
...     print ("No User data")
... 
Log user data

Is this some python version specific issue OR do I need to set any environment variables? ( I have both python 3.6 and 2.7).

Also how do I make the script less aggressive - Increasing THRASH_PROTECT_INTERVAL to 2sec and THRASH_PROTECT_SWAP_PAGE_THRESHOLD to 16?

Pids sometimes getting stuck in the frozen pid list

I've noticed this some few times on some specific RHEL-VMs with too little memory installed; /tmp/thrash-protect-frozen-pid-list gets created and stays there with one pid. The pid is also on the list of frozen processes in /var/log/thrash-protect. In two cases the process was (IIRC) /sbin/portreserve and the process was running. In the third case the process didn't exist.

Said systems are running version 0.11.4, upgrading should be the first priority. If I haven't rediscovered this issue one year after upgrading, I'll just close this issue.

Thrash-protect may stop itself

I tried to run up thrash-protect in a terminal window, with a non-whitelisted terminal program. The terminal window got frozen, and so did thrash-protect. Should look more into this.

Use mlock(all) to prevent swapout

You can use mlock() to lock specific memory regions, and mlockall() to lock an entire processes memory so that it won't be swapped out.

This would probably be more ideal in a C implementation, btu can be done in python with ctypes.

Too many "unacceptable time delta observed!" error messages

I am using thrash protect with default settings, except for the whitelist and log user data on freeze (I will provide config file below for this).

I have been seeing LOT of "unacceptable time delta observed!" messages in the journal logs. Is there a way to resolve this issue?

My laptop is an old inspiron 1520 with core 2 processor, 4GB RAM and 160 Gb HDD@5400rpm.

The journal log is a snapshot for a minute, there are such messages for every minute, filling my journal log

System Details:

CPU~Dual core Intel Core2 Duo T7300 (-MCP-) 
speed/max~1572/2001 MHz 
Kernel~4.14.27-1-MANJARO x86_64

free command output:
              total        used        free      shared  buff/cache   available
Mem:           3947        3585         113          23         248         159
Swap:          8191        1776        6415

thrash-protect log:

2018-03-22 07:47:52 - frozen   pid  5993 - u:   manjaro  CPU:  1.3%  MEM:  4.2%  CMD: /usr/lib/chromium/chromium - list: [(5993,)]
2018-03-22 07:47:53 - frozen   pid  6388 - u:   manjaro  CPU:  0.7%  MEM:  6.0%  CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,)]
2018-03-22 07:47:54 - frozen   pid  6007 - u:   manjaro  CPU:  1.1%  MEM:  5.5%  CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,), (6007,)]
2018-03-22 07:47:54 - frozen   pid  6372 - u:   manjaro  CPU:  0.6%  MEM:  4.0%  CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,), (6007,), (6372,)]
2018-03-22 07:47:57 - unfrozen pid 6372
2018-03-22 07:47:57 - unfrozen pid 6007
2018-03-22 07:47:58 - unfrozen pid 5993
2018-03-22 07:47:58 - unfrozen pid 6388
2018-03-22 07:48:11 - frozen   pid  5993 - u:   manjaro  CPU:  1.3%  MEM:  4.0%  CMD: /usr/lib/chromium/chromium - list: [(5993,)]
2018-03-22 07:48:12 - frozen   pid  6388 - u:   manjaro  CPU:  0.7%  MEM:  6.0%  CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,)]
2018-03-22 07:48:13 - unfrozen pid 6388
2018-03-22 07:48:13 - unfrozen pid 5993

journal log:

Mar 22 07:47:30 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:45 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:48 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:50 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:51 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:52 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:52 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:52 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:53 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:53 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:55 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:55 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:55 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:47:59 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!
Mar 22 07:48:03 manjaro-pc thrash-protect[15846]: ERROR:root:red alert!  unacceptable time delta observed!

thrash-protect environment variable config:

cat /etc/systemd/system/thrash-protect.service.d/override.conf
[Service]
Environment="THRASH_PROTECT_CMD_WHITELIST=sshd bash -bash sudo xinit X SCREEN ssh xterm xfce4-terminal Xorg xfwm4 systemd-journal journalctl i3lock xautolock ncdu vim Thunar  xfce4-power-manager NetworkManager"
Environment="THRASH_PROTECT_LOG_USER_DATA_ON_FREEZE=1"
#Environment="THRASH_PROTECT_SWAP_PAGE_THRESHOLD=8"
#Environment="THRASH_PROTECT_DATE_HUMAN_READABLE=1"
#Environment="THRASH_PROTECT_LOG_USER_DATA_ON_UNFREEZE=0"
#Environment="THRASH_PROTECT_INTERVAL=1"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.