tobixen / thrash-protect Goto Github PK
View Code? Open in Web Editor NEWSimple-Stupid user-space program doing "kill -STOP" and "kill -CONT" to protect from thrashing
License: GNU General Public License v3.0
Simple-Stupid user-space program doing "kill -STOP" and "kill -CONT" to protect from thrashing
License: GNU General Public License v3.0
I just installed it on my machine (uname -a
gives Linux userPC 3.19.0-32-generic #37~14.04.1-Ubuntu SMP Thu Oct 22 09:41:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
) with sudo make install
. /lib/systemd/system/thrash-protect.service
and /usr/sbin/thrash-protect
have been installed as expected, but when executed the latter gives this output:
WARNING:root:failed to do mlockall() - this makes the program vulnerable of being swapped out in an extreme thrashing event
Traceback (most recent call last):
File "/usr/sbin/thrash-protect", line 517, in thrash_protect
assert(not ctypes.cdll.LoadLibrary('libc.so.6').mlockall(ctypes.c_int(3)))
AssertionError
I've seen that the script's shebang is #!/usr/bin/python
but a few lines after it states that this script is for Python3. I'd suggest to replace the shebang by #!/usr/bin/env python3
. I did this change since the previsous was leading to python 2.7.6 but still same error with Python 3.
Here are the Python versions used:
~/w/thrash-protect master /usr/bin/python
Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
~/w/thrash-protect master /usr/bin/env python3
Python 3.5.2 (default, Mar 22 2017, 12:47:19)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Do you plan to migrate to python3?
thrash-protect/thrash-protect.py
Lines 377 to 378 in 90c0a7e
Is this order right? The README seems to suggest that OOMScoreProcessSelector()
should be before PageFaultingProcessSelector()
.
Hi,
thrash-protect is helping me avoid system freezing due to swapping on my old 4GB ram and HDD based laptop. I have one issue, it keeps freezing "ncdu" (curses based du utility) running in xfce terminal. I have tried "fg" after that, but arrow keys and commands are not working.
I tried white listing it in the script and re-running the service ( I am on manjaro linux), but still it freezes ncdu.
I have also tried running it with "ionice -c3 ncdu -x /home", but still no luck.
Please help
edit: how do I gracefully stop this script / service, when it has few processes STOPPED (can I just kill the script and send SIGCONT signal to processes it has STOPPED)
Been trying to use this program but whenever I start the service audio starts stuttering every few minutes or so. More specifically it seems to happen when ram is limited, even casually using the computer - not doing anything that should require excessive swapping. This usually doesn't happen until the system starts thrashing badly. Is there a way to prevent this?
Recently I did some mistake with fork
, causing my laptop to freeze completely up.
Thrash-protect should be more aggressive in suspending parent processes in such cases.
What do you think of releasing the project on pypi, if there isn't a C rewrite in the foreseeable future?
On Windows, usually when there is heavy swapping going on that stops windows from responding for more than a couple of seconds, the cursor icon changes to a processing variant to reflect that computer didn't break and that something is working. With thrash-protect, it would be nice if there was someway to detect thrashing on a process and trigger a cursor change state if the mouse is over that process that's being throttled or system wide throttling in general so that it's not so jarring whenever windows become unresponsive as well as indicate whether the application is thrashing or crashing (lack of indicator for latter). I think at least he latter use would be the most useful because there have been countless times where thrash protect kicked in and my heart stopped a little because I wasn't sure if x or some other application with important temp data was crashing or thrashing where a cursor loading/processing icon would have greatly alleviated that.
This is a request, which helps using this on laptops / workstations.
Is it possible to check if the process about to be frozen, is a foreground process and if so, skip to next PID in queue (like we do for the self PID)
I had heard from multiple sources(including your README) that turning off swap can prevent thrashing, but this is not true. Executable files (and some data files) of processes have to be cached by OS to allow them to run. If there is not enough physical memory and swap is off, OS has to discard and refill huge amount of caches during process scheduling, which can cause thrashing.
I did oberseved this issue on my laptop with 4GB memory and swap is off. I monitored IO by atop/iotop during the thrashing, and found that firefox, thunderbird, eclipse, amule etc. generated enoumous amount of reading, and the disk kept 100% busy.
Currently thrash-protect seems not able to handle this situation. I suggest kill -STOP
some processes if the disk has been 100% busy for a while.
I just ran thrash-protect, and get next output:
http://okturing.com/src/5948/body
It was an unpleasant experience: it stopped my browser and virtualbox, then I had to kill thrash-protect, and manually send SIGCONT to stopped processes.
How do I use thrash-protect? There is no pointer in the readme how to install, configure, run, ... it.
A while ago, I did experience severe thrashing on my workstation, and thrash-protect apparently did not help. I should do more research into it and see if I can reproduce it.
ref #22
Changing the directory breaks backward compatibility a bit - for instance, I have set up monitoring towards this file on multiple production servers - hence I don't want to do this change unless it has significant benefits. Eventually, it would be nice to do research to see how much performance impact it has to write the pid-set to /tmp on a system with /tmp set up on the same physical disk as the swap partition, compared to writing the pid-set to /dev/shm.
The tests fail due to a failed import of a Mock library
Suspending a child process causes side-effects for the parent sometimes (notably, bash job control - that's the only confirmed case I have so far, though I haven't done much research on this). Also, sometimes the parent process automatically gets suspended (notably, sudo).
Two work-arounds have been implemented so far. The first thing I did was to always resume parent the session process id and the group process id (I think the parent process id was not easily available from the scope where I did this), the second thing was to always stop the parent before the child if the parent process name was equal to "bash". In my upcoming commit, "sudo" has been added to this list.
I came to think that the proper fix for both those two issues may be to always freeze the parent process before suspending a child (possibly recursively, but never attempting on freezing pid 1 obviously). I need to think a bit and do some research before going this route.
~/Downloads/thrash-protect-master$ sudo make install
[sudo] password for user:
install "thrash-protect.py" """/usr/sbin/thrash-protect"
if [ -d """/lib/systemd/system" ]; then install systemd/thrash-protect.service """/lib/systemd/system" ; \
elif [ -d """/usr/lib/systemd/system" ]; then install systemd/thrash-protect.service """/usr/lib/systemd/system" ; fi
if [ -d """/etc/init" ]; then install upstart/thrash-protect.conf """/etc/init/thrash-protect.conf" ; fi
[ -d """/usr/lib/systemd/system" ] || [ -d """/etc/init" ] || [ -d """/lib/systemd/system" ] || install systemv/thrash-protect """/etc/init.d/thrash-protect"
ERROR:root:red alert! unacceptable time delta observed! interval: 0.5 cooldown_counter: 1 expected delay: 0 delta: 0.0539078712463 time: 1552620438.78 frozen pids: [(27290, 27295), (27290, 27765), (1042,), (1747,), (27240,)]
ERROR:root:Could not fetch process user information
Traceback (most recent call last):
File "./thrash-protect.py", line 409, in get_process_info
info = check_output("ps -p %d uf" % pid, shell = True).decode('utf-8')
File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command 'ps -p 27295 uf' returned non-zero exit status 1
ERROR:root:red alert! unacceptable time delta observed! interval: 0.5 cooldown_counter: 0 expected delay: 0 delta: 0.149600982666 time: 1552620439.35 frozen pids: [(27290, 27765), (1042,), (1747,), (27240,)]
ERROR:root:red alert! unacceptable time delta observed! interval: 0.5 cooldown_counter: 1 expected delay: 0 delta: 0.0599908828735 time: 1552620440.44 frozen pids: [(27290, 27765), (1042,), (1747,), (27240,)]
thrash-protect freezed my browser and crashed:
ERROR:root:red alert! unacceptable time delta observed! interval: 0.5 cooldown_counter: 1 expected delay: 0 delta: 0.0579028129578 time: 1552623249.83 frozen pids: [(28299,)]
ERROR:root:red alert! unacceptable time delta observed! interval: 0.5 cooldown_counter: 3 expected delay: 0 delta: 0.0423080921173 time: 1552623249.87 frozen pids: [(28299,), (28928,)]
ERROR:root:red alert! unacceptable time delta observed! interval: 0.5 cooldown_counter: 5 expected delay: 0 delta: 0.0360288619995 time: 1552623250.0 frozen pids: [(28299,), (28928,), (29415,)]
Traceback (most recent call last):
File "./thrash-protect.py", line 560, in <module>
main()
File "./thrash-protect.py", line 556, in main
thrash_protect(args)
File "./thrash-protect.py", line 531, in thrash_protect
current.unfrozen_pid = unfreeze_something()
File "./thrash-protect.py", line 505, in unfreeze_something
log_unfrozen(pid_to_unfreeze)
File "./thrash-protect.py", line 435, in log_unfrozen
logfile.write("%s - unfrozen pid %5s - %s - list: %s\n" % (get_date_string(), str(pid), get_process_info(pid), frozen_pids))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 107-113: ordinal not in range(128)
Use /dev/shm
instead of /tmp
.
/dev/shm
always uses tmpfs
.
Use mlockall() (works with python3):
from ctypes import CDLL
def mlockall():
"""Lock all memory to prevent swapping process."""
MCL_CURRENT = 1
MCL_FUTURE = 2
MCL_ONFAULT = 4
libc = CDLL('libc.so.6', use_errno=True)
result = libc.mlockall(
MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT
)
if result != 0:
result = libc.mlockall(
MCL_CURRENT | MCL_FUTURE
)
if result != 0:
print('Cannot lock all memory')
else:
print('All memory locked with MCL_CURRENT | MCL_FUTURE')
else:
print('All memory locked with MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT')
mlockall()
The default settings are bad for me: TP often stops multiple processes. How do I configure TP?
How to limit the number of stopped processes? How to change the threshold at which TP starts to stop the processes? How to make TP less sensitive? (Perhaps the fact is that I use ZRAM and have a fast swap.)
It often gets killed with the below error:
$sudo thrash-protect
...
WARNING:root:relatively big time delta observed. interval: 0.5 cooldown_counter: 0 expected delay: 0 max acceptable delta: 0.16210890375625014 delta: 0.6467616558074951 time: 1621679841.8632085 frozen pids: [(2839,), (3728,)]. (this message is to be expected every now and then as the max acceptable delta parameter is autotuned)
WARNING:root:relatively big time delta observed. interval: 0.5 cooldown_counter: 2 expected delay: 0 max acceptable delta: 0.16210890375625014 delta: 0.8485774993896484 time: 1621679842.7120879 frozen pids: [(2839,), (3728,), (101220,)]. (this message is to be expected every now and then as the max acceptable delta parameter is autotuned)
Traceback (most recent call last):
File "/usr/sbin/thrash-protect", line 607, in main
thrash_protect(args)
File "/usr/sbin/thrash-protect", line 553, in thrash_protect
freeze_something()
File "/usr/sbin/thrash-protect", line 470, in freeze_something
pids_to_freeze = pids_to_freeze or global_process_selector.scan()
File "/usr/sbin/thrash-protect", line 397, in scan
ret = self.collection[self.scan_method_count % len(self.collection)].scan()
File "/usr/sbin/thrash-protect", line 258, in scan
stats = self.readStat(pid)
File "/usr/sbin/thrash-protect", line 212, in readStat
stats_tx = stat_file.read().decode('utf-8', 'ignore')
ProcessLookupError: [Errno 3] No such process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/sbin/thrash-protect", line 616, in <module>
main()
File "/usr/sbin/thrash-protect", line 612, in main
kill(pid_to_unfreeze, signal.SIGCONT)
TypeError: an integer is required (got type tuple)
I have noticed a strange thing, in log_frozen
and log_unfrozen
functions, the test for config.log_user_data_on_freeze
variable is not working properly. my python version is 3.6.4:
[manjaro@manj-pc thrash-protect]$ python
Python 3.6.4 (default, Jan 5 2018, 02:35:40)
[GCC 7.2.1 20171224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from os import getenv, kill, getpid, unlink, getpgid, getsid
>>> class config:
... log_user_data_on_unfreeze = int(getenv('THRASH_PROTECT_LOG_USER_DATA_ON_UNFREEZE', '1'))
...
>>> if config.log_user_data_on_unfreeze:
... print ("Log user data")
... else:
... print ("No User data")
...
Log user data
Is this some python version specific issue OR do I need to set any environment variables? ( I have both python 3.6 and 2.7).
Also how do I make the script less aggressive - Increasing THRASH_PROTECT_INTERVAL
to 2sec and THRASH_PROTECT_SWAP_PAGE_THRESHOLD
to 16?
I've noticed this some few times on some specific RHEL-VMs with too little memory installed; /tmp/thrash-protect-frozen-pid-list gets created and stays there with one pid. The pid is also on the list of frozen processes in /var/log/thrash-protect. In two cases the process was (IIRC) /sbin/portreserve and the process was running. In the third case the process didn't exist.
Said systems are running version 0.11.4, upgrading should be the first priority. If I haven't rediscovered this issue one year after upgrading, I'll just close this issue.
IMHO PSI is maybe best metrics to detect thrashing.
https://lwn.net/Articles/759658/
https://facebookmicrosites.github.io/psi/
You can try to use it to detect thrashing instead of vmstat.
PSI file example (/proc/pressure/memory):
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
Use total
metrics.
I tried to run up thrash-protect in a terminal window, with a non-whitelisted terminal program. The terminal window got frozen, and so did thrash-protect. Should look more into this.
You can use mlock() to lock specific memory regions, and mlockall() to lock an entire processes memory so that it won't be swapped out.
This would probably be more ideal in a C implementation, btu can be done in python with ctypes.
I am using thrash protect with default settings, except for the whitelist and log user data on freeze (I will provide config file below for this).
I have been seeing LOT of "unacceptable time delta observed!" messages in the journal logs. Is there a way to resolve this issue?
My laptop is an old inspiron 1520 with core 2 processor, 4GB RAM and 160 Gb HDD@5400rpm.
The journal log is a snapshot for a minute, there are such messages for every minute, filling my journal log
System Details:
CPU~Dual core Intel Core2 Duo T7300 (-MCP-)
speed/max~1572/2001 MHz
Kernel~4.14.27-1-MANJARO x86_64
free command output:
total used free shared buff/cache available
Mem: 3947 3585 113 23 248 159
Swap: 8191 1776 6415
thrash-protect log:
2018-03-22 07:47:52 - frozen pid 5993 - u: manjaro CPU: 1.3% MEM: 4.2% CMD: /usr/lib/chromium/chromium - list: [(5993,)]
2018-03-22 07:47:53 - frozen pid 6388 - u: manjaro CPU: 0.7% MEM: 6.0% CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,)]
2018-03-22 07:47:54 - frozen pid 6007 - u: manjaro CPU: 1.1% MEM: 5.5% CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,), (6007,)]
2018-03-22 07:47:54 - frozen pid 6372 - u: manjaro CPU: 0.6% MEM: 4.0% CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,), (6007,), (6372,)]
2018-03-22 07:47:57 - unfrozen pid 6372
2018-03-22 07:47:57 - unfrozen pid 6007
2018-03-22 07:47:58 - unfrozen pid 5993
2018-03-22 07:47:58 - unfrozen pid 6388
2018-03-22 07:48:11 - frozen pid 5993 - u: manjaro CPU: 1.3% MEM: 4.0% CMD: /usr/lib/chromium/chromium - list: [(5993,)]
2018-03-22 07:48:12 - frozen pid 6388 - u: manjaro CPU: 0.7% MEM: 6.0% CMD: /usr/lib/chromium/chromium - list: [(5993,), (6388,)]
2018-03-22 07:48:13 - unfrozen pid 6388
2018-03-22 07:48:13 - unfrozen pid 5993
journal log:
Mar 22 07:47:30 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:45 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:48 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:49 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:50 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:51 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:52 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:52 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:52 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:53 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:53 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:54 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:55 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:55 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:55 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:47:59 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
Mar 22 07:48:03 manjaro-pc thrash-protect[15846]: ERROR:root:red alert! unacceptable time delta observed!
thrash-protect environment variable config:
cat /etc/systemd/system/thrash-protect.service.d/override.conf
[Service]
Environment="THRASH_PROTECT_CMD_WHITELIST=sshd bash -bash sudo xinit X SCREEN ssh xterm xfce4-terminal Xorg xfwm4 systemd-journal journalctl i3lock xautolock ncdu vim Thunar xfce4-power-manager NetworkManager"
Environment="THRASH_PROTECT_LOG_USER_DATA_ON_FREEZE=1"
#Environment="THRASH_PROTECT_SWAP_PAGE_THRESHOLD=8"
#Environment="THRASH_PROTECT_DATE_HUMAN_READABLE=1"
#Environment="THRASH_PROTECT_LOG_USER_DATA_ON_UNFREEZE=0"
#Environment="THRASH_PROTECT_INTERVAL=1"
Around line 138, single "/" should be double slash "//" for integer division (without that irrespective of threshold setting, it keeps stopping processes).
((self.swapcount[0]-prev.swapcount[0])//config.swap_page_threshold + 1.0) *
((self.swapcount[1]-prev.swapcount[1])//config.swap_page_threshold + 1.0)
Reference: https://stackoverflow.com/a/39332574
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.