Comments (8)
Sounds plausible. Under which circumstances did you observe this? How do you test the iscsi session recovery?
As for the patch, I'd rather stuff the big if/fi code into a function, named say discover_portal. Further, the $portal will need to be removed from the status/monitor functions as it's not going to be available.
from resource-agents.
2012/6/26 Dejan Muhamedagic
[email protected]:
Sounds plausible. Under which circumstances did you observe this? How do you test the iscsi session recovery?
Determine the resource fault for not being able to do the operation
"discovery" would be a mistake
because is not a conclusion determined by the iSCSI layer.
You can check the status of the iSCSI session using the iscsiadm
command (as shown by the patch).
Under what circumstances would I watch this?
Consider a virtual machine dependent on a iSCSI resource. The iSCSI
resource fails
temporarily (overload, or switching to other high availability
hardware). (Note that in
these circumstances the "discovery" operation may fail). The iSCSI
layer will pause the I/O
operations temporarily, but applications in the virtual machine will
not receive the fault, except
for extended timeout. If the iSCSI resource is recovered in the next
few seconds (configurable by the
administrator in /etc/iscsi/iscsid.conf), then nothing will happen
to applications in the virtual machine.
Conversely, if the session is considered dead because of a failure of
the "discovery" operation, the
machine can be configured to turn off (this fails for lack of disk)
or hot migrate (losing the iscsi commands
in queue ). Both of they result in data corruption.
As for the patch, I'd rather stuff the big if/fi code into a function, named say discover_portal. Further, the $portal will need to be removed from the status/monitor functions as it's not going to be available.
Yes.. true, but looking closely I see the following in the big if/fi code :
- The code block has related to the operation "status" and "monitor"
that are no longer needed.- When the operation is "stop", it returns $OCF_SUCCESS, regardless
of the value returned by $discovery- Only when the operation is "start", it is necessary to consider the
value returned by $discovery.
What do you think about my suggestion in this new patch?
I do not understand why it should be removed the $portal from the
status/monitor functions. Could be done using
the variable as shown in this new patch.
Sorry by the patch format, The github web iface is not useful.
--- iscsi.orig 2012-06-27 12:21:09.330289384 -0300
+++ iscsi 2012-06-27 12:21:57.714286736 -0300
@@ -257,7 +257,21 @@
$iscsiadm -m node -p $1 -T $2 -u
}
open_iscsi_status() {
- $iscsiadm -m session 2>/dev/null | grep -qs "$2$"
+ #Agree RFC3720, and open_iscsi, the session transitions are:
+ # FREE -> ACTIVE -> LOGGED_IN -> FAILED -> FREE
+ #There are various configurable timeouts between state transitions
+ # "LOGGED_IN <-> FAILED" and "FAILED <-> FREE"
+ #We consider the disk lost, if it reach the state "FREE"
+ local session_id=`$iscsiadm -m session 2>/dev/null| \
+ grep -E -s "\b$1,[0-9]+ +$2$"|sed -rn -e 's/.*\[([0-9]+)\].*/\1/p'`
+ [ -z "$session_id" ] && return 1
+ local session_state=`$iscsiadm -m session -r $session_id -P 1| \
+ sed -rn -e 's/.*iSCSI +Session +State *: *([A-Z_]+)/\1/ip'`
+ case $session_state in
+ ACTIVE|LOGGED_IN) return 0;; #All ok
+ FAILED) return 0;; #still there's a chance to recover
+ FREE|*) return 1;;
+ esac
}
#
@@ -376,40 +390,30 @@
exit $OCF_ERR_PERM
fi
-discovery_type=${OCF_RESKEY_discovery_type}
-udev=${OCF_RESKEY_udev}
-$discovery # discover and setup the real portal string (address)
-case $? in
-0) ;;
-1) [ "$1" = stop ] && exit $OCF_SUCCESS
- [ "$1" = monitor ] && exit $OCF_NOT_RUNNING
- [ "$1" = status ] && exit $LSB_STATUS_STOPPED
- exit $OCF_ERR_GENERIC
-;;
-2) [ "$1" = stop ] && {
- iscsi_monitor || exit $OCF_SUCCESS
- }
- ocf_is_probe && {
- iscsi_monitor; exit
- }
- exit $OCF_ERR_GENERIC
-;;
-3) ocf_is_probe && exit $OCF_NOT_RUNNING
- if ! is_iscsid_running; then
- [ $setup_rc -eq 1 ] &&
- ocf_log warning "iscsid.startup probably not correctly set in /etc/iscsi/iscsid.conf"
- [ "$1" = stop ] && exit $OCF_SUCCESS
- exit $OCF_ERR_INSTALLED
- fi
- exit $OCF_ERR_GENERIC
-;;
-esac
-
# which method was invoked?
case "$1" in
- start) iscsi_start
+ start)
+ # discover and setup the real portal string (address)
+ discovery_type=${OCF_RESKEY_discovery_type}
+ udev=${OCF_RESKEY_udev}
+ $discovery || case $? in
+ 1) exit $OCF_ERR_GENERIC;; #target not found
+ 2) exit $OCF_ERR_GENERIC;; #target found but can't connect it unambigously
+ 3) # iscsiadm returned error
+ if ! is_iscsid_running; then
+ [ $setup_rc -eq 1 ] &&
+ ocf_log warning "iscsid.startup probably not correctly set in /etc/iscsi/iscsid.conf"
+ exit $OCF_ERR_INSTALLED
+ fi
+ exit $OCF_ERR_GENERIC
+ ;;
+ esac
+ echo iscsi_start
;;
- stop) iscsi_stop
+ stop)
+ # discover and setup the real portal string (address)
+ $discovery || exit $OCF_SUCCESS
+ iscsi_stop
;;
status) if iscsi_status
then
from resource-agents.
On Wed, Jun 27, 2012 at 08:51:53AM -0700, liandros wrote:
2012/6/26 Dejan Muhamedagic
[email protected]:Sounds plausible. Under which circumstances did you observe this? How do you test the iscsi session recovery?
Determine the resource fault for not being able to do the operation
"discovery" would be a mistake
because is not a conclusion determined by the iSCSI layer.
You can check the status of the iSCSI session using the iscsiadm
command (as shown by the patch).[...]
case $session_state in
ACTIVE|LOGGED_IN) return 0;; #All ok
FAILED) return 0;; #still there's a chance to recover
FREE|*) return 1;;
esac
Even if the session gets into the FREE state it may recover. At
least that's what I observed here on SLE11SP2.
Returning success in case the session is in the FAILED state
won't do, because the RA cannot lie. What it can do is wait a
bit and see if the state changes.
Now, this may be considered an improvement, but it is certainly
a considerable change in behaviour. I'd suggest to introduce a
try_recovery parameter, which would modify the monitor action
appropriately. The change would be to loop indefinitely in case
the session state is different from "ACTIVE|LOGGED_IN". The loop
would be stopped on monitor action timeout.
As for the discovery, we can move it out of the monitor path.
This should be two different patches.
from resource-agents.
I did some patches in the meantime, the road to recovery implementation should be clear now. I'll write the patch for that too, but would like to give you credit. Can you please send me whatever you want to be put into the commit message (name, email address, etc). Thanks!
from resource-agents.
2012/7/17 Dejan Muhamedagic
[email protected]:
I did some patches in the meantime, the road to recovery implementation should be clear now. I'll write the patch for that too, but would like to give you credit. Can you please send me whatever you want to be put into the commit message (name, email address, etc). Thanks!
thanks, very polite.
My identity: Leandro Santinelli [email protected]
I do not know that part of my suggestion was accepted, please suggest
the commit message.
from resource-agents.
On Wed, Jul 18, 2012 at 05:35:41AM -0700, liandros wrote:
2012/7/17 Dejan Muhamedagic
[email protected]:I did some patches in the meantime, the road to recovery implementation should be clear now. I'll write the patch for that too, but would like to give you credit. Can you please send me whatever you want to be put into the commit message (name, email address, etc). Thanks!
thanks, very polite.
My identity: Leandro Santinelli [email protected]
I do not know that part of my suggestion was accepted, please suggest
the commit message.
All of it accepted, just with some modifications. Discovery is
now only in the start operation. The status is expanded to check
the connection status. The recovery will be tried if the
try_recovery parameter is set. If you'll take a look and give it
a try, I'd appreciate. Thanks!
from resource-agents.
2012/7/18 Dejan Muhamedagic
[email protected]:
All of it accepted, just with some modifications. Discovery is
now only in the start operation. The status is expanded to check
the connection status. The recovery will be tried if the
try_recovery parameter is set. If you'll take a look and give it
a try, I'd appreciate. Thanks!
I have checked the new code, and works perfectly.
For cosmetic reasons, I think it would be nice a restore message in logs:
@@ -299,8 +299,14 @@
# some drivers don't return connection state, in that case
# we'll assume that we're still connected
case "$conn_state" in
-
"LOGGED IN") return 0;;
-
"Unknown"|"") return 0;; # this is also probably OK
-
"LOGGED IN")
-
[ -n "$msg_logged" ] &&
-
ocf_log info "connection state $conn_state. Session restored."
-
return 0;;
-
"Unknown"|"") # this is also probably OK
-
[ -n "$msg_logged" ] &&
-
ocf_log info "connection state $conn_state. Session restored."
-
$recov; then
return 0;; *) # failed if [ "$__OCF_ACTION" != stop ] && ! ocf_is_probe && ocf_is_true
if [ -z "$msg_logged" ]; then
from resource-agents.
Patch applied. Many thanks for the contribution! Closing.
from resource-agents.
Related Issues (20)
- nothing provides /bin/ps needed by resource-agents-4.11.0 HOT 1
- WARNING: Can't get <node-name> xlog location. HOT 6
- ZFS promotion not working HOT 10
- Occasional false positive "down" reports from IPv6addr "monitor" action
- ZFS can't migrate to other node (cannot open pool: no such pool) HOT 2
- ERROR: LXC container name not set! HOT 23
- How to use the parameter of monitor_script?
- Unable to get metadata for resource agent 'stonith:fence_watchdog' (SyntaxError:JSON.parse:unexpected character at line 1) HOT 2
- master-pgsql attribute disappear HOT 1
- AWS Pacemaker awsvip failing with different errors HOT 4
- Resource agent - AWS Lambda support HOT 2
- Postfix RA continuously fails validate check HOT 1
- iSCSITarget - don't create default portal HOT 4
- resource-agents/heartbeat/ZFS - '-f' to option HOT 1
- "ocf : heartbeat : docker" does not exists in resource-agent v4.10 HOT 1
- How can I create a galera resource with two nodes?
- Filesystem in RHEL9.3 takes considerably longer to complete its stop operation compared to RHEL9.2. HOT 5
- Discusses the VirtualDomain_start function HOT 5
- fails to build on Arch Linux HOT 2
- ocf : heartbeat : pgsql - 3-node PG HA Cluster with async streaming replication HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from resource-agents.