When volumes are attached to a server, the operation occasionally results in:
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500)
The volume stays in state available
. Further attachment attempts results in:
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500)
Note, that the first error is MessagingTimeout
while the second error is RemoteError
. From now on it is not possible to attach the volume again. It must be deleted and recreated.
This error happens quite frequently. Possibly due to retry loops in automations (and especially also Kubernetes). We see around 1500 errors per hour.
Trace
The error happens on the Nova Compute agents. They complain about:
DBDuplicateEntry (psycopg2.IntegrityError) duplicate key value violates unique constraint "block_device_mapping_instance_uuid_volume_id_deleted_idx"
The responsible code in the agent is creating a database entry for storing the block device mapping:
|
def reserve_block_device_name(self, context, instance, device, |
|
volume_id, disk_bus, device_type): |
|
@utils.synchronized(instance.uuid) |
|
def do_reserve(): |
|
bdms = ( |
|
objects.BlockDeviceMappingList.get_by_instance_uuid( |
|
context, instance.uuid)) |
|
|
|
# NOTE(ndipanov): We need to explicitly set all the fields on the |
|
# object so that obj_load_attr does not fail |
|
new_bdm = objects.BlockDeviceMapping( |
|
context=context, |
|
source_type='volume', destination_type='volume', |
|
instance_uuid=instance.uuid, boot_index=None, |
|
volume_id=volume_id, |
|
device_name=device, guest_format=None, |
|
disk_bus=disk_bus, device_type=device_type) |
|
|
|
new_bdm.device_name = self._get_device_name_for_instance( |
|
instance, bdms, new_bdm) |
|
|
|
# NOTE(vish): create bdm here to avoid race condition |
|
new_bdm.create() |
|
return new_bdm |
When a volume is attached, this method is called via AQMP RPC from the Nova API:
|
def _create_volume_bdm(self, context, instance, device, volume_id, |
|
disk_bus, device_type, is_local_creation=False): |
|
if is_local_creation: |
|
# when the creation is done locally we can't specify the device |
|
# name as we do not have a way to check that the name specified is |
|
# a valid one. |
|
# We leave the setting of that value when the actual attach |
|
# happens on the compute manager |
|
volume_bdm = objects.BlockDeviceMapping( |
|
context=context, |
|
source_type='volume', destination_type='volume', |
|
instance_uuid=instance.uuid, boot_index=None, |
|
volume_id=volume_id or 'reserved', |
|
device_name=None, guest_format=None, |
|
disk_bus=disk_bus, device_type=device_type) |
|
volume_bdm.create() |
|
else: |
|
# NOTE(vish): This is done on the compute host because we want |
|
# to avoid a race where two devices are requested at |
|
# the same time. When db access is removed from |
|
# compute, the bdm will be created here and we will |
|
# have to make sure that they are assigned atomically. |
|
volume_bdm = self.compute_rpcapi.reserve_block_device_name( |
|
context, instance, device, volume_id, disk_bus=disk_bus, |
|
device_type=device_type) |
|
return volume_bdm |
Reproduction
One of our users was able to reproduce the problem quite reliably. Upon further inspection we found that it happens when multiple volumes are attached to the same instance in short succession. The use case here is a Kubernetes Pod that references two volumes. Whenever the pod gets created the volumes are attached almost simultaneously. It is curious that it already happens for 2 volumes.
In order to take Kubernetes out of the equation I created script to test this suspicion.
https://gist.github.com/BugRoger/a24d616912ede75b5ce17a53ef0b6614#file-volumes-sh
It works like this:
- Create 10 volumes
- Wait for volumes to be
available
- Attach all volumes simultaneously to the same instance
- Observe
MessagingTimeout
- Try to attach again
- Observe
RemoteError
In this scenario there is no retries. For each volume only a single attach call was made. The volumes are fresh and haven't been attached before. It is all sanely ordered. All volumes are available
before the attachment is done.
It did confirm the suspicion:
Attaching 59f311dd-1a2e-4f6e-b391-a045d3852181 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 7a3f88e7-00ca-4822-b5ab-f3e1c6e1c027 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching f90cc31a-1d44-4d82-83b2-cf889c78a7a2 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 9b609d59-0a6d-492d-a166-312921dc05f4 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d6265622-7681-49ca-9ad2-72a32b516dcb to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 55121026-5938-4af0-ba17-003f924a99aa to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 0d3a1991-590c-4ef1-9ed8-390d7eefa956 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching a72f8847-b49b-4d7c-b898-87b05c8a2a73 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching e6d22e21-7283-499f-b077-d954dc72ec35 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d72f723e-e438-4d5b-a741-18a5104acc9c to cf9fc220-ec34-42ff-ac63-81f6266882a2
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-b9e5db43-1150-48b9-9cad-3a4ff1dbacc3)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-2283a46b-1cb3-4527-b8ba-9bc12cbed344)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-c4c67966-2c01-44fe-b1fc-83c546a1b29d)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-4f8ae364-5fc9-4946-9bff-bae41441bf92)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-18c2cb27-da42-498c-aa41-33932ce92692)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-ed9f451c-23d9-47f2-9c45-75c67198f470)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-d44b5a42-df3f-4f91-a21e-6383fbe83899)
The API call to attach 3 of 10 volumes was successful. These volumes went into state attaching
. The MessagingTimeout
occurred for the remaining 7 volumes.
Retrying to attach the remaining volumes a second time:
Attaching 7a3f88e7-00ca-4822-b5ab-f3e1c6e1c027 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching f90cc31a-1d44-4d82-83b2-cf889c78a7a2 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d6265622-7681-49ca-9ad2-72a32b516dcb to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 0d3a1991-590c-4ef1-9ed8-390d7eefa956 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching a72f8847-b49b-4d7c-b898-87b05c8a2a73 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching e6d22e21-7283-499f-b077-d954dc72ec35 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d72f723e-e438-4d5b-a741-18a5104acc9c to cf9fc220-ec34-42ff-ac63-81f6266882a2
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-daddccd2-59d6-4da4-b715-3f6bda2cf4c4)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-aff03669-4787-42cf-8966-56124e2fb16c)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-4336c024-b383-48a9-8169-0c0507ac0cda)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-e9e8ebb9-b28a-4911-abd7-80d88d2f3a39)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-eb414cef-2062-4d42-aaab-d92444626fc7)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-66eff8c2-6a48-45c6-be80-9a272378f78b)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-988019e9-bea8-42eb-92da-5d1226ae38ae)
Each of them failed with the RemoteError
and corresponding DBDuplicateEntry
There's a high chance that attaching multiple volumes at the same time to the same instance corrupts the volumes with the unique key constrain in the database.
Theory
Nova API creates the BlockDeviceMapping via RPC call to the Nova Compute agent. The code is synchronised on the instance.uuid
. This blocks until the database entry has been written. In the above scenario we now have 10 RPC calls waiting for the mapping to be created.
The creation of that mapping seems to take a non-trivial amount of time. My guesstimate is 20-30 seconds. It might also be that this specific operation is quick but another method (could be an earlier attach_volume) is also synchronising on instance.uuid
blocking for longer.
The Nova API gives up waiting for the RPC call after 50s:
INFO nova.osapi_compute.wsgi.server "POST /v2/5d725ddf97664a16b011e8a8dd75772b/servers/cf9fc220-ec34-42ff-ac63-81f6266882a2/os-volume_attachments HTTP/1.1" status: 500 len: 442 time: 50.1058869
The Nova Agent successfully creates all 10 BlockDeviceMappings and doesn't abort processing. Though Nova API considers the attach operation as failed and returns the volume back to available
.
A second try to attach the same volumes fails instantly now, because it was partly successful before and created the BlockDeviceMapping DB entry.