GithubHelp home page GithubHelp logo

sapcc / nova Goto Github PK

View Code? Open in Web Editor NEW

This project forked from openstack/nova

2.0 18.0 6.0 430.58 MB

OpenStack Compute (Nova)

Home Page: http://openstack.org

License: Apache License 2.0

Python 97.80% Shell 0.11% Smarty 2.07% NASL 0.01% Mako 0.01%

nova's Introduction

OpenStack Nova

OpenStack Nova provides a cloud computing fabric controller, supporting a wide variety of compute technologies, including: libvirt (KVM, Xen, LXC and more), Hyper-V, VMware, OpenStack Ironic and PowerVM.

Use the following resources to learn more.

API

To learn how to use Nova's API, consult the documentation available online at:

For more information on OpenStack APIs, SDKs and CLIs in general, refer to:

Operators

To learn how to deploy and configure OpenStack Nova, consult the documentation available online at:

In the unfortunate event that bugs are discovered, they should be reported to the appropriate bug tracker. If you obtained the software from a 3rd party operating system vendor, it is often wise to use their own bug tracker for reporting problems. In all other cases use the master OpenStack bug tracker, available at:

Developers

For information on how to contribute to Nova, please see the contents of the CONTRIBUTING.rst.

Any new code must follow the development guidelines detailed in the HACKING.rst file, and pass all unit tests.

Further developer focused documentation is available at:

Other Information

During each Summit and Project Team Gathering, we agree on what the whole community wants to focus on for the upcoming release. The plans for nova can be found at:

nova's People

Contributors

ameade avatar berrange avatar cdent avatar comstud avatar dprince avatar edleafe avatar gkotton avatar gmannos avatar jaypipes avatar jichenjc avatar jkoelker avatar jogo avatar justinsb avatar kk7ds avatar lyarwood avatar markmc avatar mdbooth avatar mikalstill avatar mriedem avatar natsumetakashi avatar rconradharris avatar russellb avatar sdague avatar sleepsonthefloor avatar soulxu avatar stephenfin avatar throughnothing avatar tr3buchet avatar vishvananda avatar xtoddx avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nova's Issues

Volume Attachment Errors

When volumes are attached to a server, the operation occasionally results in:

Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500)

The volume stays in state available. Further attachment attempts results in:

Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500)

Note, that the first error is MessagingTimeout while the second error is RemoteError. From now on it is not possible to attach the volume again. It must be deleted and recreated.

This error happens quite frequently. Possibly due to retry loops in automations (and especially also Kubernetes). We see around 1500 errors per hour.

image

Trace

The error happens on the Nova Compute agents. They complain about:

DBDuplicateEntry (psycopg2.IntegrityError) duplicate key value violates unique constraint "block_device_mapping_instance_uuid_volume_id_deleted_idx"

The responsible code in the agent is creating a database entry for storing the block device mapping:

nova/nova/compute/manager.py

Lines 4721 to 4744 in c4a099d

def reserve_block_device_name(self, context, instance, device,
volume_id, disk_bus, device_type):
@utils.synchronized(instance.uuid)
def do_reserve():
bdms = (
objects.BlockDeviceMappingList.get_by_instance_uuid(
context, instance.uuid))
# NOTE(ndipanov): We need to explicitly set all the fields on the
# object so that obj_load_attr does not fail
new_bdm = objects.BlockDeviceMapping(
context=context,
source_type='volume', destination_type='volume',
instance_uuid=instance.uuid, boot_index=None,
volume_id=volume_id,
device_name=device, guest_format=None,
disk_bus=disk_bus, device_type=device_type)
new_bdm.device_name = self._get_device_name_for_instance(
instance, bdms, new_bdm)
# NOTE(vish): create bdm here to avoid race condition
new_bdm.create()
return new_bdm

When a volume is attached, this method is called via AQMP RPC from the Nova API:

nova/nova/compute/api.py

Lines 3102 to 3127 in c4a099d

def _create_volume_bdm(self, context, instance, device, volume_id,
disk_bus, device_type, is_local_creation=False):
if is_local_creation:
# when the creation is done locally we can't specify the device
# name as we do not have a way to check that the name specified is
# a valid one.
# We leave the setting of that value when the actual attach
# happens on the compute manager
volume_bdm = objects.BlockDeviceMapping(
context=context,
source_type='volume', destination_type='volume',
instance_uuid=instance.uuid, boot_index=None,
volume_id=volume_id or 'reserved',
device_name=None, guest_format=None,
disk_bus=disk_bus, device_type=device_type)
volume_bdm.create()
else:
# NOTE(vish): This is done on the compute host because we want
# to avoid a race where two devices are requested at
# the same time. When db access is removed from
# compute, the bdm will be created here and we will
# have to make sure that they are assigned atomically.
volume_bdm = self.compute_rpcapi.reserve_block_device_name(
context, instance, device, volume_id, disk_bus=disk_bus,
device_type=device_type)
return volume_bdm

Reproduction

One of our users was able to reproduce the problem quite reliably. Upon further inspection we found that it happens when multiple volumes are attached to the same instance in short succession. The use case here is a Kubernetes Pod that references two volumes. Whenever the pod gets created the volumes are attached almost simultaneously. It is curious that it already happens for 2 volumes.

In order to take Kubernetes out of the equation I created script to test this suspicion.
https://gist.github.com/BugRoger/a24d616912ede75b5ce17a53ef0b6614#file-volumes-sh

It works like this:

  1. Create 10 volumes
  2. Wait for volumes to be available
  3. Attach all volumes simultaneously to the same instance
  4. Observe MessagingTimeout
  5. Try to attach again
  6. ObserveRemoteError

In this scenario there is no retries. For each volume only a single attach call was made. The volumes are fresh and haven't been attached before. It is all sanely ordered. All volumes are available before the attachment is done.

It did confirm the suspicion:

Attaching 59f311dd-1a2e-4f6e-b391-a045d3852181 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 7a3f88e7-00ca-4822-b5ab-f3e1c6e1c027 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching f90cc31a-1d44-4d82-83b2-cf889c78a7a2 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 9b609d59-0a6d-492d-a166-312921dc05f4 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d6265622-7681-49ca-9ad2-72a32b516dcb to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 55121026-5938-4af0-ba17-003f924a99aa to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 0d3a1991-590c-4ef1-9ed8-390d7eefa956 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching a72f8847-b49b-4d7c-b898-87b05c8a2a73 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching e6d22e21-7283-499f-b077-d954dc72ec35 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d72f723e-e438-4d5b-a741-18a5104acc9c to cf9fc220-ec34-42ff-ac63-81f6266882a2
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-b9e5db43-1150-48b9-9cad-3a4ff1dbacc3)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-2283a46b-1cb3-4527-b8ba-9bc12cbed344)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-c4c67966-2c01-44fe-b1fc-83c546a1b29d)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-4f8ae364-5fc9-4946-9bff-bae41441bf92)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-18c2cb27-da42-498c-aa41-33932ce92692)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-ed9f451c-23d9-47f2-9c45-75c67198f470)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'> (HTTP 500) (Request-ID: req-d44b5a42-df3f-4f91-a21e-6383fbe83899)

The API call to attach 3 of 10 volumes was successful. These volumes went into state attaching. The MessagingTimeout occurred for the remaining 7 volumes.

Retrying to attach the remaining volumes a second time:

Attaching 7a3f88e7-00ca-4822-b5ab-f3e1c6e1c027 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching f90cc31a-1d44-4d82-83b2-cf889c78a7a2 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d6265622-7681-49ca-9ad2-72a32b516dcb to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching 0d3a1991-590c-4ef1-9ed8-390d7eefa956 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching a72f8847-b49b-4d7c-b898-87b05c8a2a73 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching e6d22e21-7283-499f-b077-d954dc72ec35 to cf9fc220-ec34-42ff-ac63-81f6266882a2
Attaching d72f723e-e438-4d5b-a741-18a5104acc9c to cf9fc220-ec34-42ff-ac63-81f6266882a2
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-daddccd2-59d6-4da4-b715-3f6bda2cf4c4)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-aff03669-4787-42cf-8966-56124e2fb16c)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-4336c024-b383-48a9-8169-0c0507ac0cda)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-e9e8ebb9-b28a-4911-abd7-80d88d2f3a39)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-eb414cef-2062-4d42-aaab-d92444626fc7)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-66eff8c2-6a48-45c6-be80-9a272378f78b)
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.rpc.client.RemoteError'> (HTTP 500) (Request-ID: req-988019e9-bea8-42eb-92da-5d1226ae38ae)

Each of them failed with the RemoteError and corresponding DBDuplicateEntry

There's a high chance that attaching multiple volumes at the same time to the same instance corrupts the volumes with the unique key constrain in the database.

Theory

Nova API creates the BlockDeviceMapping via RPC call to the Nova Compute agent. The code is synchronised on the instance.uuid. This blocks until the database entry has been written. In the above scenario we now have 10 RPC calls waiting for the mapping to be created.

The creation of that mapping seems to take a non-trivial amount of time. My guesstimate is 20-30 seconds. It might also be that this specific operation is quick but another method (could be an earlier attach_volume) is also synchronising on instance.uuid blocking for longer.

The Nova API gives up waiting for the RPC call after 50s:

INFO nova.osapi_compute.wsgi.server "POST /v2/5d725ddf97664a16b011e8a8dd75772b/servers/cf9fc220-ec34-42ff-ac63-81f6266882a2/os-volume_attachments HTTP/1.1" status: 500 len: 442 time: 50.1058869

The Nova Agent successfully creates all 10 BlockDeviceMappings and doesn't abort processing. Though Nova API considers the attach operation as failed and returns the volume back to available.

A second try to attach the same volumes fails instantly now, because it was partly successful before and created the BlockDeviceMapping DB entry.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.