GithubHelp home page GithubHelp logo

Ceph OSD docker bug about ceph-container HOT 35 CLOSED

ceph avatar ceph commented on June 13, 2024
Ceph OSD docker bug

from ceph-container.

Comments (35)

 avatar commented on June 13, 2024 1

The connections are valid full connections that properly open, communicate, and close. One OSD must be asking the OSDs on a seperate host if they can see the other OSDs on the same host before reporting it down. It only starts getting reported down after the conntrack table fills up and packets start getting dropped.

I really just need to spend some time to see how OSDs on the same host communicate to give a proper solution. It's possible we can do some socket sharing trickery to get it to work without exposting the host pid namespace.

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

I can affirm that this occurs even on physical machines, as long as the OSDs are running in Docker. Host networking does not resolve the issue. Splitting the OSD data/sync network from the access network does not resolve the issue, either.

This likely shares the same root cause which causes me to be unable to run more than one RBD-backed KVM instance on a single host without all the ports on the machine becoming exhausted from Ceph traffic.

Interestingly, in that case, if I run the KVM instances from systemd-nspawn instead of Docker, I do not have the problem. I suspect the same would be true if I ran the OSD from systemd-nspawn, as well.

I'm also quite interested to see if the problem exists with rocket. When I get some time, I'll throw together an ACI for the OSDs and try that.

I should also note that I have tried a number of options to try and mitigate this.

  • tcp nodelay = false : No difference
  • various rbd cache settings: No difference
  • splitting the network: No difference
  • reducing the TTL for connections: No difference (not expected, since the rate at which the ports exhaust is incredible)

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

More discussion on this, from earlier:
Ulexus/docker-ceph #8

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

@bobrik Anything to add to this?

from ceph-container.

bobrik avatar bobrik commented on June 13, 2024

@Ulexus not really, you know even more than I do.

from ceph-container.

peterrosell avatar peterrosell commented on June 13, 2024

I'm also struggling with this problem. My experience is that as soon as I have two osd on the same host and a third osd is started the problem occurs. It doesn't matter if the third osd is running on the same host or on another host. I can see in the monitor's log that

I have been think that the sharing of host between the osds (-net=host) makes this problem occur. I haven't been able to test it yet. I have an idea of running docker 1.5 and set up an ipv6 network that the ceph containers can communicate via. By doing each container gets its own ip address and I don't need --net=host.
I read somewhere that you didn't get ceph to work without --net=host and --privileged, but I don't understand the reason.
Do you think it's any idea for me to try running with ipv6?

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

I think it's a reasonable hypothesis that the --net=host is related. The OSD - Mon communication really doesn't like NAT, which is the purpose for the --net=host.

I don't think --privileged is required for either the Mon or the OSD. My Mon's are running without it, but I do notice that my OSDs are running with it. I'll move one off --privileged and make sure it still works.

I've been very curious to see if running Docker on IPv6 will work, too. However, last time I looked at the IPv6 support in 1.5.0, it appeared they were still doing port address translation even with IPv6, which seemed absurd. (I hope I am wrong about that.)

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

I can confirm that --privileged is NOT required. That was, I'm sure, left over from my attempts to work within the ceph-deploy tool, where access to the raw device was needed (to format the OSD partition).

from ceph-container.

dmitryint avatar dmitryint commented on June 13, 2024

I successfully use multi osd on the same host.
When you start all ceph-osd in one container everything works fine.

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

@dmitryint That is definitely interesting to know, and it is consistent with my expectations (though I certainly had not thought of trying it). Thanks!

I'd prefer to be able to run each OSD as a separate container, but we could modify the OSD container to handle multiple local OSDs.

from ceph-container.

dmitryint avatar dmitryint commented on June 13, 2024

Here is a prototype that allows to run multiple OSDs in one container
dmitryint@ae8909a

docker run -it --rm --name osd -v /osd/ceph-0:/var/lib/ceph/osd/ceph-0 -v /osd/ceph-1:/var/lib/ceph/osd/ceph-1 -v /osd/ceph-2:/var/lib/ceph/osd/ceph-2 -v /osd/journal:/var/lib/ceph/journal -v /etc/ceph:/etc/ceph --net host -v /dev:/dev --privileged  ceph/osd

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

I like the concept. The only problem I see if maintaining backward compatibility in the single-OSD use case. Presently, the default journal location in in /var/lib/ceph/osd-${OSD_ID}/journal. In this version, the default journal location would be /var/lib/ceph/journal/journal.${OSD_ID}. I think if you leave JOURNAL alone and use something like JOURNAL_DIR instead, we should, with some additional logic, be able to support both legacy and new/multiple.

Would you mind opening up a PR to track this? It isn't really related to this issue; it's more of a workaround... but a good one.

from ceph-container.

peterrosell avatar peterrosell commented on June 13, 2024

I can confirm that running three different osd containers on the same host work when using ipv6. I removed --net=host and also --priviledged. Also the monitor was running on the same host in a separate container. I used a /80 network on the host and each container got its own ip address. The host handles the routing and it was enabled with these commands. ( change the network to your correct one)

$ ip -6 route add 2001:db8:1::/64 dev docker0
$ sysctl net.ipv6.conf.default.forwarding=1
$ sysctl net.ipv6.conf.all.forwarding=1

I did the tests half manually just to get something to verify. I guess there are some work to do if you want to be able to use these containers and scripts with ipv6. Each time you restart a container you will get a new ip address. It is based on the mac address that increase for each container that is started.

I guess that ceph have some problem when the two osd bind to the same ip address but they can't see each other due to running in separate containers. When I run on ipv4 with your containers and starts two osds on the same host with a third osd also running the network traffic goes up like crazy. I get over 8000 tcp connections after a few seconds, but almost all of them is in TIME_WAIT state, which means that they are closed. It's like the third osd contacts one of the osds running on the same host, but finds out that it's the wrong OSD_ID and disconnect and then start over again. The osds then starts reporting each other as down to the monitor. I'm not sure if this is the real reason, but it might be interesting to hear what someone from the ceph community think about this theory. Maybe detailed logging of the intercommunication between the osd can be activated.

I hope I will be able to push my setup later, but it will be next week.
Update: My setup is running vagrant(virtualbox), coreos (alpha channel, 598.0.0) with docker 1.5.0.

from ceph-container.

bobrik avatar bobrik commented on June 13, 2024

@peterrosell can you also copy your observations to http://tracker.ceph.com/issues/10763?

from ceph-container.

peterrosell avatar peterrosell commented on June 13, 2024

Will do later today.
On Feb 25, 2015 9:03 AM, "Ian Babrou" [email protected] wrote:

@peterrosell https://github.com/peterrosell can you also copy your
observations to http://tracker.ceph.com/issues/10763?


Reply to this email directly or view it on GitHub
#19 (comment).

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

@peterrosell Thanks for the results.

You might be on to something here, but since the OSDs need to run with --net=host, though (in normal docker NAT'ed IPv4), they should be able to "see" each other, since they are all in the same network namespace. Still, I don't know how the ports are allocated. If it is a filesystem thing instead of a network discovery thing, that may very well explain it.

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

Updated ceph-osd README to note these two workarounds

from ceph-container.

 avatar commented on June 13, 2024

This issue does not happend when sharing pid namespaces. Currently Docker does not support sharing PID namespaces between containers, but it does work when shared with the host. --pid=host

This obviously has the downside of showing all of the host processes, paired with --net=host we basically lose any of the inherent security of containers, but this still allows a prepackaged container with ceph installed just so to work correctly.

I will cross post to the ceph bug report.

EDIT: Here is a docker request asking for shared PID namespaces. It is not in Docker as of 1.5.0 moby/moby#10163

from ceph-container.

peterrosell avatar peterrosell commented on June 13, 2024

Great finding! I will try it myself as soon as I can.
I wonder why the osds need to see each other pids, but I guess Ceph have
their reasons.
On Mar 4, 2015 5:31 AM, "Sam Yaple" [email protected] wrote:

This issue does not happend when sharing pid namespaces. Currently Docker
does not support sharing PID namespaces between containers, but it does
work when shared with the host. --pid=host

This obviously has the downside of showing all of the host processes,
paired with --net=host we basically lose any of the inherent security of
containers, but this still allows a prepackaged container with ceph
installed just so to work correctly.

I will cross post to the ceph bug report.


Reply to this email directly or view it on GitHub
#19 (comment).

from ceph-container.

 avatar commented on June 13, 2024

My thoughts were that the OSDs on the same host don't communicate over TCP but rather some interprocess communication. I am unsure though.

from ceph-container.

peterrosell avatar peterrosell commented on June 13, 2024

Sounds possible. If so, it shouldn't be that hard to force an OSD to use
the network way instead. It's a bit strange that it generates very much
network connections when they can't see the pids. Hope someone that know
the OSD code can explain it.
On Mar 4, 2015 8:14 AM, "Sam Yaple" [email protected] wrote:

My thoughts were that the OSDs on the same host don't communicate over TCP
but rather some interprocess communication. I am unsure though.


Reply to this email directly or view it on GitHub
#19 (comment).

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

I'm going to close this, as the current OSD implementation seems to work around this problem very well.

from ceph-container.

peterrosell avatar peterrosell commented on June 13, 2024

@Ulexus Do you mean that with the current OSD implementation we don't need to use --pid=host? If the so, which version do you refer to?
If not, I'm guess this problem goes away if you are running the ceph cluster on flannel or similar network tool.

from ceph-container.

bobrik avatar bobrik commented on June 13, 2024

@Ulexus please update this section with --pid host:

https://github.com/ceph/ceph-docker/tree/master/osd#multiple-osds

Running multiple osds in a single container can be removed, too tricky if you ask me.

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

Sorry, all; that was stupid.

from ceph-container.

hookenz avatar hookenz commented on June 13, 2024

@Ulexus - In your recent CoreOS fest presentation, you mentioned that all ceph containers needed --pid=host or run inside the same docker image or all the IP ports would be used up within a matter of seconds. Does that go for ceph-mds running alongside top of a ceph-mon for example? I thought this only affect ceph-mon? I haven't done too much testing here, but I am running ceph-mon and ceph-mds on the same host... or at least ceph-mds is being scheduled with fleet right now and ends up on a ceph-mon host.

from ceph-container.

leseb avatar leseb commented on June 13, 2024

Following @SamYaple advice on the Ceph tracker issue, it looks like using --pid=host fixes the conntrack_max issue.
No more kernel logs and no more flappy OSDs.
I think we can close this issue.

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

I'm good with that.

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

@bobrik - Thanks; docs updated

from ceph-container.

tangzhankun avatar tangzhankun commented on June 13, 2024

hi @Ulexus,
I saw this issue closed, but I still want to make this more clear.
I am using Macvlan to create different IPs(v4) for OSD containers. It can working.
In current ceph/osd documentation, the two workaround is that "--pid=host" and all OSDs in one containers (I don't like this solution will be adopted by many people, because it's weird and hard to manage). Considering all workarounds, I think there are two kinds in essence:

  1. OSD containers in same host share the pid namespace, then no matter what the networking is like, it will be fine
  2. OSD containers in same host don't share the pid namespace, then we need to isolate them with different network namespace to make it working

So here comes my questions:
What's the pros and cons when OSD containers share the pid namespace with host os?

One con I can see is that there may be security issue if they share the host's pid namespace. Hope for your ideas.

from ceph-container.

 avatar commented on June 13, 2024

@tangzhankun

Keep in mind, shared container pid namepsaces are coming in Docker. This will allow all the OSDs to share one PID namespace, solving the problem without the issue of sharing the host PID namespace.

That being said, I think we can hold out for that feature and live with the current situation, but I have no vested interest here so I am ok with whatever anyone decides. I am here to share info about the issue only.

Point of clarification on your second point. The different network namespaces isn't the issue, its two OSDs that are reachable at the same IP address (IPv4 or IPv6) that cannot communicate with each other. I do not know how this same host communication works, whether it is a simple pid check or something more, but sharing the PID namespace allows them to see each other as alive and healthy.

from ceph-container.

tangzhankun avatar tangzhankun commented on June 13, 2024

Hi @SamYaple,
Thanks for you quick reply.
SamYaple => "solving the problem without the issue of sharing the host PID namespace."
I double checked the document of docker, But there "--pid=host" seems make container share the host OS's pid namespace so that a container can see system process of host OS. Is the document wrong?
Refer to here: https://docs.docker.com/reference/run/

SamYaple => "its two OSDs that are reachable at the same IP address (IPv4 or IPv6) that cannot communicate with each other."
yes. I agree with that. I choose different IPs for different OSD to avoid this issue because "--pid=host" wasn't a workaround about 3 months ago. I was wondering whether to change it to "--pid=host" so I asked the pros and cons of "--pid=host" here to make decision.

PS. Another thing I was always curios is that the issue page in ceph tracker is closed 6 months ago due to someone said this seems Docker issue. What's the root cause exactly and who should fix this ? Any idea?

from ceph-container.

Ulexus avatar Ulexus commented on June 13, 2024

@tangzhankun
That is what --pid=host does, yes. What @SamYaple is talking about is a future feature, which would allow a set of containers to share a pid namespace together but which is separate from the host pid namespace.

from ceph-container.

 avatar commented on June 13, 2024

@tangzhankun
I will ask around about the bug in the Ceph tracker and see if this is something Ceph as a whole may be able to fix. Last time I was looking at this issue I would not find anyone interested in reopening or looking into the issue.

from ceph-container.

tangzhankun avatar tangzhankun commented on June 13, 2024

@Ulexus
Ok. Thanks for the clarification. Not separated from host pid may have security issue. But it is so for traditional ceph before we use docker, right?

@SamYaple
Ok. That's kind of you. Thanks very much!

from ceph-container.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.