GithubHelp home page GithubHelp logo

Comments (13)

borkd avatar borkd commented on June 13, 2024

Could be a stupid question, but are all servers and clients equipped with ECC RAM?

from moosefs.

oszafraniec avatar oszafraniec commented on June 13, 2024

@borkd no so stupid as the problem started some time ago and there was no change in MFS version etc in between
Clients are VMs, running on hosts with ECC RAM.
Masters have ECC.
Chunkservers... need to check them all but i think all also have ECC (Supermicro MBs and Xeon CPUs).

from moosefs.

chogata avatar chogata commented on June 13, 2024

Are any of your machines 2-processor ones?

from moosefs.

oszafraniec avatar oszafraniec commented on June 13, 2024

@chogata yes, they have: mfs-master (2x CPU), vm-host (2x CPU)
Chunkservers have only 1 CPU

Just checked a file that I was checking few days ago... File was not modified in between... Just to show you what I've described above.

root@mfsmaster01:/moose/# zip -T some.zip.bad
error: invalid zip file with overlapped components (possible zip bomb)
test of 12101483.zip.bad FAILED

zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)

and after some time...

root@mfsmaster01:/moose/# zip -T some.zip.bad 
test of 12101483.zip.bad OK

from moosefs.

chogata avatar chogata commented on June 13, 2024

I'm inclined to blame your vm-host machine.

We had 2 confirmed cases with our clients and we researched this a bit and found other people having the same problem with different software. Basically, there is a problem, not very common, but existing, when a 2-processor machine fails to refresh processor (high level) cache in time. When this failure happens, a process reading data from a cache cell reads a previous value, not a current one, which leads to all sorts of strange and unexpected behaviours in software.

For some reason it happens more often (or even exclusively?) with processes that use "a lot" of RAM. I'm no expert, but I suspect than when a process has a lot of memory allocated and part of that memory is in address space managed by the 1st CPU and another part in the space managed by the 2nd CPU, the kernel is inclined to more often switch the process between CPUs (following the memory, perhaps?), thus increasing the chance of a failed processor cache synchronisation issue.

Our 2 clients with 2-processor machines had them working with different modules. In one case it was a chunk server, that had spectacular fails with core dumps, in the other it was also a client machine, like yours and also problems with data, but no failure of the process itself. Note, that the client with chunk server had more than one 2-processor machine, but only one failed in regular intervals. Which leads me to believe, personally, this might be a rarely spotted hardware issue. But it could also be software (kernel) related, as we never had a case of 2 machines with absolutely identical versions of system and kernel and only one of them failing. But in both cases we investigated long and hard using debugging software and the results were indisputable: one thread puts a value (that we know) in a certain memory cell, right after another thread reads that value and behaves in a way that tells us that it MUST have read something different, that the first thread wrote. And previous value residing in this memory cell always fit the bill (aka explained the otherwise unexpected behaviour).

It also aligns with what you wrote about remounting helping initially: just after remount your mount uses less memory (it did not allocate all those caches yet ;) ) and that memory is probably "orderly" and in "one piece". After a time it increases and also starts having a tendency to fragment.

from moosefs.

borkd avatar borkd commented on June 13, 2024

Having a "fingerprint" of systems where such rare but significant issues were found might help with efforts to reproduce

  • master/cs/client
  • os / distro
  • kernel version
  • mainboard model/type
  • bios / firmware
  • RAM m/t and amount
  • storage controllers
  • nics (offload, trunking..)
  • number of cpus and their model/type
  • cpu firmware
  • vulnerability mitigation measures enabled in the running kernel or injected via 3rd party code

from moosefs.

inkdot7 avatar inkdot7 commented on June 13, 2024

But in both cases we investigated long and hard using debugging software and the results were indisputable: one thread puts a value (that we know) in a certain memory cell, right after another thread reads that value and behaves in a way that tells us that it MUST have read something different, that the first thread wrote. And previous value residing in this memory cell always fit the bill (aka explained the otherwise unexpected behaviour).

@chogata Was there enough memory fence instructions to ensure ordering between the loads and stores on each CPU?

Do I understand correctly:

  1. at start, the memory location has the value 'c'
  2. thread A writes a value 'a' to a memory location
  3. thread B reads from that memory location, and gets 'c'

Something more is needed here - what tells that 3. is after 2.? I.e. with respect to what should the memory ordering instructions ensure ordering?

from moosefs.

chogata avatar chogata commented on June 13, 2024

@inkdot7 in case of the crashing chunk server we traced core dumps, they showed us what happened, instruction after instruction, in the moments before the process crashed. So yes, those things happened in this order. It was more like:

  1. at start, a memory location has value 'x'
  2. thread A writes a value 'y' to this memory location
  3. thread B reads a value from this memory location and behaves unexpectedly, for sure NOT like it would have read y, but IF it have read x, then the unexpected behaviour would make sense (but also a host of other values, not x and y, could cause that behaviour)

We concluded that it must have read x, after reading available materials in the net about similar problems.

To kinda answer your question: what should ensure the ordering? The compiler. We investigated this path too, but the cases of bad cache refreshing are rare enough it's hard to blame the code. We've also found some claims that using mmap may cause the problem, we don't really see how, but we reverted to malloc, just in case.

BTW, I forgot one more case, two processor client machine, had a very frequent problem with data integrity (file length values), that could only be logically explained by the "cache refresh problem". We described the possible cause to the company that owned this particular machine and they investigated. Turned out, one of the coolers in a sophisticated cooling system was malfunctioning and the processors' temperature was higher than usual, but not high enough to cause emergency shutdown, just some log messages (that nobody read ;) ). When they replaced the cooler, the machine "went back to normal", AKA - they never had the problem with incorrect values again on this machine.

from moosefs.

inkdot7 avatar inkdot7 commented on June 13, 2024

@chogata I do not think that ordering by the compiler alone is not enough. It also need to emit instructions to have the CPU not do things out of order.

Consider a shared memory area with two locations, a and b. E.g. a could be a flag or counter telling if the value b is valid for use or not.

At start, both locations are '0'.

Thread A does the following:

  1. Write '2' to location b.
  2. Write '1' to location a. (Telling that b is now valid.)

Thread B does the following:

  1. Read location a. (To check if b is valid.)
  2. Read location b.

And then it would e.g. only use the value from location b if the value from location a is '1'.

What are the possible read outcomes for thread B?

If running before A, it would get a=0,b=0.

If running after A, it would get a=1,b=2.

If thread B runs the code around the same time as thread A, then it could get a=0,b=2.

But it can also get a=1,b=0, even if the compiler has made sure to put the writes in A and reads in B in the given order. The processor memory model typically give it freedom to on-the-fly reorder memory operations, which also includes not enforcing the caches between the CPUs to immediately reflect updates in order.

On e.g. x86(64), there are the mfence, lfence and sfence instructions to tell the processor that either all (mfence) memory accesses, or just read or writes (lfence or sfence) need to performed in order. So if the codes above is changed to:

Thread A:

  1. Write '2' to location b.
  2. Execute sfence instruction.
  3. Write '1' to location a.

Thread B:

  1. Read location a.
  2. Execute lfence instruction.
  3. Read location b.

Then B will never see the case a=1,b=0.

Recently ran into a problem where an if-statement checking the value read from a before possibly doing the read of b was not enough. Without memory barrier instruction, the arm m2 CPU had sometimes already speculatively done the b read before the a read.

from moosefs.

chogata avatar chogata commented on June 13, 2024

Okay, but "good coding practice" requires you to use locks to create scenarios like the above. So it would be:

Thread A:

  1. obtain lock x
  2. write to location b (some value, maybe 2)
  3. write to location a (value 1 to say location b can be read now)
  4. release lock x

Thread B:

  1. obtain lock x
  2. if there is 1 in a, read b
  3. release lock x

And one would expect the compiler to make sure the locks are honoured. MooseFS code always uses locks when it writes to a memory fragment that can potentially be accessed by other threads. I maybe did not state that clearly, but there are other operations in between the read and write (very few, otherwise the cache would have been refreshed). Besides, when you trace the core dump, you see exactly what the processors did, instruction after instruction, so even if we did not use locks, we would see that certain operations were swapped. We traced operations that had happened and KNOW the order. And yet, the value read is not valid.
I know it's hard to believe, we sat 2 days 2 people analysing one core dump, because we could not believe it either at the start :)

from moosefs.

oszafraniec avatar oszafraniec commented on June 13, 2024

As a wrap of this issue...

@chogata just FYI, remounting solves the problem like you can see below. Only dropping file cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn't solve the problem. Modifying a file (regenerate zip file in our case) also helps. I will try to provide you with any feedback I can from a user perspective. Now we have extra tests for files and can catch errors like this. Still, the scale of a problem is tiny around ~0,04% but noticeable in our case (100 vs 250k file downloads per month).

For now, we found a way to live with this and we've done some error handling on our side. Let's hope it will go away after OS/HW/MFS/etc upgrades in a future ;)

(file was read before from MFS mount and shows up as corrupted)
root@ftpgw:~# 
root@ftpgw:~# zip -T /mnt/12368428.zip 
error: invalid zip file with overlapped components (possible zip bomb)
test of /mnt/12368428.zip FAILED

zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)
root@ftpgw:~# 
root@ftpgw:~# umount -v /mnt && mount -av
umount: /mnt (mfsmaster:9421) unmounted
/                        : ignored
none                     : ignored
/mnt                     : successfully mounted
root@ftpgw:~# 
root@ftpgw:~# zip -T /mnt/12368428.zip 
test of /mnt/12368428.zip OK
root@ftpgw:~# 
(now same file is OK)

from moosefs.

borkd avatar borkd commented on June 13, 2024

@oszafraniec - can you share how are your VM clients configured (qemu config strings), and maybe some details of the hypervisor config, including networking?

from moosefs.

chogata avatar chogata commented on June 13, 2024

Regarding hardware, I forgot to add that little tidbit earlier: one of the clients that had the biggest problem with inconsistent data, that pointed to the cache refreshing problems, checked out their machine that gave faulty readings (they had only one that always generated a problem). It turned out one of the cooler fans was broken and the inside temperature was a few degrees higher than normal. Not enough to shut down the machine, just to output some error messages in the logs (which nobody bothered to read ;) ). When they replaced this one fan, the machine "went back to normal", aka it never gave a faulty readout of data again... Better cooling and the cache refreshing problem went away.

from moosefs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.