GithubHelp home page GithubHelp logo

nlknguyen / alpine-mpich Goto Github PK

View Code? Open in Web Editor NEW
133.0 133.0 56.0 59 KB

MPI Cluster Automation Solution using Docker, based on Alpine Linux with MPICH (see IEEE paper)

Home Page: https://github.com/NLKNguyen/alpine-mpich

License: MIT License

C 7.42% Shell 77.39% Dockerfile 15.19%
automation cluster docker mpi

alpine-mpich's Introduction

Seeking the coder's code...

Maybe they do now,
in this decadent era of
Lite beer, hand calculators, and "user-friendly" software
but back in the Good Old Days,
when the term "software" sounded funny
and Real Computers were made out of drums and vacuum tubes,
Real Programmers wrote in machine code.
Not FORTRAN.  Not RATFOR.  Not, even, assembly language.
Machine Code.
Raw, unadorned, inscrutable hexadecimal numbers.
Directly.

Lest a whole new generation of programmers
grow up in ignorance of this glorious past,
I feel duty-bound to describe,
as best I can through the generation gap,
how a Real Programmer wrote code.
I'll call him Mel,
because that was his name.

alpine-mpich's People

Contributors

gentooza avatar nbp-lbl avatar nlknguyen avatar simonholgate avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpine-mpich's Issues

make problem with mpich v3.3

Dear Nikyle,
I'm trying to build a customized version of you Dokerfile, using an updated version of MPICH, the 3.3 (released 21-Nov-2018) instead of the version you are using (3.2).
While the build is working with mpich versione 3.2, wit v.3.3 the build stops during make with this error:
image
Do you have any idea about the reason? how to fix it if it worth to?
Thanks in adevance
s.-

README instructions not working

Hi,

The instruction on README fail right after cloning the repository:

$ docker build -t nlknguyen/alpine-mpich base/
[+] Building 1.4s (10/21)                                        docker:default
 => [internal] load .dockerignore                                          0.0s
 => => transferring context: 2B                                            0.0s
 => [internal] load build definition from Dockerfile                       0.0s
 => => transferring dockerfile: 1.84kB                                     0.0s
 => [internal] load metadata for docker.io/library/alpine:3.4              0.9s
 => [internal] load build context                                          0.0s
 => => transferring context: 118B                                          0.0s
 => [ 1/18] FROM docker.io/library/alpine:3.4@sha256:b733d4a32c4da6a00a84  0.0s
 => CACHED [ 2/18] RUN apk update && apk upgrade       && apk add --no-ca  0.0s
 => CACHED [ 3/18] RUN apk update && apk add ca-certificates && update-ca  0.0s
 => CACHED [ 4/18] RUN mkdir /tmp/mpich-src                                0.0s
 => CACHED [ 5/18] WORKDIR /tmp/mpich-src                                  0.0s
 => ERROR [ 6/18] RUN wget http://www.mpich.org/static/downloads/3.2/mpic  0.4s
------                                                                          
 > [ 6/18] RUN wget http://www.mpich.org/static/downloads/3.2/mpich-3.2.tar.gz       && tar xfz mpich-3.2.tar.gz        && cd mpich-3.2        && ./configure --disable-fortran        && make ${MPICH_MAKE_OPTIONS} && make install       && rm -rf /tmp/mpich-src:
0.260 Connecting to www.mpich.org (172.64.150.140:80)
0.316 Connecting to www.mpich.org (172.64.150.140:443)
0.402 wget: error getting response: Connection reset by peer
------
Dockerfile:27
--------------------
  26 |     WORKDIR /tmp/mpich-src
  27 | >>> RUN wget http://www.mpich.org/static/downloads/${MPICH_VERSION}/mpich-${MPICH_VERSION}.tar.gz \
  28 | >>>       && tar xfz mpich-${MPICH_VERSION}.tar.gz  \
  29 | >>>       && cd mpich-${MPICH_VERSION}  \
  30 | >>>       && ./configure ${MPICH_CONFIGURE_OPTIONS}  \
  31 | >>>       && make ${MPICH_MAKE_OPTIONS} && make install \
  32 | >>>       && rm -rf /tmp/mpich-src
  33 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c wget http://www.mpich.org/static/downloads/${MPICH_VERSION}/mpich-${MPICH_VERSION}.tar.gz       && tar xfz mpich-${MPICH_VERSION}.tar.gz        && cd mpich-${MPICH_VERSION}        && ./configure ${MPICH_CONFIGURE_OPTIONS}        && make ${MPICH_MAKE_OPTIONS} && make install       && rm -rf /tmp/mpich-src" did not complete successfully: exit code: 1

standard_init_linux.go:195: exec user process caused "exec format error"

Hi,

I have a 3-node swarm cluster.
I am trying to follow the steps to deploy mpi over the swarm.

After i issue this command:
./swarm.sh up size=3

I get the following output with the error:

===> CLEAN UP CLUSTER

         __v_
        (.___\/{
~^~^~^~^~^~^~^~^~^~^~^~^~
$ docker service rm my-mpi-project-master my-mpi-project-worker

Error: No such service: my-mpi-project-master
Error: No such service: my-mpi-project-worker
=> No problem


===> REMOVE NETWORK

         __v_
        (.___\/{
~^~^~^~^~^~^~^~^~^~^~^~^~
$ docker network rm mpi-network

Error: No such network: mpi-network
=> No problem


===> BUILD IMAGE

         __v_
        (.___\/{
~^~^~^~^~^~^~^~^~^~^~^~^~
$ docker build -t "nlknguyen/mpi" .

Sending build context to Docker daemon  41.47kB
Step 1/3 : FROM nlknguyen/alpine-mpich:onbuild
# Executing 5 build triggers
 ---> Using cache
 ---> Using cache
 ---> Running in f0e72562d307
standard_init_linux.go:195: exec user process caused "exec format error"
The command '/bin/sh -c cat ${SSHDIR}/*.pub >> ${SSHDIR}/authorized_keys' returned a non-zero code: 1

What Docker Version?

Hello, I am interested in using this project; could you please confirm and document in the readme the recommended versions of Docker and Docker compose?

all service containers of a service are not fetched in /etc/opts/hosts file

I have created a service with 16 containers and running an MPI task from the master node. I have noticed that not all the service containers are taking the load. Then I opened the /etc/opts/hosts file which is supposed to have a list of all service containers but I found most of the time 2-3 containers are not listed in it.

I have figured it out that this is an issue with "netstat -t" command inside get_hosts, which can not resolve all containers name and hence returning fewer addresses most of the time.

Please update the published image

One of the last updates to the code seem to be a fix to get_hosts. However, if you just follow the instructions in https://github.com/NLKNguyen/alpine-mpich/wiki/Multi-Host-Orchestration without building the images first and just do ./swarm.sh up size=5 - it would seemingly pull a published 'onbuild' image from a docker registry. This image contains an old version of get_hosts that does not seem to work in my setup. The new version works fine, but it took some time to figure out what's going on. Thanks!

No such file or directory when do mpirun

I have placed my mpi file (test.c) in project folder, but this is what happen when I ran the mpirun
[proxy:0:0@069be636be9a] HYDU_create_process (utils/launch/launch.c:75): execvp error on file ./test (No such file or directory)

get_hosts script is not working for me

When executing a cluster of size 4, for example, get_hosts returns the 4 machines and also:

a.root-servers.net

Captura desde 2024-04-17 09-56-03

This line must be manually removed by me, saved in a machinefile, and executed by:

mpirun -f machinefile ./mpi_hello_world

For being able to execute programs

I think it could be fixed removing last dig command in the script, but I supose it was useful for something

cheers!

Host detection not finding second container

Hello, I have recently been trying to use this project on two Docker Engine host VMs in a private cloud. However the get_hosts script running netstat -t does not discover any other containers and executing mpirun hostname only shows the mpi master container.

For testing purposes I used the cluster setup recommended Docker and Docker Compose versions, opened all ports between the Docker Engine hosts on the private cloud, and was able to run a Docker Swarm service with multiple basic nginx image containers on each host attached to a Docker Swarm overlay network - much like your setup. I was also able to attach a second Docker Swarm service using the nginx image to the same swarm overlay network and I observed all of these containers to communicate fine. In particular within these nginx containers I was able to ping, curl, and telnet between containers using the overlay network ip addresses for each container which can be found by running docker network list and then docker network ps network_id on each Docker host VM. When I try installing these utilities and running ping and telnet between containers running my private build of alpine-mpich I get no response. As I understand it containers for services attached to a swarm overlay network should be able to communicate freely without specifying additional ports, but should docker ps on either of two Docker host VMs show that the master or worker alpine-mpich container is using port 22/tcp?

I would appreciate any help debugging or setup advice you can provide.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.