GithubHelp home page GithubHelp logo

vsoch / cdb Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 0.0 15.97 MB

Container database metadata extraction and data-container builder

Home Page: https://vsoch.github.io/cdb/

License: Mozilla Public License 2.0

Dockerfile 2.60% Python 86.29% Go 11.11%
containerdb data-container docker

cdb's Introduction

Container Database (cdb)

This is the Python support tool for containerdb to support generation of data containers. Python is more friendly to generating arbitrary data structures, and is popular among the data science community, so I chose it for metadata generation instead of using GoLang.

PyPI version

Have your data and use it too!

docs/assets/img/logo/logo.png

For documentation and full examples see vsoch.github.io/cdb. These examples are also available in the examples folder.

Getting Started

What is a Data Container?

A data container is generally an operating-system-less container that is optimized to provide data, either for query/search, or binding for analysis. The qualities of the data container should be:

  1. It can be mounted to containers with operating systems to run analysis
  2. It can be interacted with on it's own to search metadata about the data
  3. It should not have an operating system.

How do we generate one?

The generation is fairly simple! It comes down to a three step multistage build:

  1. Step 1 We install cdb to generate a GoLang template for an in-memory database for our data)
  2. Step 2 We compile the binary into an entrypoint
  3. Step 3 We add the data and the binary entrypoint to a scratch container (no operating system).

And then we interact with it! This tutorial will show you the basic steps to perform the multistage-build using a simple Dockerfile along with the data folder. The Dockerfile in the base of the repository also is a good example.

Usage

Docker Usage

The intended usage is via Docker, so you don't need to worry about installation of Python, GoLang, and multistage builds to basically:

  1. Generate a db.go template
  2. Compile it
  3. Add to scratch with data as data container entrypoint.

Thus, to run the dummy example here using the Dockerfile:

$ docker build -t data-container .

We then have a simple way to do the following:

metadata

If we just run the container, we get a listing of all metadata alongside the key.

$ docker run entrypoint 
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}

list

We can also just list data files with -ls

$ docker run entrypoint -ls
/data/avocado.txt
/data/tomato.txt

orderby

Or we can list ordered by one of the metadata items:

$ docker run entrypoint -metric size
Order by size
/data/tomato.txt: {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}
/data/avocado.txt: {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}

search

Or search for a specific metric based on value.

$ docker run entrypoint -metric size -search 8
/data/tomato.txt 8

$ docker run entrypoint -metric sha256 -search 8
/data/avocado.txt 327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4
/data/tomato.txt 3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816

get

Or we can get a particular file metadata by it's name:

$ docker run entrypoint -get /data/avocado.txt
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}

or a partial match:

$ docker run entrypoint -get /data/
/data/avocado.txt {"size": 9, "sha256": "327bf8231c9572ecdfdc53473319699e7b8e6a98adf0f383ff6be5b46094aba4"}
/data/tomato.txt {"size": 8, "sha256": "3b7721618a86990a3a90f9fa5744d15812954fba6bb21ebf5b5b66ad78cf5816"}

start

The start command is intended to keep the container running, if we are using it with an orchestrator.

$ docker run data-container -start

Orchestration

It's more likely that you'll want to interact with files in the container via some analysis, or more generally, another container. Let's put together a quick docker-compose.yml to do exactly that.

version: "3"
services:
  base:
    restart: always
    image: busybox
    entrypoint: ["tail", "-f", "/dev/null"]
    volumes:
      - data-volume:/data

  data:
    restart: always
    image: data-container
    command: ["-start"]
    volumes:
      - data-volume:/data

volumes:
  data-volume:

Notice that the command for the data-container to start is -start, which is important to keep it running. After building our data-container, we can then bring these containers up:

$ docker-compose up -d
Starting docker-simple_base_1   ... done
Recreating docker-simple_data_1 ... done
$ docker-compose ps
        Name                Command         State   Ports
---------------------------------------------------------
docker-simple_base_1   tail -f /dev/null    Up           
docker-simple_data_1   /entrypoint -start   Up           

We can then shell inside and see our data!

$ docker exec -it docker-simple_base_1 sh
/ # ls /data/
avocado.txt  tomato.txt

The metadata is still available for query by interacting with the data-container entrypoint:

$ docker exec docker-simple_data_1 /entrypoint -ls
/data/avocado.txt
/data/tomato.txt

Depending on your use case, you could easily make this available inside the other container. This is very simple usage, but the idea is powerful! We can interact with the dataset and search it without needing an operating system. It follows that we can develop customized data-containers based on the format / organization of the data inside (e.g., a data-container that knows how to expose inputs, outputs, etc.)

Python Usage

The above doesn't require you to install the Container Database (cdb) metadata generator, however if you want to (to develop or otherwise interact) you can do the following. First, install cdb from pypi or a local repository:

$ pip install cdb

or

git clone [email protected]:vsoch/cdb
cd cdb
pip install -e .

Command Line

The next step is to generate the goLang file to compile. You'll next want to change directory to somewhere you have a dataset folder. For example, in tests we have a dummy "data" folder.

cd tests/

We might then run cdb generate to create a binary for our container, targeting the tests/data folder:

$ cdb generate data --out db.go

The db.go file is then in the present working directory. You can either build it during a multistage build as is done in the Dockerfile, or do it locally with your own GoLang install and then add to the container. For example, to compile:

go get github.com/vsoch/containerdb && \
GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o /db -i /db.go

And then a very basic Dockerfile would need to add the data at the path specified, and the compiled entrypoint.

FROM scratch
WORKDIR /data
COPY data/ .
COPY db /db
CMD ["/db"]

A more useful entrypoint will be developed soon! This is just a very basic start to the library.

Python

You can run the same generation functions interactively with Python.

from cdb.main import ContainerDatabase
db = ContainerDatabase(path="data")
# <cdb.main.ContainerDatabase at 0x7fcaa9cb8950>

View that there is a files generator at db.files

db.files
<generator object recursive_find at 0x7fcaaa4ae950>

And then generate! If you don't provide an output file, a string will be returned. Otherwise, the output file name is returned.

output = db.generate(output="db.go", force=True)

Currently, functions for parsing metadata are named in cdb/functions.py, however you can also define a custom import path. This has not yet been tested and will be soon. We will also be added more real world examples soon.

License

  • Free software: MPL 2.0 License

cdb's People

Contributors

vsoch avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cdb's Issues

COPY failed: file not found in build context or excluded by .dockerignore: stat data: file does not exist

  • cdb version:0.0.1
  • Python version:Python 3.9.1 (default, Dec 28 2020, 11:24:06)
  • Operating System:mac Catalina 10.15.7Davids-MBP:hello davidacasciotti$ docker build -t data-container .
    Sending build context to Docker daemon 1.402MB
    Step 1/17 : FROM bitnami/minideb:stretch as generator
    ---> 14c86ccce4ab
    Step 2/17 : ENV PATH /opt/conda/bin:${PATH}
    ---> Using cache
    ---> 7e4f109e5f14
    Step 3/17 : ENV LANG C.UTF-8
    ---> Using cache
    ---> 6f55f435cbd7
    Step 4/17 : RUN /bin/bash -c "install_packages wget git ca-certificates && wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && rm Miniconda3-latest-Linux-x86_64.sh"
    ---> Using cache
    ---> b5f409e8833b
    Step 5/17 : RUN pip install cdb==0.0.1
    ---> Using cache
    ---> 25017e24efcf
    Step 6/17 : WORKDIR /data
    ---> Using cache
    ---> fb2bed680f30
    Step 7/17 : COPY ./data .
    COPY failed: file not found in build context or excluded by .dockerignore: stat data: file does not exist

Description

Describe what you were trying to get done.
Describe what happened, what went wrong, and what you expected to happen.

What I Did used Dockerfile from GitHub cab

# docker build -t data-container .
ENV PATH /opt/conda/bin:${PATH}
ENV LANG C.UTF-8
RUN /bin/bash -c "install_packages wget git ca-certificates && \
    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh"

# install cdb (update version if needed)
RUN pip install cdb==0.0.1

WORKDIR /data
COPY ./data .
RUN cdb generate /data --out /entrypoint.go

FROM golang:1.13-alpine3.10 as builder
COPY --from=generator /entrypoint.go /entrypoint.go
COPY --from=generator /data /data

# Dependencies
RUN apk add git && \
    go get github.com/vsoch/containerdb && \
    GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o /entrypoint -i /entrypoint.go

FROM scratch
LABEL MAINTAINER @vsoch
COPY --from=builder /data /data
COPY --from=builder /entrypoint /entrypoint

ENTRYPOINT ["/entrypoint"]
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.