This is a FastAPI application that supports TCP and HTTP clients. The server can be run locally or in a Docker container. This server is used to serve different gen-AI models that require a GPU for inference.
This server uses a queue to manage GPU consumption. The queue accepts the websocket connection and processes the request in the order it was received. The server can be scaled horizontally by running multiple instances of the server.
This is currently setup to only run a single type of model at a time. Please see the Adding New Models section for more information on how to add new models.
The following models are currently supported. Use the MODEL
environment variable to specify which model to load by including the value of the key value pair of the supported models below.
- Image Generation
- MODEL:
PixArt-alpha/PixArt-XL-2-1024-MS
-- SHORTNAME :pixart
- MODEL:
Running PyTorch with CUDA on Windows can be a bit tricky and the steps may vary based on your system configuration. The following steps should help you get started.
conda activate base
conda create --name your_environment_name python=3.11
conda activate your_environment_name
conda install cuda --channel nvidia/label/cuda-12.4.0
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
conda env update -f environment.yml
pip install -r requirements.txt
- You can test your local PyTorch/CUDA installation by using the
utils/torch_test.ipynb
notebook.
- The model files are too large to store on github and are downloaded at the start up of the Docker container.
- Additionally, the model files are too large to store more than one model at a time in a Docker container. Refer to
download.py
for logic on how the model files are checked and downloaded on server start up. - When developing locally, you can download the model files using the
utils/dl_model_files.py
script or just have the start up lifecycle do it for you (This will remove any existing files in model_files).
- You can run the server locally using the
server/main.py
script.
python server/main.py
- You can specify the host and port using the
--host
and--port
flags.
python main.py --host "127.0.0.1" --port 5000
- Server runs on port
localhost:8888
unless otherwise specified.
docker build -t remote-client-server .
If you run the container without a volume attached, make sure the model files are downloaded in the model_files
directory.
docker run -p 8888:8888 -e MODEL=pixart -e HOST=0.0.0.0 -e PORT=8888 --gpus all --name remote-client-server remote-client-server
Run the container with a volume attached with the model files.
docker run --rm -p 8888:8888 -e MODEL=pixart -e HOST=0.0.0.0 -e PORT=8888 --gpus all --name remote-client-server -v pixart-volume:/app/model_files remote-client-server
- You can use Docker volumes to store the model files.
- The volume should contain root directories for each model which should be named by the model short name.
- The volume is attached to the container at
/app/model_files
.
http://127.0.0.1:8888/docs
for Swagger UI documentation.http://127.0.0.1:8888/redoc
for ReDoc documentation.
-
ws://localhost:8888/api/generate
- Gen-AI WebSocket endpoint. -
http://localhost:8888/api/health
- Health check endpoint. -
http://localhost:8888/api/status
- Returns an object with values for the current model, queue size, GPU utilization and server status. -
http://localhost:8888/api/models/{model}
- Takes a model short name as a parameter and returns whether the correct model files are present in the model_files directory.
- Add a new file and class to the
app/gaas
directory to support the new model. - In
model_utils/model_config.py
, add the model config to theSUPPORTED_MODELS
object. - The expected_model_class can be found in the
model_index.json
of the downloaded model files. - If you are adding a new
TYPE
of model, update themodel_switch()
method in the QueueManager class to support the new model. - You can enforce type checking with pydantic by adding a new class to the
server/pydantic_models
directory.
- This project uses the Black code formatter. Please install the Black formatter in your IDE to ensure consistent code formatting before submitting a PR.
- Update ImageGen class to use generic pipeline and abstract class for different image generation models.
- Update the generation route for dynamically type checking the request for different models.
- Add semaphore and Docker env for setting the number of conncurrent operations utilzing GPU (currently set to 1).