Comments (4)
Hello @gigony , thanks for putting this together.
We would love to see a GPU solution to load a multiband geotiff file (CUCIM), convert to a CUPY array so we can use CUML for processing, and back to GEOTIFF so we can present it on a map base. @jkatagi did a great job here but he was limited by CPU technology. Still a great piece of code.
I've tested CUCIM with a 700mb-235 bands GEOTTIF on a state if art MacBook pro, taking 5 minutes to load the file against 300us on a GTX 1080 equipped Linux machine. Clear way to go. We are stuck on converting the DNs into a cupy array to do our CUML stuff. Building a constellation of Hyperspectral imaging satellites, we need GPUs manipulating the data from end to end.
https://gist.github.com/jkatagi/a1207eee32463efd06fb57676dcf86c8
from cucim.
Hello @gigony, to your questions:
1- Yes. Please find here a Hyperspectral Geotiff from Hyperion. 242 bands.
2- Yes, although we are looking for a complete GPU end-to-end pipeline, for now, we are trying to figure out the red arrow A in the attached reference pipeline. To be clear, we need a solution to read the geotiff file in memory into a cupy array through GPU.
3- I will follow up with the metadata soon, which refers to red arrow B, a different beast that requires some deep thinking.
from cucim.
Hi @Terra4m, I wonder if there is a sample GEOTIFF file available public that can represent images you are handling. Would like to use it as a reference image to work on.
Looks like what your team is focusing on is loading(tif2array()
) part of the code.
Is it correct?
(Update: looks like you also want to convert the array into GEOTTIFF).
I wonder if there are other GEOTIFF-specific metadata/geotransform that you would like cuCIM to handle and which compression method (jpeg/deflate/LZW/packbits) do you usually use in GEOTIFF file?
def tif2array(input_file, calc_gain=True):
"""
read GeoTiff and convert to numpy.ndarray.
Inputs:
input_file (str) : the name of input GeoTiff file.
calc_gain (bool) : wheter calc GAIN to DN or not (defaul:True).
return:
image(np.array) : image for each bands
dataset : for gdal's data drive.
"""
dataset = gdal.Open(input_file, gdal.GA_ReadOnly)
# Allocate our array using the first band's datatype
image_datatype = dataset.GetRasterBand(1).DataType
image = np.zeros((dataset.RasterYSize, dataset.RasterXSize, dataset.RasterCount),
dtype=float)
if calc_gain == True:
# get gain
gain = get_gain_band(input_file)
# Loop over all bands in dataset
for b in range(dataset.RasterCount):
# Remember, GDAL index is on 1, but Python is on 0 -- so we add 1 for our GDAL calls
band = dataset.GetRasterBand(b + 1)
# Read in the band's data into the third dimension of our array
if calc_gain == True:
# calc gain value for each bands
image[:, :, b] = band.ReadAsArray() * gain[b]
else:
image[:, :, b] = band.ReadAsArray()
return image, dataset
from cucim.
Hi @Terra4m
I checked the file you provided. and it has a single image plane and 242 samples per pixel with no compression used.
- Image Width: 1341 Image Length: 3461
- Bits/Sample: 16
- Sample Format: signed integer
- Compression Scheme: None
- Photometric Interpretation: min-is-black
- Samples/Pixel: 242
- Rows/Strip: 1
- Planar Configuration: single image plane
TIFF data with no compression is what cuFile (GPUDirectStorage) can help if the image file is in a GDS-supported storage device (NVMe SSD, etc.).
Even with GDS, its performance would not that much. You can expect 15%~30% gain to directly load the 3d array into GPU memory (CuPy Array)
Could you please share your use cases regarding loading GeoTIFF image?
For example.
- Reading whole image into the memory and process the image (usually inference use case when image size fit to the system/GPU memory)
- Reading image patches from random files and image locations into the memory and feed into AI model (usually Training use case)
Looks like the part of the code fits to case 1.
If it is the case of 1), there is also another solution you can exploit -- using tifffile (github). It can load basic GeoTIFF into numpy array and can read basic metadata from GeoTIFF.
Using tifffile and CuPy, I was able to get 10x gain compared with the existing solution (to execute tif2array
method)
- 4.41s (including 1.1s to load/convert whole image to a numpy array. the rest time is for converting to CuPy array and applying gain) vs 48.1s, on my local machine[AMD Ryzen 5 3600XT 6-Core Processor 3.80 GHz, GTX 3090] on NVMe SSD )
And I would like to mention that, cuCIM team can schedule to implement the feature that loads GeoTIFF images with multi-samples per pixel
or multi bands(channels)
and that supports both use cases (reading whole image and part of images). However, the team needs to discuss when we start implement the feature because there are other tasks scheduled (we will publish the roadmap once determined). I think you can expect ~15% improvements with GDS for this sample image and ~10% additional gain by using cuCIM instead of tifffile because there is not much optimized regarding loading RAW/uncompressed image data.
Let me show you some example of using tifffile
and CuPy for def tif2array(input_file, calc_gain=True)
.
In the example code, the bottle neck looks like band.ReadAsArray() * gain[b]
in the original code.
(The image's shape is (3461, 1341, 242) and rasterio tries to create a separate (3461,1341,1) array for each band and merge bands array into one array, whereas tifffile would just load the image as a whole.)
5098 function calls in 47.740 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 47.740 47.740 {built-in method builtins.exec}
1 0.129 0.129 47.740 47.740 <string>:5(<module>)
1 8.046 8.046 47.610 47.610 <ipython-input-3-23d104f12ec8>:12(tif2array)
242 0.002 0.000 39.556 0.163 gdal.py:3824(ReadAsArray)
242 0.003 0.000 39.553 0.163 gdal_array.py:356(BandReadAsArray)
242 0.001 0.000 39.544 0.163 gdal_array.py:152(BandRasterIONumPy)
242 39.543 0.163 39.543 0.163 {built-in method osgeo._gdal_array.BandRasterIONumPy}
243 0.001 0.000 0.007 0.000 gdal.py:2238(GetRasterBand)
243 0.003 0.000 0.006 0.000 {built-in method osgeo._gdal.Dataset_GetRasterBand}
242 0.004 0.000 0.004 0.000 {built-in method numpy.empty}
243 0.000 0.000 0.003 0.000 gdal.py:3495(<lambda>)
484 0.002 0.000 0.002 0.000 gdal_array.py:216(flip_code)
244 0.000 0.000 0.002 0.000 gdal.py:70(_swig_setattr)
244 0.002 0.000 0.002 0.000 gdal.py:51(_swig_setattr_nondynamic)
242 0.000 0.000 0.001 0.000 gdal_array.py:235(NumericTypeCodeToGDALTypeCode)
242 0.000 0.000 0.001 0.000 gdal_array.py:240(GDALTypeCodeToNumericTypeCode)
1 0.000 0.000 0.001 0.001 gdal.py:4331(Open)
Hope this helps and please give us your feedback on the use cases and features needed.
from tifffile import TiffFile
tif = TiffFile('hyperion_allbands.tif')
tif.pages[0].dtype
# dtype('int16')
tif.geotiff_metadata
# {'KeyDirectoryVersion': 1,
# 'KeyRevision': 1,
# 'KeyRevisionMinor': 0,
# 'GTModelTypeGeoKey': <ModelType.Projected: 1>,
# 'GTRasterTypeGeoKey': <RasterPixel.IsArea: 1>,
# 'GTCitationGeoKey': 'WGS 84 / UTM zone 13N',
# 'GeogCitationGeoKey': 'WGS 84',
# 'GeogAngularUnitsGeoKey': <Angular.Degree: 9102>,
# 'ProjectedCSTypeGeoKey': <PCS.WGS84_UTM_zone_13N: 32613>,
# 'ProjLinearUnitsGeoKey': <Linear.Meter: 9001>,
# 'ModelPixelScale': [29.977628635346758, 29.991331984975442, 0.0],
# 'ModelTiepoint': [0.0, 0.0, 0.0, 405600.0, 6018600.0, 0.0]}
Using CuPy + tifffile
from tifffile import TiffFile
import cupy as cp
def get_gain_band_cupy(input_file):
"""
Dummy method
"""
return cp.ones((242,)) * 1.15
def tif2array_cupy(input_file, calc_gain=True):
"""
read GeoTiff and convert to numpy.ndarray.
Inputs:
input_file (str) : the name of input GeoTiff file.
calc_gain (bool) : wheter calc GAIN to DN or not (defaul:True).
return:
image(np.array) : image for each bands
dataset : for gdal's data drive.
"""
tif = TiffFile(input_file)
tif_image = tif.asarray() # 'maxworkers' parameter is None which means 'up to half the CPU cores are used.'
image = cp.asarray(tif_image)
# (Y, X, S) = (3461, 1341, 242). Not used here as we are using dummy `get_gain_band_cupy()` whose length matches `raster_count`.
raster_count = image.shape[2]
if calc_gain == True:
# get gain
gain = get_gain_band_cupy(input_file)
image = image * gain
return image, tif
%%timeit
out_image, tif = tif2array_cupy('hyperion_allbands.tif')
del out_image
tif.close()
4.41 s ± 289 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using existing code using GDAL
import numpy as np
from osgeo import gdal
from osgeo import gdal_array
from osgeo import osr
def get_gain_band(input_file):
"""
Dummy method
"""
return [1.15] * 242
def tif2array(input_file, calc_gain=True):
"""
read GeoTiff and convert to numpy.ndarray.
Inputs:
input_file (str) : the name of input GeoTiff file.
calc_gain (bool) : wheter calc GAIN to DN or not (defaul:True).
return:
image(np.array) : image for each bands
dataset : for gdal's data drive.
"""
dataset = gdal.Open(input_file, gdal.GA_ReadOnly)
# Allocate our array using the first band's datatype
image_datatype = dataset.GetRasterBand(1).DataType
image = np.zeros((dataset.RasterYSize, dataset.RasterXSize, dataset.RasterCount),
dtype=float)
if calc_gain == True:
# get gain
gain = get_gain_band(input_file)
# Loop over all bands in dataset
for b in range(dataset.RasterCount):
# Remember, GDAL index is on 1, but Python is on 0 -- so we add 1 for our GDAL calls
band = dataset.GetRasterBand(b + 1)
# Read in the band's data into the third dimension of our array
if calc_gain == True:
# calc gain value for each bands
image[:, :, b] = band.ReadAsArray() * gain[b]
else:
image[:, :, b] = band.ReadAsArray()
return image, dataset
%%timeit
out_image, dataset = tif2array('hyperion_allbands.tif')
del out_image
del dataset
48.1 s ± 561 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
from cucim.
Related Issues (20)
- [BUG] fix isolated test failure in `test_cache_hit_miss` HOT 2
- Using `rapids-cmake`
- Changing default CMake build mode to release (instead of debug)
- Cannot install cucim by using conda or pip HOT 2
- 23.12 nightly wheels missing Python minor version HOT 1
- Evaluate using `scikit-build-core` HOT 13
- Cleanup old build files
- Cache Notebooks HOT 1
- Cleanup `python/cucim/docs`
- Cleanup `python/cucim/ci`
- Review & cleanup `python/cucim`
- Drop `.cookiecutterrc`
- Drop `.idea`
- Validate wheel on CI
- Test CuPy 12 & 13 on CI
- Devendor CMake dependencies
- Drop Python pre-3.9 logic HOT 4
- Slowness seen using PiecewiseAffineTransform compared to scikit-image version HOT 2
- Enable `libcufile` on `linux_aarch64` (w/CUDA 12.2+) HOT 1
- [BUG] `click` is listed only in test requirements but is a required in order to use `cucim.clara` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cucim.