
On 09/12/10 16:29, Phil Endecott wrote:
Domagoj Saric wrote:
I have a 1e12 pixel image which is supplied as a few thousand 5000x5000 TIFF files.
AFAIU, you retile this coverage cutting to smaller tiles 256x256 each. Is that correct? (I'm sorry if you've explained it already, but I've been disconnected for a couple of days.) It could be interesting to see how GDAL Raster I/O engine would manage There is a thin script which can retile existing tile coverages: http://gdal.org/gdal_retile.html http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/python/scripts/gdal_retil... All work is done by GDAL engine written in C/C++. BTW, are these 5000x5000 tiles from Ordnance Survey dataset?
...As I don't see what it is actually trying to do with the input data I cannot know whether you actually need to load entire rows of tiles (the 1400 files) but doesn't such an approach defeat the purpose of tiles in the first place?
No. Hmm, I thought this code was fairly obvious but maybe I'm making assumptions.
There is a balance between - Buffer memory size - Number of open files
I assume you mean "number of open files at the same time"? Obviously, total number of open files will be huge.
- Code complexity
Options include: 1. Read the entire input, then write the entire output. This uses an enormous amount of memory, but has only one file open at any time, and is very simple. 2. Read and write row at a time (as shown). This uses a very modest amount of memory, but requires a very large number of files to be open at the same time. It's still reasonably simple.
You mean row as a raster scanline, not row of tiles in the grid of tiles, right?
3. Read and write 256 rows at a time. This uses an acceptable amount of memory (less than 1 GB), and requires 1400 input files to be open, but only 1 ouput file. The complexity starts to increase in this case.
It would be 256 x width of input raster scanline (5000) as whole scanline needs to be decoded.
4. Read and write 5000 rows at a time. This requires a lot more RAM (15 GB) but I can have only 1 file open at a time. This is getting rather complex as there is some wrap-around to manage because 5000%256!=0.
It's possible to read in stripes (e.g. 8 scanlines at ones): 8 * 5000 * 3 (assuming RGB) Now, if read block size could be optimised to 32x32, then single read operation would require to decode four such stripes. Operation repeated 8 times to generate single 256x256 output. There are raster backends that can perform some caching, so the same scanlines/blocks are not decoded more than once to decode subsequent blocks along strip of 8 (or more) scanlines. GDAL provides such mechanism: http://www.osgeo.org/pipermail/gdal-dev/2001-September/003165.html http://www.osgeo.org/pipermail/gdal-dev/2001-September/003166.html
5. Lots of schemes that involve closing, re-opening and seeking within the images. These will all be unacceptably slow.
Closing and (re)opening of input and output files seems to be orthogonal to raster format. Seeking is as efficient as I/O of particular format. Some formats allow scanline-based access, some allow strip-based and some allow tile-based access (regardless if a raster file is physically organised in tiles). There is a variation to the concept of efficient access called Region of Interest (ROI) but this requires a) support specified by a format and implemented by a format access library b) preprocessing of data to define ROIs. I assume that ROI are as useless for Phil's use case in similar way as the tiled TIFFs are. A bit of brainstorming: Assuming the cutting of coverage of 5000x5000 tiles to 256x256 tiles, maximum number of input tiles per single output tile is 4: output raster generated on the stitching of the four input rasters. If format backend performs access based on scanlines, then for each 256 pixels wide scanline written output, 2 * 5000 wide scanlines are accessed. This is a real limitation that needs to be balanced in terms of number of total output 256x256 rasters generated at the same time, etc. Anyways, my understanding is that Boost.GIL IO takes all the access options provided by format libraries as they are specified and either allows to utilise some or all of them. Hopefully, it should support most popular access strategies for the popular formats. Thus, I don't think it's possible for Boost.GIL IO to address such specific and advanced problems like Phil's. However, I think Boost.GIL IO backends could be not only format but problem specific as well. Perhaps, Phil's problem qualifies to be solved with a new GIL IO backend solving efficient block-based access implementing some assumptions: - one file open at time - calculation of memory efficient size of block based on size of scanline and max number of scanline to not to open too much - merge read blocks to single 256x256 written in single file - blocks/scanlines caching strategy (see GDAL case above) - etc. I believe it would be feasible with Boost.GIL and the IO extensions, as a specialised IO driver. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net Charter Member of OSGeo, http://osgeo.org Member of ACCU, http://accu.org