Re: [boost] GIL io_new review

9 Dec 2010

      On 09/12/10 16:29, Phil Endecott wrote:
...
Domagoj Saric wrote:
I have a 1e12 pixel image which is supplied as a few thousand
5000x5000 TIFF files.
AFAIU, you retile this coverage cutting to smaller tiles 256x256 each.
Is that correct? (I'm sorry if you've explained it already, but I've
been disconnected for a couple of days.)

It could be interesting to see how GDAL Raster I/O engine would manage
There is a thin script which can retile existing tile coverages:

http://gdal.org/gdal_retile.html

http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/python/scripts/gdal_retil...

All work is done by GDAL engine written in C/C++.

BTW, are these 5000x5000 tiles from Ordnance Survey dataset?
...
...
...As I don't see what it is actually trying to do with the input
data I cannot know whether you actually need to load entire rows of
tiles (the 1400 files) but doesn't such an approach defeat the
purpose of tiles in the first place?
No. Hmm, I thought this code was fairly obvious but maybe I'm making
 assumptions.
There is a balance between - Buffer memory size - Number of open
files
I assume you mean "number of open files at the same time"?
Obviously, total number of open files will be huge.
...
- Code complexity
Options include: 1. Read the entire input, then write the entire
output. This uses an enormous amount of memory, but has only one file
open at any time, and is very simple. 2. Read and write row at a time
(as shown). This uses a very modest amount of memory, but requires a
very large number of files to be open at the same time. It's still
reasonably simple.
You mean row as a raster scanline, not row of tiles in the grid of
tiles, right?
...
3. Read and write 256 rows at a time. This uses an acceptable amount
of memory (less than 1 GB), and requires 1400 input files to be open,
but only 1 ouput file. The complexity starts to increase in this
case.
It would be 256 x width of input raster scanline (5000) as
whole scanline needs to be decoded.
...
4. Read and write 5000 rows at a time. This requires a lot more RAM
(15 GB) but I can have only 1 file open at a time. This is getting
rather complex as there is some wrap-around to manage because
5000%256!=0.
It's possible to read in stripes (e.g. 8 scanlines at ones):
8 * 5000 * 3 (assuming RGB)
Now, if read block size could be optimised to 32x32, then single read
operation would require to decode four such stripes.
Operation repeated  8 times to generate single 256x256 output.
There are raster backends that can perform some caching, so the same
scanlines/blocks are not decoded more than once to decode subsequent
blocks along strip of 8 (or more) scanlines.

GDAL provides such mechanism:

http://www.osgeo.org/pipermail/gdal-dev/2001-September/003165.html
http://www.osgeo.org/pipermail/gdal-dev/2001-September/003166.html
...
5. Lots of schemes that involve closing, re-opening and seeking
within the images. These will all be unacceptably slow.
Closing and (re)opening of input and output files seems to be orthogonal
to raster format. Seeking is as efficient as I/O of particular format.
Some formats allow scanline-based access, some allow
strip-based and some allow tile-based access (regardless if a raster
file is physically organised in tiles).
There is a variation to the concept of efficient access called Region
of Interest (ROI) but this requires a) support specified by a format and
implemented by a format access library b) preprocessing of data to
define ROIs. I assume that ROI are as useless for Phil's use case
in similar way as the tiled TIFFs are.

A bit of brainstorming:

Assuming the cutting of coverage of 5000x5000 tiles to 256x256 tiles,
maximum number of input tiles per single output tile is 4: output
raster generated on the stitching of the four input rasters.

If format backend performs access based on scanlines, then
for each 256 pixels wide scanline written output,
2 * 5000 wide scanlines are accessed.

This is a real limitation that needs to be balanced in terms of number
of total output 256x256 rasters generated at the same time, etc.

Anyways, my understanding is that Boost.GIL IO takes all the access
options provided by format libraries as they are specified and either
allows to utilise some or all of them. Hopefully, it should support most
popular access strategies for the popular formats.

Thus, I don't think it's possible for Boost.GIL IO to address such
specific and advanced problems like Phil's. However, I think Boost.GIL
IO backends could be not only format but problem specific as well.
Perhaps, Phil's problem qualifies to be solved with a new GIL IO backend
solving efficient block-based access implementing some assumptions:

- one file open at time
- calculation of memory efficient size of block based on size of
scanline and max number of scanline to not to open too much
- merge read blocks to single 256x256 written in single file
- blocks/scanlines caching strategy (see GDAL case above)
- etc.

I believe it would be feasible with Boost.GIL and the IO extensions,
as a specialised IO driver.

Best regards,
-- 
Mateusz Loskot, http://mateusz.loskot.net
Charter Member of OSGeo, http://osgeo.org
Member of ACCU, http://accu.org