Dynamic ambient occlusion, based on GPU Gems 2 chapter 14, online on
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter14.html
(PDF version on
http://http.download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch14.pdf).

Idea:

- Crude assumption: model is actually a number of elements, each element
  corresponding to a vertex of orig model, with normal and area correctly
  calculated.
  See menu "Program -> Show Elements".

- Encode elements info into the texture.

- On each vertex do (in vertex shader): calculate AO using elements info.

Practice:

- First of all, this AO will have too much shadow (multiple shadows
  on the same item). So calculate 2nd time, multiplying by 1st result.
  The key is that "if element is already in shadow, then it also doesn't
  contribute anything to the other shadows".

  2nd pass will make the result too much light (because we multiplied
  by results of 1st pass, which were too dark...). Theoretically,
  more passes are needed. In practice: the result is between 1st
  and 2nd passes, so just at the end of 2nd pass take the results
  and average.

- Note that you cannot actually do this in vertex shader: with arch 3,
  texture2D calls in vertex shader are not allowed or
  very very horribly-terribly slow. And you want to read texture a lot
  to get all element infos.

  (I tried to do the 1st testing implementation
  of one pass on vertex shader, it simply doesn't work when tested on older
  NVidia and works with 1 frame / couple seconds (even with 1 element count!!!)
  on fglrx.)

- So do this on fragment shader. Also, render such that 1 screen pixel =
  one element, and simply write result back to buffer. Using the texture,
  1st shader passes results to the 2nd, and 2nd passes it to the CPU.

  For communication between shaders, texture data doesn't have
  to go through CPU --- glCopyTexImage2D is perfectly designed for such tricks.

  For actual rendering, I grab the colors to the CPU --- while it would be possible
  to avoid this, I would then run into trouble trying to access textures
  from vertex shader, which is practically prohibited for shaders gen <= 3,
  see above notes.

- Shader optimization notes:

  - Remember to expand constant variables for the shader, instead
    of making them uniforms. This is important, not only for speed,
    also for correctness: older GPUs need to unroll "for" loop
    (to elements_count) at compilation.

    (I didn't expand all possible vars, because fglrx doesn't tolerate
    floats with fractional part in GLSL code; so to be safe for fglrx too,
    only ints can be expanded such.)

  - Above unrolling also means that this may just get too difficult
    for older GPUs for larger models: too many instructions after unrolling.
    E.g. my NVidia GPU (kocury) "GeForce FX 5200/AGP/SSE2/3DNOW!" handles
    smaller models (simplico), but fails with
    "error: too many instructions" on larger ones (peach, chinchilla).
    More tests show that the border is somewhere between
    24 (Ok) and 42 (not Ok) verts (for this GPU, of course;
    much newer Radeon on MacBookPro can handle at least 20k (although
    naive implementation gets really slow then).

- Note that GPU Gems text gives two (different) equations for calculating
  element-to-element shadowing, both of them wrong IMO...
  Yes, I did implemented and tried them. It's a matter of changing
  "color -= ..." line.
  Math eq on page 3 is equivalent to

    color -= sqrt(sqr_distance) * cos_emitter_angle *
      max(1.0, 4.0 * cos_current_angle) /
      sqrt(element_area / pi + sqr_distance);

  Code on page 4 is (possibly --- not sure, as they put "1 - " part
  already there, so I don't see a reasonable way to use it?) equivalent to

    color -= inversesqrt((element_area/pi)/sqr_distance + 1) *
      cos_emitter_angle *
      min(1.0, 4.0 * cos_current_angle);

  If you grok reasoning behind their equations, and/or if there's an
  error in my implementations above, mail me. My implementation follows
  normal approximation for solid angle:

    element_area * cos_emitter_angle / sqr_distance

  We also multiply by cos_current_angle
  (as light coming from an angle is scaled down, so blocker for
  which cos_current_angle is small blocks less light).

Advantages:

- Since all work is done on the fly, the scene may be absolutely dynamic.
  (Although on change, list of elements must be updated, this causes CPU work.)
  For example, see:
  data/chinchilla_awakens (timesensor animation) and
  data/dynamic_world_ifs (interactive world changing, press left mouse down, use keys werxdf, uiojkl)
- Can make bent normals, indirect lighting almost for free.
- Although the work is done on shaders, resulting colors can be grabbed to CPU.
  While this is not efficient, it allowed us to make simple implementation, and to easily debug it too.
- Can work with practically any 3d model.
  For example, check out castle/data/levels/fountain/fountain_final.wrl

Disadvantages:

- Like PRT and shadow fields, this requires a large number of vertexes (elements) to look good, and still number vertexes is also the key thing that determines speed. (On the up side, it's not difficult to make lower LOD versions of elements, since they are so simple? Hierarchy idea will also help a lot.)
  On dynamic models, when too few vertexes are available, it can be seen as moving shadows are "sticky" to vertexes.
- Requires a good GPU. Fragment shader must do a *lot* of work, a lot of texture lookups. Simply not possible on older GPUs.

TODO: our current method of generating elements works only for nodes
with explicit vertexes, so will not work for X3D primitives (sphere,
cone and and such).