"Super Imposter" shader: my journey to make a high quality imposter shader

Landon Townsend
Feb 9, 2022
19 min read

Updated: Feb 16, 2022

The Goal

First, the definition of what I was trying to accomplish: rendering with polygons and with voxels are two currently very well known methods of rendering 3D objects. The challenge I gave myself, since as far as I know this is not well explored territory, is to render a somewhat complex 3D object with the only data given being 2D textures, alongside some other simple data structures like matrices and vectors. No polygonal data aside from the cuboid representing the bounds of the object, which the object will be rendered onto. Using "imposters", static images of 3D objects from different angles, where the image is chosen based on the closest angle, is a well known technique; I wanted to make a version of this idea that is even higher quality, to the point of being near impossible to distinguish from real polygonal geometry even close up. This started as a simple test of graphics techniques and linear algebra and evolved into something much more interesting as I found better and better methods to achieve this.

Part 1: Setting Up The Data

I determined this would be the data needed in order to begin work.

Color textures from multiple camera renders (I settled on orthographic cameras for the sake of simplicity)
0 to 1 linear depth textures from the associated camera renders
The properties of the camera (their orthonormal world-to-camera transformation matrices, their near and far clip which the depth textures will be relative to, and their sizes)
The object to world matrix of the source object (so that we have a baseline for the local space the cameras will be in relative to the object, and we can draw the object at the center of the bounding cube we're rendering on)

The following was data I also found I would need, at least for early work

manually hand tuned values for the approximate bounding box of the source object, so it can be fit properly into the bounds of the cube

Part 2: Preparing the Starting Ray Position and Direction

In order to properly represent the source object in a cube that can be rotated, scaled, and translated, which will ideally rotate, scale, and translate the perceived 3D object, I did the following:

I captured the view direction from the camera to the cuboid pixel, and the camera position, both in world space
I multiplied them both by the cuboid's world to object matrix; this put them in the cuboid's object space
I passed in the source object's object-to-world space matrix from a script. I also passed in an X, Y and Z scale meant to represent the source object's bounding box. Multiplying the matrices' first 3 columns by these values "scaled" the matrix so that the values in local space represent the full size of the source object.
Multiplying the cuboid object space view vector and camera position by the above matrix transformed them into a modified world space, where the view vector and camera position would be if you took the cuboid and moved it to the position and rotation, and scaled it to the bounds of the source object.

At this point I was all set up to do whatever ray tracing or ray marching algorithm I wanted, with the translation, rotation and scale of the cuboid represented in the rendering of the "contained" 3D object.

Part 3: A Naïve Solution

The most obvious approach to rendering a 3D object with color and depth texture sources, was to take an approach similar to Parallax Occlusion Mapping. In its simplest form, POM is when you march a ray, based on view direction and position, through a height map step by step until the current position lands under the height map; that position is considered the point of collision (in the simplest implementation) and the UV at that point is used as the source UV for the color/normal/etc. samples. If you're unfamiliar with parallax occlusion mapping, this link has pictures and details to potential implementations.

https://learnopengl.com/Advanced-Lighting/Parallax-Mapping

To start with I started with the results from a single camera after a simple POM algorithm was implemented. I am demonstrating this with the following shoe model as it is a good example of the complexity I am trying to support. I started with a simpler lion statue model but I feel it is better to use the same model throughout this post so I will use the shoe for demonstration.

https://sketchfab.com/3d-models/pb170-sneaker-hi-7fc3af4900a44c2c8c6d4b4f7d164659

The model is moderately complex and extremely high poly so it will make a great demonstration of the advantages of my final result.

With our step-and-check technique, the resulting for 100 steps for a single camera was this.

You'll notice the stretching of the object, especially the shoelaces; this is because the ray has passed under the depth texture and recorded a collision, returning the UV of that point.

We actually want this: this "stretched" area represents important information about where the object could potentially be, and in the worst case scenario can potentially be drawn if the shader detects no better camera data at that particular spot.

As with POM, there is a problem when viewing the object at a steep angle:

the reason this happens is, the steps are happening every unit of distance in the Z direction. when the view direction is near parallel to the camera plane, and its Z value is very low, stepping that far in the Z direction results in a HUGE step in the XY direction, missing the entire object in a single step. We can't have this because missing the entire object in one camera potentially affects the data about where the object potentially could be when other cameras are factored in.

To fix this, we can take the start and end positions of our ray march and move them, in the direction of the view direction, towards the camera view bounds; the object is guaranteed to be inside those bounds, so we will still have an accurate representation about what information is in that direction.

Finally, after we've collided, we can back up a step and lower the step size, and continue; this will result in a narrowing down on detail after we detect a collision, resulting in a substantial increase in the quality of the final result.

Once we factor in other cameras, it's time to start factoring in a final element: disparity. When you detect a collision, you can check the distance from the Z position of the step position from the depth; this tells you how far you are into the "stretched" area extending past the actual camera data. We want that to be minimum, so when you've ran the steps on all the cameras, the "source of truth" should be the camera where the "disparity" is the lowest.

From this point you can start moving forward with resolving some of the artifacts and resolving whether the pixels should be mapped as transparent. In the below picture, I went with a very simple solution where the object is opaque only on the pixels where all cameras have collided.

There are some complexities with making this fully work and some artifacts that are not easily sorted out (don't worry, we'll get to those later), but our main concern is that, with this technique, balance of quality over performance is terrible. At a hundred steps the shader runs extremely poorly, and if you reduce the number of steps (and to an extent, even if you don't) there is a problem where stepping through the depth map will step over things (this is a well known problem with Parallax Occlusion Mapping as well.) Notice the holes in the shoelace:

When dropping the amount of steps for better performance the problem is even worse.

Sorting out the artifacts will not solve these problems, and there is no way to "dial in" on the detail if we miss the collision in the first place, so we need to find a better solution.

Part 4: Ideas That Did Not Pan Out

Understanding the things in this section are not important so I will not include pictures; If you want to see the solution that worked, skip to the next area.

1. Better bounds detection

On each camera, along with the near and far clip and storing the depth of the pixel closest to the camera, I began looking into ways to better narrow down the bounds containing the object. I realized that if I had a way to, in a single step, calculate the distance to the actual part of the depth texture containing the values, I could combine the results for all the cameras and get the smallest possible area that needs to be stepped through. To do this, I generated a pre-baked texture for every camera where the X position is the point along the outside of the depth texture and the Y position is the angle of the view direction locked to the XY dimensions, and the output is the distance traveled before hitting a value on the depth texture that wasn't 0.

To an extent this worked and I gained a much better start and end point to march through: this resulted in a significant improvement in quality per steps.

This had the following problems:

It requires a whole extra set of textures, one for each camera, to be passed to the shader.
The process of taking the ray direction and position and converting it to data that could be fed into the XY coordinate of the texture to get the distance, was convoluted enough and expensive enough to prefer to do without it if we had a better solution.
The biggest problem: the increase in quality per amount of steps was good, and mitigated, but did not eliminate, the problem where you could step past geometry and end up with holes in your render. I wanted to completely eliminate this if I could.

2. Recursive narrowing of the stepping range

Combined with the above idea, I found a way to bake custom mip maps / LODs into my depth textures. I turned the depth texture into a 2 color texture, and for each pixel on each lower resolution mip map, I stored the max and min depth values of every point that's within that pixel on the full resolution depth texture. My idea was to take the start and end point and step through these custom mip maps in a way where I could get a new range for the start and end points to step through, and then repeat the process until I narrowed down in fine detail the correct position.

This was ultimately a failure; many pixels reached an equilibrium where the min and max depth were so far apart that the recursive step stopped further narrowing down the range, resulting in a failure to exponentially reduce the range and converge. However, part of this solution, namely the custom mip maps, became key to the solution I finally arrived on.

Part 4.5: Computing our start and end point

One thing that did pan out, to replace the "better bounds checking" above, was a loop that did a simple box collision check on all the cameras to find the start and end collision of our bounds.

clipData.z is the orthographic camera size and clipdata.x and y are the near and far clip planes of the camera. rayDir is the view direction, normalized, in local camera transform space.

The end result of running this function on every camera is getting the entry and exit position of the ray along the view direction through the "union" of the camera bounds cuboids. maxCloseDist starts at 0. If minFarDist ends up being below 0, we missed one of the cubes and can early out, but the important part of this is having a very cheap way to get a good start point for our march, as well as an endpoint to check against to see if we've escaped the bounds of the object.

Part 5: Foolproofing Our Ray March

I began the process of coming to this solution by saying: what is really happening when we "miss" existing geometry while raymarching?

What has happened is, one of our steps has passed over a point in the depth buffer that, if more steps were taken, it would have detected. However, more steps means worse performance. Is there a way that, at the points A or B, we could know that peak is there and account for it by shortening our steps?

For a previous attempt at an optimization in part 4, I wrote a script and a shader to generate custom mipmaps for the depth texture. At each point of a higher mip / lower resolution, it samples the 4 pixels from the lower mip / higher resolution, and uses the max, instead of the average (which is what mip maps use by default).

Since these custom mip maps encode the max height of any pixel on the original depth map within a region, we can solve the earlier problem: a correct mip map sample will warn us of upcoming potential geometry that can be collided with:

When stepping forward, if our Z value at point B is below the height at the mip (in red) we can back up and try again. This time at the higher resolution / lower mip level (in purple)

Continuous application of this will narrow down and eventually find the collision.

On the other hand, what happens if we miss the peak but still hit the mip?

As we continue to either collide and lower the mip level, or not collide and move forward, we will eventually pass the peak:

Now the problem is, once we've passed a peak like this, our step size is too low, and making it through the rest of the depth map will take too many steps. To mitigate this, we can simply increase the step size / decrease the mip resolution each time we don't collide with the mip. To summarize:

Step forward to the edge of our current mip pixel.

If our Z value is under the depth value at that mip pixel, back up, lower the step size / increase the resolution and try again.

If it's above the depth value at that mip pixel, move forward and increase the step size / lower the resolution.

The result is that in areas where there are potential collisions, we will slow down, and in areas that are safe, we will speed up. With this strategy, we don't need to take enough steps to guarantee we get to the end; we can simply start with a fairly large step and run this for a while, and assume it will eventually narrow down to the proper collision.

To make this absolutely fool proof, we will need to step through the mip pixels with a pixel stepping algorithm. This is fairly cheap to implement:

This code, when given "pixelDist", which is the width of the mip pixel in the camera's local space (meaning, scaled with the size of the camera), will get the location of the closest edge of the pixel in the direction of the view direction (rayDir), and compare the Z of that point to the current depth sample (or the z of the previous position, if rayDir.z < 0 and we are stepping towards the the camera instead of away from it). If the Z is past the depth, we collided, so we will back up and switch to a higher resolution mip level (we already start our step before this function by dropping the mip level to a higher resolution by 1 step). If it's not past the depth, we did not collide, so we go 2 steps to a lower mip resolution (ending the function 1 step lower than we started).

Pictured is a possible result of the pixel stepping:

I've profiled many, many alterations on this algorithm, and between the conciseness / inexpensiveness of the per-step code, and the general number of steps in most cases, this particular algorithm has turned out to have the best performance for a "foolproof" (no chance of missing detail) algorithm. In most cases this code will narrow down a collision to the lowest mip level in less than 30 steps due to its tendency to "speed up" in areas of low detail and "slow down" to approach a surface.

Alternatives explored include:

Different amounts of mip levels jumped when colliding and when not colliding

Storing multiple mip levels in the depth texture in different color channels and vectorizing multiple checks at once (the code for this was pretty cool but in the end it simply ended up adding a bit too much extra math to the main loop to perform better)

Starting at various mip levels (starting at 2 or 3 LOD levels below the lowest possible resolution tends to give the best performance consistently)

Here's the result on one camera with 80 steps:

Note how there is no chance of this algorithm missing detail. The holes are gone:

Six cameras:

We aren't getting any holes now, but we still need some work. First step is figuring out what exactly is happening at the points where we are not getting a correct result.

Part 6: Syncing rays

Let's take a point on the shoe that is giving us an incorrect result and figure out what it's doing, and what it should be doing.

Let's look at the current camera views and figure out which ones should be able to reach the surface there.

Of this there are a few cameras that should definitely be able to see the proper side of the shoe, most noticeably numbers 3 and 6. So let's isolate those and see what is happening.

Camera 3 should be able to see it, but it is hitting the point in the depth map under the shoelace and stopping.

Camera 6 is getting the entirety of the information it should have blocked by the view vector hitting the shoelace and, again, stopping.

Now, since all of these cameras are running view vectors with the same start and end point, and the steps never fully "enter" the depth map (they step back whenever they hit a potential collision) it's actually not only safe, but better for performance if we simply sync the cameras up position wise every iteration. Unfortunately, they're all marching in different camera spaces. But the spaces are orthonormal and the view direction vector is the same world space vector in all the different spaces, and we can save the original ray position from the start of our function after transforming them into the different matrices (we'll call that mainCameraPos). Because we've kept our camera spaces orthonormal, we can actually simply keep track of the max distance we've gotten from the main camera after every iteration for every camera:

(in this case "i" is iterating through all the cameras)

Then we can just apply it to all the current stepped positions for all the cameras:

With this, none of the cameras can get "stuck" on the edge of the depth map because after each iteration, it will be "caught up" with any camera that has gotten further.

Implementing this solves all of the artifacts with our current method.

Part 7: Early exits

With any raymarching algorithm, you need an early exit condition. The general idea of an early exit is a branch that, under a certain condition, skips the bulk of the work and does a small amount of computation afterwards before completely exiting the shader. With branching, all threads running in parallel are only as fast as the slowest thread, and cycles will be wasted on other threads if one thread takes too long. However, with raymarching and other loop-based graphics techniques, it is very common that all of the threads being run in parallel will early exit at some point, so branching is still recommended and will still lead to a significant performance improvement.

Our two early exit conditions are convergence (a full collision, which will cause all the rays to repeatedly collide and move up the highest resolution mip level) or a divergence (going past the object into the "black"). We're already keeping track of the max distance traveled, and we have the minFarDist from the camera cuboid union check earlier, so if our max distance is past that, we can just discard the pixel and stop.

Our second exit condition, convergence, happens when all the mip levels for every camera reach zero. At that point they have all collided and have the maximum accuracy for UV and disparity values.

This early exit doubles the framerate with a negligible affect on quality.

Here's a debug shader showcasing the early exits (only white pixels run the full iterations)

Part 8: Additional improvements

You may have noticed that our disparity system creates a bit of visual noise when the same sections of the source cameras have different lighting conditions due to specular / reflections:

This is actually pretty simple to fix. Instead of choosing the UV and index of the smallest disparity, calculate the min disparity and sample every color, adding each color multiplied by the disparity minus the minimum disparity, run in a smoothstep function with the min being a property you can add (blending sharpness) and the max being zero. Keep track of the "total color" (add the value being multiplied by the color to a "totalColorAmount" float and divide the summed color by that amount after the loop is over, for a weighted average.) This will allow you to reduce the noise:

But in fact you can do one better. You have a value ranging from -1 to 1 that tracks how aligned you are with each cameras forward vector. It's the z value of the view direction in each camera space. Add a shinyness property to the shader and take that property to the power of the view direction and factor it into the value being multiplied by the color for each camera. This will weight the average towards the view from the camera you are most aligned with.

The results are fantastic:

Our final improvement involves how the shader deals with missing information. There is missing information inside of the shoe, which the existing disparity system simply replaces with whatever is closest:

What we'd like to happen is 1. not have the outside of the shoe "bleed in" to the inside, and 2. have the insides of the shoe "blend" between the two pixels that are being stretched between.

To achieve this, once we reach the end of the iterations, we can actually step back one step using our pixel stepping algorithm and get the "previous" depth and UV value. This will tell us two things:

1. if the two depth values, at both locations, are close together, but the disparity is high, we have a "bleeding through the edges" situation and we don't want to use that camera's data.

2. If the two depth values are far apart and the actual step position is in between them, you have a min, max, and t value for an inverse lerp. When you're doing your color samples and your blending at the end you can use the result of that inverse lerp to generate a gradient from one result to the other.

In fact, you can do both at once: instead of disparity being the z value minus the depth, have it be the inverse lerp function itself:

(the max and the dispBlendDist is to prevent artifacts that show up when prevDepth - depth becomes a very low value; just set dispBlendDist to something like 0.05 or make it a property in your shader).

You can use this both the way you were using the disparity previously, and as your blend value.

Lots of areas with missing information will have an improved appearance:

Part 9: LODs

If your object is far away, you want to trade some quality for better performance. There's two ways we can do this: we can lower the maximum number of steps, and we can raise the minimum LOD level used (both the lowest we'll drop to when stepping through the shader and the one that makes us early exit when all cameras have dropped to it). Both were pretty trivial to implement the way you typically would for levels of detail, factoring camera FOV, object size, and distance. However, the LODs came with quality problems that took a bit of work to fix:

Problem one: lowered step numbers lead to black outlines

This is a problem with the way we're storing our reference images. The areas surrounding the actual shoe in the color textures are black; when all the steps are exhausted and we don't reach the highest resolution mip map, sometimes those black areas end up getting sampled.

We can fix this by running some code on the color textures to "flood" the pixels around our shoe:

This way when the exhausted steps causes our shoe to pull from around the edges of our color texture, we'll get more "normal" results.

Second issue: in "stretched" sections, a higher min LOD level than 0 will have lines due to the difference in resolution between the depth texture and color texture (some lines are sampling the white at the top while others are sampling the black on the bottom of the inside of the shoe).

The way to prevent this is to create custom LODs for the color texture as well: make the custom LODs for the color texture take the same pixels that the depth texture LODs take, and then sample that LOD:

Here's a low LOD close up, showcasing the clean reduction in detail with no artifacts:

Part 10: Conclusions, pros and cons, and things to work on

This was a very interesting experiment and the results turned out better than I could have hoped, but I still do not think this shader is quite ready for use in a real world scenario. Hopefully the pros and cons will explain why.

Pros:

Extremely high level of visual quality, with no holes and few flaws in our render, given the provided data.
Increasing the number of steps does not significantly decrease performance for most moderately complex objects.
Very high poly geometry recreated in high detail with an extremely low amount of polygons used. This makes the shader work surprisingly well with things like GPU Instancing.

GPU instancing loop contains quick, simple algorithm to render front to back.

Scales well with less complex objects and fewer camera angles; a simpler, more matte object with simpler geometry might get away with 4 cameras for even better performance, and simpler geometry makes the object less likely to reach the worst case scenario for steps. Simpler geometry does not mean less polygons, it simply means less concavities and details for the raymarcher to slow down around, or that would require more cameras to properly capture.

Cons:

Scales very poorly with decreased quality. Lowering LOD levels, texture sizes, and number of steps does not lead to a linear increase in performance, creating a problem with trying to use this shader on lower end devices, or in situations where you need to save on performance. Because of this, this shader may not be ready for real world use case scenarios.
The performance is adversely affected by very complex objects or when viewed from some angles; worst case scenarios end up with fewer steps skipped which can lead to framerate drops (although in isolation, no angle on the shoe model dropped below 100 FPS on the final build in fullscreen)
Certain concave features, like the inside of the shoe, are impossible to accurately capture with the current system.
Objects with both high amounts of geometric complexity and complex 3D information, like trees with lots of leaves, do not work well with this algorithm, in terms of both performance and quality. This is a fundamental problem with the fact that you simply can not capture enough information about the state of such a complex object with static camera angles to accurately reproduce it. I don't believe this problem is solvable with the current data model, unless an expensive technique like a neural network is used to fill in the missing information. However, there may be ways to mitigate the quality loss and generate a better looking approximation, which would be worth looking into in the future since trees are a prime candidate for imposters.
Shader is unlit and recreates the lighting situations the original object was in. No system currently exists for physically based lighting (yet). Missing information due to camera views being obscured leads to "shadows" on specular (you may have noticed that from earlier images) but this will not happen with matte objects.

Potential improvements for the future:

I have improved the balance between performance and quality many times throughout the course of working on this, but I would like to improve it even further if possible, especially in regards to making it more scalable. If I continued I would experiment further with trying to find a way to make the performance increase linearly with the reduction of quality so that scalability will improve.
Some way to use reflection probes / cubemap cameras in order to properly capture concavities such as the inside of the shoe. Currently this would take too much time to implement and I'm afraid the implementation would be too Unity specific (I have attempted to use HLSL and handwritten shaders as much as possible so the techniques will be portable to other engines with a relatively small amount of work). I'm also concerned about the potential effect on performance.
Capturing the world normals and UVs of the object in the source images instead of the color information so they could be used for a full quality recreation with completely accurate specular and reflections. This is what I would be most likely to work on next if I had time, but I need to move on to the job search now that I feel like I've done a satisfactory amount of work on this project.
A modified algorithm specific to rendering trees and other objects with huge amounts of geometric complexity. A few things would need to be accounted for, but generally the fact that with such a complex object the majority of pixels will not early exit, would require the algorithm to be better optimized for worst case scenarios, as well as changing how missing data is handled to better represent objects that are less "solid".

Update: More work done on February 12/13

I had a lot of free time over the weekend so I began the work of adding a PBR variation of my shader.

Current features include:

Accurate replication of the world normals of the original object
Pixel world space position modification for accurate shadow sampling and receiving point lights
Shadow caster pass that is semi-accurate (no depth offset yet)
All of the above works both with a perspective and an orthographic camera

Features for the future (if I have more time):

Sampling baked UV textures and using those UVs to sample from the original object's textures (smoothness, metallic, albedo, etc). for a full PBR recreation of the source object
Depth offsets for completely accurate 3D intersections and self-shadowing

This resource was used to make a handwritten shader for Unity's Universal Rendering Pipeline. It's a very helpful resource so I'm happy to share it.

https://gist.github.com/phi-lira/225cd7c5e8545be602dca4eb5ed111ba