OpenMediaLib User and Development Guide
- OpenMediaLib User Development Guide
- Introduction
- High Level Use
- Reverse Polish Notation
- Applying RPN to Video/Audio
- Clip Modifications
- Compositing
- Playlists
- Stack Manipulations
- Advanced Stack Usage
- Aspect Ratio Considerations
- The Encoding Filter Graph
- Compositing Revisited
- Really, Really Advanced Stack Usage
- General Audio Issues
- Python
- Interpolation
- Threading
Aspect Ratio Considerations
Throughout the document so far, we have used the by now familiar:
colour: <input> filter:composite slot=1
stack entries, but other than stating that it's an example of a binary operator, no explanation has been offered as to what it really does, or what you can do with it.
This is possibly one of the most critical structures in the field of video editing and encoding. To truly understand what it's all about, we need to dig into the dirty laundry of the video world...
The problem is that the dimensions of a video image typically don't conform to the shape of a pixel on a computer monitor, which is typically square. Most computer images also have square pixels – for example, if you take a JPEG from a digital camera and its size is 640x480, to fully display it, you would need precisely 640x480 of screen real estate.
In the world of video, the rules change – typically, pixels are rectangular. The dimensions of that rectangle need to be applied to each 'sample' in order for them to be displayed correctly. This additional information is known as 'sample aspect ratio' or 'sar'. In OML, sar consists of a pair of numbers and is annotated as num:den (or numerator and denominator). These numbers dictate the ratio of width to height of each sample.
Hence, the JPEG from the digital camera has a sar of 1:1. Note that it also has an 'aspect ratio' of 4:3, being the ratio of width to height.
NB: To allow for non-square pixel monitors, there really should be a 'display aspect ratio' or 'dar' num:den pair – for the purposes of simplification, we will consider non-square dar's as a specific problem of the user of the frame, though if you follow the logic here, you'll see that the treatment is identical).
So, what values of sar's are used?
Unfortunately, that is entirely dependent on the video input. But for the purposes of this document, we will focus on some specific cases. These are the image dimensions and sar for DV flavours:
DV PAL 4:3 – 720x576 @ 59:54 DV PAL 16:9 – 720x576 @ 118:81 DV NTSC 4:3 – 720x480 @ 10:11 DV NTSC 16:9 – 720x480 @ 40:33
And yes, those are correct – 4:3 and 16:9 have precisely the same number of samples per image – the difference is purely in the way that they are captured by the camera and presented. In the wide screen case, the samples are simply stretched horizontally.
So how do we use this information?
Let's assume that our only requirement at this point is simply to display the image on a computer monitor – in order to do that, we need to convert the dimensions to a 1:1 image.
The following equation can be used:
realwidth = width * sar_num / sar_den
Note that we scale horizontally, so the height given is the height used.
This gives us some interesting results, and shows another flaw in the system. First of all, the PAL DV 4:3 figures give us:
realwidth = 720 * 59 / 54 = 786.6666666 = approx 787
This is at odds with what we might expect from the 4:3 description – the 4:3 there should denote the ratio between width and height, therefore, we would expect that the sar usage and the following would match:
readlwidth = height * 4 / 3 = 576 * 4 / 3 = 768
The reasons behind this discrepancy are twofold – first up, the sar is actually an approximation – in this case it's closer to 1094:1000 which gives us:
realwidth = 720 * 1094 / 1000 = 787.68 = approx 788
Now we now have a 20 pixel discrepancy. And oddly, that is now exactly correct - which brings us to the second point - a TV screen does not show the outer 10 pixels on both sides – hence, only the central 768 are visible.
Similar rules and approximations apply to the other examples above.
NB: OML ignores the pixel discrepancy and ends up with an ever so slightly distorted image. It could be easily corrected if the sar's were more accurately provided, so the view taken is simply that – correction is made from the outside or, more bluntly, Garbage In, Garbage Out.
How does all of this relate to the original colour: <input> composite?
All frames in a video need to conform to a particular resolution and sample aspect ratio. For example, if you want to generate a PAL DV stream, all images must be scaled to 720x576 @ 59:54.
So, if we take our 640x480 1:1 JPEG, we obviously need to scale it – but if we just blindly scale to 720x576, it's going to be completely wrong. Ultimately, we want the 640x480 image which has a 4:3 width/height ratio to fill the central 768x576 pixels of the TV – right?
In our examples thus far, preserving aspect ratio is essentially what the composite filter does.
Providing the target resolution and sar is the role of the colour: (or in fact, any other background which is used to replace it).
The colour: input has properties which allow these settings to be specified – namely, width, height, sar_num and sar_den. These default to the PAL 720x576 59:54 settings. It also has properties to allow the specification of the colour itself – obviously, in this case, we want black which is also the default.
To calculate the dimensions of the foreground such that it fits vertically within the dimensions of the image, we apply the following:
w = ( bg_h * fg_w * fg_sar_num * bg_sar_den ) / ( fg_h * fg_sar_den * bg_sar_num ) h = bg_h
Hence in our 640x480 1:1 foreground on to a 720x576 59:54 background, we get:
w = ( 576 * 640 * 1 * 54 ) / ( 480 * 1 * 59 ) = 702.9152 = approx 703 h = 576
This mode of compositing is known as 'pillarbox'.
To calculate the dimensions of the foreground such that it fits vertically within the dimensions of the image, we apply the following:
w = bg_w h = ( bg_w * fg_h * fg_sar_den * bg_sar_num ) / ( fg_w * fg_sar_num * bg_sar_den )
Again, using our 640x480 1:1 foreground on to a 720x576 59:54 background, we get:
w = 720 h = ( 720 * 480 * 1 * 59 ) / ( 640 * 1 * 54 ) = 590
This compositing is known as 'letterbox'.
The default behaviour of the composite is to rescale the foreground such that it is placed centred on to the background such that it's cropped neither vertically or horizontally. This is simply done by calculating one of the above and ensuring the computed value is less than or equal to the corresponding dimension of the background – if it's greater, then use the other computation. Finally, the image is centred and composited.
OML dubs this mode of compositing 'fill'.
The 'mode' of compositing is also a property of the composite filter – mode can take the values of full, letterbox, pillarbox, native and distort. The last two operations preserve the dimensions of the original [and hence, crop or contain padding where necessary] and distort the foreground to fill the background respectively. Any value other than those cause it to behave as a distort.
References:
http://www.bbc.co.uk/commissioning/tvbranding/picturesize.shtml http://lipas.uwasa.fi/~f76998/video/conversion/
