OpenMediaLib User and Development Guide
- OpenMediaLib User Development Guide
- Introduction
- High Level Use
- Reverse Polish Notation
- Applying RPN to Video/Audio
- Clip Modifications
- Compositing
- Playlists
- Stack Manipulations
- Advanced Stack Usage
- Aspect Ratio Considerations
- The Encoding Filter Graph
- Compositing Revisited
- Really, Really Advanced Stack Usage
- General Audio Issues
- Python
- Interpolation
- Threading
General Audio Issues
There is a reason why multimedia frameworks tend to focus on image manipulations and audio is a rather secondary concern – images are simply easier.
There's nothing particularly complex about audio either, but in combination, things start getting messy. Traditional frameworks tend to treat the two as parallel filter graphs and sync between the two is handled by the component which uses them.
OML doesn't force that situation (nor does it block it) – it takes a view that is inspired by DV – that the image and audio for that image can be grouped together and delivered in a single packet. Thus audio sync is easier to achieve especially when arbitrary seeking is used, or we want a mechanism where audio visualisation is required.
Regardless of whether two graphs are used or not, there must be some kind of sync resolution to ensure that audio is mixed and otherwise operated on in parallel with the video – and the mechanism provided is to quantise audio into frame sized chunks (and in the split case, it's not strictly necessary for the 'frame sized chunk' to be the same for both audio and video).
For now, this document will focus on the parallel delivery of images and their respective audio samples via the single frame mechanism.
What do we need to be aware of?
Typically speaking, the complexities associated to this are encapsulated within the OML plugin implementations rather than the use, but there are some exceptions where a combined filter graph needs to be aware of the audio issues. There is a point where the complexity of the requirements will make splitting the audio processing a more attractive proposition, but typically, for playout and encoding, we don't need to split.
So, what are the issues?
Well, let's take a simple example first – let's assume that we want to generate a PAL video with 48khz stereo audio. This means we need 48,000 audio samples per channel per second. Since one second is 25 frames, that means we should expect each frame to contain 48000 / 25 samples per channel – this comes to 1920.
This particular set up doesn't present too much of an issue – given any input, it assumes that the delivery of samples are contiguous for the entire duration and that we have an exact number of samples to cover the duration of the input.
Now consider a movie framerate of 24fps at 44.1khz – this gives us 44100 / 24 – this means that each frame should deliver 1837.5 samples per channel. Obviously, that presents a problem – half a sample is nonsensical.
We can rectify the situation by saying that 1837.5 is the average number of samples in each frame of the input. Then it becomes clearer – half the frames have 1837 and the other half have 1838. And it should be obvious that we would like to distribute evenly, hence we define the number of frames required to spread the samples as a 'cycle' – the cycle here being 2.
Determinism is also an important point which should not be ignored here – if we seek to an arbitrary even or odd frame offset, the samples in the generated frame should be identical to those that would have been received had the video been played in frame order. Hence, whilst it doesn't matter whether the odd or even offset carries the additional sample, it should always be consistent.
Before we present the mechanisms which OML provides for this, we will consider a mind blowing example which is far harder to visualise, but is a mainstay of the DV world from which all of this originates – NTSC at 48khz.
Keep in mind that we want to deliver 48000 samples per channel per second, and that one second in NTSC terms is 30000 / 1001 frames – that means, it delivers approx 29.97 frames per second – or rather, on average it delivers 29.97 frames per second – some seconds thus deliver 29 images to the display, and others deliver 30.
Computers don't deal with this kind of situation very well – a computer can only approximate 30000 / 1001.
However, it should be clear from the above that the mechanism to determine the number of audio samples per frame is the same – we start by dividing the frequency by the frame rate – 48000 / ( 30000 / 1001 ) in this case, and that gives us 1601.6 – again, that's an average, so a logical distribution of the audio samples means that we get 2 frames in 5 having 1601 samples and the remaining 3 having 1602. Thus, we have a cycle of 5 here.
Other cycles can be longer – another example from the DV world is NTSC at 32khz which has a cycle of 15, and specifically, we spread the samples as follows:
1068, 1068, 1069, 1068, 1068, 1068, 1069, 1068, 1068, 1068, 1069, 1068, 1068, 1068, 1069
The proof of this is left an exercise for the reader.
This is generally known as 'Locked Audio'.
