Skip to main content

· 3 min read
Siyuan Wu

Recently, I was developing WebAssembly based FFmpeg library, FrameFlow. It directly uses low-level C API of libav* folders from FFmpeg, to give more power to web browser. I want to share some development experience of using those C APIs.

FFmpeg mainly has two ways to use it. Command-line way or C API. Actually Command-line program is also based on C API. Now when your first time to learn those APIs, it would be confused why there are multiple steps to create one thing. Because C language only has functions to do something. Why not use just one function to init something? Here is an example (C++), from encode.cpp.

auto codec = avcodec_find_encoder_by_name(info.codec_name.c_str());
auto codec_ctx = avcodec_alloc_context3(codec);
set_avcodec_context_from_streamInfo(info, codec_ctx);
auto ret = avcodec_open2(codec_ctx, codec, NULL);

This is minimum requirements to create an encoder. Let me explain one by one. First avcodec_find_encoder_by_name find the Codec by its name. This Codec is just like a class. You cannot change any value in it. It gives you some meta information about the codec (like libx264 codec), and also has pointers to functions to encode for example. Its type is AVCodec. Second line avcodec_alloc_context3, is just malloc a memory block, with every value in the struct set to default value. It is called codec_ctx (codec context). The name is a convention in FFmpeg. Because its type is AVCodecContext. This is just like using new to create a new object (instance). The third line is to set all values from info which I defined before. And this function is my defined function. Don't care about it. This step is just like giving parameters to constructor of the class. The last line avcodec_open2 is to initiate the object (instance). Just like calling constructor of the class.

So although, FFmpeg is written in pure C language. But it actually uses some Object-oriented style to organize the codebase. You can also see other similar examples about demuxer, muxer, decoder in my project.

Changes after init

Decoder: Time_base

In my experience of developing, there are some annoying bugs that seem weird, at first glance. Then after understanding the init process as I explained above, there is a key step that we should care about, last step avcodec_open2. Because it starts a contructor function, and init. It may change some fields that you set at the previous step. For example, here when you call avcodec_open2. It will use specifed codec algorithm to init. And often, time_base will be changed to another value. That may let us surprised. So any output frames' time_base is according to the new one, not the one you set. So after calling avcodec_open2, you may need to retrieve current time_base value from codec_ctx, to do further stuff. By the way, you may wonder what is time_base ? It might be worth to write another blog to explain. And now, simply explained, it is just a time unit, like second, microsecond, etc.

Encoder: format (pixel format / sample format)...

There is another example. For encoder, pixel format (video) or sample format (audio) may be changed, by specified codec algorithm, which the decoder uses. So after init, the encoder may only accept another pixel format frame. So before encoding, you need to rescale video frames to the specified pixel format, or resample audio frames to the specified sample format.

Conclusion

Overall, having an Object-oriented view would better understand those C APIs. And You can see all cpp codes in FrameFlow-0.1.1 release.

· 7 min read
Siyuan Wu

Background

Several years ago, I was developing a video editor. As a fan of front-end development, my primary choice is to make it on web page. However, several components that I need drop the idea. One of them is video processing tool. I needed FFmpeg, which cannot run in browser directly. So I had to use Electron.

Then it seems feasible. But it made me exhausted, since I heavily relied on FFmpeg. First, to use FFmpeg in Electron, actually Nodejs. We need to use it through Node provided api to start a child process and send commands to it. It looks like we remote control something. Although with the help of node-fluent-ffmpeg, everything became easier. But underhood, it still uses command-line interface.

Command-line interface of FFmpeg is very easy to use at first glance. However, in my use case, I needed several recorded audio files, trimmed, concat togather and merge with video. Then as for CMD, I needed to learn how to use filter_complex to build a complex graph, which costs a lot of time.

After these finished, exporting video worked. Since my editor generated video frames from canvas-like place. I used ReadableStream to feed images into FFmpeg process. Because the process looked like a black box. I cannot optimized the speed further.

Through the development, I also found that there is another way to use FFmpeg. To call low-level C API from FFmpeg/libav* folders. Actually FFmpeg is a collection of several tools, FFmpeg command-line program, FFprobe, FFplay. They are all based on libav* libraries. These APIs are flexible enough when we are not satisfied by CMD way. But learning curve is too high, we need to learn fundamentally how video processing works.

Inspiration from FFmpeg.wasm

Someday, I accidently got that FFmpeg had been ported to WebAssembly. And it worked. I was excited about the open source project. Hoped it can evetually allow my project move to web page. However, I found that it only allows processing after an entire video file loaded into memory. So my stream of input images is not applicable.

Solution: Custom I/O

After a while, a discussion of FFmpeg.wasm issue gave me a better solution. We can use WebAssembly to directly use libav APIs. In other words, reimplement input and output. Thus wasm-based FFmpeg can interact with any JavaScript data. This will give us enough flexiblity, and can real fit into browser environment.

There is another project BeamCoder gave me guides about how to wrap those low-level api and expose to JS. So in my case, I use C++ to wrap C api and use Emscripten embind to expose self defined FFmpeg classes to JS world. So I can build video processing workflow in a JS worker.

Inspired from TensorFlow / Pytorch

Initially, I just wanted to build something similar to BeamCoder. But maybe we can do it further. Since I know the experience of learning FFmpeg basic concepts and API is painful. Like, container / codec, demux / decode, encode / mux, and pts, dts, time_base, etc. So if this project abstracts those concepts while also keeps the same flexibility, others can avoid same headache experience.

Then, I figured out that we can build a library like machine learning frameworks (Tensorflow, Pytorch). Each frame of a video, no matter whether it is compressed (packet), or not (frame). They can all be viewed as Tensor. And entire process is through a tensors' graph. First build a graph. And when processing video, for each iteration, feed data (images/file/stream), execute the graph, and get the output (images/chunks).

Additional gain

So it is just a for...loop. We can keep it loop until ends. Also we can break in the middle, or skip by continue. Here is an example using FrameFlow api.

for await (let chunk of target) {
if (/* cond */)
continue
else if (/* cond */)
break

// do something regularly ...
}

API design logic

The goal of FrameFlow is to keep flexible and simple. It is designed to support all JavaScript data, all will be processed in stream way. So video, audio or sequence of images can be viewed as an array. Each element in the array is an image (frame). Sometimes it is in compressed format, sometimes in uncompressed, and sometimes multiple arrays (video track/audio track) zip togather, like using python zip function.

Also, building a filter graph should also go in JS way. Here is an example of comparing FFmpeg command-line style and FrameFlow JS style.

CMD style

ffmpeg -i test.mp4 -filter_complex \
'[0:v] trim=start=5:end=10, setpts=PTS-STARTPTS [v0]; \
[0:a]atrim=start=5:end=10,asetpts=PTS-STARTPTS [a0]'\
-map [v0] -map [a0] output.mp4

FrameFlow style

let video = await fflow.source('./test.mp4')
await video.trim({start: 5, duration: 5}).exportTo('./output.mp4')

You don't need to understand why we need trim and atrim, what is setpts, what is -map... Internally, frameflow actually converts JS style to FFmpeg style to build a filter graph.

Problems of FrameFlow

After talking about advantages of frameflow, let's talk about some critic problems that still exist in the project.

Speed

It is the top issue that decides how impactful it will be. Is it just a toy, a demo or a killer app ? Although WebAssembly is designed to have near-native speed performance. But in reality, things are not that simple. In my current version, since it hasn't done any optimization. The speed is roughly 10x slower than native FFmpeg one. Why, let me explain.

After doing some initial speed tests, I found out the bottlenecks are encode and decode phases. Especially the encode phase. The gap between frameflow and FFmpeg is from three aspects.

  • WebAssembly speed is a little bit slower than native one. Especially when there are many interactions between JS and WASM. The speed will slow down to half speed of the native one.
  • FFmpeg have multi-threads enabled, frameflow currently haven't enabled.
  • FFmpeg has SIMD optimization for various CPU architectures. FrameFlow hasn't.

Solutions

So here are some solutions for above each problem.

  • Since frameflow directly manipulates FFmpeg low-level api. There is no need to mount any Emscripten FS. Every interaction between JS and Wasm is under control, even log to stderr. We can optimize if needed.

  • Enable multi-threads, if enable SharedArrayBuffer and cross-origin isolation. Most cases are ok with that, except some few use cases.

  • Write WebAssembly SIMD codes. Since FFmpeg uses assembly SIMD code, which cannot port to wasm, because Emscripten only allow C intrinsics codes. So rewrite all the optimization codes would need a lot of time. And Safari currently hasn't fully supported it. Check the browser compatibility.

  • Additional, use WebCodecs API when browsers support specific codec. This will directly have native encode and decode power. That is estimated to be have near-native speed, without any limitation. But not all browser support it. check the compatibility.

Altogather, not each one solution can solve the issue perfectly, but they togather will accelerate a lot. That will be estimated to satisfied most of our daily use cases.

Packet size

FrameFlow heavily relies on FFmpeg as basic component. However, FFmpeg library itself is huge size, from the perspective of web developers. So frameflow current wasm version is about 22 MB size. Why? Because, during web developing, codes are downloaded and run when need. But FFmpeg downloads all before running, which is dozens of MB size. If you build the library without any components, like encoder/decoder, demuxer/muxer and filters. Size of the library reduces to 1~2 MB, even <1 MB.

Solutions

  • So do we really need all those components? No, most of time, we only need very small fraction of it. So why not download on demand? Like streaming media. In the future, we can attempt to use Emscripten split feature to lazily load each fragment on demand.

  • Current version has loadWASM api, which can preload the wasm binary.

Conclusion

FrameFlow is designed to support any JavaScript data in stream way. And will do most of things FFmpeg can do. And also gives more friendly API than either FFmpeg command-line way or low-level C API. Your words will shape the future of FrameFlow.