Skip to main content

Why FrameFlow

· 7 min read
Siyuan Wu

Background

Several years ago, I was developing a video editor. As a fan of front-end development, my primary choice is to make it on web page. However, several components that I need drop the idea. One of them is video processing tool. I needed FFmpeg, which cannot run in browser directly. So I had to use Electron.

Then it seems feasible. But it made me exhausted, since I heavily relied on FFmpeg. First, to use FFmpeg in Electron, actually Nodejs. We need to use it through Node provided api to start a child process and send commands to it. It looks like we remote control something. Although with the help of node-fluent-ffmpeg, everything became easier. But underhood, it still uses command-line interface.

Command-line interface of FFmpeg is very easy to use at first glance. However, in my use case, I needed several recorded audio files, trimmed, concat togather and merge with video. Then as for CMD, I needed to learn how to use filter_complex to build a complex graph, which costs a lot of time.

After these finished, exporting video worked. Since my editor generated video frames from canvas-like place. I used ReadableStream to feed images into FFmpeg process. Because the process looked like a black box. I cannot optimized the speed further.

Through the development, I also found that there is another way to use FFmpeg. To call low-level C API from FFmpeg/libav* folders. Actually FFmpeg is a collection of several tools, FFmpeg command-line program, FFprobe, FFplay. They are all based on libav* libraries. These APIs are flexible enough when we are not satisfied by CMD way. But learning curve is too high, we need to learn fundamentally how video processing works.

Inspiration from FFmpeg.wasm

Someday, I accidently got that FFmpeg had been ported to WebAssembly. And it worked. I was excited about the open source project. Hoped it can evetually allow my project move to web page. However, I found that it only allows processing after an entire video file loaded into memory. So my stream of input images is not applicable.

Solution: Custom I/O

After a while, a discussion of FFmpeg.wasm issue gave me a better solution. We can use WebAssembly to directly use libav APIs. In other words, reimplement input and output. Thus wasm-based FFmpeg can interact with any JavaScript data. This will give us enough flexiblity, and can real fit into browser environment.

There is another project BeamCoder gave me guides about how to wrap those low-level api and expose to JS. So in my case, I use C++ to wrap C api and use Emscripten embind to expose self defined FFmpeg classes to JS world. So I can build video processing workflow in a JS worker.

Inspired from TensorFlow / Pytorch

Initially, I just wanted to build something similar to BeamCoder. But maybe we can do it further. Since I know the experience of learning FFmpeg basic concepts and API is painful. Like, container / codec, demux / decode, encode / mux, and pts, dts, time_base, etc. So if this project abstracts those concepts while also keeps the same flexibility, others can avoid same headache experience.

Then, I figured out that we can build a library like machine learning frameworks (Tensorflow, Pytorch). Each frame of a video, no matter whether it is compressed (packet), or not (frame). They can all be viewed as Tensor. And entire process is through a tensors' graph. First build a graph. And when processing video, for each iteration, feed data (images/file/stream), execute the graph, and get the output (images/chunks).

Additional gain

So it is just a for...loop. We can keep it loop until ends. Also we can break in the middle, or skip by continue. Here is an example using FrameFlow api.

for await (let chunk of target) {
if (/* cond */)
continue
else if (/* cond */)
break

// do something regularly ...
}

API design logic

The goal of FrameFlow is to keep flexible and simple. It is designed to support all JavaScript data, all will be processed in stream way. So video, audio or sequence of images can be viewed as an array. Each element in the array is an image (frame). Sometimes it is in compressed format, sometimes in uncompressed, and sometimes multiple arrays (video track/audio track) zip togather, like using python zip function.

Also, building a filter graph should also go in JS way. Here is an example of comparing FFmpeg command-line style and FrameFlow JS style.

CMD style

ffmpeg -i test.mp4 -filter_complex \
'[0:v] trim=start=5:end=10, setpts=PTS-STARTPTS [v0]; \
[0:a]atrim=start=5:end=10,asetpts=PTS-STARTPTS [a0]'\
-map [v0] -map [a0] output.mp4

FrameFlow style

let video = await fflow.source('./test.mp4')
await video.trim({start: 5, duration: 5}).exportTo('./output.mp4')

You don't need to understand why we need trim and atrim, what is setpts, what is -map... Internally, frameflow actually converts JS style to FFmpeg style to build a filter graph.

Problems of FrameFlow

After talking about advantages of frameflow, let's talk about some critic problems that still exist in the project.

Speed

It is the top issue that decides how impactful it will be. Is it just a toy, a demo or a killer app ? Although WebAssembly is designed to have near-native speed performance. But in reality, things are not that simple. In my current version, since it hasn't done any optimization. The speed is roughly 10x slower than native FFmpeg one. Why, let me explain.

After doing some initial speed tests, I found out the bottlenecks are encode and decode phases. Especially the encode phase. The gap between frameflow and FFmpeg is from three aspects.

  • WebAssembly speed is a little bit slower than native one. Especially when there are many interactions between JS and WASM. The speed will slow down to half speed of the native one.
  • FFmpeg have multi-threads enabled, frameflow currently haven't enabled.
  • FFmpeg has SIMD optimization for various CPU architectures. FrameFlow hasn't.

Solutions

So here are some solutions for above each problem.

  • Since frameflow directly manipulates FFmpeg low-level api. There is no need to mount any Emscripten FS. Every interaction between JS and Wasm is under control, even log to stderr. We can optimize if needed.

  • Enable multi-threads, if enable SharedArrayBuffer and cross-origin isolation. Most cases are ok with that, except some few use cases.

  • Write WebAssembly SIMD codes. Since FFmpeg uses assembly SIMD code, which cannot port to wasm, because Emscripten only allow C intrinsics codes. So rewrite all the optimization codes would need a lot of time. And Safari currently hasn't fully supported it. Check the browser compatibility.

  • Additional, use WebCodecs API when browsers support specific codec. This will directly have native encode and decode power. That is estimated to be have near-native speed, without any limitation. But not all browser support it. check the compatibility.

Altogather, not each one solution can solve the issue perfectly, but they togather will accelerate a lot. That will be estimated to satisfied most of our daily use cases.

Packet size

FrameFlow heavily relies on FFmpeg as basic component. However, FFmpeg library itself is huge size, from the perspective of web developers. So frameflow current wasm version is about 22 MB size. Why? Because, during web developing, codes are downloaded and run when need. But FFmpeg downloads all before running, which is dozens of MB size. If you build the library without any components, like encoder/decoder, demuxer/muxer and filters. Size of the library reduces to 1~2 MB, even <1 MB.

Solutions

  • So do we really need all those components? No, most of time, we only need very small fraction of it. So why not download on demand? Like streaming media. In the future, we can attempt to use Emscripten split feature to lazily load each fragment on demand.

  • Current version has loadWASM api, which can preload the wasm binary.

Conclusion

FrameFlow is designed to support any JavaScript data in stream way. And will do most of things FFmpeg can do. And also gives more friendly API than either FFmpeg command-line way or low-level C API. Your words will shape the future of FrameFlow.