This post is also available in: Russian
In our definition, live streaming is a feature of an application that allows you to transmit video over the network as you shoot your video, without having to wait until recording is complete. In this post, we are going to discuss how we have tackled this problem while developing the Together video camera app for iOS.
At the first glance, supporting live streaming is not a big deal. There are lots of well-known applications and services run by a variety of platforms that use this feature. It is the main feature for some of them, e.g. Skype, Chatroulette, Livestream, Facetime, and many others. However, the iOS standard development tools and libraries prevent optimal implementation of this function, since they fail to offer a direct access to hardware-based video encoding features.
From the implementation viewpoint, live streaming can be broken down into the following sub-tasks:
- Get the stream data while shooting.
- Parse the stream data.
- Convert the stream into the format supported by the server.
- Deliver the data to the server.
Here are the following basic requirements for the app to implement live-streaming:
- Ensure a minimum delay between video shooting and its display to the consumer.
- Ensure the minimum amount of data sent over the network while maintaining acceptable quality of picture and sound.
- Enable optimal utilization of CPU, memory, and storage capacity of the shooting device.
- Minimize the battery drain.
Depending on the purpose of live streaming in the app, certain requirements may dominate. For instance, the requirement to minimize the battery drain is in conflict with minimizing delay, as sending large chunks of data across the network may be more energy efficient than maintaining a connection constantly exchanging the data while shooting. Our Together app is not focused at getting feedback from the viewers in real time; but, at the same time, we offer you the opportunity to share videos you shoot as soon as possible. Therefore, the requirement to minimize the battery drain has become a top priority for us.
iOS SDK includes a fairly rich set of features to interact with the camera and handle video and audio data. In our previous posts, we have already told you of the CoreMedia and AVFoundation frameworks. Such classes as AVCaptureAudioDataOutput or AVCaptureVideoDataOutput combined with AVCaptureSession can retrieve frame-by-frame picture from the camera in the format of uncompressed video and audio buffers (CMSampleBuffer). The SDK is supplied with sample apps (AVCamDemo, RosyWriter) illustrating how to work with these classes. Before transmitting buffers obtained from the camera to the server, we have to compress them with a codec. Video encoding at a decent compression ratio usually involves substantial CPU time utilization and, accordingly, battery drain. The iOS devices have special hardware for fast video compression with an H.264 codec. To make this hardware available to developers, the SDK provides but two classes differing in ease-of-use and features:
- AVCaptureMovieFileOutput outputs the camera image directly to a MOV or MP4 file.
- AVAssetWriter also saves video to a file, but in addition to that it can use frames provided by the developer, as a source picture; such frames can be either obtained from a camera as CMSampleBuffer objects, or generated programmatically.
While sending video over a network, an ideal option would be to skip data saving to the disk but to feed compressed data directly into the application’s memory. But, unfortunately, the SDK is not providing for this, so the only thing left is to read compressed data from a file the system writes to.
To read data from a file which is currently appended to, the easiest way is to run a cycle to read new data appended to the file ignoring the end of the file character until appending to the output file is not complete:
Such an approach is not very efficient in the context of resource utilization as the application is unaware when the new data has been appended to the file and has to access the file constantly, regardless of whether the new data has emerged or not. This downside can be overcome by using the dispatch_source function of GCD, an asynchronous input/output library.
The resulting data will be presented in the same format they are saved by iOS, i.e. in the MOV or MP4 container format.
Most players cannot play back a video stream as it arrives from an uncompleted MOV file. Moreover, to optimize transmission of video to server and data storage, video stream shoud be converted into another format or at least parsed out into individual packets. The task of video transcoding can be delegated either to the server part or to the client part. In the Together camera, we have chosen the second option, as transcoding is not likely to involve much computing resources, but can, at the same time, help you to simplify and offload the server architecture. The application immediately transcodes the stream into a MPEG TS container and cuts it into 8-second segments which can be easily transmitted to the server in a simple HTTP POST request with the multipart/form-data body. Such segments can be used immediately, without any further processing, to build a broadcast playlist via HTTP Live Streaming.
Specific structure atoms inside a common MOV file, makes it impossible to apply this format to streaming. To decode any part of a MOV file, a complete file shall be available, as decoding-critical information is contained in the end of the file. As a workaround for this problem, an MP4 format extension has been proposed to record a MOV file consisting of multiple fragments, each of which containing a separate block of video stream metadata. For AVAssetWriter to start writing a fragmented MOV, it is sufficient to specify the movieFragmentInterval value constituting the fragment length.
For video stream transcoding we used the most intuitive solution: an open set of FFmpeg libraries. With the FFmpeg libraries, we have succeeded in solving the issue of parsing a fragmented MOV file transcoding packages into a MPEG TS container. FFmpeg can relatively easily parse a file submitted in any format, convert it to another format, and even transmit it through the network via any of the protocols supported. For example, for the purposes of live streaming you can use the FLV as an output format, and RTMP or RTP as protocols.
However, FFmpeg does not fully support such a model we had to implement in Together. So that reading from the recorded file is not interrupted at the end of the file, we wrote a protocol module for FFmpeg called pipelike. To slice segments for HTTP LS, as a basis we used one of the earliest versions of the hlsenc module called libav and revised it, fixing bugs and adding the feature of transmitting the output data and the main module’s events directly to other parts of the application through callbacks.
The resulting solution has the following advantages and disadvantages:
- Optimal battery consumption, at the level in no way inferior to standard Apple’s applications. This has been attained by utilizing the platform’s hardware resources.
- We have made the maximum use of standard iOS SDK, with no private APIs, which makes the solution fully App Store compatible.
- Simplified server infrastructure attained through transcoding of HTTP LS video segments at the application side.
- Delay from shooting to playback is about 90 seconds.
- The video file is written to a disk, meaning that:
- A streamed video cannot exceed free space available on the flash drive.
- During streaming, the flash disk resource deteriorates with each data overwrite.
- Disk space that could have been used for other purposes gets crammed.
So, what contributes to the delay:
- Transfer a frame from the camera to the application: milliseconds.
- Write the buffer to the file: milliseconds.
- Read the buffer from the file (including delay due to file write buffering at the operating system level and video encoding algorithm used) 1-3 seconds.
- Fragment read completion: up to 2 seconds (fragment duration).
- Segment generation completion: up to 8 seconds (segment length).
- Send a segment to the server: 2-10 seconds.
- Generate a playlist: 2 seconds.
- Client’s delay before requesting a new playlist: 9 seconds (Target Duration value set in the playlist).
- Load and buffer segments on the client: 27 seconds (three Target Durations).
Total: 60 seconds. Other delays in the Together live streaming are caused by the specifics of implementation:
- The reason behind such a delay is that the system starts reading the next segment only after the previous segment has been fully sent.
- Delay due to the time elapsed from starting to write the file to streaming launch by the user (the file is always streamed from its beginning).
Most of the factors affecting the delay are due to the HTTP LS protocol. Most of live streaming applications use such protocols as RTMP or RTP, which are better tailored to streaming with a minimum delay. We also researched into a few other applications having the live streaming function, such as Skype and Ustream. Skype has a minimum delay of about one second while utilizing 50% of the CPU. It brings us to the conclusion that they use a proprietary protocol and algorithm to compress video data. Ustream uses RTMP, has a delay of 10 seconds or less and generates a minimum CPU utilizaton, just as with hardware video encoding.
All-in-all we have resulted in acceptable live streaming satisfying requirements of the Together Video Camera.
Here are some other developments that use a similar approach to live streaming:
- Livu is an application that can stream video from the iOS device to an RTP server. As it is not a full-scale streaming service, you’ll have to specify your own video server. On github you can find a pretty old version of the app’s streaming component. Livu’s developer Steve McFarlin is often consulting at Stack Overflow on video app development issues.
- Hardware Video Encoding on iPhone – RTSP Server example: here you can find the source code of an application implementing live streaming from iOS.