In this post, we are going to dwell on the process of video platform design for our Together project. From the technical viewpoint, this project is remarkable by containing the entire content lifecycle, from its creation on mobile devices to distribution and viewing. While designing the platform, we sought to attain solution flexibility and cost-efficiency. With the new video platform you can receive, store and share videos. All video management tasks were implemented on Apple HLS.
Problem statement Design a video platform to enable online broadcast. The platform can:
1) Record content from a variety of mobile devices (iOS/Android smartphones and tablets)
2) View content from a variety of devices (MultiScreening) – iOS / Android / PC.
An important feature is to enable publishing via a wonky mobile connection, broadcast fault recovery, and broadcast pause. Also great is that in case of a connection failure the content shot is never lost, with broadcasting fully resumed after recovery. Here it is important to ensure that "It just works", regardless of unstable connection issues.
In other words, this is a common video camera that can publish your recordings online whatever the bandwidth or connection quality.
Probably the plainest and most hardcore example of how this should work, is shooting of a subway trip. Connection gets continuously lost underground, but the purpose is to present the whole shooting to the viewer, from the beginning to the end, even with pauses. For convenience, let’s further refer to people that shoot video as social journalists, or just journalists.
Here’s how it looks from the viewer point. When the recording starts, the journalist publishes a link to a video broadcast on Twitter / Facebook / G+, etc. The viewer gets the link along with the possibility to repost, retwit and/or join the viewing. Suffice it to say that, we have spent lots of time developing Embed video players that make video available directly on the social networks, so that the audience can join the viewing easier. In our opinion, live video broadcast shown directly in the Twitter client looks very impressive.
In case the journalist loses the signal, the viewer sees in the player an indicator or a message that "the broadcast has interrupted, please wait, the journalist will soon be back…", and an offline timer.
After the publisher has returned with the offline content, the player shows options: the viewer can either continue viewing from the interruption point or go live.
Also it is necessary to make a single file from the broadcast to upload it to YouTube or Vimeo. This is really helpful when following the live broadcast of the conference the video is immediately available in the VOD format. All this constitutes quite a convenient model both in terms of the journalist and the viewer. You can make one long broadcast of your whole journey. You can shoot at an unstable connection but lose nothing. The camera can be helpful in lots of situations.
Video platform So, when you design a video platform for this service, you need to have an insight into what problems it is designed to solve, for instance:
- Receiving a video stream: in our case, HLS stream is generated by the mobile application
- Balancing the publishing clients between the receiving nodes
- Fault tolerance, i.e. the capability to continue broadcast from any other node in the event of a node failure
- Recording, i.e. saving the broadcast to the data storage
- Storage, i.e. fault-tolerant distributed storage of video content
- Generating and updating broadcast metadata, i.e. its status and duration
- Video delivery, i.e. streaming of broadcasts
- Balancing of video content viewers
- Caching of popular content.
Generally speaking, there are two entities: video broadcast and video reception. To make the video platform as simple, stable and easily scalable as possible, we transmit the video using Apple HLS. Mobile devices broadcast video in the Apple HLS format; the video platform receives chunks and stores them; viewers watch videos in Apple HLS (for the Flash Player we have developed an HLS plugin). Everything has fitted pretty well: chunks of the same type, no conversions, HTTP simplicity, MPEG DASH expected shortly 🙂
Tools Let’s take it for granted that we already have some Web portal offering an API to the client.
Client application behavior constitutes downloading of the content within a certain period of time. Also, whenever a chunk is received the content is verified. It means that all servers have to be aware of all broadcasts published at the moment. To enable this, the servers can either get information from the external environment whenever a new chunk is received or simply skip checking and delegate it to some other entity. Obviously, the latter is not a very good idea. In any case, you need a service to notify of the incoming content and coordinate its transfer to the data storage.
So, for ease of implementation it is desirable that per each broadcast the data is published to the same server. The data on which server to start a publication shall be received by the client while balancing the publishing client.
We have put the entire responsibility for informing the external system (portal) of the video broadcast status, on the service that accepts chunks (also we call it recorder as it records video from the client). In our view this is logical. This will allow the service to keep the data only on those broadcasts which are published to it at the moment, given a certain timeout. The service just has to push the information on the next broadcast update (i.e., which chunk has arrived).
In case of a client failover, the server will just have to check the broadcast status but once and inform everyone that it has taken on the publishing.
Reception and balancing of published content Many different balancing methods and tools exist, but first of all we have to understand the balancing criteria. For us, the main criteria are:
- Current geographical location of the user
- Traffic processed by our servers in the region
- Guaranteed availability of servers at the current moment.
For regional balancing, we can use ELB, but this makes us dependent on AWS. Their traffic is immensely costly, and traffic through the ELB is also paid. DNS Load Balancing has lots of downsides mentioned many times in many sources. It can be used for balancing by macro regions only, where there is an entry point that further balances the client by the end nodes.
Another option is, to use algorithms based on the Least Connection/RR/WRR methods. All of these are not the best choice as we need to allocate the client to a given server for a certain period of time.
So, the solution is, to add IP Hash balancing to any of the above algorithms and be happy. HAProxy allows you to redirect the traffic and bypass it. But an important criterion here is the server load. To me, server load is the load on the CPU/disk IO/network interfaces. Here it is possible to invent some new services and try to interface them with HAPproxy.
But we have chosen another way. The publishing solution is a
- NGINX upload module
- C++ daemon.
The daemon receives arriving chunk data from NGINX, applies some actions to it, moves it to the storage and, if necessary, updates the status in the portal API and the broadcast metadata also. This module also collects some statistics on broadcasts streamed by that server. So, eventually, this is the same service that receives and registers chunks, additionally collecting the server statistics.
To balance the publishing process, again the C++ daemon is used. It collects the data from the entire pool of servers that receive chunks (recorders).
It is with this service that the mobile publishing client communicates with, and it decides where to optimally redirect the client based on all the factors and statistical data mentioned above.
Data Storage So, what are the storage requirements? The storage has to be:
- Not a cloud-based service
- Open (GPL/ Apache lic, etc.)
- Support data replication across N nodes, including geographical
- Be able to read directly from N nodes
- Failure of a node should not affect storage performance
- Support POSIX interface to access the data
- Support ACL
- Support disk space limits on each of the nodes
- Be quick and reliable :).
There are such solutions, and Google Search is replete with such names as Ceph / glusterfs / parallel nfs / GFS / GPFS / Lustre, etc.
But we decided to explore the solution proposed by our Russian Yandex Team developers, i.e. elliptics + pohmelfs.
Well, we have managed to build it and run successfully. However, certain manual adjustments had to be made. But since the pohmelfs module has been derived from staging and not included in mainline yet, it was quite dangerous to become dependent on a specific kernel version. At the moment, our cloud hosting operator does not suppport loading of custom kernels. So, anytime it can update the kernel to have very unhealthy consequences for us. This was the reason to drop the cloud based solution.
But we strongly wish that the Yandex Team updates the documentation and includes pomelfs into the kernel. It was also confusing that only two production installations were reported at that time, and one of them was in Yandex itself.
In elliptics, we enjoyed the lack of metadata servers and key-value, and "automatic" rebalancing of nodes. So we craved to find an FS with similar niceties.
The selection was not large enough, though. GlusterFS. It has only one nicety, i.e., lack of metadata servers, and
we make it run. As an issue I would like to mention that in 3.3.1 disk limit was non-operable. We have also found that we can not make one large volume for all data centers, as a too high latency (ping avg 100 max 700, jitter avg 30, stdev 40, loss 0.1%) between the DCs resulted in a significant drop in performance (from 150Mbps to 50-40Mbps).
As a result, we selected the following scheme. There are two data volumes: the backup volume and hot data volume. Each data center has a separate data volume hosted by a few nodes with SSD drives. The archive volume is located in DC with a high-level SLA, but it runs on servers with conventional large-volume SATA drives (1-3 TB). Data synchronization runs periodically to move recorded broadcasts to the archive storage.
Data access interfaces on the archival storage and quick storage are identical, so this provides transparent access to the data avoiding any search-related hassle.
Generation and storage of broadcast metadata Metadata is the video stream data that the player needs in order to play back the video stream, including the data needed by recorders to handle the broadcasts. To store the metadata we use Redis. And, we have arrived to it not immediately. At first, CouchBase looked very attractive, but then we dropped the idea as its production prices are quite high. In Redis, the data on each chunk is saved as soon as it arrives.
To generate m3u8 playlists from the metadata, also a small service is used. On client request, it selects all the necessary data from Redis and easily generates a playlist. Such service can be scaled horizontally at no difficulty. In Redis, fault-tolerance is implemented by replication + redis sentinel.
Balancing of viewing requests and caching of popular content Why do we need it? First of all, we need it to ensure cache efficiency on the edge servers. The viewer has to be sent to the Edge server, where the content is already cached and viewed, as it is very likely that such content is in the OS cached memory. In this way, we can reduce the number of random reads from the disk, and increase the edge server efficiency.
For caching we decided to use plain NGINX. But here was an issue that files in the NGINX cache have absolutely meaningless alphanumeric names. To learn KEY (actually, URL) of such a cached file, we need to read the first four lines of the file, which is relatively expensive in terms of file handling.
By making some modifications to NGINX, we made it possible to obtain such information without a significant overhead. But receiving the information is not enough: you need to aggregate data for each content item based on the number of chunks in it, considering which broadcasts are now watched, and load on the disk / cpu / network interfaces.
To collect all such statistics and monitor the cache (i.e., to add/remove content), we have written a single C++ daemon.
Similarly to recording, here also there is a central daemon collecting statistics from all the Edge servers and consolidating it. It is with it that the client player communicates when making a viewing request. The edge server is assigned based on such characteristics as:
- Geographical location of the user
- Availability of requested content in the cache
- Whether the content is currently viewed somewhere else
- System parameters of the edge servers (CPU / IO / eth)
- Total allowable bandwidth of the video platform.
- Whether requests for a given content unit have to be re-directed to an external CDN
- Broken Edge (the client can not connect to a specific server running)
After that, an optimal server is provided to the client.
Such video platform can be hosted on dedicated servers or in the cloud without any link to a specific provider.
Last but not least, we would like you to try this wonderful video service: it is free and can be very helpful to you if you shoot videos on your phone frequently.