Beginnings of a platform - 3D Streaming Toolkit

Opportunity - unmet needs
In February 2016, we got in contact with AVEVA. They were an early HoloLens solution developer, and they'd hit a wall. In their Everything3D product, their customers develop extremely large, detailed BIM plans for power plants and large marine vessels. The fidelity of these models is critical to their applications.
So they posed the question - can we render this remotely from a datacenter to the device?
AVEVA specifically wanted an integrated product experience, not just remote-desktop or remote-app projection in AR.
Challenges
- Why hasn't anyone else done this yet?
- Will this make people sick?
- Do solutions work well on form factors beyond HoloLens?
- Can this approach scale efficiently?
Before we started, I built an iterative roadmap alongside Chase Laurendine from AVEVA.
The plan
- Deep dive with AVEVA to understand their product, technologies, and customer needs.
- Come together for a hackfest to attempt to build a functional proof of concept.
- If we succeed in a proof of concept, then determine requirements for a commercially viable solution.
- Develop an MVP and then test and validate viability with AVEVA.
- If successful, publish the toolkit as an open-source project and begin onboarding more customers.
Preparing for the hackfest
We had one week in a room to build a functional prototype. This one of the bedrock tenants of the group I worked with - solving hard problems completely takes a long time, but if you can't show a glimmer of hope in a week or two, there's probably some much larger roadblocks (technological, organizational, regulatory, cultural) that need to be overcome first.
Due to the limited window of opportunity, I started researching. I knew we wouldn't have the time or bandwidth to build an end-to-end prototype from scratch.
I knew we would need three critical components to prove this could work:
- A low latency method to capture a frame, screen, or buffer and rehydrate it on the HoloLens
- A transport protocol and architecture capable of working across local and wide area networks that can carry multiple data streams (buffers, inputs, and metadata) with good fault tolerance and congestion fallback.
- A method to display the content on HoloLens with pose prediction and image stabilization.
After speaking briefly with some colleagues on the HoloLens side of Microsoft, I knew that #3 was possible, but would require a significant amount of organizational effort, so we kicked that can down the road.
Low-latency transcoding
This is not a new problem, and it wasn't in 2016 either. Fortunately, Microsoft had an amazing solution - Remote desktop. I reached out to colleagues who had worked on RDP and learned that it had just received a massive update enabling much higher resolutions, faster framerates, and better image quality.
Unfortunately, the code wasn't accessible - internal only, no published APIs, no integrations, and definitely not open-sourceable.
There were a number of hardware encoding devices 123 on the market that could do "real-time" processing, but none of them were available on any large cloud service provider and all would take weeks to get one on site for even a test.
So next was looking at market solutions - specifically game streaming. Nvidia had recently launched GeForce Now into public beta on their Shield device, and v5.0 of NVEnc (Nvidia video encoder) API added the low latency encoding option in late 2014 - making it possible for anyone to use their hardware for "real-time" video encoding.
We all had nVidia GPUs available with NVEnc cores, and most importantly, both Azure (N-series) and AWS (EC2 GPU) had large deployments of GPU instances available.
So I went with NVENC - highest likelihood to find success quickly.
Transport
For the transport protocol, I looked again at several candidates as this is another technology area that was mature at the time.
The first candidate was using existing Microsoft technologies - especially Skype. I had several teammates who had come over from the Skype org, so it was an easy conversation to have. They all pointed me away from this direction pretty quickly. Much like RDP - Skype didn't have a great developer story, and the codebase (especially the infrastructure side) was in a lot of flux as they were consolidating and modernizing with the Teams platform.
Fortunately, multiple colleagues pointed me to the same alternative - WebRTC.
I also looked at WebSockets, and SignalR (layer on top of Websockets) - but neither had any affordances for video streaming - we'd have to build that from ground up.
The biggest drawback at the time for WebRTC was the logistics of building it for WinRT - a requirement for HoloLens client apps. Again fortunately for me (this will become a trend) - I was at the right place and time. I small team elsewhere in Microsoft had been tackling this problem already for a year - and had published WebRTC-UWP, a fork of WebRTC with the intent to merge everything back into Google's codebase when stable.
Thanks to those guys (big shout out to James Cadd, Bernard Aboba, and Robin Raymond) - I was able to get a running demo in a day or two.
So we have a transport protocol, and framework in WebRTC.
Stable holograms
The last piece of the puzzle was frankly the easiest for the hackfest, while being the most difficult long-term problem we faced to a usable solution for customers.
I was able to speak with the HoloLens team who built the Holographic Remoting Player, to seek their help and advice on the larger project of remote rendering we were embarking on.
They gave us two pieces of advice:
- Don't bother, it won't work.
- If it does work, it will make people sick.
Not the most cheerful advice, but this came from a group who had worked hard on this problem for some time and really knew both the HoloLens and AR technology as a whole much better than any of us. They confirmed that yes, there were APIs for camera pose prediction, and that it would be technically possible to leverage those APIs to provide stable remotely rendered video as a binocular image.
The challenge is that all of these APIs were private and closely guarded IP for Microsoft. Zero chance this was going to be allowed into an open-source project.
With that knowledge in tow, I made the call to punt this problem down the road. If we couldn't succeed with a PoC, the stability wouldn't matter. And if we did succeed, real customers with real money are a great way to put pressure on organizational obstacles.
Proof of Concept
In short, it worked. In 5 days in a room, alongside a couple of engineers from AVEVA, and with the genius help of Jason Fox, we got a demo running from server to headset.

This was very, very basic - but we proved that a solution was at least technically possible. It took a little time to put together the proposal, but within a month, I got the green light to go for it.
Building an open-source product
I put together a small and mighty team to tackle this project. We only succeeded due to the unique combination of experience, skills, and perspective of each team member.
Andrei Ermilov, Chase Laurendine, Phong Cao, Anastasia Linzbach, Ben Greenier.
Andrei published an amazing retrospective on the full architecture of the V2 toolkit here on the ISE Developer Blog - Real-time streaming of 3D enterprise applications from the cloud to low-powered devices.
The article details all of the major contributions we made back to WebRTC to enable it as a robust transport for real-time interactive experiences, as well as the challenges of getting NVEnc working in a scaled production architecture.
What I will cover more in depth are some aspects of the project that still make it uniquely differentiated from other commercial remote rendering solutions that were extremely important to customer adoption.
Outcomes
Many commercial customers adopted the toolkit to build out their own platforms, including AVEVA, Intel, BMW, Medivis, and Kognitiv Spark.
What was most surprising is how the core technology of this project found legs far beyond the world of HoloLens and spatial compute. There turned out to be many customers who needed ways to take a remote compute resource and project a running application or environment into a highly integrated client experience.
The success of the toolkit was truly realized when the same HoloLens team that sometimes begrudgingly helped us along the way, came back asking about our growth and success. Kudos to them - they always helped us and our customers, even when we disagreed on approach or outcome - one of the things I cherished about working at Microsoft.
When we found success, the door opened. I wrote a business opportunity brief for the organization, and a year or two later, they officially launched Azure Remote Rendering.
Similarly, because we had customers using the toolkit commercially, they all came back to us with the same question - does Microsoft offer WebRTC infrastructure as a service? Initially we pointed them to some great partners like SignalWire. However, we had built up a great relationship with folks in Microsoft Teams and Skype along the way, so I thought it was worth a conversation.
It turned out the timing was right once again. There was willingness to explore opening Skype's infrastructure into an Azure service, and it wouldn't be a huge lift to put a WebRTC capable gateway like Janus in front to make it all work.
Around the same time that ARR launched, so did Azure Communication Services.