Last month, Primary Data announced that it had raised $40 million — including $20 million from investment partners and a new $20 million line of credit — in a second round of venture financing to support its DataSphere software platform. The company describes DataSphere as an “enterprise metadata engine,” meaning it leverages machine-learning techniques to improve data management capabilities. We asked Primary Data CEO Lance Smith to fill us in.
“We collect telemetry from the applications themselves as to how they are using their data,” Smith explains. “We help the application become storage-aware.” In short, DataSphere collects a range of metadata concerning storage: IOPS, latency, bandwidth and availability. Next, it analyzes the data based on business objectives and figures out, in real time, how it can be moved around for the greatest efficiency without disrupting operations — and while maintaining compliance with security restrictions or other requirements.
“When DataSphere is collecting that telemetry from the client, it’s from the client’s perspective,” Smith says, meaning DataSphere can detect and attempt to correct for bottlenecks in the infrastructure. “There may be a different store that’s better for one client than it is for another. Let’s say you’ve got a fast flash filer next to your laptop, and there’s another one just like it sitting somewhere in London. One of those is going to be faster for you than the other. Those kinds of things will be detected by us.”
Analysis Before Action
Smith suggested DataSphere can act in part as a bulwark against hasty decisions made by IT personnel who haven’t necessarily got access to relevant information about a facility’s workflow requirements, including where the bottlenecks really reside. As an example, he cites an imagined IT manager who specs out a generous deployment of flash storage in an attempt to relieve stress points. If that storage is accidentally overprovisioned, it can waste a lot of money; if it’s underprovisioned, the problem may not impact performance until it’s too late to make an easy fix. Primary Data is trying to help facilities into the Goldilocks zone, with just the right amount of storage at every tier.
“You can give us any storage you want,” Smith says. “We’ll look at three attributes: performance, reliability, and cost. We are a metadata engine, but we have machine learning. So we collect telemetry, and we learn from it. The typical rule for ‘hotness’ of data is 80/20: 80 percent of it is cold and still sitting in areas that are expensive from an IT perspective. We can push that off. We can support more workflows and more artists working simultaneously in existing infrastructure, and we can allow them to expand without jumping through hoops.”
After DataSphere analyzes the data and determines what should go where, the actual task of moving the data falls on DSX, or DataSphere Extended Services. DSX can be installed on physical or virtual machines and is used to move data without disrupting applications, to bring block storage — HDDs, SSDs, or NVMe — into the global DataSphere namespace, and to connect to S3-compatible cloud storage. For example, a backup snapshot from a storage system can be easily deduped and compressed before DSX moves it to the cloud.
Meeting the Needs of M&E
Especially in the media and entertainment market, it is can be difficult to correctly spec a storage system, notes Primary Data VP Products Douglas Fallstrom. “Rendering requires a huge amount of metadata,” he says. “There are thousands and thousands of symlinks and tiny files. Color and editing require large frames with sequential throughput. There’s not a single system that does all of that effectively in the entire world today. But as soon as the customer is forced to buy two different systems for two purposes, you end up with storage silos and inefficiency. At the scale of large companies, that can cost millions of dollars.”
That’s where DataSphere can really prove its worth, he says. The software is vendor agnostic, so it can be installed on existing infrastructure hardware, where it can analyze the storage configuration from the POV of multiple clients on the network and recommend fairly sweeping changes to improve performance and efficiency.
“We had a client who was doing 4K workflow — color editing and rendering — and they needed 24 fps, 1.2 GB/sec throughput,” recalls Fallstrom. “They ended up with bottlenecks because they couldn’t easily predict the strain on their infrastructure based on who was accessing a stream at any given time. We were able to give them better data placement so they didn’t end up with a hot bottleneck. We also helped them use storage closer to the actual endpoint to reduce the amount of network traffic that went across the building to the data center by load-balancing across existing infrastructure and reusing server infrastructure they already had closer to where the client was sitting.”
The latest release of DataSphere, v2.0, adds support for SMB 2.1, SMB 3.1, and Active Directory protocols. It supports Windows and Linux (NFS) clients and handles security and file permissions across NFS and SMB. The UI has been updated, with a new dashboard for viewing metadata related to applications and stores and better insight on activity in cloud storage. Snapshot capabilities have been improved to allow data to be moved to the cloud without impacting overall capacity. And Primary Data has developed a new set of Objective Expressions for more enhanced control capabilities in the latest version.