30 Mar The Fundamentals Of Data Gravity With Dave McCrory
Data gravity is the ability of data to attract applications, services, and other data to it. Dave McCrory coined the term in 2010 to describe this trait of data and its impact on businesses. In this episode, Tad Lebeck and Dave discuss data gravity for us. Dave looks at the evolution of how companies use data and how this concept will change how we handle and consume data. Tune in to learn more about this phenomenon and how it will change things for every company.
Listen to the podcast here
The Fundamentals Of Data Gravity With Dave McCrory
Our guest is Dave McCrory, Global Head of insights and Analytics at Digital Realty. He is a great guest for our initial series. He’s a co-owner of the original patent on cloud computing but more importantly, he’s the one who coined the term data gravity. Dave, I would like to understand how you got to this term. Everyone’s taken it onto them own and define their own terms on it. What do you mean by it as you use the term?
First, thanks for having me on. Data gravity for me came as I was looking at what early cloud companies were doing. By cloud, I mean companies like Salesforce and such. At the time, this was circa 2010, they were making acquisitions of companies. I was trying to understand why they were making decisions and such. I realized that it was the access to the data and the closer that you could get to the data, the more advantaged you would be.
As I thought about that, I thought of it as being like a planet having a gravitational pull. The pull being on applications and services. That became the birth of data gravity, that data has an attractive force to it and that attractive force was caused by access to the data either with lower latency so you get faster access to the data or more bandwidth. You can access more of the data at a time.
With that, I began blogging about it, chatting on social media, interacting with other industry experts. Over time, my views on data gravity have evolved. When I think about data gravity, I think of two things. I think of data having mass and growing. When you have a sufficient enough amount of data, it ends up with an attractive force like when you have a large enough amount of mass, say a planet or something, it has an attractive force. At the same time, there’s the concept of activity. When you access data that can also have an attractive effect. If you think about if you have a small amount of content but it’s highly desired, it will still have that same attractive force even though it’s not a large amount of data.
You could think about a very simple example like there was a new Avengers movie and it was available on Netflix. That file might only be in reality, the stream might be 2 to 4 gigs in size. Not that much data. However, the number of people and the amount of activity around that data would be incredibly high. Therefore, it would have a high attractive force because people would want to watch this movie in as high resolution as they possibly could, which would mean you need higher bandwidth and ideally you would want lower latency. You have two things. You have this activity and this mass. These things are the fundamental components in my mind of what data gravity is all about and then everything builds upon that.
I had not thought about the attractive force of more activity. Put it in the context of enterprises. What does this mean to them? How can they drive value on it? How do they mitigate some of the problems with it?
That’s a multi-part answer. The enterprises need to take stock of where they are. What are their data gravity issues? Usually, enterprises either only deal with fighting fires or they are not looking from a high level of things. They are looking at more of a persistent basis, which is fine. There’s nothing wrong with it but understanding where things are and where they are going, you have to consider data gravity.
Whether you are doing projects with something like IOT or other edge computing. Where if you are doing work in core, if you are doing hybrid between your own facility or a colo and cloud. All of those things are going to be affected by, “Where is your data? Where should data be created? Where is it being stored and processed? Who’s consuming it?” You have to consider all of those things.
It’s easy if you put them into those buckets and you don’t get down deep into the specifics of an individual system, which all of us in IT are guilty of, for a number of reasons many of them good. The only way you are going to mitigate the effects is to understand how the data flows, what’s happening with the data and who’s consuming it. Otherwise, you are making guesses on what’s happening with your data.
People often think of data gravity as a big central mass but what you described is more of a planetary system where there are different gravitational forces whether it’s on the edge or things that drive that. There’s going to be a problem, not only in terms of how you access it but how you think about what the gravity effects are. One of the problems that always come in with data is you want to ensure privacy on that. Now you have got these different centers of gravity, how do you do things like data protection or GDPR?
One of the new wrinkles in the overall data gravity problem is how do you deal with GDPR or with other local laws in data retention, auditing, security and embrace both the benefits of data gravity, as well as try to avoid the dangers of data gravity. Things like GDPR and requiring data to be held locally in country or other data sovereignty laws require a lot more work.
You have to keep the raw data locally, within country but maybe you need a small amount of that data. Let’s say you are storing the data in Germany and you need the data in Australia. The bandwidth and latency between Germany and Australia, bandwidth is low. Latency is high. You want to have a copy of that data in Australia but you can’t copy the data.
What you can do is some processing of the data and either anonymize it or do other work so that it is able to be copied over to the Australia location, where you can gain the benefits. That effort, may or may not be worth it. If it requires an incredible amount of processing and it takes a lot of time and there’s a large amount of delay in between and you need access to the data, it may be easier to go across the network and directly access the raw data. Maybe you query the raw data through a service that runs locally in Germany that provides all the necessary encryption and other capabilities so that you are not violating a specific data sovereignty law.
You have to understand what the laws are and then you have to figure out what is the most effective and efficient way of achieving whatever your goals are. If you had hundreds of thousands of people that needed access to this data, you are not going to want to access it remotely from Australia in Germany. That does not make sense. You are going to have to figure out a way to replicate that data in one form or another, even if it’s anonymized and there are a lot of other steps involved. It’s simply going to become a necessity.
[bctt tweet=”You have activity and mass, and these things are the fundamental components of what data gravity is all about. Everything builds upon that. ” via=”no”]
The regulatory aspect provides another element of the gravitational force of the data, the sense that where it keeps it. If I understand you correctly, there are different ways of cracking this data gravity. One is you move the processing closer to the data and the other is you move the data closer to the processing. Is that a fair analysis?
That’s fair. It’s about not only data to processing and processing to data but also, where’s the data being created and where’s it being consumed. It may make sense to move the data to the processing or processing to the data but where’s the data being consumed. Ultimately, you are processing, analytics, machine learning or other activities that need to happen with the initial data or raw data may be different than how the data are being consumed on the other side.
For data science or analytics, it might be being consumed where the processing is happening but that may not be true of whether it be eCommerce some type of planning or maybe you have teams such as in life sciences. You will see where one team is working on generating or creating the data. The data’s processed in a different location whether it’s expertise in doing the processing but then the data either needs to be sent back or it needs to be distributed to many sites for them to do their own analysis or their own aggregation with other data sets. The only way to do that is to move the data around pre and post. It may be more efficient to do the processing in one place but you may not have the expertise to do that for all of the cases.
You are thinking about data and data gravity has evolved over the years. Where do you think we are going with this?
We are headed to a state in the, say the next five years where we will see data creation, processing and storage. Having more focus but happening at what I would call all of the points. There will be storage processing, activity creation and consumption at the edge. Whether it be with IOT devices or things like wearables and such. All of these things are all generating data. There is a once in a need and a desire for people to have access to more information. That’s always initially in the form of data being generated. At the same time, many systems need either a local or regional facility to do some type of pre-processing aggregation, filtering or other activities. If we move up then you either have a core facility, a cloud, an enterprise data center or some combination of those things.
Data and processing are going to be occurring at all of those. That’s at least for the next 10 to 20 years is going to be the state that everything operates in. In the beginning, everything was consolidated. We had a few mainframe computers. Everything was centralized. We started to expand out. We saw the concepts of personal computers, local networks and all of these things. We ended up with this distribution where everything was stored on the separate endpoints that we said, “We need cloud so everything’s going to be more centralized in the cloud.” Now we are moving to, “Some things are better in the cloud or in the enterprise data center. Some things are better at the edge or at a local or regional level.”
We are starting to use all of them. The speed of the networks, latency, processing and storage, all of these things are now evolving to a point where they can facilitate some level of that occurring. That’s why we are getting to that place but it’s also making the data gravity dance more important than it was before where you were either distributed or you were centralized. Now you are both. Now you have to make intelligent choices about what to do.
The first phase of this to address data gravity is around content distribution networks because data was always in some central spot. We wanted people to consume it in different locations. What you are saying is there’s almost an inversion of that that we have to have a more intelligent way of not only getting the data out but bringing the data back in and having this intermediary processing so that we are not moving bulk data, we are moving information. Is that a fair analysis?
We are moving a little bit into one of the other areas that I’m focused on as of late but we are talking about the data versus information. You talked about an inversion of the CDN. That’s very true but I don’t know if it’s an inversion or more of both at the same time. You still need the CDN but you need the inversion of a CDN, a content reception network or the equivalent where you are able to take in data quickly, make decisions, move it to the right place back up. That’s critical with what’s happening. I think about my own home and the number of connected devices I have and the amount of data that’s generated passively, which is something I should bring up about activity as well. Every time you interact with data, you create more data.
The amount of activity that we have with our own data, even as individuals let alone, businesses and things is going up and increasing as well. If you think about something as simple as a transaction, you buy something online from a retailer. It does not matter which one. The purchase you are making involves more than, “I have selected that new t-shirt,” or whatever you are buying.
It involves inventory checks and usually dozens or more API calls in backend systems. It involves you transacting and providing payment. It involves providing your shipping, delivery information and all of those things. Every one of those systems as you touch it for that transaction is creating log files, an audit path and a set of records in other departments. It’s updating inventory. Every interaction you are having is generating even more data.
We have this exponential growth of data that are occurring out there. Some of it may not seem all that valuable at this moment in time but could be incredibly valuable at a later moment in time. Maybe it’s a month or a year. Sometimes you can consolidate and normalize or compress the data but even if you do that, you still need to be able to gain access to that data in some reasonable amount of time, at least to be able to do an analysis or take action.
What you are saying basically is that this explosion of data, people have no idea how much data they are generating with a simple, “I’m going for a walk with my Apple watch,” and what that entails behind it. More importantly, the systems of record on that going backward, you want to keep that around because you don’t know what you want to analyze until you want to analyze this. Is that what you are saying?
That’s right. There are companies that are focused on how do you shrink that amount of data as much as possible. You can do intelligent things to make it smaller but at some point, you still need to retain the data. Otherwise, when you do want to go and do that analysis, if the data does not exist anymore you can’t do the analysis. That’s why so many companies are heavily invested in data lakes. They are trying to keep the data as long as they can because they don’t, at least now, have a way of knowing what’s going to be valuable and what won’t outside of if you have a temperature sensor and it’s reading the same temperature 100 times.
[bctt tweet=”If you have a small amount of content but it’s highly desired, it will still have that same attractive force, even though it’s not a large amount of data. ” via=”no”]
Can you consolidate that? Yes, you can consolidate that data very easily but if you have something like an Apple watch that’s doing many different sensor readings all the time and it’s looking at patterns and such, it’s very difficult to consolidate that data past a certain point. That’s true with machines and people and gathering all of that sensor data. It’s a challenge but it’s a challenge that businesses have to deal with. Otherwise, they risk missing out at some point in the future.
This has been a fantastic conversation. I appreciate it. Do you have any closing comments you want to bring to the end here?
My closing comment would be enterprises need to be focused on what their data gravity footprint looks like. They also need to be planning for the future. Know your present state and look at what your future state is going to be. Otherwise, you risk being caught in a constant game of catch up and never being able to get into a spot where you can take advantage of data gravity versus suffering from all of the ill effects of not being properly prepared.
Thank you so much. This has been enlightening for me. I hope our audience enjoyed it as much as I did. Thanks a lot again, Dave.
About Dave McCrory
Dave McCrory is a senior technology executive with experience in building and managing teams, managing the implementation and operations of production platforms at scale, and helping architect technical solutions that improve both operational efficiency and profits. He is an acknowledged subject matter expert for his innovative approaches in Cloud Computing, Virtualization, and Data.