Frictionless Data Video · Rufus Pollock Online

Our mission is to make it radically easier to make data used and useful – our immediate goal is make it as simple as possible to get the data you want into the tool of your choice. Open Knowledge International

Transcript

We want insight, wether it is how to stop global warming or just the quickest way to get to work. No, to get insight, we need to use data. For example, if we’re thinking about global warming, how much CO2 is omitted, what renewables could we use and how much do they cost. But, there’s a problem; rather than it being easy and quick to get the data we need to start making our analysis and creating our insight, it can take weeks, or even months to get that data together, rather than it being a matter of minutes or hours.

To understand that problem a little deeper I want you to think about an analogy. I want you to imagine that insight is like a cake. So just like we combine together ingredients; it’s flour, it’s sugar, eggs, maybe some butter to make a delicious cake. We want to make an ‘insight’ cake. rather than bringing together sugar, flour and butter, we want to bring together data sets, maps, and statistics and other pieces of data.

Imagine though, that to make a cake you’d first have to collect the ingredients. You have to actually go to the farm and collect the corn and mill it, and then cart it back to your house. And only then could you start baking. Rather than a quick trip to the store, and a couple of hours in the kitchen, to make a cake would take weeks and even months of effort. And this is where we are with data today.

With data we spend most of our time collecting and preparing the data and are left with only a small fraction of the time to actually turn it into insight. So there is this huge amount of friction in getting and using data. And that stops us from getting insight and solving important problems. Whether it’s climate change or the quickest way to do your commute.

So, what can we do about it? Now there are lots of sources of friction, and lots of things we could improve. There’s legal barriers to getting data or it is very expensive. You can’t even get the data from a corporation, or you have to pay are amounts of money. Or maybe it is data quality, there is that data there but it is all on stacks of printed paper that you have to hand type in to be able to use it. However we just want to focus one, what we’re going to call kind of ‘data logistics’. We want to eliminate the friction getting data from A to B, from tool A to tool B; from that database online into your tool on your desktop so you can start your analysis. That is our fundamental goal, ‘frictionless data’.

We want to do for data what containerization did for the shipping of goods. Dramatically cutting costs, allowing for massive automation, and economies of scale, and making things fast, efficient and cheap. So I want to take you back to what shipping was like in It was manual, it was slow and it was costly. Individual stevedores used to carry sacks of flour or bunches of bananas onto a ship, store it in the hold and then when it arrived somewhere they had to reverse the process. But today, shipping is containerized. Instead of having individual people have to carry a sack of flour, big machines can load these steel containers on and off ship. It’s automated, it’s fast and it’s cheap. And it’s safe, by the way, compared to the old. And just to give you a sense of how big a difference this made to shipping and transporting goods. It used to be 80% of shipping something from let’s say America to London, and it was the cost of loading it at either end. And containerization reduced that cost seven-thousand times, or conversely made it seven-thousand times more productive.

Now, can we do the same thing for data? Yes, we can. We can do containerization for data. And we call it ‘data packages’. ‘Data packages’ are containers for data, it is that simple. Just like we put bananas inside a steel container that allows us to load them onto ships massively more efficiently. So we want to take data, like a spreadsheet and put it inside a virtual container, a ‘data package’. That will allow us to massively more efficiently load that data in and out of the tools we have. The container is important not just because it’s just a container but because of the tooling it allows. With the ships it isn’t the fact that we have a steel box, but that by having the steel box we can have massive cranes that can zoom around and load those containers on the ship really quickly. Or that we’ve got trucks that are build for those steel containers now. Or even railway cars. And it’s the same thing for data. One we have our standardized virtual box, our ‘data package’, we’ve got all kinds of tools that we can use for that. We can validate the data automatically, we can store and search that data in standard ways. We can import that data to your specific tool or export it from it automatically.

You might not know that those containers you might have seen on trucks or in container ports actually come in different shapes and sizes. Some longer, some shorter, some wider and that’s the same for our data packages. You could have different kinds of data packages, customized to your needs. And just to take one example, we have tabular data packages which, guess what, are for tabular data like spreadsheets, especially designed for that kind of data.

Let’s just recap, and summarize. First, and just to go all the way back to the beginning. We want to turn data into insight and we want to do that quickly, easily and reliably. But today there’s huge amounts of friction and you could spend weeks preparing data to be able to do just a few hours of analysis. However, by containerizing data, putting it into ‘data packages’, we can dramatically cut the cost of acquiring and integrating data. And create a world of frictionless information. More insight, more efficiently!