TnM2: Big Data

Part 2 of Technology En Masse

Lets throw in another buzzword: Big Data. In the previous edition of Technology en Masse we featured a post about Internet of Things in which we explained there is more and more devices connected to the internet. In 2020 there is expected to be a total of 28 billion ‘things’ connected to the internet and they transceive a lot of data. However Big Data does not only include this particular segment of data-transmission but can also incorporate the large current-day Social Media streams. Still, what does this mean? How ‘big’ is Big Data? What are the dangers? And how can we use it from a software development perspective?

How ‘big’ is Big Data?

The term Big Data has been used a lot the past couple of years. It is meant to describe the large data-flows/-streams generated not only by humans (likes, selfies, blogposts) but also generated by the growing amount of devices connected to the internet. You should think of devices not only used directly by us humans (FitBits, smart meters, smart refrigerators) but also indirectly; meaning industrial machines (gasturbines, weather stations, stock market machines). Even though this data might not all be open to the general public, businesses might be dealing with tons and tons of (company-owned) data. The question that has been arising is, how are we going to deal with this vat of uncategorized information? And how can we use it to the benefit of us (in general) and/or our company?

Before we can even try to formulate an answer to these questions we will first have to get a jest of how ‘big’ big data really is. We are talking about amounts of data hardly imaginable. For good measures we have therefore included a table of units of information translated to Gigabytes (in base 10/decimals). This way we can get a clearer idea of the size, since nowadays it is pretty much a unit that is used as a standard when describing harddrive sizes. As a side note: these units are using the standard SI prefixes as most physical units do!

Data measurement	in Gigabytes
1 Kilobyte (KB)	0.000001 GB
1 Megabyte (MB)	0.001 GB
1 Gigabyte (GB)	1 GB
1 Terabyte (TB)	1,000 GB
1 Petabyte (PB)	1,000,000 GB
1 Exabyte (EB)	1,000,000,000 GB
1 Zettabyte (ZB)	1,000,000,000,000 GB
1 Yottabyte (YB)	1,000,000,000,000,000 GB

It is estimated that an average internet user produces about 500 MB per day. With around 4 billion users worldwide that comes down to an impressive 2 exabytes (2 billion GB) of data per day globally. This number solely includes our human footprint. If we take into account all 28 billion sensors and devices not only publicly but also privately owned we are arriving at a total of 163 zettabyte per year by 2025. It is not so much our social interactions that contributes to this number (our 500 million tweets a day total to 130 GB) but even more so these devices and all of their sensors (one gasturbine has 200 sensors creating 600 GB data per day). So that’s a treasure up-for-grabs we hear you say!

We will have to burst that bubble immediately. There might be a lot of useful data available but the hard part is finding the ‘usable’part of it and then making sense of it. Thus you should consider it more of a hidden-treasure. Here is why: as Maria Fasli explained at a TED talk in 2014; big data does not necessarily equal big knowledge. Data by itself is just a large heap of unusable ones and zeros. It is only when we turn it into information that we can try and derive knowledge from it. Information is data molded and grouped into understandable and analyzable chunks that we can train models and algorithms on in order to understand it and look for (hidden) relations between different datasets. That is the challenge.

We have arrived at a moment in time where we are able to develop systems capable of using information to our advantage. Just an example: oncologists are using computer systems that can prescribe medicines beneficial to the patient, based on the records of thousands of previous patients and their treatment. These systems can currently make a better judgement than the oncologist, 85% of the time. The records analysed by the system hold medicines prescribed, food choices made, and patient measurements such as weight. Imagine if we add to this ancestry, personal lifestyle and daily activities. The system would be able to advice more than just medicine but could also detect cancer prematurely or give tips to prevent it. Sounds good, right? However it comes with a risk.

Privacy

All this data is very private and most people don’t want their personal information to be available for everyone to see. You can make data anonymous and it is still usable in cases as described in the previous paragraph. However people want to have a choice and want to be aware where and when their specific data is used. In a perfect world that would be the case but large players in the field of Social Networking have misused their large database of personal information and broken meanwhile have broken the trust of online service operators.

In this second chapter we want to dive deeper into the possibilities of using Psychometrics on big datasets. The scene we set in the coming paragraphs might come over as eerie, which is not our goal per se. Our main motivation for using this example is to: firstly inform you about the current possibilities; and secondly investigate possible applications for big data in other markets.

50 billion devices (https://spectrum.ieee.org/tech-talk/telecom/internet/popular-internet-of-things-forecast-of-50-billion-devices-by-2020-is-outdated)

GoldmanSachs (http://www.goldmansachs.com/our-thinking/pages/internet-of-things/iot-report.pdf)

https://www.volkskrant.nl/opinie/big-data-is-big-business-geworden-betoogt-jan-kuitenbrouwer~a4583368/

MTheelen

How ‘big’ is Big Data?

Privacy

TnM4: VPN and secure browsing

TnM3: GUIDs and Identification

TnM1: The Internet of Things