What is Big Data ?

Are you curious? Did you ask yourself a question that all-day I use social media, like the post, add comments, download images and upload images, Where does this all data is stored ?. Big companies get millions of data per data where they store data?

Don't worry you will get all your answers here, just give it a read.

India has 574 million active Internet users as of 2019. India is the second-largest online market, behind China. It is estimated that by December 2020 there will be around 639 million active internet users in India. The majority of India’s internet users are mobile phone internet users, who take advantage of cheap alternatives to expensive broadband/ Wi-Fi connections that require PC, laptop, and other equipment. Indian mobile data users consume 11 gigabits (GB) of data each month on average, India is the highest globally, ahead of markets like China, the US, France, South Korea, Japan, Germany, and Spain. The overall data traffic in India increased by 47% in 2019 driven by continued 4G consumption. 4G constituted 96% of the total data traffic consumed across the country while 3G data traffic registered its highest-ever decline of 30%. Indians have 1.2 billion mobile phone subscriptions and will be downloaded in 2019.

The top social and its traffic from India:

Indians now download more apps than residents of any other country – over 19 billion apps were downloaded by Indian users in 2019, resulting in a 195 % growth over 2016 data. The average Indian social media user spends 17 hours on the platforms each week, more than social media users in China and the United States. Indian internet users are fond of social media. In 2021, it is estimated that there will be around 448 million social network users in India, a significant increase from 2019 where it figures at 351 million. Facebook is the most popular social networking site in the country. There are about 270 million Facebook users in India as 2019, placing India as the country with the largest Facebook user base in the world.

The IPL, one in many cricketing events followed religiously in India had the highest attendance among all cricket leagues worldwide. Apart from the attendance, fans seemed to be keen on updates about their favorite teams. The IPL teams registered over 59 million likes on Facebook alone and more than 81 million followers on Twitter. Most of the Facebook usage came from the younger generation, aged between 18-24 years to be precise, with over 97 million users in 2018. Increased availability of internet connections and access in recent years, propelled by the central government’s Digital India initiative was directly proportional to the growth of social media users.

This is a great challenge for companies to store huge amounts of data permanently. "Data can be your likes on social media, comments, images , audio, video, document anything that takes space."

Their no such storage unit to sore this enormous amount of data, this is a problem known as Big Data.

What is Big Data?

To really understand big data, it’s helpful to have some historical background. Here is Gartner’s definition, circa 2001 (which is still the go-to definition): Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. This is known as the three Vs.

The 3 v's of Big Data:

Volume: The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.

Velocity: The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.

Variety: Variety refers to the many types of data that are available. Traditional data types were structured and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video, require additional preprocessing to derive meaning and support metadata.

Two more Vs have emerged over the past few years: value and veracity.

Data has intrinsic value. But it’s of no use until that value is discovered. Equally important: How truthful is your data—and how much can you rely on it?

Today, big data has become capital. Think of some of the world’s biggest tech companies. A large part of the value they offer comes from their data, which they’re constantly analyzing to produce more efficiency and develop new products.

Recent technological breakthroughs have exponentially reduced the cost of data storage and compute, making it easier and less expensive to store more data than ever before. With an increased volume of big data now cheaper and more accessible, you can make more accurate and precise business decisions.

Finding value in big data isn’t only about analyzing it (which is a whole other benefit). It’s an entire discovery process that requires insightful analysts, business users, and executives who ask the right questions, recognize patterns, make informed assumptions, and predict behavior.

The History of Big Data:

Although the concept of big data itself is relatively new, the origins of large data sets go back to the 1960s and '70s when the world of data was just getting started with the first data centers and the development of the relational database.

Around 2005, people began to realize just how much data users generated through Facebook, YouTube, and other online services. Hadoop (an open-source framework created specifically to store and analyze big data sets) was developed that same year. NoSQL also began to gain popularity during this time.

The development of open-source frameworks, such as Hadoop (and more recently, Spark) was essential for the growth of big data because they make big data easier to work with and cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are still generating huge amounts of data—but it’s not just humans who are doing it.

With the advent of the Internet of Things (IoT), more objects and devices are connected to the internet, gathering data on customer usage patterns and product performance. The emergence of machine learning has produced still more data.

While big data has come far, its usefulness is only just beginning. Cloud computing has expanded big data possibilities even further. The the cloud offers truly elastic scalability, where developers can simply spin up ad hoc clusters to test a subset of data.

Benefits of Big Data and Data Analytics:

Big data makes it possible for you to gain more complete answers because you have more information.
More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.

"The solution to the Big Data problem is Distributed storage concept."

Distributed Storage Concept:

A distributed storage system is an infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.

Distributed storage systems can store several types of data:

Files—a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
Block storage—a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
Objects—a distributed object storage system wraps data into objects, identified by a unique ID or hash.

Distributed storage systems have several advantages:

Scalability—the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
Redundancy—distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at a low cost.
Performance—distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.

To implement the concept of Distributed Storage we use Hadoop Technology.

What is HADOOP?

Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Hadoop consists of four main modules:

Hadoop Distributed File System (HDFS) – A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.

Yet Another Resource Negotiator (YARN) – Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.

MapReduce – A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key-value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.

Hadoop Common – Provides common Java libraries that can be used across all modules.

"A report from McKinsey & Co. stated that by 2009, companies with more than 1,000 employees already had more than 200 terabytes of data of their customer’s lives stored"

Big Companies that use Hadoop:

Facebook:

Arguably the world’s most popular social media network with more than two billion monthly active users worldwide, Facebook stores enormous amounts of user data, making it a massive data wonderland. Consider adding that startling amount of stored data to the rapid growth of data provided to social media platforms since then. There are trillions of tweets, billions of Facebook likes, and other social media sites like Snapchat, Instagram, and Pinterest are only adding to this social media data deluge. Every 60 seconds, 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That is a LOT of data.

There are over 290 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country then it would be ranked fourth in terms of the largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 190 million, 140 million, and 130 million Facebook users respectively.

Facebook usage statistics:

30% of internet users use Facebook more than once a day.

45% of people get news from Facebook.

40% of people said they would share their health data with Facebook.

There are an estimated 81 million fake Facebook profiles.

The most popular page is Facebook’s main page with 213m likes. Samsung is second with 159m, while Cristiano Ronaldo is third with 122m.

Facebook accounts for 62% of social logins made by consumers to sign into the apps and websites of publishers and brands.

200 million people use Facebook Lite – the app for the developing world’s slow connections.

Facebook takes up 22% of the internet time Americans spend on mobile devices, compared with 11% on Google search and YouTube combined.

Users spend an average of 20 minutes per day on the site.

In a month, the average user likes 10 posts, makes 4 comments, and clicks on 8 ads.

Hive is Facebook’s data warehouse, with 300 petabytes of data.

Facebook generates 4 new petabytes of data per day.

Facebook now sees 100 million hours of daily video watch time.

Users generate 4 million likes every minute.

More than 250 billion photos have been uploaded to Facebook.

This equates to 350 million photos per day.

Facebook relies too much on technology, like Hadoop. Facebook relies on a massive installation of Hadoop software, which is a highly scalable open-source framework that uses bundles of low-cost servers to solve problems. The company even designs its in-house hardware for this purpose. Mr. Rudin says, “The analytic process at Facebook begins with a 300 petabyte data analysis warehouse. To answer a specific query, data is often pulled out of the warehouse and placed into a table so that it can be studied. The team also built a search engine that indexes data in the warehouse. These are just some of the many technologies that Facebook uses to manage and analyze information.”

“Facebook runs the world’s largest Hadoop cluster," says Jay Parikh, Vice President of Infrastructure Engineering, Facebook.

Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers:

The developers can freely write map-reduce programs in any language.
SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to developers with small subsets of SQL.

Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on the Hadoop database, i.e., Apache HBase, which has a layered architecture that supports a a plethora of messages in a single day.

2. Google:

More than 7 billion humans use the internet (that’s a growth rate of 7.5 percent over 2016).
On average, Google now processes more than 40,000 searches EVERY second (3.5 billion searches per day)!
While 77% of searches are conducted on Google, it would be remiss not to remember other search engines are also contributing to our daily data generation. Worldwide there are 5 billion searches a day.
Google handles a staggering 1.2 trillion searches every year.

Google invented the software stack that Hadoop reimplemented and made open source. So naturally, any use case where one might want to use Hadoop, we already have an internal alternative that leverages our existing infrastructure. Google offer cloud-based managed services that are built on top of Hadoop

3. IBM:

Everyday IBM creates 2.5 quintilion bytes of data - so much that 90% of the data in the world today has been created in the last 2 years alone.

IBM InfoSphere BigInsights makes it simpler for people to use Hadoop and build big data applications. It enhances this open-source technology to withstand the demands of your enterprise, adding administrative, discovery, development, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is that you get a more developer and user-friendly solution for complex, large scale analytics.”

IBM offers cloud-based managed services that are built on top of Hadoop.

4. Microsoft:

There are more than 1 billion devices running on window 10.
Everyday Microsoft analyzes over 6.5 trillion signals in order too identify emerging threat protection customers.

“Quickly build a Hadoop cluster in minutes when you need it, and delete it when your work is done. Choose the right cluster size to optimize for time to insight or cost. Seamlessly integrate HDInsight into your existing analysis workflows with Windows Azure PowerShell and Windows Azure Command-Line Interface. Microsoft offers cloud-based managed services that are built on top of Hadoop.

5.Amazon:

Total global visitor traffic to Amazon.com 2020

Published by J. Clement, Jun 18, 2020

During May 2020, Amazon.com had over 2.5 billion combined desktop and mobile visits, up from 2.01 billion visits in February 2020. The platform is by far the most visited e-commerce property in the United States.

Online retail in the United States is constantly growing. In the first quarter of 2020, e-commerce sales accounted for 11.8 percent of retail sales in the United States. During that quarter, U.S. retail e-commerce sales amounted to over 160.33 billion U.S. dollars. Some retail categories are performing stronger in terms of e-commerce sales than others: in the apparel and accessories segment, 26.1 percent of total retail revenue was generated online. However, digital-only accounted for two percent of U.S. grocery revenues.

As the most popular online shopping platform, Amazon’s influence on consumers shopping behavior extends beyond its own website. According to a February 2019 survey of U.S. Amazon users, 66 percent of respondents stated that they started their online product research on Amazon. Of course, Amazon is not only popular for product research but ultimately, also for making the purchase. The most important factors driving users to purchase via Amazon are pricing and low shipping costs.

Combined desktop and mobile visits to Amazon.com from May 2019 to May 2020

Amazon EMR programmatically installs and configures applications in the Hadoop project, including Hadoop MapReduce, YARN, HDFS, and Apache Tez across the nodes in your cluster.

6. Youtube:

Youtube usage more than tripled from 2014-2016 with users uploading 400 hours of new video each minute of every day! Now, in 2019, users are watching 4,333,560 videos every minute.
300 hours of video are uploaded to YouTube every minute!
Viewers watch 1 billion hours of content on the platform every day according to official YouTube statistics from February 2017.
YouTube accounts for an astonishing 25% of global mobile traffic (Facebook only manages 17%), and 15% of broadband traffic.
Viewers watch 1 billion hours of content on the platform every day according to official YouTube statistics from February 2017.

Hours of video uploaded to YouTube every minute as of May 2019

YouTube, are one of the best examples of services that produce a massive amount of data in a brief period. Data extraction of a significant amount of data is done using Hadoop and MapReduce to measure performance. Hadoop is a system that offers consistent memory. Storage is provided by HDFS (Hadoop Distributed File System) and MapReduce analysis. MapReduce is a programming model and a corresponding implementation for processing large data sets. This article presents the analysis of Big Data on YouTube using the Hadoop and MapReduce techniques.

And many more companies use Hadoop.

Big Stats and Facts About Big Data (Editor’s Choice)

At the beginning of 2020, the digital universe was estimated to consist of 44 zettabytes of data.
By 2025, approximately 463 exabytes would be created every 24 hours worldwide.
As of June 2019, there were more than 4.5 billion people online.
80% of digital content is unavailable in nine out of every ten languages.
In 2019, Google processed 3.7 million queries, Facebook saw one million logins, and YouTube recorded 4.5 million videos viewed every 60 seconds.
Netflix’s content volume in 2019 outnumbered that of the US TV industry in 2005.
By 2025, there would be 75 billion Internet-of-Things (IoT) devices in the world.
By 2030, nine in every ten people aged six and above would be digitally active.

How big Big Data is ?

As of 2013, experts believed that 90% of the world’s data was generated from 2011 to 2012. This is still one of the most mind-blowing data stats to date. It does justice to the explosion of data growth in a blink of an eye since the beginning of the Information Age.
In 2018, more than 2.5 quintillion bytes of data were created every day.

It was the year when Americans used over 3.1 million gigabytes of internet data and 1.25 new bitcoins were “minted” every minute.
The amount of data in the world was estimated to be 44 zettabytes at the dawn of 2020.

To put things into perspective, a zettabyte is 1,000 bytes to the seventh power. In other words, one zettabyte has 21 zeroes.

Such an insane number was attainable only by adding up the total amount of data generated each day by social media sites, financial institutions, medical facilities, shopping platforms, automakers, and others.
At the beginning of 2020, the number of bytes in the digital universe was 40 times more than the number of stars in the observable universe.

The exponential growth of big data still does not compare to the Big Bang, but it is spectacular. How much data is produced every day in 2019? Read the statistic above again, and let it sink in for a minute.
By 2025, the amount of data generated each day is expected to reach 463 exabytes globally.

An exabyte is 1,000 bytes to the sixth power. Good luck doing the math to figure out and wrap your head around the overall amount of data on the internet that would be created five years from now.
Google, Facebook, Microsoft, and Amazon are custodians of at least 1,200 petabytes of people’s information.

Some of them could acquire more global data created daily as they branch out. For instance, Facebook wants to establish a new financial world order with the launch of Libra, a stablecoin-based payment network.

In the event that the social media titan’s plan comes to fruition, it could rival the power of central banks. Facebook could then exercise some monetary policies as it sees fit in order to manipulate and stabilize the value of its own cryptocurrency.
As of June 2019, there were over 4.5 billion internet users in the world.

In other words, nearly 60% of all the people on the planet at that time were digitally active. The internet penetration rates in North America and Europe were both almost 90%, but the largest group of people on the Web came from Asia, even though only 50.7% of all Asians have gone online! Africa, the planet’s second-most-populous continent, has been exhibiting the fastest growth, with a penetration rate of 11,481% from 2000 to 2019.

With these statistics in mind, it is no wonder why the amount of data being created daily is getting harder to comprehend.

8. 80% of online content is available in just one-tenth of all languages.

One of the many reasons why billions of people are still offline is that they struggle to find content they can understand to begin with. As a result, the generation of data on the internet has not been as big as it should be.

9. Google handles a staggering 1.2 trillion searches every year.

So, how much data does Google process every day? Research stats show that it is more or less 3.5 billion queries every 24 hours. Although the leading search engine seems invincible at this point, it is surprisingly not peerless. Amazon’s ad revenue share in the US is poised to reach 15.9% by 2021 at the expense of Google.

10. Despite processing just 6.2% of all searches in the US, Bing makes almost $5 billion in ad revenues.

This is nearly three times Twitter’s advertising profit.

While making a dent on the amount of data generated by Google daily is still a pipedream, Microsoft’s search engine has yet to fade into oblivion. Apple devices may have switched back to Google, but Bing has been the default search engine of most internet properties and pieces of hardware the market leader could not control.

Bing’s story underscores the eye-popping monetary value of data generated daily in the eyes of marketers.

11. The number of apps downloaded from Google Play Store and App Store every 60 seconds in 2019 jumped to 390,030 from 375,000 only in 2018.

This stat shows the sources of digital data generated each day continue to multiply. Additionally, app user segmentation is becoming more pronounced, and the lives of app developers and mobile marketers are getting much harder.

12. The world spends almost $1 billion per minute on commodities on the Internet.

This telling statistic demonstrates how much data is created every day without further explanation. A buyer’s online journey, from initial site visit to purchase, is typically well documented.

13. In 2019, the number of emails sent every minute was 188 million.

The figure was an improvement from 2018 when 181 million emails were sent every 60 seconds. While it is still an indicator of information growth, it is evident that this part of the Web is nearing a plateau.

14. Last year, Google tallied 3.7 million queries, Facebook had one million logins, and YouTube saw 4.5 million videos viewed every minute.

All of these figures were higher than the year prior, but the increases were not that significant. Despite not having any shortage of fresh data generated every day, such platforms were probably close to their peak in terms of usage.

These online brands may seem untouchable, but no one is safe in the adapt-or-perish world of the internet. Even they have to continue evolving to keep growing if they are to consistently record a higher amount of data generated every day over time.

15. In 2019, nearly 695,000 hours’ worth of Netflix content was watched per minute across the world.

The number was just 266,000 in 2018. This stat represents the intensifying love of internet users for media streaming. Considering the rise of the company’s share of the total amount of data in the world in 2019, Netflix has further cemented its position as a force to be reckoned with in Hollywood.

16. Netflix released almost 55% more originals in 2019 than it did in 2018.

In the hope of preserving its market share and seeing more data created every day, the streaming-giant-cum-video-production-company added an amazing 371 new movies and TV shows to its vast content collection—excluding children’s shows and films with short theatrical runs. As a matter of fact, Netflix’s 2019 content output beat that of the entire US TV industry in 2005

17. As of May 2019, 500 hours of video were uploaded to YouTube every minute.

So, how much content is created every day? The simplest answer is countless, as everybody can be a content creator these days and financially succeed as a reviewer, analyst, actor, or any other profession on YouTube.

18. Four petabytes is the estimated amount of new data being generated by Facebook daily.

Instagram’s parent company is still growing on its own. In fact, its number of active users has been on an upward trend since Q1 2011. As of Q3 2019, more than 1.6 billion people logged in on the most popular social networking site every 24 hours. It is safe to presume that the amount of data created on Facebook every day will not go down any time soon.

19. Every 24 hours, 500 million tweets are tweeted on Twitter.

Despite the microblogging site’s character limit, that number still translates to an enormous amount of big data generated daily. Twitter may have not seen a steady rise of monthly active users of late, but the service is still America’s favorite social media outside the Facebook-owned brands.

20. Snaps created on Snapchat fell from 2.4 million per minute in 2018 to 2.1 million in 2019.

The multimedia messaging app had a disappointing year, probably due to its criticized redesign. Moreover, Snapchat usage in the US is projected not to recover in the coming years. Still, the platform could bounce back and increase its amount of data being generated daily per person if it could attract more users beyond North America.

21. More than 347,222 users were scrolling Instagram every 60 seconds in 2019.

Snapchat’s loss was Instagram’s gain. The photo- and video-sharing social networking service had a big year. In 2018, just 174,000 people were using the app per minute. Such a feat helped boost Facebook’s data production daily.

22. 18.1 million text messages were sent every minute through LINE last year.

Such a figure means that the amount of text data created every day hardly increased from 2018, when the app processed 18 million messages.

23. Game streaming has become a global phenomenon, attracting over one billion internet users.

As of January 2019, 30% of internet users played games streamed live online, 23% watched live streams of other games, and 16% watched esports tournaments every month. The amount of data generated daily by passive and active gaming enthusiasts has become gold to developers.

24. As of January 2019, more than 26% of the US population owned a smart speaker.

The smart speaker ownership in America grew by over 40% from the prior 12-month period. Although the country is still not considered among the most “connected” nations in the world, the daily US data production will steadily grow as more and more Americans warm to IoT devices.

25. 5G can elevate data transmission speed by up to 100 times and reduce latency from about 20 milliseconds to one millisecond. In a future where the adoption of 5G cellular connection is pervasive, the amount of data that could be produced daily is nothing short of unfathomable. Imagine how much content you could consume if you can download an entire season of a TV series in less than a minute.

Right now, 5.6 gigabytes is the average amount of data used per month worldwide. But the arrival of 5G technology has already driven up smartphone data usage dramatically in areas where it is available.

As of June 2019, the average 5G customer in South Korea used 24GB a month. Such data usage is more than twice what typical 4G subscribers consume. Faster internet speeds will almost certainly cause the amount of data created daily to skyrocket.

26. By 2025, there will be 75 billion IoT devices. Presently, there are just more than 26 billion. In 2019, about 180 smart speakers were being shipped every 60 minutes. In 2018, just 67 voice-first devices were. The trend is going to persist.

If you can’t work out how much data is created every day now, you better learn about and get used to zettabytes and yottabytes. The proliferation of IoT products will flood the digital a universe with pieces of data never been available before.

Compared to the current data generated each day, more frequent conversations between pieces of connected hardware will form colossal mountains and deep seas of insightful information since IoT devices automatically produce digital footprint.

27. By 2030, 90% of people at least six years old on the planet will be online.With the aid of 5G networks and IoT devices, worldwide internet penetration will continue to surge.One estimation has revealed that, as of January 2019, more than one million new people came online every day, which was a historic first. Naturally, it translated to a greater amount of data generated daily.

Search This Blog

Knowledge Hub

Reverse Email Lookup using Proxycurl

What is Big Data ?

By 2025, the amount of data generated each day is expected to reach 463 exabytes globally.

Comments

Post a Comment

Popular posts from this blog

Hanson Robotics : Sophia the Robot

How to get started with Data Science?

Configure HTTPD Server and Python Interpreter on Docker Container.