What is Big Data ?
- Get link
- X
- Other Apps
Two more Vs have emerged over the past few years: value and veracity.
Data has intrinsic value. But it’s of no use until that value is discovered. Equally important: How truthful is your data—and how much can you rely on it?
Today, big data has become capital. Think of some of the world’s biggest tech companies. A large part of the value they offer comes from their data, which they’re constantly analyzing to produce more efficiency and develop new products.
Recent technological breakthroughs have exponentially reduced the cost of data storage and compute, making it easier and less expensive to store more data than ever before. With an increased volume of big data now cheaper and more accessible, you can make more accurate and precise business decisions.
Finding value in big data isn’t only about analyzing it (which is a whole other benefit). It’s an entire discovery process that requires insightful analysts, business users, and executives who ask the right questions, recognize patterns, make informed assumptions, and predict behavior.
The History of Big Data:
Although the concept of big data itself is relatively new, the origins of large data sets go back to the 1960s and '70s when the world of data was just getting started with the first data centers and the development of the relational database.
Around 2005, people began to realize just how much data users generated through Facebook, YouTube, and other online services. Hadoop (an open-source framework created specifically to store and analyze big data sets) was developed that same year. NoSQL also began to gain popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark) was essential for the growth of big data because they make big data easier to work with and cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are still generating huge amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to the internet, gathering data on customer usage patterns and product performance. The emergence of machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has expanded big data possibilities even further. The the cloud offers truly elastic scalability, where developers can simply spin up ad hoc clusters to test a subset of data.
Benefits of Big Data and Data Analytics:
-
Big data makes it possible for you to gain more complete answers because you have more information.
-
More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.
"The solution to the Big Data problem is Distributed storage concept."
Distributed Storage Concept:
A distributed storage system is an infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.
Distributed storage systems can store several types of data:
- Files—a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
- Block storage—a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
- Objects—a distributed object storage system wraps data into objects, identified by a unique ID or hash.
Distributed storage systems have several advantages:
- Scalability—the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
- Redundancy—distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
- Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at a low cost.
- Performance—distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.
Big Companies that use Hadoop:
-
Facebook:
There are over 290 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country then it would be ranked fourth in terms of the largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 190 million, 140 million, and 130 million Facebook users respectively.
Facebook usage statistics:
30% of internet users use Facebook more than once a day.
45% of people get news from Facebook.
40% of people said they would share their health data with Facebook.
There are an estimated 81 million fake Facebook profiles.
The most popular page is Facebook’s main page with 213m likes. Samsung is second with 159m, while Cristiano Ronaldo is third with 122m.
Facebook accounts for 62% of social logins made by consumers to sign into the apps and websites of publishers and brands.
200 million people use Facebook Lite – the app for the developing world’s slow connections.
Facebook takes up 22% of the internet time Americans spend on mobile devices, compared with 11% on Google search and YouTube combined.
Users spend an average of 20 minutes per day on the site.
In a month, the average user likes 10 posts, makes 4 comments, and clicks on 8 ads.
Hive is Facebook’s data warehouse, with 300 petabytes of data.
Facebook generates 4 new petabytes of data per day.
Facebook now sees 100 million hours of daily video watch time.
Users generate 4 million likes every minute.
More than 250 billion photos have been uploaded to Facebook.
This equates to 350 million photos per day.
Facebook relies too much on technology, like Hadoop. Facebook relies on a massive installation of Hadoop software, which is a highly scalable open-source framework that uses bundles of low-cost servers to solve problems. The company even designs its in-house hardware for this purpose. Mr. Rudin says, “The analytic process at Facebook begins with a 300 petabyte data analysis warehouse. To answer a specific query, data is often pulled out of the warehouse and placed into a table so that it can be studied. The team also built a search engine that indexes data in the warehouse. These are just some of the many technologies that Facebook uses to manage and analyze information.”
“Facebook runs the world’s largest Hadoop cluster," says Jay Parikh, Vice President of Infrastructure Engineering, Facebook.
Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers:
- The developers can freely write map-reduce programs in any language.
- SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to developers with small subsets of SQL.
Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on the Hadoop database, i.e., Apache HBase, which has a layered architecture that supports a a plethora of messages in a single day.
2. Google:
- More than 7 billion humans use the internet (that’s a growth rate of 7.5 percent over 2016).
- On average, Google now processes more than 40,000 searches EVERY second (3.5 billion searches per day)!
- While 77% of searches are conducted on Google, it would be remiss not to remember other search engines are also contributing to our daily data generation. Worldwide there are 5 billion searches a day.
- Google handles a staggering 1.2 trillion searches every year.
Everyday IBM creates 2.5 quintilion bytes of data - so much that 90% of the data in the world today has been created in the last 2 years alone.
4. Microsoft:
-
There are more than 1 billion devices running on window 10.
-
Everyday Microsoft analyzes over 6.5 trillion signals in order too identify emerging threat protection customers.
“Quickly build a Hadoop cluster in minutes when you need it, and delete it when your work is done. Choose the right cluster size to optimize for time to insight or cost. Seamlessly integrate HDInsight into your existing analysis workflows with Windows Azure PowerShell and Windows Azure Command-Line Interface. Microsoft offers cloud-based managed services that are built on top of Hadoop.
5.Amazon:
-
During May 2020, Amazon.com had over 2.5 billion combined desktop and mobile visits, up from 2.01 billion visits in February 2020. The platform is by far the most visited e-commerce property in the United States.
-
Online retail in the United States is constantly growing. In the first quarter of 2020, e-commerce sales accounted for 11.8 percent of retail sales in the United States. During that quarter, U.S. retail e-commerce sales amounted to over 160.33 billion U.S. dollars. Some retail categories are performing stronger in terms of e-commerce sales than others: in the apparel and accessories segment, 26.1 percent of total retail revenue was generated online. However, digital-only accounted for two percent of U.S. grocery revenues.
-
As the most popular online shopping platform, Amazon’s influence on consumers shopping behavior extends beyond its own website. According to a February 2019 survey of U.S. Amazon users, 66 percent of respondents stated that they started their online product research on Amazon. Of course, Amazon is not only popular for product research but ultimately, also for making the purchase. The most important factors driving users to purchase via Amazon are pricing and low shipping costs.
Amazon EMR programmatically installs and configures applications in the Hadoop project, including Hadoop MapReduce, YARN, HDFS, and Apache Tez across the nodes in your cluster.
6. Youtube:
-
Youtube usage more than tripled from 2014-2016 with users uploading 400 hours of new video each minute of every day! Now, in 2019, users are watching 4,333,560 videos every minute.
-
300 hours of video are uploaded to YouTube every minute!
-
Viewers watch 1 billion hours of content on the platform every day according to official YouTube statistics from February 2017.
-
YouTube accounts for an astonishing 25% of global mobile traffic (Facebook only manages 17%), and 15% of broadband traffic.
- Viewers watch 1 billion hours of content on the platform every day according to official YouTube statistics from February 2017.
-
At the beginning of 2020, the digital universe was estimated to consist of 44 zettabytes of data.
-
By 2025, approximately 463 exabytes would be created every 24 hours worldwide.
-
As of June 2019, there were more than 4.5 billion people online.
-
80% of digital content is unavailable in nine out of every ten languages.
-
In 2019, Google processed 3.7 million queries, Facebook saw one million logins, and YouTube recorded 4.5 million videos viewed every 60 seconds.
-
Netflix’s content volume in 2019 outnumbered that of the US TV industry in 2005.
-
By 2025, there would be 75 billion Internet-of-Things (IoT) devices in the world.
-
By 2030, nine in every ten people aged six and above would be digitally active.
-
As of 2013, experts believed that 90% of the world’s data was generated from 2011 to 2012. This is still one of the most mind-blowing data stats to date. It does justice to the explosion of data growth in a blink of an eye since the beginning of the Information Age.
-
In 2018, more than 2.5 quintillion bytes of data were created every day.It was the year when Americans used over 3.1 million gigabytes of internet data and 1.25 new bitcoins were “minted” every minute.
-
The amount of data in the world was estimated to be 44 zettabytes at the dawn of 2020.To put things into perspective, a zettabyte is 1,000 bytes to the seventh power. In other words, one zettabyte has 21 zeroes.
Such an insane number was attainable only by adding up the total amount of data generated each day by social media sites, financial institutions, medical facilities, shopping platforms, automakers, and others. -
At the beginning of 2020, the number of bytes in the digital universe was 40 times more than the number of stars in the observable universe.The exponential growth of big data still does not compare to the Big Bang, but it is spectacular. How much data is produced every day in 2019? Read the statistic above again, and let it sink in for a minute.
-
By 2025, the amount of data generated each day is expected to reach 463 exabytes globally.
An exabyte is 1,000 bytes to the sixth power. Good luck doing the math to figure out and wrap your head around the overall amount of data on the internet that would be created five years from now.
-
Google, Facebook, Microsoft, and Amazon are custodians of at least 1,200 petabytes of people’s information.
Some of them could acquire more global data created daily as they branch out. For instance, Facebook wants to establish a new financial world order with the launch of Libra, a stablecoin-based payment network.
In the event that the social media titan’s plan comes to fruition, it could rival the power of central banks. Facebook could then exercise some monetary policies as it sees fit in order to manipulate and stabilize the value of its own cryptocurrency.
-
As of June 2019, there were over 4.5 billion internet users in the world.
In other words, nearly 60% of all the people on the planet at that time were digitally active. The internet penetration rates in North America and Europe were both almost 90%, but the largest group of people on the Web came from Asia, even though only 50.7% of all Asians have gone online! Africa, the planet’s second-most-populous continent, has been exhibiting the fastest growth, with a penetration rate of 11,481% from 2000 to 2019.
With these statistics in mind, it is no wonder why the amount of data being created daily is getting harder to comprehend.
8. 80% of online content is available in just one-tenth of all languages.
One of the many reasons why billions of people are still offline is that they struggle to find content they can understand to begin with. As a result, the generation of data on the internet has not been as big as it should be.
9. Google handles a staggering 1.2 trillion searches every year.
So, how much data does Google process every day? Research stats show that it is more or less 3.5 billion queries every 24 hours. Although the leading search engine seems invincible at this point, it is surprisingly not peerless. Amazon’s ad revenue share in the US is poised to reach 15.9% by 2021 at the expense of Google.
10. Despite processing just 6.2% of all searches in the US, Bing makes almost $5 billion in ad revenues.
This is nearly three times Twitter’s advertising profit.
While making a dent on the amount of data generated by Google daily is still a pipedream, Microsoft’s search engine has yet to fade into oblivion. Apple devices may have switched back to Google, but Bing has been the default search engine of most internet properties and pieces of hardware the market leader could not control.
Bing’s story underscores the eye-popping monetary value of data generated daily in the eyes of marketers.
11. The number of apps downloaded from Google Play Store and App Store every 60 seconds in 2019 jumped to 390,030 from 375,000 only in 2018.
This stat shows the sources of digital data generated each day continue to multiply. Additionally, app user segmentation is becoming more pronounced, and the lives of app developers and mobile marketers are getting much harder.
12. The world spends almost $1 billion per minute on commodities on the Internet.
This telling statistic demonstrates how much data is created every day without further explanation. A buyer’s online journey, from initial site visit to purchase, is typically well documented.
13. In 2019, the number of emails sent every minute was 188 million.
The figure was an improvement from 2018 when 181 million emails were sent every 60 seconds. While it is still an indicator of information growth, it is evident that this part of the Web is nearing a plateau.
14. Last year, Google tallied 3.7 million queries, Facebook had one million logins, and YouTube saw 4.5 million videos viewed every minute.
All of these figures were higher than the year prior, but the increases were not that significant. Despite not having any shortage of fresh data generated every day, such platforms were probably close to their peak in terms of usage.
These online brands may seem untouchable, but no one is safe in the adapt-or-perish world of the internet. Even they have to continue evolving to keep growing if they are to consistently record a higher amount of data generated every day over time.
15. In 2019, nearly 695,000 hours’ worth of Netflix content was watched per minute across the world.
The number was just 266,000 in 2018. This stat represents the intensifying love of internet users for media streaming. Considering the rise of the company’s share of the total amount of data in the world in 2019, Netflix has further cemented its position as a force to be reckoned with in Hollywood.
16. Netflix released almost 55% more originals in 2019 than it did in 2018.
In the hope of preserving its market share and seeing more data created every day, the streaming-giant-cum-video-production-company added an amazing 371 new movies and TV shows to its vast content collection—excluding children’s shows and films with short theatrical runs. As a matter of fact, Netflix’s 2019 content output beat that of the entire US TV industry in 2005
17. As of May 2019, 500 hours of video were uploaded to YouTube every minute.
So, how much content is created every day? The simplest answer is countless, as everybody can be a content creator these days and financially succeed as a reviewer, analyst, actor, or any other profession on YouTube.
18. Four petabytes is the estimated amount of new data being generated by Facebook daily.
Instagram’s parent company is still growing on its own. In fact, its number of active users has been on an upward trend since Q1 2011. As of Q3 2019, more than 1.6 billion people logged in on the most popular social networking site every 24 hours. It is safe to presume that the amount of data created on Facebook every day will not go down any time soon.
19. Every 24 hours, 500 million tweets are tweeted on Twitter.
Despite the microblogging site’s character limit, that number still translates to an enormous amount of big data generated daily. Twitter may have not seen a steady rise of monthly active users of late, but the service is still America’s favorite social media outside the Facebook-owned brands.
20. Snaps created on Snapchat fell from 2.4 million per minute in 2018 to 2.1 million in 2019.
The multimedia messaging app had a disappointing year, probably due to its criticized redesign. Moreover, Snapchat usage in the US is projected not to recover in the coming years. Still, the platform could bounce back and increase its amount of data being generated daily per person if it could attract more users beyond North America.
21. More than 347,222 users were scrolling Instagram every 60 seconds in 2019.
Snapchat’s loss was Instagram’s gain. The photo- and video-sharing social networking service had a big year. In 2018, just 174,000 people were using the app per minute. Such a feat helped boost Facebook’s data production daily.
22. 18.1 million text messages were sent every minute through LINE last year.
Such a figure means that the amount of text data created every day hardly increased from 2018, when the app processed 18 million messages.
23. Game streaming has become a global phenomenon, attracting over one billion internet users.
As of January 2019, 30% of internet users played games streamed live online, 23% watched live streams of other games, and 16% watched esports tournaments every month. The amount of data generated daily by passive and active gaming enthusiasts has become gold to developers.
24. As of January 2019, more than 26% of the US population owned a smart speaker.
The smart speaker ownership in America grew by over 40% from the prior 12-month period. Although the country is still not considered among the most “connected” nations in the world, the daily US data production will steadily grow as more and more Americans warm to IoT devices.
Comments
Post a Comment
If you have any doubt please comment.