The impacts of Big Data go far beyond web or IT applications. The impact on society is increasing. The capacity to store and analyze data allows emergence of new information processing and analysis technologies, it allows the expending in research : medical, pharmaceutical, Computer, banking or electromechanical (Internet of Things, automobile etc.)
What is Big Data?
The concept was created to face the constant increase of the amount of data. Because today data volumetries are very huge it is necessary to find storage and analysis solutions. To Gartner’s opinion, Big Data adress the issue of 3V : data Volumetry, data Variety (from diverse sources) and Velocity concerning data collection, storage and analysis. With these news inqueries were created new technologies by the colossus Google, Yahoo or Facebook.
How has Big Data Emerged?
Three major technical developments have allowed the creation and growth of it: firstly, the evolution of storage hardware with increasing storage capacities in smaller and smaller devices and the evolution of storage models, from internal servers inside the enterprise to so-called “cloud” servers, which often have much higher storage capacity than internal company servers. The second point is the new evolution on serializing small replaceable servers and creating a distributed system that is resistant to breakdowns. This paradigm was popularized by Google in the early 2000s and is at the origin of the first open source version of the first Big Data framework released 10 years ago: Hadoop . The third revolution that began in 2009 is the explosion of tools for analyzing, extracting and processing data in an unstructured way, from NoSQL or new framework linked to The Hadoop ecosystem .
Traditional databases are no longer able to manage so high volumes of information, then big web players such as Facebook, Google, Yahoo or Linkedin have created frameworks to manage and process large amounts of data through, For example, data lakes, where all data from various sources are stored. These data are then “splitted” or separated to be processed in parallel in order to lighten computation processes (in the old model, treatments were made one after another in a stack) and then reassembled to give the final result . It’s this technology that allows fast processing speed on large volume data. Originally developed by Google, it is now under the Apache flag and this technology is called Map Reduce. Here is the breakdown of the processing process:
The aim of the above algorithm is to calculate the number of repetitions of a keyword in the text. The Map Reduce algorithm distributes the data (here character strings or words) in several nodes (splitting), each node performs its calculations separately (calculation of the number of words – Mapping and Shuffling) and finally the step of “Reduce” will consolidate the data of each calculation to display the final result. Smart !
Why is Big Data Useful?
Today its applications were widely developped to address the needs previously quoted and you wonder « Why is it so important ? ». Because it enables to respond to several issues like predictive analysis, i.e predictive maintenance, predictive sales or stock management. Real-time analysis is also one Big Data application.
Let’s focus on some technologies!
Apache Hadoop
The first and the most popular technology is Apache Hadoop, a framework widely used to treat big volumetries of data. Hadoop includes several things : a storage system HDFS, a treatment-planning system YARN and a treatment framework MapReduce. One of the most famous use case of Hadoop is the data lake.
Batch processing
This enables data processing until there are no more data entering system. The continuous and incremental treatments make it possible for the architecture to take into account the new data without processing the previous one. The results appear at the end of processing. MapReduce (into Hadoop version) and Apache Spark are exemples of batch processing.
Streaming processing
This is the opposite kind of treatment than batch processing. With this method, the results are accessible before the end of processing. Streaming processing is an easy-to-deployed solution which improves processing time. They are often used to implement evolutive solutions.
Lambda architecture
It’s a mix of batch and streaming processing. By using batch processing this architecture enables to balance latency, debit and system failure tolerance by providing conceased views on the data at the same time mixed with real-time data.
NoSQL database
Compared to classical relational databases that cannot store data at a large scale with a quick processing, NoSQL databases give a new approach of data storage, more flexible and adaptable to evolutions and less sensitive to failures.
Cassandra and HBase
They are systems of data management bases which are effective in reading and writing huge amounts of data. This kind of base is capable to bear the progressive increases in stored data without altering the existing functions.
Cloud computing
Cloud computing is the promoted way of deploying Big Data technologies. Indeed, these technologies are demanding in huge storage and processing capacities and the Cloud is today the best solution able to bear these volumetries : more powerful and less expensive than a classical on-premise solution.
Big Data Use Cases
It finds a wide variety of business applications, but some industries use it more than others.
The insurance and the bank hold records of customer data (who doesn’t have bank accounts ?) it gives the possibility to analyze customer data (subscriptions, unsubscriptions, geographical locations, Cultural variables, gender and others) in order to predict the population or typical client profile likely to leave the bank and thus take steps to reduce the churn rate in these populations.
Aviation or mechanical industries in general. Saagie is currently conducting predictive maintenance for its client, to predict moments of breakage of sensitive parts or the evolution of deterioration of essential parts for the proper functioning of an airplane, a car, luggages conveyors and thus makes it possible to launch the alert and the control of the part before the latter reaches its end (Prescriptive Analysis).
In the pharmaceutical and medical fields, for example the analysis of dead data concerning scanners of patients suffering from cancer, the collection of these data, the analysis of medical opinions for each case and the implementation of intelligence Artificial to help the doctor in his decision-making is allowed by Saagie Artificial Intelligence, on the basis of analysis of hundreds of thousands of scanners and opinions of hundreds of doctors, the robot can learn more precisely which opinion and treatment can be more efficient depending on high scanner resolution analysis.
It also makes it possible to optimize quantitatively and qualitatively the profile of test patients in the pharmaceutical industry, making it possible to determine on the basis of the previous data the typical profiles having the most probability of going to the end of the drug tests. This can help to reduce time before commercialization of some drugs, blocked by lack of patient test.
The Future of Big Data
Because the technological industry of big data is very recent, the systems of treatment of the megadata and the storage are constantly growing. There is an impressive speed in appearance and disappearance of technologies. The algorithm Map Reduce was created in 2004 by Google and is widely used, for example by Yahoo in its Nutch project. In 2008 it became an Apache product in order to create Hadoop but because of the “slowness” of treatment, even on modest-sized megadata, its use is progressively abandoned.
From the second version of Hadoop , the modular architecture allows to accept new calculation modules like Hadoop File System (HDFS) and Map Reduce. This is the way Apache-developed Spark , younger than Map Reduce, overtakes it little by little. Spark can be executed over Hadoop and over NoSQL numerous bases. Project experienced these last years a fast development and received the approval of a big part of the community of developers.
The Main Actors of Big Data
Google and Facebook very early faced problems of volumetry of their data, that is why they quite naturally became two structuring actors of the domain. They are the most capable actors to correctly and quickly treat these volumes of data. Since the beginning it interests the giants of the IT sector, software publishers, historical software integrators on the waiters of companies. The “early adopters” are, for example, Oracle, SAP, IBM or Microsoft, which (regarding to the potential of this market) certainly launched out a little later than Google and Facebook, but still benefit from the wave of growth of Big Data.
Hortonworks, Cloudera et Mapr
They are the editors of the Big Data distributions. Cloudera counts among its team one of the creators of Hadoop Doug Cutting . Hortonworks is a spin off of Yahoo and has the most open source positionning. Mapr has another approach, the engines of storage and calculations were changed but Hadoop APIs were preserved in order to ensure compatibility with the existing ecosystem.
Google remains the precursor and mastodon of Big Data technologies, with the development of Map Reduce in 2004 for example. Google largely uses its technology for indexing algorithms on the search engines, Google Translate or Google Satellites, by using specific functions of load balancing, parallelization and recovery in case of servers breakdowns. Google uses less and less Map Reduce and is very strongly directed towards streaming (real time treatment). Google provided the open version source of Google Data Flow with Apache Beam.
Amazon
Amazon became one of the biggest actors of the domain when it offered in 2009 Amazon Web Services with a Google-comparable technology called Elastic Map Reduce. This technology has the advantage to separate data exploration from Hadoop clusters implementing, management or adjustment. The advent of Cloud Computing launched by Amazon enables the brand to be more powerful in the sector by massively democratizing it. But the cost of migration and exit strategy is very high.
IBM
IBM like the other big actors of the Web started exploring Big Data by integrating in its services some Hadoop and Map Reduce bricks of treatment.
ODPi
The Open Data Platform Initiative gathers Hortonworks, IBM, Pivotal in order to set standards about Big Data platforms implementation. The goal is to give to users reversibility guarantees. But it is not a success yet because Cloudera and Mapr did not join the mouvment.