Even though the terms Big Data, Data Science and Artificial Intelligence gain in popularity, few data initiatives do in fact materialize. Various use cases are addressed but struggle to get industrialized. The Data Fabric concept seems a promising solution to get data projects in production. Data Fabrics emerged recently in the specialized press (Forbes, Networkworld…) each with different definitions. In this article we will explain how Data Fabrics accelerate IA and Big Data projects.
What is a data fabric?
Due to lack of time, expertise, technologies or resources, only few companies are able to manage the entire data life cycle. There are however isolated initiatives and a shared feeling that data is a key component of company strategy. That’s where the Data Fabric fits in, providing a way to govern, access and secure data for analytical purposes. In addition, a Data Fabric should also provide the capacity to develop and operationalize enterprise-ready AI business applications.
A Data Fabric is a software solution to manage data. Deployable on both Cloud and on-prem, it enables to accelerate digital transformation by operationalizing data projects. At the intersection of a Data Management platform, a Data Science platform and a Data Lake, it presents a coherent set of software solutions and applications, independently of the selected infrastructure. A Data Fabric offers a full-stack solution allowing to manage a project throughout the data life cycle, from collection, storage, processing, modelling, deployment, supervision to governance. Whatever the origin of the data, a Data Fabric offers a full range of technologies addressing a wide range of needs.
Beyond the plumbing aspects, a Data Fabric can also expose datasets and applications to various teams. Self-service applications allow business users to gain insights and create business value from data. By liberating the access to data and by providing the appropriate tools companies can become much more data-driven.
Not every platform can be a data fabric
According to Dan Kusnetzky, author of the NetworkWorld article, a Data Fabric should satisfy different criteria :
- Make data available to applications, whatever the size and requirements, ensuring speed and reliability
- Provide access to data from multiple locations, from the company’s Data Center, systems at the edge of the networks or Cloud computing environments
- Offer a unified Data environment : provide easy access to documents, security, and adapted storage capacity
A data fabric is not a data science platform
Although confusing at first sight, a Data Fabric and a Data Science platform are really two different things. Putting it simply, a Data Science platform aims at developing algorithms to carry out artificial intelligence projects including Machine Learning and Deep Learning. Such a platform is not necessarily adapted to business users who prefer to exploit the results of algorithms within their day-to-day business applications.
On the other hand, a Data Fabric is a full data management ecosystem combining extraction, processing and data consumption. The main objective is to put data projects in production. Since all technologies are unified, different types of users can access the platform, from business users who have easy access to datasets, to technical users who can work in their favorite programming language (R, Python…).To sum up, a Data Science Platform is a component of a Data Fabric.
Why opt for a data fabric?
First of all, for its technology scope. For instance, the Saagie Data Fabric supports a wide range of technologies including HDFS, Impala, Hive, Drill, Spark, Sqoop, Talend, Java, Scala, R, Python, Jupyter, Docker, Zeppelin, Mongo DB, Elastic Search and MySQL. By supporting different versions of these technologies and by regularly updating these frameworks, overall complexity is reduced as well as the challenges to maintain and administer a big data ecosystem. And not even mentioning the fact that new frameworks are added all the time.
A Data Fabric brings cohesion between the different tools by means of a data pipeline
A Data Fabric thus appears as a viable alternative to Data Management Platforms. Data processing is made possible, independantly from the location where data is stored (in the Cloud, onsite, on Azure or AWS…). The tool is perfectly adapted to address multiple use cases.
For governance: the standardization of data processing makes it possible to grant access to projects and teams, and to share code, scripts, datasets and applications. By adding organizational processes around the tools, we can start speaking of governance. Governance is often associated with GDPR by securing personal data and by controlling processing and by providing access, it also contains an economic value. Governance guarantees the value of information within the enterprise.
Finally, a Data Fabric creates a data community by facilitating the collaboration between members of a data team (data engineers, data scientists, business analysts, data stewards and IT/Ops) and by providing them the tools to get their projects up and running.
- Data Engineers : the possibility to create jobs and pipelines to collect, cleanse and process data, and make the appropriate datasets available for modeling by data scientists
- Data Scientists : access to the latest programming languages, the capacity to train models and large datasets.
- Data Analysts : easy access to data to create new business views
- Data Stewards : tools to document data and declare personal data and tags
- IT/Ops : a secure environment to industrialize and manage data access
And to do what?
A great number of use cases can be addressed. At Saagie we are capable of reducing churn, optimizing the supply chain and much more. Digital transformation, AI-first… whatever the term used, companies are changing and need to put data at the heart of their strategy. Time is running fast. To leverage their data, they need a complete, yet simple solution. By tackling the Devops for Data Science challenge, a Data Fabric allows fast decision making by exploring, selecting and analyzing data for a wide range of business users and increasingly in a self-service manner.