You can easily get lost in the data technology ecosystem. The technological offer in data management being very (too?!) rich, many solutions are available to you according to your needs, data sources, industries, infrastructures, skills, technological situation? This is why we present you with a review and advice on how to choose your analysis tools.
Data engineers or data scientists ?
Data engineers must use technologies concerning the infrastructure and the global ecosystem, they must have in-depth knowledge concerning SQL databases, they must be able to configure Spark clusters… They use Linux and Git for development, Hadoop and Spark for the Big Data environment, possibly Map Reduce as a calculation model as well as HDFS, MongoDB and Cassandra for NoSQL.
As far as data scientists are concerned, these will be tools more focused on the development of machine learning applications, with Python, R, Jupiter, TensorFlow, Pandas, etc. Obviously the borderline between the two profiles is not completely clear-cut, and it can therefore be difficult to know what differentiates the Data Engineer from the Data Scientist.
Programming languages
As far as programming languages are concerned, R and Python are still the most used. These languages are not recent. The “war” between R and Python is far from over, both have positive points.
Historically, R is closer to the statistician community while Python has been more used by computer scientists. For good reason, R is considered to be more specialized in statistical analysis, while Python is more generalist, which makes it more popular than its counterpart.
R has a steeper learning curve and fewer possibilities for integration with web and database applications, but is more efficient in data pre-processing. Concretely, R will be more adapted to dataset analysis and exploration. On the other hand, Python will be more efficient for their manipulation.
Concerning Python, you will find almost the opposite of R, i.e. a more modern syntax that allows you to develop applications more quickly, but a slightly less rich ecosystem. For example, the Pandas library is very powerful to process and convert CSV, JSON, TSV files or a SQS database to an easily parsable Python object.
Other languages are also used: Scala, Perl, C#, Java…
Analysis
Despite its significant decline, Hadoop still remains a standard when it comes to analysis. Massively used, it is today practically synonymous with Big Data thanks to its MapReduce architecture. And with Hive comes the ability to make SQL-type queries. Although Hadoop is losing speed, it is still recommended when a large amount of data needs to be analyzed without time constraints. If you are faced with a project where real-time analysis is essential, choose Spark or Storm instead.
Development environment
An essential tool for any data scientist is Jupiter. This free web application allows you to develop “notebooks” in which to write your code without having to install many local libraries. It is very easy to share and collaborate on the same project, as well as create results in the form of graphs. You can easily integrate other tools such as Spark, Pandas and Tensorflow.
Libraries
TensorFlow is one of the most important libraries in today’s scientific data environments. It was developed by Google in 2015 and is specialized in machine learning and deep learning. The computational model is based on a data flow in the form of a network, in which the nodes correspond to mathematical operations and the links (tensors) are matrices containing multidimensional data.
Other interesting libraries are NLTK and Keras. NLTK is a standard Python library for language processing with many useful operations. It is capable of analyzing and categorizing sentences, analyzing sentiment, recognizing entities, etc. It is a standard Python library for language processing with many useful operations.
As for Keras, it can be seen as an overlay for deep learning, able to interact with TensorFlow and other deep learning libraries. It is very easy to create prototype applications because its syntax is very easy.
Finally, Apache Spark is the ultimate tool for data processing in a Big Data context. The execution of tasks is very fast and efficient. It also makes it possible to explore a large amount of data in record time. The native language is Scala but you can also develop in Java, Python and R.
Continuous integration and cooperation
An important aspect in any data project is cooperation and continuous integration. You can’t miss tools like Jenkins for scheduling tasks, Docker for creating containers and deploying applications, Kubernetes for deploying applications in the cloud and managing containers, and old-fashioned tools like Maven and Git for managing code dependencies and version management.
The data ecosystem is complex and very rich. As a result, in order to meet the various needs, the current trend is towards unification to facilitate the development and integration of different profiles. A significant effort is made on the orchestration of all project steps, from data collection to production, within the same tool. For example, this is what is at stake in Saagie’s DataOps platform – which presents a meta-technology bringing together most of the technologies within the same interface – thus simplifying the development process of data projects by making it much more intuitive.