Once you decide to develop Data Science projects, you will break new ground, and you will need to get it right from the start. Of course, there are many technological challenges on this path, but you will also find out about cultural issues that will make collaboration hard. Today, we will share some best practices of our own on how to start your Data Science journey. These practices will help you move Data Science projects forward without having to rework them later. Mark Twain said: “Habit is habit, and not to be thrown out the window by any man; it must be coaxed downstairs, slowly, one step at a time.” We believe that changing a habit takes more time than learn it well from the beginning. Said shortly: beginning with a good basis is essential. Let’s take a look at five practices that we can apply to any mainstream Machine Learning scenario.
#1 - Capitalizing on Code
From the beginning, think code reusability. Your first Data Science projects will allow you to test different methods and approaches, and select the best ones to be used in production. As soon as possible, start capitalizing on those snippets of code using component approach (*) without waiting for definitive proof of efficacy. This approach helps the capitalization of Data Science expertise across large organizations.
(*) “A component is a part that combines with other parts to form something bigger” (Cambridge Dictionary). Building models from small building blocks are easily manageable and it scales faster.
Little by little, you will constitute a code library that you can reuse and share in future use cases.
#2 - From Exploration to Production
Exploration and production architectures must speak the same language. Thus, limiting duplication of code between exploration and production is also important. It is worth selecting technology which can be easily moved between environments. The goal is to offer the best of modern Data Science while smoothly fitting into your company’s existing IT environment. This way, any business function that follows this architectural pattern will be able to go into production at a marginal cost. Make sure that your choice of technologies does not constraint your deployment: choose your tools wisely.
#3 - Integrate Software Engineer Best Practices
From day one when you embark on a Data Science project, think of how you will expose your model – the sooner, the better. When you focus on building the best Machine Learning model, it’s very easy to forget that you are simply writing normal code. Software engineering practices such as Continuous Development, Continuous Integration, Unit Testing, Monitoring, Clean Coding, and others will make your code easier to maintain.
#4 - Develop Code-Driven Machine Learning Models
Implementing a Machine Learning algorithm by hand from scratch can teach you a lot about the algorithm and how it works. The more algorithms you implement, the faster and more efficient you become at it, and the more you will develop and customize your own process. Here are some downsides to this approach to keep in mind:
- Redundancy: Many algorithms already have implementations, some very robust, and been used by hundreds or thousands of researchers and practitioners around the world. Your implementation may be redundant, a duplication of effort already invested and solved by the community.
- Bugs: New code that has few users is more likely to have bugs, even with a skilled programmer and unit tests. Using a standard library will generally reduce the likelihood of having bugs in the algorithm implementation.
There are plenty of open source implementations of algorithms that you can review, diagram, internalize and implement in another language. You will find it beneficial to start with a standard implementation before considering how to change it to be programmatically less elegant, but computationally more efficient.
#5 - Engage with Open Source
Open source is becoming widely accepted and used in large organizations. Engaging with open source can not only achieve faster development and better quality of your models but also create an environment fostering cross-team cooperation.
Open source is much more than just a license and an account on GitHub; it is a culture. When done properly with a deep understanding, it will have broad implications across your organization :
- lead to better communication
- become a new mode to engage employees and customers
- improve engineering quality
- favour access to a broader talent tool.
For instance, PayPal’s path to InnerSource involved a series of corporate decisions to induce a shift in the choice of tools and company culture.
To sum up, these five practices create a foundation for successful data science projects.