Workflow in a data science project

The goal of a data science project – is always to identify as much value as possible. 

Even though the algorithms’ ability improves every day, new analysis possibilities occur, and self-driving cars will be the new normal in the future, doesn’t mean that the latest algorithms and robots are the right thing for you. On the bottom line, the value created with data science is predicated on the money models, and algorithms are saving or generating for your company. 

That is why we at Borbaki follow a seven-step structured workflow, guaranteeing that we are constantly pursuing to identify as much value as possible. 

To maintain an eye on the goal, every one of our data science project is going through these seven steps: 

  1. Business Understanding
  2. Data mining
  3. Data cleaning
  4. Data exploration
  5. Feature engineering
  6. Predictive modeling
  7. Data visualization

This article will explain what each step contains and why it is essential to do for a data science project. The article is writing to business executers who want to learn about handling a data science project and important sub-goals. Therefore, it will not discuss the different algorithms, code language, and programs used. 

1. Business Understanding

Understanding the industry, the workflow of the company and the challenges facing is the first place to start. The beginning of a data science journey is all about asking the ‘why’s.’ Asking the ‘why’s’ ensures that concrete data support the strategic decisions made by the company, and that is guaranteed, with a high probability to achieve results. 

Microsoft Azure blog has divided the why questions into five different questions types:

  1. How much or how many? (regressions)
  2. Which category? (Classification)
  3. Which group? (Clustering)
  4. Is this weird? (Anomaly detection)
  5. Which option should be taken? ( Recommendation)

In step one, we identify the central objects of the project, which we do by identifying what the company really wants to know about their company, competitors, or customers. Depending on the type of question asked, the focus of the analysis will differ. 

If the question is how-oriented, the goal is to make sales forecasting and marketing spending improvement. Compared to classification and clustering where the target would be to generate insight about customer profiles and customer lifetime value. 

There is no rule about the type of question, but the most common are questions like “What is each of my customers worth“, “How can I improve my customer lifetime value“, or “How can I optimize the workflow in the company?

Do you want more specific examples on questions you can aks your data? Then our blogpost about “How to ask the right questions to your data” can hopefully give you the answers. 

2. Data mining

When the goal is defined by asking the right question, is it time to collect the necessary data.
When working in the data mining step, we are collecting and gathering available data as fx. COGS- and POS-data.

There is a close relation between data mining and data cleaning (step 3). Because Borbaki’s primary services are to give companies the possibility to outsource their data science department, the data mining procedure takes longer than if the analysis was created in-house.

At the data mining step, our data scientist considers what data needs to be present to answer the question defined in step 1. Then, when the demand for data is specified based on the available data, we start with locating the data, figure a way to obtain it, and which way will be the most efficient way to store and access it. This is all done in collaboration with the customer.

3. Data cleaning

As the name indicates – it’s cleaning time. Data cleaning is usually the most time-consuming task in the whole process.

The reason for this is simply because of the many possible scenarios that could necessitate cleaning. Fx. If there is an inconsistency within the same column in a dataset, the same rows may be labeled 0 or 1 while others were labeled no or yes, which is making it impossible for the algorithm to crawl the data. 
The data in the rows could also be inconsistent, meaning some of the 0’s might be integers, whereas some could be strings [1].

4. Data exploration

When all errors and inconsistencies in the data are cleaned, we can start analyzing the dataset. The data exploration step is where the brainstorming happens. How can we analyze the data, what overview do we have over the data, and how can we use the data best as possible. 

Of course, all of these thoughts have been thought in steps 2 and 3, but now it’s time to take the data through a funnel, so we only have the best-fitted data to answer the questions asked in step 1. 

It involves pulling up and analyzing random subset of the data or creating a histogram or distribution curve to get a clear overlook of the general trends in the data. We narrow down which data will identify the best answers for the question asked based on the new information. 

5. Feature engineering

‘Feature’ is the different data categories you merge to identify patterns. 

Fx. A study from Israeli Judges found judges to be more severe in their decisions before lunch and more lenient in granting parole after having a good meal [Ai superpower By Kunk Le].
The features in this analysis would be:

  • Giving sentence 
  • Time of the trial
  • The Judges lunch

This step is about using domain knowledge to transform the raw data info informative features that can help answer the question from step 1, and not be afraid of using unregular data to identify the patterns. 

“Coming up with features is difficult, time-consuming, requires expert knowledge, applied machine learning is basically feature engineering” Andrew Ng machine learning expert.

There are typically two types of tasks in feature engineering: Feature selection and – construction.

Feature selection 

You decide which features there are making sense to use and which features there are making more ‘noise’ than information. 


Feature construction

Feature construction involves creating new features from the already existing ones. Fx. If one of the features is ages, but the used model only cares about if the person is an adult or minor, you could threshold it at 18 and assign different categories to instances above and below that threshold. 

6. Predictive modeling

At this step, machine learning is added to the data science project. Based on the questions asked in step 1, our data scientist decides which model is best to answer the question based on the data from steps 4 and 5.

This decision is not easy because there is never a single correct answer. The models that we chose to train will be dependent on the size, type, and quality of the date and time available.

Once the model is trained, the most important thing is to evaluate its success to guarantee as high model accuracy as possible.

Based on knowledge and know-how, our data scientist has an idea of which models to use when starting the project, but not deciding it before step 6 creates an agile way of working, ensuring that the model we end up with is the best suited for the job.

7. Data visualization

Once the intended insights from the model are identified, our focus is on representing the results so all key stakeholders at the company can understand and get value from the results.

Because the results of the analysis aren’t worth a dime if we don’t manage to deliver the results as actionable insights, we make sure that the report highlighting the most important discoveries in a way that CEO’s, middle leaders, and executers can understand and take action based on the results.

Bringing the results into action

Now, when the analysis is sealed and delivered, we reached the most important part of the analysis: Implementing the business workflow results.

At this step, we evaluate the result and the created understanding for the company. The companies are responsible for applying the results in the daily workflow, but we are always keeping close contact with our clients to make sure that they never feel uncertain about the results or how to gain value from it.

Data science can gain value in all industries and for all businesses, no matter the size – it is all about asking the right question. At Borbaki, we believe that the future is data-science-driven. The sooner today’s companies start using data science to make strategic business strategic decisions, the sooner they will take their company to new heights.

You can save a lot of money and time, by outsourcing you data science department. Read about the benefits of outsourcing your data science department here and start your data journey today.