Why you should avoid these 4 mistakes during data discovery phase in a ML Project

Prasenjit Poddar
10 min readApr 26, 2021

Avoid these four mistakes during data discovery phase to get rid of a few vital issues faced in a ML project journey.

Source : blissquotes.com

I come across many friends, colleagues or techies who desire their interest in Data Science, Machine Learning or Deep Learning, and want to pursue their career in the field of Machine Learning (ML) world as one of the dream jobs.

I say, “why not ?” ML is one of the areas most of the techies want to explore and work now-a-days, and there is nothing wrong in it. With the increase volume of data the organisation stores, there are lots of scope to explore the data and find interesting insights that would help to solve the business problem.

Many of them ask the question to me, “What are the challenges that you face in real-time ML project ?” Well, today I am here to discuss few of the challenges faced in the data discovery of the project from my experience and share with you. If you are reading this article right now, and you want to know the challenges in the data discovery phase, and want to solve, then you are in the right place to have at least some of the queries answered.

This article discusses about a few challenges faced during the data discovery phase of a ML project, and ways to mitigate it. Avoiding these mistakes early will definitely help you to get rid of a few mess that you may land up in the later stage of the ML project journey.

What is data discovery phase of a ML project ?

In any typical ML project, we identify three main phases — discovery, development and deployment. In the discovery phase, we need to identify the business need, and a clear roadmap what a ML model with help the business to achieve. This phase is crucial, as we establish the problem statement, and how solving it by ML model will impact the business and the users consuming the product.

In this phase, we identify what datasets are needed, whether the required data is available and is enough to build and train a model, or whether external datasets are beneficial, and how to acquire.

Data Discovery phase is essential as we establish the problem statement, and how solving it by ML model will impact the business and the users who consume the product build out of the ML project.

Challenges in the data discovery phase

As you are aware of the data discovery phase why it is essential for any ML project. Now, I’ll be discussing a few of the challenges that practitioners come across in any typical real-time ML project.

Here are the challenges in data discovery phase,

If you don’t have checks in place that the collected data is complete and aligns to the agreed data scope of the project

In the real time project, the very first step is to collect the required data for analysis and modelling exercise. In some cases, you may have separate data engineering team who does this for you.

A lot of time data scientist need to collect the data for the analysis in scope from different sources, and stage them to carry our preprocessing, data explanatory analysis, and modelling exercise on these data. I do agree most of the time we have some checks to see that the collected data is complete. However, it is very important to have some checks or test cases to see that the data collected aligns with the business objective or agreed data scope of the project.

In one of the project, we were solving time-series forecast problem for a client. There was a separate data engineering team who had set up the ETL pipelines to continue the flowing of new data on a daily basis into the source table on which our analysis and modelling are dependent upon. No checks were put on by the data engineering team on these ETL pipelines. During our analysis, we could find there were duplicates data in the source table as ETL pipelines were run multiple times on a few occasions. In some days, ETL pipelines failed, so there was no data present on a that date. Inspite of all these data repairing process, we did good enough job to create the time-series forecast model, and moved the model to deployment. We were happy for all the efforts that we put together as a team. However, we never expected what was there next for us when the testing team raised a issue mentioning that we missed one main product category type for which we didn’t do the forecasting at all. when we traced down the issue, we found that ETL pipelines never pulled the missing product category data into the source table. The stored procedure (SP) had a bug in it, then it was fixed. We had to make some code changes, rebuilt the model, fine-tune the model and deployed it to production. It created a panic situation during the end days of the project.

I hope you won’t be in such circumstances, as you know the importance of data being collected is complete and aligns with the agreed data scope.

Follow these steps to avoid such circumstances,
1. Do put on checks or test cases on board to see the data is complete and aligns to the agreed data scope. Even though you may have a separate data engineering team who does the data pull for you, as a data scientist it’s your responsibility to have additional check in place
2. Do check periodically the data quality when you know there is new data appending to the source table on a daily, weekly, monthly or periodical basis

If you don’t have the required business knowledge

If you are on a client project, before you start anything on the project, make sure you have the knowledge of their business — how their business is structured and operated. You may need to go extra mile to understand the dynamics of their business such as who are their primary customers? what products, support and services they provide, how their websites are designed, various components of the business, significance of each component, and how they are connected with each other, and so on.

It is not necessary to understand the universe of their business model how it works, however, definitely you should focus on those business part more that aligns to the business problem you are solving through analytics or data science. It is important to have that deep knowledge on their business.

If you don’t have the required business knowledge, you won’t be able to create much relevant hypotheses set in order to analyse the data and retrieve any potential insights, you won’t have upper edge on the data validation from the business concepts perspective, and will struggle in building the story telling part.

Make sure you have the knowledge of their business, and know the dynamics — how their business is structured and operated. It is important to have thorough understanding on business part that aligns to the business problem you are solving through data science.

If you don’t have the clear understanding of the business problem statement that you are solving

It is very important to establish the business problem statement before you fit any ML model. You need to very clear on the business goals that you are solving. I have experienced that most of the real-time projects business goals are very complex and much of the decoding is required to understand the underlying business goals, unlike in MOOC courses ML project where you have very clear and straightforward target goals.

Let me narrate a past experience where the business problem that we were solving for client was very complex. The problem statement — “Based on the customer interactions on their websites what would be optimal page components design that would help them to drive more visitors in terms of content consumptions and conversions” At the glance, you may feel the business problem is quite clear, but wait, there is lot more when you deep dive into the problem statement.
When you deep dive, you might have to break down into smaller business goals, which together combining all the smaller goals accomplish the required business goals. Now you need to focus on the formulation of the smaller goals instead of the bigger business goals. So, it is very important to understand the business problem statement, no matter whether it is bigger or formulated smaller goals, and how these smaller goals add up contributing to the bigger goals, and you have to be clear on this. I do agree someone may consider these as part of solution phase of the business, still you have to clear on the establishment of the problem statement. If you are not clear on this, you won’t able to fit any mathematical model that will help you to achieve business goals.

It is recommendable that you engage yourself in a iterative discussion with the client on the scope of project, and business problems till the moment you believe that you can establish the problem. Sometimes, it may happen due to lack of data availability or requirement of external data, you may have to tweak the scope of the problem statement to the nearest business ask which would still show potential benefit to the business problems. I have experienced many lead data science practitioners even though they are not very clear on the business problems, are very reluctant to reach out to the client to discuss on the scope beyond the interactions that has happened during the solution or POC phase of the project. They follow this typical path — assume something with added perspective of their own, analyse and build some solutions, go to the clients to get their feedbacks, and iteratively work on it to fix the solution This may work sometimes, and may not work other times. However, they are still reluctant to discuss with the clients on the problem statement that they are solving, not sure what the reasons are. May be they are in the impressions that client would count this as negative as going back again and again to them to understand the business problems, and this might bring a question on their efficiencies, so they choose the other way of being reluctant to approach them.

It is advisable to engage as much as possible in a discussion with client in order to establish the problem statement that you are solving. You have to do this in iterative way engaging in discussions with client to gain knowledge on the industry part, how client business operates, or any useful information related to the business. This would definitely help you to establish the problem statement, build up the story telling part, and validate the hypothesis that you might be working in the later stages of the project, as going through many discussions with client will make you content rich about the client business, and would help you in the long run. Please do keep in mind, this is iterative activity!

If you don’t have the initial set of hypotheses to analyse the data

This is one of the crucial phase of any typical ML project. Before you deep dive into any exploratory data analysis (EDA), you should build initial relevant set of hypotheses to analyse the data that align to the business problem and will provide you with meaningful insights. As you know, when you start analysing the data with basic summary statistics, and then move to basic and advanced EDA, you may have come with N insights based on the data explored. However, it may happen that only a few of the insights extracted are meaningful and potential insights, and rest are ignored or might not be that useful. Most of the typical real-time ML projects are short-termed say 12–16 weeks, and you may not get sufficient time to do EDA to find the universe of insights from the data as 2–3 weeks allocated for EDA typically. So, it is very important to have initial hypotheses ready that are specific and align to the business problem and give potential insights. I do agree when you deep dive more into EDA, you may have more hypotheses related to business problem adding up to the hypotheses list, but to get start with you should have initial hypotheses handy before you do any sort of EDA.

How to develop the initial set of hypotheses ?

It largely depends on the knowledge you acquired on the client business dynamics, how client business model works, domain knowledge on which client business operates on, external research on the same industry, prior experience working on the same domain or similar project, and expertise opinion.

On a higher level, examples of the initial hypotheses may look like this,

  1. More than x% traffic to the “ABC” website comes through Organic Search and Paid Search
  2. Asia Pacific market shows y% potential increase in revenue in 2020 compared to 2019

Conclusion

Now you know what the data discovery phase is, and the importance of data discovery phase. I have discussed the various challenges associated, and the work around plan. The mitigation plan discussed here will help to identify the early mistakes that ML practitioners generally do, and get rid of the mess that you may land up in the later stages of ML project lifecycle. Once you mitigate these above challenges in the data discovery phase, you are ready to move to development phase, and then to deployment phase of the ML project.

To conclude, please do not ignore the above mentioned mistakes in the data discovery. Cheers!

As a next step, an advice to ML practitioners based on my experience,

Do start with the thinking of business analyst, and deep dive into the data with data analyst perspective, and build and tune the model like statistical modeler.
This roadmap will make you a better data scientist.

--

--