• Russell

The Problems of Missing Data: A Data Science Perspective


A typical issue in data science is the absence of required information. It can occur when information is not collected or not correctly stored. Missing data can significantly impact the accuracy and reliability of analyses.

When data scientists analyze large data sets, they may encounter instances where certain pieces of information are missing. It can be challenging to draw conclusions or increase the accuracy of the data's forecasts. In some cases, this problem may be caused by errors in the data collection process, while in others, it may be due to limitations on the available data. Either way, fixing the problem can help improve the accuracy and effectiveness of data analysis.

Data scientists spend countless hours acquiring, cleaning, and organizing data to extract every scrap of insight. However, it's not always possible to collect all the desired data: one may be missing measurements from a specific time or a low-level observation in an experimental setup. There are several ways to deal with missing data, but each has its risks and benefits.

This article will go through some of the problems that can arise when little or no data is available and how to avoid them when designing a study.

Missing data

Missing data is the primary source of malfunction in many workflows in data science. In the classical workflow, missing information is an unforeseen problem that can be avoided by careful design and planning. Unfortunately, this approach works only under some circumstances. The most common occurrence of missing data in a study is when you collect observations as part of a survey or experiment but have not yet decided on the final mappings between the variables and questions.

Types of Missing Data

Missing data can occur at several stages and are hence called differently.

Missing at Random (MAR): Missing at Random means missing the observed data. In this case, it has nothing to do with the specific values that were omitted. Data is not missing in all observations but only in a small portion of the data set. However, given the observed information, it's unclear if the data even belongs there. It is possible to predict the missing data based on the existing data.

Missing Completely At Random (MCAR): Missing completely at Random is the case where all of the observations are missing. It has nothing to do with whether the data was observed or not. The probability of missing data depends only on unobserved covariates, and hence it's uncertain how to account for the missing data.

Missing not at Random (MNAR): MNAR refers to missing data with a defined structure. Also, there are plausible explanations for the lack of information. A group of people, perhaps women ages 45 to 55, may not have answered a question in a survey. When it comes to data, you cannot tell what is going on based solely on the data that has already been collected. Modeling the missing data is essential to obtain an accurate estimate. A biased model can be produced by simply removing observations with missing data.

Causes and Problems of Missing Data

Causes of Missing Data

There are a variety of reasons why a survey may have missing data.

Privacy concerns are one reason why people may decline to answer a question. Alternatively, the person conducting the survey cannot comprehend the survey question. The respondent may have responded, but the answer they might have given isn't one of the possibilities presented. Alternatively, the questionnaire may have been too time-consuming, or the respondent lost interest.

Every unanswered question in a survey constitutes a data point that cannot be used. Additionally, research results are susceptible to missing data. Errors made by humans can lead to omissions. If a researcher forgets to take a vital sign like the patient's pulse, the results could be disastrous. A measurement can be thrown out if a test tube breaks. Databases, too, have gaps in their information. There will always be missing instances when the variables in the two databases don't match.

A database analyst, for example, analyzes sales databases that have been consolidated from three different geographic areas. There would be omissions in the three databases if one variable, such as the educational background of sales representatives, were not recorded centrally.

Problems of Missing Data

Identifying the source of data loss is a difficult task. You can't always tell when missing data will cause problems because your results can be affected or not. There are times when a lack of data is not immediately apparent. There may only be a few missing data points for each question or variable, but the total number of missing data points could be quite large.

Missing data can only be identified as a problem after a thorough analysis. This analysis has been laborious, prone to mistakes, and inefficient for the longest time. It's not uncommon for missing data to lead to major issues. First, most statistical procedures automatically rule out cases with no data.

As a result, there's a chance you won't have enough information to complete your analysis. Just a few cases would be too small to run factor analysis on. Another possibility is that the analysis will not produce statistically significant results given the small data. If the cases you examine do not represent a representative sample of all cases, your findings could be misinterpreted.

A missing data can also introduce bias, leading to inaccurate results. For example, a person's self-reported height may be influenced by their impression of ideal or average height while answering a question about height. A self-reported value for height may be lower than an objectively measured one.

Dealing with Missing Data

There are two ways data scientists can solve the problem of missing data, which are:

· Imputation

· Removal of Data.

The imputation method comes up with reasonable guesses when there is no data. Most useful when there aren't a lot of gaps in the information. If a lot of data is missing, the results don't have enough natural variation to make a good model.

Removal of Data: When a dataset is missing due to inaccessibility, it's best to delete it. The missing value will not be included in any additional analyses, and bias can be eliminated. It can be easier to get rid of related data when dealing with data that doesn't appear. However, the best thing to do if there aren't enough observations to make a good analysis is not to remove the data. You may need to watch for specific events or factors in some cases.


Missing data is one of the most severe problems in data science and can create problems for both researchers and participants. These problems can range from inaccurate results to ethical concerns. However, due to its ubiquity, one should not be discouraged from performing large datasets and careful planning. Researchers should ensure that their data is complete and accurate to avoid these complications. Participants should also be made aware of the potential dangers of missing data before they agree to participate in a study.

5 views0 comments

Recent Posts

See All