- What do we understand by data quality and what is its importance?
We understand a few things about data quality and its importance. Data quality makes sure that your organization has precise data with its clients. It can seem tedious. But thanks to the developers, we have software tools such as Experian Data Quality. This makes the process of collecting accurate data an ease. On the long run, data quality is meant to collect precise data and give a significant advantage in improving profitability or achieving a current task. It's important because of its high-quality. Without it, it can make contacting customers a real pain. Today 2018, we are in a data-driven age, where it’s much easier to find out key information about customer types like current customers and potential customers. This information can enable a user to market effectively, and encourage loyalty.
- Discuss one of the factors comprising data quality and provide examples.
One factor comprising data quality is Data Accuracy. This refers to whether the data collected is correct or not. One example can be, an inspection has been complete on a vehicle but the inspector accidently doesn’t report the current hour meter reading the vehicle failure. This renders up the inspection incomplete and reduces its value because of this one important information is left out. If the operator performs an inspection and puts in a value for every field and then spells the type of vehicle correctly and records the correct units of measurements and now the user has complete consistent data.
- How can the data be preprocessed in order to help improve its quality?
There are several methods that can preprocess the data to improve its qualities. Data cleaning is one of em. Data is analyzed from the data mining techniques and can sometimes be incomplete. Its noted that there are many tuples that have no recorded value for several attributes. These missing values can be filled in by various methods. We can ignore the tuple, fill in the massing value manually, use a global constant to fill in the missing value, use the attribute mean to fill in the missing value, you can use the attribute mean for all samples belonging to the same class as the given tuple. Or you can use the most probable value to fill in the missing value. From using a global constant to fill in the missing value to the last one all bias the data. The filled in value can have a possible incorrect value.
- Please discuss the meaning of noise in data sets and the methods that can be used to remove the noise (smooth out the data).
Noisy data is a random error variance in a measured variable. There are a few methods such as Binning methods, clustering, combined computer human inspection, and regression. Bining methods smooth a sorted data value by consulting the values around it. The sorted values are distributed into buckets. Because binning methods consult its neighbors for values, they perform local smoothing. Clustering is a method that detects outliers, which is data that's not common or close of a value than it is with the other values. Combining computer and human inspection is another method that can identify the outliers through this inspection. Finally, regression is another method that can smooth out the data by fitting it into a function. This involves finding the “best” line to fit two variables so that one can be used to predict the other.