How to Find Outliers With IQR Using Python – Built In

Every data set has issues, or points that don’t make sense. These points, referred to as outliers, can either show issues in the data collection process or real phenomena that are not representative of what typically happens. Here are a few examples of outliers that I’ve seen in real data sets: 
Including outliers in your data analysis skews your data set and negatively impacts the results of your analysis. Therefore it’s important to make sure your data set excludes all outliers, and only uses the realistic data.
Let’s talk about how to do that using IQR (interquartile ranges).
 
Before talking through the details of how to write Python code removing outliers, it’s important to mention that removing outliers is more of an art than a science. You need to carefully determine what is an outlier and what is not based on the context of your project. Here are a few examples using the outliers described above:
More From Peter GrantHow to Use Float in Python (With Sample Code!)
 
With that word of caution in mind, one common way of identifying outliers is based on analyzing the statistical spread of the data set. In this method you identify the range of the data you want to use and exclude the rest. To do so you: 
Here’s a Python-based example using NumPy to exclude the highest and lowest five percent of all data points from a data set.
That code yields the following outputs:
The first line in the above code imports the NumPy package for use in the analysis process. If you want to do data science with Python I recommend getting very, very familiar with NumPy. 
The second line sets a seed for NumPy’s randomization code, telling it to return the same quasi-random numbers every time. 
The third line creates a new variable, random_data, which is an array of 100 random values between 0 and 1
The fourth line is where the magic starts. That line calls the NumPy percentile function to identify the value of the data at the ninety-fifth percentile and the fifth percentile, then stores those values in the p95 and p5 variables. These values set the bound which will later be used to limit the data set.
The next two lines are print statements showing what’s happening. The first prints p95and p5so that you can see the values at those percentiles. You can see they’re quite close to 95 percent and five percent of the upper range of the data set which, in a non-normal data set, is what we expect. The second prints the length of random_data, showing that it still contains the 100 values that were originally entered.
The following two lines both reduce the data set based on the bounds specified above. The first line modifies random_data to only include the values that are less than p95, and the following line adjusts it again to include only values that are greater than p5. In this way, the data set is reduced to include only values within the bounds set by the fifth and ninety fifth percentiles of the data set. This reduces the data set to 90 percent of the total values, and is equivalent to stating the largest and smallest five percent are all outliers.
The final line prints the length of random_data after modification, and we can see that it’s now reduced to 90 data points as expected.
A Deeper Dive Into OutliersHow to Find Outliers (With Examples)
 
In order to limit the data set based on the percentiles you must first decide what range of the data set you want to keep. One way to examine the data is to limit it based on the IQR. The IQR is a statistical concept describing the spread of all data points within one quartile of the average, or the middle 50 percent range. The IQR is commonly used when people want to examine what the middle group of a population is doing. For instance, we often see IQR used to understand a school’s SAT or state standardized test scores. 
When using the IQR to remove outliers you remove all points that lie outside the range defined by the quartiles +/- 1.5 * IQR. For example, consider the following calculations.
The following code shows an example of using IQR to identify and remove outliers.
Which returns the following outputs:
The code rejecting outliers using IQR has is different from the prior example code in the following ways:
The outputs show that the code follows the same processes with the new requirements. It prints that the third quartile is at approximately0.68, and the first quartile is at approximately -0.67. Given that NumPy’s standard_normal function uses a standard deviation of 1, these numbers are almost exactly as expected. The code then prints that the total data set holds 100,000 points. The IQR is identified at 1.34, which leads to upper and lower bounds of 2.69 and -2.68. Filtering the code to only values within those two thresholds yields a data set of 99,249 points, indicating that 751 were outside of that range and removed.
And that’s how you do it! You can now think about how to identify outliers through both a practical and statistical approach, how to write generic code to remove outliers and how to evaluate your data set using the common interquartile range metric.
Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

source

Leave a Comment