Identifying Outliers and How to Remove Them: A Simple Guide for Beginners

0
705
Identifying Outliers

Outlier detection may also be referred to as anomaly detection. This is one of the most important steps in data mining to help identify events, data points, or observations deviating from the dataset’s normal behavior.

Identifying outliers or an anomaly in a dataset may lead to critical errors like a change in consumer behavior or a technical glitch. For this, machine learning has been used extensively to automate anomaly detection.

How will you define an anomaly?

Most companies use multiple management software with all analytics programs embedded in them, it has become much easier for any company to effectively measure every single aspect of their business activity. These activities include evaluating the key performance indicators (KPIs) and operational performance of infrastructure and applications for the success of the firm. With millions of metrics for companies to evaluate their firm’s performance, they still manage to end up producing impressive datasets to improve their business’s performance.

Now, in this dataset, there are high probabilities that you might find unexpected change or events within the dataset, which does not even conform with the expected data pattern, this is what you can then call an anomaly. Simply said, an anomaly or an outlier is something that deviates from the business as usual. But what do we actually mean when we say a business ‘as usual when talking about business metrics? This does not mean that the business is not constant or unchanged.

For instance, finding an e-commerce company collecting a large revenue in a single day should not be unusual especially if the day is a Cyber Monday. It cannot be unusual since every Cyber Monday the particular business tends to receive a high volume of sales and a well-established peak in business during this natural cycle.

Impact of outliers in data science projects

Outliers give marketers a reason to pause from doing their business. Outliers are crucial because they can tell you more about the data gathered, how it is gathered, and what is in it? Doing so allows them to assess the entire dataset keeping the marketing goals in mind.

Dealing with anomalies in data science projects is essential because only then you can expect accurate results from your data.

Outliers and anomalies are a part of a data scientist’s daily agenda. Not to mention, even 5 percent of high-quality data is likely to have some sort of anomaly. To be precise, an outlier or an anomaly is an observation that is abnormal when compared to the other values.

Let’s hypothetically take an example:

A researcher analyzes the number of cups of coffee a student consumes each day. In a sample of 30 students, 12 students are said to consume 1 cup of coffee per day, 13 students consume 2 cups per day, 2 students consume 3 cups, 2 students consume 4 cups, and 1 student consumes 100 cups of coffee per day.

Well, in this analysis, the last student is obviously an anomaly.

Outliers impact the results and the decisions we make based on them. The presence of outliers must be handled to ensure marketers makes the right business decisions.

How to identify outliers and how to remove them?

If you observe anything out of the ordinary while working with data science projects, that is your anomaly.

Although you may seem to have identified the anomaly and want to remove it, it isn’t that easy since removing outliers may vary from the type of dataset. Therefore, we will discuss more different methods of removing outliers. The example given here will be demonstrated on the dataset of Airbnb and New York City.

Some of the methods include:

  • Scatter/box plots

import matplotlib.pyplot as plt

plt.scatter(df.index,df[‘price’],color=’red’)

plt.title(‘Price of accomodation’)

plt.xlabel(‘indices’)

plt.ylabel(‘Price’)

plt.show()

OUTPUT

x_upper=list(df[df[‘price’]>upper_bound].index)

y_upper=df[df[‘price’]>upper_bound]

#print(x_upper)

#print(y_upper[‘price’])

x_lower=list(df[df[‘price’]<lower_bound+2500].index)

y_lower=df[df[‘price’]<lower_bound+2500]

#print(x_lower)

#print(y_lower[‘price’])

x_inlier=list(df[(df[‘price’]<upper_bound) & (df[‘price’]>lower_bound)].index)

y_inlier=df[(df[‘price’]<upper_bound) & (df[‘price’]>lower_bound)]

print(x_inlier)

print(y_inlier)

OUTPUT

plt.scatter(x_upper,y_upper[‘price’],color=’black’,marker=’d’,label=’Above Upper Quartile’)

plt.scatter(x_lower,y_lower[‘price’],color=’red’,label=’Below Lower Quartile’)

plt.scatter(x_inlier,y_inlier[‘price’],color=’green’,label=’Inlier’)

plt.title(‘Price of accomodation’)

plt.xlabel(‘indices’)

plt.ylabel(‘Price’)

plt.legend()

plt.show()

OUTPUT

In the bottom graph, you will notice red data points, and they all fall under the lower quartile. We’re unable to see them properly because the number of green data points is there to help you notice the red data points. And due to the difference between the lower quartile and the normal range is not considerably huge for this to be clearly visible.

The source of the example is taken from Analytics India Mag

  • Standard deviation

This outlier detection is ideal for beginners working on data science projects.

  • Z test

The formula used here is:

df[‘zscore’] = ( df.Height – df.Height.mean() ) / df.Height.std()

df.head(5)

OUTPUT

The core agenda is to have the z score greater than 3 or lower than 3.

The other way around is to obtain data points greater than 3 i.e.

#the outliers

df[(df.zscore<-3) | (df.zscore>3)]

The source of the example is taken from Analytics India Mag

  •  Percentile

#setting the limits or our criteria for an item to be called as an outlier

upper_bound=df[‘price’].quantile(0.9995)   #value at 99.99 percentile

print(‘Upper bound:’,upper_bound)

lower_bound=df[‘price’].quantile(0.0005)   #value at 0.05 percentile

print(‘Lower bound:’,lower_bound)

max_price=max(df[‘price’])

print(max_price)

min_price=min(df[‘price’])

print(min_price)

df[df[‘price’]>upper_bound]

print(len(df[df[‘price’]>upper_bound]))

df[df[‘price’]<lower_bound]

print(len(df[df[‘price’]<lower_bound]))

OUTPUT

df_percentile=df[(df[‘price’]<upper_bound) & (df[‘price’]>lower_bound)]

print(df_percentile) #removing outliers with the help of percentile

OUTPUT

Once the outliers are removed, the entries were 48841 from 48895

The source of the example is taken from Analytics India Mag

These are some of the easiest ways to detect outliers and help beginners understand the concept of an outlier and how to remove them. Other methods also include names like IQR or hypotheses tests, but for a beginner, the given methods will suffice.

Read Also : How To Fix System Interrupts High CPU Usage On Windows 10