python - how to use pandas filter with IQR

ID : 131363

viewed : 10

Tags : pythonpandasdata-processingiqrpython

Top 5 Answer for python - how to use pandas filter with IQR

vote vote

97

As far as I know, the most compact notation seems to be brought by the query method.

# Some test data np.random.seed(33454) df = (     # A standard distribution     pd.DataFrame({'nb': np.random.randint(0, 100, 20)})         # Adding some outliers         .append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))         # Reseting the index         .reset_index(drop=True)     )  # Computing IQR Q1 = df['nb'].quantile(0.25) Q3 = df['nb'].quantile(0.75) IQR = Q3 - Q1  # Filtering Values between Q1-1.5IQR and Q3+1.5IQR filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)') 

Then we can plot the result to check the difference. We observe that the outlier in the left boxplot (the cross at 183) does not appear anymore in the filtered series.

# Ploting the result to check the difference df.join(filtered, rsuffix='_filtered').boxplot() 

Comparison before and after filterinf

Since this answer I've written a post on this topic were you may find more information.

vote vote

80

Another approach using Series.between():

iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)] 

Drawn out:

# Select the first quantile q1 = df['col'].quantile(.25)  # Select the third quantile q3 = df['col'].quantile(.75)  # Create a mask inbeetween q1 & q3 mask = df['col'].between(q1, q3, inclusive=True)  # Filtering the initial dataframe with a mask iqr = df.loc[mask, 'col']          
vote vote

78

This will give you the subset of df which lies in the IQR of column column:

def subset_by_iqr(df, column, whisker_width=1.5):     """Remove outliers from a dataframe by column, including optional         whiskers, removing rows for which the column value are         less than Q1-1.5IQR or greater than Q3+1.5IQR.     Args:         df (`:obj:pd.DataFrame`): A pandas dataframe to subset         column (str): Name of the column to calculate the subset from.         whisker_width (float): Optional, loosen the IQR filter by a                                factor of `whisker_width` * IQR.     Returns:         (`:obj:pd.DataFrame`): Filtered dataframe     """     # Calculate Q1, Q2 and IQR     q1 = df[column].quantile(0.25)                      q3 = df[column].quantile(0.75)     iqr = q3 - q1     # Apply filter with respect to IQR, including optional whiskers     filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr)     return df.loc[filter]                                                       # Example for whiskers = 1.5, as requested by the OP df_filtered = subset_by_iqr(df, 'column_name', whisker_width=1.5) 
vote vote

62

Find the 1st and 3rd quartile using df.quantile and then use a mask on the dataframe. In case you want to remove them, use no_outliers and invert the condition in the mask to get outliers.

Q1 = df.col.quantile(0.25) Q3 = df.col.quantile(0.75) IQR = Q3 - Q1 no_outliers = df.col[(Q1 - 1.5*IQR < df.BMI) &  (df.BMI < Q3 + 1.5*IQR)] outliers = df.col[(Q1 - 1.5*IQR >= df.BMI) |  (df.BMI >= Q3 + 1.5*IQR)] 
vote vote

50

Another approach uses Series.clip:

q = s.quantile([.25, .75]) s = s[~s.clip(*q).isin(q)] 

here are details:

s = pd.Series(np.randon.randn(100)) q = s.quantile([.25, .75])  # calculate lower and upper bounds s = s.clip(*q)  # assigns values outside boundary to boundary values s = s[~s.isin(q)]  # take only observations within bounds 

Using it to filter a whole dataframe df is straightforward:

def iqr(df, colname, bounds = [.25, .75]):     s = df[colname]     q = s.quantile(bounds)     return df[~s.clip(*q).isin(q)] 

Note: the method excludes the boundaries themselves.

Top 3 video Explaining python - how to use pandas filter with IQR

Related QUESTION?