python - how to use pandas filter with IQR

ID : 131363

viewed : 10

Top 5 Answer for python - how to use pandas filter with IQR

97

As far as I know, the most compact notation seems to be brought by the `query` method.

``# Some test data np.random.seed(33454) df = (     # A standard distribution     pd.DataFrame({'nb': np.random.randint(0, 100, 20)})         # Adding some outliers         .append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))         # Reseting the index         .reset_index(drop=True)     )  # Computing IQR Q1 = df['nb'].quantile(0.25) Q3 = df['nb'].quantile(0.75) IQR = Q3 - Q1  # Filtering Values between Q1-1.5IQR and Q3+1.5IQR filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)') ``

Then we can plot the result to check the difference. We observe that the outlier in the left boxplot (the cross at 183) does not appear anymore in the filtered series.

``# Ploting the result to check the difference df.join(filtered, rsuffix='_filtered').boxplot() ``

80

Another approach using Series.between():

``iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)] ``

Drawn out:

``# Select the first quantile q1 = df['col'].quantile(.25)  # Select the third quantile q3 = df['col'].quantile(.75)  # Create a mask inbeetween q1 & q3 mask = df['col'].between(q1, q3, inclusive=True)  # Filtering the initial dataframe with a mask iqr = df.loc[mask, 'col']          ``

78

This will give you the subset of `df` which lies in the IQR of column `column`:

``def subset_by_iqr(df, column, whisker_width=1.5):     """Remove outliers from a dataframe by column, including optional         whiskers, removing rows for which the column value are         less than Q1-1.5IQR or greater than Q3+1.5IQR.     Args:         df (`:obj:pd.DataFrame`): A pandas dataframe to subset         column (str): Name of the column to calculate the subset from.         whisker_width (float): Optional, loosen the IQR filter by a                                factor of `whisker_width` * IQR.     Returns:         (`:obj:pd.DataFrame`): Filtered dataframe     """     # Calculate Q1, Q2 and IQR     q1 = df[column].quantile(0.25)                      q3 = df[column].quantile(0.75)     iqr = q3 - q1     # Apply filter with respect to IQR, including optional whiskers     filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr)     return df.loc[filter]                                                       # Example for whiskers = 1.5, as requested by the OP df_filtered = subset_by_iqr(df, 'column_name', whisker_width=1.5) ``

62

Find the 1st and 3rd quartile using `df.quantile` and then use a mask on the dataframe. In case you want to remove them, use `no_outliers` and invert the condition in the mask to get `outliers`.

``Q1 = df.col.quantile(0.25) Q3 = df.col.quantile(0.75) IQR = Q3 - Q1 no_outliers = df.col[(Q1 - 1.5*IQR < df.BMI) &  (df.BMI < Q3 + 1.5*IQR)] outliers = df.col[(Q1 - 1.5*IQR >= df.BMI) |  (df.BMI >= Q3 + 1.5*IQR)] ``

50

Another approach uses Series.clip:

``q = s.quantile([.25, .75]) s = s[~s.clip(*q).isin(q)] ``

here are details:

``s = pd.Series(np.randon.randn(100)) q = s.quantile([.25, .75])  # calculate lower and upper bounds s = s.clip(*q)  # assigns values outside boundary to boundary values s = s[~s.isin(q)]  # take only observations within bounds ``

Using it to filter a whole dataframe `df` is straightforward:

``def iqr(df, colname, bounds = [.25, .75]):     s = df[colname]     q = s.quantile(bounds)     return df[~s.clip(*q).isin(q)] ``

Note: the method excludes the boundaries themselves.