go_utils.filtering
Overview
This submodule contains code to facilitate the general filtering of data. The following sections discuss some of the logic and context behind these methods.
Methods
Filter Invalid Coords
Certain entries in the GLOBE Database have latitudes and longitudes that don't exist.
Filter Duplicates
Due to reasons like GLOBE Observer trainings among other things, there are oftentimes multiple observations of the same exact entry. This can lead to a decrease in data quality and so this utility can be used to reduce this. Groups of entries that share the same MGRS Latitude, NGRS Longitude, measured date, and other dataset specific attributes (e.g. water source) could likely be duplicate entries. In Low, et. al, Mosquito Habitat Mapper duplicates are removed by groups of size greater than 10 sharing MGRS Latitude, MGRS Longitude, measuredDate, Water source, and Sitename values. Do note, however, the filter by default includes the first entry of each duplicate group which is unlike the procedure in Low, et al. as all duplicate entries were dropped.
Filter Poor Geolocational Data
Geolocational data may not be the most accurate. As a result, this runs a relatively naive check to remove poor geolocational data. More specifically, if the MGRS coordinates match up with the GPS coordinates or the GPS coordinates are whole numbers, then the entry is considered poor quality.
1import numpy as np 2from pandas.api.types import is_hashable 3 4__doc__ = """ 5# Overview 6This submodule contains code to facilitate the general filtering of data. 7The following sections discuss some of the logic and context behind these methods. 8 9# Methods 10 11## [Filter Invalid Coords](#filter_invalid_coords) 12Certain entries in the GLOBE Database have latitudes and longitudes that don't exist. 13 14## [Filter Duplicates](filter_duplicates) 15Due to reasons like GLOBE Observer trainings among other things, there are oftentimes multiple observations of the same exact entry. This can lead to a decrease in data quality and so this utility can be used to reduce this. Groups of entries that share the same MGRS Latitude, NGRS Longitude, measured date, and other dataset specific attributes (e.g. water source) could likely be duplicate entries. In [Low, et. al](https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2021GH000436), Mosquito Habitat Mapper duplicates are removed by groups of size greater than 10 sharing MGRS Latitude, MGRS Longitude, measuredDate, Water source, and Sitename values. 16Do note, however, the filter by default includes the first entry of each duplicate group which is unlike the procedure in Low, et al. as all duplicate entries were dropped. 17 18## [Filter Poor Geolocational Data](filter_by_globe_team) 19Geolocational data may not be the most accurate. As a result, this runs a relatively naive check to remove poor geolocational data. More specifically, if the MGRS coordinates match up with the GPS coordinates or the GPS coordinates are whole numbers, then the entry is considered poor quality. 20""" 21 22 23def filter_out_entries(df, mask, include, inplace): 24 """ 25 Filters out or selects target entries of a DataFrame using a mask. Mainly serves as a utility function for the other filters. 26 27 Parameters 28 ---------- 29 df : pd.DataFrame 30 The DataFrame to filter 31 mask : 1D np.array of bools 32 The mask to apply to the DataFrame 33 include : bool 34 True to only select the masked values False to exclude the masked values 35 inplace : bool 36 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 37 38 Returns 39 ------- 40 pd.DataFrame or None 41 A DataFrame with the mask filter applied. If `inplace=True` it returns None. 42 """ 43 if include: 44 final_mask = mask 45 else: 46 final_mask = ~mask 47 filtered_df = df[final_mask] 48 if not inplace: 49 return filtered_df 50 else: 51 df.mask(~df.isin(filtered_df), inplace=True) 52 df.dropna(how="all", inplace=True) 53 for col in df.columns: 54 if df[col].dtype != filtered_df[col].dtype: 55 df[col] = df[col].astype(filtered_df[col].dtype) 56 57 58def filter_invalid_coords( 59 df, latitude_col, longitude_col, inclusive=False, inplace=False 60): 61 """ 62 Filters latitude and longitude of a DataFrame to lie within the latitude range of [-90, 90] or (-90, 90) and longitude range of [-180, 180] or (-180, 180) 63 64 Parameters 65 ---------- 66 df : pd.DataFrame 67 The DataFrame to filter 68 latitude_col : str 69 The name of the column that contains latitude values 70 longitude_col : str 71 The name of the column that contains longitude values 72 inclusive : bool, default=False 73 True if you would like the bounds of the latitude and longitude to be inclusive e.g. [-90, 90]. Do note that these bounds may not work with certain GIS software and projections. 74 inplace : bool, default=False 75 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 76 77 Returns 78 ------- 79 pd.DataFrame or None 80 A DataFrame with invalid latitude and longitude entries removed. If `inplace=True` it returns None. 81 """ 82 if not inplace: 83 df = df.copy() 84 85 if inclusive: 86 mask = ( 87 (df[latitude_col] >= -90) 88 & (df[latitude_col] <= 90) 89 & (df[longitude_col] <= 180) 90 & (df[longitude_col] >= -180) 91 ) 92 else: 93 mask = ( 94 (df[latitude_col] > -90) 95 & (df[latitude_col] < 90) 96 & (df[longitude_col] < 180) 97 & (df[longitude_col] > -180) 98 ) 99 100 return filter_out_entries(df, mask, True, inplace) 101 102 103def filter_duplicates(df, columns, group_size, keep_first=True, inplace=False): 104 """ 105 Filters possible duplicate data by grouping together suspiciously similar entries. 106 107 Parameters 108 ---------- 109 df : pd.DataFrame 110 The DataFrame to filter 111 columns : list of str 112 The name of the columns that duplicate data would share. This can include things such as MGRS Latitude, MGRS Longitude, measure date, and other fields (e.g. mosquito water source for mosquito habitat mapper). 113 group_size : int 114 The number of duplicate entries in a group needed to classify the group as duplicate data. 115 inplace : bool, default=False 116 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 117 118 Returns 119 ------- 120 pd.DataFrame or None 121 A DataFrame with duplicate data removed. If `inplace=True` it returns None. 122 """ 123 124 if not inplace: 125 df = df.copy() 126 127 # groups / filters suspected events 128 suspect_df = df.groupby(by=columns).filter(lambda x: len(x) >= group_size) 129 if keep_first: 130 suspect_df = suspect_df.groupby(by=columns, as_index=False).nth[1:] 131 suspect_mask = df.isin(suspect_df) 132 suspect_mask = np.any(suspect_mask, axis=1) 133 134 return filter_out_entries(df, suspect_mask, False, inplace) 135 136 137def filter_poor_geolocational_data( 138 df, 139 latitude_col, 140 longitude_col, 141 mgrs_latitude_col, 142 mgrs_longitude_col, 143 inplace=False, 144): 145 """ 146 Filters latitude and longitude of a DataFrame that contain poor geolocational quality. 147 148 Parameters 149 ---------- 150 df : pd.DataFrame 151 The DataFrame to filter 152 latitude_col : str 153 The name of the column that contains latitude values 154 longitude_col : str 155 The name of the column that contains longitude values 156 mgrs_latitude_col : str 157 The name of the column that contains MGRS latitude values 158 mgrs_longitude_col : str 159 The name of the column that contains MGRS longitude values 160 inplace : bool, default=False 161 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 162 163 Returns 164 ------- 165 pd.DataFrame or None 166 A DataFrame with bad latitude and longitude entries removed. If `inplace=True` it returns None. 167 """ 168 169 def geolocational_filter(gps_lat, gps_lon, recorded_lat, recorded_lon): 170 return ( 171 (recorded_lat == gps_lat and recorded_lon == gps_lon) 172 or gps_lat == int(gps_lat) 173 or gps_lon == int(gps_lon) 174 ) 175 176 if not inplace: 177 df = df.copy() 178 179 vectorized_filter = np.vectorize(geolocational_filter) 180 bad_data = vectorized_filter( 181 df[latitude_col].to_numpy(), 182 df[longitude_col].to_numpy(), 183 df[mgrs_latitude_col].to_numpy(), 184 df[mgrs_longitude_col].to_numpy(), 185 ) 186 187 return filter_out_entries(df, bad_data, False, inplace) 188 189 190def filter_by_globe_team( 191 df, globe_teams_column, target_teams, exclude=False, inplace=False 192): 193 """ 194 Finds or filters out specific globe teams. 195 196 Parameters 197 ---------- 198 df : pd.DataFrame 199 The DataFrame to filter 200 globe_teams_column : str 201 The column containing the GLOBE teams. 202 target_teams : list of str 203 The names of the GLOBE teams to be used. 204 exclude : bool, default=False 205 Whether to exclude the specified teams from the dataset. 206 inplace : bool, default=False 207 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 208 Returns 209 ------- 210 pd.DataFrame or None 211 A DataFrame with only the specified GLOBE teams (if exclude is False) or without the specified GLOBE teams (if exclude is True). If `inplace=True` it returns None. 212 """ 213 214 def is_desired_team(team_list): 215 if not exclude: 216 return any( 217 [ 218 team in team_list if not is_hashable(team_list) else False 219 for team in target_teams 220 ] 221 ) 222 else: 223 return all( 224 [ 225 team not in team_list if not is_hashable(team_list) else False 226 for team in target_teams 227 ] 228 ) 229 230 desired_team_filter = np.vectorize(is_desired_team) 231 desired_data_mask = desired_team_filter(df[globe_teams_column].to_numpy()) 232 233 return filter_out_entries(df, desired_data_mask, True, inplace)
24def filter_out_entries(df, mask, include, inplace): 25 """ 26 Filters out or selects target entries of a DataFrame using a mask. Mainly serves as a utility function for the other filters. 27 28 Parameters 29 ---------- 30 df : pd.DataFrame 31 The DataFrame to filter 32 mask : 1D np.array of bools 33 The mask to apply to the DataFrame 34 include : bool 35 True to only select the masked values False to exclude the masked values 36 inplace : bool 37 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 38 39 Returns 40 ------- 41 pd.DataFrame or None 42 A DataFrame with the mask filter applied. If `inplace=True` it returns None. 43 """ 44 if include: 45 final_mask = mask 46 else: 47 final_mask = ~mask 48 filtered_df = df[final_mask] 49 if not inplace: 50 return filtered_df 51 else: 52 df.mask(~df.isin(filtered_df), inplace=True) 53 df.dropna(how="all", inplace=True) 54 for col in df.columns: 55 if df[col].dtype != filtered_df[col].dtype: 56 df[col] = df[col].astype(filtered_df[col].dtype)
Filters out or selects target entries of a DataFrame using a mask. Mainly serves as a utility function for the other filters.
Parameters
- df (pd.DataFrame): The DataFrame to filter
- mask (1D np.array of bools): The mask to apply to the DataFrame
- include (bool): True to only select the masked values False to exclude the masked values
- inplace (bool): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the mask filter applied. If
inplace=True
it returns None.
59def filter_invalid_coords( 60 df, latitude_col, longitude_col, inclusive=False, inplace=False 61): 62 """ 63 Filters latitude and longitude of a DataFrame to lie within the latitude range of [-90, 90] or (-90, 90) and longitude range of [-180, 180] or (-180, 180) 64 65 Parameters 66 ---------- 67 df : pd.DataFrame 68 The DataFrame to filter 69 latitude_col : str 70 The name of the column that contains latitude values 71 longitude_col : str 72 The name of the column that contains longitude values 73 inclusive : bool, default=False 74 True if you would like the bounds of the latitude and longitude to be inclusive e.g. [-90, 90]. Do note that these bounds may not work with certain GIS software and projections. 75 inplace : bool, default=False 76 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 77 78 Returns 79 ------- 80 pd.DataFrame or None 81 A DataFrame with invalid latitude and longitude entries removed. If `inplace=True` it returns None. 82 """ 83 if not inplace: 84 df = df.copy() 85 86 if inclusive: 87 mask = ( 88 (df[latitude_col] >= -90) 89 & (df[latitude_col] <= 90) 90 & (df[longitude_col] <= 180) 91 & (df[longitude_col] >= -180) 92 ) 93 else: 94 mask = ( 95 (df[latitude_col] > -90) 96 & (df[latitude_col] < 90) 97 & (df[longitude_col] < 180) 98 & (df[longitude_col] > -180) 99 ) 100 101 return filter_out_entries(df, mask, True, inplace)
Filters latitude and longitude of a DataFrame to lie within the latitude range of [-90, 90] or (-90, 90) and longitude range of [-180, 180] or (-180, 180)
Parameters
- df (pd.DataFrame): The DataFrame to filter
- latitude_col (str): The name of the column that contains latitude values
- longitude_col (str): The name of the column that contains longitude values
- inclusive (bool, default=False): True if you would like the bounds of the latitude and longitude to be inclusive e.g. [-90, 90]. Do note that these bounds may not work with certain GIS software and projections.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with invalid latitude and longitude entries removed. If
inplace=True
it returns None.
104def filter_duplicates(df, columns, group_size, keep_first=True, inplace=False): 105 """ 106 Filters possible duplicate data by grouping together suspiciously similar entries. 107 108 Parameters 109 ---------- 110 df : pd.DataFrame 111 The DataFrame to filter 112 columns : list of str 113 The name of the columns that duplicate data would share. This can include things such as MGRS Latitude, MGRS Longitude, measure date, and other fields (e.g. mosquito water source for mosquito habitat mapper). 114 group_size : int 115 The number of duplicate entries in a group needed to classify the group as duplicate data. 116 inplace : bool, default=False 117 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 118 119 Returns 120 ------- 121 pd.DataFrame or None 122 A DataFrame with duplicate data removed. If `inplace=True` it returns None. 123 """ 124 125 if not inplace: 126 df = df.copy() 127 128 # groups / filters suspected events 129 suspect_df = df.groupby(by=columns).filter(lambda x: len(x) >= group_size) 130 if keep_first: 131 suspect_df = suspect_df.groupby(by=columns, as_index=False).nth[1:] 132 suspect_mask = df.isin(suspect_df) 133 suspect_mask = np.any(suspect_mask, axis=1) 134 135 return filter_out_entries(df, suspect_mask, False, inplace)
Filters possible duplicate data by grouping together suspiciously similar entries.
Parameters
- df (pd.DataFrame): The DataFrame to filter
- columns (list of str): The name of the columns that duplicate data would share. This can include things such as MGRS Latitude, MGRS Longitude, measure date, and other fields (e.g. mosquito water source for mosquito habitat mapper).
- group_size (int): The number of duplicate entries in a group needed to classify the group as duplicate data.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with duplicate data removed. If
inplace=True
it returns None.
138def filter_poor_geolocational_data( 139 df, 140 latitude_col, 141 longitude_col, 142 mgrs_latitude_col, 143 mgrs_longitude_col, 144 inplace=False, 145): 146 """ 147 Filters latitude and longitude of a DataFrame that contain poor geolocational quality. 148 149 Parameters 150 ---------- 151 df : pd.DataFrame 152 The DataFrame to filter 153 latitude_col : str 154 The name of the column that contains latitude values 155 longitude_col : str 156 The name of the column that contains longitude values 157 mgrs_latitude_col : str 158 The name of the column that contains MGRS latitude values 159 mgrs_longitude_col : str 160 The name of the column that contains MGRS longitude values 161 inplace : bool, default=False 162 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 163 164 Returns 165 ------- 166 pd.DataFrame or None 167 A DataFrame with bad latitude and longitude entries removed. If `inplace=True` it returns None. 168 """ 169 170 def geolocational_filter(gps_lat, gps_lon, recorded_lat, recorded_lon): 171 return ( 172 (recorded_lat == gps_lat and recorded_lon == gps_lon) 173 or gps_lat == int(gps_lat) 174 or gps_lon == int(gps_lon) 175 ) 176 177 if not inplace: 178 df = df.copy() 179 180 vectorized_filter = np.vectorize(geolocational_filter) 181 bad_data = vectorized_filter( 182 df[latitude_col].to_numpy(), 183 df[longitude_col].to_numpy(), 184 df[mgrs_latitude_col].to_numpy(), 185 df[mgrs_longitude_col].to_numpy(), 186 ) 187 188 return filter_out_entries(df, bad_data, False, inplace)
Filters latitude and longitude of a DataFrame that contain poor geolocational quality.
Parameters
- df (pd.DataFrame): The DataFrame to filter
- latitude_col (str): The name of the column that contains latitude values
- longitude_col (str): The name of the column that contains longitude values
- mgrs_latitude_col (str): The name of the column that contains MGRS latitude values
- mgrs_longitude_col (str): The name of the column that contains MGRS longitude values
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with bad latitude and longitude entries removed. If
inplace=True
it returns None.
191def filter_by_globe_team( 192 df, globe_teams_column, target_teams, exclude=False, inplace=False 193): 194 """ 195 Finds or filters out specific globe teams. 196 197 Parameters 198 ---------- 199 df : pd.DataFrame 200 The DataFrame to filter 201 globe_teams_column : str 202 The column containing the GLOBE teams. 203 target_teams : list of str 204 The names of the GLOBE teams to be used. 205 exclude : bool, default=False 206 Whether to exclude the specified teams from the dataset. 207 inplace : bool, default=False 208 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 209 Returns 210 ------- 211 pd.DataFrame or None 212 A DataFrame with only the specified GLOBE teams (if exclude is False) or without the specified GLOBE teams (if exclude is True). If `inplace=True` it returns None. 213 """ 214 215 def is_desired_team(team_list): 216 if not exclude: 217 return any( 218 [ 219 team in team_list if not is_hashable(team_list) else False 220 for team in target_teams 221 ] 222 ) 223 else: 224 return all( 225 [ 226 team not in team_list if not is_hashable(team_list) else False 227 for team in target_teams 228 ] 229 ) 230 231 desired_team_filter = np.vectorize(is_desired_team) 232 desired_data_mask = desired_team_filter(df[globe_teams_column].to_numpy()) 233 234 return filter_out_entries(df, desired_data_mask, True, inplace)
Finds or filters out specific globe teams.
Parameters
- df (pd.DataFrame): The DataFrame to filter
- globe_teams_column (str): The column containing the GLOBE teams.
- target_teams (list of str): The names of the GLOBE teams to be used.
- exclude (bool, default=False): Whether to exclude the specified teams from the dataset.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with only the specified GLOBE teams (if exclude is False) or without the specified GLOBE teams (if exclude is True). If
inplace=True
it returns None.