go_utils.filtering

Overview

This submodule contains code to facilitate the general filtering of data. The following sections discuss some of the logic and context behind these methods.

Methods

Filter Invalid Coords

Certain entries in the GLOBE Database have latitudes and longitudes that don't exist.

Due to reasons like GLOBE Observer trainings among other things, there are oftentimes multiple observations of the same exact entry. This can lead to a decrease in data quality and so this utility can be used to reduce this. Groups of entries that share the same MGRS Latitude, NGRS Longitude, measured date, and other dataset specific attributes (e.g. water source) could likely be duplicate entries. In Low, et. al, Mosquito Habitat Mapper duplicates are removed by groups of size greater than 10 sharing MGRS Latitude, MGRS Longitude, measuredDate, Water source, and Sitename values. Do note, however, the filter by default includes the first entry of each duplicate group which is unlike the procedure in Low, et al. as all duplicate entries were dropped.

Filter Poor Geolocational Data

Geolocational data may not be the most accurate. As a result, this runs a relatively naive check to remove poor geolocational data. More specifically, if the MGRS coordinates match up with the GPS coordinates or the GPS coordinates are whole numbers, then the entry is considered poor quality.

View Source

  1import numpy as np
  2from pandas.api.types import is_hashable
  3
  4__doc__ = """
  5# Overview
  6This submodule contains code to facilitate the general filtering of data. 
  7The following sections discuss some of the logic and context behind these methods.
  8
  9# Methods
 10
 11## [Filter Invalid Coords](#filter_invalid_coords)
 12Certain entries in the GLOBE Database have latitudes and longitudes that don't exist.
 13
 14## [Filter Duplicates](filter_duplicates)
 15Due to reasons like GLOBE Observer trainings among other things, there are oftentimes multiple observations of the same exact entry. This can lead to a decrease in data quality and so this utility can be used to reduce this. Groups of entries that share the same MGRS Latitude, NGRS Longitude, measured date, and other dataset specific attributes (e.g. water source) could likely be duplicate entries. In [Low, et. al](https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2021GH000436), Mosquito Habitat Mapper duplicates are removed by groups of size greater than 10 sharing MGRS Latitude, MGRS Longitude, measuredDate, Water source, and Sitename values.
 16Do note, however, the filter by default includes the first entry of each duplicate group which is unlike the procedure in Low, et al. as all duplicate entries were dropped.
 17
 18## [Filter Poor Geolocational Data](filter_by_globe_team)
 19Geolocational data may not be the most accurate. As a result, this runs a relatively naive check to remove poor geolocational data. More specifically, if the MGRS coordinates match up with the GPS coordinates or the GPS coordinates are whole numbers, then the entry is considered poor quality.
 20"""
 21
 22
 23def filter_out_entries(df, mask, include, inplace):
 24    """
 25    Filters out or selects target entries of a DataFrame using a mask. Mainly serves as a utility function for the other filters.
 26
 27    Parameters
 28    ----------
 29    df : pd.DataFrame
 30        The DataFrame to filter
 31    mask : 1D np.array of bools
 32        The mask to apply to the DataFrame
 33    include : bool
 34        True to only select the masked values False to exclude the masked values
 35    inplace : bool
 36        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 37
 38    Returns
 39    -------
 40    pd.DataFrame or None
 41        A DataFrame with the mask filter applied. If `inplace=True` it returns None.
 42    """
 43    if include:
 44        final_mask = mask
 45    else:
 46        final_mask = ~mask
 47    filtered_df = df[final_mask]
 48    if not inplace:
 49        return filtered_df
 50    else:
 51        df.mask(~df.isin(filtered_df), inplace=True)
 52        df.dropna(how="all", inplace=True)
 53        for col in df.columns:
 54            if df[col].dtype != filtered_df[col].dtype:
 55                df[col] = df[col].astype(filtered_df[col].dtype)
 56
 57
 58def filter_invalid_coords(
 59    df, latitude_col, longitude_col, inclusive=False, inplace=False
 60):
 61    """
 62    Filters latitude and longitude of a DataFrame to lie within the latitude range of [-90, 90] or (-90, 90) and longitude range of [-180, 180] or (-180, 180)
 63
 64    Parameters
 65    ----------
 66    df : pd.DataFrame
 67        The DataFrame to filter
 68    latitude_col : str
 69        The name of the column that contains latitude values
 70    longitude_col : str
 71        The name of the column that contains longitude values
 72    inclusive : bool, default=False
 73        True if you would like the bounds of the latitude and longitude to be inclusive e.g. [-90, 90]. Do note that these bounds may not work with certain GIS software and projections.
 74    inplace : bool, default=False
 75        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 76
 77    Returns
 78    -------
 79    pd.DataFrame or None
 80        A DataFrame with invalid latitude and longitude entries removed. If `inplace=True` it returns None.
 81    """
 82    if not inplace:
 83        df = df.copy()
 84
 85    if inclusive:
 86        mask = (
 87            (df[latitude_col] >= -90)
 88            & (df[latitude_col] <= 90)
 89            & (df[longitude_col] <= 180)
 90            & (df[longitude_col] >= -180)
 91        )
 92    else:
 93        mask = (
 94            (df[latitude_col] > -90)
 95            & (df[latitude_col] < 90)
 96            & (df[longitude_col] < 180)
 97            & (df[longitude_col] > -180)
 98        )
 99
100    return filter_out_entries(df, mask, True, inplace)
101
102
103def filter_duplicates(df, columns, group_size, keep_first=True, inplace=False):
104    """
105    Filters possible duplicate data by grouping together suspiciously similar entries.
106
107    Parameters
108    ----------
109    df : pd.DataFrame
110        The DataFrame to filter
111    columns : list of str
112        The name of the columns that duplicate data would share. This can include things such as MGRS Latitude, MGRS Longitude, measure date, and other fields (e.g. mosquito water source for mosquito habitat mapper).
113    group_size : int
114        The number of duplicate entries in a group needed to classify the group as duplicate data.
115    inplace : bool, default=False
116        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
117
118    Returns
119    -------
120    pd.DataFrame or None
121        A DataFrame with duplicate data removed. If `inplace=True` it returns None.
122    """
123
124    if not inplace:
125        df = df.copy()
126
127    # groups / filters suspected events
128    suspect_df = df.groupby(by=columns).filter(lambda x: len(x) >= group_size)
129    if keep_first:
130        suspect_df = suspect_df.groupby(by=columns, as_index=False).nth[1:]
131    suspect_mask = df.isin(suspect_df)
132    suspect_mask = np.any(suspect_mask, axis=1)
133
134    return filter_out_entries(df, suspect_mask, False, inplace)
135
136
137def filter_poor_geolocational_data(
138    df,
139    latitude_col,
140    longitude_col,
141    mgrs_latitude_col,
142    mgrs_longitude_col,
143    inplace=False,
144):
145    """
146    Filters latitude and longitude of a DataFrame that contain poor geolocational quality.
147
148    Parameters
149    ----------
150    df : pd.DataFrame
151        The DataFrame to filter
152    latitude_col : str
153        The name of the column that contains latitude values
154    longitude_col : str
155        The name of the column that contains longitude values
156    mgrs_latitude_col : str
157        The name of the column that contains MGRS latitude values
158    mgrs_longitude_col : str
159        The name of the column that contains MGRS longitude values
160    inplace : bool, default=False
161        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
162
163    Returns
164    -------
165    pd.DataFrame or None
166        A DataFrame with bad latitude and longitude entries removed. If `inplace=True` it returns None.
167    """
168
169    def geolocational_filter(gps_lat, gps_lon, recorded_lat, recorded_lon):
170        return (
171            (recorded_lat == gps_lat and recorded_lon == gps_lon)
172            or gps_lat == int(gps_lat)
173            or gps_lon == int(gps_lon)
174        )
175
176    if not inplace:
177        df = df.copy()
178
179    vectorized_filter = np.vectorize(geolocational_filter)
180    bad_data = vectorized_filter(
181        df[latitude_col].to_numpy(),
182        df[longitude_col].to_numpy(),
183        df[mgrs_latitude_col].to_numpy(),
184        df[mgrs_longitude_col].to_numpy(),
185    )
186
187    return filter_out_entries(df, bad_data, False, inplace)
188
189
190def filter_by_globe_team(
191    df, globe_teams_column, target_teams, exclude=False, inplace=False
192):
193    """
194    Finds or filters out specific globe teams.
195
196    Parameters
197    ----------
198    df : pd.DataFrame
199        The DataFrame to filter
200    globe_teams_column : str
201        The column containing the GLOBE teams.
202    target_teams : list of str
203        The names of the GLOBE teams to be used.
204    exclude : bool, default=False
205        Whether to exclude the specified teams from the dataset.
206    inplace : bool, default=False
207        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
208    Returns
209    -------
210    pd.DataFrame or None
211        A DataFrame with only the specified GLOBE teams (if exclude is False) or without the specified GLOBE teams (if exclude is True). If `inplace=True` it returns None.
212    """
213
214    def is_desired_team(team_list):
215        if not exclude:
216            return any(
217                [
218                    team in team_list if not is_hashable(team_list) else False
219                    for team in target_teams
220                ]
221            )
222        else:
223            return all(
224                [
225                    team not in team_list if not is_hashable(team_list) else False
226                    for team in target_teams
227                ]
228            )
229
230    desired_team_filter = np.vectorize(is_desired_team)
231    desired_data_mask = desired_team_filter(df[globe_teams_column].to_numpy())
232
233    return filter_out_entries(df, desired_data_mask, True, inplace)

def filter_out_entries(df, mask, include, inplace) View Source

24def filter_out_entries(df, mask, include, inplace):
25    """
26    Filters out or selects target entries of a DataFrame using a mask. Mainly serves as a utility function for the other filters.
27
28    Parameters
29    ----------
30    df : pd.DataFrame
31        The DataFrame to filter
32    mask : 1D np.array of bools
33        The mask to apply to the DataFrame
34    include : bool
35        True to only select the masked values False to exclude the masked values
36    inplace : bool
37        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
38
39    Returns
40    -------
41    pd.DataFrame or None
42        A DataFrame with the mask filter applied. If `inplace=True` it returns None.
43    """
44    if include:
45        final_mask = mask
46    else:
47        final_mask = ~mask
48    filtered_df = df[final_mask]
49    if not inplace:
50        return filtered_df
51    else:
52        df.mask(~df.isin(filtered_df), inplace=True)
53        df.dropna(how="all", inplace=True)
54        for col in df.columns:
55            if df[col].dtype != filtered_df[col].dtype:
56                df[col] = df[col].astype(filtered_df[col].dtype)

Filters out or selects target entries of a DataFrame using a mask. Mainly serves as a utility function for the other filters.

Parameters

df (pd.DataFrame): The DataFrame to filter
mask (1D np.array of bools): The mask to apply to the DataFrame
include (bool): True to only select the masked values False to exclude the masked values
inplace (bool): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.

Returns

pd.DataFrame or None: A DataFrame with the mask filter applied. If inplace=True it returns None.

def filter_invalid_coords(df, latitude_col, longitude_col, inclusive=False, inplace=False) View Source

 59def filter_invalid_coords(
 60    df, latitude_col, longitude_col, inclusive=False, inplace=False
 61):
 62    """
 63    Filters latitude and longitude of a DataFrame to lie within the latitude range of [-90, 90] or (-90, 90) and longitude range of [-180, 180] or (-180, 180)
 64
 65    Parameters
 66    ----------
 67    df : pd.DataFrame
 68        The DataFrame to filter
 69    latitude_col : str
 70        The name of the column that contains latitude values
 71    longitude_col : str
 72        The name of the column that contains longitude values
 73    inclusive : bool, default=False
 74        True if you would like the bounds of the latitude and longitude to be inclusive e.g. [-90, 90]. Do note that these bounds may not work with certain GIS software and projections.
 75    inplace : bool, default=False
 76        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 77
 78    Returns
 79    -------
 80    pd.DataFrame or None
 81        A DataFrame with invalid latitude and longitude entries removed. If `inplace=True` it returns None.
 82    """
 83    if not inplace:
 84        df = df.copy()
 85
 86    if inclusive:
 87        mask = (
 88            (df[latitude_col] >= -90)
 89            & (df[latitude_col] <= 90)
 90            & (df[longitude_col] <= 180)
 91            & (df[longitude_col] >= -180)
 92        )
 93    else:
 94        mask = (
 95            (df[latitude_col] > -90)
 96            & (df[latitude_col] < 90)
 97            & (df[longitude_col] < 180)
 98            & (df[longitude_col] > -180)
 99        )
100
101    return filter_out_entries(df, mask, True, inplace)

Filters latitude and longitude of a DataFrame to lie within the latitude range of [-90, 90] or (-90, 90) and longitude range of [-180, 180] or (-180, 180)

Parameters

df (pd.DataFrame): The DataFrame to filter
latitude_col (str): The name of the column that contains latitude values
longitude_col (str): The name of the column that contains longitude values
inclusive (bool, default=False): True if you would like the bounds of the latitude and longitude to be inclusive e.g. [-90, 90]. Do note that these bounds may not work with certain GIS software and projections.
inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.

Returns

pd.DataFrame or None: A DataFrame with invalid latitude and longitude entries removed. If inplace=True it returns None.

def filter_duplicates(df, columns, group_size, keep_first=True, inplace=False) View Source

104def filter_duplicates(df, columns, group_size, keep_first=True, inplace=False):
105    """
106    Filters possible duplicate data by grouping together suspiciously similar entries.
107
108    Parameters
109    ----------
110    df : pd.DataFrame
111        The DataFrame to filter
112    columns : list of str
113        The name of the columns that duplicate data would share. This can include things such as MGRS Latitude, MGRS Longitude, measure date, and other fields (e.g. mosquito water source for mosquito habitat mapper).
114    group_size : int
115        The number of duplicate entries in a group needed to classify the group as duplicate data.
116    inplace : bool, default=False
117        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
118
119    Returns
120    -------
121    pd.DataFrame or None
122        A DataFrame with duplicate data removed. If `inplace=True` it returns None.
123    """
124
125    if not inplace:
126        df = df.copy()
127
128    # groups / filters suspected events
129    suspect_df = df.groupby(by=columns).filter(lambda x: len(x) >= group_size)
130    if keep_first:
131        suspect_df = suspect_df.groupby(by=columns, as_index=False).nth[1:]
132    suspect_mask = df.isin(suspect_df)
133    suspect_mask = np.any(suspect_mask, axis=1)
134
135    return filter_out_entries(df, suspect_mask, False, inplace)

Filters possible duplicate data by grouping together suspiciously similar entries.

Parameters

df (pd.DataFrame): The DataFrame to filter
columns (list of str): The name of the columns that duplicate data would share. This can include things such as MGRS Latitude, MGRS Longitude, measure date, and other fields (e.g. mosquito water source for mosquito habitat mapper).
group_size (int): The number of duplicate entries in a group needed to classify the group as duplicate data.
inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.

Returns

pd.DataFrame or None: A DataFrame with duplicate data removed. If inplace=True it returns None.

def filter_poor_geolocational_data( df, latitude_col, longitude_col, mgrs_latitude_col, mgrs_longitude_col, inplace=False) View Source

138def filter_poor_geolocational_data(
139    df,
140    latitude_col,
141    longitude_col,
142    mgrs_latitude_col,
143    mgrs_longitude_col,
144    inplace=False,
145):
146    """
147    Filters latitude and longitude of a DataFrame that contain poor geolocational quality.
148
149    Parameters
150    ----------
151    df : pd.DataFrame
152        The DataFrame to filter
153    latitude_col : str
154        The name of the column that contains latitude values
155    longitude_col : str
156        The name of the column that contains longitude values
157    mgrs_latitude_col : str
158        The name of the column that contains MGRS latitude values
159    mgrs_longitude_col : str
160        The name of the column that contains MGRS longitude values
161    inplace : bool, default=False
162        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
163
164    Returns
165    -------
166    pd.DataFrame or None
167        A DataFrame with bad latitude and longitude entries removed. If `inplace=True` it returns None.
168    """
169
170    def geolocational_filter(gps_lat, gps_lon, recorded_lat, recorded_lon):
171        return (
172            (recorded_lat == gps_lat and recorded_lon == gps_lon)
173            or gps_lat == int(gps_lat)
174            or gps_lon == int(gps_lon)
175        )
176
177    if not inplace:
178        df = df.copy()
179
180    vectorized_filter = np.vectorize(geolocational_filter)
181    bad_data = vectorized_filter(
182        df[latitude_col].to_numpy(),
183        df[longitude_col].to_numpy(),
184        df[mgrs_latitude_col].to_numpy(),
185        df[mgrs_longitude_col].to_numpy(),
186    )
187
188    return filter_out_entries(df, bad_data, False, inplace)

Filters latitude and longitude of a DataFrame that contain poor geolocational quality.

Parameters

df (pd.DataFrame): The DataFrame to filter
latitude_col (str): The name of the column that contains latitude values
longitude_col (str): The name of the column that contains longitude values
mgrs_latitude_col (str): The name of the column that contains MGRS latitude values
mgrs_longitude_col (str): The name of the column that contains MGRS longitude values
inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.

Returns

pd.DataFrame or None: A DataFrame with bad latitude and longitude entries removed. If inplace=True it returns None.

def filter_by_globe_team(df, globe_teams_column, target_teams, exclude=False, inplace=False) View Source

191def filter_by_globe_team(
192    df, globe_teams_column, target_teams, exclude=False, inplace=False
193):
194    """
195    Finds or filters out specific globe teams.
196
197    Parameters
198    ----------
199    df : pd.DataFrame
200        The DataFrame to filter
201    globe_teams_column : str
202        The column containing the GLOBE teams.
203    target_teams : list of str
204        The names of the GLOBE teams to be used.
205    exclude : bool, default=False
206        Whether to exclude the specified teams from the dataset.
207    inplace : bool, default=False
208        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
209    Returns
210    -------
211    pd.DataFrame or None
212        A DataFrame with only the specified GLOBE teams (if exclude is False) or without the specified GLOBE teams (if exclude is True). If `inplace=True` it returns None.
213    """
214
215    def is_desired_team(team_list):
216        if not exclude:
217            return any(
218                [
219                    team in team_list if not is_hashable(team_list) else False
220                    for team in target_teams
221                ]
222            )
223        else:
224            return all(
225                [
226                    team not in team_list if not is_hashable(team_list) else False
227                    for team in target_teams
228                ]
229            )
230
231    desired_team_filter = np.vectorize(is_desired_team)
232    desired_data_mask = desired_team_filter(df[globe_teams_column].to_numpy())
233
234    return filter_out_entries(df, desired_data_mask, True, inplace)

Finds or filters out specific globe teams.

Parameters

df (pd.DataFrame): The DataFrame to filter
globe_teams_column (str): The column containing the GLOBE teams.
target_teams (list of str): The names of the GLOBE teams to be used.
exclude (bool, default=False): Whether to exclude the specified teams from the dataset.
inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.

Returns

pd.DataFrame or None: A DataFrame with only the specified GLOBE teams (if exclude is False) or without the specified GLOBE teams (if exclude is True). If inplace=True it returns None.