go_utils.cleanup

Overview

This submodule contains several methods to assist with data cleanup. The following sections discuss some of the decisions behind these methods and their part of a larger data cleanup pipeline.

Methods

Remove Redundant/Homogenous Column:

This method indentifies and removes columns where all values were the same. If the logging level is set to INFO, the method will also print the names of the dropped columns and their respective singular value as a Python dictionary. For the raw mosquito habitat mapper data, the following were dropped:

  • {'protocol': 'mosquito_habitat_mapper'}
  • {'ExtraData': None}
  • {'MosquitoEggCount': None}
  • {'DataSource': 'GLOBE Observer App'}

For raw landcover data, the following were dropped:

  • {'protocol': 'land_covers'}

Rename columns:

Differentiating between MGRS and GPS Columns:

The GLOBE API data for MosquitoHabitatMapper and LandCovers report each observation’s Military Grid Reference System (MGRS) Coordinates in the latitude and longitude fields. The GPS Coordinates are stored in the MeasurementLatitude and MeasurementLongitude fields. To avoid confusion between these measuring systems, this method renames latitude and longitude to MGRSLatitude and MGRSLongitude, respectively, and MeasurementLatitude and MeasurementLongitude to Latitude and Longitude, respectively. Now, the official Latitude and Longitude columns are more intuitively named.

Protocol Abbreviation:

To better support future cross-protocol analysis and data enrichment, this method following naming scheme for all column names: protocolAbbreviation_columnName, where protocolAbbreviation was the abbreviation for the protocol (mhm for mosquito habitat mapper and lc for land cover) and columnName was the original name of the column. For example, the mosquito habitat mapper “MGRSLongitude” column was renamed “mhm_MGRSLongitude” and the corresponding land cover column was renamed “lc_MGRSLongitude”.

Do note that if you would like to use the previously mentioned mhm and lc naming scheme for you data, the go_utils.mhm and go_utils.lc submodules each have a method called cleanup_column_prefix which uses the previously mentioned naming scheme as opposed to the replace_column_prefix method which requires that you specify the current prefix and desired prefix.

Standardize no-data values:

The GLOBE API CSV’s lacked standardization in indicating No Data. Indicators ranged from Python's None, to “null”, to an empty cell, to NaN (np.nan). To improve the computational efficiency in future mathematical algorithms on the GLOBE datasets, this method converts all No Data Indicators to np.nan (Python NumPy’s version of No-Data as a float). Do note that later in Round Appropriate Columns, all numerical extraneous values are converted from np.nan to -9999. Thus, Users will receive the pre-processed GLOBE API Mosquito Habitat Mapper and Land Cover Data in accordance with the standards described by Cook et al (2018).

Round Appropriate Columns

This method does the following:

  1. Identifies all numerical columns (e.g. float64, float, int, int64).
  2. Rounds Latitudes and Longitude Columns to 5 places. To reduce data density, all latitude and longitude values were rounded to 5 decimal places. This corresponds to about a meter of accuracy. Furthermore, any larger number of decimal places consume unnecessary amounts of storage as the GLOBE Observer app cannot attain such precision.
  3. Converts other Numerical Data to Integers. To improve the datasets’ memory and performance, non latitude and longitude numerical values were converted to integers for the remaining columns, including Id, MeasurementElevation, and elevation columns. This is appropriate since ids are always discrete values. MeasurementElevation and elevation are imprecise estimates from 3rd party sources, rendering additional precision an unnecessary waste of memory. However, by converting these values to integers, we could no longer use np.nan, a float, to denote extraneous/empty values. Thus, for integer columns, we used -9999 to denote extraneous/empty values.

Note: Larvae Counts were also converted to integers and Land Classification Column percentages were also converted to integers, reducing our data density. This logic is further discussed in go_utils.mhm.larvae_to_num for mosquito habitat mapper and go_utils.lc.unpack_classifications

  1import logging
  2
  3import numpy as np
  4import pandas as pd
  5from pytz import timezone
  6from timezonefinder import TimezoneFinder
  7
  8__doc__ = """
  9
 10# Overview
 11This submodule contains several methods to assist with data cleanup.
 12The following sections discuss some of the decisions behind these methods and their part of a larger data cleanup pipeline.
 13
 14# Methods
 15
 16## Remove Redundant/Homogenous Column:
 17[This method](#remove_homogenous_cols) indentifies and removes columns where all values were the same. If the logging level is set to INFO, the method will also print the names of the dropped columns and their respective singular value as a Python dictionary. 
 18For the raw mosquito habitat mapper data, the following were dropped: 
 19- {'protocol': 'mosquito_habitat_mapper'} 
 20- {'ExtraData': None} 
 21- {'MosquitoEggCount': None} 
 22- {'DataSource': 'GLOBE Observer App'}
 23
 24For raw landcover data, the following were dropped:
 25- {'protocol': 'land_covers'}
 26
 27## Rename columns:
 28
 29### Differentiating between MGRS and GPS Columns: 
 30The GLOBE API data for `MosquitoHabitatMapper` and `LandCovers` report each observation’s Military Grid Reference System (MGRS) Coordinates in the `latitude` and `longitude` fields. The GPS Coordinates are stored in the `MeasurementLatitude` and `MeasurementLongitude` fields. 
 31To avoid confusion between these measuring systems, [this method](#rename_latlon_cols) renames `latitude` and `longitude` to `MGRSLatitude` and `MGRSLongitude`, respectively, and `MeasurementLatitude` and `MeasurementLongitude` to `Latitude` and `Longitude`, respectively. Now, the official `Latitude` and `Longitude` columns are more intuitively named.
 32
 33### Protocol Abbreviation:
 34To better support future cross-protocol analysis and data enrichment, [this method](#replace_column_prefix) following naming scheme for all column names: `protocolAbbreviation_columnName`, where `protocolAbbreviation` was the abbreviation for the protocol (`mhm` for mosquito habitat mapper and `lc` for land cover) and `columnName` was the original name of the column. For example, the mosquito habitat mapper “MGRSLongitude” column was renamed “mhm_MGRSLongitude” and the corresponding land cover column was renamed “lc_MGRSLongitude”.
 35
 36Do note that if you would like to use the previously mentioned `mhm` and `lc` naming scheme for you data, the `go_utils.mhm` and `go_utils.lc` submodules each have a method called `cleanup_column_prefix` which uses the previously mentioned naming scheme as opposed to the `replace_column_prefix` method which requires that you specify the current prefix and desired prefix.
 37
 38## Standardize no-data values:
 39The GLOBE API CSV’s lacked standardization in indicating No Data. Indicators ranged from Python's `None`, to `“null”`, to an empty cell, to `NaN` (`np.nan`). To improve the computational efficiency in future mathematical algorithms on the GLOBE datasets, [this method](#standardize_null_values) converts all No Data Indicators to np.nan (Python NumPy’s version of No-Data as a float). Do note that later in Round Appropriate Columns, all numerical extraneous values are converted from np.nan to -9999.  Thus, Users will receive the pre-processed GLOBE API Mosquito Habitat Mapper and Land Cover Data in accordance with the standards described by Cook et al (2018). 
 40
 41## Round Appropriate Columns
 42[This method](#round_cols) does the following:
 431. Identifies all numerical columns (e.g. `float64`, `float`, `int`, `int64`).
 442. Rounds Latitudes and Longitude Columns to 5 places. To reduce data density, all latitude and longitude values were rounded to 5 decimal places. This corresponds to about a meter of accuracy. Furthermore, any larger number of decimal places consume unnecessary amounts of storage as the GLOBE Observer app cannot attain such precision.
 453. Converts other Numerical Data to Integers. To improve the datasets’ memory and performance, non latitude and longitude numerical values were converted to integers for the remaining columns, including `Id`, `MeasurementElevation`, and `elevation` columns.  This is appropriate since ids are always discrete values. `MeasurementElevation` and `elevation` are imprecise estimates from 3rd party sources, rendering additional precision an unnecessary waste of memory. However, by converting these values to integers, we could no longer use np.nan, a float, to denote extraneous/empty values. Thus, for integer columns, we used -9999 to denote extraneous/empty values.
 46
 47**Note**: Larvae Counts were also converted to integers and Land Classification Column percentages were also converted to integers, reducing our data density. This logic is further discussed in go_utils.mhm.larvae_to_num for mosquito habitat mapper and go_utils.lc.unpack_classifications
 48
 49"""
 50
 51
 52def adjust_timezones(df, time_col, latitude_col, longitude_col, inplace=False):
 53    """
 54    Calculates timezone offset and adjusts date columns accordingly. This is done because GLOBE data uses UTC timezones and it can be useful to have the time adjusted to the local observation time.
 55
 56    Parameters
 57    ----------
 58    df : pd.DataFrame
 59        The DataFrame to adjust time zones for
 60    time_col : str
 61        The column that contains the time zone data
 62    latitude_col : str
 63        The column that contains latitude data
 64    longitude_col : str
 65        The column that contains longitude data
 66    inplace : bool, default=False
 67        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 68
 69    Returns
 70    -------
 71    pd.DataFrame or None
 72        A DataFrame with its time entry adjusted to its local timezone. If `inplace=True` it returns None.
 73    """
 74    tf = TimezoneFinder()
 75
 76    def convert_timezone(time, latitude, longitude):
 77        utc_tz = pd.to_datetime(time, utc=True)
 78        local_time_zone = timezone(tf.timezone_at(lng=longitude, lat=latitude))
 79        return utc_tz.astimezone(local_time_zone)
 80
 81    time_zone_converter = np.vectorize(convert_timezone)
 82
 83    if not inplace:
 84        df = df.copy()
 85
 86    df[time_col] = time_zone_converter(
 87        df[time_col].to_numpy(),
 88        df[latitude_col].to_numpy(),
 89        df[longitude_col].to_numpy(),
 90    )
 91
 92    if not inplace:
 93        return df
 94
 95
 96def remove_homogenous_cols(df, exclude=[], inplace=False):
 97    """
 98    Removes columns froma DataFrame if they contain only 1 unique value.
 99    ```
100
101    Then the original `df` variable that was passed is now updated with these dropped columns.
102
103    If you would like to see the columns that are dropped, setting the logging level to info will allow for that to happen.
104
105    Parameters
106    ----------
107    df : pd.DataFrame
108        The DataFrame that will be modified
109    exclude : list of str, default=[]
110        A list of any columns that should be excluded from this removal.
111    inplace : bool, default=False
112        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
113
114    Returns
115    -------
116    pd.DataFrame or None
117        A DataFrame with homogenous columns removed. If `inplace=True` it returns None.
118    """
119
120    if not inplace:
121        df = df.copy()
122
123    for column in df.columns:
124        try:
125            if column not in exclude and len(pd.unique(df[column])) == 1:
126                logging.info(f"Dropped: {df[column].iloc[0]}")
127                df.drop(column, axis=1, inplace=True)
128        except TypeError:
129            continue
130
131    if not inplace:
132        return df
133
134
135def replace_column_prefix(df, current_prefix, replacement_text, inplace=False):
136    """
137    Replaces the protocol prefix (e.g. mosquito_habitat_mapper/mosquitohabitatmapper) for the column names with another prefix in the format of `newPrefix_columnName`.
138
139    If you are interested in replacing the prefixes for the raw mosquito habitat mapper and landcover datasets, use the go_utils.lc.cleanup_column_prefix and go_utils.mhm.cleanup_column_prefix methods.
140
141    Parameters
142    ----------
143    df : pd.DataFrame
144        The DataFrame you would like updated
145    protocol : str
146        A string representing the protocol prefix.
147    replacement_text : str
148        A string representing the desired prefix for the column name.
149    inplace : bool, default=False
150        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
151
152    Returns
153    -------
154    pd.DataFrame or None
155        A DataFrame with the column prefixes replaced. If `inplace=True` it returns None.
156    """
157    if not inplace:
158        df = df.copy()
159    df.columns = [
160        f"{replacement_text}_{column.replace(current_prefix,'')}"
161        for column in df.columns
162    ]
163    if not inplace:
164        return df
165
166
167def find_column(df, keyword):
168    """Finds the first column that contains a certain keyword. Mainly intended to be a utility function for some of the other methods.
169
170
171    Parameters
172    ----------
173    df : pd.DataFrame
174        The DataFrame containing the columns that need to be searched.
175    keyword: str
176        The keyword that needs to be present in the desired column.
177    """
178
179    return [column for column in df.columns if keyword in column][0]
180
181
182def camel_case(string, delimiters=[" "]):
183    """Converts a string into camel case
184
185    Parameters
186    ----------
187    string: str, the string to convert
188    delimiter: str, the character that denotes separate words
189    """
190    for delimiter in delimiters:
191        str_list = [s[0].upper() + s[1:] for s in string.split(delimiter)]
192        string = "".join([s for s in str_list])
193    return string
194
195
196def rename_latlon_cols(
197    df,
198    gps_latitude="",
199    gps_longitude="",
200    mgrs_latitude="latitude",
201    mgrs_longitude="longitude",
202    inplace=False,
203):
204    """Renames the latitude and longitude columns of **raw** GLOBE Observer Data to make the naming intuitive.
205
206    [This](#differentiating-between-mgrs-and-gps-columns) explains the motivation behind the method.
207
208    Example usage:
209    ```python
210    from go_utils.cleanup import rename_latlon_cols
211    rename_latlon_cols(df)
212    ```
213
214    Parameters
215    ----------
216    df : pd.DataFrame
217        The DataFrame whose columns require renaming.
218    inplace : bool, default=False
219        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
220
221    Returns
222    -------
223    pd.DataFrame or None
224        A DataFrame with the updated Latitude and Longitude column names. If `inplace=True` it returns None.
225    """
226    if not inplace:
227        df = df.copy()
228
229    if not gps_latitude:
230        gps_latitude = find_column(df, "MeasurementLatitude")
231    if not gps_longitude:
232        gps_longitude = find_column(df, "MeasurementLongitude")
233    df.rename(
234        {
235            gps_latitude: "Latitude",
236            gps_longitude: "Longitude",
237            mgrs_latitude: "MGRSLatitude",
238            mgrs_longitude: "MGRSLongitude",
239        },
240        axis=1,
241        inplace=True,
242    )
243
244    if not inplace:
245        return df
246
247
248def round_cols(df, inplace=False):
249    """This rounds columns in the DataFrame. More specifically, latitude and longitude data is rounded to 5 decimal places, other fields are rounded to integers, and null values (for the integer columns) are set to -9999.
250
251    See [here](#round-appropriate-columns) for more information.
252
253    Example usage:
254    ```python
255    from go_utils.cleanup import round_cols
256    round_cols(df)
257    ```
258
259    Parameters
260    ----------
261    df : pd.DataFrame
262        The DataFrame that requires rounding.
263    inplace : bool, default=False
264        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
265
266    Returns
267    -------
268    pd.DataFrame or None
269        A DataFrame with the rounded values. If `inplace=True` it returns None.
270    """
271    if not inplace:
272        df = df.copy()
273    # Identifies all the numerical cols
274    number_cols = [
275        df.columns[i]
276        for i in range(len(df.dtypes))
277        if (df.dtypes[i] == "float64")
278        or (df.dtypes[i] == "float")
279        or (df.dtypes[i] == "int")
280        or (df.dtypes[i] == "int64")
281    ]
282
283    # Rounds cols appropriately
284    column_round = np.vectorize(lambda x, digits: round(x, digits))
285    for name in number_cols:
286        df[name] = df[name].fillna(-9999)
287        if ("latitude" in name.lower()) or ("longitude" in name.lower()):
288            logging.info(f"Rounded to 5 decimals: {name}")
289            df[name] = column_round(df[name].to_numpy(), 5)
290        else:
291            logging.info(f"Converted to integer: {name}")
292            df[name] = df[name].to_numpy().astype(int)
293
294    if not inplace:
295        return df
296
297
298def standardize_null_vals(df, null_val=np.nan, inplace=False):
299    """
300    This method standardizes the null values of **raw** GLOBE Observer Data.
301
302    ```python
303    from go_utils.cleanup import standardize_null_vals
304    standardize_null_vals(df)
305    ```
306
307    Parameters
308    ----------
309    df : pd.DataFrame
310        The DataFrame that needs null value standardization
311    null_val : obj, default=np.nan
312        The value that all null values should be set to
313    inplace : bool, default=False
314        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
315
316    Returns
317    -------
318    pd.DataFrame or None
319        A DataFrame with the standardized null values. If `inplace=True` it returns None.
320    """
321
322    if not inplace:
323        df = df.copy()
324
325    # Replace Null Values with null_val
326    df.fillna(null_val, inplace=True)
327
328    # Replace any text null values
329    df.replace(
330        {
331            "null": null_val,
332            "": null_val,
333            "NaN": null_val,
334            "nan": null_val,
335            None: null_val,
336        },
337        inplace=True,
338    )
339
340    if not inplace:
341        return df
def adjust_timezones(df, time_col, latitude_col, longitude_col, inplace=False)
53def adjust_timezones(df, time_col, latitude_col, longitude_col, inplace=False):
54    """
55    Calculates timezone offset and adjusts date columns accordingly. This is done because GLOBE data uses UTC timezones and it can be useful to have the time adjusted to the local observation time.
56
57    Parameters
58    ----------
59    df : pd.DataFrame
60        The DataFrame to adjust time zones for
61    time_col : str
62        The column that contains the time zone data
63    latitude_col : str
64        The column that contains latitude data
65    longitude_col : str
66        The column that contains longitude data
67    inplace : bool, default=False
68        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
69
70    Returns
71    -------
72    pd.DataFrame or None
73        A DataFrame with its time entry adjusted to its local timezone. If `inplace=True` it returns None.
74    """
75    tf = TimezoneFinder()
76
77    def convert_timezone(time, latitude, longitude):
78        utc_tz = pd.to_datetime(time, utc=True)
79        local_time_zone = timezone(tf.timezone_at(lng=longitude, lat=latitude))
80        return utc_tz.astimezone(local_time_zone)
81
82    time_zone_converter = np.vectorize(convert_timezone)
83
84    if not inplace:
85        df = df.copy()
86
87    df[time_col] = time_zone_converter(
88        df[time_col].to_numpy(),
89        df[latitude_col].to_numpy(),
90        df[longitude_col].to_numpy(),
91    )
92
93    if not inplace:
94        return df

Calculates timezone offset and adjusts date columns accordingly. This is done because GLOBE data uses UTC timezones and it can be useful to have the time adjusted to the local observation time.

Parameters
  • df (pd.DataFrame): The DataFrame to adjust time zones for
  • time_col (str): The column that contains the time zone data
  • latitude_col (str): The column that contains latitude data
  • longitude_col (str): The column that contains longitude data
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with its time entry adjusted to its local timezone. If inplace=True it returns None.
def remove_homogenous_cols(df, exclude=[], inplace=False)
 97def remove_homogenous_cols(df, exclude=[], inplace=False):
 98    """
 99    Removes columns froma DataFrame if they contain only 1 unique value.
100    ```
101
102    Then the original `df` variable that was passed is now updated with these dropped columns.
103
104    If you would like to see the columns that are dropped, setting the logging level to info will allow for that to happen.
105
106    Parameters
107    ----------
108    df : pd.DataFrame
109        The DataFrame that will be modified
110    exclude : list of str, default=[]
111        A list of any columns that should be excluded from this removal.
112    inplace : bool, default=False
113        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
114
115    Returns
116    -------
117    pd.DataFrame or None
118        A DataFrame with homogenous columns removed. If `inplace=True` it returns None.
119    """
120
121    if not inplace:
122        df = df.copy()
123
124    for column in df.columns:
125        try:
126            if column not in exclude and len(pd.unique(df[column])) == 1:
127                logging.info(f"Dropped: {df[column].iloc[0]}")
128                df.drop(column, axis=1, inplace=True)
129        except TypeError:
130            continue
131
132    if not inplace:
133        return df

Removes columns froma DataFrame if they contain only 1 unique value. ```

Then the original df variable that was passed is now updated with these dropped columns.

If you would like to see the columns that are dropped, setting the logging level to info will allow for that to happen.

Parameters
  • df (pd.DataFrame): The DataFrame that will be modified
  • exclude (list of str, default=[]): A list of any columns that should be excluded from this removal.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with homogenous columns removed. If inplace=True it returns None.
def replace_column_prefix(df, current_prefix, replacement_text, inplace=False)
136def replace_column_prefix(df, current_prefix, replacement_text, inplace=False):
137    """
138    Replaces the protocol prefix (e.g. mosquito_habitat_mapper/mosquitohabitatmapper) for the column names with another prefix in the format of `newPrefix_columnName`.
139
140    If you are interested in replacing the prefixes for the raw mosquito habitat mapper and landcover datasets, use the go_utils.lc.cleanup_column_prefix and go_utils.mhm.cleanup_column_prefix methods.
141
142    Parameters
143    ----------
144    df : pd.DataFrame
145        The DataFrame you would like updated
146    protocol : str
147        A string representing the protocol prefix.
148    replacement_text : str
149        A string representing the desired prefix for the column name.
150    inplace : bool, default=False
151        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
152
153    Returns
154    -------
155    pd.DataFrame or None
156        A DataFrame with the column prefixes replaced. If `inplace=True` it returns None.
157    """
158    if not inplace:
159        df = df.copy()
160    df.columns = [
161        f"{replacement_text}_{column.replace(current_prefix,'')}"
162        for column in df.columns
163    ]
164    if not inplace:
165        return df

Replaces the protocol prefix (e.g. mosquito_habitat_mapper/mosquitohabitatmapper) for the column names with another prefix in the format of newPrefix_columnName.

If you are interested in replacing the prefixes for the raw mosquito habitat mapper and landcover datasets, use the go_utils.lc.cleanup_column_prefix and go_utils.mhm.cleanup_column_prefix methods.

Parameters
  • df (pd.DataFrame): The DataFrame you would like updated
  • protocol (str): A string representing the protocol prefix.
  • replacement_text (str): A string representing the desired prefix for the column name.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the column prefixes replaced. If inplace=True it returns None.
def find_column(df, keyword)
168def find_column(df, keyword):
169    """Finds the first column that contains a certain keyword. Mainly intended to be a utility function for some of the other methods.
170
171
172    Parameters
173    ----------
174    df : pd.DataFrame
175        The DataFrame containing the columns that need to be searched.
176    keyword: str
177        The keyword that needs to be present in the desired column.
178    """
179
180    return [column for column in df.columns if keyword in column][0]

Finds the first column that contains a certain keyword. Mainly intended to be a utility function for some of the other methods.

Parameters
  • df (pd.DataFrame): The DataFrame containing the columns that need to be searched.
  • keyword (str): The keyword that needs to be present in the desired column.
def camel_case(string, delimiters=[' '])
183def camel_case(string, delimiters=[" "]):
184    """Converts a string into camel case
185
186    Parameters
187    ----------
188    string: str, the string to convert
189    delimiter: str, the character that denotes separate words
190    """
191    for delimiter in delimiters:
192        str_list = [s[0].upper() + s[1:] for s in string.split(delimiter)]
193        string = "".join([s for s in str_list])
194    return string

Converts a string into camel case

Parameters
  • string (str, the string to convert):

  • delimiter (str, the character that denotes separate words):

def rename_latlon_cols( df, gps_latitude='', gps_longitude='', mgrs_latitude='latitude', mgrs_longitude='longitude', inplace=False)
197def rename_latlon_cols(
198    df,
199    gps_latitude="",
200    gps_longitude="",
201    mgrs_latitude="latitude",
202    mgrs_longitude="longitude",
203    inplace=False,
204):
205    """Renames the latitude and longitude columns of **raw** GLOBE Observer Data to make the naming intuitive.
206
207    [This](#differentiating-between-mgrs-and-gps-columns) explains the motivation behind the method.
208
209    Example usage:
210    ```python
211    from go_utils.cleanup import rename_latlon_cols
212    rename_latlon_cols(df)
213    ```
214
215    Parameters
216    ----------
217    df : pd.DataFrame
218        The DataFrame whose columns require renaming.
219    inplace : bool, default=False
220        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
221
222    Returns
223    -------
224    pd.DataFrame or None
225        A DataFrame with the updated Latitude and Longitude column names. If `inplace=True` it returns None.
226    """
227    if not inplace:
228        df = df.copy()
229
230    if not gps_latitude:
231        gps_latitude = find_column(df, "MeasurementLatitude")
232    if not gps_longitude:
233        gps_longitude = find_column(df, "MeasurementLongitude")
234    df.rename(
235        {
236            gps_latitude: "Latitude",
237            gps_longitude: "Longitude",
238            mgrs_latitude: "MGRSLatitude",
239            mgrs_longitude: "MGRSLongitude",
240        },
241        axis=1,
242        inplace=True,
243    )
244
245    if not inplace:
246        return df

Renames the latitude and longitude columns of raw GLOBE Observer Data to make the naming intuitive.

This explains the motivation behind the method.

Example usage:

from go_utils.cleanup import rename_latlon_cols
rename_latlon_cols(df)
Parameters
  • df (pd.DataFrame): The DataFrame whose columns require renaming.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the updated Latitude and Longitude column names. If inplace=True it returns None.
def round_cols(df, inplace=False)
249def round_cols(df, inplace=False):
250    """This rounds columns in the DataFrame. More specifically, latitude and longitude data is rounded to 5 decimal places, other fields are rounded to integers, and null values (for the integer columns) are set to -9999.
251
252    See [here](#round-appropriate-columns) for more information.
253
254    Example usage:
255    ```python
256    from go_utils.cleanup import round_cols
257    round_cols(df)
258    ```
259
260    Parameters
261    ----------
262    df : pd.DataFrame
263        The DataFrame that requires rounding.
264    inplace : bool, default=False
265        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
266
267    Returns
268    -------
269    pd.DataFrame or None
270        A DataFrame with the rounded values. If `inplace=True` it returns None.
271    """
272    if not inplace:
273        df = df.copy()
274    # Identifies all the numerical cols
275    number_cols = [
276        df.columns[i]
277        for i in range(len(df.dtypes))
278        if (df.dtypes[i] == "float64")
279        or (df.dtypes[i] == "float")
280        or (df.dtypes[i] == "int")
281        or (df.dtypes[i] == "int64")
282    ]
283
284    # Rounds cols appropriately
285    column_round = np.vectorize(lambda x, digits: round(x, digits))
286    for name in number_cols:
287        df[name] = df[name].fillna(-9999)
288        if ("latitude" in name.lower()) or ("longitude" in name.lower()):
289            logging.info(f"Rounded to 5 decimals: {name}")
290            df[name] = column_round(df[name].to_numpy(), 5)
291        else:
292            logging.info(f"Converted to integer: {name}")
293            df[name] = df[name].to_numpy().astype(int)
294
295    if not inplace:
296        return df

This rounds columns in the DataFrame. More specifically, latitude and longitude data is rounded to 5 decimal places, other fields are rounded to integers, and null values (for the integer columns) are set to -9999.

See here for more information.

Example usage:

from go_utils.cleanup import round_cols
round_cols(df)
Parameters
  • df (pd.DataFrame): The DataFrame that requires rounding.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the rounded values. If inplace=True it returns None.
def standardize_null_vals(df, null_val=nan, inplace=False)
299def standardize_null_vals(df, null_val=np.nan, inplace=False):
300    """
301    This method standardizes the null values of **raw** GLOBE Observer Data.
302
303    ```python
304    from go_utils.cleanup import standardize_null_vals
305    standardize_null_vals(df)
306    ```
307
308    Parameters
309    ----------
310    df : pd.DataFrame
311        The DataFrame that needs null value standardization
312    null_val : obj, default=np.nan
313        The value that all null values should be set to
314    inplace : bool, default=False
315        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
316
317    Returns
318    -------
319    pd.DataFrame or None
320        A DataFrame with the standardized null values. If `inplace=True` it returns None.
321    """
322
323    if not inplace:
324        df = df.copy()
325
326    # Replace Null Values with null_val
327    df.fillna(null_val, inplace=True)
328
329    # Replace any text null values
330    df.replace(
331        {
332            "null": null_val,
333            "": null_val,
334            "NaN": null_val,
335            "nan": null_val,
336            None: null_val,
337        },
338        inplace=True,
339    )
340
341    if not inplace:
342        return df

This method standardizes the null values of raw GLOBE Observer Data.

from go_utils.cleanup import standardize_null_vals
standardize_null_vals(df)
Parameters
  • df (pd.DataFrame): The DataFrame that needs null value standardization
  • null_val (obj, default=np.nan): The value that all null values should be set to
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the standardized null values. If inplace=True it returns None.