go_utils.cleanup
Overview
This submodule contains several methods to assist with data cleanup. The following sections discuss some of the decisions behind these methods and their part of a larger data cleanup pipeline.
Methods
Remove Redundant/Homogenous Column:
This method indentifies and removes columns where all values were the same. If the logging level is set to INFO, the method will also print the names of the dropped columns and their respective singular value as a Python dictionary. For the raw mosquito habitat mapper data, the following were dropped:
- {'protocol': 'mosquito_habitat_mapper'}
- {'ExtraData': None}
- {'MosquitoEggCount': None}
- {'DataSource': 'GLOBE Observer App'}
For raw landcover data, the following were dropped:
- {'protocol': 'land_covers'}
Rename columns:
Differentiating between MGRS and GPS Columns:
The GLOBE API data for MosquitoHabitatMapper
and LandCovers
report each observation’s Military Grid Reference System (MGRS) Coordinates in the latitude
and longitude
fields. The GPS Coordinates are stored in the MeasurementLatitude
and MeasurementLongitude
fields.
To avoid confusion between these measuring systems, this method renames latitude
and longitude
to MGRSLatitude
and MGRSLongitude
, respectively, and MeasurementLatitude
and MeasurementLongitude
to Latitude
and Longitude
, respectively. Now, the official Latitude
and Longitude
columns are more intuitively named.
Protocol Abbreviation:
To better support future cross-protocol analysis and data enrichment, this method following naming scheme for all column names: protocolAbbreviation_columnName
, where protocolAbbreviation
was the abbreviation for the protocol (mhm
for mosquito habitat mapper and lc
for land cover) and columnName
was the original name of the column. For example, the mosquito habitat mapper “MGRSLongitude” column was renamed “mhm_MGRSLongitude” and the corresponding land cover column was renamed “lc_MGRSLongitude”.
Do note that if you would like to use the previously mentioned mhm
and lc
naming scheme for you data, the go_utils.mhm
and go_utils.lc
submodules each have a method called cleanup_column_prefix
which uses the previously mentioned naming scheme as opposed to the replace_column_prefix
method which requires that you specify the current prefix and desired prefix.
Standardize no-data values:
The GLOBE API CSV’s lacked standardization in indicating No Data. Indicators ranged from Python's None
, to “null”
, to an empty cell, to NaN
(np.nan
). To improve the computational efficiency in future mathematical algorithms on the GLOBE datasets, this method converts all No Data Indicators to np.nan (Python NumPy’s version of No-Data as a float). Do note that later in Round Appropriate Columns, all numerical extraneous values are converted from np.nan to -9999. Thus, Users will receive the pre-processed GLOBE API Mosquito Habitat Mapper and Land Cover Data in accordance with the standards described by Cook et al (2018).
Round Appropriate Columns
This method does the following:
- Identifies all numerical columns (e.g.
float64
,float
,int
,int64
). - Rounds Latitudes and Longitude Columns to 5 places. To reduce data density, all latitude and longitude values were rounded to 5 decimal places. This corresponds to about a meter of accuracy. Furthermore, any larger number of decimal places consume unnecessary amounts of storage as the GLOBE Observer app cannot attain such precision.
- Converts other Numerical Data to Integers. To improve the datasets’ memory and performance, non latitude and longitude numerical values were converted to integers for the remaining columns, including
Id
,MeasurementElevation
, andelevation
columns. This is appropriate since ids are always discrete values.MeasurementElevation
andelevation
are imprecise estimates from 3rd party sources, rendering additional precision an unnecessary waste of memory. However, by converting these values to integers, we could no longer use np.nan, a float, to denote extraneous/empty values. Thus, for integer columns, we used -9999 to denote extraneous/empty values.
Note: Larvae Counts were also converted to integers and Land Classification Column percentages were also converted to integers, reducing our data density. This logic is further discussed in go_utils.mhm.larvae_to_num for mosquito habitat mapper and go_utils.lc.unpack_classifications
1import logging 2 3import numpy as np 4import pandas as pd 5from pytz import timezone 6from timezonefinder import TimezoneFinder 7 8__doc__ = """ 9 10# Overview 11This submodule contains several methods to assist with data cleanup. 12The following sections discuss some of the decisions behind these methods and their part of a larger data cleanup pipeline. 13 14# Methods 15 16## Remove Redundant/Homogenous Column: 17[This method](#remove_homogenous_cols) indentifies and removes columns where all values were the same. If the logging level is set to INFO, the method will also print the names of the dropped columns and their respective singular value as a Python dictionary. 18For the raw mosquito habitat mapper data, the following were dropped: 19- {'protocol': 'mosquito_habitat_mapper'} 20- {'ExtraData': None} 21- {'MosquitoEggCount': None} 22- {'DataSource': 'GLOBE Observer App'} 23 24For raw landcover data, the following were dropped: 25- {'protocol': 'land_covers'} 26 27## Rename columns: 28 29### Differentiating between MGRS and GPS Columns: 30The GLOBE API data for `MosquitoHabitatMapper` and `LandCovers` report each observation’s Military Grid Reference System (MGRS) Coordinates in the `latitude` and `longitude` fields. The GPS Coordinates are stored in the `MeasurementLatitude` and `MeasurementLongitude` fields. 31To avoid confusion between these measuring systems, [this method](#rename_latlon_cols) renames `latitude` and `longitude` to `MGRSLatitude` and `MGRSLongitude`, respectively, and `MeasurementLatitude` and `MeasurementLongitude` to `Latitude` and `Longitude`, respectively. Now, the official `Latitude` and `Longitude` columns are more intuitively named. 32 33### Protocol Abbreviation: 34To better support future cross-protocol analysis and data enrichment, [this method](#replace_column_prefix) following naming scheme for all column names: `protocolAbbreviation_columnName`, where `protocolAbbreviation` was the abbreviation for the protocol (`mhm` for mosquito habitat mapper and `lc` for land cover) and `columnName` was the original name of the column. For example, the mosquito habitat mapper “MGRSLongitude” column was renamed “mhm_MGRSLongitude” and the corresponding land cover column was renamed “lc_MGRSLongitude”. 35 36Do note that if you would like to use the previously mentioned `mhm` and `lc` naming scheme for you data, the `go_utils.mhm` and `go_utils.lc` submodules each have a method called `cleanup_column_prefix` which uses the previously mentioned naming scheme as opposed to the `replace_column_prefix` method which requires that you specify the current prefix and desired prefix. 37 38## Standardize no-data values: 39The GLOBE API CSV’s lacked standardization in indicating No Data. Indicators ranged from Python's `None`, to `“null”`, to an empty cell, to `NaN` (`np.nan`). To improve the computational efficiency in future mathematical algorithms on the GLOBE datasets, [this method](#standardize_null_values) converts all No Data Indicators to np.nan (Python NumPy’s version of No-Data as a float). Do note that later in Round Appropriate Columns, all numerical extraneous values are converted from np.nan to -9999. Thus, Users will receive the pre-processed GLOBE API Mosquito Habitat Mapper and Land Cover Data in accordance with the standards described by Cook et al (2018). 40 41## Round Appropriate Columns 42[This method](#round_cols) does the following: 431. Identifies all numerical columns (e.g. `float64`, `float`, `int`, `int64`). 442. Rounds Latitudes and Longitude Columns to 5 places. To reduce data density, all latitude and longitude values were rounded to 5 decimal places. This corresponds to about a meter of accuracy. Furthermore, any larger number of decimal places consume unnecessary amounts of storage as the GLOBE Observer app cannot attain such precision. 453. Converts other Numerical Data to Integers. To improve the datasets’ memory and performance, non latitude and longitude numerical values were converted to integers for the remaining columns, including `Id`, `MeasurementElevation`, and `elevation` columns. This is appropriate since ids are always discrete values. `MeasurementElevation` and `elevation` are imprecise estimates from 3rd party sources, rendering additional precision an unnecessary waste of memory. However, by converting these values to integers, we could no longer use np.nan, a float, to denote extraneous/empty values. Thus, for integer columns, we used -9999 to denote extraneous/empty values. 46 47**Note**: Larvae Counts were also converted to integers and Land Classification Column percentages were also converted to integers, reducing our data density. This logic is further discussed in go_utils.mhm.larvae_to_num for mosquito habitat mapper and go_utils.lc.unpack_classifications 48 49""" 50 51 52def adjust_timezones(df, time_col, latitude_col, longitude_col, inplace=False): 53 """ 54 Calculates timezone offset and adjusts date columns accordingly. This is done because GLOBE data uses UTC timezones and it can be useful to have the time adjusted to the local observation time. 55 56 Parameters 57 ---------- 58 df : pd.DataFrame 59 The DataFrame to adjust time zones for 60 time_col : str 61 The column that contains the time zone data 62 latitude_col : str 63 The column that contains latitude data 64 longitude_col : str 65 The column that contains longitude data 66 inplace : bool, default=False 67 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 68 69 Returns 70 ------- 71 pd.DataFrame or None 72 A DataFrame with its time entry adjusted to its local timezone. If `inplace=True` it returns None. 73 """ 74 tf = TimezoneFinder() 75 76 def convert_timezone(time, latitude, longitude): 77 utc_tz = pd.to_datetime(time, utc=True) 78 local_time_zone = timezone(tf.timezone_at(lng=longitude, lat=latitude)) 79 return utc_tz.astimezone(local_time_zone) 80 81 time_zone_converter = np.vectorize(convert_timezone) 82 83 if not inplace: 84 df = df.copy() 85 86 df[time_col] = time_zone_converter( 87 df[time_col].to_numpy(), 88 df[latitude_col].to_numpy(), 89 df[longitude_col].to_numpy(), 90 ) 91 92 if not inplace: 93 return df 94 95 96def remove_homogenous_cols(df, exclude=[], inplace=False): 97 """ 98 Removes columns froma DataFrame if they contain only 1 unique value. 99 ``` 100 101 Then the original `df` variable that was passed is now updated with these dropped columns. 102 103 If you would like to see the columns that are dropped, setting the logging level to info will allow for that to happen. 104 105 Parameters 106 ---------- 107 df : pd.DataFrame 108 The DataFrame that will be modified 109 exclude : list of str, default=[] 110 A list of any columns that should be excluded from this removal. 111 inplace : bool, default=False 112 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 113 114 Returns 115 ------- 116 pd.DataFrame or None 117 A DataFrame with homogenous columns removed. If `inplace=True` it returns None. 118 """ 119 120 if not inplace: 121 df = df.copy() 122 123 for column in df.columns: 124 try: 125 if column not in exclude and len(pd.unique(df[column])) == 1: 126 logging.info(f"Dropped: {df[column].iloc[0]}") 127 df.drop(column, axis=1, inplace=True) 128 except TypeError: 129 continue 130 131 if not inplace: 132 return df 133 134 135def replace_column_prefix(df, current_prefix, replacement_text, inplace=False): 136 """ 137 Replaces the protocol prefix (e.g. mosquito_habitat_mapper/mosquitohabitatmapper) for the column names with another prefix in the format of `newPrefix_columnName`. 138 139 If you are interested in replacing the prefixes for the raw mosquito habitat mapper and landcover datasets, use the go_utils.lc.cleanup_column_prefix and go_utils.mhm.cleanup_column_prefix methods. 140 141 Parameters 142 ---------- 143 df : pd.DataFrame 144 The DataFrame you would like updated 145 protocol : str 146 A string representing the protocol prefix. 147 replacement_text : str 148 A string representing the desired prefix for the column name. 149 inplace : bool, default=False 150 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 151 152 Returns 153 ------- 154 pd.DataFrame or None 155 A DataFrame with the column prefixes replaced. If `inplace=True` it returns None. 156 """ 157 if not inplace: 158 df = df.copy() 159 df.columns = [ 160 f"{replacement_text}_{column.replace(current_prefix,'')}" 161 for column in df.columns 162 ] 163 if not inplace: 164 return df 165 166 167def find_column(df, keyword): 168 """Finds the first column that contains a certain keyword. Mainly intended to be a utility function for some of the other methods. 169 170 171 Parameters 172 ---------- 173 df : pd.DataFrame 174 The DataFrame containing the columns that need to be searched. 175 keyword: str 176 The keyword that needs to be present in the desired column. 177 """ 178 179 return [column for column in df.columns if keyword in column][0] 180 181 182def camel_case(string, delimiters=[" "]): 183 """Converts a string into camel case 184 185 Parameters 186 ---------- 187 string: str, the string to convert 188 delimiter: str, the character that denotes separate words 189 """ 190 for delimiter in delimiters: 191 str_list = [s[0].upper() + s[1:] for s in string.split(delimiter)] 192 string = "".join([s for s in str_list]) 193 return string 194 195 196def rename_latlon_cols( 197 df, 198 gps_latitude="", 199 gps_longitude="", 200 mgrs_latitude="latitude", 201 mgrs_longitude="longitude", 202 inplace=False, 203): 204 """Renames the latitude and longitude columns of **raw** GLOBE Observer Data to make the naming intuitive. 205 206 [This](#differentiating-between-mgrs-and-gps-columns) explains the motivation behind the method. 207 208 Example usage: 209 ```python 210 from go_utils.cleanup import rename_latlon_cols 211 rename_latlon_cols(df) 212 ``` 213 214 Parameters 215 ---------- 216 df : pd.DataFrame 217 The DataFrame whose columns require renaming. 218 inplace : bool, default=False 219 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 220 221 Returns 222 ------- 223 pd.DataFrame or None 224 A DataFrame with the updated Latitude and Longitude column names. If `inplace=True` it returns None. 225 """ 226 if not inplace: 227 df = df.copy() 228 229 if not gps_latitude: 230 gps_latitude = find_column(df, "MeasurementLatitude") 231 if not gps_longitude: 232 gps_longitude = find_column(df, "MeasurementLongitude") 233 df.rename( 234 { 235 gps_latitude: "Latitude", 236 gps_longitude: "Longitude", 237 mgrs_latitude: "MGRSLatitude", 238 mgrs_longitude: "MGRSLongitude", 239 }, 240 axis=1, 241 inplace=True, 242 ) 243 244 if not inplace: 245 return df 246 247 248def round_cols(df, inplace=False): 249 """This rounds columns in the DataFrame. More specifically, latitude and longitude data is rounded to 5 decimal places, other fields are rounded to integers, and null values (for the integer columns) are set to -9999. 250 251 See [here](#round-appropriate-columns) for more information. 252 253 Example usage: 254 ```python 255 from go_utils.cleanup import round_cols 256 round_cols(df) 257 ``` 258 259 Parameters 260 ---------- 261 df : pd.DataFrame 262 The DataFrame that requires rounding. 263 inplace : bool, default=False 264 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 265 266 Returns 267 ------- 268 pd.DataFrame or None 269 A DataFrame with the rounded values. If `inplace=True` it returns None. 270 """ 271 if not inplace: 272 df = df.copy() 273 # Identifies all the numerical cols 274 number_cols = [ 275 df.columns[i] 276 for i in range(len(df.dtypes)) 277 if (df.dtypes[i] == "float64") 278 or (df.dtypes[i] == "float") 279 or (df.dtypes[i] == "int") 280 or (df.dtypes[i] == "int64") 281 ] 282 283 # Rounds cols appropriately 284 column_round = np.vectorize(lambda x, digits: round(x, digits)) 285 for name in number_cols: 286 df[name] = df[name].fillna(-9999) 287 if ("latitude" in name.lower()) or ("longitude" in name.lower()): 288 logging.info(f"Rounded to 5 decimals: {name}") 289 df[name] = column_round(df[name].to_numpy(), 5) 290 else: 291 logging.info(f"Converted to integer: {name}") 292 df[name] = df[name].to_numpy().astype(int) 293 294 if not inplace: 295 return df 296 297 298def standardize_null_vals(df, null_val=np.nan, inplace=False): 299 """ 300 This method standardizes the null values of **raw** GLOBE Observer Data. 301 302 ```python 303 from go_utils.cleanup import standardize_null_vals 304 standardize_null_vals(df) 305 ``` 306 307 Parameters 308 ---------- 309 df : pd.DataFrame 310 The DataFrame that needs null value standardization 311 null_val : obj, default=np.nan 312 The value that all null values should be set to 313 inplace : bool, default=False 314 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 315 316 Returns 317 ------- 318 pd.DataFrame or None 319 A DataFrame with the standardized null values. If `inplace=True` it returns None. 320 """ 321 322 if not inplace: 323 df = df.copy() 324 325 # Replace Null Values with null_val 326 df.fillna(null_val, inplace=True) 327 328 # Replace any text null values 329 df.replace( 330 { 331 "null": null_val, 332 "": null_val, 333 "NaN": null_val, 334 "nan": null_val, 335 None: null_val, 336 }, 337 inplace=True, 338 ) 339 340 if not inplace: 341 return df
53def adjust_timezones(df, time_col, latitude_col, longitude_col, inplace=False): 54 """ 55 Calculates timezone offset and adjusts date columns accordingly. This is done because GLOBE data uses UTC timezones and it can be useful to have the time adjusted to the local observation time. 56 57 Parameters 58 ---------- 59 df : pd.DataFrame 60 The DataFrame to adjust time zones for 61 time_col : str 62 The column that contains the time zone data 63 latitude_col : str 64 The column that contains latitude data 65 longitude_col : str 66 The column that contains longitude data 67 inplace : bool, default=False 68 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 69 70 Returns 71 ------- 72 pd.DataFrame or None 73 A DataFrame with its time entry adjusted to its local timezone. If `inplace=True` it returns None. 74 """ 75 tf = TimezoneFinder() 76 77 def convert_timezone(time, latitude, longitude): 78 utc_tz = pd.to_datetime(time, utc=True) 79 local_time_zone = timezone(tf.timezone_at(lng=longitude, lat=latitude)) 80 return utc_tz.astimezone(local_time_zone) 81 82 time_zone_converter = np.vectorize(convert_timezone) 83 84 if not inplace: 85 df = df.copy() 86 87 df[time_col] = time_zone_converter( 88 df[time_col].to_numpy(), 89 df[latitude_col].to_numpy(), 90 df[longitude_col].to_numpy(), 91 ) 92 93 if not inplace: 94 return df
Calculates timezone offset and adjusts date columns accordingly. This is done because GLOBE data uses UTC timezones and it can be useful to have the time adjusted to the local observation time.
Parameters
- df (pd.DataFrame): The DataFrame to adjust time zones for
- time_col (str): The column that contains the time zone data
- latitude_col (str): The column that contains latitude data
- longitude_col (str): The column that contains longitude data
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with its time entry adjusted to its local timezone. If
inplace=True
it returns None.
97def remove_homogenous_cols(df, exclude=[], inplace=False): 98 """ 99 Removes columns froma DataFrame if they contain only 1 unique value. 100 ``` 101 102 Then the original `df` variable that was passed is now updated with these dropped columns. 103 104 If you would like to see the columns that are dropped, setting the logging level to info will allow for that to happen. 105 106 Parameters 107 ---------- 108 df : pd.DataFrame 109 The DataFrame that will be modified 110 exclude : list of str, default=[] 111 A list of any columns that should be excluded from this removal. 112 inplace : bool, default=False 113 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 114 115 Returns 116 ------- 117 pd.DataFrame or None 118 A DataFrame with homogenous columns removed. If `inplace=True` it returns None. 119 """ 120 121 if not inplace: 122 df = df.copy() 123 124 for column in df.columns: 125 try: 126 if column not in exclude and len(pd.unique(df[column])) == 1: 127 logging.info(f"Dropped: {df[column].iloc[0]}") 128 df.drop(column, axis=1, inplace=True) 129 except TypeError: 130 continue 131 132 if not inplace: 133 return df
Removes columns froma DataFrame if they contain only 1 unique value. ```
Then the original df
variable that was passed is now updated with these dropped columns.
If you would like to see the columns that are dropped, setting the logging level to info will allow for that to happen.
Parameters
- df (pd.DataFrame): The DataFrame that will be modified
- exclude (list of str, default=[]): A list of any columns that should be excluded from this removal.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with homogenous columns removed. If
inplace=True
it returns None.
136def replace_column_prefix(df, current_prefix, replacement_text, inplace=False): 137 """ 138 Replaces the protocol prefix (e.g. mosquito_habitat_mapper/mosquitohabitatmapper) for the column names with another prefix in the format of `newPrefix_columnName`. 139 140 If you are interested in replacing the prefixes for the raw mosquito habitat mapper and landcover datasets, use the go_utils.lc.cleanup_column_prefix and go_utils.mhm.cleanup_column_prefix methods. 141 142 Parameters 143 ---------- 144 df : pd.DataFrame 145 The DataFrame you would like updated 146 protocol : str 147 A string representing the protocol prefix. 148 replacement_text : str 149 A string representing the desired prefix for the column name. 150 inplace : bool, default=False 151 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 152 153 Returns 154 ------- 155 pd.DataFrame or None 156 A DataFrame with the column prefixes replaced. If `inplace=True` it returns None. 157 """ 158 if not inplace: 159 df = df.copy() 160 df.columns = [ 161 f"{replacement_text}_{column.replace(current_prefix,'')}" 162 for column in df.columns 163 ] 164 if not inplace: 165 return df
Replaces the protocol prefix (e.g. mosquito_habitat_mapper/mosquitohabitatmapper) for the column names with another prefix in the format of newPrefix_columnName
.
If you are interested in replacing the prefixes for the raw mosquito habitat mapper and landcover datasets, use the go_utils.lc.cleanup_column_prefix and go_utils.mhm.cleanup_column_prefix methods.
Parameters
- df (pd.DataFrame): The DataFrame you would like updated
- protocol (str): A string representing the protocol prefix.
- replacement_text (str): A string representing the desired prefix for the column name.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the column prefixes replaced. If
inplace=True
it returns None.
168def find_column(df, keyword): 169 """Finds the first column that contains a certain keyword. Mainly intended to be a utility function for some of the other methods. 170 171 172 Parameters 173 ---------- 174 df : pd.DataFrame 175 The DataFrame containing the columns that need to be searched. 176 keyword: str 177 The keyword that needs to be present in the desired column. 178 """ 179 180 return [column for column in df.columns if keyword in column][0]
Finds the first column that contains a certain keyword. Mainly intended to be a utility function for some of the other methods.
Parameters
- df (pd.DataFrame): The DataFrame containing the columns that need to be searched.
- keyword (str): The keyword that needs to be present in the desired column.
183def camel_case(string, delimiters=[" "]): 184 """Converts a string into camel case 185 186 Parameters 187 ---------- 188 string: str, the string to convert 189 delimiter: str, the character that denotes separate words 190 """ 191 for delimiter in delimiters: 192 str_list = [s[0].upper() + s[1:] for s in string.split(delimiter)] 193 string = "".join([s for s in str_list]) 194 return string
Converts a string into camel case
Parameters
string (str, the string to convert):
delimiter (str, the character that denotes separate words):
197def rename_latlon_cols( 198 df, 199 gps_latitude="", 200 gps_longitude="", 201 mgrs_latitude="latitude", 202 mgrs_longitude="longitude", 203 inplace=False, 204): 205 """Renames the latitude and longitude columns of **raw** GLOBE Observer Data to make the naming intuitive. 206 207 [This](#differentiating-between-mgrs-and-gps-columns) explains the motivation behind the method. 208 209 Example usage: 210 ```python 211 from go_utils.cleanup import rename_latlon_cols 212 rename_latlon_cols(df) 213 ``` 214 215 Parameters 216 ---------- 217 df : pd.DataFrame 218 The DataFrame whose columns require renaming. 219 inplace : bool, default=False 220 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 221 222 Returns 223 ------- 224 pd.DataFrame or None 225 A DataFrame with the updated Latitude and Longitude column names. If `inplace=True` it returns None. 226 """ 227 if not inplace: 228 df = df.copy() 229 230 if not gps_latitude: 231 gps_latitude = find_column(df, "MeasurementLatitude") 232 if not gps_longitude: 233 gps_longitude = find_column(df, "MeasurementLongitude") 234 df.rename( 235 { 236 gps_latitude: "Latitude", 237 gps_longitude: "Longitude", 238 mgrs_latitude: "MGRSLatitude", 239 mgrs_longitude: "MGRSLongitude", 240 }, 241 axis=1, 242 inplace=True, 243 ) 244 245 if not inplace: 246 return df
Renames the latitude and longitude columns of raw GLOBE Observer Data to make the naming intuitive.
This explains the motivation behind the method.
Example usage:
from go_utils.cleanup import rename_latlon_cols
rename_latlon_cols(df)
Parameters
- df (pd.DataFrame): The DataFrame whose columns require renaming.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the updated Latitude and Longitude column names. If
inplace=True
it returns None.
249def round_cols(df, inplace=False): 250 """This rounds columns in the DataFrame. More specifically, latitude and longitude data is rounded to 5 decimal places, other fields are rounded to integers, and null values (for the integer columns) are set to -9999. 251 252 See [here](#round-appropriate-columns) for more information. 253 254 Example usage: 255 ```python 256 from go_utils.cleanup import round_cols 257 round_cols(df) 258 ``` 259 260 Parameters 261 ---------- 262 df : pd.DataFrame 263 The DataFrame that requires rounding. 264 inplace : bool, default=False 265 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 266 267 Returns 268 ------- 269 pd.DataFrame or None 270 A DataFrame with the rounded values. If `inplace=True` it returns None. 271 """ 272 if not inplace: 273 df = df.copy() 274 # Identifies all the numerical cols 275 number_cols = [ 276 df.columns[i] 277 for i in range(len(df.dtypes)) 278 if (df.dtypes[i] == "float64") 279 or (df.dtypes[i] == "float") 280 or (df.dtypes[i] == "int") 281 or (df.dtypes[i] == "int64") 282 ] 283 284 # Rounds cols appropriately 285 column_round = np.vectorize(lambda x, digits: round(x, digits)) 286 for name in number_cols: 287 df[name] = df[name].fillna(-9999) 288 if ("latitude" in name.lower()) or ("longitude" in name.lower()): 289 logging.info(f"Rounded to 5 decimals: {name}") 290 df[name] = column_round(df[name].to_numpy(), 5) 291 else: 292 logging.info(f"Converted to integer: {name}") 293 df[name] = df[name].to_numpy().astype(int) 294 295 if not inplace: 296 return df
This rounds columns in the DataFrame. More specifically, latitude and longitude data is rounded to 5 decimal places, other fields are rounded to integers, and null values (for the integer columns) are set to -9999.
See here for more information.
Example usage:
from go_utils.cleanup import round_cols
round_cols(df)
Parameters
- df (pd.DataFrame): The DataFrame that requires rounding.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the rounded values. If
inplace=True
it returns None.
299def standardize_null_vals(df, null_val=np.nan, inplace=False): 300 """ 301 This method standardizes the null values of **raw** GLOBE Observer Data. 302 303 ```python 304 from go_utils.cleanup import standardize_null_vals 305 standardize_null_vals(df) 306 ``` 307 308 Parameters 309 ---------- 310 df : pd.DataFrame 311 The DataFrame that needs null value standardization 312 null_val : obj, default=np.nan 313 The value that all null values should be set to 314 inplace : bool, default=False 315 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 316 317 Returns 318 ------- 319 pd.DataFrame or None 320 A DataFrame with the standardized null values. If `inplace=True` it returns None. 321 """ 322 323 if not inplace: 324 df = df.copy() 325 326 # Replace Null Values with null_val 327 df.fillna(null_val, inplace=True) 328 329 # Replace any text null values 330 df.replace( 331 { 332 "null": null_val, 333 "": null_val, 334 "NaN": null_val, 335 "nan": null_val, 336 None: null_val, 337 }, 338 inplace=True, 339 ) 340 341 if not inplace: 342 return df
This method standardizes the null values of raw GLOBE Observer Data.
from go_utils.cleanup import standardize_null_vals
standardize_null_vals(df)
Parameters
- df (pd.DataFrame): The DataFrame that needs null value standardization
- null_val (obj, default=np.nan): The value that all null values should be set to
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the standardized null values. If
inplace=True
it returns None.