go_utils.mhm
Mosquito Specific Cleanup Procedures
Converting Larvae Data to Integers
Larvae Data is stored as a string in the raw GLOBE Observer dataset. To facillitate analysis, this method converts this data to numerical data.
It needs to account for 4 types of data:
- Regular Data: Converts it to a number
- Extraneously large data ($\geq 100$ as its hard to count more than that amount accurately): To maintain the information from that entry, the
LarvaeCountMagnitude
flag is used to indicate the real value - Ranges (e.g. "25-50"): Chooses the lower bound and set the
LarvaeCountIsRangeFlag
to true. - Null Values: Sets null values to $-9999$
It generates the following flags:
LarvaeCountMagnitude
: The integer flag contains the order of magnitude (0-4) by which the larvae count exceeds the maximum Larvae Count of 100. This is calculated by $1 + \lfloor \log{\frac{num}{100}} \rfloor$. As a result:0
: Corresponds to a Larvae Count $\leq 100$1
: Corresponds to a Larvae Count between $100$ and $999$2
: Corresponds to a Larvae Count between $1000$ and $9999$3
: Corresponds to a Larvae Count between $10,000$ and $99,999$4
: Corresponds to a Larvae Count $\geq 100,000$
LarvaeCountIsRange
: Either a $1$ which indicates the entry was a range (e.g. 25-50) or $0$ which indicates the entry wasn't a range.
Additionally, there were extremely large values that Python was unable to process (1e+27
) and so there was an initial preprocessing step to set those numbers to 100000 (which corresponds to the maximum magnitude flag).
1import math 2import re 3 4import matplotlib.pyplot as plt 5import numpy as np 6import pandas as pd 7 8from go_utils.cleanup import ( 9 rename_latlon_cols, 10 replace_column_prefix, 11 round_cols, 12 standardize_null_vals, 13) 14from go_utils.plot import completeness_histogram, plot_freq_bar, plot_int_distribution 15 16__doc__ = r""" 17 18## Mosquito Specific Cleanup Procedures 19 20### Converting Larvae Data to Integers 21Larvae Data is stored as a string in the raw GLOBE Observer dataset. To facillitate analysis, [this method](#larvae_to_num) converts this data to numerical data. 22 23It needs to account for 4 types of data: 241. Regular Data: Converts it to a number 252. Extraneously large data ($\geq 100$ as its hard to count more than that amount accurately): To maintain the information from that entry, the `LarvaeCountMagnitude` flag is used to indicate the real value 263. Ranges (e.g. "25-50"): Chooses the lower bound and set the `LarvaeCountIsRangeFlag` to true. 274. Null Values: Sets null values to $-9999$ 28 29 30It generates the following flags: 31- `LarvaeCountMagnitude`: The integer flag contains the order of magnitude (0-4) by which the larvae count exceeds the maximum Larvae Count of 100. This is calculated by $1 + \lfloor \log{\frac{num}{100}} \rfloor$. As a result: 32 - `0`: Corresponds to a Larvae Count $\leq 100$ 33 - `1`: Corresponds to a Larvae Count between $100$ and $999$ 34 - `2`: Corresponds to a Larvae Count between $1000$ and $9999$ 35 - `3`: Corresponds to a Larvae Count between $10,000$ and $99,999$ 36 - `4`: Corresponds to a Larvae Count $\geq 100,000$ 37- `LarvaeCountIsRange`: Either a $1$ which indicates the entry was a range (e.g. 25-50) or $0$ which indicates the entry wasn't a range. 38 39Additionally, there were extremely large values that Python was unable to process (`1e+27`) and so there was an initial preprocessing step to set those numbers to 100000 (which corresponds to the maximum magnitude flag). 40""" 41 42 43def cleanup_column_prefix(df, inplace=False): 44 """Method for shortening raw mosquito habitat mapper column names. 45 46 Parameters 47 ---------- 48 df : pd.DataFrame 49 The DataFrame containing raw mosquito habitat mapper data. 50 inplace : bool, default=False 51 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 52 53 Returns 54 ------- 55 pd.DataFrame or None 56 A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None. 57 """ 58 59 return replace_column_prefix(df, "mosquitohabitatmapper", "mhm", inplace=inplace) 60 61 62def _entry_to_num(entry): 63 try: 64 if entry == "more than 100": 65 return 101, 1, 1 66 if pd.isna(entry): 67 return -9999, 0, 0 68 elif float(entry) > 100: 69 return 101, min(math.floor(math.log10(float(entry) / 100)) + 1, 4), 0 70 return float(entry), 0, 0 71 except ValueError: 72 return float(re.sub(r"-.*", "", entry)), 0, 1 73 74 75def larvae_to_num( 76 mhm_df, 77 larvae_count_col="mhm_LarvaeCount", 78 magnitude="mhm_LarvaeCountMagnitude", 79 range_flag="mhm_LarvaeCountIsRangeFlag", 80 inplace=False, 81): 82 """Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers. 83 84 See [here](#converting-larvae-data-to-integers) for more information. 85 86 Parameters 87 ---------- 88 mhm_df : pd.DataFrame 89 A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers 90 larvae_count_col : str, default="mhm_LarvaeCount" 91 The name of the column storing the larvae count. **Note**: The columns will be output in the format: `prefix_ColumnName` where `prefix` is all the characters that preceed the words `LarvaeCount` in the specified name. 92 magnitude: str, default="mhm_LarvaeCountMagnitude" 93 The name of the column which will store the generated LarvaeCountMagnitude output 94 range_flag : str, default="mhm_LarvaeCountIsRangeFlag" 95 The name of the column which will store the generated LarvaeCountIsRange flag 96 inplace : bool, default=False 97 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 98 99 Returns 100 ------- 101 pd.DataFrame 102 A DataFrame with the larvae count as integers. If `inplace=True` it returns None. 103 """ 104 105 if not inplace: 106 mhm_df = mhm_df.copy() 107 # Preprocessing step to remove extremely erroneous values 108 for i in mhm_df.index: 109 count = mhm_df[larvae_count_col][i] 110 if not pd.isna(count) and type(count) is str and "e+" in count: 111 mhm_df.at[i, larvae_count_col] = "100000" 112 113 larvae_conversion = np.vectorize(_entry_to_num) 114 ( 115 mhm_df[larvae_count_col], 116 mhm_df[magnitude], 117 mhm_df[range_flag], 118 ) = larvae_conversion(mhm_df[larvae_count_col].to_numpy()) 119 120 if not inplace: 121 return mhm_df 122 123 124def has_genus_flag(df, genus_col="mhm_Genus", bit_col="mhm_HasGenus", inplace=False): 125 """ 126 Creates a bit flag: `mhm_HasGenus` where 1 denotes a recorded Genus and 0 denotes the contrary. 127 128 Parameters 129 ---------- 130 df : pd.DataFrame 131 A mosquito habitat mapper DataFrame 132 genus_col : str, default="mhm_Genus" 133 The name of the column in the mosquito habitat mapper DataFrame that contains the genus records. 134 bit_col : str, default="mhm_HasGenus" 135 The name of the column which will store the generated HasGenus flag 136 inplace : bool, default=False 137 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 138 139 Returns 140 ------- 141 pd.DataFrame 142 A DataFrame with the HasGenus flag. If `inplace=True` it returns None. 143 """ 144 if not inplace: 145 df = df.copy() 146 df[bit_col] = (~pd.isna(df[genus_col].to_numpy())).astype(int) 147 148 if not inplace: 149 return df 150 151 152def infectious_genus_flag( 153 df, genus_col="mhm_Genus", bit_col="mhm_IsGenusOfInterest", inplace=False 154): 155 """ 156 Creates a bit flag: `mhm_IsGenusOfInterest` where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary. 157 158 Parameters 159 ---------- 160 df : pd.DataFrame 161 A mosquito habitat mapper DataFrame 162 genus_col : str, default="mhm_Genus" 163 The name of the column in the mosquito habitat mapper DataFrame that contains the genus records. 164 bit_col : str, default="mhm_HasGenus" 165 The name of the column which will store the generated IsGenusOfInterest flag 166 inplace : bool, default=False 167 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 168 169 Returns 170 ------- 171 pd.DataFrame 172 A DataFrame with the IsGenusOfInterest flag. If `inplace=True` it returns None. 173 """ 174 if not inplace: 175 df = df.copy() 176 infectious_genus_flag = np.vectorize( 177 lambda genus: genus in ["Aedes", "Anopheles", "Culex"] 178 ) 179 df[bit_col] = infectious_genus_flag(df[genus_col].to_numpy()).astype(int) 180 181 if not inplace: 182 return df 183 184 185def is_container_flag( 186 df, 187 watersource_col="mhm_WaterSourceType", 188 bit_col="mhm_IsWaterSourceContainer", 189 inplace=False, 190): 191 """ 192 Creates a bit flag: `mhm_IsWaterSourceContainer` where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary. 193 194 Parameters 195 ---------- 196 df : pd.DataFrame 197 A mosquito habitat mapper DataFrame 198 watersource_col : str, default="mhm_WaterSourceType" 199 The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records. 200 bit_col : str, default="mhm_IsWaterSourceContainer" 201 The name of the column which will store the generated IsWaterSourceContainer flag 202 inplace : bool, default=False 203 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 204 205 Returns 206 ------- 207 pd.DataFrame 208 A DataFrame with the IsContainer flag. If `inplace=True` it returns None. 209 """ 210 211 if not inplace: 212 df = df.copy() 213 214 mark_containers = np.vectorize( 215 lambda container: not pd.isna(container) and "container" in container 216 ) 217 df[bit_col] = mark_containers(df[watersource_col].to_numpy()).astype(int) 218 219 if not inplace: 220 return df 221 222 223def has_watersource_flag( 224 df, watersource_col="mhm_WaterSource", bit_col="mhm_HasWaterSource", inplace=False 225): 226 """ 227 Creates a bit flag: `mhm_HasWaterSource` where 1 denotes if there is a watersource and 0 denotes the contrary. 228 229 Parameters 230 ---------- 231 df : pd.DataFrame 232 A mosquito habitat mapper DataFrame 233 watersource_col : str, default="mhm_WaterSource" 234 The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records. 235 bit_col : str, default="mhm_IsWaterSourceContainer" 236 The name of the column which will store the generated HasWaterSource flag 237 inplace : bool, default=False 238 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 239 240 Returns 241 ------- 242 pd.DataFrame 243 A DataFrame with the HasWaterSource flag. If `inplace=True` it returns None. 244 """ 245 246 if not inplace: 247 df = df.copy() 248 has_watersource = np.vectorize(lambda watersource: int(not pd.isna(watersource))) 249 df[bit_col] = has_watersource(df[watersource_col].to_numpy()) 250 251 if not inplace: 252 return df 253 254 255def photo_bit_flags( 256 df, 257 watersource_photos="mhm_WaterSourcePhotoUrls", 258 larvae_photos="mhm_LarvaFullBodyPhotoUrls", 259 abdomen_photos="mhm_AbdomenCloseupPhotoUrls", 260 photo_count="mhm_PhotoCount", 261 rejected_count="mhm_RejectedCount", 262 pending_count="mhm_PendingCount", 263 photo_bit_binary="mhm_PhotoBitBinary", 264 photo_bit_decimal="mhm_PhotoBitDecimal", 265 inplace=False, 266): 267 """ 268 Creates the following flags: 269 - `PhotoCount`: The number of valid photos per record. 270 - `RejectedCount`: The number of photos that were rejected per record. 271 - `PendingCount`: The number of photos that are pending approval per record. 272 - `PhotoBitBinary`: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is `110`, that indicates that there is a water source photo and a larvae photo, but no abdomen photo. 273 - `PhotoBitDecimal`: The numerical representation of the mhm_PhotoBitBinary string. 274 275 Parameters 276 ---------- 277 df : pd.DataFrame 278 A mosquito habitat mapper DataFrame 279 watersource_photos : str, default="mhm_WaterSourcePhotoUrls" 280 The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records. 281 larvae_photos : str, default="mhm_LarvaFullBodyPhotoUrls" 282 The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records. 283 abdomen_photos : str, default="mhm_AbdomenCloseupPhotoUrls" 284 The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records. 285 photo_count : str, default="mhm_PhotoCount" 286 The name of the column that will store the PhotoCount flag. 287 rejected_count : str, default="mhm_RejectedCount" 288 The name of the column that will store the RejectedCount flag. 289 pending_count : str, default="mhm_PendingCount" 290 The name of the column that will store the PendingCount flag. 291 photo_bit_binary : str, default="mhm_PhotoBitBinary" 292 The name of the column that will store the PhotoBitBinary flag. 293 photo_bit_decimal : str, default="mhm_PhotoBitDecimal" 294 The name of the column that will store the PhotoBitDecimal flag. 295 inplace : bool, default=False 296 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 297 298 Returns 299 ------- 300 pd.DataFrame 301 A DataFrame with the photo flags. If `inplace=True` it returns None. 302 """ 303 304 def pic_data(*args): 305 pic_count = 0 306 rejected_count = 0 307 pending_count = 0 308 valid_photo_bit_mask = "" 309 310 # bit_power = len(args) - 1 311 # For url string -- if we see ANY http, add 1 312 # also count all valid photos, rejected photos, 313 # If there are NO http then add 0, to empty photo field 314 for url_string in args: 315 if not pd.isna(url_string): 316 if "http" not in url_string: 317 valid_photo_bit_mask += "0" 318 else: 319 valid_photo_bit_mask += "1" 320 321 pic_count += url_string.count("http") 322 pending_count += url_string.count("pending") 323 rejected_count += url_string.count("rejected") 324 else: 325 valid_photo_bit_mask += "0" 326 327 return ( 328 pic_count, 329 rejected_count, 330 pending_count, 331 valid_photo_bit_mask, 332 int(valid_photo_bit_mask, 2), 333 ) 334 335 if not inplace: 336 df = df.copy() 337 338 get_photo_data = np.vectorize(pic_data) 339 ( 340 df[photo_count], 341 df[rejected_count], 342 df[pending_count], 343 df[photo_bit_binary], 344 df[photo_bit_decimal], 345 ) = get_photo_data( 346 df[watersource_photos].to_numpy(), 347 df[larvae_photos].to_numpy(), 348 df[abdomen_photos].to_numpy(), 349 ) 350 351 if not inplace: 352 return df 353 354 355def completion_score_flag( 356 df, 357 photo_bit_binary="mhm_PhotoBitBinary", 358 has_genus="mhm_HasGenus", 359 sub_completeness="mhm_SubCompletenessScore", 360 completeness="mhm_CumulativeCompletenessScore", 361 inplace=False, 362): 363 """ 364 Adds the following completness score flags: 365 - `SubCompletenessScore`: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out. 366 - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns. 367 368 Parameters 369 ---------- 370 df : pd.DataFrame 371 A mosquito habitat mapper DataFrame with the [`PhotoBitDecimal`](#photo_bit_flags) and [`HasGenus`](#has_genus_flags) flags. 372 photo_bit_binary: str, default="mhm_PhotoBitBinary" 373 The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag. 374 sub_completeness : str, default="mhm_HasGenus" 375 The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag. 376 completeness : str, default="mhm_SubCompletenessScore" 377 The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag. 378 inplace : bool, default=False 379 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 380 381 Returns 382 ------- 383 pd.DataFrame 384 A DataFrame with completion score flags. If `inplace=True` it returns None. 385 """ 386 387 def sum_bit_mask(bit_mask="0"): 388 total = 0.0 389 for char in bit_mask: 390 total += int(char) 391 return total 392 393 if not inplace: 394 df = df.copy() 395 396 scores = {} 397 scores["sub_score"] = [] 398 # Cummulative Completion Score 399 scores["cumulative_score"] = round(df.count(axis=1) / len(df.columns), 2) 400 # Sub-Score 401 for index in df.index: 402 bit_mask = df[photo_bit_binary][index] 403 sub_score = df[has_genus][index] + sum_bit_mask(bit_mask=bit_mask) 404 sub_score /= 4.0 405 scores["sub_score"].append(sub_score) 406 407 df[sub_completeness], df[completeness] = ( 408 scores["sub_score"], 409 scores["cumulative_score"], 410 ) 411 412 if not inplace: 413 return df 414 415 416def apply_cleanup(mhm_df): 417 """Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy. 418 It follows the following steps: 419 - Removes Homogenous Columns 420 - Renames Latitude and Longitudes 421 - Cleans the Column Naming 422 - Converts Larvae Count to Numbers 423 - Rounds Columns 424 - Standardizes Null Values 425 426 Parameters 427 ---------- 428 mhm_df : pd.DataFrame 429 A DataFrame containing **raw** Mosquito Habitat Mapper Data from the API. 430 431 Returns 432 ------- 433 pd.DataFrame 434 A DataFrame containing the cleaned up Mosquito Habitat Mapper Data 435 """ 436 mhm_df = mhm_df.copy() 437 438 rename_latlon_cols(mhm_df, inplace=True) 439 cleanup_column_prefix(mhm_df, inplace=True) 440 larvae_to_num(mhm_df, inplace=True) 441 round_cols(mhm_df, inplace=True) 442 standardize_null_vals(mhm_df, inplace=True) 443 return mhm_df 444 445 446def add_flags(mhm_df): 447 """Adds the following flags to the Mosquito Habitat Mapper Data: 448 - Has Genus 449 - Is Infectious Genus/Genus of Interest 450 - Is Container 451 - Has WaterSource 452 - Photo Bit Flags 453 - Completion Score Flag 454 455 This returns a copy of the original DataFrame with the flags added onto it. 456 457 Parameters 458 ---------- 459 mhm_df : pd.DataFrame 460 A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method. 461 462 Returns 463 ------- 464 pd.DataFrame 465 A DataFrame containing the flagged Mosquito Habitat Mapper Data 466 """ 467 mhm_df = mhm_df.copy() 468 has_genus_flag(mhm_df, inplace=True) 469 infectious_genus_flag(mhm_df, inplace=True) 470 is_container_flag(mhm_df, inplace=True) 471 has_watersource_flag(mhm_df, inplace=True) 472 photo_bit_flags(mhm_df, inplace=True) 473 completion_score_flag(mhm_df, inplace=True) 474 return mhm_df 475 476 477def plot_valid_entries(df, bit_col, entry_type): 478 """ 479 Plots the number of entries with photos and the number of entries without photos 480 481 Parameters 482 ---------- 483 df : pd.DataFrame 484 The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag. 485 """ 486 plt.figure() 487 num_valid = len(df[df[bit_col] > 0]) 488 plt.title(f"Entries with {entry_type} vs No {entry_type}") 489 plt.ylabel("Number of Entries") 490 plt.bar(entry_type, num_valid, color="#e34a33") 491 plt.bar(f"No {entry_type}", len(df) - num_valid, color="#fdcc8a") 492 493 494def photo_subjects(mhm_df): 495 """ 496 Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource) 497 498 Parameters 499 ---------- 500 mhm_df : pd.DataFrame 501 The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag. 502 """ 503 504 total_dict = {"Larvae Photos": 0, "Abdomen Photos": 0, "Watersource Photos": 0} 505 506 for number in mhm_df["mhm_PhotoBitDecimal"]: 507 total_dict["Watersource Photos"] += number & 4 508 total_dict["Larvae Photos"] += number & 2 509 total_dict["Abdomen Photos"] += number & 1 510 511 for key in total_dict.keys(): 512 if total_dict[key] != 0: 513 total_dict[key] = math.log10(total_dict[key]) 514 else: 515 total_dict[key] = 0 516 plt.figure(figsize=(10, 5)) 517 plt.title("Mosquito Habitat Mapper - Photo Subject Frequencies (Log Scale)") 518 plt.xlabel("Photo Type") 519 plt.ylabel("Frequency (Log Scale)") 520 plt.bar(total_dict.keys(), total_dict.values(), color="lightblue") 521 522 523def diagnostic_plots(mhm_df): 524 """ 525 Generates (but doesn't display) diagnostic plots to gain insight into the current data. 526 527 Plots: 528 - Larvae Count Distribution (where a negative entry denotes null data) 529 - Photo Subject Distribution 530 - Number of valid photos vs no photos 531 - Completeness Score Distribution 532 - Subcompleteness Score Distribution 533 534 Parameters 535 ---------- 536 mhm_df : pd.DataFrame 537 The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data. 538 """ 539 plot_int_distribution(mhm_df, "mhm_LarvaeCount", "Larvae Count") 540 photo_subjects(mhm_df) 541 plot_freq_bar(mhm_df, "Mosquito Habitat Mapper", "mhm_Genus", "Genus Types") 542 plot_valid_entries(mhm_df, "mhm_HasGenus", "Genus Classifications") 543 plot_valid_entries(mhm_df, "mhm_PhotoBitDecimal", "Valid Photos") 544 completeness_histogram( 545 mhm_df, 546 "Mosquito Habitat Mapper", 547 "mhm_CumulativeCompletenessScore", 548 "Cumulative Completeness", 549 ) 550 completeness_histogram( 551 mhm_df, 552 "Mosquito Habitat Mapper", 553 "mhm_SubCompletenessScore", 554 "Sub Completeness", 555 ) 556 557 558def qa_filter( 559 mhm_df, 560 has_genus=False, 561 min_larvae_count=-9999, 562 has_photos=False, 563 is_container=False, 564): 565 """ 566 Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria: 567 - `Has Genus`: If the entry has an identified genus 568 - `Min Larvae Count` : Minimum larvae count needed for an entry 569 - `Has Photos` : If the entry contains valid photo entries 570 - `Is Container` : If the entry's watersource was a container 571 572 Returns a copy of the DataFrame 573 574 Parameters 575 ---------- 576 has_genus : bool, default=False 577 If True, only entries with an identified genus will be returned. 578 min_larvae_count : int, default=-9999 579 Only entries with a larvae count greater than or equal to this parameter will be included. 580 has_photos : bool, default=False 581 If True, only entries with recorded photos will be returned 582 is_container : bool, default=False 583 If True, only entries with containers will be returned 584 585 Returns 586 ------- 587 pd.DataFrame 588 A DataFrame of the applied filters. 589 """ 590 591 mhm_df = mhm_df[mhm_df["mhm_LarvaeCount"] >= min_larvae_count] 592 593 if has_genus: 594 mhm_df = mhm_df[mhm_df["mhm_HasGenus"] == 1] 595 if has_photos: 596 mhm_df = mhm_df[mhm_df["mhm_PhotoBitDecimal"] > 0] 597 if is_container: 598 mhm_df = mhm_df[mhm_df["mhm_IsWaterSourceContainer"] == 1] 599 600 return mhm_df
44def cleanup_column_prefix(df, inplace=False): 45 """Method for shortening raw mosquito habitat mapper column names. 46 47 Parameters 48 ---------- 49 df : pd.DataFrame 50 The DataFrame containing raw mosquito habitat mapper data. 51 inplace : bool, default=False 52 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 53 54 Returns 55 ------- 56 pd.DataFrame or None 57 A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None. 58 """ 59 60 return replace_column_prefix(df, "mosquitohabitatmapper", "mhm", inplace=inplace)
Method for shortening raw mosquito habitat mapper column names.
Parameters
- df (pd.DataFrame): The DataFrame containing raw mosquito habitat mapper data.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the cleaned up column prefixes. If
inplace=True
it returns None.
76def larvae_to_num( 77 mhm_df, 78 larvae_count_col="mhm_LarvaeCount", 79 magnitude="mhm_LarvaeCountMagnitude", 80 range_flag="mhm_LarvaeCountIsRangeFlag", 81 inplace=False, 82): 83 """Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers. 84 85 See [here](#converting-larvae-data-to-integers) for more information. 86 87 Parameters 88 ---------- 89 mhm_df : pd.DataFrame 90 A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers 91 larvae_count_col : str, default="mhm_LarvaeCount" 92 The name of the column storing the larvae count. **Note**: The columns will be output in the format: `prefix_ColumnName` where `prefix` is all the characters that preceed the words `LarvaeCount` in the specified name. 93 magnitude: str, default="mhm_LarvaeCountMagnitude" 94 The name of the column which will store the generated LarvaeCountMagnitude output 95 range_flag : str, default="mhm_LarvaeCountIsRangeFlag" 96 The name of the column which will store the generated LarvaeCountIsRange flag 97 inplace : bool, default=False 98 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 99 100 Returns 101 ------- 102 pd.DataFrame 103 A DataFrame with the larvae count as integers. If `inplace=True` it returns None. 104 """ 105 106 if not inplace: 107 mhm_df = mhm_df.copy() 108 # Preprocessing step to remove extremely erroneous values 109 for i in mhm_df.index: 110 count = mhm_df[larvae_count_col][i] 111 if not pd.isna(count) and type(count) is str and "e+" in count: 112 mhm_df.at[i, larvae_count_col] = "100000" 113 114 larvae_conversion = np.vectorize(_entry_to_num) 115 ( 116 mhm_df[larvae_count_col], 117 mhm_df[magnitude], 118 mhm_df[range_flag], 119 ) = larvae_conversion(mhm_df[larvae_count_col].to_numpy()) 120 121 if not inplace: 122 return mhm_df
Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers.
See here for more information.
Parameters
- mhm_df (pd.DataFrame): A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers
- larvae_count_col (str, default="mhm_LarvaeCount"):
The name of the column storing the larvae count. Note: The columns will be output in the format:
prefix_ColumnName
whereprefix
is all the characters that preceed the wordsLarvaeCount
in the specified name. - magnitude (str, default="mhm_LarvaeCountMagnitude"): The name of the column which will store the generated LarvaeCountMagnitude output
- range_flag (str, default="mhm_LarvaeCountIsRangeFlag"): The name of the column which will store the generated LarvaeCountIsRange flag
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame: A DataFrame with the larvae count as integers. If
inplace=True
it returns None.
125def has_genus_flag(df, genus_col="mhm_Genus", bit_col="mhm_HasGenus", inplace=False): 126 """ 127 Creates a bit flag: `mhm_HasGenus` where 1 denotes a recorded Genus and 0 denotes the contrary. 128 129 Parameters 130 ---------- 131 df : pd.DataFrame 132 A mosquito habitat mapper DataFrame 133 genus_col : str, default="mhm_Genus" 134 The name of the column in the mosquito habitat mapper DataFrame that contains the genus records. 135 bit_col : str, default="mhm_HasGenus" 136 The name of the column which will store the generated HasGenus flag 137 inplace : bool, default=False 138 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 139 140 Returns 141 ------- 142 pd.DataFrame 143 A DataFrame with the HasGenus flag. If `inplace=True` it returns None. 144 """ 145 if not inplace: 146 df = df.copy() 147 df[bit_col] = (~pd.isna(df[genus_col].to_numpy())).astype(int) 148 149 if not inplace: 150 return df
Creates a bit flag: mhm_HasGenus
where 1 denotes a recorded Genus and 0 denotes the contrary.
Parameters
- df (pd.DataFrame): A mosquito habitat mapper DataFrame
- genus_col (str, default="mhm_Genus"): The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
- bit_col (str, default="mhm_HasGenus"): The name of the column which will store the generated HasGenus flag
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame: A DataFrame with the HasGenus flag. If
inplace=True
it returns None.
153def infectious_genus_flag( 154 df, genus_col="mhm_Genus", bit_col="mhm_IsGenusOfInterest", inplace=False 155): 156 """ 157 Creates a bit flag: `mhm_IsGenusOfInterest` where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary. 158 159 Parameters 160 ---------- 161 df : pd.DataFrame 162 A mosquito habitat mapper DataFrame 163 genus_col : str, default="mhm_Genus" 164 The name of the column in the mosquito habitat mapper DataFrame that contains the genus records. 165 bit_col : str, default="mhm_HasGenus" 166 The name of the column which will store the generated IsGenusOfInterest flag 167 inplace : bool, default=False 168 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 169 170 Returns 171 ------- 172 pd.DataFrame 173 A DataFrame with the IsGenusOfInterest flag. If `inplace=True` it returns None. 174 """ 175 if not inplace: 176 df = df.copy() 177 infectious_genus_flag = np.vectorize( 178 lambda genus: genus in ["Aedes", "Anopheles", "Culex"] 179 ) 180 df[bit_col] = infectious_genus_flag(df[genus_col].to_numpy()).astype(int) 181 182 if not inplace: 183 return df
Creates a bit flag: mhm_IsGenusOfInterest
where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary.
Parameters
- df (pd.DataFrame): A mosquito habitat mapper DataFrame
- genus_col (str, default="mhm_Genus"): The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
- bit_col (str, default="mhm_HasGenus"): The name of the column which will store the generated IsGenusOfInterest flag
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame: A DataFrame with the IsGenusOfInterest flag. If
inplace=True
it returns None.
186def is_container_flag( 187 df, 188 watersource_col="mhm_WaterSourceType", 189 bit_col="mhm_IsWaterSourceContainer", 190 inplace=False, 191): 192 """ 193 Creates a bit flag: `mhm_IsWaterSourceContainer` where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary. 194 195 Parameters 196 ---------- 197 df : pd.DataFrame 198 A mosquito habitat mapper DataFrame 199 watersource_col : str, default="mhm_WaterSourceType" 200 The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records. 201 bit_col : str, default="mhm_IsWaterSourceContainer" 202 The name of the column which will store the generated IsWaterSourceContainer flag 203 inplace : bool, default=False 204 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 205 206 Returns 207 ------- 208 pd.DataFrame 209 A DataFrame with the IsContainer flag. If `inplace=True` it returns None. 210 """ 211 212 if not inplace: 213 df = df.copy() 214 215 mark_containers = np.vectorize( 216 lambda container: not pd.isna(container) and "container" in container 217 ) 218 df[bit_col] = mark_containers(df[watersource_col].to_numpy()).astype(int) 219 220 if not inplace: 221 return df
Creates a bit flag: mhm_IsWaterSourceContainer
where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary.
Parameters
- df (pd.DataFrame): A mosquito habitat mapper DataFrame
- watersource_col (str, default="mhm_WaterSourceType"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records.
- bit_col (str, default="mhm_IsWaterSourceContainer"): The name of the column which will store the generated IsWaterSourceContainer flag
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame: A DataFrame with the IsContainer flag. If
inplace=True
it returns None.
224def has_watersource_flag( 225 df, watersource_col="mhm_WaterSource", bit_col="mhm_HasWaterSource", inplace=False 226): 227 """ 228 Creates a bit flag: `mhm_HasWaterSource` where 1 denotes if there is a watersource and 0 denotes the contrary. 229 230 Parameters 231 ---------- 232 df : pd.DataFrame 233 A mosquito habitat mapper DataFrame 234 watersource_col : str, default="mhm_WaterSource" 235 The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records. 236 bit_col : str, default="mhm_IsWaterSourceContainer" 237 The name of the column which will store the generated HasWaterSource flag 238 inplace : bool, default=False 239 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 240 241 Returns 242 ------- 243 pd.DataFrame 244 A DataFrame with the HasWaterSource flag. If `inplace=True` it returns None. 245 """ 246 247 if not inplace: 248 df = df.copy() 249 has_watersource = np.vectorize(lambda watersource: int(not pd.isna(watersource))) 250 df[bit_col] = has_watersource(df[watersource_col].to_numpy()) 251 252 if not inplace: 253 return df
Creates a bit flag: mhm_HasWaterSource
where 1 denotes if there is a watersource and 0 denotes the contrary.
Parameters
- df (pd.DataFrame): A mosquito habitat mapper DataFrame
- watersource_col (str, default="mhm_WaterSource"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records.
- bit_col (str, default="mhm_IsWaterSourceContainer"): The name of the column which will store the generated HasWaterSource flag
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame: A DataFrame with the HasWaterSource flag. If
inplace=True
it returns None.
256def photo_bit_flags( 257 df, 258 watersource_photos="mhm_WaterSourcePhotoUrls", 259 larvae_photos="mhm_LarvaFullBodyPhotoUrls", 260 abdomen_photos="mhm_AbdomenCloseupPhotoUrls", 261 photo_count="mhm_PhotoCount", 262 rejected_count="mhm_RejectedCount", 263 pending_count="mhm_PendingCount", 264 photo_bit_binary="mhm_PhotoBitBinary", 265 photo_bit_decimal="mhm_PhotoBitDecimal", 266 inplace=False, 267): 268 """ 269 Creates the following flags: 270 - `PhotoCount`: The number of valid photos per record. 271 - `RejectedCount`: The number of photos that were rejected per record. 272 - `PendingCount`: The number of photos that are pending approval per record. 273 - `PhotoBitBinary`: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is `110`, that indicates that there is a water source photo and a larvae photo, but no abdomen photo. 274 - `PhotoBitDecimal`: The numerical representation of the mhm_PhotoBitBinary string. 275 276 Parameters 277 ---------- 278 df : pd.DataFrame 279 A mosquito habitat mapper DataFrame 280 watersource_photos : str, default="mhm_WaterSourcePhotoUrls" 281 The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records. 282 larvae_photos : str, default="mhm_LarvaFullBodyPhotoUrls" 283 The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records. 284 abdomen_photos : str, default="mhm_AbdomenCloseupPhotoUrls" 285 The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records. 286 photo_count : str, default="mhm_PhotoCount" 287 The name of the column that will store the PhotoCount flag. 288 rejected_count : str, default="mhm_RejectedCount" 289 The name of the column that will store the RejectedCount flag. 290 pending_count : str, default="mhm_PendingCount" 291 The name of the column that will store the PendingCount flag. 292 photo_bit_binary : str, default="mhm_PhotoBitBinary" 293 The name of the column that will store the PhotoBitBinary flag. 294 photo_bit_decimal : str, default="mhm_PhotoBitDecimal" 295 The name of the column that will store the PhotoBitDecimal flag. 296 inplace : bool, default=False 297 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 298 299 Returns 300 ------- 301 pd.DataFrame 302 A DataFrame with the photo flags. If `inplace=True` it returns None. 303 """ 304 305 def pic_data(*args): 306 pic_count = 0 307 rejected_count = 0 308 pending_count = 0 309 valid_photo_bit_mask = "" 310 311 # bit_power = len(args) - 1 312 # For url string -- if we see ANY http, add 1 313 # also count all valid photos, rejected photos, 314 # If there are NO http then add 0, to empty photo field 315 for url_string in args: 316 if not pd.isna(url_string): 317 if "http" not in url_string: 318 valid_photo_bit_mask += "0" 319 else: 320 valid_photo_bit_mask += "1" 321 322 pic_count += url_string.count("http") 323 pending_count += url_string.count("pending") 324 rejected_count += url_string.count("rejected") 325 else: 326 valid_photo_bit_mask += "0" 327 328 return ( 329 pic_count, 330 rejected_count, 331 pending_count, 332 valid_photo_bit_mask, 333 int(valid_photo_bit_mask, 2), 334 ) 335 336 if not inplace: 337 df = df.copy() 338 339 get_photo_data = np.vectorize(pic_data) 340 ( 341 df[photo_count], 342 df[rejected_count], 343 df[pending_count], 344 df[photo_bit_binary], 345 df[photo_bit_decimal], 346 ) = get_photo_data( 347 df[watersource_photos].to_numpy(), 348 df[larvae_photos].to_numpy(), 349 df[abdomen_photos].to_numpy(), 350 ) 351 352 if not inplace: 353 return df
Creates the following flags:
PhotoCount
: The number of valid photos per record.RejectedCount
: The number of photos that were rejected per record.PendingCount
: The number of photos that are pending approval per record.PhotoBitBinary
: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is110
, that indicates that there is a water source photo and a larvae photo, but no abdomen photo.PhotoBitDecimal
: The numerical representation of the mhm_PhotoBitBinary string.
Parameters
- df (pd.DataFrame): A mosquito habitat mapper DataFrame
- watersource_photos (str, default="mhm_WaterSourcePhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records.
- larvae_photos (str, default="mhm_LarvaFullBodyPhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records.
- abdomen_photos (str, default="mhm_AbdomenCloseupPhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records.
- photo_count (str, default="mhm_PhotoCount"): The name of the column that will store the PhotoCount flag.
- rejected_count (str, default="mhm_RejectedCount"): The name of the column that will store the RejectedCount flag.
- pending_count (str, default="mhm_PendingCount"): The name of the column that will store the PendingCount flag.
- photo_bit_binary (str, default="mhm_PhotoBitBinary"): The name of the column that will store the PhotoBitBinary flag.
- photo_bit_decimal (str, default="mhm_PhotoBitDecimal"): The name of the column that will store the PhotoBitDecimal flag.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame: A DataFrame with the photo flags. If
inplace=True
it returns None.
356def completion_score_flag( 357 df, 358 photo_bit_binary="mhm_PhotoBitBinary", 359 has_genus="mhm_HasGenus", 360 sub_completeness="mhm_SubCompletenessScore", 361 completeness="mhm_CumulativeCompletenessScore", 362 inplace=False, 363): 364 """ 365 Adds the following completness score flags: 366 - `SubCompletenessScore`: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out. 367 - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns. 368 369 Parameters 370 ---------- 371 df : pd.DataFrame 372 A mosquito habitat mapper DataFrame with the [`PhotoBitDecimal`](#photo_bit_flags) and [`HasGenus`](#has_genus_flags) flags. 373 photo_bit_binary: str, default="mhm_PhotoBitBinary" 374 The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag. 375 sub_completeness : str, default="mhm_HasGenus" 376 The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag. 377 completeness : str, default="mhm_SubCompletenessScore" 378 The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag. 379 inplace : bool, default=False 380 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 381 382 Returns 383 ------- 384 pd.DataFrame 385 A DataFrame with completion score flags. If `inplace=True` it returns None. 386 """ 387 388 def sum_bit_mask(bit_mask="0"): 389 total = 0.0 390 for char in bit_mask: 391 total += int(char) 392 return total 393 394 if not inplace: 395 df = df.copy() 396 397 scores = {} 398 scores["sub_score"] = [] 399 # Cummulative Completion Score 400 scores["cumulative_score"] = round(df.count(axis=1) / len(df.columns), 2) 401 # Sub-Score 402 for index in df.index: 403 bit_mask = df[photo_bit_binary][index] 404 sub_score = df[has_genus][index] + sum_bit_mask(bit_mask=bit_mask) 405 sub_score /= 4.0 406 scores["sub_score"].append(sub_score) 407 408 df[sub_completeness], df[completeness] = ( 409 scores["sub_score"], 410 scores["cumulative_score"], 411 ) 412 413 if not inplace: 414 return df
Adds the following completness score flags:
SubCompletenessScore
: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out.CumulativeCompletenessScore
: The percentage of non null values out of all the columns.
Parameters
- df (pd.DataFrame):
A mosquito habitat mapper DataFrame with the
PhotoBitDecimal
andHasGenus
flags. - photo_bit_binary (str, default="mhm_PhotoBitBinary"): The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag.
- sub_completeness (str, default="mhm_HasGenus"): The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag.
- completeness (str, default="mhm_SubCompletenessScore"): The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame: A DataFrame with completion score flags. If
inplace=True
it returns None.
417def apply_cleanup(mhm_df): 418 """Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy. 419 It follows the following steps: 420 - Removes Homogenous Columns 421 - Renames Latitude and Longitudes 422 - Cleans the Column Naming 423 - Converts Larvae Count to Numbers 424 - Rounds Columns 425 - Standardizes Null Values 426 427 Parameters 428 ---------- 429 mhm_df : pd.DataFrame 430 A DataFrame containing **raw** Mosquito Habitat Mapper Data from the API. 431 432 Returns 433 ------- 434 pd.DataFrame 435 A DataFrame containing the cleaned up Mosquito Habitat Mapper Data 436 """ 437 mhm_df = mhm_df.copy() 438 439 rename_latlon_cols(mhm_df, inplace=True) 440 cleanup_column_prefix(mhm_df, inplace=True) 441 larvae_to_num(mhm_df, inplace=True) 442 round_cols(mhm_df, inplace=True) 443 standardize_null_vals(mhm_df, inplace=True) 444 return mhm_df
Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy. It follows the following steps:
- Removes Homogenous Columns
- Renames Latitude and Longitudes
- Cleans the Column Naming
- Converts Larvae Count to Numbers
- Rounds Columns
- Standardizes Null Values
Parameters
- mhm_df (pd.DataFrame): A DataFrame containing raw Mosquito Habitat Mapper Data from the API.
Returns
- pd.DataFrame: A DataFrame containing the cleaned up Mosquito Habitat Mapper Data
447def add_flags(mhm_df): 448 """Adds the following flags to the Mosquito Habitat Mapper Data: 449 - Has Genus 450 - Is Infectious Genus/Genus of Interest 451 - Is Container 452 - Has WaterSource 453 - Photo Bit Flags 454 - Completion Score Flag 455 456 This returns a copy of the original DataFrame with the flags added onto it. 457 458 Parameters 459 ---------- 460 mhm_df : pd.DataFrame 461 A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method. 462 463 Returns 464 ------- 465 pd.DataFrame 466 A DataFrame containing the flagged Mosquito Habitat Mapper Data 467 """ 468 mhm_df = mhm_df.copy() 469 has_genus_flag(mhm_df, inplace=True) 470 infectious_genus_flag(mhm_df, inplace=True) 471 is_container_flag(mhm_df, inplace=True) 472 has_watersource_flag(mhm_df, inplace=True) 473 photo_bit_flags(mhm_df, inplace=True) 474 completion_score_flag(mhm_df, inplace=True) 475 return mhm_df
Adds the following flags to the Mosquito Habitat Mapper Data:
- Has Genus
- Is Infectious Genus/Genus of Interest
- Is Container
- Has WaterSource
- Photo Bit Flags
- Completion Score Flag
This returns a copy of the original DataFrame with the flags added onto it.
Parameters
- mhm_df (pd.DataFrame): A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method.
Returns
- pd.DataFrame: A DataFrame containing the flagged Mosquito Habitat Mapper Data
478def plot_valid_entries(df, bit_col, entry_type): 479 """ 480 Plots the number of entries with photos and the number of entries without photos 481 482 Parameters 483 ---------- 484 df : pd.DataFrame 485 The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag. 486 """ 487 plt.figure() 488 num_valid = len(df[df[bit_col] > 0]) 489 plt.title(f"Entries with {entry_type} vs No {entry_type}") 490 plt.ylabel("Number of Entries") 491 plt.bar(entry_type, num_valid, color="#e34a33") 492 plt.bar(f"No {entry_type}", len(df) - num_valid, color="#fdcc8a")
Plots the number of entries with photos and the number of entries without photos
Parameters
- df (pd.DataFrame): The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
495def photo_subjects(mhm_df): 496 """ 497 Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource) 498 499 Parameters 500 ---------- 501 mhm_df : pd.DataFrame 502 The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag. 503 """ 504 505 total_dict = {"Larvae Photos": 0, "Abdomen Photos": 0, "Watersource Photos": 0} 506 507 for number in mhm_df["mhm_PhotoBitDecimal"]: 508 total_dict["Watersource Photos"] += number & 4 509 total_dict["Larvae Photos"] += number & 2 510 total_dict["Abdomen Photos"] += number & 1 511 512 for key in total_dict.keys(): 513 if total_dict[key] != 0: 514 total_dict[key] = math.log10(total_dict[key]) 515 else: 516 total_dict[key] = 0 517 plt.figure(figsize=(10, 5)) 518 plt.title("Mosquito Habitat Mapper - Photo Subject Frequencies (Log Scale)") 519 plt.xlabel("Photo Type") 520 plt.ylabel("Frequency (Log Scale)") 521 plt.bar(total_dict.keys(), total_dict.values(), color="lightblue")
Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource)
Parameters
- mhm_df (pd.DataFrame): The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
524def diagnostic_plots(mhm_df): 525 """ 526 Generates (but doesn't display) diagnostic plots to gain insight into the current data. 527 528 Plots: 529 - Larvae Count Distribution (where a negative entry denotes null data) 530 - Photo Subject Distribution 531 - Number of valid photos vs no photos 532 - Completeness Score Distribution 533 - Subcompleteness Score Distribution 534 535 Parameters 536 ---------- 537 mhm_df : pd.DataFrame 538 The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data. 539 """ 540 plot_int_distribution(mhm_df, "mhm_LarvaeCount", "Larvae Count") 541 photo_subjects(mhm_df) 542 plot_freq_bar(mhm_df, "Mosquito Habitat Mapper", "mhm_Genus", "Genus Types") 543 plot_valid_entries(mhm_df, "mhm_HasGenus", "Genus Classifications") 544 plot_valid_entries(mhm_df, "mhm_PhotoBitDecimal", "Valid Photos") 545 completeness_histogram( 546 mhm_df, 547 "Mosquito Habitat Mapper", 548 "mhm_CumulativeCompletenessScore", 549 "Cumulative Completeness", 550 ) 551 completeness_histogram( 552 mhm_df, 553 "Mosquito Habitat Mapper", 554 "mhm_SubCompletenessScore", 555 "Sub Completeness", 556 )
Generates (but doesn't display) diagnostic plots to gain insight into the current data.
Plots:
- Larvae Count Distribution (where a negative entry denotes null data)
- Photo Subject Distribution
- Number of valid photos vs no photos
- Completeness Score Distribution
- Subcompleteness Score Distribution
Parameters
- mhm_df (pd.DataFrame): The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data.
559def qa_filter( 560 mhm_df, 561 has_genus=False, 562 min_larvae_count=-9999, 563 has_photos=False, 564 is_container=False, 565): 566 """ 567 Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria: 568 - `Has Genus`: If the entry has an identified genus 569 - `Min Larvae Count` : Minimum larvae count needed for an entry 570 - `Has Photos` : If the entry contains valid photo entries 571 - `Is Container` : If the entry's watersource was a container 572 573 Returns a copy of the DataFrame 574 575 Parameters 576 ---------- 577 has_genus : bool, default=False 578 If True, only entries with an identified genus will be returned. 579 min_larvae_count : int, default=-9999 580 Only entries with a larvae count greater than or equal to this parameter will be included. 581 has_photos : bool, default=False 582 If True, only entries with recorded photos will be returned 583 is_container : bool, default=False 584 If True, only entries with containers will be returned 585 586 Returns 587 ------- 588 pd.DataFrame 589 A DataFrame of the applied filters. 590 """ 591 592 mhm_df = mhm_df[mhm_df["mhm_LarvaeCount"] >= min_larvae_count] 593 594 if has_genus: 595 mhm_df = mhm_df[mhm_df["mhm_HasGenus"] == 1] 596 if has_photos: 597 mhm_df = mhm_df[mhm_df["mhm_PhotoBitDecimal"] > 0] 598 if is_container: 599 mhm_df = mhm_df[mhm_df["mhm_IsWaterSourceContainer"] == 1] 600 601 return mhm_df
Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
Has Genus
: If the entry has an identified genusMin Larvae Count
: Minimum larvae count needed for an entryHas Photos
: If the entry contains valid photo entriesIs Container
: If the entry's watersource was a container
Returns a copy of the DataFrame
Parameters
- has_genus (bool, default=False): If True, only entries with an identified genus will be returned.
- min_larvae_count (int, default=-9999): Only entries with a larvae count greater than or equal to this parameter will be included.
- has_photos (bool, default=False): If True, only entries with recorded photos will be returned
- is_container (bool, default=False): If True, only entries with containers will be returned
Returns
- pd.DataFrame: A DataFrame of the applied filters.