go_utils.lc
Unpacking the Landcover Classification Data
The classification data for each entry is condensed into several entries separated by a semicolon. This method identifies and parses Land Cover Classifications and percentages to create new columns. The columns are also reordered to better group directional information together.
The end result is a DataFrame that contains columns for every Unique Landcover Classification (per direction) and its respective percentages for each entry.
There are four main steps to this procedure: 1.Identifying Land Cover Classifications for each Cardinal Direction: An internal method returns the unique description (e.g. HerbaceousGrasslandTallGrass) listed in a column. This method is run for all 4 cardinal directions to obtain the all unique classifications per direction.
- Creating empty columns for each Classification from each Cardinal Direction: Using the newly identified classifications new columns are made for each unique classification. These columns initially contained the default float64 value of 0.0. By initializing all the classification column values to 0.0, we ensure no empty values are set to -9999 in the round_cols(df) method (discussed in General Cleanup Procedures - Round Appropriate Columns). This step eases future numerical analysis.
- Grouping and Alphabetically Sorting Directional Column Information: To better organize the DataFrame, columns containing any of the following directional substrings: "downward", "upward", "west", "east", "north", "south" (case insensitive) are identified and alphabetically sorted. Then an internal method called move_cols, specified column headers to move (direction_data_cols), and the location before the desired point of insertion, the program returns a reordered DataFrame, where all directional columns are grouped together. This greatly improves the Land Covers dataset’s organization and accessibility.
- Adding Classification Percentages to their respective Land Cover Classification Columns - To fill in each classification column with their respective percentages, an internal method is applied to each row in the dataframe. This method iterates through each classification direction (ie “lc_EastClassifications”) and sets each identified Classification column with its respective percentage.
NOTE: After these procedures, the original directional classification columns (e.g. “lc_EastClassifications”) are not dropped.
1import math 2import re 3 4import matplotlib.pyplot as plt 5import numpy as np 6import pandas as pd 7import seaborn as sns 8 9from go_utils.cleanup import ( 10 camel_case, 11 remove_homogenous_cols, 12 rename_latlon_cols, 13 replace_column_prefix, 14 round_cols, 15 standardize_null_vals, 16) 17from go_utils.plot import completeness_histogram, multiple_bar_graph, plot_freq_bar 18 19__doc__ = """ 20 21## Unpacking the Landcover Classification Data 22The classification data for each entry is condensed into several entries separated by a semicolon. [This method](#unpack_classifications) identifies and parses Land Cover Classifications and percentages to create new columns. The columns are also reordered to better group directional information together. 23 24The end result is a DataFrame that contains columns for every Unique Landcover Classification (per direction) and its respective percentages for each entry. 25 26There are four main steps to this procedure: 271.Identifying Land Cover Classifications for each Cardinal Direction: An internal method returns the unique description (e.g. HerbaceousGrasslandTallGrass) listed in a column. This method is run for all 4 cardinal directions to obtain the all unique classifications per direction. 282. Creating empty columns for each Classification from each Cardinal Direction: Using the newly identified classifications new columns are made for each unique classification. These columns initially contained the default float64 value of 0.0. By initializing all the classification column values to 0.0, we ensure no empty values are set to -9999 in the round_cols(df) method (discussed in General Cleanup Procedures - Round Appropriate Columns). This step eases future numerical analysis. 293. Grouping and Alphabetically Sorting Directional Column Information: To better organize the DataFrame, columns containing any of the following directional substrings: "downward", "upward", "west", "east", "north", "south" (case insensitive) are identified and alphabetically sorted. Then an internal method called move_cols, specified column headers to move (direction_data_cols), and the location before the desired point of insertion, the program returns a reordered DataFrame, where all directional columns are grouped together. This greatly improves the Land Covers dataset’s organization and accessibility. 304. Adding Classification Percentages to their respective Land Cover Classification Columns - To fill in each classification column with their respective percentages, an internal method is applied to each row in the dataframe. This method iterates through each classification direction (ie “lc_EastClassifications”) and sets each identified Classification column with its respective percentage. 31 32NOTE: After these procedures, the original directional classification columns (e.g. “lc_EastClassifications”) are not dropped. 33""" 34 35classifications = [] 36 37 38def cleanup_column_prefix(df, inplace=False): 39 """Method for shortening raw landcover column names. 40 41 The df object will now replace the verbose `landcovers` prefix in some of the columns with `lc_` 42 43 Parameters 44 ---------- 45 df : pd.DataFrame 46 The DataFrame containing raw landcover data. The DataFrame object itself will be modified. 47 inplace : bool, default=False 48 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 49 50 Returns 51 ------- 52 pd.DataFrame or None 53 A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None. 54 """ 55 56 if not inplace: 57 df = df.copy() 58 59 replace_column_prefix(df, "landcovers", "lc", inplace=True) 60 61 if not inplace: 62 return df 63 64 65def extract_classification_name(entry): 66 """ 67 Extracts the name (landcover description) of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `"Trees, Closely Spaced, Deciduous - Broad Leaved"` is extracted. 68 69 Parameters 70 ---------- 71 entry : str 72 A single landcover classification. 73 74 Returns 75 ------- 76 str 77 The Landcover description of a classification 78 """ 79 80 return re.search(r"(?<=\[).*(?=\])", entry).group() 81 82 83def extract_classification_percentage(entry): 84 """ 85 Extracts the percentage of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `60` is extracted. 86 87 Parameters 88 ---------- 89 entry : str 90 A single landcover classification. 91 92 Returns 93 ------- 94 float 95 The percentage of a landcover classification 96 """ 97 98 return float(re.search(".*(?=%)", entry).group()) 99 100 101def _extract_landcover_items(func, info): 102 entries = info.split(";") 103 return [func(entry) for entry in entries] 104 105 106def extract_classifications(info): 107 """Extracts the name/landcover description (see [here](#extract_classification_name) for a clearer definition) of a landcover classification entry in the GLOBE Observer Data. 108 109 Parameters 110 ---------- 111 info : str 112 A string representing a landcover classification entry in the GLOBE Observer Datset. 113 114 Returns 115 ------- 116 list of str 117 The different landcover classifications stored within the landcover entry. 118 """ 119 return _extract_landcover_items(extract_classification_name, info) 120 121 122def extract_percentages(info): 123 """Extracts the percentages (see [here](#extract_classification_percentage) for a clearer definition) of a landcover classification in the GLOBE Observer Datset. 124 125 Parameters 126 ---------- 127 info : str 128 A string representing a landcover classification entry in the GLOBE Observer Datset. 129 130 Returns 131 ------- 132 list of float 133 The different landcover percentages stored within the landcover entry. 134 """ 135 136 return _extract_landcover_items(extract_classification_percentage, info) 137 138 139def extract_classification_dict(info): 140 """Extracts the landcover descriptions and percentages of a landcover classification entry as a dictionary. 141 142 Parameters 143 ---------- 144 info : str 145 A string representing a landcover classification entry in the GLOBE Observer Datset. 146 147 Returns 148 ------- 149 dict of str, float 150 The landcover descriptions and percentages stored as a dict in the form: `{"description" : percentage}`. 151 """ 152 153 entries = info.split(";") 154 return { 155 extract_classification_name(entry): extract_classification_percentage(entry) 156 for entry in entries 157 } 158 159 160def _get_classifications_for_direction(df, direction_col_name): 161 list_of_land_types = [] 162 for info in df[direction_col_name]: 163 # Note: Sometimes info = np.nan, a float -- In that case we do NOT parse/split 164 if type(info) == str: 165 [ 166 list_of_land_types.append(camel_case(entry, [" ", ",", "-", "/"])) 167 for entry in extract_classifications(info) 168 ] 169 return np.unique(list_of_land_types).tolist() 170 171 172def _move_cols(df, cols_to_move=[], ref_col=""): 173 col_names = df.columns.tolist() 174 index_before_desired_loc = col_names.index(ref_col) 175 176 cols_before_index = col_names[: index_before_desired_loc + 1] 177 cols_at_index = cols_to_move 178 179 cols_before_index = [i for i in cols_before_index if i not in cols_at_index] 180 cols_after_index = [ 181 i for i in col_names if i not in cols_before_index + cols_at_index 182 ] 183 184 return df[cols_before_index + cols_at_index + cols_after_index] 185 186 187def unpack_classifications( 188 lc_df, 189 north="lc_NorthClassifications", 190 east="lc_EastClassifications", 191 south="lc_SouthClassifications", 192 west="lc_WestClassifications", 193 ref_col="lc_pid", 194 unpack=True, 195): 196 """ 197 Unpacks the classification data in the *raw* GLOBE Observer Landcover data. This method assumes that the columns have been renamed with accordance to the [column cleanup](#cleanup_column_prefix) method. 198 199 This returns a copy of the dataframe. 200 201 See [here](#unpacking-the-landcover-classification-data) for more information. 202 203 *Note:* The returned DataFrame will have around 250 columns. 204 205 Parameters 206 ---------- 207 lc_df : pd.DataFrame 208 A DataFrame containing Raw GLOBE Observer Landcover data that has had the column names simplified. 209 north: str, default="lc_NorthClassifications" 210 The name of the column which contains the North Classifications 211 east: str, default="lc_EastClassifications" 212 The name of the column which contains the East Classifications 213 south: str, default="lc_SouthClassifications" 214 The name of the column which contains the South Classifications 215 west: str, default="lc_WestClassifications" 216 The name of the column which contains the West Classifications 217 ref_col: str, default="lc_pid" 218 The name of the column which all of the expanded values will be placed after. For example, if the columns were `[1, 2, 3, 4]` and you chose 3, the new columns will now be `[1, 2, 3, (all classification columns), 4]`. 219 unpack: bool, default=True 220 True if you want to unpack the directional classifications, False if you only want overall classifications 221 222 Returns 223 ------- 224 pd.DataFrame 225 A DataFrame with the unpacked classification columns. 226 list 227 A list containing all the generated overall Land Cover column names (mainly for testing purposes). 228 list 229 A list containing all the generated directional Land Cover column names (mainly for testing purposes). 230 """ 231 232 classifications = [north, east, south, west] 233 234 def set_directions(row): 235 for classification in classifications: 236 if not pd.isnull(row[classification]): 237 entries = row[classification].split(";") 238 for entry in entries: 239 percent, name = ( 240 extract_classification_percentage(entry), 241 extract_classification_name(entry), 242 ) 243 name = camel_case(name, [" ", ",", "-", "/"]) 244 classification = classification.replace("Classifications", "_") 245 overall = re.sub( 246 r"(north|south|east|west).*", 247 "Overall_", 248 key, 249 flags=re.IGNORECASE, 250 ) 251 row[f"{classification}{name.strip()}"] = percent 252 row[f"{overall}{name.strip()}"] += percent 253 return row 254 255 land_type_columns_to_add = { 256 classification: _get_classifications_for_direction(lc_df, classification) 257 for classification in classifications 258 } 259 overall_columns = set() 260 direction_cols = set() 261 for key, values in land_type_columns_to_add.items(): 262 direction_name = key.replace("Classifications", "_") 263 overall = re.sub( 264 r"(north|south|east|west).*", "Overall_", key, flags=re.IGNORECASE 265 ) 266 for value in values: 267 direction_cols.add(direction_name + value) 268 overall_columns.add(overall + value) 269 overall_columns = list(overall_columns) 270 direction_cols = list(direction_cols) 271 direction_data_cols = sorted(overall_columns + direction_cols) 272 273 # Creates a blank DataFrame and concats it to the original to avoid iteratively growing the LC DataFrame 274 blank_df = pd.DataFrame( 275 np.zeros((len(lc_df), len(direction_data_cols))), columns=direction_data_cols 276 ) 277 278 lc_df = pd.concat([lc_df, blank_df], axis=1) 279 280 lc_df = _move_cols(lc_df, cols_to_move=direction_data_cols, ref_col=ref_col) 281 lc_df = lc_df.apply(set_directions, axis=1) 282 for column in overall_columns: 283 lc_df[column] /= 4 284 285 if not unpack: 286 lc_df = lc_df.drop(columns=direction_cols) 287 return lc_df, overall_columns, direction_cols 288 289 290def photo_bit_flags( 291 df, 292 up="lc_UpwardPhotoUrl", 293 down="lc_DownwardPhotoUrl", 294 north="lc_NorthPhotoUrl", 295 south="lc_SouthPhotoUrl", 296 east="lc_EastPhotoUrl", 297 west="lc_WestPhotoUrl", 298 photo_count="lc_PhotoCount", 299 rejected_count="lc_RejectedCount", 300 pending_count="lc_PendingCount", 301 empty_count="lc_EmptyCount", 302 bit_binary="lc_PhotoBitBinary", 303 bit_decimal="lc_PhotoBitDecimal", 304 inplace=False, 305): 306 """ 307 Creates the following flags: 308 - `PhotoCount`: The number of valid photos per record. 309 - `RejectedCount`: The number of photos that were rejected per record. 310 - `PendingCount`: The number of photos that are pending approval per record. 311 - `PhotoBitBinary`: A string that represents the presence of a photo in the Up, Down, North, South, East, and West directions. For example, if the entry is `110100`, that indicates that there is a valid photo for the Up, Down, and South Directions but no valid photos for the North, East, and West Directions. 312 - `PhotoBitDecimal`: The numerical representation of the lc_PhotoBitBinary string. 313 314 Parameters 315 ---------- 316 df : pd.DataFrame 317 A land cover DataFrame 318 up : str, default="lc_UpwardPhotoUrl" 319 The name of the column in the land cover DataFrame that contains the url for the upwards photo. 320 down : str, default="lc_DownwardPhotoUrl" 321 The name of the column in the land cover DataFrame that contains the url for the downwards photo. 322 north : str, default="lc_NorthPhotoUrl" 323 The name of the column in the land cover DataFrame that contains the url for the north photo. 324 south : str, default="lc_SouthPhotoUrl" 325 The name of the column in the land cover DataFrame that contains the url for the south photo. 326 east : str, default="lc_EastPhotoUrl" 327 The name of the column in the land cover DataFrame that contains the url for the east photo. 328 west : str, default="lc_WestPhotoUrl" 329 The name of the column in the land cover DataFrame that contains the url for the west photo. 330 photo_count : str, default="lc_PhotoCount" 331 The name of the column that will be storing the PhotoCount flag. 332 rejected_count : str, default="lc_RejectedCount" 333 The name of the column that will be storing the RejectedCount flag. 334 pending_count : str, default="lc_PendingCount" 335 The name of the column that will be storing the PendingCount flag. 336 empty_count : str, default="lc_EmptyCount" 337 The name of the column that will be storing the EmptyCount flag. 338 bit_binary : str, default="lc_PhotoBitBinary" 339 The name of the column that will be storing the PhotoBitBinary flag. 340 bit_decimal : str, default="lc_PhotoBitDecimal" 341 The name of the column that will be storing the PhotoBitDecimal flag. 342 inplace : bool, default=False 343 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 344 345 Returns 346 ------- 347 pd.DataFrame or None 348 A DataFrame with the photo bit flags. If `inplace=True` it returns None. 349 """ 350 351 def pic_data(*args): 352 pic_count = 0 353 rejected_count = 0 354 pending_count = 0 355 empty_count = 0 356 valid_photo_bit_mask = "" 357 358 for entry in args: 359 if not pd.isna(entry) and "http" in entry: 360 valid_photo_bit_mask += "1" 361 pic_count += entry.count("http") 362 else: 363 valid_photo_bit_mask += "0" 364 if pd.isna(entry): 365 empty_count += 1 366 else: 367 pending_count += entry.count("pending") 368 rejected_count += entry.count("rejected") 369 return ( 370 pic_count, 371 rejected_count, 372 pending_count, 373 empty_count, 374 valid_photo_bit_mask, 375 int(valid_photo_bit_mask, 2), 376 ) 377 378 if not inplace: 379 df = df.copy() 380 381 get_photo_data = np.vectorize(pic_data) 382 ( 383 df[photo_count], 384 df[rejected_count], 385 df[pending_count], 386 df[empty_count], 387 df[bit_binary], 388 df[bit_decimal], 389 ) = get_photo_data( 390 df[up].to_numpy(), 391 df[down].to_numpy(), 392 df[north].to_numpy(), 393 df[south].to_numpy(), 394 df[east].to_numpy(), 395 df[west].to_numpy(), 396 ) 397 398 if not inplace: 399 return df 400 401 402def classification_bit_flags( 403 df, 404 north="lc_NorthClassifications", 405 south="lc_SouthClassifications", 406 east="lc_EastClassifications", 407 west="lc_WestClassifications", 408 classification_count="lc_ClassificationCount", 409 bit_binary="lc_ClassificationBitBinary", 410 bit_decimal="lc_ClassificationBitDecimal", 411 inplace=False, 412): 413 """ 414 Creates the following flags: 415 - `ClassificationCount`: The number of classifications per record. 416 - `BitBinary`: A string that represents the presence of a classification in the North, South, East, and West directions. For example, if the entry is `1101`, that indicates that there is a valid classification for the North, South, and West Directions but no valid classifications for the East Direction. 417 - `BitDecimal`: The number of photos that are pending approval per record. 418 419 Parameters 420 ---------- 421 df : pd.DataFrame 422 A land cover DataFrame 423 north : str, default="lc_NorthClassifications" 424 The name of the column in the land cover DataFrame that contains the north classification. 425 south : str, default="lc_SouthClassifications" 426 The name of the column in the land cover DataFrame that contains the south classification. 427 east : str, default="lc_EastClassifications" 428 The name of the column in the land cover DataFrame that contains the east classification. 429 west : str, default="lc_WestClassifications" 430 The name of the column in the land cover DataFrame that contains the west classification. 431 classification_count : str, default="lc_ClassificationCount" 432 The name of the column that will store the ClassificationCount flag. 433 bit_binary : str, default="lc_ClassificationBitBinary" 434 The name of the column that will store the BitBinary flag. 435 bit_decimal : str, default="lc_ClassificationBitDecimal" 436 The name of the column that will store the BitDecimal flag. 437 inplace : bool, default=False 438 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 439 440 Returns 441 ------- 442 pd.DataFrame or None 443 A DataFrame with the classification bit flags. If `inplace=True` it returns None. 444 """ 445 446 def classification_data(*args): 447 classification_count = 0 448 classification_bit_mask = "" 449 for entry in args: 450 if pd.isna(entry) or entry is np.nan: 451 classification_bit_mask += "0" 452 else: 453 classification_count += 1 454 classification_bit_mask += "1" 455 return ( 456 classification_count, 457 classification_bit_mask, 458 int(classification_bit_mask, 2), 459 ) 460 461 if not inplace: 462 df = df.copy() 463 get_classification_data = np.vectorize(classification_data) 464 465 ( 466 df[classification_count], 467 df[bit_binary], 468 df[bit_decimal], 469 ) = get_classification_data( 470 df[north], 471 df[south], 472 df[east], 473 df[west], 474 ) 475 if not inplace: 476 return df 477 478 479def completion_scores( 480 df, 481 photo_bit_binary="lc_PhotoBitBinary", 482 classification_binary="lc_ClassificationBitBinary", 483 sub_completeness="lc_SubCompletenessScore", 484 completeness="lc_CumulativeCompletenessScore", 485 inplace=False, 486): 487 """ 488 Adds the following completness score flags: 489 - `SubCompletenessScore`: The percentage of valid landcover classifications and photos that are filled out. 490 - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns. 491 492 Parameters 493 ---------- 494 df : pd.DataFrame 495 A landcover DataFrame with the [`PhotoBitBinary`](#photo_bit_flags) and [`ClassificationBitBinary`](#classification_bit_flags) flags. 496 photo_bit_binary : str, default="lc_PhotoBitBinary" 497 The name of the column that stores the PhotoBitBinary flag. 498 classification_binary : str, default="lc_PhotoBitBinary" 499 The name of the column that stores the ClassificationBitBinary flag. 500 sub_completeness : str, default="lc_PhotoBitBinary" 501 The name of the column that will store the generated SubCompletenessScore flag. 502 completeness : str, default="lc_PhotoBitBinary" 503 The name of the column that will store the generated CompletenessScore flag. 504 inplace : bool, default=False 505 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 506 507 Returns 508 ------- 509 pd.DataFrame or None 510 A DataFrame with the completeness score flags. If `inplace=True` it returns None. 511 """ 512 513 def sum_bit_mask(bit_mask="0"): 514 sum = 0.0 515 for char in bit_mask: 516 sum += int(char) 517 return sum 518 519 if not inplace: 520 df = df.copy() 521 522 scores = {} 523 scores["sub_score"] = [] 524 # Cummulative Completion Score 525 scores["cumulative_score"] = round(df.count(1) / len(df.columns), 2) 526 # Sub-Score 527 for index in df.index: 528 bit_mask = df[photo_bit_binary][index] + df[classification_binary][index] 529 sub_score = round(sum_bit_mask(bit_mask=bit_mask), 2) 530 sub_score /= len(bit_mask) 531 scores["sub_score"].append(sub_score) 532 533 df[sub_completeness], df[completeness] = ( 534 scores["sub_score"], 535 scores["cumulative_score"], 536 ) 537 538 if not inplace: 539 return df 540 541 542def apply_cleanup(lc_df, unpack=True): 543 """Applies a full cleanup procedure to the landcover data. 544 It follows the following steps: 545 - Removes Homogenous Columns 546 - Renames Latitude and Longitudes 547 - Cleans the Column Naming 548 - Unpacks landcover classifications 549 - Rounds Columns 550 - Standardizes Null Values 551 552 This returns a copy 553 554 Parameters 555 ---------- 556 lc_df : pd.DataFrame 557 A DataFrame containing **raw** Landcover Data from the API. 558 unpack : bool 559 If True, the Landcover data will expand the classifications into separate columns (results in around 300 columns). If False, it will just unpack overall landcover. 560 561 Returns 562 ------- 563 pd.DataFrame 564 A DataFrame containing the cleaned Landcover Data 565 """ 566 lc_df = lc_df.copy() 567 568 remove_homogenous_cols(lc_df, inplace=True) 569 rename_latlon_cols(lc_df, inplace=True) 570 cleanup_column_prefix(lc_df, inplace=True) 571 lc_df, overall_cols, directional_cols = unpack_classifications(lc_df, unpack=unpack) 572 573 round_cols(lc_df, inplace=True) 574 standardize_null_vals(lc_df, inplace=True) 575 return lc_df 576 577 578def add_flags(lc_df): 579 """Adds the following flags to the landcover data: 580 - Photo Bit Flags 581 - Classification Bit Flags 582 - Completeness Score Flags 583 584 Returns a copy of the DataFrame 585 586 Parameters 587 ---------- 588 lc_df : pd.DataFrame 589 A DataFrame containing cleaned up Landcover Data ideally from the [apply_cleanup](#apply_cleanup) method. 590 591 Returns 592 ------- 593 pd.DataFrame 594 A DataFrame containing the Land Cover flags. 595 """ 596 lc_df = lc_df.copy() 597 photo_bit_flags(lc_df, inplace=True) 598 classification_bit_flags(lc_df, inplace=True) 599 get_main_classifications(lc_df, inplace=True) 600 completion_scores(lc_df, inplace=True) 601 return lc_df 602 603 604def direction_frequency(lc_df, direction_list, bit_binary, entry_type): 605 """ 606 Plots the amount of a variable of interest for each direction. 607 608 Parameters 609 ---------- 610 lc_df : pd.DataFrame 611 The DataFrame containing Land Cover Data. 612 direction_list : list of str 613 The column names of the different variables of interest for each direction. 614 bit_binary: str 615 The Bit Binary Flag associated with the variable of interest. 616 entry_type: str 617 The variable of interest (e.g. Photos or Classifications) 618 """ 619 direction_photos = pd.DataFrame() 620 direction_photos["category"] = direction_list 621 direction_counts = [0 for i in range(len(direction_photos))] 622 for mask in lc_df[bit_binary]: 623 for i in range(len(mask) - 1, -1, -1): 624 direction_counts[i] += int(mask[i]) 625 direction_counts 626 direction_photos["count"] = [math.log10(value) for value in direction_counts] 627 direction_photos 628 629 plt.figure(figsize=(15, 6)) 630 title = f"Land Cover -- {entry_type} Direction Frequency (Log Scale)" 631 plt.title(title) 632 plt.ylabel("Count (Log Scale)") 633 sns.barplot(data=direction_photos, x="category", y="count", color="lightblue") 634 635 636def diagnostic_plots( 637 lc_df, 638 up_url="lc_UpwardPhotoUrl", 639 down_url="lc_DownwardPhotoUrl", 640 north_url="lc_NorthPhotoUrl", 641 south_url="lc_SouthPhotoUrl", 642 east_url="lc_EastPhotoUrl", 643 west_url="lc_WestPhotoUrl", 644 photo_bit="lc_PhotoBitBinary", 645 north_classification="lc_NorthClassifications", 646 south_classification="lc_SouthClassifications", 647 east_classification="lc_EastClassifications", 648 west_classification="lc_WestClassifications", 649 classification_bit="lc_ClassificationBitBinary", 650): 651 """ 652 Generates (but doesn't display) diagnostic plots to gain insight into the current data. 653 654 Plots: 655 - Valid Photo Count Distribution 656 - Photo Distribution by direction 657 - Classification Distribution by direction 658 - Photo Status Distribution 659 - Completeness Score Distribution 660 - Subcompleteness Score Distribution 661 662 Parameters 663 ---------- 664 lc_df : pd.DataFrame 665 The DataFrame containing Flagged and Cleaned Land Cover Data. 666 """ 667 plot_freq_bar( 668 lc_df, "Land Cover", "lc_PhotoCount", "Valid Photo Count", log_scale=True 669 ) 670 direction_frequency( 671 lc_df, 672 [up_url, down_url, north_url, south_url, east_url, west_url], 673 photo_bit, 674 "Photo", 675 ) 676 direction_frequency( 677 lc_df, 678 [ 679 north_classification, 680 south_classification, 681 east_classification, 682 west_classification, 683 ], 684 classification_bit, 685 "Classification", 686 ) 687 multiple_bar_graph( 688 lc_df, 689 "Land Cover", 690 ["lc_PhotoCount", "lc_RejectedCount", "lc_EmptyCount"], 691 "Photo Summary", 692 log_scale=True, 693 ) 694 695 completeness_histogram( 696 lc_df, "Land Cover", "lc_CumulativeCompletenessScore", "Cumulative Completeness" 697 ) 698 completeness_histogram( 699 lc_df, "Land Cover", "lc_SubCompletenessScore", "Sub Completeness" 700 ) 701 702 703def qa_filter( 704 lc_df, 705 has_classification=False, 706 has_photo=False, 707 has_all_photos=False, 708 has_all_classifications=False, 709): 710 """ 711 Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria: 712 - `Has Classification`: If the entry has atleast one direction classified 713 - `Has Photo` : If the entry has atleast one photo taken 714 - `Has All Photos` : If the entry has all photos taken (up, down, north, south, east, west) 715 - `Has All Classifications` : If the entry has all directions classified 716 717 Returns a copy of the DataFrame 718 719 Parameters 720 ---------- 721 has_classification : bool, default=False 722 If True, only entries with atleast one classification will be included. 723 has_photo : bool, default=False 724 If True, only entries with atleast one photo will be included. 725 has_all_photos : bool, default=False 726 If True, only entries with all photos will be included. 727 has_all_classifications : bool, default=False 728 If True, only entries with all classifications will be included. 729 730 Returns 731 ------- 732 pd.DataFrame 733 A DataFrame of the applied filters. 734 """ 735 736 if has_classification and not has_all_classifications: 737 lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] > 0] 738 elif has_all_classifications: 739 lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] == 15] 740 if has_photo and not has_all_photos: 741 lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] > 0] 742 elif has_all_photos: 743 lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] == 63] 744 745 return lc_df 746 747 748def _accumulate_ties(classification_list): 749 classifications = list() 750 i = 0 751 while i < len(classification_list) - 1: 752 if classification_list[i][1] == classification_list[i + 1][1]: 753 classifications.append(classification_list[i][0]) 754 classifications.append(classification_list[i + 1][0]) 755 i += 1 756 else: 757 break 758 759 output = ", ".join([classification for classification in classifications]) 760 if not output: 761 if len(classification_list) != 0: 762 output = classification_list[0][0] 763 else: 764 output = "NA" 765 # TODO replace w regex methods 766 return output, i + 1 767 768 769def _rank_direction(classification_dict, direction_classifications): 770 if pd.isna(direction_classifications): 771 return "NA", "NA" 772 classifications_list = [] 773 classifications = direction_classifications.split(";") 774 for classification_data in classifications: 775 percent = extract_classification_percentage(classification_data) 776 classification = extract_classification_name(classification_data) 777 if classification in classification_dict: 778 classification_dict[classification] += percent 779 else: 780 classification_dict[classification] = percent 781 classifications_list.append((classification, percent)) 782 classifications_list = sorted( 783 classifications_list, key=lambda x: x[1], reverse=True 784 ) 785 if len(classifications_list) < 2: 786 return classifications_list[0][0], "NA" 787 788 primary_classification, i = _accumulate_ties(classifications_list) 789 secondary_classification, temp = _accumulate_ties(classifications_list[i:]) 790 791 return primary_classification, secondary_classification 792 793 794def _rank_classifications(*args): 795 classification_dict = {} 796 rank_directions = [ 797 classification 798 for arg in args 799 for classification in _rank_direction(classification_dict, arg) 800 ] 801 primary, secondary = ("NA", 0), ("NA", 0) 802 if classification_dict: 803 if len(classification_dict) < 2: 804 primary = ( 805 list(classification_dict.keys())[0], 806 list(classification_dict.values())[0], 807 ) 808 else: 809 sorted_classifications = sorted( 810 classification_dict.items(), key=lambda x: x[1], reverse=True 811 ) 812 primary, i = _accumulate_ties(sorted_classifications) 813 primary = primary, sorted_classifications[0][1] 814 if i < len(sorted_classifications): 815 secondary, temp = _accumulate_ties(sorted_classifications[i:]) 816 secondary = secondary, sorted_classifications[i][1] 817 return ( 818 *rank_directions, 819 primary[0], 820 secondary[0], 821 primary[1] / len(args), 822 secondary[1] / len(args), 823 ) 824 825 826def get_main_classifications( 827 lc_df, 828 north_classification="lc_NorthClassifications", 829 east_classification="lc_EastClassifications", 830 south_classification="lc_SouthClassifications", 831 west_classification="lc_WestClassifications", 832 north_primary="lc_NorthPrimary", 833 north_secondary="lc_NorthSecondary", 834 east_primary="lc_EastPrimary", 835 east_secondary="lc_EastSecondary", 836 south_primary="lc_SouthPrimary", 837 south_secondary="lc_SouthSecondary", 838 west_primary="lc_WestPrimary", 839 west_secondary="lc_WestSecondary", 840 primary_classification="lc_PrimaryClassification", 841 secondary_classification="lc_SecondaryClassification", 842 primary_percentage="lc_PrimaryPercentage", 843 secondary_percentage="lc_SecondaryPercentage", 844 inplace=False, 845): 846 if not inplace: 847 lc_df = lc_df.copy() 848 vectorized_rank = np.vectorize(_rank_classifications) 849 ( 850 lc_df[north_primary], 851 lc_df[north_secondary], 852 lc_df[east_primary], 853 lc_df[east_secondary], 854 lc_df[south_primary], 855 lc_df[south_secondary], 856 lc_df[west_primary], 857 lc_df[west_secondary], 858 lc_df[primary_classification], 859 lc_df[secondary_classification], 860 lc_df[primary_percentage], 861 lc_df[secondary_percentage], 862 ) = vectorized_rank( 863 lc_df[north_classification].to_numpy(), 864 lc_df[east_classification].to_numpy(), 865 lc_df[south_classification].to_numpy(), 866 lc_df[west_classification].to_numpy(), 867 ) 868 869 if not inplace: 870 return lc_df
39def cleanup_column_prefix(df, inplace=False): 40 """Method for shortening raw landcover column names. 41 42 The df object will now replace the verbose `landcovers` prefix in some of the columns with `lc_` 43 44 Parameters 45 ---------- 46 df : pd.DataFrame 47 The DataFrame containing raw landcover data. The DataFrame object itself will be modified. 48 inplace : bool, default=False 49 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 50 51 Returns 52 ------- 53 pd.DataFrame or None 54 A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None. 55 """ 56 57 if not inplace: 58 df = df.copy() 59 60 replace_column_prefix(df, "landcovers", "lc", inplace=True) 61 62 if not inplace: 63 return df
Method for shortening raw landcover column names.
The df object will now replace the verbose landcovers
prefix in some of the columns with lc_
Parameters
- df (pd.DataFrame): The DataFrame containing raw landcover data. The DataFrame object itself will be modified.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the cleaned up column prefixes. If
inplace=True
it returns None.
66def extract_classification_name(entry): 67 """ 68 Extracts the name (landcover description) of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `"Trees, Closely Spaced, Deciduous - Broad Leaved"` is extracted. 69 70 Parameters 71 ---------- 72 entry : str 73 A single landcover classification. 74 75 Returns 76 ------- 77 str 78 The Landcover description of a classification 79 """ 80 81 return re.search(r"(?<=\[).*(?=\])", entry).group()
Extracts the name (landcover description) of a singular landcover classification. For example in the classification of "60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"
, the "Trees, Closely Spaced, Deciduous - Broad Leaved"
is extracted.
Parameters
- entry (str): A single landcover classification.
Returns
- str: The Landcover description of a classification
84def extract_classification_percentage(entry): 85 """ 86 Extracts the percentage of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `60` is extracted. 87 88 Parameters 89 ---------- 90 entry : str 91 A single landcover classification. 92 93 Returns 94 ------- 95 float 96 The percentage of a landcover classification 97 """ 98 99 return float(re.search(".*(?=%)", entry).group())
Extracts the percentage of a singular landcover classification. For example in the classification of "60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"
, the 60
is extracted.
Parameters
- entry (str): A single landcover classification.
Returns
- float: The percentage of a landcover classification
107def extract_classifications(info): 108 """Extracts the name/landcover description (see [here](#extract_classification_name) for a clearer definition) of a landcover classification entry in the GLOBE Observer Data. 109 110 Parameters 111 ---------- 112 info : str 113 A string representing a landcover classification entry in the GLOBE Observer Datset. 114 115 Returns 116 ------- 117 list of str 118 The different landcover classifications stored within the landcover entry. 119 """ 120 return _extract_landcover_items(extract_classification_name, info)
Extracts the name/landcover description (see here for a clearer definition) of a landcover classification entry in the GLOBE Observer Data.
Parameters
- info (str): A string representing a landcover classification entry in the GLOBE Observer Datset.
Returns
- list of str: The different landcover classifications stored within the landcover entry.
123def extract_percentages(info): 124 """Extracts the percentages (see [here](#extract_classification_percentage) for a clearer definition) of a landcover classification in the GLOBE Observer Datset. 125 126 Parameters 127 ---------- 128 info : str 129 A string representing a landcover classification entry in the GLOBE Observer Datset. 130 131 Returns 132 ------- 133 list of float 134 The different landcover percentages stored within the landcover entry. 135 """ 136 137 return _extract_landcover_items(extract_classification_percentage, info)
Extracts the percentages (see here for a clearer definition) of a landcover classification in the GLOBE Observer Datset.
Parameters
- info (str): A string representing a landcover classification entry in the GLOBE Observer Datset.
Returns
- list of float: The different landcover percentages stored within the landcover entry.
140def extract_classification_dict(info): 141 """Extracts the landcover descriptions and percentages of a landcover classification entry as a dictionary. 142 143 Parameters 144 ---------- 145 info : str 146 A string representing a landcover classification entry in the GLOBE Observer Datset. 147 148 Returns 149 ------- 150 dict of str, float 151 The landcover descriptions and percentages stored as a dict in the form: `{"description" : percentage}`. 152 """ 153 154 entries = info.split(";") 155 return { 156 extract_classification_name(entry): extract_classification_percentage(entry) 157 for entry in entries 158 }
Extracts the landcover descriptions and percentages of a landcover classification entry as a dictionary.
Parameters
- info (str): A string representing a landcover classification entry in the GLOBE Observer Datset.
Returns
- dict of str, float: The landcover descriptions and percentages stored as a dict in the form:
{"description" : percentage}
.
188def unpack_classifications( 189 lc_df, 190 north="lc_NorthClassifications", 191 east="lc_EastClassifications", 192 south="lc_SouthClassifications", 193 west="lc_WestClassifications", 194 ref_col="lc_pid", 195 unpack=True, 196): 197 """ 198 Unpacks the classification data in the *raw* GLOBE Observer Landcover data. This method assumes that the columns have been renamed with accordance to the [column cleanup](#cleanup_column_prefix) method. 199 200 This returns a copy of the dataframe. 201 202 See [here](#unpacking-the-landcover-classification-data) for more information. 203 204 *Note:* The returned DataFrame will have around 250 columns. 205 206 Parameters 207 ---------- 208 lc_df : pd.DataFrame 209 A DataFrame containing Raw GLOBE Observer Landcover data that has had the column names simplified. 210 north: str, default="lc_NorthClassifications" 211 The name of the column which contains the North Classifications 212 east: str, default="lc_EastClassifications" 213 The name of the column which contains the East Classifications 214 south: str, default="lc_SouthClassifications" 215 The name of the column which contains the South Classifications 216 west: str, default="lc_WestClassifications" 217 The name of the column which contains the West Classifications 218 ref_col: str, default="lc_pid" 219 The name of the column which all of the expanded values will be placed after. For example, if the columns were `[1, 2, 3, 4]` and you chose 3, the new columns will now be `[1, 2, 3, (all classification columns), 4]`. 220 unpack: bool, default=True 221 True if you want to unpack the directional classifications, False if you only want overall classifications 222 223 Returns 224 ------- 225 pd.DataFrame 226 A DataFrame with the unpacked classification columns. 227 list 228 A list containing all the generated overall Land Cover column names (mainly for testing purposes). 229 list 230 A list containing all the generated directional Land Cover column names (mainly for testing purposes). 231 """ 232 233 classifications = [north, east, south, west] 234 235 def set_directions(row): 236 for classification in classifications: 237 if not pd.isnull(row[classification]): 238 entries = row[classification].split(";") 239 for entry in entries: 240 percent, name = ( 241 extract_classification_percentage(entry), 242 extract_classification_name(entry), 243 ) 244 name = camel_case(name, [" ", ",", "-", "/"]) 245 classification = classification.replace("Classifications", "_") 246 overall = re.sub( 247 r"(north|south|east|west).*", 248 "Overall_", 249 key, 250 flags=re.IGNORECASE, 251 ) 252 row[f"{classification}{name.strip()}"] = percent 253 row[f"{overall}{name.strip()}"] += percent 254 return row 255 256 land_type_columns_to_add = { 257 classification: _get_classifications_for_direction(lc_df, classification) 258 for classification in classifications 259 } 260 overall_columns = set() 261 direction_cols = set() 262 for key, values in land_type_columns_to_add.items(): 263 direction_name = key.replace("Classifications", "_") 264 overall = re.sub( 265 r"(north|south|east|west).*", "Overall_", key, flags=re.IGNORECASE 266 ) 267 for value in values: 268 direction_cols.add(direction_name + value) 269 overall_columns.add(overall + value) 270 overall_columns = list(overall_columns) 271 direction_cols = list(direction_cols) 272 direction_data_cols = sorted(overall_columns + direction_cols) 273 274 # Creates a blank DataFrame and concats it to the original to avoid iteratively growing the LC DataFrame 275 blank_df = pd.DataFrame( 276 np.zeros((len(lc_df), len(direction_data_cols))), columns=direction_data_cols 277 ) 278 279 lc_df = pd.concat([lc_df, blank_df], axis=1) 280 281 lc_df = _move_cols(lc_df, cols_to_move=direction_data_cols, ref_col=ref_col) 282 lc_df = lc_df.apply(set_directions, axis=1) 283 for column in overall_columns: 284 lc_df[column] /= 4 285 286 if not unpack: 287 lc_df = lc_df.drop(columns=direction_cols) 288 return lc_df, overall_columns, direction_cols
Unpacks the classification data in the raw GLOBE Observer Landcover data. This method assumes that the columns have been renamed with accordance to the column cleanup method.
This returns a copy of the dataframe.
See here for more information.
Note: The returned DataFrame will have around 250 columns.
Parameters
- lc_df (pd.DataFrame): A DataFrame containing Raw GLOBE Observer Landcover data that has had the column names simplified.
- north (str, default="lc_NorthClassifications"): The name of the column which contains the North Classifications
- east (str, default="lc_EastClassifications"): The name of the column which contains the East Classifications
- south (str, default="lc_SouthClassifications"): The name of the column which contains the South Classifications
- west (str, default="lc_WestClassifications"): The name of the column which contains the West Classifications
- ref_col (str, default="lc_pid"):
The name of the column which all of the expanded values will be placed after. For example, if the columns were
[1, 2, 3, 4]
and you chose 3, the new columns will now be[1, 2, 3, (all classification columns), 4]
. - unpack (bool, default=True): True if you want to unpack the directional classifications, False if you only want overall classifications
Returns
- pd.DataFrame: A DataFrame with the unpacked classification columns.
- list: A list containing all the generated overall Land Cover column names (mainly for testing purposes).
- list: A list containing all the generated directional Land Cover column names (mainly for testing purposes).
291def photo_bit_flags( 292 df, 293 up="lc_UpwardPhotoUrl", 294 down="lc_DownwardPhotoUrl", 295 north="lc_NorthPhotoUrl", 296 south="lc_SouthPhotoUrl", 297 east="lc_EastPhotoUrl", 298 west="lc_WestPhotoUrl", 299 photo_count="lc_PhotoCount", 300 rejected_count="lc_RejectedCount", 301 pending_count="lc_PendingCount", 302 empty_count="lc_EmptyCount", 303 bit_binary="lc_PhotoBitBinary", 304 bit_decimal="lc_PhotoBitDecimal", 305 inplace=False, 306): 307 """ 308 Creates the following flags: 309 - `PhotoCount`: The number of valid photos per record. 310 - `RejectedCount`: The number of photos that were rejected per record. 311 - `PendingCount`: The number of photos that are pending approval per record. 312 - `PhotoBitBinary`: A string that represents the presence of a photo in the Up, Down, North, South, East, and West directions. For example, if the entry is `110100`, that indicates that there is a valid photo for the Up, Down, and South Directions but no valid photos for the North, East, and West Directions. 313 - `PhotoBitDecimal`: The numerical representation of the lc_PhotoBitBinary string. 314 315 Parameters 316 ---------- 317 df : pd.DataFrame 318 A land cover DataFrame 319 up : str, default="lc_UpwardPhotoUrl" 320 The name of the column in the land cover DataFrame that contains the url for the upwards photo. 321 down : str, default="lc_DownwardPhotoUrl" 322 The name of the column in the land cover DataFrame that contains the url for the downwards photo. 323 north : str, default="lc_NorthPhotoUrl" 324 The name of the column in the land cover DataFrame that contains the url for the north photo. 325 south : str, default="lc_SouthPhotoUrl" 326 The name of the column in the land cover DataFrame that contains the url for the south photo. 327 east : str, default="lc_EastPhotoUrl" 328 The name of the column in the land cover DataFrame that contains the url for the east photo. 329 west : str, default="lc_WestPhotoUrl" 330 The name of the column in the land cover DataFrame that contains the url for the west photo. 331 photo_count : str, default="lc_PhotoCount" 332 The name of the column that will be storing the PhotoCount flag. 333 rejected_count : str, default="lc_RejectedCount" 334 The name of the column that will be storing the RejectedCount flag. 335 pending_count : str, default="lc_PendingCount" 336 The name of the column that will be storing the PendingCount flag. 337 empty_count : str, default="lc_EmptyCount" 338 The name of the column that will be storing the EmptyCount flag. 339 bit_binary : str, default="lc_PhotoBitBinary" 340 The name of the column that will be storing the PhotoBitBinary flag. 341 bit_decimal : str, default="lc_PhotoBitDecimal" 342 The name of the column that will be storing the PhotoBitDecimal flag. 343 inplace : bool, default=False 344 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 345 346 Returns 347 ------- 348 pd.DataFrame or None 349 A DataFrame with the photo bit flags. If `inplace=True` it returns None. 350 """ 351 352 def pic_data(*args): 353 pic_count = 0 354 rejected_count = 0 355 pending_count = 0 356 empty_count = 0 357 valid_photo_bit_mask = "" 358 359 for entry in args: 360 if not pd.isna(entry) and "http" in entry: 361 valid_photo_bit_mask += "1" 362 pic_count += entry.count("http") 363 else: 364 valid_photo_bit_mask += "0" 365 if pd.isna(entry): 366 empty_count += 1 367 else: 368 pending_count += entry.count("pending") 369 rejected_count += entry.count("rejected") 370 return ( 371 pic_count, 372 rejected_count, 373 pending_count, 374 empty_count, 375 valid_photo_bit_mask, 376 int(valid_photo_bit_mask, 2), 377 ) 378 379 if not inplace: 380 df = df.copy() 381 382 get_photo_data = np.vectorize(pic_data) 383 ( 384 df[photo_count], 385 df[rejected_count], 386 df[pending_count], 387 df[empty_count], 388 df[bit_binary], 389 df[bit_decimal], 390 ) = get_photo_data( 391 df[up].to_numpy(), 392 df[down].to_numpy(), 393 df[north].to_numpy(), 394 df[south].to_numpy(), 395 df[east].to_numpy(), 396 df[west].to_numpy(), 397 ) 398 399 if not inplace: 400 return df
Creates the following flags:
PhotoCount
: The number of valid photos per record.RejectedCount
: The number of photos that were rejected per record.PendingCount
: The number of photos that are pending approval per record.PhotoBitBinary
: A string that represents the presence of a photo in the Up, Down, North, South, East, and West directions. For example, if the entry is110100
, that indicates that there is a valid photo for the Up, Down, and South Directions but no valid photos for the North, East, and West Directions.PhotoBitDecimal
: The numerical representation of the lc_PhotoBitBinary string.
Parameters
- df (pd.DataFrame): A land cover DataFrame
- up (str, default="lc_UpwardPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the upwards photo.
- down (str, default="lc_DownwardPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the downwards photo.
- north (str, default="lc_NorthPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the north photo.
- south (str, default="lc_SouthPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the south photo.
- east (str, default="lc_EastPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the east photo.
- west (str, default="lc_WestPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the west photo.
- photo_count (str, default="lc_PhotoCount"): The name of the column that will be storing the PhotoCount flag.
- rejected_count (str, default="lc_RejectedCount"): The name of the column that will be storing the RejectedCount flag.
- pending_count (str, default="lc_PendingCount"): The name of the column that will be storing the PendingCount flag.
- empty_count (str, default="lc_EmptyCount"): The name of the column that will be storing the EmptyCount flag.
- bit_binary (str, default="lc_PhotoBitBinary"): The name of the column that will be storing the PhotoBitBinary flag.
- bit_decimal (str, default="lc_PhotoBitDecimal"): The name of the column that will be storing the PhotoBitDecimal flag.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the photo bit flags. If
inplace=True
it returns None.
403def classification_bit_flags( 404 df, 405 north="lc_NorthClassifications", 406 south="lc_SouthClassifications", 407 east="lc_EastClassifications", 408 west="lc_WestClassifications", 409 classification_count="lc_ClassificationCount", 410 bit_binary="lc_ClassificationBitBinary", 411 bit_decimal="lc_ClassificationBitDecimal", 412 inplace=False, 413): 414 """ 415 Creates the following flags: 416 - `ClassificationCount`: The number of classifications per record. 417 - `BitBinary`: A string that represents the presence of a classification in the North, South, East, and West directions. For example, if the entry is `1101`, that indicates that there is a valid classification for the North, South, and West Directions but no valid classifications for the East Direction. 418 - `BitDecimal`: The number of photos that are pending approval per record. 419 420 Parameters 421 ---------- 422 df : pd.DataFrame 423 A land cover DataFrame 424 north : str, default="lc_NorthClassifications" 425 The name of the column in the land cover DataFrame that contains the north classification. 426 south : str, default="lc_SouthClassifications" 427 The name of the column in the land cover DataFrame that contains the south classification. 428 east : str, default="lc_EastClassifications" 429 The name of the column in the land cover DataFrame that contains the east classification. 430 west : str, default="lc_WestClassifications" 431 The name of the column in the land cover DataFrame that contains the west classification. 432 classification_count : str, default="lc_ClassificationCount" 433 The name of the column that will store the ClassificationCount flag. 434 bit_binary : str, default="lc_ClassificationBitBinary" 435 The name of the column that will store the BitBinary flag. 436 bit_decimal : str, default="lc_ClassificationBitDecimal" 437 The name of the column that will store the BitDecimal flag. 438 inplace : bool, default=False 439 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 440 441 Returns 442 ------- 443 pd.DataFrame or None 444 A DataFrame with the classification bit flags. If `inplace=True` it returns None. 445 """ 446 447 def classification_data(*args): 448 classification_count = 0 449 classification_bit_mask = "" 450 for entry in args: 451 if pd.isna(entry) or entry is np.nan: 452 classification_bit_mask += "0" 453 else: 454 classification_count += 1 455 classification_bit_mask += "1" 456 return ( 457 classification_count, 458 classification_bit_mask, 459 int(classification_bit_mask, 2), 460 ) 461 462 if not inplace: 463 df = df.copy() 464 get_classification_data = np.vectorize(classification_data) 465 466 ( 467 df[classification_count], 468 df[bit_binary], 469 df[bit_decimal], 470 ) = get_classification_data( 471 df[north], 472 df[south], 473 df[east], 474 df[west], 475 ) 476 if not inplace: 477 return df
Creates the following flags:
ClassificationCount
: The number of classifications per record.BitBinary
: A string that represents the presence of a classification in the North, South, East, and West directions. For example, if the entry is1101
, that indicates that there is a valid classification for the North, South, and West Directions but no valid classifications for the East Direction.BitDecimal
: The number of photos that are pending approval per record.
Parameters
- df (pd.DataFrame): A land cover DataFrame
- north (str, default="lc_NorthClassifications"): The name of the column in the land cover DataFrame that contains the north classification.
- south (str, default="lc_SouthClassifications"): The name of the column in the land cover DataFrame that contains the south classification.
- east (str, default="lc_EastClassifications"): The name of the column in the land cover DataFrame that contains the east classification.
- west (str, default="lc_WestClassifications"): The name of the column in the land cover DataFrame that contains the west classification.
- classification_count (str, default="lc_ClassificationCount"): The name of the column that will store the ClassificationCount flag.
- bit_binary (str, default="lc_ClassificationBitBinary"): The name of the column that will store the BitBinary flag.
- bit_decimal (str, default="lc_ClassificationBitDecimal"): The name of the column that will store the BitDecimal flag.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the classification bit flags. If
inplace=True
it returns None.
480def completion_scores( 481 df, 482 photo_bit_binary="lc_PhotoBitBinary", 483 classification_binary="lc_ClassificationBitBinary", 484 sub_completeness="lc_SubCompletenessScore", 485 completeness="lc_CumulativeCompletenessScore", 486 inplace=False, 487): 488 """ 489 Adds the following completness score flags: 490 - `SubCompletenessScore`: The percentage of valid landcover classifications and photos that are filled out. 491 - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns. 492 493 Parameters 494 ---------- 495 df : pd.DataFrame 496 A landcover DataFrame with the [`PhotoBitBinary`](#photo_bit_flags) and [`ClassificationBitBinary`](#classification_bit_flags) flags. 497 photo_bit_binary : str, default="lc_PhotoBitBinary" 498 The name of the column that stores the PhotoBitBinary flag. 499 classification_binary : str, default="lc_PhotoBitBinary" 500 The name of the column that stores the ClassificationBitBinary flag. 501 sub_completeness : str, default="lc_PhotoBitBinary" 502 The name of the column that will store the generated SubCompletenessScore flag. 503 completeness : str, default="lc_PhotoBitBinary" 504 The name of the column that will store the generated CompletenessScore flag. 505 inplace : bool, default=False 506 Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place. 507 508 Returns 509 ------- 510 pd.DataFrame or None 511 A DataFrame with the completeness score flags. If `inplace=True` it returns None. 512 """ 513 514 def sum_bit_mask(bit_mask="0"): 515 sum = 0.0 516 for char in bit_mask: 517 sum += int(char) 518 return sum 519 520 if not inplace: 521 df = df.copy() 522 523 scores = {} 524 scores["sub_score"] = [] 525 # Cummulative Completion Score 526 scores["cumulative_score"] = round(df.count(1) / len(df.columns), 2) 527 # Sub-Score 528 for index in df.index: 529 bit_mask = df[photo_bit_binary][index] + df[classification_binary][index] 530 sub_score = round(sum_bit_mask(bit_mask=bit_mask), 2) 531 sub_score /= len(bit_mask) 532 scores["sub_score"].append(sub_score) 533 534 df[sub_completeness], df[completeness] = ( 535 scores["sub_score"], 536 scores["cumulative_score"], 537 ) 538 539 if not inplace: 540 return df
Adds the following completness score flags:
SubCompletenessScore
: The percentage of valid landcover classifications and photos that are filled out.CumulativeCompletenessScore
: The percentage of non null values out of all the columns.
Parameters
- df (pd.DataFrame):
A landcover DataFrame with the
PhotoBitBinary
andClassificationBitBinary
flags. - photo_bit_binary (str, default="lc_PhotoBitBinary"): The name of the column that stores the PhotoBitBinary flag.
- classification_binary (str, default="lc_PhotoBitBinary"): The name of the column that stores the ClassificationBitBinary flag.
- sub_completeness (str, default="lc_PhotoBitBinary"): The name of the column that will store the generated SubCompletenessScore flag.
- completeness (str, default="lc_PhotoBitBinary"): The name of the column that will store the generated CompletenessScore flag.
- inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
- pd.DataFrame or None: A DataFrame with the completeness score flags. If
inplace=True
it returns None.
543def apply_cleanup(lc_df, unpack=True): 544 """Applies a full cleanup procedure to the landcover data. 545 It follows the following steps: 546 - Removes Homogenous Columns 547 - Renames Latitude and Longitudes 548 - Cleans the Column Naming 549 - Unpacks landcover classifications 550 - Rounds Columns 551 - Standardizes Null Values 552 553 This returns a copy 554 555 Parameters 556 ---------- 557 lc_df : pd.DataFrame 558 A DataFrame containing **raw** Landcover Data from the API. 559 unpack : bool 560 If True, the Landcover data will expand the classifications into separate columns (results in around 300 columns). If False, it will just unpack overall landcover. 561 562 Returns 563 ------- 564 pd.DataFrame 565 A DataFrame containing the cleaned Landcover Data 566 """ 567 lc_df = lc_df.copy() 568 569 remove_homogenous_cols(lc_df, inplace=True) 570 rename_latlon_cols(lc_df, inplace=True) 571 cleanup_column_prefix(lc_df, inplace=True) 572 lc_df, overall_cols, directional_cols = unpack_classifications(lc_df, unpack=unpack) 573 574 round_cols(lc_df, inplace=True) 575 standardize_null_vals(lc_df, inplace=True) 576 return lc_df
Applies a full cleanup procedure to the landcover data. It follows the following steps:
- Removes Homogenous Columns
- Renames Latitude and Longitudes
- Cleans the Column Naming
- Unpacks landcover classifications
- Rounds Columns
- Standardizes Null Values
This returns a copy
Parameters
- lc_df (pd.DataFrame): A DataFrame containing raw Landcover Data from the API.
- unpack (bool): If True, the Landcover data will expand the classifications into separate columns (results in around 300 columns). If False, it will just unpack overall landcover.
Returns
- pd.DataFrame: A DataFrame containing the cleaned Landcover Data
579def add_flags(lc_df): 580 """Adds the following flags to the landcover data: 581 - Photo Bit Flags 582 - Classification Bit Flags 583 - Completeness Score Flags 584 585 Returns a copy of the DataFrame 586 587 Parameters 588 ---------- 589 lc_df : pd.DataFrame 590 A DataFrame containing cleaned up Landcover Data ideally from the [apply_cleanup](#apply_cleanup) method. 591 592 Returns 593 ------- 594 pd.DataFrame 595 A DataFrame containing the Land Cover flags. 596 """ 597 lc_df = lc_df.copy() 598 photo_bit_flags(lc_df, inplace=True) 599 classification_bit_flags(lc_df, inplace=True) 600 get_main_classifications(lc_df, inplace=True) 601 completion_scores(lc_df, inplace=True) 602 return lc_df
Adds the following flags to the landcover data:
- Photo Bit Flags
- Classification Bit Flags
- Completeness Score Flags
Returns a copy of the DataFrame
Parameters
- lc_df (pd.DataFrame): A DataFrame containing cleaned up Landcover Data ideally from the apply_cleanup method.
Returns
- pd.DataFrame: A DataFrame containing the Land Cover flags.
605def direction_frequency(lc_df, direction_list, bit_binary, entry_type): 606 """ 607 Plots the amount of a variable of interest for each direction. 608 609 Parameters 610 ---------- 611 lc_df : pd.DataFrame 612 The DataFrame containing Land Cover Data. 613 direction_list : list of str 614 The column names of the different variables of interest for each direction. 615 bit_binary: str 616 The Bit Binary Flag associated with the variable of interest. 617 entry_type: str 618 The variable of interest (e.g. Photos or Classifications) 619 """ 620 direction_photos = pd.DataFrame() 621 direction_photos["category"] = direction_list 622 direction_counts = [0 for i in range(len(direction_photos))] 623 for mask in lc_df[bit_binary]: 624 for i in range(len(mask) - 1, -1, -1): 625 direction_counts[i] += int(mask[i]) 626 direction_counts 627 direction_photos["count"] = [math.log10(value) for value in direction_counts] 628 direction_photos 629 630 plt.figure(figsize=(15, 6)) 631 title = f"Land Cover -- {entry_type} Direction Frequency (Log Scale)" 632 plt.title(title) 633 plt.ylabel("Count (Log Scale)") 634 sns.barplot(data=direction_photos, x="category", y="count", color="lightblue")
Plots the amount of a variable of interest for each direction.
Parameters
- lc_df (pd.DataFrame): The DataFrame containing Land Cover Data.
- direction_list (list of str): The column names of the different variables of interest for each direction.
- bit_binary (str): The Bit Binary Flag associated with the variable of interest.
- entry_type (str): The variable of interest (e.g. Photos or Classifications)
637def diagnostic_plots( 638 lc_df, 639 up_url="lc_UpwardPhotoUrl", 640 down_url="lc_DownwardPhotoUrl", 641 north_url="lc_NorthPhotoUrl", 642 south_url="lc_SouthPhotoUrl", 643 east_url="lc_EastPhotoUrl", 644 west_url="lc_WestPhotoUrl", 645 photo_bit="lc_PhotoBitBinary", 646 north_classification="lc_NorthClassifications", 647 south_classification="lc_SouthClassifications", 648 east_classification="lc_EastClassifications", 649 west_classification="lc_WestClassifications", 650 classification_bit="lc_ClassificationBitBinary", 651): 652 """ 653 Generates (but doesn't display) diagnostic plots to gain insight into the current data. 654 655 Plots: 656 - Valid Photo Count Distribution 657 - Photo Distribution by direction 658 - Classification Distribution by direction 659 - Photo Status Distribution 660 - Completeness Score Distribution 661 - Subcompleteness Score Distribution 662 663 Parameters 664 ---------- 665 lc_df : pd.DataFrame 666 The DataFrame containing Flagged and Cleaned Land Cover Data. 667 """ 668 plot_freq_bar( 669 lc_df, "Land Cover", "lc_PhotoCount", "Valid Photo Count", log_scale=True 670 ) 671 direction_frequency( 672 lc_df, 673 [up_url, down_url, north_url, south_url, east_url, west_url], 674 photo_bit, 675 "Photo", 676 ) 677 direction_frequency( 678 lc_df, 679 [ 680 north_classification, 681 south_classification, 682 east_classification, 683 west_classification, 684 ], 685 classification_bit, 686 "Classification", 687 ) 688 multiple_bar_graph( 689 lc_df, 690 "Land Cover", 691 ["lc_PhotoCount", "lc_RejectedCount", "lc_EmptyCount"], 692 "Photo Summary", 693 log_scale=True, 694 ) 695 696 completeness_histogram( 697 lc_df, "Land Cover", "lc_CumulativeCompletenessScore", "Cumulative Completeness" 698 ) 699 completeness_histogram( 700 lc_df, "Land Cover", "lc_SubCompletenessScore", "Sub Completeness" 701 )
Generates (but doesn't display) diagnostic plots to gain insight into the current data.
Plots:
- Valid Photo Count Distribution
- Photo Distribution by direction
- Classification Distribution by direction
- Photo Status Distribution
- Completeness Score Distribution
- Subcompleteness Score Distribution
Parameters
- lc_df (pd.DataFrame): The DataFrame containing Flagged and Cleaned Land Cover Data.
704def qa_filter( 705 lc_df, 706 has_classification=False, 707 has_photo=False, 708 has_all_photos=False, 709 has_all_classifications=False, 710): 711 """ 712 Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria: 713 - `Has Classification`: If the entry has atleast one direction classified 714 - `Has Photo` : If the entry has atleast one photo taken 715 - `Has All Photos` : If the entry has all photos taken (up, down, north, south, east, west) 716 - `Has All Classifications` : If the entry has all directions classified 717 718 Returns a copy of the DataFrame 719 720 Parameters 721 ---------- 722 has_classification : bool, default=False 723 If True, only entries with atleast one classification will be included. 724 has_photo : bool, default=False 725 If True, only entries with atleast one photo will be included. 726 has_all_photos : bool, default=False 727 If True, only entries with all photos will be included. 728 has_all_classifications : bool, default=False 729 If True, only entries with all classifications will be included. 730 731 Returns 732 ------- 733 pd.DataFrame 734 A DataFrame of the applied filters. 735 """ 736 737 if has_classification and not has_all_classifications: 738 lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] > 0] 739 elif has_all_classifications: 740 lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] == 15] 741 if has_photo and not has_all_photos: 742 lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] > 0] 743 elif has_all_photos: 744 lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] == 63] 745 746 return lc_df
Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
Has Classification
: If the entry has atleast one direction classifiedHas Photo
: If the entry has atleast one photo takenHas All Photos
: If the entry has all photos taken (up, down, north, south, east, west)Has All Classifications
: If the entry has all directions classified
Returns a copy of the DataFrame
Parameters
- has_classification (bool, default=False): If True, only entries with atleast one classification will be included.
- has_photo (bool, default=False): If True, only entries with atleast one photo will be included.
- has_all_photos (bool, default=False): If True, only entries with all photos will be included.
- has_all_classifications (bool, default=False): If True, only entries with all classifications will be included.
Returns
- pd.DataFrame: A DataFrame of the applied filters.
827def get_main_classifications( 828 lc_df, 829 north_classification="lc_NorthClassifications", 830 east_classification="lc_EastClassifications", 831 south_classification="lc_SouthClassifications", 832 west_classification="lc_WestClassifications", 833 north_primary="lc_NorthPrimary", 834 north_secondary="lc_NorthSecondary", 835 east_primary="lc_EastPrimary", 836 east_secondary="lc_EastSecondary", 837 south_primary="lc_SouthPrimary", 838 south_secondary="lc_SouthSecondary", 839 west_primary="lc_WestPrimary", 840 west_secondary="lc_WestSecondary", 841 primary_classification="lc_PrimaryClassification", 842 secondary_classification="lc_SecondaryClassification", 843 primary_percentage="lc_PrimaryPercentage", 844 secondary_percentage="lc_SecondaryPercentage", 845 inplace=False, 846): 847 if not inplace: 848 lc_df = lc_df.copy() 849 vectorized_rank = np.vectorize(_rank_classifications) 850 ( 851 lc_df[north_primary], 852 lc_df[north_secondary], 853 lc_df[east_primary], 854 lc_df[east_secondary], 855 lc_df[south_primary], 856 lc_df[south_secondary], 857 lc_df[west_primary], 858 lc_df[west_secondary], 859 lc_df[primary_classification], 860 lc_df[secondary_classification], 861 lc_df[primary_percentage], 862 lc_df[secondary_percentage], 863 ) = vectorized_rank( 864 lc_df[north_classification].to_numpy(), 865 lc_df[east_classification].to_numpy(), 866 lc_df[south_classification].to_numpy(), 867 lc_df[west_classification].to_numpy(), 868 ) 869 870 if not inplace: 871 return lc_df