go_utils.lc

Unpacking the Landcover Classification Data

The classification data for each entry is condensed into several entries separated by a semicolon. This method identifies and parses Land Cover Classifications and percentages to create new columns. The columns are also reordered to better group directional information together.

The end result is a DataFrame that contains columns for every Unique Landcover Classification (per direction) and its respective percentages for each entry.

There are four main steps to this procedure: 1.Identifying Land Cover Classifications for each Cardinal Direction: An internal method returns the unique description (e.g. HerbaceousGrasslandTallGrass) listed in a column. This method is run for all 4 cardinal directions to obtain the all unique classifications per direction.

  1. Creating empty columns for each Classification from each Cardinal Direction: Using the newly identified classifications new columns are made for each unique classification. These columns initially contained the default float64 value of 0.0. By initializing all the classification column values to 0.0, we ensure no empty values are set to -9999 in the round_cols(df) method (discussed in General Cleanup Procedures - Round Appropriate Columns). This step eases future numerical analysis.
  2. Grouping and Alphabetically Sorting Directional Column Information: To better organize the DataFrame, columns containing any of the following directional substrings: "downward", "upward", "west", "east", "north", "south" (case insensitive) are identified and alphabetically sorted. Then an internal method called move_cols, specified column headers to move (direction_data_cols), and the location before the desired point of insertion, the program returns a reordered DataFrame, where all directional columns are grouped together. This greatly improves the Land Covers dataset’s organization and accessibility.
  3. Adding Classification Percentages to their respective Land Cover Classification Columns - To fill in each classification column with their respective percentages, an internal method is applied to each row in the dataframe. This method iterates through each classification direction (ie “lc_EastClassifications”) and sets each identified Classification column with its respective percentage.

NOTE: After these procedures, the original directional classification columns (e.g. “lc_EastClassifications”) are not dropped.

  1import math
  2import re
  3
  4import matplotlib.pyplot as plt
  5import numpy as np
  6import pandas as pd
  7import seaborn as sns
  8
  9from go_utils.cleanup import (
 10    camel_case,
 11    remove_homogenous_cols,
 12    rename_latlon_cols,
 13    replace_column_prefix,
 14    round_cols,
 15    standardize_null_vals,
 16)
 17from go_utils.plot import completeness_histogram, multiple_bar_graph, plot_freq_bar
 18
 19__doc__ = """
 20
 21## Unpacking the Landcover Classification Data
 22The classification data for each entry is condensed into several entries separated by a semicolon. [This method](#unpack_classifications) identifies and parses Land Cover Classifications and percentages to create new columns. The columns are also reordered to better group directional information together.
 23
 24The end result is a DataFrame that contains columns for every Unique Landcover Classification (per direction) and its respective percentages for each entry.
 25
 26There are four main steps to this procedure:
 271.Identifying Land Cover Classifications for each Cardinal Direction: An internal method returns the unique description (e.g. HerbaceousGrasslandTallGrass) listed in a column. This method is run for all 4 cardinal directions to obtain the all unique classifications per direction.
 282. Creating empty columns for each Classification from each Cardinal Direction: Using the newly identified classifications new columns are made for each unique classification. These columns initially contained the default float64 value of 0.0. By initializing all the classification column values to 0.0, we ensure no empty values are set to -9999 in the round_cols(df) method (discussed in General Cleanup Procedures - Round Appropriate Columns). This step eases future numerical analysis.
 293. Grouping and Alphabetically Sorting Directional Column Information: To better organize the DataFrame, columns containing any of the following directional substrings: "downward", "upward", "west", "east", "north", "south" (case insensitive) are identified and alphabetically sorted. Then an internal method called move_cols, specified column headers to move (direction_data_cols), and the location before the desired point of insertion, the program returns a reordered DataFrame, where all directional columns are grouped together. This greatly improves the Land Covers dataset’s organization and accessibility.
 304. Adding Classification Percentages to their respective Land Cover Classification Columns - To fill in each classification column with their respective percentages, an internal method is applied to each row in the dataframe. This method iterates through each classification direction (ie “lc_EastClassifications”) and sets each identified Classification column with its respective percentage.
 31
 32NOTE: After these procedures, the original directional classification columns (e.g. “lc_EastClassifications”) are not dropped.
 33"""
 34
 35classifications = []
 36
 37
 38def cleanup_column_prefix(df, inplace=False):
 39    """Method for shortening raw landcover column names.
 40
 41    The df object will now replace the verbose `landcovers` prefix in some of the columns with `lc_`
 42
 43    Parameters
 44    ----------
 45    df : pd.DataFrame
 46        The DataFrame containing raw landcover data. The DataFrame object itself will be modified.
 47    inplace : bool, default=False
 48        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 49
 50    Returns
 51    -------
 52    pd.DataFrame or None
 53        A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None.
 54    """
 55
 56    if not inplace:
 57        df = df.copy()
 58
 59    replace_column_prefix(df, "landcovers", "lc", inplace=True)
 60
 61    if not inplace:
 62        return df
 63
 64
 65def extract_classification_name(entry):
 66    """
 67    Extracts the name (landcover description) of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `"Trees, Closely Spaced, Deciduous - Broad Leaved"` is extracted.
 68
 69    Parameters
 70    ----------
 71    entry : str
 72        A single landcover classification.
 73
 74    Returns
 75    -------
 76    str
 77        The Landcover description of a classification
 78    """
 79
 80    return re.search(r"(?<=\[).*(?=\])", entry).group()
 81
 82
 83def extract_classification_percentage(entry):
 84    """
 85    Extracts the percentage of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `60` is extracted.
 86
 87    Parameters
 88    ----------
 89    entry : str
 90        A single landcover classification.
 91
 92    Returns
 93    -------
 94    float
 95        The percentage of a landcover classification
 96    """
 97
 98    return float(re.search(".*(?=%)", entry).group())
 99
100
101def _extract_landcover_items(func, info):
102    entries = info.split(";")
103    return [func(entry) for entry in entries]
104
105
106def extract_classifications(info):
107    """Extracts the name/landcover description (see [here](#extract_classification_name) for a clearer definition) of a landcover classification entry in the GLOBE Observer Data.
108
109    Parameters
110    ----------
111    info : str
112        A string representing a landcover classification entry in the GLOBE Observer Datset.
113
114    Returns
115    -------
116    list of str
117        The different landcover classifications stored within the landcover entry.
118    """
119    return _extract_landcover_items(extract_classification_name, info)
120
121
122def extract_percentages(info):
123    """Extracts the percentages (see [here](#extract_classification_percentage) for a clearer definition) of a landcover classification in the GLOBE Observer Datset.
124
125    Parameters
126    ----------
127    info : str
128        A string representing a landcover classification entry in the GLOBE Observer Datset.
129
130    Returns
131    -------
132    list of float
133        The different landcover percentages stored within the landcover entry.
134    """
135
136    return _extract_landcover_items(extract_classification_percentage, info)
137
138
139def extract_classification_dict(info):
140    """Extracts the landcover descriptions and percentages of a landcover classification entry as a dictionary.
141
142    Parameters
143    ----------
144    info : str
145        A string representing a landcover classification entry in the GLOBE Observer Datset.
146
147    Returns
148    -------
149    dict of str, float
150        The landcover descriptions and percentages stored as a dict in the form: `{"description" : percentage}`.
151    """
152
153    entries = info.split(";")
154    return {
155        extract_classification_name(entry): extract_classification_percentage(entry)
156        for entry in entries
157    }
158
159
160def _get_classifications_for_direction(df, direction_col_name):
161    list_of_land_types = []
162    for info in df[direction_col_name]:
163        # Note: Sometimes info = np.nan, a float -- In that case we do NOT parse/split
164        if type(info) == str:
165            [
166                list_of_land_types.append(camel_case(entry, [" ", ",", "-", "/"]))
167                for entry in extract_classifications(info)
168            ]
169    return np.unique(list_of_land_types).tolist()
170
171
172def _move_cols(df, cols_to_move=[], ref_col=""):
173    col_names = df.columns.tolist()
174    index_before_desired_loc = col_names.index(ref_col)
175
176    cols_before_index = col_names[: index_before_desired_loc + 1]
177    cols_at_index = cols_to_move
178
179    cols_before_index = [i for i in cols_before_index if i not in cols_at_index]
180    cols_after_index = [
181        i for i in col_names if i not in cols_before_index + cols_at_index
182    ]
183
184    return df[cols_before_index + cols_at_index + cols_after_index]
185
186
187def unpack_classifications(
188    lc_df,
189    north="lc_NorthClassifications",
190    east="lc_EastClassifications",
191    south="lc_SouthClassifications",
192    west="lc_WestClassifications",
193    ref_col="lc_pid",
194    unpack=True,
195):
196    """
197    Unpacks the classification data in the *raw* GLOBE Observer Landcover data. This method assumes that the columns have been renamed with accordance to the [column cleanup](#cleanup_column_prefix) method.
198
199    This returns a copy of the dataframe.
200
201    See [here](#unpacking-the-landcover-classification-data) for more information.
202
203    *Note:* The returned DataFrame will have around 250 columns.
204
205    Parameters
206    ----------
207    lc_df : pd.DataFrame
208        A DataFrame containing Raw GLOBE Observer Landcover data that has had the column names simplified.
209    north: str, default="lc_NorthClassifications"
210        The name of the column which contains the North Classifications
211    east: str, default="lc_EastClassifications"
212        The name of the column which contains the East Classifications
213    south: str, default="lc_SouthClassifications"
214        The name of the column which contains the South Classifications
215    west: str, default="lc_WestClassifications"
216        The name of the column which contains the West Classifications
217    ref_col: str, default="lc_pid"
218        The name of the column which all of the expanded values will be placed after. For example, if the columns were `[1, 2, 3, 4]` and you chose 3, the new columns will now be `[1, 2, 3, (all classification columns), 4]`.
219    unpack: bool, default=True
220        True if you want to unpack the directional classifications, False if you only want overall classifications
221
222    Returns
223    -------
224    pd.DataFrame
225        A DataFrame with the unpacked classification columns.
226    list
227        A list containing all the generated overall Land Cover column names (mainly for testing purposes).
228    list
229        A list containing all the generated directional Land Cover column names (mainly for testing purposes).
230    """
231
232    classifications = [north, east, south, west]
233
234    def set_directions(row):
235        for classification in classifications:
236            if not pd.isnull(row[classification]):
237                entries = row[classification].split(";")
238                for entry in entries:
239                    percent, name = (
240                        extract_classification_percentage(entry),
241                        extract_classification_name(entry),
242                    )
243                    name = camel_case(name, [" ", ",", "-", "/"])
244                    classification = classification.replace("Classifications", "_")
245                    overall = re.sub(
246                        r"(north|south|east|west).*",
247                        "Overall_",
248                        key,
249                        flags=re.IGNORECASE,
250                    )
251                    row[f"{classification}{name.strip()}"] = percent
252                    row[f"{overall}{name.strip()}"] += percent
253        return row
254
255    land_type_columns_to_add = {
256        classification: _get_classifications_for_direction(lc_df, classification)
257        for classification in classifications
258    }
259    overall_columns = set()
260    direction_cols = set()
261    for key, values in land_type_columns_to_add.items():
262        direction_name = key.replace("Classifications", "_")
263        overall = re.sub(
264            r"(north|south|east|west).*", "Overall_", key, flags=re.IGNORECASE
265        )
266        for value in values:
267            direction_cols.add(direction_name + value)
268            overall_columns.add(overall + value)
269    overall_columns = list(overall_columns)
270    direction_cols = list(direction_cols)
271    direction_data_cols = sorted(overall_columns + direction_cols)
272
273    # Creates a blank DataFrame and concats it to the original to avoid iteratively growing the LC DataFrame
274    blank_df = pd.DataFrame(
275        np.zeros((len(lc_df), len(direction_data_cols))), columns=direction_data_cols
276    )
277
278    lc_df = pd.concat([lc_df, blank_df], axis=1)
279
280    lc_df = _move_cols(lc_df, cols_to_move=direction_data_cols, ref_col=ref_col)
281    lc_df = lc_df.apply(set_directions, axis=1)
282    for column in overall_columns:
283        lc_df[column] /= 4
284
285    if not unpack:
286        lc_df = lc_df.drop(columns=direction_cols)
287    return lc_df, overall_columns, direction_cols
288
289
290def photo_bit_flags(
291    df,
292    up="lc_UpwardPhotoUrl",
293    down="lc_DownwardPhotoUrl",
294    north="lc_NorthPhotoUrl",
295    south="lc_SouthPhotoUrl",
296    east="lc_EastPhotoUrl",
297    west="lc_WestPhotoUrl",
298    photo_count="lc_PhotoCount",
299    rejected_count="lc_RejectedCount",
300    pending_count="lc_PendingCount",
301    empty_count="lc_EmptyCount",
302    bit_binary="lc_PhotoBitBinary",
303    bit_decimal="lc_PhotoBitDecimal",
304    inplace=False,
305):
306    """
307    Creates the following flags:
308    - `PhotoCount`: The number of valid photos per record.
309    - `RejectedCount`: The number of photos that were rejected per record.
310    - `PendingCount`: The number of photos that are pending approval per record.
311    - `PhotoBitBinary`: A string that represents the presence of a photo in the Up, Down, North, South, East, and West directions. For example, if the entry is `110100`, that indicates that there is a valid photo for the Up, Down, and South Directions but no valid photos for the North, East, and West Directions.
312    - `PhotoBitDecimal`: The numerical representation of the lc_PhotoBitBinary string.
313
314    Parameters
315    ----------
316    df : pd.DataFrame
317        A land cover DataFrame
318    up : str, default="lc_UpwardPhotoUrl"
319        The name of the column in the land cover DataFrame that contains the url for the upwards photo.
320    down : str, default="lc_DownwardPhotoUrl"
321        The name of the column in the land cover DataFrame that contains the url for the downwards photo.
322    north : str, default="lc_NorthPhotoUrl"
323        The name of the column in the land cover DataFrame that contains the url for the north photo.
324    south : str, default="lc_SouthPhotoUrl"
325        The name of the column in the land cover DataFrame that contains the url for the south photo.
326    east : str, default="lc_EastPhotoUrl"
327        The name of the column in the land cover DataFrame that contains the url for the east photo.
328    west : str, default="lc_WestPhotoUrl"
329        The name of the column in the land cover DataFrame that contains the url for the west photo.
330    photo_count : str, default="lc_PhotoCount"
331        The name of the column that will be storing the PhotoCount flag.
332    rejected_count : str, default="lc_RejectedCount"
333        The name of the column that will be storing the RejectedCount flag.
334    pending_count : str, default="lc_PendingCount"
335        The name of the column that will be storing the PendingCount flag.
336    empty_count : str, default="lc_EmptyCount"
337        The name of the column that will be storing the EmptyCount flag.
338    bit_binary : str, default="lc_PhotoBitBinary"
339        The name of the column that will be storing the PhotoBitBinary flag.
340    bit_decimal : str, default="lc_PhotoBitDecimal"
341        The name of the column that will be storing the PhotoBitDecimal flag.
342    inplace : bool, default=False
343        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
344
345    Returns
346    -------
347    pd.DataFrame or None
348        A DataFrame with the photo bit flags. If `inplace=True` it returns None.
349    """
350
351    def pic_data(*args):
352        pic_count = 0
353        rejected_count = 0
354        pending_count = 0
355        empty_count = 0
356        valid_photo_bit_mask = ""
357
358        for entry in args:
359            if not pd.isna(entry) and "http" in entry:
360                valid_photo_bit_mask += "1"
361                pic_count += entry.count("http")
362            else:
363                valid_photo_bit_mask += "0"
364            if pd.isna(entry):
365                empty_count += 1
366            else:
367                pending_count += entry.count("pending")
368                rejected_count += entry.count("rejected")
369        return (
370            pic_count,
371            rejected_count,
372            pending_count,
373            empty_count,
374            valid_photo_bit_mask,
375            int(valid_photo_bit_mask, 2),
376        )
377
378    if not inplace:
379        df = df.copy()
380
381    get_photo_data = np.vectorize(pic_data)
382    (
383        df[photo_count],
384        df[rejected_count],
385        df[pending_count],
386        df[empty_count],
387        df[bit_binary],
388        df[bit_decimal],
389    ) = get_photo_data(
390        df[up].to_numpy(),
391        df[down].to_numpy(),
392        df[north].to_numpy(),
393        df[south].to_numpy(),
394        df[east].to_numpy(),
395        df[west].to_numpy(),
396    )
397
398    if not inplace:
399        return df
400
401
402def classification_bit_flags(
403    df,
404    north="lc_NorthClassifications",
405    south="lc_SouthClassifications",
406    east="lc_EastClassifications",
407    west="lc_WestClassifications",
408    classification_count="lc_ClassificationCount",
409    bit_binary="lc_ClassificationBitBinary",
410    bit_decimal="lc_ClassificationBitDecimal",
411    inplace=False,
412):
413    """
414    Creates the following flags:
415    - `ClassificationCount`: The number of classifications per record.
416    - `BitBinary`: A string that represents the presence of a classification in the North, South, East, and West directions. For example, if the entry is `1101`, that indicates that there is a valid classification for the North, South, and West Directions but no valid classifications for the East Direction.
417    - `BitDecimal`: The number of photos that are pending approval per record.
418
419    Parameters
420    ----------
421    df : pd.DataFrame
422        A land cover DataFrame
423    north : str, default="lc_NorthClassifications"
424        The name of the column in the land cover DataFrame that contains the north classification.
425    south : str, default="lc_SouthClassifications"
426        The name of the column in the land cover DataFrame that contains the south classification.
427    east : str, default="lc_EastClassifications"
428        The name of the column in the land cover DataFrame that contains the east classification.
429    west : str, default="lc_WestClassifications"
430        The name of the column in the land cover DataFrame that contains the west classification.
431    classification_count : str, default="lc_ClassificationCount"
432        The name of the column that will store the ClassificationCount flag.
433    bit_binary : str, default="lc_ClassificationBitBinary"
434        The name of the column that will store the BitBinary flag.
435    bit_decimal : str, default="lc_ClassificationBitDecimal"
436        The name of the column that will store the BitDecimal flag.
437    inplace : bool, default=False
438        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
439
440    Returns
441    -------
442    pd.DataFrame or None
443        A DataFrame with the classification bit flags. If `inplace=True` it returns None.
444    """
445
446    def classification_data(*args):
447        classification_count = 0
448        classification_bit_mask = ""
449        for entry in args:
450            if pd.isna(entry) or entry is np.nan:
451                classification_bit_mask += "0"
452            else:
453                classification_count += 1
454                classification_bit_mask += "1"
455        return (
456            classification_count,
457            classification_bit_mask,
458            int(classification_bit_mask, 2),
459        )
460
461    if not inplace:
462        df = df.copy()
463    get_classification_data = np.vectorize(classification_data)
464
465    (
466        df[classification_count],
467        df[bit_binary],
468        df[bit_decimal],
469    ) = get_classification_data(
470        df[north],
471        df[south],
472        df[east],
473        df[west],
474    )
475    if not inplace:
476        return df
477
478
479def completion_scores(
480    df,
481    photo_bit_binary="lc_PhotoBitBinary",
482    classification_binary="lc_ClassificationBitBinary",
483    sub_completeness="lc_SubCompletenessScore",
484    completeness="lc_CumulativeCompletenessScore",
485    inplace=False,
486):
487    """
488    Adds the following completness score flags:
489    - `SubCompletenessScore`: The percentage of valid landcover classifications and photos that are filled out.
490    - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns.
491
492    Parameters
493    ----------
494    df : pd.DataFrame
495        A landcover DataFrame with the [`PhotoBitBinary`](#photo_bit_flags) and [`ClassificationBitBinary`](#classification_bit_flags) flags.
496    photo_bit_binary : str, default="lc_PhotoBitBinary"
497        The name of the column that stores the PhotoBitBinary flag.
498    classification_binary : str, default="lc_PhotoBitBinary"
499        The name of the column that stores the ClassificationBitBinary flag.
500    sub_completeness : str, default="lc_PhotoBitBinary"
501        The name of the column that will store the generated SubCompletenessScore flag.
502    completeness : str, default="lc_PhotoBitBinary"
503        The name of the column that will store the generated CompletenessScore flag.
504    inplace : bool, default=False
505        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
506
507    Returns
508    -------
509    pd.DataFrame or None
510        A DataFrame with the completeness score flags. If `inplace=True` it returns None.
511    """
512
513    def sum_bit_mask(bit_mask="0"):
514        sum = 0.0
515        for char in bit_mask:
516            sum += int(char)
517        return sum
518
519    if not inplace:
520        df = df.copy()
521
522    scores = {}
523    scores["sub_score"] = []
524    # Cummulative Completion Score
525    scores["cumulative_score"] = round(df.count(1) / len(df.columns), 2)
526    # Sub-Score
527    for index in df.index:
528        bit_mask = df[photo_bit_binary][index] + df[classification_binary][index]
529        sub_score = round(sum_bit_mask(bit_mask=bit_mask), 2)
530        sub_score /= len(bit_mask)
531        scores["sub_score"].append(sub_score)
532
533    df[sub_completeness], df[completeness] = (
534        scores["sub_score"],
535        scores["cumulative_score"],
536    )
537
538    if not inplace:
539        return df
540
541
542def apply_cleanup(lc_df, unpack=True):
543    """Applies a full cleanup procedure to the landcover data.
544    It follows the following steps:
545    - Removes Homogenous Columns
546    - Renames Latitude and Longitudes
547    - Cleans the Column Naming
548    - Unpacks landcover classifications
549    - Rounds Columns
550    - Standardizes Null Values
551
552    This returns a copy
553
554    Parameters
555    ----------
556    lc_df : pd.DataFrame
557        A DataFrame containing **raw** Landcover Data from the API.
558    unpack : bool
559        If True, the Landcover data will expand the classifications into separate columns (results in around 300 columns). If False, it will just unpack overall landcover.
560
561    Returns
562    -------
563    pd.DataFrame
564        A DataFrame containing the cleaned Landcover Data
565    """
566    lc_df = lc_df.copy()
567
568    remove_homogenous_cols(lc_df, inplace=True)
569    rename_latlon_cols(lc_df, inplace=True)
570    cleanup_column_prefix(lc_df, inplace=True)
571    lc_df, overall_cols, directional_cols = unpack_classifications(lc_df, unpack=unpack)
572
573    round_cols(lc_df, inplace=True)
574    standardize_null_vals(lc_df, inplace=True)
575    return lc_df
576
577
578def add_flags(lc_df):
579    """Adds the following flags to the landcover data:
580    - Photo Bit Flags
581    - Classification Bit Flags
582    - Completeness Score Flags
583
584    Returns a copy of the DataFrame
585
586    Parameters
587    ----------
588    lc_df : pd.DataFrame
589        A DataFrame containing cleaned up Landcover Data ideally from the [apply_cleanup](#apply_cleanup) method.
590
591    Returns
592    -------
593    pd.DataFrame
594        A DataFrame containing the Land Cover flags.
595    """
596    lc_df = lc_df.copy()
597    photo_bit_flags(lc_df, inplace=True)
598    classification_bit_flags(lc_df, inplace=True)
599    get_main_classifications(lc_df, inplace=True)
600    completion_scores(lc_df, inplace=True)
601    return lc_df
602
603
604def direction_frequency(lc_df, direction_list, bit_binary, entry_type):
605    """
606    Plots the amount of a variable of interest for each direction.
607
608    Parameters
609    ----------
610    lc_df : pd.DataFrame
611        The DataFrame containing Land Cover Data.
612    direction_list : list of str
613        The column names of the different variables of interest for each direction.
614    bit_binary: str
615        The Bit Binary Flag associated with the variable of interest.
616    entry_type: str
617        The variable of interest (e.g. Photos or Classifications)
618    """
619    direction_photos = pd.DataFrame()
620    direction_photos["category"] = direction_list
621    direction_counts = [0 for i in range(len(direction_photos))]
622    for mask in lc_df[bit_binary]:
623        for i in range(len(mask) - 1, -1, -1):
624            direction_counts[i] += int(mask[i])
625    direction_counts
626    direction_photos["count"] = [math.log10(value) for value in direction_counts]
627    direction_photos
628
629    plt.figure(figsize=(15, 6))
630    title = f"Land Cover -- {entry_type} Direction Frequency (Log Scale)"
631    plt.title(title)
632    plt.ylabel("Count (Log Scale)")
633    sns.barplot(data=direction_photos, x="category", y="count", color="lightblue")
634
635
636def diagnostic_plots(
637    lc_df,
638    up_url="lc_UpwardPhotoUrl",
639    down_url="lc_DownwardPhotoUrl",
640    north_url="lc_NorthPhotoUrl",
641    south_url="lc_SouthPhotoUrl",
642    east_url="lc_EastPhotoUrl",
643    west_url="lc_WestPhotoUrl",
644    photo_bit="lc_PhotoBitBinary",
645    north_classification="lc_NorthClassifications",
646    south_classification="lc_SouthClassifications",
647    east_classification="lc_EastClassifications",
648    west_classification="lc_WestClassifications",
649    classification_bit="lc_ClassificationBitBinary",
650):
651    """
652    Generates (but doesn't display) diagnostic plots to gain insight into the current data.
653
654    Plots:
655    - Valid Photo Count Distribution
656    - Photo Distribution by direction
657    - Classification Distribution by direction
658    - Photo Status Distribution
659    - Completeness Score Distribution
660    - Subcompleteness Score Distribution
661
662    Parameters
663    ----------
664    lc_df : pd.DataFrame
665        The DataFrame containing Flagged and Cleaned Land Cover Data.
666    """
667    plot_freq_bar(
668        lc_df, "Land Cover", "lc_PhotoCount", "Valid Photo Count", log_scale=True
669    )
670    direction_frequency(
671        lc_df,
672        [up_url, down_url, north_url, south_url, east_url, west_url],
673        photo_bit,
674        "Photo",
675    )
676    direction_frequency(
677        lc_df,
678        [
679            north_classification,
680            south_classification,
681            east_classification,
682            west_classification,
683        ],
684        classification_bit,
685        "Classification",
686    )
687    multiple_bar_graph(
688        lc_df,
689        "Land Cover",
690        ["lc_PhotoCount", "lc_RejectedCount", "lc_EmptyCount"],
691        "Photo Summary",
692        log_scale=True,
693    )
694
695    completeness_histogram(
696        lc_df, "Land Cover", "lc_CumulativeCompletenessScore", "Cumulative Completeness"
697    )
698    completeness_histogram(
699        lc_df, "Land Cover", "lc_SubCompletenessScore", "Sub Completeness"
700    )
701
702
703def qa_filter(
704    lc_df,
705    has_classification=False,
706    has_photo=False,
707    has_all_photos=False,
708    has_all_classifications=False,
709):
710    """
711    Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
712    - `Has Classification`: If the entry has atleast one direction classified
713    - `Has Photo` : If the entry has atleast one photo taken
714    - `Has All Photos` : If the entry has all photos taken (up, down, north, south, east, west)
715    - `Has All Classifications` : If the entry has all directions classified
716
717    Returns a copy of the DataFrame
718
719    Parameters
720    ----------
721    has_classification : bool, default=False
722        If True, only entries with atleast one classification will be included.
723    has_photo : bool, default=False
724        If True, only entries with atleast one photo will be included.
725    has_all_photos : bool, default=False
726        If True, only entries with all photos will be included.
727    has_all_classifications : bool, default=False
728        If True, only entries with all classifications will be included.
729
730    Returns
731    -------
732    pd.DataFrame
733        A DataFrame of the applied filters.
734    """
735
736    if has_classification and not has_all_classifications:
737        lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] > 0]
738    elif has_all_classifications:
739        lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] == 15]
740    if has_photo and not has_all_photos:
741        lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] > 0]
742    elif has_all_photos:
743        lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] == 63]
744
745    return lc_df
746
747
748def _accumulate_ties(classification_list):
749    classifications = list()
750    i = 0
751    while i < len(classification_list) - 1:
752        if classification_list[i][1] == classification_list[i + 1][1]:
753            classifications.append(classification_list[i][0])
754            classifications.append(classification_list[i + 1][0])
755            i += 1
756        else:
757            break
758
759    output = ", ".join([classification for classification in classifications])
760    if not output:
761        if len(classification_list) != 0:
762            output = classification_list[0][0]
763        else:
764            output = "NA"
765    # TODO replace w regex methods
766    return output, i + 1
767
768
769def _rank_direction(classification_dict, direction_classifications):
770    if pd.isna(direction_classifications):
771        return "NA", "NA"
772    classifications_list = []
773    classifications = direction_classifications.split(";")
774    for classification_data in classifications:
775        percent = extract_classification_percentage(classification_data)
776        classification = extract_classification_name(classification_data)
777        if classification in classification_dict:
778            classification_dict[classification] += percent
779        else:
780            classification_dict[classification] = percent
781        classifications_list.append((classification, percent))
782    classifications_list = sorted(
783        classifications_list, key=lambda x: x[1], reverse=True
784    )
785    if len(classifications_list) < 2:
786        return classifications_list[0][0], "NA"
787
788    primary_classification, i = _accumulate_ties(classifications_list)
789    secondary_classification, temp = _accumulate_ties(classifications_list[i:])
790
791    return primary_classification, secondary_classification
792
793
794def _rank_classifications(*args):
795    classification_dict = {}
796    rank_directions = [
797        classification
798        for arg in args
799        for classification in _rank_direction(classification_dict, arg)
800    ]
801    primary, secondary = ("NA", 0), ("NA", 0)
802    if classification_dict:
803        if len(classification_dict) < 2:
804            primary = (
805                list(classification_dict.keys())[0],
806                list(classification_dict.values())[0],
807            )
808        else:
809            sorted_classifications = sorted(
810                classification_dict.items(), key=lambda x: x[1], reverse=True
811            )
812            primary, i = _accumulate_ties(sorted_classifications)
813            primary = primary, sorted_classifications[0][1]
814            if i < len(sorted_classifications):
815                secondary, temp = _accumulate_ties(sorted_classifications[i:])
816                secondary = secondary, sorted_classifications[i][1]
817    return (
818        *rank_directions,
819        primary[0],
820        secondary[0],
821        primary[1] / len(args),
822        secondary[1] / len(args),
823    )
824
825
826def get_main_classifications(
827    lc_df,
828    north_classification="lc_NorthClassifications",
829    east_classification="lc_EastClassifications",
830    south_classification="lc_SouthClassifications",
831    west_classification="lc_WestClassifications",
832    north_primary="lc_NorthPrimary",
833    north_secondary="lc_NorthSecondary",
834    east_primary="lc_EastPrimary",
835    east_secondary="lc_EastSecondary",
836    south_primary="lc_SouthPrimary",
837    south_secondary="lc_SouthSecondary",
838    west_primary="lc_WestPrimary",
839    west_secondary="lc_WestSecondary",
840    primary_classification="lc_PrimaryClassification",
841    secondary_classification="lc_SecondaryClassification",
842    primary_percentage="lc_PrimaryPercentage",
843    secondary_percentage="lc_SecondaryPercentage",
844    inplace=False,
845):
846    if not inplace:
847        lc_df = lc_df.copy()
848    vectorized_rank = np.vectorize(_rank_classifications)
849    (
850        lc_df[north_primary],
851        lc_df[north_secondary],
852        lc_df[east_primary],
853        lc_df[east_secondary],
854        lc_df[south_primary],
855        lc_df[south_secondary],
856        lc_df[west_primary],
857        lc_df[west_secondary],
858        lc_df[primary_classification],
859        lc_df[secondary_classification],
860        lc_df[primary_percentage],
861        lc_df[secondary_percentage],
862    ) = vectorized_rank(
863        lc_df[north_classification].to_numpy(),
864        lc_df[east_classification].to_numpy(),
865        lc_df[south_classification].to_numpy(),
866        lc_df[west_classification].to_numpy(),
867    )
868
869    if not inplace:
870        return lc_df
def cleanup_column_prefix(df, inplace=False)
39def cleanup_column_prefix(df, inplace=False):
40    """Method for shortening raw landcover column names.
41
42    The df object will now replace the verbose `landcovers` prefix in some of the columns with `lc_`
43
44    Parameters
45    ----------
46    df : pd.DataFrame
47        The DataFrame containing raw landcover data. The DataFrame object itself will be modified.
48    inplace : bool, default=False
49        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
50
51    Returns
52    -------
53    pd.DataFrame or None
54        A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None.
55    """
56
57    if not inplace:
58        df = df.copy()
59
60    replace_column_prefix(df, "landcovers", "lc", inplace=True)
61
62    if not inplace:
63        return df

Method for shortening raw landcover column names.

The df object will now replace the verbose landcovers prefix in some of the columns with lc_

Parameters
  • df (pd.DataFrame): The DataFrame containing raw landcover data. The DataFrame object itself will be modified.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the cleaned up column prefixes. If inplace=True it returns None.
def extract_classification_name(entry)
66def extract_classification_name(entry):
67    """
68    Extracts the name (landcover description) of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `"Trees, Closely Spaced, Deciduous - Broad Leaved"` is extracted.
69
70    Parameters
71    ----------
72    entry : str
73        A single landcover classification.
74
75    Returns
76    -------
77    str
78        The Landcover description of a classification
79    """
80
81    return re.search(r"(?<=\[).*(?=\])", entry).group()

Extracts the name (landcover description) of a singular landcover classification. For example in the classification of "60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]", the "Trees, Closely Spaced, Deciduous - Broad Leaved" is extracted.

Parameters
  • entry (str): A single landcover classification.
Returns
  • str: The Landcover description of a classification
def extract_classification_percentage(entry)
84def extract_classification_percentage(entry):
85    """
86    Extracts the percentage of a singular landcover classification. For example in the classification of `"60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]"`, the `60` is extracted.
87
88    Parameters
89    ----------
90    entry : str
91        A single landcover classification.
92
93    Returns
94    -------
95    float
96        The percentage of a landcover classification
97    """
98
99    return float(re.search(".*(?=%)", entry).group())

Extracts the percentage of a singular landcover classification. For example in the classification of "60% MUC 02 (b) [Trees, Closely Spaced, Deciduous - Broad Leaved]", the 60 is extracted.

Parameters
  • entry (str): A single landcover classification.
Returns
  • float: The percentage of a landcover classification
def extract_classifications(info)
107def extract_classifications(info):
108    """Extracts the name/landcover description (see [here](#extract_classification_name) for a clearer definition) of a landcover classification entry in the GLOBE Observer Data.
109
110    Parameters
111    ----------
112    info : str
113        A string representing a landcover classification entry in the GLOBE Observer Datset.
114
115    Returns
116    -------
117    list of str
118        The different landcover classifications stored within the landcover entry.
119    """
120    return _extract_landcover_items(extract_classification_name, info)

Extracts the name/landcover description (see here for a clearer definition) of a landcover classification entry in the GLOBE Observer Data.

Parameters
  • info (str): A string representing a landcover classification entry in the GLOBE Observer Datset.
Returns
  • list of str: The different landcover classifications stored within the landcover entry.
def extract_percentages(info)
123def extract_percentages(info):
124    """Extracts the percentages (see [here](#extract_classification_percentage) for a clearer definition) of a landcover classification in the GLOBE Observer Datset.
125
126    Parameters
127    ----------
128    info : str
129        A string representing a landcover classification entry in the GLOBE Observer Datset.
130
131    Returns
132    -------
133    list of float
134        The different landcover percentages stored within the landcover entry.
135    """
136
137    return _extract_landcover_items(extract_classification_percentage, info)

Extracts the percentages (see here for a clearer definition) of a landcover classification in the GLOBE Observer Datset.

Parameters
  • info (str): A string representing a landcover classification entry in the GLOBE Observer Datset.
Returns
  • list of float: The different landcover percentages stored within the landcover entry.
def extract_classification_dict(info)
140def extract_classification_dict(info):
141    """Extracts the landcover descriptions and percentages of a landcover classification entry as a dictionary.
142
143    Parameters
144    ----------
145    info : str
146        A string representing a landcover classification entry in the GLOBE Observer Datset.
147
148    Returns
149    -------
150    dict of str, float
151        The landcover descriptions and percentages stored as a dict in the form: `{"description" : percentage}`.
152    """
153
154    entries = info.split(";")
155    return {
156        extract_classification_name(entry): extract_classification_percentage(entry)
157        for entry in entries
158    }

Extracts the landcover descriptions and percentages of a landcover classification entry as a dictionary.

Parameters
  • info (str): A string representing a landcover classification entry in the GLOBE Observer Datset.
Returns
  • dict of str, float: The landcover descriptions and percentages stored as a dict in the form: {"description" : percentage}.
def unpack_classifications( lc_df, north='lc_NorthClassifications', east='lc_EastClassifications', south='lc_SouthClassifications', west='lc_WestClassifications', ref_col='lc_pid', unpack=True)
188def unpack_classifications(
189    lc_df,
190    north="lc_NorthClassifications",
191    east="lc_EastClassifications",
192    south="lc_SouthClassifications",
193    west="lc_WestClassifications",
194    ref_col="lc_pid",
195    unpack=True,
196):
197    """
198    Unpacks the classification data in the *raw* GLOBE Observer Landcover data. This method assumes that the columns have been renamed with accordance to the [column cleanup](#cleanup_column_prefix) method.
199
200    This returns a copy of the dataframe.
201
202    See [here](#unpacking-the-landcover-classification-data) for more information.
203
204    *Note:* The returned DataFrame will have around 250 columns.
205
206    Parameters
207    ----------
208    lc_df : pd.DataFrame
209        A DataFrame containing Raw GLOBE Observer Landcover data that has had the column names simplified.
210    north: str, default="lc_NorthClassifications"
211        The name of the column which contains the North Classifications
212    east: str, default="lc_EastClassifications"
213        The name of the column which contains the East Classifications
214    south: str, default="lc_SouthClassifications"
215        The name of the column which contains the South Classifications
216    west: str, default="lc_WestClassifications"
217        The name of the column which contains the West Classifications
218    ref_col: str, default="lc_pid"
219        The name of the column which all of the expanded values will be placed after. For example, if the columns were `[1, 2, 3, 4]` and you chose 3, the new columns will now be `[1, 2, 3, (all classification columns), 4]`.
220    unpack: bool, default=True
221        True if you want to unpack the directional classifications, False if you only want overall classifications
222
223    Returns
224    -------
225    pd.DataFrame
226        A DataFrame with the unpacked classification columns.
227    list
228        A list containing all the generated overall Land Cover column names (mainly for testing purposes).
229    list
230        A list containing all the generated directional Land Cover column names (mainly for testing purposes).
231    """
232
233    classifications = [north, east, south, west]
234
235    def set_directions(row):
236        for classification in classifications:
237            if not pd.isnull(row[classification]):
238                entries = row[classification].split(";")
239                for entry in entries:
240                    percent, name = (
241                        extract_classification_percentage(entry),
242                        extract_classification_name(entry),
243                    )
244                    name = camel_case(name, [" ", ",", "-", "/"])
245                    classification = classification.replace("Classifications", "_")
246                    overall = re.sub(
247                        r"(north|south|east|west).*",
248                        "Overall_",
249                        key,
250                        flags=re.IGNORECASE,
251                    )
252                    row[f"{classification}{name.strip()}"] = percent
253                    row[f"{overall}{name.strip()}"] += percent
254        return row
255
256    land_type_columns_to_add = {
257        classification: _get_classifications_for_direction(lc_df, classification)
258        for classification in classifications
259    }
260    overall_columns = set()
261    direction_cols = set()
262    for key, values in land_type_columns_to_add.items():
263        direction_name = key.replace("Classifications", "_")
264        overall = re.sub(
265            r"(north|south|east|west).*", "Overall_", key, flags=re.IGNORECASE
266        )
267        for value in values:
268            direction_cols.add(direction_name + value)
269            overall_columns.add(overall + value)
270    overall_columns = list(overall_columns)
271    direction_cols = list(direction_cols)
272    direction_data_cols = sorted(overall_columns + direction_cols)
273
274    # Creates a blank DataFrame and concats it to the original to avoid iteratively growing the LC DataFrame
275    blank_df = pd.DataFrame(
276        np.zeros((len(lc_df), len(direction_data_cols))), columns=direction_data_cols
277    )
278
279    lc_df = pd.concat([lc_df, blank_df], axis=1)
280
281    lc_df = _move_cols(lc_df, cols_to_move=direction_data_cols, ref_col=ref_col)
282    lc_df = lc_df.apply(set_directions, axis=1)
283    for column in overall_columns:
284        lc_df[column] /= 4
285
286    if not unpack:
287        lc_df = lc_df.drop(columns=direction_cols)
288    return lc_df, overall_columns, direction_cols

Unpacks the classification data in the raw GLOBE Observer Landcover data. This method assumes that the columns have been renamed with accordance to the column cleanup method.

This returns a copy of the dataframe.

See here for more information.

Note: The returned DataFrame will have around 250 columns.

Parameters
  • lc_df (pd.DataFrame): A DataFrame containing Raw GLOBE Observer Landcover data that has had the column names simplified.
  • north (str, default="lc_NorthClassifications"): The name of the column which contains the North Classifications
  • east (str, default="lc_EastClassifications"): The name of the column which contains the East Classifications
  • south (str, default="lc_SouthClassifications"): The name of the column which contains the South Classifications
  • west (str, default="lc_WestClassifications"): The name of the column which contains the West Classifications
  • ref_col (str, default="lc_pid"): The name of the column which all of the expanded values will be placed after. For example, if the columns were [1, 2, 3, 4] and you chose 3, the new columns will now be [1, 2, 3, (all classification columns), 4].
  • unpack (bool, default=True): True if you want to unpack the directional classifications, False if you only want overall classifications
Returns
  • pd.DataFrame: A DataFrame with the unpacked classification columns.
  • list: A list containing all the generated overall Land Cover column names (mainly for testing purposes).
  • list: A list containing all the generated directional Land Cover column names (mainly for testing purposes).
def photo_bit_flags( df, up='lc_UpwardPhotoUrl', down='lc_DownwardPhotoUrl', north='lc_NorthPhotoUrl', south='lc_SouthPhotoUrl', east='lc_EastPhotoUrl', west='lc_WestPhotoUrl', photo_count='lc_PhotoCount', rejected_count='lc_RejectedCount', pending_count='lc_PendingCount', empty_count='lc_EmptyCount', bit_binary='lc_PhotoBitBinary', bit_decimal='lc_PhotoBitDecimal', inplace=False)
291def photo_bit_flags(
292    df,
293    up="lc_UpwardPhotoUrl",
294    down="lc_DownwardPhotoUrl",
295    north="lc_NorthPhotoUrl",
296    south="lc_SouthPhotoUrl",
297    east="lc_EastPhotoUrl",
298    west="lc_WestPhotoUrl",
299    photo_count="lc_PhotoCount",
300    rejected_count="lc_RejectedCount",
301    pending_count="lc_PendingCount",
302    empty_count="lc_EmptyCount",
303    bit_binary="lc_PhotoBitBinary",
304    bit_decimal="lc_PhotoBitDecimal",
305    inplace=False,
306):
307    """
308    Creates the following flags:
309    - `PhotoCount`: The number of valid photos per record.
310    - `RejectedCount`: The number of photos that were rejected per record.
311    - `PendingCount`: The number of photos that are pending approval per record.
312    - `PhotoBitBinary`: A string that represents the presence of a photo in the Up, Down, North, South, East, and West directions. For example, if the entry is `110100`, that indicates that there is a valid photo for the Up, Down, and South Directions but no valid photos for the North, East, and West Directions.
313    - `PhotoBitDecimal`: The numerical representation of the lc_PhotoBitBinary string.
314
315    Parameters
316    ----------
317    df : pd.DataFrame
318        A land cover DataFrame
319    up : str, default="lc_UpwardPhotoUrl"
320        The name of the column in the land cover DataFrame that contains the url for the upwards photo.
321    down : str, default="lc_DownwardPhotoUrl"
322        The name of the column in the land cover DataFrame that contains the url for the downwards photo.
323    north : str, default="lc_NorthPhotoUrl"
324        The name of the column in the land cover DataFrame that contains the url for the north photo.
325    south : str, default="lc_SouthPhotoUrl"
326        The name of the column in the land cover DataFrame that contains the url for the south photo.
327    east : str, default="lc_EastPhotoUrl"
328        The name of the column in the land cover DataFrame that contains the url for the east photo.
329    west : str, default="lc_WestPhotoUrl"
330        The name of the column in the land cover DataFrame that contains the url for the west photo.
331    photo_count : str, default="lc_PhotoCount"
332        The name of the column that will be storing the PhotoCount flag.
333    rejected_count : str, default="lc_RejectedCount"
334        The name of the column that will be storing the RejectedCount flag.
335    pending_count : str, default="lc_PendingCount"
336        The name of the column that will be storing the PendingCount flag.
337    empty_count : str, default="lc_EmptyCount"
338        The name of the column that will be storing the EmptyCount flag.
339    bit_binary : str, default="lc_PhotoBitBinary"
340        The name of the column that will be storing the PhotoBitBinary flag.
341    bit_decimal : str, default="lc_PhotoBitDecimal"
342        The name of the column that will be storing the PhotoBitDecimal flag.
343    inplace : bool, default=False
344        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
345
346    Returns
347    -------
348    pd.DataFrame or None
349        A DataFrame with the photo bit flags. If `inplace=True` it returns None.
350    """
351
352    def pic_data(*args):
353        pic_count = 0
354        rejected_count = 0
355        pending_count = 0
356        empty_count = 0
357        valid_photo_bit_mask = ""
358
359        for entry in args:
360            if not pd.isna(entry) and "http" in entry:
361                valid_photo_bit_mask += "1"
362                pic_count += entry.count("http")
363            else:
364                valid_photo_bit_mask += "0"
365            if pd.isna(entry):
366                empty_count += 1
367            else:
368                pending_count += entry.count("pending")
369                rejected_count += entry.count("rejected")
370        return (
371            pic_count,
372            rejected_count,
373            pending_count,
374            empty_count,
375            valid_photo_bit_mask,
376            int(valid_photo_bit_mask, 2),
377        )
378
379    if not inplace:
380        df = df.copy()
381
382    get_photo_data = np.vectorize(pic_data)
383    (
384        df[photo_count],
385        df[rejected_count],
386        df[pending_count],
387        df[empty_count],
388        df[bit_binary],
389        df[bit_decimal],
390    ) = get_photo_data(
391        df[up].to_numpy(),
392        df[down].to_numpy(),
393        df[north].to_numpy(),
394        df[south].to_numpy(),
395        df[east].to_numpy(),
396        df[west].to_numpy(),
397    )
398
399    if not inplace:
400        return df

Creates the following flags:

  • PhotoCount: The number of valid photos per record.
  • RejectedCount: The number of photos that were rejected per record.
  • PendingCount: The number of photos that are pending approval per record.
  • PhotoBitBinary: A string that represents the presence of a photo in the Up, Down, North, South, East, and West directions. For example, if the entry is 110100, that indicates that there is a valid photo for the Up, Down, and South Directions but no valid photos for the North, East, and West Directions.
  • PhotoBitDecimal: The numerical representation of the lc_PhotoBitBinary string.
Parameters
  • df (pd.DataFrame): A land cover DataFrame
  • up (str, default="lc_UpwardPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the upwards photo.
  • down (str, default="lc_DownwardPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the downwards photo.
  • north (str, default="lc_NorthPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the north photo.
  • south (str, default="lc_SouthPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the south photo.
  • east (str, default="lc_EastPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the east photo.
  • west (str, default="lc_WestPhotoUrl"): The name of the column in the land cover DataFrame that contains the url for the west photo.
  • photo_count (str, default="lc_PhotoCount"): The name of the column that will be storing the PhotoCount flag.
  • rejected_count (str, default="lc_RejectedCount"): The name of the column that will be storing the RejectedCount flag.
  • pending_count (str, default="lc_PendingCount"): The name of the column that will be storing the PendingCount flag.
  • empty_count (str, default="lc_EmptyCount"): The name of the column that will be storing the EmptyCount flag.
  • bit_binary (str, default="lc_PhotoBitBinary"): The name of the column that will be storing the PhotoBitBinary flag.
  • bit_decimal (str, default="lc_PhotoBitDecimal"): The name of the column that will be storing the PhotoBitDecimal flag.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the photo bit flags. If inplace=True it returns None.
def classification_bit_flags( df, north='lc_NorthClassifications', south='lc_SouthClassifications', east='lc_EastClassifications', west='lc_WestClassifications', classification_count='lc_ClassificationCount', bit_binary='lc_ClassificationBitBinary', bit_decimal='lc_ClassificationBitDecimal', inplace=False)
403def classification_bit_flags(
404    df,
405    north="lc_NorthClassifications",
406    south="lc_SouthClassifications",
407    east="lc_EastClassifications",
408    west="lc_WestClassifications",
409    classification_count="lc_ClassificationCount",
410    bit_binary="lc_ClassificationBitBinary",
411    bit_decimal="lc_ClassificationBitDecimal",
412    inplace=False,
413):
414    """
415    Creates the following flags:
416    - `ClassificationCount`: The number of classifications per record.
417    - `BitBinary`: A string that represents the presence of a classification in the North, South, East, and West directions. For example, if the entry is `1101`, that indicates that there is a valid classification for the North, South, and West Directions but no valid classifications for the East Direction.
418    - `BitDecimal`: The number of photos that are pending approval per record.
419
420    Parameters
421    ----------
422    df : pd.DataFrame
423        A land cover DataFrame
424    north : str, default="lc_NorthClassifications"
425        The name of the column in the land cover DataFrame that contains the north classification.
426    south : str, default="lc_SouthClassifications"
427        The name of the column in the land cover DataFrame that contains the south classification.
428    east : str, default="lc_EastClassifications"
429        The name of the column in the land cover DataFrame that contains the east classification.
430    west : str, default="lc_WestClassifications"
431        The name of the column in the land cover DataFrame that contains the west classification.
432    classification_count : str, default="lc_ClassificationCount"
433        The name of the column that will store the ClassificationCount flag.
434    bit_binary : str, default="lc_ClassificationBitBinary"
435        The name of the column that will store the BitBinary flag.
436    bit_decimal : str, default="lc_ClassificationBitDecimal"
437        The name of the column that will store the BitDecimal flag.
438    inplace : bool, default=False
439        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
440
441    Returns
442    -------
443    pd.DataFrame or None
444        A DataFrame with the classification bit flags. If `inplace=True` it returns None.
445    """
446
447    def classification_data(*args):
448        classification_count = 0
449        classification_bit_mask = ""
450        for entry in args:
451            if pd.isna(entry) or entry is np.nan:
452                classification_bit_mask += "0"
453            else:
454                classification_count += 1
455                classification_bit_mask += "1"
456        return (
457            classification_count,
458            classification_bit_mask,
459            int(classification_bit_mask, 2),
460        )
461
462    if not inplace:
463        df = df.copy()
464    get_classification_data = np.vectorize(classification_data)
465
466    (
467        df[classification_count],
468        df[bit_binary],
469        df[bit_decimal],
470    ) = get_classification_data(
471        df[north],
472        df[south],
473        df[east],
474        df[west],
475    )
476    if not inplace:
477        return df

Creates the following flags:

  • ClassificationCount: The number of classifications per record.
  • BitBinary: A string that represents the presence of a classification in the North, South, East, and West directions. For example, if the entry is 1101, that indicates that there is a valid classification for the North, South, and West Directions but no valid classifications for the East Direction.
  • BitDecimal: The number of photos that are pending approval per record.
Parameters
  • df (pd.DataFrame): A land cover DataFrame
  • north (str, default="lc_NorthClassifications"): The name of the column in the land cover DataFrame that contains the north classification.
  • south (str, default="lc_SouthClassifications"): The name of the column in the land cover DataFrame that contains the south classification.
  • east (str, default="lc_EastClassifications"): The name of the column in the land cover DataFrame that contains the east classification.
  • west (str, default="lc_WestClassifications"): The name of the column in the land cover DataFrame that contains the west classification.
  • classification_count (str, default="lc_ClassificationCount"): The name of the column that will store the ClassificationCount flag.
  • bit_binary (str, default="lc_ClassificationBitBinary"): The name of the column that will store the BitBinary flag.
  • bit_decimal (str, default="lc_ClassificationBitDecimal"): The name of the column that will store the BitDecimal flag.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the classification bit flags. If inplace=True it returns None.
def completion_scores( df, photo_bit_binary='lc_PhotoBitBinary', classification_binary='lc_ClassificationBitBinary', sub_completeness='lc_SubCompletenessScore', completeness='lc_CumulativeCompletenessScore', inplace=False)
480def completion_scores(
481    df,
482    photo_bit_binary="lc_PhotoBitBinary",
483    classification_binary="lc_ClassificationBitBinary",
484    sub_completeness="lc_SubCompletenessScore",
485    completeness="lc_CumulativeCompletenessScore",
486    inplace=False,
487):
488    """
489    Adds the following completness score flags:
490    - `SubCompletenessScore`: The percentage of valid landcover classifications and photos that are filled out.
491    - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns.
492
493    Parameters
494    ----------
495    df : pd.DataFrame
496        A landcover DataFrame with the [`PhotoBitBinary`](#photo_bit_flags) and [`ClassificationBitBinary`](#classification_bit_flags) flags.
497    photo_bit_binary : str, default="lc_PhotoBitBinary"
498        The name of the column that stores the PhotoBitBinary flag.
499    classification_binary : str, default="lc_PhotoBitBinary"
500        The name of the column that stores the ClassificationBitBinary flag.
501    sub_completeness : str, default="lc_PhotoBitBinary"
502        The name of the column that will store the generated SubCompletenessScore flag.
503    completeness : str, default="lc_PhotoBitBinary"
504        The name of the column that will store the generated CompletenessScore flag.
505    inplace : bool, default=False
506        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
507
508    Returns
509    -------
510    pd.DataFrame or None
511        A DataFrame with the completeness score flags. If `inplace=True` it returns None.
512    """
513
514    def sum_bit_mask(bit_mask="0"):
515        sum = 0.0
516        for char in bit_mask:
517            sum += int(char)
518        return sum
519
520    if not inplace:
521        df = df.copy()
522
523    scores = {}
524    scores["sub_score"] = []
525    # Cummulative Completion Score
526    scores["cumulative_score"] = round(df.count(1) / len(df.columns), 2)
527    # Sub-Score
528    for index in df.index:
529        bit_mask = df[photo_bit_binary][index] + df[classification_binary][index]
530        sub_score = round(sum_bit_mask(bit_mask=bit_mask), 2)
531        sub_score /= len(bit_mask)
532        scores["sub_score"].append(sub_score)
533
534    df[sub_completeness], df[completeness] = (
535        scores["sub_score"],
536        scores["cumulative_score"],
537    )
538
539    if not inplace:
540        return df

Adds the following completness score flags:

  • SubCompletenessScore: The percentage of valid landcover classifications and photos that are filled out.
  • CumulativeCompletenessScore: The percentage of non null values out of all the columns.
Parameters
  • df (pd.DataFrame): A landcover DataFrame with the PhotoBitBinary and ClassificationBitBinary flags.
  • photo_bit_binary (str, default="lc_PhotoBitBinary"): The name of the column that stores the PhotoBitBinary flag.
  • classification_binary (str, default="lc_PhotoBitBinary"): The name of the column that stores the ClassificationBitBinary flag.
  • sub_completeness (str, default="lc_PhotoBitBinary"): The name of the column that will store the generated SubCompletenessScore flag.
  • completeness (str, default="lc_PhotoBitBinary"): The name of the column that will store the generated CompletenessScore flag.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the completeness score flags. If inplace=True it returns None.
def apply_cleanup(lc_df, unpack=True)
543def apply_cleanup(lc_df, unpack=True):
544    """Applies a full cleanup procedure to the landcover data.
545    It follows the following steps:
546    - Removes Homogenous Columns
547    - Renames Latitude and Longitudes
548    - Cleans the Column Naming
549    - Unpacks landcover classifications
550    - Rounds Columns
551    - Standardizes Null Values
552
553    This returns a copy
554
555    Parameters
556    ----------
557    lc_df : pd.DataFrame
558        A DataFrame containing **raw** Landcover Data from the API.
559    unpack : bool
560        If True, the Landcover data will expand the classifications into separate columns (results in around 300 columns). If False, it will just unpack overall landcover.
561
562    Returns
563    -------
564    pd.DataFrame
565        A DataFrame containing the cleaned Landcover Data
566    """
567    lc_df = lc_df.copy()
568
569    remove_homogenous_cols(lc_df, inplace=True)
570    rename_latlon_cols(lc_df, inplace=True)
571    cleanup_column_prefix(lc_df, inplace=True)
572    lc_df, overall_cols, directional_cols = unpack_classifications(lc_df, unpack=unpack)
573
574    round_cols(lc_df, inplace=True)
575    standardize_null_vals(lc_df, inplace=True)
576    return lc_df

Applies a full cleanup procedure to the landcover data. It follows the following steps:

  • Removes Homogenous Columns
  • Renames Latitude and Longitudes
  • Cleans the Column Naming
  • Unpacks landcover classifications
  • Rounds Columns
  • Standardizes Null Values

This returns a copy

Parameters
  • lc_df (pd.DataFrame): A DataFrame containing raw Landcover Data from the API.
  • unpack (bool): If True, the Landcover data will expand the classifications into separate columns (results in around 300 columns). If False, it will just unpack overall landcover.
Returns
  • pd.DataFrame: A DataFrame containing the cleaned Landcover Data
def add_flags(lc_df)
579def add_flags(lc_df):
580    """Adds the following flags to the landcover data:
581    - Photo Bit Flags
582    - Classification Bit Flags
583    - Completeness Score Flags
584
585    Returns a copy of the DataFrame
586
587    Parameters
588    ----------
589    lc_df : pd.DataFrame
590        A DataFrame containing cleaned up Landcover Data ideally from the [apply_cleanup](#apply_cleanup) method.
591
592    Returns
593    -------
594    pd.DataFrame
595        A DataFrame containing the Land Cover flags.
596    """
597    lc_df = lc_df.copy()
598    photo_bit_flags(lc_df, inplace=True)
599    classification_bit_flags(lc_df, inplace=True)
600    get_main_classifications(lc_df, inplace=True)
601    completion_scores(lc_df, inplace=True)
602    return lc_df

Adds the following flags to the landcover data:

  • Photo Bit Flags
  • Classification Bit Flags
  • Completeness Score Flags

Returns a copy of the DataFrame

Parameters
  • lc_df (pd.DataFrame): A DataFrame containing cleaned up Landcover Data ideally from the apply_cleanup method.
Returns
  • pd.DataFrame: A DataFrame containing the Land Cover flags.
def direction_frequency(lc_df, direction_list, bit_binary, entry_type)
605def direction_frequency(lc_df, direction_list, bit_binary, entry_type):
606    """
607    Plots the amount of a variable of interest for each direction.
608
609    Parameters
610    ----------
611    lc_df : pd.DataFrame
612        The DataFrame containing Land Cover Data.
613    direction_list : list of str
614        The column names of the different variables of interest for each direction.
615    bit_binary: str
616        The Bit Binary Flag associated with the variable of interest.
617    entry_type: str
618        The variable of interest (e.g. Photos or Classifications)
619    """
620    direction_photos = pd.DataFrame()
621    direction_photos["category"] = direction_list
622    direction_counts = [0 for i in range(len(direction_photos))]
623    for mask in lc_df[bit_binary]:
624        for i in range(len(mask) - 1, -1, -1):
625            direction_counts[i] += int(mask[i])
626    direction_counts
627    direction_photos["count"] = [math.log10(value) for value in direction_counts]
628    direction_photos
629
630    plt.figure(figsize=(15, 6))
631    title = f"Land Cover -- {entry_type} Direction Frequency (Log Scale)"
632    plt.title(title)
633    plt.ylabel("Count (Log Scale)")
634    sns.barplot(data=direction_photos, x="category", y="count", color="lightblue")

Plots the amount of a variable of interest for each direction.

Parameters
  • lc_df (pd.DataFrame): The DataFrame containing Land Cover Data.
  • direction_list (list of str): The column names of the different variables of interest for each direction.
  • bit_binary (str): The Bit Binary Flag associated with the variable of interest.
  • entry_type (str): The variable of interest (e.g. Photos or Classifications)
def diagnostic_plots( lc_df, up_url='lc_UpwardPhotoUrl', down_url='lc_DownwardPhotoUrl', north_url='lc_NorthPhotoUrl', south_url='lc_SouthPhotoUrl', east_url='lc_EastPhotoUrl', west_url='lc_WestPhotoUrl', photo_bit='lc_PhotoBitBinary', north_classification='lc_NorthClassifications', south_classification='lc_SouthClassifications', east_classification='lc_EastClassifications', west_classification='lc_WestClassifications', classification_bit='lc_ClassificationBitBinary')
637def diagnostic_plots(
638    lc_df,
639    up_url="lc_UpwardPhotoUrl",
640    down_url="lc_DownwardPhotoUrl",
641    north_url="lc_NorthPhotoUrl",
642    south_url="lc_SouthPhotoUrl",
643    east_url="lc_EastPhotoUrl",
644    west_url="lc_WestPhotoUrl",
645    photo_bit="lc_PhotoBitBinary",
646    north_classification="lc_NorthClassifications",
647    south_classification="lc_SouthClassifications",
648    east_classification="lc_EastClassifications",
649    west_classification="lc_WestClassifications",
650    classification_bit="lc_ClassificationBitBinary",
651):
652    """
653    Generates (but doesn't display) diagnostic plots to gain insight into the current data.
654
655    Plots:
656    - Valid Photo Count Distribution
657    - Photo Distribution by direction
658    - Classification Distribution by direction
659    - Photo Status Distribution
660    - Completeness Score Distribution
661    - Subcompleteness Score Distribution
662
663    Parameters
664    ----------
665    lc_df : pd.DataFrame
666        The DataFrame containing Flagged and Cleaned Land Cover Data.
667    """
668    plot_freq_bar(
669        lc_df, "Land Cover", "lc_PhotoCount", "Valid Photo Count", log_scale=True
670    )
671    direction_frequency(
672        lc_df,
673        [up_url, down_url, north_url, south_url, east_url, west_url],
674        photo_bit,
675        "Photo",
676    )
677    direction_frequency(
678        lc_df,
679        [
680            north_classification,
681            south_classification,
682            east_classification,
683            west_classification,
684        ],
685        classification_bit,
686        "Classification",
687    )
688    multiple_bar_graph(
689        lc_df,
690        "Land Cover",
691        ["lc_PhotoCount", "lc_RejectedCount", "lc_EmptyCount"],
692        "Photo Summary",
693        log_scale=True,
694    )
695
696    completeness_histogram(
697        lc_df, "Land Cover", "lc_CumulativeCompletenessScore", "Cumulative Completeness"
698    )
699    completeness_histogram(
700        lc_df, "Land Cover", "lc_SubCompletenessScore", "Sub Completeness"
701    )

Generates (but doesn't display) diagnostic plots to gain insight into the current data.

Plots:

  • Valid Photo Count Distribution
  • Photo Distribution by direction
  • Classification Distribution by direction
  • Photo Status Distribution
  • Completeness Score Distribution
  • Subcompleteness Score Distribution
Parameters
  • lc_df (pd.DataFrame): The DataFrame containing Flagged and Cleaned Land Cover Data.
def qa_filter( lc_df, has_classification=False, has_photo=False, has_all_photos=False, has_all_classifications=False)
704def qa_filter(
705    lc_df,
706    has_classification=False,
707    has_photo=False,
708    has_all_photos=False,
709    has_all_classifications=False,
710):
711    """
712    Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
713    - `Has Classification`: If the entry has atleast one direction classified
714    - `Has Photo` : If the entry has atleast one photo taken
715    - `Has All Photos` : If the entry has all photos taken (up, down, north, south, east, west)
716    - `Has All Classifications` : If the entry has all directions classified
717
718    Returns a copy of the DataFrame
719
720    Parameters
721    ----------
722    has_classification : bool, default=False
723        If True, only entries with atleast one classification will be included.
724    has_photo : bool, default=False
725        If True, only entries with atleast one photo will be included.
726    has_all_photos : bool, default=False
727        If True, only entries with all photos will be included.
728    has_all_classifications : bool, default=False
729        If True, only entries with all classifications will be included.
730
731    Returns
732    -------
733    pd.DataFrame
734        A DataFrame of the applied filters.
735    """
736
737    if has_classification and not has_all_classifications:
738        lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] > 0]
739    elif has_all_classifications:
740        lc_df = lc_df[lc_df["lc_ClassificationBitDecimal"] == 15]
741    if has_photo and not has_all_photos:
742        lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] > 0]
743    elif has_all_photos:
744        lc_df = lc_df[lc_df["lc_PhotoBitDecimal"] == 63]
745
746    return lc_df

Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:

  • Has Classification: If the entry has atleast one direction classified
  • Has Photo : If the entry has atleast one photo taken
  • Has All Photos : If the entry has all photos taken (up, down, north, south, east, west)
  • Has All Classifications : If the entry has all directions classified

Returns a copy of the DataFrame

Parameters
  • has_classification (bool, default=False): If True, only entries with atleast one classification will be included.
  • has_photo (bool, default=False): If True, only entries with atleast one photo will be included.
  • has_all_photos (bool, default=False): If True, only entries with all photos will be included.
  • has_all_classifications (bool, default=False): If True, only entries with all classifications will be included.
Returns
  • pd.DataFrame: A DataFrame of the applied filters.
def get_main_classifications( lc_df, north_classification='lc_NorthClassifications', east_classification='lc_EastClassifications', south_classification='lc_SouthClassifications', west_classification='lc_WestClassifications', north_primary='lc_NorthPrimary', north_secondary='lc_NorthSecondary', east_primary='lc_EastPrimary', east_secondary='lc_EastSecondary', south_primary='lc_SouthPrimary', south_secondary='lc_SouthSecondary', west_primary='lc_WestPrimary', west_secondary='lc_WestSecondary', primary_classification='lc_PrimaryClassification', secondary_classification='lc_SecondaryClassification', primary_percentage='lc_PrimaryPercentage', secondary_percentage='lc_SecondaryPercentage', inplace=False)
827def get_main_classifications(
828    lc_df,
829    north_classification="lc_NorthClassifications",
830    east_classification="lc_EastClassifications",
831    south_classification="lc_SouthClassifications",
832    west_classification="lc_WestClassifications",
833    north_primary="lc_NorthPrimary",
834    north_secondary="lc_NorthSecondary",
835    east_primary="lc_EastPrimary",
836    east_secondary="lc_EastSecondary",
837    south_primary="lc_SouthPrimary",
838    south_secondary="lc_SouthSecondary",
839    west_primary="lc_WestPrimary",
840    west_secondary="lc_WestSecondary",
841    primary_classification="lc_PrimaryClassification",
842    secondary_classification="lc_SecondaryClassification",
843    primary_percentage="lc_PrimaryPercentage",
844    secondary_percentage="lc_SecondaryPercentage",
845    inplace=False,
846):
847    if not inplace:
848        lc_df = lc_df.copy()
849    vectorized_rank = np.vectorize(_rank_classifications)
850    (
851        lc_df[north_primary],
852        lc_df[north_secondary],
853        lc_df[east_primary],
854        lc_df[east_secondary],
855        lc_df[south_primary],
856        lc_df[south_secondary],
857        lc_df[west_primary],
858        lc_df[west_secondary],
859        lc_df[primary_classification],
860        lc_df[secondary_classification],
861        lc_df[primary_percentage],
862        lc_df[secondary_percentage],
863    ) = vectorized_rank(
864        lc_df[north_classification].to_numpy(),
865        lc_df[east_classification].to_numpy(),
866        lc_df[south_classification].to_numpy(),
867        lc_df[west_classification].to_numpy(),
868    )
869
870    if not inplace:
871        return lc_df