
Mosquito Specific Cleanup Procedures

Converting Larvae Data to Integers

Larvae Data is stored as a string in the raw GLOBE Observer dataset. To facillitate analysis, this method converts this data to numerical data.

It needs to account for 4 types of data:

  1. Regular Data: Converts it to a number
  2. Extraneously large data (100 as its hard to count more than that amount accurately): To maintain the information from that entry, the LarvaeCountMagnitude flag is used to indicate the real value
  3. Ranges (e.g. "25-50"): Chooses the lower bound and set the LarvaeCountIsRangeFlag to true.
  4. Null Values: Sets null values to 9999

It generates the following flags:

  • LarvaeCountMagnitude: The integer flag contains the order of magnitude (0-4) by which the larvae count exceeds the maximum Larvae Count of 100. This is calculated by 1+lognum100. As a result:
    • 0: Corresponds to a Larvae Count 100
    • 1: Corresponds to a Larvae Count between 100 and 999
    • 2: Corresponds to a Larvae Count between 1000 and 9999
    • 3: Corresponds to a Larvae Count between 10,000 and 99,999
    • 4: Corresponds to a Larvae Count 100,000
  • LarvaeCountIsRange: Either a 1 which indicates the entry was a range (e.g. 25-50) or 0 which indicates the entry wasn't a range.

Additionally, there were extremely large values that Python was unable to process (1e+27) and so there was an initial preprocessing step to set those numbers to 100000 (which corresponds to the maximum magnitude flag).

  1import math
  2import re
  4import matplotlib.pyplot as plt
  5import numpy as np
  6import pandas as pd
  8from go_utils.cleanup import (
  9    rename_latlon_cols,
 10    replace_column_prefix,
 11    round_cols,
 12    standardize_null_vals,
 14from go_utils.plot import completeness_histogram, plot_freq_bar, plot_int_distribution
 16__doc__ = r"""
 18## Mosquito Specific Cleanup Procedures
 20### Converting Larvae Data to Integers
 21Larvae Data is stored as a string in the raw GLOBE Observer dataset. To facillitate analysis, [this method](#larvae_to_num) converts this data to numerical data.
 23It needs to account for 4 types of data:
 241. Regular Data: Converts it to a number
 252. Extraneously large data ($\geq 100$ as its hard to count more than that amount accurately): To maintain the information from that entry, the `LarvaeCountMagnitude` flag is used to indicate the real value
 263. Ranges (e.g. "25-50"): Chooses the lower bound and set the `LarvaeCountIsRangeFlag` to true.
 274. Null Values: Sets null values to $-9999$
 30It generates the following flags:
 31- `LarvaeCountMagnitude`: The integer flag contains the order of magnitude (0-4) by which the larvae count exceeds the maximum Larvae Count of 100. This is calculated by $1 + \lfloor \log{\frac{num}{100}} \rfloor$. As a result:
 32    - `0`: Corresponds to a Larvae Count $\leq 100$
 33    - `1`: Corresponds to a Larvae Count between $100$ and $999$
 34    - `2`: Corresponds to a Larvae Count between $1000$ and $9999$
 35    - `3`: Corresponds to a Larvae Count between $10,000$ and $99,999$
 36    - `4`: Corresponds to a Larvae Count $\geq 100,000$
 37- `LarvaeCountIsRange`: Either a $1$ which indicates the entry was a range (e.g. 25-50) or $0$ which indicates the entry wasn't a range.
 39Additionally, there were extremely large values that Python was unable to process (`1e+27`) and so there was an initial preprocessing step to set those numbers to 100000 (which corresponds to the maximum magnitude flag).
 43def cleanup_column_prefix(df, inplace=False):
 44    """Method for shortening raw mosquito habitat mapper column names.
 46    Parameters
 47    ----------
 48    df : pd.DataFrame
 49        The DataFrame containing raw mosquito habitat mapper data.
 50    inplace : bool, default=False
 51        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 53    Returns
 54    -------
 55    pd.DataFrame or None
 56        A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None.
 57    """
 59    return replace_column_prefix(df, "mosquitohabitatmapper", "mhm", inplace=inplace)
 62def _entry_to_num(entry):
 63    try:
 64        if entry == "more than 100":
 65            return 101, 1, 1
 66        if pd.isna(entry):
 67            return -9999, 0, 0
 68        elif float(entry) > 100:
 69            return 101, min(math.floor(math.log10(float(entry) / 100)) + 1, 4), 0
 70        return float(entry), 0, 0
 71    except ValueError:
 72        return float(re.sub(r"-.*", "", entry)), 0, 1
 75def larvae_to_num(
 76    mhm_df,
 77    larvae_count_col="mhm_LarvaeCount",
 78    magnitude="mhm_LarvaeCountMagnitude",
 79    range_flag="mhm_LarvaeCountIsRangeFlag",
 80    inplace=False,
 82    """Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers.
 84    See [here](#converting-larvae-data-to-integers) for more information.
 86    Parameters
 87    ----------
 88    mhm_df : pd.DataFrame
 89        A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers
 90    larvae_count_col : str, default="mhm_LarvaeCount"
 91        The name of the column storing the larvae count. **Note**: The columns will be output in the format: `prefix_ColumnName` where `prefix` is all the characters that preceed the words `LarvaeCount` in the specified name.
 92    magnitude: str, default="mhm_LarvaeCountMagnitude"
 93        The name of the column which will store the generated LarvaeCountMagnitude output
 94    range_flag : str, default="mhm_LarvaeCountIsRangeFlag"
 95        The name of the column which will store the generated LarvaeCountIsRange flag
 96    inplace : bool, default=False
 97        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 99    Returns
100    -------
101    pd.DataFrame
102        A DataFrame with the larvae count as integers. If `inplace=True` it returns None.
103    """
105    if not inplace:
106        mhm_df = mhm_df.copy()
107    # Preprocessing step to remove extremely erroneous values
108    for i in mhm_df.index:
109        count = mhm_df[larvae_count_col][i]
110        if not pd.isna(count) and type(count) is str and "e+" in count:
111            mhm_df.at[i, larvae_count_col] = "100000"
113    larvae_conversion = np.vectorize(_entry_to_num)
114    (
115        mhm_df[larvae_count_col],
116        mhm_df[magnitude],
117        mhm_df[range_flag],
118    ) = larvae_conversion(mhm_df[larvae_count_col].to_numpy())
120    if not inplace:
121        return mhm_df
124def has_genus_flag(df, genus_col="mhm_Genus", bit_col="mhm_HasGenus", inplace=False):
125    """
126    Creates a bit flag: `mhm_HasGenus` where 1 denotes a recorded Genus and 0 denotes the contrary.
128    Parameters
129    ----------
130    df : pd.DataFrame
131        A mosquito habitat mapper DataFrame
132    genus_col : str, default="mhm_Genus"
133        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
134    bit_col : str, default="mhm_HasGenus"
135        The name of the column which will store the generated HasGenus flag
136    inplace : bool, default=False
137        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
139    Returns
140    -------
141    pd.DataFrame
142        A DataFrame with the HasGenus flag. If `inplace=True` it returns None.
143    """
144    if not inplace:
145        df = df.copy()
146    df[bit_col] = (~pd.isna(df[genus_col].to_numpy())).astype(int)
148    if not inplace:
149        return df
152def infectious_genus_flag(
153    df, genus_col="mhm_Genus", bit_col="mhm_IsGenusOfInterest", inplace=False
155    """
156    Creates a bit flag: `mhm_IsGenusOfInterest` where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary.
158    Parameters
159    ----------
160    df : pd.DataFrame
161        A mosquito habitat mapper DataFrame
162    genus_col : str, default="mhm_Genus"
163        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
164    bit_col : str, default="mhm_HasGenus"
165        The name of the column which will store the generated IsGenusOfInterest flag
166    inplace : bool, default=False
167        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
169    Returns
170    -------
171    pd.DataFrame
172        A DataFrame with the IsGenusOfInterest flag. If `inplace=True` it returns None.
173    """
174    if not inplace:
175        df = df.copy()
176    infectious_genus_flag = np.vectorize(
177        lambda genus: genus in ["Aedes", "Anopheles", "Culex"]
178    )
179    df[bit_col] = infectious_genus_flag(df[genus_col].to_numpy()).astype(int)
181    if not inplace:
182        return df
185def is_container_flag(
186    df,
187    watersource_col="mhm_WaterSourceType",
188    bit_col="mhm_IsWaterSourceContainer",
189    inplace=False,
191    """
192    Creates a bit flag: `mhm_IsWaterSourceContainer` where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary.
194    Parameters
195    ----------
196    df : pd.DataFrame
197        A mosquito habitat mapper DataFrame
198    watersource_col : str, default="mhm_WaterSourceType"
199        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records.
200    bit_col : str, default="mhm_IsWaterSourceContainer"
201        The name of the column which will store the generated IsWaterSourceContainer flag
202    inplace : bool, default=False
203        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
205    Returns
206    -------
207    pd.DataFrame
208        A DataFrame with the IsContainer flag. If `inplace=True` it returns None.
209    """
211    if not inplace:
212        df = df.copy()
214    mark_containers = np.vectorize(
215        lambda container: not pd.isna(container) and "container" in container
216    )
217    df[bit_col] = mark_containers(df[watersource_col].to_numpy()).astype(int)
219    if not inplace:
220        return df
223def has_watersource_flag(
224    df, watersource_col="mhm_WaterSource", bit_col="mhm_HasWaterSource", inplace=False
226    """
227    Creates a bit flag: `mhm_HasWaterSource` where 1 denotes if there is a watersource and 0 denotes the contrary.
229    Parameters
230    ----------
231    df : pd.DataFrame
232        A mosquito habitat mapper DataFrame
233    watersource_col : str, default="mhm_WaterSource"
234        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records.
235    bit_col : str, default="mhm_IsWaterSourceContainer"
236        The name of the column which will store the generated HasWaterSource flag
237    inplace : bool, default=False
238        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
240    Returns
241    -------
242    pd.DataFrame
243        A DataFrame with the HasWaterSource flag. If `inplace=True` it returns None.
244    """
246    if not inplace:
247        df = df.copy()
248    has_watersource = np.vectorize(lambda watersource: int(not pd.isna(watersource)))
249    df[bit_col] = has_watersource(df[watersource_col].to_numpy())
251    if not inplace:
252        return df
255def photo_bit_flags(
256    df,
257    watersource_photos="mhm_WaterSourcePhotoUrls",
258    larvae_photos="mhm_LarvaFullBodyPhotoUrls",
259    abdomen_photos="mhm_AbdomenCloseupPhotoUrls",
260    photo_count="mhm_PhotoCount",
261    rejected_count="mhm_RejectedCount",
262    pending_count="mhm_PendingCount",
263    photo_bit_binary="mhm_PhotoBitBinary",
264    photo_bit_decimal="mhm_PhotoBitDecimal",
265    inplace=False,
267    """
268    Creates the following flags:
269    - `PhotoCount`: The number of valid photos per record.
270    - `RejectedCount`: The number of photos that were rejected per record.
271    - `PendingCount`: The number of photos that are pending approval per record.
272    - `PhotoBitBinary`: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is `110`, that indicates that there is a water source photo and a larvae photo, but no abdomen photo.
273    - `PhotoBitDecimal`: The numerical representation of the mhm_PhotoBitBinary string.
275    Parameters
276    ----------
277    df : pd.DataFrame
278        A mosquito habitat mapper DataFrame
279    watersource_photos : str, default="mhm_WaterSourcePhotoUrls"
280        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records.
281    larvae_photos : str, default="mhm_LarvaFullBodyPhotoUrls"
282        The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records.
283    abdomen_photos : str, default="mhm_AbdomenCloseupPhotoUrls"
284        The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records.
285    photo_count : str, default="mhm_PhotoCount"
286        The name of the column that will store the PhotoCount flag.
287    rejected_count : str, default="mhm_RejectedCount"
288        The name of the column that will store the RejectedCount flag.
289    pending_count : str, default="mhm_PendingCount"
290        The name of the column that will store the PendingCount flag.
291    photo_bit_binary : str, default="mhm_PhotoBitBinary"
292        The name of the column that will store the PhotoBitBinary flag.
293    photo_bit_decimal : str, default="mhm_PhotoBitDecimal"
294        The name of the column that will store the PhotoBitDecimal flag.
295    inplace : bool, default=False
296        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
298    Returns
299    -------
300    pd.DataFrame
301        A DataFrame with the photo flags. If `inplace=True` it returns None.
302    """
304    def pic_data(*args):
305        pic_count = 0
306        rejected_count = 0
307        pending_count = 0
308        valid_photo_bit_mask = ""
310        # bit_power = len(args) - 1
311        # For url string -- if we see ANY http, add 1
312        # also count all valid photos, rejected photos,
313        # If there are NO http then add 0, to empty photo field
314        for url_string in args:
315            if not pd.isna(url_string):
316                if "http" not in url_string:
317                    valid_photo_bit_mask += "0"
318                else:
319                    valid_photo_bit_mask += "1"
321                pic_count += url_string.count("http")
322                pending_count += url_string.count("pending")
323                rejected_count += url_string.count("rejected")
324            else:
325                valid_photo_bit_mask += "0"
327        return (
328            pic_count,
329            rejected_count,
330            pending_count,
331            valid_photo_bit_mask,
332            int(valid_photo_bit_mask, 2),
333        )
335    if not inplace:
336        df = df.copy()
338    get_photo_data = np.vectorize(pic_data)
339    (
340        df[photo_count],
341        df[rejected_count],
342        df[pending_count],
343        df[photo_bit_binary],
344        df[photo_bit_decimal],
345    ) = get_photo_data(
346        df[watersource_photos].to_numpy(),
347        df[larvae_photos].to_numpy(),
348        df[abdomen_photos].to_numpy(),
349    )
351    if not inplace:
352        return df
355def completion_score_flag(
356    df,
357    photo_bit_binary="mhm_PhotoBitBinary",
358    has_genus="mhm_HasGenus",
359    sub_completeness="mhm_SubCompletenessScore",
360    completeness="mhm_CumulativeCompletenessScore",
361    inplace=False,
363    """
364    Adds the following completness score flags:
365    - `SubCompletenessScore`: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out.
366    - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns.
368    Parameters
369    ----------
370    df : pd.DataFrame
371        A mosquito habitat mapper DataFrame with the [`PhotoBitDecimal`](#photo_bit_flags) and [`HasGenus`](#has_genus_flags) flags.
372    photo_bit_binary: str, default="mhm_PhotoBitBinary"
373        The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag.
374    sub_completeness : str, default="mhm_HasGenus"
375        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag.
376    completeness : str, default="mhm_SubCompletenessScore"
377        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag.
378    inplace : bool, default=False
379        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
381    Returns
382    -------
383    pd.DataFrame
384        A DataFrame with completion score flags. If `inplace=True` it returns None.
385    """
387    def sum_bit_mask(bit_mask="0"):
388        total = 0.0
389        for char in bit_mask:
390            total += int(char)
391        return total
393    if not inplace:
394        df = df.copy()
396    scores = {}
397    scores["sub_score"] = []
398    # Cummulative Completion Score
399    scores["cumulative_score"] = round(df.count(axis=1) / len(df.columns), 2)
400    # Sub-Score
401    for index in df.index:
402        bit_mask = df[photo_bit_binary][index]
403        sub_score = df[has_genus][index] + sum_bit_mask(bit_mask=bit_mask)
404        sub_score /= 4.0
405        scores["sub_score"].append(sub_score)
407    df[sub_completeness], df[completeness] = (
408        scores["sub_score"],
409        scores["cumulative_score"],
410    )
412    if not inplace:
413        return df
416def apply_cleanup(mhm_df):
417    """Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy.
418    It follows the following steps:
419    - Removes Homogenous Columns
420    - Renames Latitude and Longitudes
421    - Cleans the Column Naming
422    - Converts Larvae Count to Numbers
423    - Rounds Columns
424    - Standardizes Null Values
426    Parameters
427    ----------
428    mhm_df : pd.DataFrame
429        A DataFrame containing **raw** Mosquito Habitat Mapper Data from the API.
431    Returns
432    -------
433    pd.DataFrame
434        A DataFrame containing the cleaned up Mosquito Habitat Mapper Data
435    """
436    mhm_df = mhm_df.copy()
438    rename_latlon_cols(mhm_df, inplace=True)
439    cleanup_column_prefix(mhm_df, inplace=True)
440    larvae_to_num(mhm_df, inplace=True)
441    round_cols(mhm_df, inplace=True)
442    standardize_null_vals(mhm_df, inplace=True)
443    return mhm_df
446def add_flags(mhm_df):
447    """Adds the following flags to the Mosquito Habitat Mapper Data:
448    - Has Genus
449    - Is Infectious Genus/Genus of Interest
450    - Is Container
451    - Has WaterSource
452    - Photo Bit Flags
453    - Completion Score Flag
455    This returns a copy of the original DataFrame with the flags added onto it.
457    Parameters
458    ----------
459    mhm_df : pd.DataFrame
460        A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method.
462    Returns
463    -------
464    pd.DataFrame
465        A DataFrame containing the flagged Mosquito Habitat Mapper Data
466    """
467    mhm_df = mhm_df.copy()
468    has_genus_flag(mhm_df, inplace=True)
469    infectious_genus_flag(mhm_df, inplace=True)
470    is_container_flag(mhm_df, inplace=True)
471    has_watersource_flag(mhm_df, inplace=True)
472    photo_bit_flags(mhm_df, inplace=True)
473    completion_score_flag(mhm_df, inplace=True)
474    return mhm_df
477def plot_valid_entries(df, bit_col, entry_type):
478    """
479    Plots the number of entries with photos and the number of entries without photos
481    Parameters
482    ----------
483    df : pd.DataFrame
484        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
485    """
486    plt.figure()
487    num_valid = len(df[df[bit_col] > 0])
488    plt.title(f"Entries with {entry_type} vs No {entry_type}")
489    plt.ylabel("Number of Entries")
490    plt.bar(entry_type, num_valid, color="#e34a33")
491    plt.bar(f"No {entry_type}", len(df) - num_valid, color="#fdcc8a")
494def photo_subjects(mhm_df):
495    """
496    Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource)
498    Parameters
499    ----------
500    mhm_df : pd.DataFrame
501        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
502    """
504    total_dict = {"Larvae Photos": 0, "Abdomen Photos": 0, "Watersource Photos": 0}
506    for number in mhm_df["mhm_PhotoBitDecimal"]:
507        total_dict["Watersource Photos"] += number & 4
508        total_dict["Larvae Photos"] += number & 2
509        total_dict["Abdomen Photos"] += number & 1
511    for key in total_dict.keys():
512        if total_dict[key] != 0:
513            total_dict[key] = math.log10(total_dict[key])
514        else:
515            total_dict[key] = 0
516    plt.figure(figsize=(10, 5))
517    plt.title("Mosquito Habitat Mapper - Photo Subject Frequencies (Log Scale)")
518    plt.xlabel("Photo Type")
519    plt.ylabel("Frequency (Log Scale)")
520    plt.bar(total_dict.keys(), total_dict.values(), color="lightblue")
523def diagnostic_plots(mhm_df):
524    """
525    Generates (but doesn't display) diagnostic plots to gain insight into the current data.
527    Plots:
528    - Larvae Count Distribution (where a negative entry denotes null data)
529    - Photo Subject Distribution
530    - Number of valid photos vs no photos
531    - Completeness Score Distribution
532    - Subcompleteness Score Distribution
534    Parameters
535    ----------
536    mhm_df : pd.DataFrame
537        The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data.
538    """
539    plot_int_distribution(mhm_df, "mhm_LarvaeCount", "Larvae Count")
540    photo_subjects(mhm_df)
541    plot_freq_bar(mhm_df, "Mosquito Habitat Mapper", "mhm_Genus", "Genus Types")
542    plot_valid_entries(mhm_df, "mhm_HasGenus", "Genus Classifications")
543    plot_valid_entries(mhm_df, "mhm_PhotoBitDecimal", "Valid Photos")
544    completeness_histogram(
545        mhm_df,
546        "Mosquito Habitat Mapper",
547        "mhm_CumulativeCompletenessScore",
548        "Cumulative Completeness",
549    )
550    completeness_histogram(
551        mhm_df,
552        "Mosquito Habitat Mapper",
553        "mhm_SubCompletenessScore",
554        "Sub Completeness",
555    )
558def qa_filter(
559    mhm_df,
560    has_genus=False,
561    min_larvae_count=-9999,
562    has_photos=False,
563    is_container=False,
565    """
566    Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
567    - `Has Genus`: If the entry has an identified genus
568    - `Min Larvae Count` : Minimum larvae count needed for an entry
569    - `Has Photos` : If the entry contains valid photo entries
570    - `Is Container` : If the entry's watersource was a container
572    Returns a copy of the DataFrame
574    Parameters
575    ----------
576    has_genus : bool, default=False
577        If True, only entries with an identified genus will be returned.
578    min_larvae_count : int, default=-9999
579        Only entries with a larvae count greater than or equal to this parameter will be included.
580    has_photos : bool, default=False
581        If True, only entries with recorded photos will be returned
582    is_container : bool, default=False
583        If True, only entries with containers will be returned
585    Returns
586    -------
587    pd.DataFrame
588        A DataFrame of the applied filters.
589    """
591    mhm_df = mhm_df[mhm_df["mhm_LarvaeCount"] >= min_larvae_count]
593    if has_genus:
594        mhm_df = mhm_df[mhm_df["mhm_HasGenus"] == 1]
595    if has_photos:
596        mhm_df = mhm_df[mhm_df["mhm_PhotoBitDecimal"] > 0]
597    if is_container:
598        mhm_df = mhm_df[mhm_df["mhm_IsWaterSourceContainer"] == 1]
600    return mhm_df
def cleanup_column_prefix(df, inplace=False)
44def cleanup_column_prefix(df, inplace=False):
45    """Method for shortening raw mosquito habitat mapper column names.
47    Parameters
48    ----------
49    df : pd.DataFrame
50        The DataFrame containing raw mosquito habitat mapper data.
51    inplace : bool, default=False
52        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
54    Returns
55    -------
56    pd.DataFrame or None
57        A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None.
58    """
60    return replace_column_prefix(df, "mosquitohabitatmapper", "mhm", inplace=inplace)

Method for shortening raw mosquito habitat mapper column names.

  • df (pd.DataFrame): The DataFrame containing raw mosquito habitat mapper data.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame or None: A DataFrame with the cleaned up column prefixes. If inplace=True it returns None.
def larvae_to_num( mhm_df, larvae_count_col='mhm_LarvaeCount', magnitude='mhm_LarvaeCountMagnitude', range_flag='mhm_LarvaeCountIsRangeFlag', inplace=False)
 76def larvae_to_num(
 77    mhm_df,
 78    larvae_count_col="mhm_LarvaeCount",
 79    magnitude="mhm_LarvaeCountMagnitude",
 80    range_flag="mhm_LarvaeCountIsRangeFlag",
 81    inplace=False,
 83    """Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers.
 85    See [here](#converting-larvae-data-to-integers) for more information.
 87    Parameters
 88    ----------
 89    mhm_df : pd.DataFrame
 90        A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers
 91    larvae_count_col : str, default="mhm_LarvaeCount"
 92        The name of the column storing the larvae count. **Note**: The columns will be output in the format: `prefix_ColumnName` where `prefix` is all the characters that preceed the words `LarvaeCount` in the specified name.
 93    magnitude: str, default="mhm_LarvaeCountMagnitude"
 94        The name of the column which will store the generated LarvaeCountMagnitude output
 95    range_flag : str, default="mhm_LarvaeCountIsRangeFlag"
 96        The name of the column which will store the generated LarvaeCountIsRange flag
 97    inplace : bool, default=False
 98        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
100    Returns
101    -------
102    pd.DataFrame
103        A DataFrame with the larvae count as integers. If `inplace=True` it returns None.
104    """
106    if not inplace:
107        mhm_df = mhm_df.copy()
108    # Preprocessing step to remove extremely erroneous values
109    for i in mhm_df.index:
110        count = mhm_df[larvae_count_col][i]
111        if not pd.isna(count) and type(count) is str and "e+" in count:
112            mhm_df.at[i, larvae_count_col] = "100000"
114    larvae_conversion = np.vectorize(_entry_to_num)
115    (
116        mhm_df[larvae_count_col],
117        mhm_df[magnitude],
118        mhm_df[range_flag],
119    ) = larvae_conversion(mhm_df[larvae_count_col].to_numpy())
121    if not inplace:
122        return mhm_df

Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers.

See here for more information.

  • mhm_df (pd.DataFrame): A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers
  • larvae_count_col (str, default="mhm_LarvaeCount"): The name of the column storing the larvae count. Note: The columns will be output in the format: prefix_ColumnName where prefix is all the characters that preceed the words LarvaeCount in the specified name.
  • magnitude (str, default="mhm_LarvaeCountMagnitude"): The name of the column which will store the generated LarvaeCountMagnitude output
  • range_flag (str, default="mhm_LarvaeCountIsRangeFlag"): The name of the column which will store the generated LarvaeCountIsRange flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame: A DataFrame with the larvae count as integers. If inplace=True it returns None.
def has_genus_flag(df, genus_col='mhm_Genus', bit_col='mhm_HasGenus', inplace=False)
125def has_genus_flag(df, genus_col="mhm_Genus", bit_col="mhm_HasGenus", inplace=False):
126    """
127    Creates a bit flag: `mhm_HasGenus` where 1 denotes a recorded Genus and 0 denotes the contrary.
129    Parameters
130    ----------
131    df : pd.DataFrame
132        A mosquito habitat mapper DataFrame
133    genus_col : str, default="mhm_Genus"
134        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
135    bit_col : str, default="mhm_HasGenus"
136        The name of the column which will store the generated HasGenus flag
137    inplace : bool, default=False
138        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
140    Returns
141    -------
142    pd.DataFrame
143        A DataFrame with the HasGenus flag. If `inplace=True` it returns None.
144    """
145    if not inplace:
146        df = df.copy()
147    df[bit_col] = (~pd.isna(df[genus_col].to_numpy())).astype(int)
149    if not inplace:
150        return df

Creates a bit flag: mhm_HasGenus where 1 denotes a recorded Genus and 0 denotes the contrary.

  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • genus_col (str, default="mhm_Genus"): The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
  • bit_col (str, default="mhm_HasGenus"): The name of the column which will store the generated HasGenus flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame: A DataFrame with the HasGenus flag. If inplace=True it returns None.
def infectious_genus_flag( df, genus_col='mhm_Genus', bit_col='mhm_IsGenusOfInterest', inplace=False)
153def infectious_genus_flag(
154    df, genus_col="mhm_Genus", bit_col="mhm_IsGenusOfInterest", inplace=False
156    """
157    Creates a bit flag: `mhm_IsGenusOfInterest` where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary.
159    Parameters
160    ----------
161    df : pd.DataFrame
162        A mosquito habitat mapper DataFrame
163    genus_col : str, default="mhm_Genus"
164        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
165    bit_col : str, default="mhm_HasGenus"
166        The name of the column which will store the generated IsGenusOfInterest flag
167    inplace : bool, default=False
168        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
170    Returns
171    -------
172    pd.DataFrame
173        A DataFrame with the IsGenusOfInterest flag. If `inplace=True` it returns None.
174    """
175    if not inplace:
176        df = df.copy()
177    infectious_genus_flag = np.vectorize(
178        lambda genus: genus in ["Aedes", "Anopheles", "Culex"]
179    )
180    df[bit_col] = infectious_genus_flag(df[genus_col].to_numpy()).astype(int)
182    if not inplace:
183        return df

Creates a bit flag: mhm_IsGenusOfInterest where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary.

  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • genus_col (str, default="mhm_Genus"): The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
  • bit_col (str, default="mhm_HasGenus"): The name of the column which will store the generated IsGenusOfInterest flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame: A DataFrame with the IsGenusOfInterest flag. If inplace=True it returns None.
def is_container_flag( df, watersource_col='mhm_WaterSourceType', bit_col='mhm_IsWaterSourceContainer', inplace=False)
186def is_container_flag(
187    df,
188    watersource_col="mhm_WaterSourceType",
189    bit_col="mhm_IsWaterSourceContainer",
190    inplace=False,
192    """
193    Creates a bit flag: `mhm_IsWaterSourceContainer` where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary.
195    Parameters
196    ----------
197    df : pd.DataFrame
198        A mosquito habitat mapper DataFrame
199    watersource_col : str, default="mhm_WaterSourceType"
200        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records.
201    bit_col : str, default="mhm_IsWaterSourceContainer"
202        The name of the column which will store the generated IsWaterSourceContainer flag
203    inplace : bool, default=False
204        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
206    Returns
207    -------
208    pd.DataFrame
209        A DataFrame with the IsContainer flag. If `inplace=True` it returns None.
210    """
212    if not inplace:
213        df = df.copy()
215    mark_containers = np.vectorize(
216        lambda container: not pd.isna(container) and "container" in container
217    )
218    df[bit_col] = mark_containers(df[watersource_col].to_numpy()).astype(int)
220    if not inplace:
221        return df

Creates a bit flag: mhm_IsWaterSourceContainer where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary.

  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • watersource_col (str, default="mhm_WaterSourceType"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records.
  • bit_col (str, default="mhm_IsWaterSourceContainer"): The name of the column which will store the generated IsWaterSourceContainer flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame: A DataFrame with the IsContainer flag. If inplace=True it returns None.
def has_watersource_flag( df, watersource_col='mhm_WaterSource', bit_col='mhm_HasWaterSource', inplace=False)
224def has_watersource_flag(
225    df, watersource_col="mhm_WaterSource", bit_col="mhm_HasWaterSource", inplace=False
227    """
228    Creates a bit flag: `mhm_HasWaterSource` where 1 denotes if there is a watersource and 0 denotes the contrary.
230    Parameters
231    ----------
232    df : pd.DataFrame
233        A mosquito habitat mapper DataFrame
234    watersource_col : str, default="mhm_WaterSource"
235        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records.
236    bit_col : str, default="mhm_IsWaterSourceContainer"
237        The name of the column which will store the generated HasWaterSource flag
238    inplace : bool, default=False
239        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
241    Returns
242    -------
243    pd.DataFrame
244        A DataFrame with the HasWaterSource flag. If `inplace=True` it returns None.
245    """
247    if not inplace:
248        df = df.copy()
249    has_watersource = np.vectorize(lambda watersource: int(not pd.isna(watersource)))
250    df[bit_col] = has_watersource(df[watersource_col].to_numpy())
252    if not inplace:
253        return df

Creates a bit flag: mhm_HasWaterSource where 1 denotes if there is a watersource and 0 denotes the contrary.

  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • watersource_col (str, default="mhm_WaterSource"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records.
  • bit_col (str, default="mhm_IsWaterSourceContainer"): The name of the column which will store the generated HasWaterSource flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame: A DataFrame with the HasWaterSource flag. If inplace=True it returns None.
def photo_bit_flags( df, watersource_photos='mhm_WaterSourcePhotoUrls', larvae_photos='mhm_LarvaFullBodyPhotoUrls', abdomen_photos='mhm_AbdomenCloseupPhotoUrls', photo_count='mhm_PhotoCount', rejected_count='mhm_RejectedCount', pending_count='mhm_PendingCount', photo_bit_binary='mhm_PhotoBitBinary', photo_bit_decimal='mhm_PhotoBitDecimal', inplace=False)
256def photo_bit_flags(
257    df,
258    watersource_photos="mhm_WaterSourcePhotoUrls",
259    larvae_photos="mhm_LarvaFullBodyPhotoUrls",
260    abdomen_photos="mhm_AbdomenCloseupPhotoUrls",
261    photo_count="mhm_PhotoCount",
262    rejected_count="mhm_RejectedCount",
263    pending_count="mhm_PendingCount",
264    photo_bit_binary="mhm_PhotoBitBinary",
265    photo_bit_decimal="mhm_PhotoBitDecimal",
266    inplace=False,
268    """
269    Creates the following flags:
270    - `PhotoCount`: The number of valid photos per record.
271    - `RejectedCount`: The number of photos that were rejected per record.
272    - `PendingCount`: The number of photos that are pending approval per record.
273    - `PhotoBitBinary`: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is `110`, that indicates that there is a water source photo and a larvae photo, but no abdomen photo.
274    - `PhotoBitDecimal`: The numerical representation of the mhm_PhotoBitBinary string.
276    Parameters
277    ----------
278    df : pd.DataFrame
279        A mosquito habitat mapper DataFrame
280    watersource_photos : str, default="mhm_WaterSourcePhotoUrls"
281        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records.
282    larvae_photos : str, default="mhm_LarvaFullBodyPhotoUrls"
283        The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records.
284    abdomen_photos : str, default="mhm_AbdomenCloseupPhotoUrls"
285        The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records.
286    photo_count : str, default="mhm_PhotoCount"
287        The name of the column that will store the PhotoCount flag.
288    rejected_count : str, default="mhm_RejectedCount"
289        The name of the column that will store the RejectedCount flag.
290    pending_count : str, default="mhm_PendingCount"
291        The name of the column that will store the PendingCount flag.
292    photo_bit_binary : str, default="mhm_PhotoBitBinary"
293        The name of the column that will store the PhotoBitBinary flag.
294    photo_bit_decimal : str, default="mhm_PhotoBitDecimal"
295        The name of the column that will store the PhotoBitDecimal flag.
296    inplace : bool, default=False
297        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
299    Returns
300    -------
301    pd.DataFrame
302        A DataFrame with the photo flags. If `inplace=True` it returns None.
303    """
305    def pic_data(*args):
306        pic_count = 0
307        rejected_count = 0
308        pending_count = 0
309        valid_photo_bit_mask = ""
311        # bit_power = len(args) - 1
312        # For url string -- if we see ANY http, add 1
313        # also count all valid photos, rejected photos,
314        # If there are NO http then add 0, to empty photo field
315        for url_string in args:
316            if not pd.isna(url_string):
317                if "http" not in url_string:
318                    valid_photo_bit_mask += "0"
319                else:
320                    valid_photo_bit_mask += "1"
322                pic_count += url_string.count("http")
323                pending_count += url_string.count("pending")
324                rejected_count += url_string.count("rejected")
325            else:
326                valid_photo_bit_mask += "0"
328        return (
329            pic_count,
330            rejected_count,
331            pending_count,
332            valid_photo_bit_mask,
333            int(valid_photo_bit_mask, 2),
334        )
336    if not inplace:
337        df = df.copy()
339    get_photo_data = np.vectorize(pic_data)
340    (
341        df[photo_count],
342        df[rejected_count],
343        df[pending_count],
344        df[photo_bit_binary],
345        df[photo_bit_decimal],
346    ) = get_photo_data(
347        df[watersource_photos].to_numpy(),
348        df[larvae_photos].to_numpy(),
349        df[abdomen_photos].to_numpy(),
350    )
352    if not inplace:
353        return df

Creates the following flags:

  • PhotoCount: The number of valid photos per record.
  • RejectedCount: The number of photos that were rejected per record.
  • PendingCount: The number of photos that are pending approval per record.
  • PhotoBitBinary: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is 110, that indicates that there is a water source photo and a larvae photo, but no abdomen photo.
  • PhotoBitDecimal: The numerical representation of the mhm_PhotoBitBinary string.
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • watersource_photos (str, default="mhm_WaterSourcePhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records.
  • larvae_photos (str, default="mhm_LarvaFullBodyPhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records.
  • abdomen_photos (str, default="mhm_AbdomenCloseupPhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records.
  • photo_count (str, default="mhm_PhotoCount"): The name of the column that will store the PhotoCount flag.
  • rejected_count (str, default="mhm_RejectedCount"): The name of the column that will store the RejectedCount flag.
  • pending_count (str, default="mhm_PendingCount"): The name of the column that will store the PendingCount flag.
  • photo_bit_binary (str, default="mhm_PhotoBitBinary"): The name of the column that will store the PhotoBitBinary flag.
  • photo_bit_decimal (str, default="mhm_PhotoBitDecimal"): The name of the column that will store the PhotoBitDecimal flag.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame: A DataFrame with the photo flags. If inplace=True it returns None.
def completion_score_flag( df, photo_bit_binary='mhm_PhotoBitBinary', has_genus='mhm_HasGenus', sub_completeness='mhm_SubCompletenessScore', completeness='mhm_CumulativeCompletenessScore', inplace=False)
356def completion_score_flag(
357    df,
358    photo_bit_binary="mhm_PhotoBitBinary",
359    has_genus="mhm_HasGenus",
360    sub_completeness="mhm_SubCompletenessScore",
361    completeness="mhm_CumulativeCompletenessScore",
362    inplace=False,
364    """
365    Adds the following completness score flags:
366    - `SubCompletenessScore`: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out.
367    - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns.
369    Parameters
370    ----------
371    df : pd.DataFrame
372        A mosquito habitat mapper DataFrame with the [`PhotoBitDecimal`](#photo_bit_flags) and [`HasGenus`](#has_genus_flags) flags.
373    photo_bit_binary: str, default="mhm_PhotoBitBinary"
374        The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag.
375    sub_completeness : str, default="mhm_HasGenus"
376        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag.
377    completeness : str, default="mhm_SubCompletenessScore"
378        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag.
379    inplace : bool, default=False
380        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
382    Returns
383    -------
384    pd.DataFrame
385        A DataFrame with completion score flags. If `inplace=True` it returns None.
386    """
388    def sum_bit_mask(bit_mask="0"):
389        total = 0.0
390        for char in bit_mask:
391            total += int(char)
392        return total
394    if not inplace:
395        df = df.copy()
397    scores = {}
398    scores["sub_score"] = []
399    # Cummulative Completion Score
400    scores["cumulative_score"] = round(df.count(axis=1) / len(df.columns), 2)
401    # Sub-Score
402    for index in df.index:
403        bit_mask = df[photo_bit_binary][index]
404        sub_score = df[has_genus][index] + sum_bit_mask(bit_mask=bit_mask)
405        sub_score /= 4.0
406        scores["sub_score"].append(sub_score)
408    df[sub_completeness], df[completeness] = (
409        scores["sub_score"],
410        scores["cumulative_score"],
411    )
413    if not inplace:
414        return df

Adds the following completness score flags:

  • SubCompletenessScore: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out.
  • CumulativeCompletenessScore: The percentage of non null values out of all the columns.
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame with the PhotoBitDecimal and HasGenus flags.
  • photo_bit_binary (str, default="mhm_PhotoBitBinary"): The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag.
  • sub_completeness (str, default="mhm_HasGenus"): The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag.
  • completeness (str, default="mhm_SubCompletenessScore"): The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
  • pd.DataFrame: A DataFrame with completion score flags. If inplace=True it returns None.
def apply_cleanup(mhm_df)
417def apply_cleanup(mhm_df):
418    """Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy.
419    It follows the following steps:
420    - Removes Homogenous Columns
421    - Renames Latitude and Longitudes
422    - Cleans the Column Naming
423    - Converts Larvae Count to Numbers
424    - Rounds Columns
425    - Standardizes Null Values
427    Parameters
428    ----------
429    mhm_df : pd.DataFrame
430        A DataFrame containing **raw** Mosquito Habitat Mapper Data from the API.
432    Returns
433    -------
434    pd.DataFrame
435        A DataFrame containing the cleaned up Mosquito Habitat Mapper Data
436    """
437    mhm_df = mhm_df.copy()
439    rename_latlon_cols(mhm_df, inplace=True)
440    cleanup_column_prefix(mhm_df, inplace=True)
441    larvae_to_num(mhm_df, inplace=True)
442    round_cols(mhm_df, inplace=True)
443    standardize_null_vals(mhm_df, inplace=True)
444    return mhm_df

Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy. It follows the following steps:

  • Removes Homogenous Columns
  • Renames Latitude and Longitudes
  • Cleans the Column Naming
  • Converts Larvae Count to Numbers
  • Rounds Columns
  • Standardizes Null Values
  • mhm_df (pd.DataFrame): A DataFrame containing raw Mosquito Habitat Mapper Data from the API.
  • pd.DataFrame: A DataFrame containing the cleaned up Mosquito Habitat Mapper Data
def add_flags(mhm_df)
447def add_flags(mhm_df):
448    """Adds the following flags to the Mosquito Habitat Mapper Data:
449    - Has Genus
450    - Is Infectious Genus/Genus of Interest
451    - Is Container
452    - Has WaterSource
453    - Photo Bit Flags
454    - Completion Score Flag
456    This returns a copy of the original DataFrame with the flags added onto it.
458    Parameters
459    ----------
460    mhm_df : pd.DataFrame
461        A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method.
463    Returns
464    -------
465    pd.DataFrame
466        A DataFrame containing the flagged Mosquito Habitat Mapper Data
467    """
468    mhm_df = mhm_df.copy()
469    has_genus_flag(mhm_df, inplace=True)
470    infectious_genus_flag(mhm_df, inplace=True)
471    is_container_flag(mhm_df, inplace=True)
472    has_watersource_flag(mhm_df, inplace=True)
473    photo_bit_flags(mhm_df, inplace=True)
474    completion_score_flag(mhm_df, inplace=True)
475    return mhm_df

Adds the following flags to the Mosquito Habitat Mapper Data:

  • Has Genus
  • Is Infectious Genus/Genus of Interest
  • Is Container
  • Has WaterSource
  • Photo Bit Flags
  • Completion Score Flag

This returns a copy of the original DataFrame with the flags added onto it.

  • mhm_df (pd.DataFrame): A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method.
  • pd.DataFrame: A DataFrame containing the flagged Mosquito Habitat Mapper Data
def plot_valid_entries(df, bit_col, entry_type)
478def plot_valid_entries(df, bit_col, entry_type):
479    """
480    Plots the number of entries with photos and the number of entries without photos
482    Parameters
483    ----------
484    df : pd.DataFrame
485        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
486    """
487    plt.figure()
488    num_valid = len(df[df[bit_col] > 0])
489    plt.title(f"Entries with {entry_type} vs No {entry_type}")
490    plt.ylabel("Number of Entries")
491    plt.bar(entry_type, num_valid, color="#e34a33")
492    plt.bar(f"No {entry_type}", len(df) - num_valid, color="#fdcc8a")

Plots the number of entries with photos and the number of entries without photos

  • df (pd.DataFrame): The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
def photo_subjects(mhm_df)
495def photo_subjects(mhm_df):
496    """
497    Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource)
499    Parameters
500    ----------
501    mhm_df : pd.DataFrame
502        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
503    """
505    total_dict = {"Larvae Photos": 0, "Abdomen Photos": 0, "Watersource Photos": 0}
507    for number in mhm_df["mhm_PhotoBitDecimal"]:
508        total_dict["Watersource Photos"] += number & 4
509        total_dict["Larvae Photos"] += number & 2
510        total_dict["Abdomen Photos"] += number & 1
512    for key in total_dict.keys():
513        if total_dict[key] != 0:
514            total_dict[key] = math.log10(total_dict[key])
515        else:
516            total_dict[key] = 0
517    plt.figure(figsize=(10, 5))
518    plt.title("Mosquito Habitat Mapper - Photo Subject Frequencies (Log Scale)")
519    plt.xlabel("Photo Type")
520    plt.ylabel("Frequency (Log Scale)")
521    plt.bar(total_dict.keys(), total_dict.values(), color="lightblue")

Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource)

  • mhm_df (pd.DataFrame): The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
def diagnostic_plots(mhm_df)
524def diagnostic_plots(mhm_df):
525    """
526    Generates (but doesn't display) diagnostic plots to gain insight into the current data.
528    Plots:
529    - Larvae Count Distribution (where a negative entry denotes null data)
530    - Photo Subject Distribution
531    - Number of valid photos vs no photos
532    - Completeness Score Distribution
533    - Subcompleteness Score Distribution
535    Parameters
536    ----------
537    mhm_df : pd.DataFrame
538        The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data.
539    """
540    plot_int_distribution(mhm_df, "mhm_LarvaeCount", "Larvae Count")
541    photo_subjects(mhm_df)
542    plot_freq_bar(mhm_df, "Mosquito Habitat Mapper", "mhm_Genus", "Genus Types")
543    plot_valid_entries(mhm_df, "mhm_HasGenus", "Genus Classifications")
544    plot_valid_entries(mhm_df, "mhm_PhotoBitDecimal", "Valid Photos")
545    completeness_histogram(
546        mhm_df,
547        "Mosquito Habitat Mapper",
548        "mhm_CumulativeCompletenessScore",
549        "Cumulative Completeness",
550    )
551    completeness_histogram(
552        mhm_df,
553        "Mosquito Habitat Mapper",
554        "mhm_SubCompletenessScore",
555        "Sub Completeness",
556    )

Generates (but doesn't display) diagnostic plots to gain insight into the current data.


  • Larvae Count Distribution (where a negative entry denotes null data)
  • Photo Subject Distribution
  • Number of valid photos vs no photos
  • Completeness Score Distribution
  • Subcompleteness Score Distribution
  • mhm_df (pd.DataFrame): The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data.
def qa_filter( mhm_df, has_genus=False, min_larvae_count=-9999, has_photos=False, is_container=False)
559def qa_filter(
560    mhm_df,
561    has_genus=False,
562    min_larvae_count=-9999,
563    has_photos=False,
564    is_container=False,
566    """
567    Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
568    - `Has Genus`: If the entry has an identified genus
569    - `Min Larvae Count` : Minimum larvae count needed for an entry
570    - `Has Photos` : If the entry contains valid photo entries
571    - `Is Container` : If the entry's watersource was a container
573    Returns a copy of the DataFrame
575    Parameters
576    ----------
577    has_genus : bool, default=False
578        If True, only entries with an identified genus will be returned.
579    min_larvae_count : int, default=-9999
580        Only entries with a larvae count greater than or equal to this parameter will be included.
581    has_photos : bool, default=False
582        If True, only entries with recorded photos will be returned
583    is_container : bool, default=False
584        If True, only entries with containers will be returned
586    Returns
587    -------
588    pd.DataFrame
589        A DataFrame of the applied filters.
590    """
592    mhm_df = mhm_df[mhm_df["mhm_LarvaeCount"] >= min_larvae_count]
594    if has_genus:
595        mhm_df = mhm_df[mhm_df["mhm_HasGenus"] == 1]
596    if has_photos:
597        mhm_df = mhm_df[mhm_df["mhm_PhotoBitDecimal"] > 0]
598    if is_container:
599        mhm_df = mhm_df[mhm_df["mhm_IsWaterSourceContainer"] == 1]
601    return mhm_df

Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:

  • Has Genus: If the entry has an identified genus
  • Min Larvae Count : Minimum larvae count needed for an entry
  • Has Photos : If the entry contains valid photo entries
  • Is Container : If the entry's watersource was a container

Returns a copy of the DataFrame

  • has_genus (bool, default=False): If True, only entries with an identified genus will be returned.
  • min_larvae_count (int, default=-9999): Only entries with a larvae count greater than or equal to this parameter will be included.
  • has_photos (bool, default=False): If True, only entries with recorded photos will be returned
  • is_container (bool, default=False): If True, only entries with containers will be returned
  • pd.DataFrame: A DataFrame of the applied filters.