go_utils.mhm

Mosquito Specific Cleanup Procedures

Converting Larvae Data to Integers

Larvae Data is stored as a string in the raw GLOBE Observer dataset. To facillitate analysis, this method converts this data to numerical data.

It needs to account for 4 types of data:

  1. Regular Data: Converts it to a number
  2. Extraneously large data ($\geq 100$ as its hard to count more than that amount accurately): To maintain the information from that entry, the LarvaeCountMagnitude flag is used to indicate the real value
  3. Ranges (e.g. "25-50"): Chooses the lower bound and set the LarvaeCountIsRangeFlag to true.
  4. Null Values: Sets null values to $-9999$

It generates the following flags:

  • LarvaeCountMagnitude: The integer flag contains the order of magnitude (0-4) by which the larvae count exceeds the maximum Larvae Count of 100. This is calculated by $1 + \lfloor \log{\frac{num}{100}} \rfloor$. As a result:
    • 0: Corresponds to a Larvae Count $\leq 100$
    • 1: Corresponds to a Larvae Count between $100$ and $999$
    • 2: Corresponds to a Larvae Count between $1000$ and $9999$
    • 3: Corresponds to a Larvae Count between $10,000$ and $99,999$
    • 4: Corresponds to a Larvae Count $\geq 100,000$
  • LarvaeCountIsRange: Either a $1$ which indicates the entry was a range (e.g. 25-50) or $0$ which indicates the entry wasn't a range.

Additionally, there were extremely large values that Python was unable to process (1e+27) and so there was an initial preprocessing step to set those numbers to 100000 (which corresponds to the maximum magnitude flag).

  1import math
  2import re
  3
  4import matplotlib.pyplot as plt
  5import numpy as np
  6import pandas as pd
  7
  8from go_utils.cleanup import (
  9    rename_latlon_cols,
 10    replace_column_prefix,
 11    round_cols,
 12    standardize_null_vals,
 13)
 14from go_utils.plot import completeness_histogram, plot_freq_bar, plot_int_distribution
 15
 16__doc__ = r"""
 17
 18## Mosquito Specific Cleanup Procedures
 19
 20### Converting Larvae Data to Integers
 21Larvae Data is stored as a string in the raw GLOBE Observer dataset. To facillitate analysis, [this method](#larvae_to_num) converts this data to numerical data.
 22
 23It needs to account for 4 types of data:
 241. Regular Data: Converts it to a number
 252. Extraneously large data ($\geq 100$ as its hard to count more than that amount accurately): To maintain the information from that entry, the `LarvaeCountMagnitude` flag is used to indicate the real value
 263. Ranges (e.g. "25-50"): Chooses the lower bound and set the `LarvaeCountIsRangeFlag` to true.
 274. Null Values: Sets null values to $-9999$
 28
 29
 30It generates the following flags:
 31- `LarvaeCountMagnitude`: The integer flag contains the order of magnitude (0-4) by which the larvae count exceeds the maximum Larvae Count of 100. This is calculated by $1 + \lfloor \log{\frac{num}{100}} \rfloor$. As a result:
 32    - `0`: Corresponds to a Larvae Count $\leq 100$
 33    - `1`: Corresponds to a Larvae Count between $100$ and $999$
 34    - `2`: Corresponds to a Larvae Count between $1000$ and $9999$
 35    - `3`: Corresponds to a Larvae Count between $10,000$ and $99,999$
 36    - `4`: Corresponds to a Larvae Count $\geq 100,000$
 37- `LarvaeCountIsRange`: Either a $1$ which indicates the entry was a range (e.g. 25-50) or $0$ which indicates the entry wasn't a range.
 38
 39Additionally, there were extremely large values that Python was unable to process (`1e+27`) and so there was an initial preprocessing step to set those numbers to 100000 (which corresponds to the maximum magnitude flag).
 40"""
 41
 42
 43def cleanup_column_prefix(df, inplace=False):
 44    """Method for shortening raw mosquito habitat mapper column names.
 45
 46    Parameters
 47    ----------
 48    df : pd.DataFrame
 49        The DataFrame containing raw mosquito habitat mapper data.
 50    inplace : bool, default=False
 51        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 52
 53    Returns
 54    -------
 55    pd.DataFrame or None
 56        A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None.
 57    """
 58
 59    return replace_column_prefix(df, "mosquitohabitatmapper", "mhm", inplace=inplace)
 60
 61
 62def _entry_to_num(entry):
 63    try:
 64        if entry == "more than 100":
 65            return 101, 1, 1
 66        if pd.isna(entry):
 67            return -9999, 0, 0
 68        elif float(entry) > 100:
 69            return 101, min(math.floor(math.log10(float(entry) / 100)) + 1, 4), 0
 70        return float(entry), 0, 0
 71    except ValueError:
 72        return float(re.sub(r"-.*", "", entry)), 0, 1
 73
 74
 75def larvae_to_num(
 76    mhm_df,
 77    larvae_count_col="mhm_LarvaeCount",
 78    magnitude="mhm_LarvaeCountMagnitude",
 79    range_flag="mhm_LarvaeCountIsRangeFlag",
 80    inplace=False,
 81):
 82    """Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers.
 83
 84    See [here](#converting-larvae-data-to-integers) for more information.
 85
 86    Parameters
 87    ----------
 88    mhm_df : pd.DataFrame
 89        A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers
 90    larvae_count_col : str, default="mhm_LarvaeCount"
 91        The name of the column storing the larvae count. **Note**: The columns will be output in the format: `prefix_ColumnName` where `prefix` is all the characters that preceed the words `LarvaeCount` in the specified name.
 92    magnitude: str, default="mhm_LarvaeCountMagnitude"
 93        The name of the column which will store the generated LarvaeCountMagnitude output
 94    range_flag : str, default="mhm_LarvaeCountIsRangeFlag"
 95        The name of the column which will store the generated LarvaeCountIsRange flag
 96    inplace : bool, default=False
 97        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 98
 99    Returns
100    -------
101    pd.DataFrame
102        A DataFrame with the larvae count as integers. If `inplace=True` it returns None.
103    """
104
105    if not inplace:
106        mhm_df = mhm_df.copy()
107    # Preprocessing step to remove extremely erroneous values
108    for i in mhm_df.index:
109        count = mhm_df[larvae_count_col][i]
110        if not pd.isna(count) and type(count) is str and "e+" in count:
111            mhm_df.at[i, larvae_count_col] = "100000"
112
113    larvae_conversion = np.vectorize(_entry_to_num)
114    (
115        mhm_df[larvae_count_col],
116        mhm_df[magnitude],
117        mhm_df[range_flag],
118    ) = larvae_conversion(mhm_df[larvae_count_col].to_numpy())
119
120    if not inplace:
121        return mhm_df
122
123
124def has_genus_flag(df, genus_col="mhm_Genus", bit_col="mhm_HasGenus", inplace=False):
125    """
126    Creates a bit flag: `mhm_HasGenus` where 1 denotes a recorded Genus and 0 denotes the contrary.
127
128    Parameters
129    ----------
130    df : pd.DataFrame
131        A mosquito habitat mapper DataFrame
132    genus_col : str, default="mhm_Genus"
133        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
134    bit_col : str, default="mhm_HasGenus"
135        The name of the column which will store the generated HasGenus flag
136    inplace : bool, default=False
137        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
138
139    Returns
140    -------
141    pd.DataFrame
142        A DataFrame with the HasGenus flag. If `inplace=True` it returns None.
143    """
144    if not inplace:
145        df = df.copy()
146    df[bit_col] = (~pd.isna(df[genus_col].to_numpy())).astype(int)
147
148    if not inplace:
149        return df
150
151
152def infectious_genus_flag(
153    df, genus_col="mhm_Genus", bit_col="mhm_IsGenusOfInterest", inplace=False
154):
155    """
156    Creates a bit flag: `mhm_IsGenusOfInterest` where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary.
157
158    Parameters
159    ----------
160    df : pd.DataFrame
161        A mosquito habitat mapper DataFrame
162    genus_col : str, default="mhm_Genus"
163        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
164    bit_col : str, default="mhm_HasGenus"
165        The name of the column which will store the generated IsGenusOfInterest flag
166    inplace : bool, default=False
167        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
168
169    Returns
170    -------
171    pd.DataFrame
172        A DataFrame with the IsGenusOfInterest flag. If `inplace=True` it returns None.
173    """
174    if not inplace:
175        df = df.copy()
176    infectious_genus_flag = np.vectorize(
177        lambda genus: genus in ["Aedes", "Anopheles", "Culex"]
178    )
179    df[bit_col] = infectious_genus_flag(df[genus_col].to_numpy()).astype(int)
180
181    if not inplace:
182        return df
183
184
185def is_container_flag(
186    df,
187    watersource_col="mhm_WaterSourceType",
188    bit_col="mhm_IsWaterSourceContainer",
189    inplace=False,
190):
191    """
192    Creates a bit flag: `mhm_IsWaterSourceContainer` where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary.
193
194    Parameters
195    ----------
196    df : pd.DataFrame
197        A mosquito habitat mapper DataFrame
198    watersource_col : str, default="mhm_WaterSourceType"
199        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records.
200    bit_col : str, default="mhm_IsWaterSourceContainer"
201        The name of the column which will store the generated IsWaterSourceContainer flag
202    inplace : bool, default=False
203        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
204
205    Returns
206    -------
207    pd.DataFrame
208        A DataFrame with the IsContainer flag. If `inplace=True` it returns None.
209    """
210
211    if not inplace:
212        df = df.copy()
213
214    mark_containers = np.vectorize(
215        lambda container: not pd.isna(container) and "container" in container
216    )
217    df[bit_col] = mark_containers(df[watersource_col].to_numpy()).astype(int)
218
219    if not inplace:
220        return df
221
222
223def has_watersource_flag(
224    df, watersource_col="mhm_WaterSource", bit_col="mhm_HasWaterSource", inplace=False
225):
226    """
227    Creates a bit flag: `mhm_HasWaterSource` where 1 denotes if there is a watersource and 0 denotes the contrary.
228
229    Parameters
230    ----------
231    df : pd.DataFrame
232        A mosquito habitat mapper DataFrame
233    watersource_col : str, default="mhm_WaterSource"
234        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records.
235    bit_col : str, default="mhm_IsWaterSourceContainer"
236        The name of the column which will store the generated HasWaterSource flag
237    inplace : bool, default=False
238        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
239
240    Returns
241    -------
242    pd.DataFrame
243        A DataFrame with the HasWaterSource flag. If `inplace=True` it returns None.
244    """
245
246    if not inplace:
247        df = df.copy()
248    has_watersource = np.vectorize(lambda watersource: int(not pd.isna(watersource)))
249    df[bit_col] = has_watersource(df[watersource_col].to_numpy())
250
251    if not inplace:
252        return df
253
254
255def photo_bit_flags(
256    df,
257    watersource_photos="mhm_WaterSourcePhotoUrls",
258    larvae_photos="mhm_LarvaFullBodyPhotoUrls",
259    abdomen_photos="mhm_AbdomenCloseupPhotoUrls",
260    photo_count="mhm_PhotoCount",
261    rejected_count="mhm_RejectedCount",
262    pending_count="mhm_PendingCount",
263    photo_bit_binary="mhm_PhotoBitBinary",
264    photo_bit_decimal="mhm_PhotoBitDecimal",
265    inplace=False,
266):
267    """
268    Creates the following flags:
269    - `PhotoCount`: The number of valid photos per record.
270    - `RejectedCount`: The number of photos that were rejected per record.
271    - `PendingCount`: The number of photos that are pending approval per record.
272    - `PhotoBitBinary`: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is `110`, that indicates that there is a water source photo and a larvae photo, but no abdomen photo.
273    - `PhotoBitDecimal`: The numerical representation of the mhm_PhotoBitBinary string.
274
275    Parameters
276    ----------
277    df : pd.DataFrame
278        A mosquito habitat mapper DataFrame
279    watersource_photos : str, default="mhm_WaterSourcePhotoUrls"
280        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records.
281    larvae_photos : str, default="mhm_LarvaFullBodyPhotoUrls"
282        The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records.
283    abdomen_photos : str, default="mhm_AbdomenCloseupPhotoUrls"
284        The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records.
285    photo_count : str, default="mhm_PhotoCount"
286        The name of the column that will store the PhotoCount flag.
287    rejected_count : str, default="mhm_RejectedCount"
288        The name of the column that will store the RejectedCount flag.
289    pending_count : str, default="mhm_PendingCount"
290        The name of the column that will store the PendingCount flag.
291    photo_bit_binary : str, default="mhm_PhotoBitBinary"
292        The name of the column that will store the PhotoBitBinary flag.
293    photo_bit_decimal : str, default="mhm_PhotoBitDecimal"
294        The name of the column that will store the PhotoBitDecimal flag.
295    inplace : bool, default=False
296        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
297
298    Returns
299    -------
300    pd.DataFrame
301        A DataFrame with the photo flags. If `inplace=True` it returns None.
302    """
303
304    def pic_data(*args):
305        pic_count = 0
306        rejected_count = 0
307        pending_count = 0
308        valid_photo_bit_mask = ""
309
310        # bit_power = len(args) - 1
311        # For url string -- if we see ANY http, add 1
312        # also count all valid photos, rejected photos,
313        # If there are NO http then add 0, to empty photo field
314        for url_string in args:
315            if not pd.isna(url_string):
316                if "http" not in url_string:
317                    valid_photo_bit_mask += "0"
318                else:
319                    valid_photo_bit_mask += "1"
320
321                pic_count += url_string.count("http")
322                pending_count += url_string.count("pending")
323                rejected_count += url_string.count("rejected")
324            else:
325                valid_photo_bit_mask += "0"
326
327        return (
328            pic_count,
329            rejected_count,
330            pending_count,
331            valid_photo_bit_mask,
332            int(valid_photo_bit_mask, 2),
333        )
334
335    if not inplace:
336        df = df.copy()
337
338    get_photo_data = np.vectorize(pic_data)
339    (
340        df[photo_count],
341        df[rejected_count],
342        df[pending_count],
343        df[photo_bit_binary],
344        df[photo_bit_decimal],
345    ) = get_photo_data(
346        df[watersource_photos].to_numpy(),
347        df[larvae_photos].to_numpy(),
348        df[abdomen_photos].to_numpy(),
349    )
350
351    if not inplace:
352        return df
353
354
355def completion_score_flag(
356    df,
357    photo_bit_binary="mhm_PhotoBitBinary",
358    has_genus="mhm_HasGenus",
359    sub_completeness="mhm_SubCompletenessScore",
360    completeness="mhm_CumulativeCompletenessScore",
361    inplace=False,
362):
363    """
364    Adds the following completness score flags:
365    - `SubCompletenessScore`: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out.
366    - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns.
367
368    Parameters
369    ----------
370    df : pd.DataFrame
371        A mosquito habitat mapper DataFrame with the [`PhotoBitDecimal`](#photo_bit_flags) and [`HasGenus`](#has_genus_flags) flags.
372    photo_bit_binary: str, default="mhm_PhotoBitBinary"
373        The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag.
374    sub_completeness : str, default="mhm_HasGenus"
375        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag.
376    completeness : str, default="mhm_SubCompletenessScore"
377        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag.
378    inplace : bool, default=False
379        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
380
381    Returns
382    -------
383    pd.DataFrame
384        A DataFrame with completion score flags. If `inplace=True` it returns None.
385    """
386
387    def sum_bit_mask(bit_mask="0"):
388        total = 0.0
389        for char in bit_mask:
390            total += int(char)
391        return total
392
393    if not inplace:
394        df = df.copy()
395
396    scores = {}
397    scores["sub_score"] = []
398    # Cummulative Completion Score
399    scores["cumulative_score"] = round(df.count(axis=1) / len(df.columns), 2)
400    # Sub-Score
401    for index in df.index:
402        bit_mask = df[photo_bit_binary][index]
403        sub_score = df[has_genus][index] + sum_bit_mask(bit_mask=bit_mask)
404        sub_score /= 4.0
405        scores["sub_score"].append(sub_score)
406
407    df[sub_completeness], df[completeness] = (
408        scores["sub_score"],
409        scores["cumulative_score"],
410    )
411
412    if not inplace:
413        return df
414
415
416def apply_cleanup(mhm_df):
417    """Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy.
418    It follows the following steps:
419    - Removes Homogenous Columns
420    - Renames Latitude and Longitudes
421    - Cleans the Column Naming
422    - Converts Larvae Count to Numbers
423    - Rounds Columns
424    - Standardizes Null Values
425
426    Parameters
427    ----------
428    mhm_df : pd.DataFrame
429        A DataFrame containing **raw** Mosquito Habitat Mapper Data from the API.
430
431    Returns
432    -------
433    pd.DataFrame
434        A DataFrame containing the cleaned up Mosquito Habitat Mapper Data
435    """
436    mhm_df = mhm_df.copy()
437
438    rename_latlon_cols(mhm_df, inplace=True)
439    cleanup_column_prefix(mhm_df, inplace=True)
440    larvae_to_num(mhm_df, inplace=True)
441    round_cols(mhm_df, inplace=True)
442    standardize_null_vals(mhm_df, inplace=True)
443    return mhm_df
444
445
446def add_flags(mhm_df):
447    """Adds the following flags to the Mosquito Habitat Mapper Data:
448    - Has Genus
449    - Is Infectious Genus/Genus of Interest
450    - Is Container
451    - Has WaterSource
452    - Photo Bit Flags
453    - Completion Score Flag
454
455    This returns a copy of the original DataFrame with the flags added onto it.
456
457    Parameters
458    ----------
459    mhm_df : pd.DataFrame
460        A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method.
461
462    Returns
463    -------
464    pd.DataFrame
465        A DataFrame containing the flagged Mosquito Habitat Mapper Data
466    """
467    mhm_df = mhm_df.copy()
468    has_genus_flag(mhm_df, inplace=True)
469    infectious_genus_flag(mhm_df, inplace=True)
470    is_container_flag(mhm_df, inplace=True)
471    has_watersource_flag(mhm_df, inplace=True)
472    photo_bit_flags(mhm_df, inplace=True)
473    completion_score_flag(mhm_df, inplace=True)
474    return mhm_df
475
476
477def plot_valid_entries(df, bit_col, entry_type):
478    """
479    Plots the number of entries with photos and the number of entries without photos
480
481    Parameters
482    ----------
483    df : pd.DataFrame
484        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
485    """
486    plt.figure()
487    num_valid = len(df[df[bit_col] > 0])
488    plt.title(f"Entries with {entry_type} vs No {entry_type}")
489    plt.ylabel("Number of Entries")
490    plt.bar(entry_type, num_valid, color="#e34a33")
491    plt.bar(f"No {entry_type}", len(df) - num_valid, color="#fdcc8a")
492
493
494def photo_subjects(mhm_df):
495    """
496    Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource)
497
498    Parameters
499    ----------
500    mhm_df : pd.DataFrame
501        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
502    """
503
504    total_dict = {"Larvae Photos": 0, "Abdomen Photos": 0, "Watersource Photos": 0}
505
506    for number in mhm_df["mhm_PhotoBitDecimal"]:
507        total_dict["Watersource Photos"] += number & 4
508        total_dict["Larvae Photos"] += number & 2
509        total_dict["Abdomen Photos"] += number & 1
510
511    for key in total_dict.keys():
512        if total_dict[key] != 0:
513            total_dict[key] = math.log10(total_dict[key])
514        else:
515            total_dict[key] = 0
516    plt.figure(figsize=(10, 5))
517    plt.title("Mosquito Habitat Mapper - Photo Subject Frequencies (Log Scale)")
518    plt.xlabel("Photo Type")
519    plt.ylabel("Frequency (Log Scale)")
520    plt.bar(total_dict.keys(), total_dict.values(), color="lightblue")
521
522
523def diagnostic_plots(mhm_df):
524    """
525    Generates (but doesn't display) diagnostic plots to gain insight into the current data.
526
527    Plots:
528    - Larvae Count Distribution (where a negative entry denotes null data)
529    - Photo Subject Distribution
530    - Number of valid photos vs no photos
531    - Completeness Score Distribution
532    - Subcompleteness Score Distribution
533
534    Parameters
535    ----------
536    mhm_df : pd.DataFrame
537        The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data.
538    """
539    plot_int_distribution(mhm_df, "mhm_LarvaeCount", "Larvae Count")
540    photo_subjects(mhm_df)
541    plot_freq_bar(mhm_df, "Mosquito Habitat Mapper", "mhm_Genus", "Genus Types")
542    plot_valid_entries(mhm_df, "mhm_HasGenus", "Genus Classifications")
543    plot_valid_entries(mhm_df, "mhm_PhotoBitDecimal", "Valid Photos")
544    completeness_histogram(
545        mhm_df,
546        "Mosquito Habitat Mapper",
547        "mhm_CumulativeCompletenessScore",
548        "Cumulative Completeness",
549    )
550    completeness_histogram(
551        mhm_df,
552        "Mosquito Habitat Mapper",
553        "mhm_SubCompletenessScore",
554        "Sub Completeness",
555    )
556
557
558def qa_filter(
559    mhm_df,
560    has_genus=False,
561    min_larvae_count=-9999,
562    has_photos=False,
563    is_container=False,
564):
565    """
566    Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
567    - `Has Genus`: If the entry has an identified genus
568    - `Min Larvae Count` : Minimum larvae count needed for an entry
569    - `Has Photos` : If the entry contains valid photo entries
570    - `Is Container` : If the entry's watersource was a container
571
572    Returns a copy of the DataFrame
573
574    Parameters
575    ----------
576    has_genus : bool, default=False
577        If True, only entries with an identified genus will be returned.
578    min_larvae_count : int, default=-9999
579        Only entries with a larvae count greater than or equal to this parameter will be included.
580    has_photos : bool, default=False
581        If True, only entries with recorded photos will be returned
582    is_container : bool, default=False
583        If True, only entries with containers will be returned
584
585    Returns
586    -------
587    pd.DataFrame
588        A DataFrame of the applied filters.
589    """
590
591    mhm_df = mhm_df[mhm_df["mhm_LarvaeCount"] >= min_larvae_count]
592
593    if has_genus:
594        mhm_df = mhm_df[mhm_df["mhm_HasGenus"] == 1]
595    if has_photos:
596        mhm_df = mhm_df[mhm_df["mhm_PhotoBitDecimal"] > 0]
597    if is_container:
598        mhm_df = mhm_df[mhm_df["mhm_IsWaterSourceContainer"] == 1]
599
600    return mhm_df
def cleanup_column_prefix(df, inplace=False)
44def cleanup_column_prefix(df, inplace=False):
45    """Method for shortening raw mosquito habitat mapper column names.
46
47    Parameters
48    ----------
49    df : pd.DataFrame
50        The DataFrame containing raw mosquito habitat mapper data.
51    inplace : bool, default=False
52        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
53
54    Returns
55    -------
56    pd.DataFrame or None
57        A DataFrame with the cleaned up column prefixes. If `inplace=True` it returns None.
58    """
59
60    return replace_column_prefix(df, "mosquitohabitatmapper", "mhm", inplace=inplace)

Method for shortening raw mosquito habitat mapper column names.

Parameters
  • df (pd.DataFrame): The DataFrame containing raw mosquito habitat mapper data.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame or None: A DataFrame with the cleaned up column prefixes. If inplace=True it returns None.
def larvae_to_num( mhm_df, larvae_count_col='mhm_LarvaeCount', magnitude='mhm_LarvaeCountMagnitude', range_flag='mhm_LarvaeCountIsRangeFlag', inplace=False)
 76def larvae_to_num(
 77    mhm_df,
 78    larvae_count_col="mhm_LarvaeCount",
 79    magnitude="mhm_LarvaeCountMagnitude",
 80    range_flag="mhm_LarvaeCountIsRangeFlag",
 81    inplace=False,
 82):
 83    """Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers.
 84
 85    See [here](#converting-larvae-data-to-integers) for more information.
 86
 87    Parameters
 88    ----------
 89    mhm_df : pd.DataFrame
 90        A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers
 91    larvae_count_col : str, default="mhm_LarvaeCount"
 92        The name of the column storing the larvae count. **Note**: The columns will be output in the format: `prefix_ColumnName` where `prefix` is all the characters that preceed the words `LarvaeCount` in the specified name.
 93    magnitude: str, default="mhm_LarvaeCountMagnitude"
 94        The name of the column which will store the generated LarvaeCountMagnitude output
 95    range_flag : str, default="mhm_LarvaeCountIsRangeFlag"
 96        The name of the column which will store the generated LarvaeCountIsRange flag
 97    inplace : bool, default=False
 98        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
 99
100    Returns
101    -------
102    pd.DataFrame
103        A DataFrame with the larvae count as integers. If `inplace=True` it returns None.
104    """
105
106    if not inplace:
107        mhm_df = mhm_df.copy()
108    # Preprocessing step to remove extremely erroneous values
109    for i in mhm_df.index:
110        count = mhm_df[larvae_count_col][i]
111        if not pd.isna(count) and type(count) is str and "e+" in count:
112            mhm_df.at[i, larvae_count_col] = "100000"
113
114    larvae_conversion = np.vectorize(_entry_to_num)
115    (
116        mhm_df[larvae_count_col],
117        mhm_df[magnitude],
118        mhm_df[range_flag],
119    ) = larvae_conversion(mhm_df[larvae_count_col].to_numpy())
120
121    if not inplace:
122        return mhm_df

Converts the Larvae Count of the Mosquito Habitat Mapper Dataset from being stored as a string to integers.

See here for more information.

Parameters
  • mhm_df (pd.DataFrame): A DataFrame of Mosquito Habitat Mapper data that needs the larvae counts to be set to numbers
  • larvae_count_col (str, default="mhm_LarvaeCount"): The name of the column storing the larvae count. Note: The columns will be output in the format: prefix_ColumnName where prefix is all the characters that preceed the words LarvaeCount in the specified name.
  • magnitude (str, default="mhm_LarvaeCountMagnitude"): The name of the column which will store the generated LarvaeCountMagnitude output
  • range_flag (str, default="mhm_LarvaeCountIsRangeFlag"): The name of the column which will store the generated LarvaeCountIsRange flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame: A DataFrame with the larvae count as integers. If inplace=True it returns None.
def has_genus_flag(df, genus_col='mhm_Genus', bit_col='mhm_HasGenus', inplace=False)
125def has_genus_flag(df, genus_col="mhm_Genus", bit_col="mhm_HasGenus", inplace=False):
126    """
127    Creates a bit flag: `mhm_HasGenus` where 1 denotes a recorded Genus and 0 denotes the contrary.
128
129    Parameters
130    ----------
131    df : pd.DataFrame
132        A mosquito habitat mapper DataFrame
133    genus_col : str, default="mhm_Genus"
134        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
135    bit_col : str, default="mhm_HasGenus"
136        The name of the column which will store the generated HasGenus flag
137    inplace : bool, default=False
138        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
139
140    Returns
141    -------
142    pd.DataFrame
143        A DataFrame with the HasGenus flag. If `inplace=True` it returns None.
144    """
145    if not inplace:
146        df = df.copy()
147    df[bit_col] = (~pd.isna(df[genus_col].to_numpy())).astype(int)
148
149    if not inplace:
150        return df

Creates a bit flag: mhm_HasGenus where 1 denotes a recorded Genus and 0 denotes the contrary.

Parameters
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • genus_col (str, default="mhm_Genus"): The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
  • bit_col (str, default="mhm_HasGenus"): The name of the column which will store the generated HasGenus flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame: A DataFrame with the HasGenus flag. If inplace=True it returns None.
def infectious_genus_flag( df, genus_col='mhm_Genus', bit_col='mhm_IsGenusOfInterest', inplace=False)
153def infectious_genus_flag(
154    df, genus_col="mhm_Genus", bit_col="mhm_IsGenusOfInterest", inplace=False
155):
156    """
157    Creates a bit flag: `mhm_IsGenusOfInterest` where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary.
158
159    Parameters
160    ----------
161    df : pd.DataFrame
162        A mosquito habitat mapper DataFrame
163    genus_col : str, default="mhm_Genus"
164        The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
165    bit_col : str, default="mhm_HasGenus"
166        The name of the column which will store the generated IsGenusOfInterest flag
167    inplace : bool, default=False
168        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
169
170    Returns
171    -------
172    pd.DataFrame
173        A DataFrame with the IsGenusOfInterest flag. If `inplace=True` it returns None.
174    """
175    if not inplace:
176        df = df.copy()
177    infectious_genus_flag = np.vectorize(
178        lambda genus: genus in ["Aedes", "Anopheles", "Culex"]
179    )
180    df[bit_col] = infectious_genus_flag(df[genus_col].to_numpy()).astype(int)
181
182    if not inplace:
183        return df

Creates a bit flag: mhm_IsGenusOfInterest where 1 denotes a Genus of a infectious mosquito and 0 denotes the contrary.

Parameters
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • genus_col (str, default="mhm_Genus"): The name of the column in the mosquito habitat mapper DataFrame that contains the genus records.
  • bit_col (str, default="mhm_HasGenus"): The name of the column which will store the generated IsGenusOfInterest flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame: A DataFrame with the IsGenusOfInterest flag. If inplace=True it returns None.
def is_container_flag( df, watersource_col='mhm_WaterSourceType', bit_col='mhm_IsWaterSourceContainer', inplace=False)
186def is_container_flag(
187    df,
188    watersource_col="mhm_WaterSourceType",
189    bit_col="mhm_IsWaterSourceContainer",
190    inplace=False,
191):
192    """
193    Creates a bit flag: `mhm_IsWaterSourceContainer` where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary.
194
195    Parameters
196    ----------
197    df : pd.DataFrame
198        A mosquito habitat mapper DataFrame
199    watersource_col : str, default="mhm_WaterSourceType"
200        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records.
201    bit_col : str, default="mhm_IsWaterSourceContainer"
202        The name of the column which will store the generated IsWaterSourceContainer flag
203    inplace : bool, default=False
204        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
205
206    Returns
207    -------
208    pd.DataFrame
209        A DataFrame with the IsContainer flag. If `inplace=True` it returns None.
210    """
211
212    if not inplace:
213        df = df.copy()
214
215    mark_containers = np.vectorize(
216        lambda container: not pd.isna(container) and "container" in container
217    )
218    df[bit_col] = mark_containers(df[watersource_col].to_numpy()).astype(int)
219
220    if not inplace:
221        return df

Creates a bit flag: mhm_IsWaterSourceContainer where 1 denotes if a watersource is a container (e.g. ovitrap, pots, tires, etc.) and 0 denotes the contrary.

Parameters
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • watersource_col (str, default="mhm_WaterSourceType"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource type records.
  • bit_col (str, default="mhm_IsWaterSourceContainer"): The name of the column which will store the generated IsWaterSourceContainer flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame: A DataFrame with the IsContainer flag. If inplace=True it returns None.
def has_watersource_flag( df, watersource_col='mhm_WaterSource', bit_col='mhm_HasWaterSource', inplace=False)
224def has_watersource_flag(
225    df, watersource_col="mhm_WaterSource", bit_col="mhm_HasWaterSource", inplace=False
226):
227    """
228    Creates a bit flag: `mhm_HasWaterSource` where 1 denotes if there is a watersource and 0 denotes the contrary.
229
230    Parameters
231    ----------
232    df : pd.DataFrame
233        A mosquito habitat mapper DataFrame
234    watersource_col : str, default="mhm_WaterSource"
235        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records.
236    bit_col : str, default="mhm_IsWaterSourceContainer"
237        The name of the column which will store the generated HasWaterSource flag
238    inplace : bool, default=False
239        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
240
241    Returns
242    -------
243    pd.DataFrame
244        A DataFrame with the HasWaterSource flag. If `inplace=True` it returns None.
245    """
246
247    if not inplace:
248        df = df.copy()
249    has_watersource = np.vectorize(lambda watersource: int(not pd.isna(watersource)))
250    df[bit_col] = has_watersource(df[watersource_col].to_numpy())
251
252    if not inplace:
253        return df

Creates a bit flag: mhm_HasWaterSource where 1 denotes if there is a watersource and 0 denotes the contrary.

Parameters
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • watersource_col (str, default="mhm_WaterSource"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource records.
  • bit_col (str, default="mhm_IsWaterSourceContainer"): The name of the column which will store the generated HasWaterSource flag
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame: A DataFrame with the HasWaterSource flag. If inplace=True it returns None.
def photo_bit_flags( df, watersource_photos='mhm_WaterSourcePhotoUrls', larvae_photos='mhm_LarvaFullBodyPhotoUrls', abdomen_photos='mhm_AbdomenCloseupPhotoUrls', photo_count='mhm_PhotoCount', rejected_count='mhm_RejectedCount', pending_count='mhm_PendingCount', photo_bit_binary='mhm_PhotoBitBinary', photo_bit_decimal='mhm_PhotoBitDecimal', inplace=False)
256def photo_bit_flags(
257    df,
258    watersource_photos="mhm_WaterSourcePhotoUrls",
259    larvae_photos="mhm_LarvaFullBodyPhotoUrls",
260    abdomen_photos="mhm_AbdomenCloseupPhotoUrls",
261    photo_count="mhm_PhotoCount",
262    rejected_count="mhm_RejectedCount",
263    pending_count="mhm_PendingCount",
264    photo_bit_binary="mhm_PhotoBitBinary",
265    photo_bit_decimal="mhm_PhotoBitDecimal",
266    inplace=False,
267):
268    """
269    Creates the following flags:
270    - `PhotoCount`: The number of valid photos per record.
271    - `RejectedCount`: The number of photos that were rejected per record.
272    - `PendingCount`: The number of photos that are pending approval per record.
273    - `PhotoBitBinary`: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is `110`, that indicates that there is a water source photo and a larvae photo, but no abdomen photo.
274    - `PhotoBitDecimal`: The numerical representation of the mhm_PhotoBitBinary string.
275
276    Parameters
277    ----------
278    df : pd.DataFrame
279        A mosquito habitat mapper DataFrame
280    watersource_photos : str, default="mhm_WaterSourcePhotoUrls"
281        The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records.
282    larvae_photos : str, default="mhm_LarvaFullBodyPhotoUrls"
283        The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records.
284    abdomen_photos : str, default="mhm_AbdomenCloseupPhotoUrls"
285        The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records.
286    photo_count : str, default="mhm_PhotoCount"
287        The name of the column that will store the PhotoCount flag.
288    rejected_count : str, default="mhm_RejectedCount"
289        The name of the column that will store the RejectedCount flag.
290    pending_count : str, default="mhm_PendingCount"
291        The name of the column that will store the PendingCount flag.
292    photo_bit_binary : str, default="mhm_PhotoBitBinary"
293        The name of the column that will store the PhotoBitBinary flag.
294    photo_bit_decimal : str, default="mhm_PhotoBitDecimal"
295        The name of the column that will store the PhotoBitDecimal flag.
296    inplace : bool, default=False
297        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
298
299    Returns
300    -------
301    pd.DataFrame
302        A DataFrame with the photo flags. If `inplace=True` it returns None.
303    """
304
305    def pic_data(*args):
306        pic_count = 0
307        rejected_count = 0
308        pending_count = 0
309        valid_photo_bit_mask = ""
310
311        # bit_power = len(args) - 1
312        # For url string -- if we see ANY http, add 1
313        # also count all valid photos, rejected photos,
314        # If there are NO http then add 0, to empty photo field
315        for url_string in args:
316            if not pd.isna(url_string):
317                if "http" not in url_string:
318                    valid_photo_bit_mask += "0"
319                else:
320                    valid_photo_bit_mask += "1"
321
322                pic_count += url_string.count("http")
323                pending_count += url_string.count("pending")
324                rejected_count += url_string.count("rejected")
325            else:
326                valid_photo_bit_mask += "0"
327
328        return (
329            pic_count,
330            rejected_count,
331            pending_count,
332            valid_photo_bit_mask,
333            int(valid_photo_bit_mask, 2),
334        )
335
336    if not inplace:
337        df = df.copy()
338
339    get_photo_data = np.vectorize(pic_data)
340    (
341        df[photo_count],
342        df[rejected_count],
343        df[pending_count],
344        df[photo_bit_binary],
345        df[photo_bit_decimal],
346    ) = get_photo_data(
347        df[watersource_photos].to_numpy(),
348        df[larvae_photos].to_numpy(),
349        df[abdomen_photos].to_numpy(),
350    )
351
352    if not inplace:
353        return df

Creates the following flags:

  • PhotoCount: The number of valid photos per record.
  • RejectedCount: The number of photos that were rejected per record.
  • PendingCount: The number of photos that are pending approval per record.
  • PhotoBitBinary: A string that represents the presence of a photo in the order of watersource, larvae, and abdomen. For example, if the entry is 110, that indicates that there is a water source photo and a larvae photo, but no abdomen photo.
  • PhotoBitDecimal: The numerical representation of the mhm_PhotoBitBinary string.
Parameters
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame
  • watersource_photos (str, default="mhm_WaterSourcePhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the watersource photo url records.
  • larvae_photos (str, default="mhm_LarvaFullBodyPhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the larvae photo url records.
  • abdomen_photos (str, default="mhm_AbdomenCloseupPhotoUrls"): The name of the column in the mosquito habitat mapper DataFrame that contains the abdomen photo url records.
  • photo_count (str, default="mhm_PhotoCount"): The name of the column that will store the PhotoCount flag.
  • rejected_count (str, default="mhm_RejectedCount"): The name of the column that will store the RejectedCount flag.
  • pending_count (str, default="mhm_PendingCount"): The name of the column that will store the PendingCount flag.
  • photo_bit_binary (str, default="mhm_PhotoBitBinary"): The name of the column that will store the PhotoBitBinary flag.
  • photo_bit_decimal (str, default="mhm_PhotoBitDecimal"): The name of the column that will store the PhotoBitDecimal flag.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame: A DataFrame with the photo flags. If inplace=True it returns None.
def completion_score_flag( df, photo_bit_binary='mhm_PhotoBitBinary', has_genus='mhm_HasGenus', sub_completeness='mhm_SubCompletenessScore', completeness='mhm_CumulativeCompletenessScore', inplace=False)
356def completion_score_flag(
357    df,
358    photo_bit_binary="mhm_PhotoBitBinary",
359    has_genus="mhm_HasGenus",
360    sub_completeness="mhm_SubCompletenessScore",
361    completeness="mhm_CumulativeCompletenessScore",
362    inplace=False,
363):
364    """
365    Adds the following completness score flags:
366    - `SubCompletenessScore`: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out.
367    - `CumulativeCompletenessScore`: The percentage of non null values out of all the columns.
368
369    Parameters
370    ----------
371    df : pd.DataFrame
372        A mosquito habitat mapper DataFrame with the [`PhotoBitDecimal`](#photo_bit_flags) and [`HasGenus`](#has_genus_flags) flags.
373    photo_bit_binary: str, default="mhm_PhotoBitBinary"
374        The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag.
375    sub_completeness : str, default="mhm_HasGenus"
376        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag.
377    completeness : str, default="mhm_SubCompletenessScore"
378        The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag.
379    inplace : bool, default=False
380        Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
381
382    Returns
383    -------
384    pd.DataFrame
385        A DataFrame with completion score flags. If `inplace=True` it returns None.
386    """
387
388    def sum_bit_mask(bit_mask="0"):
389        total = 0.0
390        for char in bit_mask:
391            total += int(char)
392        return total
393
394    if not inplace:
395        df = df.copy()
396
397    scores = {}
398    scores["sub_score"] = []
399    # Cummulative Completion Score
400    scores["cumulative_score"] = round(df.count(axis=1) / len(df.columns), 2)
401    # Sub-Score
402    for index in df.index:
403        bit_mask = df[photo_bit_binary][index]
404        sub_score = df[has_genus][index] + sum_bit_mask(bit_mask=bit_mask)
405        sub_score /= 4.0
406        scores["sub_score"].append(sub_score)
407
408    df[sub_completeness], df[completeness] = (
409        scores["sub_score"],
410        scores["cumulative_score"],
411    )
412
413    if not inplace:
414        return df

Adds the following completness score flags:

  • SubCompletenessScore: The percentage of the watersource photos, larvae photos, abdomen photos, and genus columns that are filled out.
  • CumulativeCompletenessScore: The percentage of non null values out of all the columns.
Parameters
  • df (pd.DataFrame): A mosquito habitat mapper DataFrame with the PhotoBitDecimal and HasGenus flags.
  • photo_bit_binary (str, default="mhm_PhotoBitBinary"): The name of the column in the mosquito habitat mapper DataFrame that contains the PhotoBitBinary flag.
  • sub_completeness (str, default="mhm_HasGenus"): The name of the column in the mosquito habitat mapper DataFrame that will contain the generated SubCompletenessScore flag.
  • completeness (str, default="mhm_SubCompletenessScore"): The name of the column in the mosquito habitat mapper DataFrame that will contain the generated CumulativeCompletenessScore flag.
  • inplace (bool, default=False): Whether to return a new DataFrame. If True then no DataFrame copy is not returned and the operation is performed in place.
Returns
  • pd.DataFrame: A DataFrame with completion score flags. If inplace=True it returns None.
def apply_cleanup(mhm_df)
417def apply_cleanup(mhm_df):
418    """Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy.
419    It follows the following steps:
420    - Removes Homogenous Columns
421    - Renames Latitude and Longitudes
422    - Cleans the Column Naming
423    - Converts Larvae Count to Numbers
424    - Rounds Columns
425    - Standardizes Null Values
426
427    Parameters
428    ----------
429    mhm_df : pd.DataFrame
430        A DataFrame containing **raw** Mosquito Habitat Mapper Data from the API.
431
432    Returns
433    -------
434    pd.DataFrame
435        A DataFrame containing the cleaned up Mosquito Habitat Mapper Data
436    """
437    mhm_df = mhm_df.copy()
438
439    rename_latlon_cols(mhm_df, inplace=True)
440    cleanup_column_prefix(mhm_df, inplace=True)
441    larvae_to_num(mhm_df, inplace=True)
442    round_cols(mhm_df, inplace=True)
443    standardize_null_vals(mhm_df, inplace=True)
444    return mhm_df

Applies a full cleanup procedure to the mosquito habitat mapper data. Only returns a copy. It follows the following steps:

  • Removes Homogenous Columns
  • Renames Latitude and Longitudes
  • Cleans the Column Naming
  • Converts Larvae Count to Numbers
  • Rounds Columns
  • Standardizes Null Values
Parameters
  • mhm_df (pd.DataFrame): A DataFrame containing raw Mosquito Habitat Mapper Data from the API.
Returns
  • pd.DataFrame: A DataFrame containing the cleaned up Mosquito Habitat Mapper Data
def add_flags(mhm_df)
447def add_flags(mhm_df):
448    """Adds the following flags to the Mosquito Habitat Mapper Data:
449    - Has Genus
450    - Is Infectious Genus/Genus of Interest
451    - Is Container
452    - Has WaterSource
453    - Photo Bit Flags
454    - Completion Score Flag
455
456    This returns a copy of the original DataFrame with the flags added onto it.
457
458    Parameters
459    ----------
460    mhm_df : pd.DataFrame
461        A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method.
462
463    Returns
464    -------
465    pd.DataFrame
466        A DataFrame containing the flagged Mosquito Habitat Mapper Data
467    """
468    mhm_df = mhm_df.copy()
469    has_genus_flag(mhm_df, inplace=True)
470    infectious_genus_flag(mhm_df, inplace=True)
471    is_container_flag(mhm_df, inplace=True)
472    has_watersource_flag(mhm_df, inplace=True)
473    photo_bit_flags(mhm_df, inplace=True)
474    completion_score_flag(mhm_df, inplace=True)
475    return mhm_df

Adds the following flags to the Mosquito Habitat Mapper Data:

  • Has Genus
  • Is Infectious Genus/Genus of Interest
  • Is Container
  • Has WaterSource
  • Photo Bit Flags
  • Completion Score Flag

This returns a copy of the original DataFrame with the flags added onto it.

Parameters
  • mhm_df (pd.DataFrame): A DataFrame containing cleaned up Mosquito Habitat Mapper Data ideally from the method.
Returns
  • pd.DataFrame: A DataFrame containing the flagged Mosquito Habitat Mapper Data
def plot_valid_entries(df, bit_col, entry_type)
478def plot_valid_entries(df, bit_col, entry_type):
479    """
480    Plots the number of entries with photos and the number of entries without photos
481
482    Parameters
483    ----------
484    df : pd.DataFrame
485        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
486    """
487    plt.figure()
488    num_valid = len(df[df[bit_col] > 0])
489    plt.title(f"Entries with {entry_type} vs No {entry_type}")
490    plt.ylabel("Number of Entries")
491    plt.bar(entry_type, num_valid, color="#e34a33")
492    plt.bar(f"No {entry_type}", len(df) - num_valid, color="#fdcc8a")

Plots the number of entries with photos and the number of entries without photos

Parameters
  • df (pd.DataFrame): The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
def photo_subjects(mhm_df)
495def photo_subjects(mhm_df):
496    """
497    Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource)
498
499    Parameters
500    ----------
501    mhm_df : pd.DataFrame
502        The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
503    """
504
505    total_dict = {"Larvae Photos": 0, "Abdomen Photos": 0, "Watersource Photos": 0}
506
507    for number in mhm_df["mhm_PhotoBitDecimal"]:
508        total_dict["Watersource Photos"] += number & 4
509        total_dict["Larvae Photos"] += number & 2
510        total_dict["Abdomen Photos"] += number & 1
511
512    for key in total_dict.keys():
513        if total_dict[key] != 0:
514            total_dict[key] = math.log10(total_dict[key])
515        else:
516            total_dict[key] = 0
517    plt.figure(figsize=(10, 5))
518    plt.title("Mosquito Habitat Mapper - Photo Subject Frequencies (Log Scale)")
519    plt.xlabel("Photo Type")
520    plt.ylabel("Frequency (Log Scale)")
521    plt.bar(total_dict.keys(), total_dict.values(), color="lightblue")

Plots the amount of photos for each photo area (Larvae, Abdomen, Watersource)

Parameters
  • mhm_df (pd.DataFrame): The DataFrame containing Mosquito Habitat Mapper Data with the PhotoBitDecimal Flag.
def diagnostic_plots(mhm_df)
524def diagnostic_plots(mhm_df):
525    """
526    Generates (but doesn't display) diagnostic plots to gain insight into the current data.
527
528    Plots:
529    - Larvae Count Distribution (where a negative entry denotes null data)
530    - Photo Subject Distribution
531    - Number of valid photos vs no photos
532    - Completeness Score Distribution
533    - Subcompleteness Score Distribution
534
535    Parameters
536    ----------
537    mhm_df : pd.DataFrame
538        The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data.
539    """
540    plot_int_distribution(mhm_df, "mhm_LarvaeCount", "Larvae Count")
541    photo_subjects(mhm_df)
542    plot_freq_bar(mhm_df, "Mosquito Habitat Mapper", "mhm_Genus", "Genus Types")
543    plot_valid_entries(mhm_df, "mhm_HasGenus", "Genus Classifications")
544    plot_valid_entries(mhm_df, "mhm_PhotoBitDecimal", "Valid Photos")
545    completeness_histogram(
546        mhm_df,
547        "Mosquito Habitat Mapper",
548        "mhm_CumulativeCompletenessScore",
549        "Cumulative Completeness",
550    )
551    completeness_histogram(
552        mhm_df,
553        "Mosquito Habitat Mapper",
554        "mhm_SubCompletenessScore",
555        "Sub Completeness",
556    )

Generates (but doesn't display) diagnostic plots to gain insight into the current data.

Plots:

  • Larvae Count Distribution (where a negative entry denotes null data)
  • Photo Subject Distribution
  • Number of valid photos vs no photos
  • Completeness Score Distribution
  • Subcompleteness Score Distribution
Parameters
  • mhm_df (pd.DataFrame): The DataFrame containing Flagged and Cleaned Mosquito Habitat Mapper Data.
def qa_filter( mhm_df, has_genus=False, min_larvae_count=-9999, has_photos=False, is_container=False)
559def qa_filter(
560    mhm_df,
561    has_genus=False,
562    min_larvae_count=-9999,
563    has_photos=False,
564    is_container=False,
565):
566    """
567    Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:
568    - `Has Genus`: If the entry has an identified genus
569    - `Min Larvae Count` : Minimum larvae count needed for an entry
570    - `Has Photos` : If the entry contains valid photo entries
571    - `Is Container` : If the entry's watersource was a container
572
573    Returns a copy of the DataFrame
574
575    Parameters
576    ----------
577    has_genus : bool, default=False
578        If True, only entries with an identified genus will be returned.
579    min_larvae_count : int, default=-9999
580        Only entries with a larvae count greater than or equal to this parameter will be included.
581    has_photos : bool, default=False
582        If True, only entries with recorded photos will be returned
583    is_container : bool, default=False
584        If True, only entries with containers will be returned
585
586    Returns
587    -------
588    pd.DataFrame
589        A DataFrame of the applied filters.
590    """
591
592    mhm_df = mhm_df[mhm_df["mhm_LarvaeCount"] >= min_larvae_count]
593
594    if has_genus:
595        mhm_df = mhm_df[mhm_df["mhm_HasGenus"] == 1]
596    if has_photos:
597        mhm_df = mhm_df[mhm_df["mhm_PhotoBitDecimal"] > 0]
598    if is_container:
599        mhm_df = mhm_df[mhm_df["mhm_IsWaterSourceContainer"] == 1]
600
601    return mhm_df

Can filter a cleaned and flagged mosquito habitat mapper DataFrame based on the following criteria:

  • Has Genus: If the entry has an identified genus
  • Min Larvae Count : Minimum larvae count needed for an entry
  • Has Photos : If the entry contains valid photo entries
  • Is Container : If the entry's watersource was a container

Returns a copy of the DataFrame

Parameters
  • has_genus (bool, default=False): If True, only entries with an identified genus will be returned.
  • min_larvae_count (int, default=-9999): Only entries with a larvae count greater than or equal to this parameter will be included.
  • has_photos (bool, default=False): If True, only entries with recorded photos will be returned
  • is_container (bool, default=False): If True, only entries with containers will be returned
Returns
  • pd.DataFrame: A DataFrame of the applied filters.