Loading Part 2: Sizer Data

Loading Part 2: Sizer Data#

Welcome to Part 2 of our data loading series! This section builds upon the concepts and techniques introduced in Part 1. If you haven’t gone through the first part, we highly recommend you do so to get a firm foundation. Here, we’ll dive into handling 2-dimensional data, such as size distributions, which are common in fields like environmental science and engineering.

Setting Up Your Working Path#

Before we begin, let’s set up the working path, which is the location on your computer where your data files are stored. In this example, we’ll use data provided in the current directory of this notebook. However, you can easily change this to any directory where your data files are located. For instance, if you have a folder named “data” in your home directory for a project called “Campaign2023_of_awesome”, you would set the path like this:

path = "U:\\data\\processing\\Campaign2023_of_awesome\\data"

Let’s start by importing the necessary Python libraries and modules. We’ll explain each one as we use them throughout this notebook.

import os  # For handling file and directory paths
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For plotting and visualizing data
# Particula package components
from particula.data import loader, loader_interface, settings_generator
# For accessing example data
from particula.data.tests.example_data.get_example_data import get_data_folder

Now, we’ll determine the current working directory and set the path for the data folder. This step is essential for ensuring that our scripts know where to look for the data files.

# Retrieving and printing the current working directory of this script
current_path = os.getcwd()
print('Current path for this script:')
print(current_path.rsplit('particula')[-1])

# Setting and printing the path to the data folder
path = get_data_folder()
print('Path to data folder:')
print(path.rsplit('particula')[-1])

Current path for this script:
/docs/examples/streamlake
Path to data folder:
/data/tests/example_data

Load the Data#

Now that we’ve set our working directory, the next step is to load the data. We’ll be using the loader module from the Particula package for this task. The function loader.data_raw_loader() is specifically designed to read data from a file, which we’ll specify using its file path. This approach is straightforward and efficient for loading data into Python for analysis.

# Constructing the full file path for the data file
# We're joining the path set earlier with the specific file we want to load
data_file = os.path.join(
    path,  # The base path we set earlier
    'SMPS_data',  # The subdirectory where the data file is located
    '2022-07-07_095151_SMPS.csv')  # The name of the data file

# Optional: Print the file path to confirm it's correct
# print("Data file path:", data_file)

# Using the loader module to load the data from the file
# The data_raw_loader function takes the file path as its argument and
# reads the data
raw_data = loader.data_raw_loader(data_file)

# Printing a snippet of the loaded data for a quick preview
# This helps us to confirm that the data is loaded and to get an idea of
# its structure
print("Preview of loaded data:")
for row in raw_data[22:30]:  # Displaying rows 22 to 30 as a sample
    print(row)

Preview of loaded data:
Units,dW/dlogDp
Weight,Number
Sample #,Date,Start Time,Sample Temp (C),Sample Pressure (kPa),Relative Humidity (%),Mean Free Path (m),Gas Viscosity (Pa*s),Diameter Midpoint (nm),20.72,21.10,21.48,21.87,22.27,22.67,23.08,23.50,23.93,24.36,24.80,25.25,25.71,26.18,26.66,27.14,27.63,28.13,28.64,29.16,29.69,30.23,30.78,31.34,31.91,32.49,33.08,33.68,34.29,34.91,35.55,36.19,36.85,37.52,38.20,38.89,39.60,40.32,41.05,41.79,42.55,43.32,44.11,44.91,45.73,46.56,47.40,48.26,49.14,50.03,50.94,51.86,52.80,53.76,54.74,55.73,56.74,57.77,58.82,59.89,60.98,62.08,63.21,64.36,65.52,66.71,67.93,69.16,70.41,71.69,72.99,74.32,75.67,77.04,78.44,79.86,81.31,82.79,84.29,85.82,87.38,88.96,90.58,92.22,93.90,95.60,97.34,99.10,100.90,102.74,104.60,106.50,108.43,110.40,112.40,114.44,116.52,118.64,120.79,122.98,125.21,127.49,129.80,132.16,134.56,137.00,139.49,142.02,144.60,147.22,149.89,152.61,155.38,158.20,161.08,164.00,166.98,170.01,173.09,176.24,179.43,182.69,186.01,189.38,192.82,196.32,199.89,203.51,207.21,210.97,214.80,218.70,222.67,226.71,230.82,235.01,239.28,243.62,248.05,252.55,257.13,261.80,266.55,271.39,276.32,281.33,286.44,291.64,296.93,302.32,307.81,313.40,319.08,324.88,330.77,336.78,342.89,349.12,355.45,361.90,368.47,375.16,381.97,388.91,395.96,403.15,410.47,417.92,425.51,433.23,441.09,449.10,457.25,465.55,474.00,482.61,491.37,500.29,509.37,518.61,528.03,537.61,547.37,557.31,567.42,577.72,588.21,598.89,609.76,620.82,632.09,643.57,655.25,667.14,679.25,691.58,704.14,716.92,729.93,743.18,756.67,770.40,784.39,Scan Time (s),Retrace Time (s),Scan Resolution (Hz),Scans Per Sample,HV Polarity,Sheath Flow (L/min),Aerosol Flow (L/min),Bypass Flow (L/min),Low Voltage (V),High Voltage (V),Lower Size (nm),Upper Size (nm),Density (g/cm³),td + 0.5 (s),tf (s),D50 (nm),Median (nm),Mean (nm),Geo. Mean (nm),Mode (nm),Geo. Std. Dev.,Total Conc. (#/cm³),Neutralizer Status,Dilution Factor,Test Name,Test Description,Dataset Name,Dataset Description,Instrument Errors
1,07/07/2022,08:49:17,23.7,101.2,61.9,6.75690e-8,1.83579e-5,,6103.186,2832.655,4733.553,4765.944,5960.964,4475.806,4412.044,5853.069,4832.167,3781.343,3675.830,3271.549,3084.392,3668.269,4116.143,3310.157,3978.368,4151.566,2515.995,3755.837,2776.663,5032.745,3775.426,2818.553,2641.302,2636.806,3079.759,2606.094,2317.234,3192.346,2226.703,2484.878,3394.395,1762.834,3172.359,2919.533,2452.013,3403.780,2360.277,2543.386,2563.290,2649.769,1375.374,1364.046,1446.529,2068.167,1336.070,1542.077,1707.249,1482.481,2272.182,1754.409,2472.438,1191.563,2221.825,1635.293,2548.571,1991.926,2546.956,1790.114,2115.075,1138.769,1934.746,2163.955,1613.179,2132.750,1654.348,1698.154,2403.529,1222.983,1829.254,1197.162,1638.797,1248.565,2417.521,1130.421,1429.423,1694.923,1658.378,1443.393,1731.346,1277.799,1089.149,1072.630,1205.387,1693.146,1109.648,915.428,491.529,881.028,1218.297,755.658,714.301,686.247,790.943,398.805,1043.226,1298.495,1548.704,1070.899,846.596,938.241,232.947,926.941,837.452,794.492,254.455,392.637,353.144,872.576,693.986,1544.164,657.340,546.445,311.890,365.934,616.794,610.810,938.786,815.964,593.441,939.634,188.115,1077.429,1213.142,737.913,1876.626,735.779,996.521,1098.601,1166.494,962.551,1392.535,947.504,655.459,993.819,682.087,852.503,601.057,733.860,529.122,960.578,687.512,839.973,652.820,289.921,623.835,453.604,588.057,856.253,283.994,282.839,365.801,200.382,365.756,146.548,306.730,373.162,114.272,0.000,182.692,260.788,164.857,19.851,89.612,0.000,181.974,0.000,53.276,20.016,0.000,0.000,95.433,96.222,0.000,0.000,197.307,0.000,100.336,0.000,102.072,0.000,0.000,209.529,0.000,213.227,107.561,0.000,218.965,220.930,111.453,0.000,113.460,0.000,115.513,0.000,0.000,0.000,0.000,28.221,93.413,122.992,0.000,75,4,50,1,Negative,2.000,0.300,0.00,10.07,9863.01,20.5,791.5,1.0,1.81,10.79,1000.0,41.562,74.959,52.078,20.721,2.179,2.16900e+3,ON,1,TRACER-CAT,,2022 07 07 09_51,,Detector aerosol flow rate error;Incomplete Scan
2,07/07/2022,08:50:48,23.6,101.2,61.7,6.75401e-8,1.83531e-5,,5621.118,5867.747,6233.403,3453.156,4484.307,5468.148,4725.052,4689.983,3661.759,4356.725,4292.911,7728.414,5112.679,4746.084,3957.005,3472.977,3496.697,4674.202,4188.868,2868.559,3375.113,4306.112,5191.077,4732.512,4566.029,3514.167,5172.877,3825.270,5323.756,2327.737,3846.602,2347.097,3182.011,1876.273,2952.863,2831.255,2497.869,4158.061,3828.510,3199.720,2309.195,2462.550,3060.240,1086.744,1476.289,2069.774,1727.787,2710.631,2067.327,2619.082,2345.026,2362.235,1429.749,2557.408,2660.327,1209.933,1590.320,1696.569,2236.773,1499.046,1922.632,1650.213,3147.351,2201.919,1622.954,2198.739,1800.998,1429.621,1426.761,1923.931,1262.939,1745.284,1458.571,1523.548,1920.108,1382.558,2211.525,2571.277,1979.297,1562.697,1741.573,1307.680,967.481,838.919,1502.136,1301.401,1011.619,829.770,973.269,1100.004,1152.808,749.250,1187.900,806.256,111.008,297.062,809.059,1361.412,779.536,535.087,881.522,1307.518,800.804,1053.953,182.381,1042.830,673.021,646.171,825.612,963.187,748.743,540.954,769.157,788.222,825.566,236.537,865.009,289.185,803.098,398.510,446.847,439.645,1118.961,1003.003,924.180,745.149,430.134,415.522,805.970,790.348,998.975,1043.136,604.082,1004.545,1082.455,1312.781,1447.390,872.420,398.380,695.719,857.412,645.872,691.129,623.007,471.728,641.049,1023.693,394.611,475.599,446.076,657.686,313.003,136.395,248.550,579.894,336.126,485.938,298.810,0.000,227.571,104.550,157.583,289.697,0.229,0.000,0.000,217.592,67.816,24.067,0.000,0.000,0.000,0.000,0.000,97.009,0.000,0.000,0.000,0.000,0.000,0.000,205.900,0.000,104.761,0.000,0.000,0.000,108.509,0.000,110.461,0.000,0.000,0.000,114.471,115.506,116.540,0.000,118.660,0.000,0.000,0.000,0.000,75.377,75,4,50,1,Negative,2.000,0.300,0.00,10.07,9863.01,20.5,791.5,1.0,1.81,10.79,1000.0,39.458,69.080,49.198,25.255,2.101,2.39408e+3,ON,1,TRACER-CAT,,2022 07 07 09_51,,Detector aerosol flow rate error;Incomplete Scan
3,07/07/2022,08:52:19,23.7,101.2,61.5,6.75690e-8,1.83579e-5,,5165.139,4969.987,4312.386,6939.394,4680.764,3224.473,4999.149,3653.002,4241.532,3928.137,2718.607,3363.947,4863.410,5338.452,4659.515,3430.329,3997.386,4644.421,4943.511,3883.970,3212.310,4445.981,2349.435,3605.419,4366.557,4969.924,4880.573,3186.281,3089.412,2724.537,3195.740,4277.947,4864.436,4263.532,2100.807,1967.634,3283.337,3268.660,3001.917,2781.549,1879.354,1376.083,2051.524,2165.874,2012.210,2923.129,1575.515,1544.252,1610.635,1572.609,1299.370,1549.832,1145.100,2897.864,1839.992,2351.579,2102.027,1543.106,953.811,2073.610,2317.378,2087.617,1586.363,1897.860,2456.722,1647.781,1013.534,1734.023,1633.021,1841.697,2193.442,2714.856,1396.336,2264.046,1671.363,1538.012,1257.148,1423.316,1217.281,1745.437,1787.473,1284.774,1534.815,1274.852,1438.025,1199.602,964.066,862.098,685.995,679.146,879.775,806.703,979.672,894.103,1379.499,1112.031,744.999,580.777,1241.262,960.784,750.484,908.236,957.901,652.265,1200.515,429.487,347.453,552.393,617.871,652.163,709.227,788.963,1499.238,627.895,1315.208,976.800,555.360,440.680,1182.819,863.800,362.530,942.047,460.380,1222.507,678.820,1006.555,319.371,91.941,761.841,205.384,449.120,751.217,572.530,350.734,295.089,413.379,612.088,474.457,678.504,490.408,751.536,400.656,585.567,676.707,364.052,124.385,631.790,788.487,566.062,390.904,141.751,256.369,366.589,528.781,512.078,257.120,393.412,350.601,361.659,65.138,348.203,326.629,329.714,175.810,111.365,74.091,103.212,0.000,0.000,47.532,0.000,166.826,0.000,96.217,388.070,97.832,98.649,99.490,200.678,202.399,0.000,102.953,0.000,0.000,105.683,106.611,33.630,183.108,2.602,218.305,222.901,0.000,226.925,0.000,0.000,116.553,0.000,118.661,119.732,120.801,0.000,122.992,124.085,75,4,50,1,Negative,2.000,0.300,0.00,10.07,9863.01,20.5,791.5,1.0,1.81,10.79,1000.0,39.324,72.102,50.019,21.870,2.136,2.27861e+3,ON,1,TRACER-CAT,,2022 07 07 09_51,,Detector aerosol flow rate error;Incomplete Scan
4,07/07/2022,08:53:50,23.8,101.2,61.4,6.75979e-8,1.83627e-5,,5814.745,5937.421,5542.118,7127.484,5341.069,4793.690,4938.844,5721.541,4877.746,5900.250,5104.984,4914.366,4891.892,6655.579,4431.173,3389.961,4947.809,3115.245,4138.126,5421.474,4589.063,4007.156,2524.137,5009.064,4780.963,4959.096,3648.285,4148.676,4270.099,2229.465,3043.487,5618.376,3689.188,4700.549,2535.915,1754.223,2560.335,2853.385,2454.711,2515.907,3015.370,1502.864,2344.161,2761.448,2047.076,1542.531,2151.757,2365.884,2330.816,2585.566,1431.955,2391.335,2097.717,1891.014,2211.815,2071.479,2188.302,2475.058,1906.364,1781.793,2356.998,1527.723,2609.446,1644.771,1917.624,1843.984,2418.197,1385.516,1263.621,2155.939,2083.223,1765.167,957.777,2077.747,1667.811,1122.065,1579.113,1709.471,1604.406,686.151,390.075,1194.313,1657.144,1462.232,1870.846,1012.132,847.165,1248.528,1039.604,779.076,1375.101,1058.272,1013.378,1211.420,1641.490,979.146,835.539,763.524,951.720,1270.393,1308.492,1056.486,1715.924,657.112,1475.767,235.866,827.129,1266.089,1080.958,1246.249,1147.116,840.719,1560.246,1201.554,1743.366,1233.526,1166.422,1068.551,1047.492,787.018,759.836,491.419,714.111,460.361,681.068,767.815,654.715,501.038,357.016,575.937,613.281,851.029,583.739,475.691,431.584,616.144,744.932,409.334,984.682,371.750,613.130,757.474,637.077,441.004,609.132,380.961,595.419,565.033,566.955,332.402,450.524,139.761,430.419,443.058,558.628,158.467,271.708,346.807,57.637,148.050,226.825,353.827,77.661,0.000,0.000,74.100,0.000,250.296,117.433,93.156,187.816,0.000,95.443,0.000,0.000,293.505,0.000,99.496,100.342,0.000,102.078,102.959,0.000,0.000,0.000,106.622,322.709,0.000,328.474,0.000,67.473,44.378,0.000,0.000,115.519,0.000,0.000,118.668,0.000,0.000,0.000,0.000,0.000,75,4,50,1,Negative,2.000,0.300,0.00,10.07,9863.01,20.5,791.5,1.0,1.81,10.79,1000.0,37.995,68.796,48.896,21.870,2.107,2.51144e+3,ON,1,TRACER-CAT,,2022 07 07 09_51,,Detector aerosol flow rate error;Incomplete Scan
5,07/07/2022,08:55:21,24.0,101.1,61.4,6.77227e-8,1.83722e-5,,8034.425,6317.981,6972.600,4577.324,6488.519,4985.397,5484.518,7295.312,3449.590,4261.716,4259.456,6124.670,4418.824,5418.742,3311.293,3548.897,4940.747,6738.536,3377.823,3309.433,5322.339,4148.187,3387.285,3967.636,5064.382,4573.259,3896.245,4006.531,3769.030,4129.946,4678.454,3121.839,3888.625,2443.782,1947.617,2321.130,1845.465,2833.269,2745.881,3262.145,4055.876,2319.187,3397.282,2596.623,2935.256,1508.733,1555.232,3184.200,2683.631,2158.530,2303.663,2739.336,2714.276,2536.377,2051.076,2063.667,2074.972,2852.267,2366.702,2135.668,1500.801,2228.817,2220.527,1501.131,2354.567,2072.434,2547.917,2111.890,1474.809,1561.614,1334.889,1100.318,1077.335,1470.618,1377.825,1684.933,1093.441,1596.409,1456.255,1543.298,1116.499,984.258,1294.805,1586.816,723.664,1709.369,1060.965,1415.310,1611.158,1791.258,1098.238,1513.790,1335.019,1178.572,1538.772,477.803,1130.380,1596.999,652.664,1098.951,1384.104,772.285,788.185,1432.363,773.331,729.470,819.882,979.684,925.309,753.771,706.255,659.741,1026.707,818.647,1205.428,940.460,906.655,758.763,811.344,1123.245,520.356,1009.392,651.265,735.336,209.657,549.624,537.181,841.849,483.705,713.011,497.248,743.196,556.459,953.140,847.692,614.097,423.810,816.193,627.059,453.998,976.898,592.170,548.197,535.480,667.837,312.390,476.781,369.028,451.687,432.520,1001.512,312.053,498.408,198.771,399.968,363.778,403.848,381.782,223.839,227.667,212.819,101.097,164.909,359.326,285.450,0.000,44.177,0.000,158.441,220.559,81.404,49.687,95.468,0.000,194.095,391.452,98.679,0.000,0.000,0.000,0.000,102.990,103.881,104.800,0.000,106.644,0.000,108.554,0.000,110.496,0.000,112.492,113.494,0.000,115.548,0.000,0.000,0.000,239.531,241.683,0.000,0.000,248.252,75,4,50,1,Negative,2.000,0.300,0.00,10.07,9863.01,20.5,791.5,1.0,1.81,10.79,1000.0,39.214,69.960,48.959,20.721,2.123,2.56068e+3,ON,1,TRACER-CAT,,2022 07 07 09_51,,Detector aerosol flow rate error;Incomplete Scan

Formatting the Data#

When dealing with 2-dimensional data, such as size distributions, the formatting process can be a bit more complex compared to 1-dimensional data. For our data, we need to extract size bins and use them as headers for our dataset. This involves specifying the start and end points within the data that define our size bins. In our specific example, the start point is indicated by the keyword “Date Time” and the end point by “Total Conc”. Understanding where your data starts and ends is crucial for accurate formatting and analysis.

# Formatting the data for time series analysis
# The sizer_data_formatter function from the loader module is used for
# this purpose

epoch_time, data, header = loader.sizer_data_formatter(
    data=raw_data,  # The raw data that was loaded earlier
    data_checks={  # Performing checks on the data
        "characters": [250],  # Expected character count per line
        # Number of rows to skip at the beginning (headers, etc.)
        "skip_rows": 25,
        "skip_end": 0,  # Number of rows to skip at the end
        "char_counts": {"/": 2, ":": 2}  # Ensuring line formats are consistent
    },
    data_sizer_reader={  # Reading the size distribution data
        'Dp_start_keyword': '20.72',  # Starting keyword for size bins
        'Dp_end_keyword': '784.39',  # Ending keyword for size bins
        'convert_scale_from': 'dw/dlogdp'  # Scale conversion for the data everything is converted to dN/dlogdp
    },
    time_column=[1, 2],  # Columns that contain the time data
    time_format="%m/%d/%Y %H:%M:%S",  # Format of the time data
    delimiter=",",  # Delimiter used in the data file
    header_row=24)  # Row number that contains the header

# Printing a preview of the formatted data to confirm successful formatting
print('Epoch time (First 5 Entries):')
print(epoch_time[:5])  # Displaying the first 5 time entries
print('Data shape:')
print(data.shape)  # Showing the shape of the data array
print('Header (First 10 Entries):')
print(header[:10])  # Displaying the first 10 headers

Epoch time (First 5 Entries):
[1.65718376e+09 1.65718385e+09 1.65718394e+09 1.65718403e+09
 1.65718412e+09]
Data shape:
(2854, 203)
Header (First 10 Entries):
['20.72', '21.10', '21.48', '21.87', '22.27', '22.67', '23.08', '23.50', '23.93', '24.36']

Pause to Plot#

Visualizing your data is a crucial step in the data analysis process. It allows you to see patterns, trends, and potential anomalies that might not be evident from the raw data alone. Now that we have formatted the data and extracted the time information, let’s create a plot. This will help us get a visual sense of the data’s characteristics, such as the concentration of particles in different size bins over time.

# Creating a plot using matplotlib
fig, ax = plt.subplots()  # Creating a figure and axis for the plot

# Plotting data from a specific size bin against time
# 'epoch_time' is used on the x-axis (time data)
# 'data[:, 50]' selects the data from the 50th bin (as an example) for the y-axis
ax.plot(epoch_time,
        data[:, 50],  # Selecting the 50th bin of data to plot
        label=f'Bin {header[50]} nm',  # Adding a label with the bin size
        )

# Setting labels for the x-axis and y-axis
ax.set_xlabel("Time (epoch)")  # Label for the x-axis
ax.set_ylabel("Bin Concentration (#/cm³)")  # Label for the y-axis

# Adding a legend to the plot for clarity
ax.legend()

# Displaying the plot
plt.show()

# Adjusting the layout to ensure all plot elements are visible and
# well-arranged
fig.tight_layout()

../../_images/b2f931e41b4cacc00127071b6aababc7b83fc907eb28073dddb5178d7cfdd703.png

Dates in Plots#

When working with time-series data, it’s often helpful to have dates on the x-axis of your plots for better readability and understanding. However, to display dates effectively in plots using matplotlib, we need to convert our time data into a format that matplotlib can recognize and work with.

One common format for this purpose is np.datetime64. This format represents dates and times in a way that is compatible with numpy arrays, making it ideal for plotting time-related data. In our case, we can convert our epoch time (time since a fixed point in the past, typically January 1, 1970) to np.datetime64 using the convert.datetime64_from_epoch_array function from the Particula package.

Additionally, to make the plot more readable, especially when there are many data points, it’s a good practice to rotate the x-axis labels. This prevents overlapping and makes each date and time label clear. We can achieve this rotation using plt.xticks(rotation=45).

# Importing the necessary function for converting epoch time to datetime64
from particula.util.time_manage import datetime64_from_epoch_array

# Converting the epoch time to datetime64 format
time_in_datetime64 = datetime64_from_epoch_array(epoch_time)

# Creating a plot
fig, ax = plt.subplots()

# Plotting the data with time in datetime64 format on the x-axis
ax.plot(time_in_datetime64,
        data[:, 50],  # Selecting the 50th bin of data
        label=f'Bin {header[50]} nm',  # Label for the data series
        )

# Rotating the x-axis labels to 45 degrees for better readability
plt.xticks(rotation=45)

# Setting the x-axis and y-axis labels
ax.set_xlabel("Time (UTC)")  # Updated label to indicate the time format
ax.set_ylabel("Bin Concentration (#/cm³)")

# Adding a legend to the plot
ax.legend()

# Displaying the plot
plt.show()

# Adjusting the layout for a neat presentation
fig.tight_layout()

../../_images/a79e9d301ff6d3e5e3ef56619c813fc0726e3ba090d0bff3d4522fb128107f43.png

Contour Plot of Data#

Contour plots are a powerful tool for visualizing how data changes over time and space. In the context of size distribution data, a contour plot can effectively show the variation in particle concentration across different sizes over time. It’s like looking at a topographic map where different colors or shades represent varying concentrations of particles at different sizes and times.

Preparing the Data for Contour Plotting#

Before we plot, it’s a good practice to set limits on our data to ensure that extreme values don’t skew the visualization. This helps in highlighting the relevant ranges of our data. Here’s how we do it:

Setting Lower and Upper Limits: We impose a lower limit to avoid plotting extremely low concentrations (which might be less relevant or below detection limits) and an upper limit to avoid letting very high concentrations dominate the plot.
Option for Logarithmic Scale: For data with a wide range of values, using a logarithmic scale (e.g., np.log10(concentration)) can make the plot more informative by compressing the scale and emphasizing the variations across orders of magnitude.

Creating the Contour Plot#

With our data prepared, we can now create the contour plot. This type of plot will use different colors or shades to represent the concentration of particles at various sizes and times.

X-Axis (Epoch Time): Represents the time dimension of our data.
Y-Axis (Diameter in nm): Represents the different size bins of particles.
Color Intensity: Indicates the concentration of particles at each size and time.

Using plt.contourf, we create a filled contour plot with a logarithmic y-scale, which is particularly useful for size distribution data that typically spans several orders of magnitude in particle sizes.

import numpy as np
import matplotlib.pyplot as plt

# Setting limits on the concentration data to improve plot readability
concentration = data
concentration = np.where(
    concentration < 1e-5,
    1e-5,
    concentration)  # Setting a lower limit
concentration = np.where(
    concentration > 10**5,
    10**5,
    concentration)  # Setting an upper limit
# Uncomment the next line to plot concentration in logarithmic scale
# concentration = np.log10(concentration)

# Creating a figure and axis for the contour plot
fig, ax = plt.subplots(1, 1)

# Creating the contour plot
plt.contourf(
    epoch_time,  # X-axis: Time data in epoch format
    # Y-axis: Particle sizes converted to float
    np.array(header).astype(float),
    concentration.T,  # Transposed concentration data for correct orientation
    cmap=plt.cm.PuBu_r,  # Color map for the plot
    levels=50  # Number of levels in the contour plot
)

# Setting the y-axis to logarithmic scale for better visualization of size
# distribution
plt.yscale('log')

# Setting labels for the x-axis and y-axis
ax.set_xlabel('Epoch Time')  # Label for the x-axis
ax.set_ylabel('Diameter (nm)')  # Label for the y-axis

# Adding a color bar to the plot, indicating concentration levels
plt.colorbar(label='Concentration dN/dlogDp [#/cm³]', ax=ax)

# Displaying the plot
plt.show()

# Adjusting the layout for a better presentation of the plot elements
fig.tight_layout()

../../_images/1745b64c85066304d9018271aa505c8a206071f09184bf29e9cc8a2701d6610a.png

Simplifying Data Import with the Settings Generator#

In the same way we handled 1-dimensional (1d) data, we can also streamline the import process for 2-dimensional (2d) data using the settings generator. This tool is particularly useful for creating a structured approach to loading complex datasets. By using the settings_generator.for_general_sizer_1d_2d_load() function, we can generate a comprehensive settings dictionary that directs how data should be imported and formatted.

Understanding the Settings Generator Function#

This function is designed to be flexible and accommodate a wide range of data types and formats. It comes with numerous arguments that allow you to specify details like data checks, column information, time format, etc., tailored to your specific dataset. However, it’s important to note that you don’t have to be overwhelmed by these options.

Using Default Settings#

For many users, especially those just starting out or working with standard data formats, the default settings of the settings_generator function may work. These defaults are configured to align with the example data provided in the Particula package. This means that if your data structure is similar to the example data, you can call this function without passing any arguments, and it will automatically set up the settings for you.

This approach not only saves time but also reduces the potential for errors in the data import process, making it a quick and reliable way to get your data ready for analysis.

# Importing the necessary module for settings generation
from particula.data import settings_generator

# Generating settings for loading 1d and 2d data from files
# This is useful for instruments that output both types of data in the
# same file
settings_1d, settings_2d = settings_generator.for_general_sizer_1d_2d_load(
    relative_data_folder='SMPS_data',  # Folder where the data files are located
    filename_regex='*.csv',  # Pattern to match filenames (e.g., all CSV files)
    file_min_size_bytes=10,  # Minimum file size in bytes for consideration
    header_row=24,  # Row number containing the header in the data file
    data_checks={  # Checks to ensure data integrity
        "characters": [250],  # Expected character count per line
        "skip_rows": 25,  # Rows to skip at the start of the file
        "skip_end": 0,  # Rows to skip at the end of the file
        "char_counts": {"/": 2, ":": 2}  # Ensuring date formats are consistent
    },
    data_1d_column=[  # Columns for 1d data in the data file
        "Lower Size (nm)", "Upper Size (nm)", "Sample Temp (C)",
        "Sample Pressure (kPa)", "Relative Humidity (%)", "Median (nm)",
        "Mean (nm)", "Geo. Mean (nm)", "Mode (nm)", "Geo. Std. Dev.",
        "Total Conc. (#/cm³)"
    ],
    data_1d_header=[  # Headers for 1d data columns once in the Stream
        "Lower_Size_(nm)", "Upper_Size_(nm)", "Sample_Temp_(C)",
        "Sample_Pressure_(kPa)", "Relative_Humidity_(%)", "Median_(nm)",
        "Mean_(nm)", "Geo_Mean_(nm)", "Mode_(nm)", "Geo_Std_Dev.",
        "Total_Conc_(#/cc)"
    ],
    data_2d_dp_start_keyword="20.72",  # Starting keyword for 2d size bins
    data_2d_dp_end_keyword="784.39",  # Ending keyword for 2d size bins
    data_2d_convert_concentration_from="dw/dlogdp",  # Conversion for 2d concentration
    time_column=[1, 2],  # Columns containing time data
    time_format="%m/%d/%Y %H:%M:%S",  # Format of the time data
    delimiter=",",  # Delimiter used in the data file
    time_shift_seconds=0,  # Time shift, if needed
    timezone_identifier="UTC",  # Timezone for the time data
)

# Printing the generated settings dictionaries for both 1d and 2d data
print('Settings for 1d data:')
for key, value in settings_1d.items():
    print(f'{key}: {value}')

print('\nSettings for 2d data:')
for key, value in settings_2d.items():
    print(f'{key}: {value}')

Settings for 1d data:
relative_data_folder: SMPS_data
filename_regex: *.csv
MIN_SIZE_BYTES: 10
data_loading_function: general_1d_load
header_row: 24
data_checks: {'characters': [250], 'skip_rows': 25, 'skip_end': 0, 'char_counts': {'/': 2, ':': 2}}
data_column: ['Lower Size (nm)', 'Upper Size (nm)', 'Sample Temp (C)', 'Sample Pressure (kPa)', 'Relative Humidity (%)', 'Median (nm)', 'Mean (nm)', 'Geo. Mean (nm)', 'Mode (nm)', 'Geo. Std. Dev.', 'Total Conc. (#/cm³)']
data_header: ['Lower_Size_(nm)', 'Upper_Size_(nm)', 'Sample_Temp_(C)', 'Sample_Pressure_(kPa)', 'Relative_Humidity_(%)', 'Median_(nm)', 'Mean_(nm)', 'Geo_Mean_(nm)', 'Mode_(nm)', 'Geo_Std_Dev.', 'Total_Conc_(#/cc)']
time_column: [1, 2]
time_format: %m/%d/%Y %H:%M:%S
delimiter: ,
time_shift_seconds: 0
timezone_identifier: UTC

Settings for 2d data:
relative_data_folder: SMPS_data
filename_regex: *.csv
MIN_SIZE_BYTES: 10
data_loading_function: general_2d_load
header_row: 24
data_checks: {'characters': [250], 'skip_rows': 25, 'skip_end': 0, 'char_counts': {'/': 2, ':': 2}}
data_sizer_reader: {'Dp_start_keyword': '20.72', 'Dp_end_keyword': '784.39', 'convert_scale_from': 'dw/dlogdp'}
time_column: [1, 2]
time_format: %m/%d/%Y %H:%M:%S
delimiter: ,
time_shift_seconds: 0
timezone_identifier: UTC

Efficient Data Loading with the Interface#

After configuring our settings dictionaries for 1-dimensional and 2-dimensional data, we’re set to leverage the interface for data loading. This interface, a key component of the Particula package, streamlines the process, making it more efficient and straightforward, especially after you have a good grasp of how the settings work.

Understanding the Interface’s Role#

The interface acts as a facilitator that intelligently uses the settings we’ve established to manage the data loading process. It eliminates the need for manual execution of multiple steps, thereby integrating and automating the data import based on our predefined preferences.

The Advantages of Mastery#

Once you’re comfortable with setting up your data parameters, using the interface offers several key benefits:

Enhanced Efficiency: It consolidates several operations into a single action, significantly speeding up the data loading process.
Consistent Results: By automating the data import process with predefined settings, it ensures uniformity and accuracy across different datasets.
Optimized Workflow: For users who understand the settings, the interface offers a simplified and more effective way to handle data loading. It removes the repetitive task of manually calling functions, allowing you to focus more on data analysis.

In the following section, we’ll demonstrate how to utilize the interface with our prepared settings to efficiently load and process our data.

# Importing the necessary module for the loader interface
from particula.data import loader_interface
from particula.data.tests.example_data.get_example_data import get_data_folder

# Setting the working path to the directory where the data files are located
working_path = get_data_folder()

# Using the settings dictionaries created earlier for 1d and 2d data

# Loading 1-dimensional data using the loader interface
# The interface takes the path and settings for 1d data and loads the data
# accordingly
data_stream_1d = loader_interface.load_files_interface(
    path=working_path,  # The path where data files are stored
    settings=settings_1d,  # Settings dictionary for 1d data
)

# Loading 2-dimensional data using the loader interface
# Similar to the 1d data, but using the settings for 2d data
data_stream_2d = loader_interface.load_files_interface(
    path=working_path,  # The path where data files are stored
    settings=settings_2d,  # Settings dictionary for 2d data
)

# The data_stream_1d and data_stream_2d objects now contain the loaded data
# ready for further analysis and visualization

  Loading file: 2022-07-10_094659_SMPS.csv
  Loading file: 2022-07-07_095151_SMPS.csv

  Loading file: 2022-07-10_094659_SMPS.csv

  Loading file: 2022-07-07_095151_SMPS.csv

Printing Data Stream Summaries#

After loading our data using the loader interface, it’s a good practice to take a moment and review what we have loaded. This helps us confirm that the data is imported correctly and gives us an initial overview of its structure. We’ll do this by printing summaries of the data_stream_1d and data_stream_2d objects.

Understanding the Data Stream Summary#

When we print a data_stream object, it provides us with a summary of its contents. This includes information like the size of the data, the headers (which represent different data types or measurements), and a glimpse into the actual data values. These summaries are especially useful for:

Verifying Data Integrity: Ensuring that the data has been loaded as expected and is ready for analysis.
Quick Overview: Getting a high-level understanding of the data’s structure, such as the number of data points and the range of measurements.

Code for Printing Summaries#

Here’s how we print the summaries for our 1-dimensional and 2-dimensional data streams:

# Print a blank line for better readability
print('')

# Print the summary of the 1-dimensional data stream
print('Data stream 1d summary:')
print(data_stream_1d)  # This will display a summary of the 1d data

# Print another blank line for separation
print('')

# Print the summary of the 2-dimensional data stream
print('Data stream 2d summary:')
print(data_stream_2d)  # This will display a summary of the 2d data

Data stream 1d summary:
Stream(header=['Lower_Size_(nm)', 'Upper_Size_(nm)', 'Sample_Temp_(C)', 'Sample_Pressure_(kPa)', 'Relative_Humidity_(%)', 'Median_(nm)', 'Mean_(nm)', 'Geo_Mean_(nm)', 'Mode_(nm)', 'Geo_Std_Dev.', 'Total_Conc_(#/cc)'], data=array([[2.05000e+01, 7.91500e+02, 2.37000e+01, ..., 2.07210e+01,
        2.17900e+00, 2.16900e+03],
       [2.05000e+01, 7.91500e+02, 2.36000e+01, ..., 2.52550e+01,
        2.10100e+00, 2.39408e+03],
       [2.05000e+01, 7.91500e+02, 2.37000e+01, ..., 2.18700e+01,
        2.13600e+00, 2.27861e+03],
       ...,
       [2.05000e+01, 7.91500e+02, 2.35000e+01, ..., 2.07210e+01,
        2.31800e+00, 2.08056e+03],
       [2.05000e+01, 7.91500e+02, 2.33000e+01, ..., 2.10970e+01,
        2.31800e+00, 2.10616e+03],
       [2.05000e+01, 7.91500e+02, 2.35000e+01, ..., 2.07210e+01,
        2.24800e+00, 2.45781e+03]]), time=array([1.65718376e+09, 1.65718385e+09, 1.65718394e+09, ...,
       1.65753440e+09, 1.65753450e+09, 1.65753459e+09]), files=[['2022-07-10_094659_SMPS.csv', 2003798], ['2022-07-07_095151_SMPS.csv', 5617925]])

Data stream 2d summary:
Stream(header=['20.72', '21.10', '21.48', '21.87', '22.27', '22.67', '23.08', '23.50', '23.93', '24.36', '24.80', '25.25', '25.71', '26.18', '26.66', '27.14', '27.63', '28.13', '28.64', '29.16', '29.69', '30.23', '30.78', '31.34', '31.91', '32.49', '33.08', '33.68', '34.29', '34.91', '35.55', '36.19', '36.85', '37.52', '38.20', '38.89', '39.60', '40.32', '41.05', '41.79', '42.55', '43.32', '44.11', '44.91', '45.73', '46.56', '47.40', '48.26', '49.14', '50.03', '50.94', '51.86', '52.80', '53.76', '54.74', '55.73', '56.74', '57.77', '58.82', '59.89', '60.98', '62.08', '63.21', '64.36', '65.52', '66.71', '67.93', '69.16', '70.41', '71.69', '72.99', '74.32', '75.67', '77.04', '78.44', '79.86', '81.31', '82.79', '84.29', '85.82', '87.38', '88.96', '90.58', '92.22', '93.90', '95.60', '97.34', '99.10', '100.90', '102.74', '104.60', '106.50', '108.43', '110.40', '112.40', '114.44', '116.52', '118.64', '120.79', '122.98', '125.21', '127.49', '129.80', '132.16', '134.56', '137.00', '139.49', '142.02', '144.60', '147.22', '149.89', '152.61', '155.38', '158.20', '161.08', '164.00', '166.98', '170.01', '173.09', '176.24', '179.43', '182.69', '186.01', '189.38', '192.82', '196.32', '199.89', '203.51', '207.21', '210.97', '214.80', '218.70', '222.67', '226.71', '230.82', '235.01', '239.28', '243.62', '248.05', '252.55', '257.13', '261.80', '266.55', '271.39', '276.32', '281.33', '286.44', '291.64', '296.93', '302.32', '307.81', '313.40', '319.08', '324.88', '330.77', '336.78', '342.89', '349.12', '355.45', '361.90', '368.47', '375.16', '381.97', '388.91', '395.96', '403.15', '410.47', '417.92', '425.51', '433.23', '441.09', '449.10', '457.25', '465.55', '474.00', '482.61', '491.37', '500.29', '509.37', '518.61', '528.03', '537.61', '547.37', '557.31', '567.42', '577.72', '588.21', '598.89', '609.76', '620.82', '632.09', '643.57', '655.25', '667.14', '679.25', '691.58', '704.14', '716.92', '729.93', '743.18', '756.67', '770.40', '784.39'], data=array([[ 6103.186,  2832.655,  4733.553, ...,    93.413,   122.992,
            0.   ],
       [ 5621.118,  5867.747,  6233.403, ...,     0.   ,     0.   ,
           75.377],
       [ 5165.139,  4969.987,  4312.386, ...,     0.   ,   122.992,
          124.085],
       ...,
       [ 9962.036,  7986.823,  8682.258, ...,     0.   ,     0.   ,
          124.153],
       [ 8765.782, 11175.603,  8148.945, ...,     0.   ,     0.   ,
          372.433],
       [14380.528, 11524.35 , 13632.727, ...,     0.   ,     0.   ,
            0.   ]]), time=array([1.65718376e+09, 1.65718385e+09, 1.65718394e+09, ...,
       1.65753440e+09, 1.65753450e+09, 1.65753459e+09]), files=[['2022-07-10_094659_SMPS.csv', 2003798], ['2022-07-07_095151_SMPS.csv', 5617925]])

Plotting Again#

Plotting is an integral part of the data analysis process. It transforms raw data into visual representations that can reveal insights, patterns, and anomalies that might not be immediately apparent in numerical form. Let’s delve into why plotting is so crucial in understanding your data.

import numpy as np
import matplotlib.pyplot as plt

# Adjusting the concentration data for better visualization in the plot
concentration = data_stream_2d.data
concentration = np.where(
    concentration < 1e-5,
    1e-5,
    concentration)  # Setting a lower limit
concentration = np.where(
    concentration > 10**5,
    10**5,
    concentration)  # Setting an upper limit
# Uncomment the following line to plot the concentration on a logarithmic scale
# concentration = np.log10(concentration)

# Creating a figure and axis for the contour plot
fig, ax = plt.subplots(1, 1)

# Creating the contour plot
# X-axis: Time data in datetime64 format from data_stream_2d
# Y-axis: Particle sizes (diameter in nm) converted from the header strings to floats
# Z-axis: Concentration data
plt.contourf(
    data_stream_2d.datetime64,  # Time data
    data_stream_2d.header_float,  # Particle sizes
    concentration.T,  # Concentration data, transposed for correct orientation
    cmap=plt.cm.PuBu_r,  # Color map for the plot
    levels=50  # Number of contour levels
)

# Setting the y-axis to logarithmic scale for better visualization of size
# distribution
plt.yscale('log')

# Rotating the x-axis labels for better readability
plt.tick_params(rotation=35)

# Setting labels for the x-axis and y-axis
ax.set_xlabel("Time (UTC)")
ax.set_ylabel('Diameter (nm)')

# Adding a color bar to indicate concentration levels
plt.colorbar(label='Concentration dN/dlog(Dp) [#/cm³]', ax=ax)

# Displaying the plot
plt.show()

# Adjusting the layout to ensure all elements of the plot are clearly visible
fig.tight_layout()

../../_images/a075b78473e9a83c3f79a08f5a5f5e142fb62e27bf15d301576b52da7ec6f591.png

Summary of Loading Data Part 2#

In this section, we delved into the process of loading and handling 2-dimensional data, focusing on a size distribution dataset. We walked through several crucial steps, providing a comprehensive guide to managing and visualizing complex data structures. Here’s a recap of what we covered:

Setting the Working Path: We began by establishing the working directory for our data, a foundational step in ensuring our scripts access the correct files.
Loading the Data: Using the loader module, we demonstrated how to import raw data from a file, setting the stage for further processing.
Formatting the Data: We then tackled the challenge of formatting 2-dimensional data, extracting size bins as headers, and preparing the dataset for analysis.
Initial Plotting: To get a preliminary understanding of our data, we created initial plots. This step is crucial for visually inspecting the data and confirming its integrity.
Generating the Settings Dictionary: We utilized the settings_generator to create settings dictionaries for both 1-dimensional and 2-dimensional data. This streamlines the data loading process, especially for complex datasets.
Loading Data with the Interface: We showcased how to use the loader interface to efficiently load data using the predefined settings, emphasizing the ease and efficiency it brings to the data loading process.
Advanced Plotting of the Data Stream: Lastly, we explored more advanced data visualization techniques, including creating contour plots. This allowed us to visualize how particle concentration varied across different sizes and times, offering valuable insights into our dataset.

Throughout this section, we focused on making each step clear and accessible, particularly for those new to working with 2-dimensional datasets in Python. By following these steps, you can effectively manage and analyze complex data, gaining deeper insights into your research or projects.