Stream: Averaging and Outliers

Stream: Averaging and Outliers#

This example shows how to clean up a stream of data by removing outliers and and averaging the values over time.

Working path#

Set the working path where the data is stored. For now we’ll use the provided example data in this current directory.

But the path could be any where on your computer. For example, if you have a folder called “data” in your home directory, you could set the path to: path = "U:\\data\\processing\\Campgain2023_of_aswsome\\data"

# all the imports, but we'll go through them one by one as we use them
import os
import matplotlib.pyplot as plt
from particula.data import loader_interface, settings_generator, stream_stats
from particula.data.tests.example_data.get_example_data import get_data_folder

# set the parent directory of the data folder
path = get_data_folder()
print('Path to data folder:')
print(path.rsplit('particula')[-1])

Path to data folder:
/data/tests/example_data

Load the data#

For this example we’ll use the provided example data. But you can change the path to any folder on your computer. We then can used the settings generator to load the data.

# This method uses the settings_generator module to generate the settings.
settings = settings_generator.for_general_1d_load(
    relative_data_folder='CPC_3010_data',
    filename_regex='*.csv',
    file_min_size_bytes=10,
    data_checks={
        "characters": [10, 100],
        "char_counts": {",": 4},
        "skip_rows": 0,
        "skip_end": 0,
    },
    data_column=[1, 2],
    data_header=['CPC_count[#/sec]', 'Temperature[degC]'],
    time_column=[0],
    time_format='epoch',
    delimiter=',',
    time_shift_seconds=0,
    timezone_identifier='UTC',
)

# now call the loader interface
data_stream = loader_interface.load_files_interface(
    path=path,
    settings=settings,
)

  Loading file: CPC_3010_data_20220710_Jul.csv

  Loading file: CPC_3010_data_20220709_Jul.csv

# print data stream summary
print('Stream:')
print(data_stream)

Stream:
Stream(header=['CPC_count[#/sec]', 'Temperature[degC]'], data=array([[3.3510e+04, 1.7000e+01],
       [3.3465e+04, 1.7100e+01],
       [3.2171e+04, 1.7000e+01],
       ...,
       [1.9403e+04, 1.6900e+01],
       [2.0230e+04, 1.7000e+01],
       [1.9521e+04, 1.6800e+01]]), time=array([1.65734280e+09, 1.65734281e+09, 1.65734281e+09, ...,
       1.65751559e+09, 1.65751560e+09, 1.65751560e+09]), files=[['CPC_3010_data_20220710_Jul.csv', 1078191], ['CPC_3010_data_20220709_Jul.csv', 1011254]])

# plot the data
fig, ax = plt.subplots()
ax.plot(data_stream.datetime64,
        data_stream.data[:, 0],  # data_stream.data is a 2d array, so we need
                                 # to specify which column we want to plot
        label=data_stream.header[0],
        linestyle="none",
        marker=".",)
plt.xticks(rotation=45)
ax.set_xlabel("Time (UTC)")
ax.set_ylabel(data_stream.header[0])
plt.show()
fig.tight_layout()

../../_images/f9ad298c1820118eb08576d04c4335b4fb17f1c0779605cbba7e8773c34666fc.png

Average the data#

Now that we have the data loaded, we can average the data over time. We’ll use the ‘particula.data.stream_stats’ module to do this. The module has a function called ‘averaged_std’ that will take stream object and return a new stream object with the averaged data and the standard deviation of the data.

stream_averaged = stream_stats.average_std(
    stream=data_stream,
    average_interval=600,
)
stream_averaged.standard_deviation.shape

(288, 2)

Plot the averaged data#

fig, ax = plt.subplots()
ax.plot(stream_averaged.datetime64,
        stream_averaged.data[:, 0],
        label=stream_averaged.header[0],
        marker=".",)
plt.xticks(rotation=45)
ax.set_xlabel("Time (UTC)")
ax.set_ylabel(stream_averaged.header[0])
plt.show()
fig.tight_layout()

../../_images/bf719363de53635d3c3930df2c2d749d958f5e5a0284359e90d35044aecd7ef4.png

Clean up the data#

Now we may see some outliers in the data. We can use the ‘particula.data.stream_stats’ module to remove the outliers. The module has a function called ‘filtering’ that will take stream object and return a new stream object with the outliers removed.

stream_filtered = stream_stats.filtering(
    stream=data_stream,
    top=250000,
    drop=True,
)
fig, ax = plt.subplots()
ax.plot(stream_filtered.datetime64,
        stream_filtered.data[:, 0],
        label=stream_filtered.header[0],
        marker=".",)
plt.xticks(rotation=45)
ax.set_xlabel("Time (UTC)")
ax.set_ylabel(stream_filtered.header[0])
plt.show()
fig.tight_layout()

../../_images/1c0eeb4dc7ef4a943b960f87462a10205bfe2509cfd618f6cc851fc74b520b31.png

Summary#