| February 09, 2024

Smartwatches Should Facilitate the Aggregation of Mental Health Data

By Pavél Llamocca Portella, César Guevara Maldonado, Yury Jiménez Agudelo, and Victoria López López

Psychology and medicine—specifically the prediction, prevention, and treatment of disease—are important research areas around the world. Because two people with the same diagnosis can have vastly different sets of health data, personalized medical treatments are becoming increasingly popular in the healthcare industry. In fact, a 2018 study found that personalized models predict the relapse of certain diseases with 30 percent more accuracy than generic models [1]. Such individualized approaches involve tailoring a treatment regime for a specific person with a specific disease; to do so, practitioners might process real-time signals to detect the patient's medical conditions, body variables, and other related details. As such, reliable data that provide current, accurate, and specific information about patients’ health states are a critical component of personalized care. At present, wearable devices like smartwatches routinely gather health-related data about their users [3]. However, this data is often noisy, incomplete, and heterogeneous — which complicates its collection, processing, and storage.

Our study utilized mobile technology to examine the real-time health signals of individuals with depression. Specifically, we used smartwatches to monitor participants during nocturnal and diurnal periods and gather information that might help prevent states of emotional crisis or other cases of acute depression [4]. In general, manufacturers use aggregation algorithms to transform the raw data from smartwatch sensors into cleaner versions that are ready for analysis and interpretation. Figure 1 depicts a general scheme of this transformation for data from the Apple Watch and the Withings Activité Steel smartwatch.

Figure 1. Raw and aggregated data from the Apple Watch and the Withings Activité Steel. Figure courtesy of the authors.

In 2022, C. Viñals Guitart investigated the variations between data structures from different smartwatches and proposed a procedure to standardize them [5]. For instance, the differences between data from the Apple Watch and the Withings Activité Steel occur in the data types, variable labels, sensor accuracies, and file storage methods (see Figure 2). The application dataset for the Withings Activité Steel—which we used in our study to collect data from each wearer over the course of two years—comprises a file structure with raw, aggregated, and double-aggregated data; the most relevant files in this structure contain the values of a double aggregation by the manufacturer that comprises 460 activity records. The Withings dataset identifies nine different activities throughout the day: swimming, walking, running, Pilates, skiing, hiking, tennis, cycling, and dancing. It also classifies the activity level during sleep into three priority states: light sleep, deep sleep, and awake. We aggregated the raw data directly from the devices’ sensors to obtain a super-aggregated data file; using the resulting information, we then created a data processing aggregation algorithm that analyzes the precision of the information from the device provider to ensure the reliability of the manufacturer’s algorithm and prevent incorrect outputs. To perform proper aggregations, we had to select the parsing algorithm that best fits our data [2].

Our aggregation algorithm added data based on the time series of each activity that the accelerometer detected. These dynamic lists have different lengths because each item is added when the accelerometer detects movement until it detects stoppage. The items in the list might correspond to different types of activity, as determined by data from other sensors (such as gyroscopes or GPS position, speed, or movement). Our study generated a file that contained 1,053 observations (grouped by day) and another file that included 63,469 ungrouped observations.

Figure 2. Data structure comparison. 2a. Withings Activité Steel. 2b. Apple Watch. Figure courtesy of the authors.

We applied our aggregation algorithm by grouping information from several records that had the same date and computing a simple sum of the recorded values. To verify the differences between our algorithm and the manufacturer’s, we used the same raw data input and compared the aggregated data from the manufacturer with the aggregated data from our algorithm. For the comparative analysis, we chose the variables of interest based on the specific purpose of this work: evaluating emotional state. We only examined three results (for illustrative purposes) and opted to focus on data that related to burned calories, distance traveled, and total steps since we are interested in the daily physical activity of each patient. Our process identified each date on which the user performed any of these three activities and added every value that the device registered.

The quantity of records in our data and the manufacturer’s data are the same, which indicates that our processes utilized the same aggregation criteria. We applied a t-test to evaluate the results and analyze the difference in outcomes, executing this method in two different ways; the first considered all of the data (720 observations) and the second eliminated the data that was null in both samples, ultimately obtaining 197 observations. The second approach is more useful because the device’s periods of inactivity and the subsequent absence of sensor data yielded many pairs of zeros.

Figure 3. Summaries and histograms that compare the consumed calories based on the raw data and the factory aggregated data. 3a. Raw data. 3b. Manufacturer’s data. Figure courtesy of the authors.

Our t-test identified substantial differences between the manufacturers’ results and those that we obtained directly from the raw data with our own aggregation. The p-value is quite significant in both cases, the 95 percent confidence interval does not contain zero, and neither the standard deviation nor the mean are similar (in fact, important variations exist). The percentiles are also different, which implies that the manufacturer filtered the data prior to aggregation. And our standard deviation is larger than that of the manufacturer, possibly because they removed some outliers from the raw data.

Additionally, we studied two other variables—calories and distance—in different time periods as a subset of the data. We applied the same technique and confirmed that the manufacturer's data does not match and is distributed very differently for these variables as well. Figures 3 and 4 demonstrate these results. Figure 3a illustrates the results for consumed calories from our algorithm when fed with the device's raw data, and Figure 3b depicts the results of the manufacturer's algorithm. The variations are clearly substantial; a similar behavior for the distance travelled variable is evident in Figure 4. Our aggregated data and manufacturer’s aggregated data are significantly different.

Figure 4. Summaries and histograms that compare distance traveled based on the raw data and the factory aggregated data. 4a. Raw data. 4b. Factory aggregated data. Figure courtesy of the authors.

Ultimately, we can conclude that the manufacturer uses other sources of information—such as sensors from other devices that are associated with patients’ smartphones—to enrich the users' knowledge. These additional sources of information add noise to the verification process. We propose an optimal alternative that approximates the results with a higher accuracy than the factory applications. Because our process aggregates data directly from the raw input, it could perform certain analysis tasks based on information that is recorded by a specific mobile device.

Activity sensor manufacturers are constantly improving the accuracy and reliability of their products. Nevertheless, practitioners must understand that specific applications (such as the study of mood disorders like depression) require aggregation algorithms that can verify the integrity of data from these activity sensors.

Victoria López delivered a minisymposium presentation on this research at the 2023 SIAM Conference on Computational Science and Engineering, which took place in Amsterdam, the Netherlands, last year.

References
[1] Constantinides, M., Busk, J., Matic, A., Faurholt-Jepsen, M., Kessing, L.V., & Bardram, J.E. (2018). Personalized versus generic mood prediction models in bipolar disorder. In UbiComp ’18: Proceedings of the 2018 ACM international joint conference and 2018 international symposium on pervasive and ubiquitous computing and wearable computers (pp. 1700-1707). New York, NY: Association for Computing Machinery.
[2] Diéz Pérez-Villacastín, C. (2022). Parsing algorithms for the analysis of physical activity and sleep data. [Master’s thesis, CUNEF Univeridad] (in Spanish).
[3] Doncel Pedrosa, A. (2022). Exploratory analysis of the correlation and causality of physical activity in sleep. [Master’s thesis, CUNEF Universidad] (in Spanish).
[4] Llamocca, P., López, V., & Čukić, M. (2022). The proposition for bipolar depression forecasting based on wearable data collection. Front. Physiol., 12, 777137.
[5] Viñals Guitart, C. (2022). Comparison and integration of raw data structures of smart sensors: Application to physical activity. [Master’s thesis, CUNEF Universidad] (in Spanish).

	Pavél Llamocca Portella is a faculty member in the Department of Quantitative Methods at CUNEF Universidad in Spain. He holds a Ph.D. in computer science.
	César Guevara Maldonado is a faculty member in the Department of Quantitative Methods at CUNEF Universidad in Spain. He holds a Ph.D. in computer science.
	Yury Jiménez Agudelo is a faculty member in the Department of Quantitative Methods at CUNEF Universidad in Spain. She holds a Ph.D. in telematics.
	Victoria López López is a faculty member in the Department of Quantitative Methods at CUNEF Universidad in Spain. She holds a Ph.D. in computational mathematics and artificial intelligence.