| August 10, 2023

Drift Matters: Unsupervised Anomaly Detection and National Security

“A ship in a port is safe, but that’s not what ships are built for.”
– Grace Hopper

Whenever someone asks me about my research, I find that “unsupervised anomaly detection under concept drift for maritime trajectory analytics” is a bit of a mouthful. Instead, I generally prefer to summarize with “I find bad ships.” “Bad ships” is of course a broad term, but a lot of bad things happen on the ocean. Sometimes bad things happen to ships, like a passenger sustaining an injury, a craft grounding or losing power, cargo falling into the sea, or—worst of all—a vessel sinking. And sometimes ships do bad things, such as damaging the environment or facilitating smuggling, illegal fishing, trade sanction dodging, vessel spoofing, and human trafficking. Time is of the essence when crews are responding to these types of events, meaning that we must detect them while they are occurring — or perhaps even before they occur. Unsupervised anomaly detection (UAD) algorithms seek to do just that.

Automatic identification system (AIS) data forms the foundation for this kind of work. Ships with AIS consistently broadcast important data—including their latitudinal/longitudinal position and speed and course over ground—as well as static information like vessel identifiers and vessel type. Laws from the International Maritime Organization and many national governments require that a significant number of vessels—typically almost all medium to large self-propelled vessels—be fitted with AIS. This standard, combined with the fact that any public receiver can record AIS data, makes AIS the go-to system for UAD at sea.

However, the development of UAD algorithms for AIS data can be quite tricky. AIS datasets are large, noisy, and unlabeled, and no datasets with labeled anomalies are publicly available. Furthermore, maritime vessel data evolves over time in multiple simultaneous ways through a phenomenon called concept drift. Although previous research has developed methods for UAD under concept drift—primarily for cybersecurity applications—few existing algorithms can handle the concept drift challenges that are specific to maritime trajectories. Our work focuses on addressing this gap and developing a UAD algorithm that can compensate for multiple forms of drift at once.

What is Concept Drift?

Concept drift describes the evolution of a dataset’s underlying normal distribution over time. There are many types of concept drift, but three have considerable effects on maritime vessel traffic: gradual, seasonal, and abrupt drift. Here, we visualize these drifts with Marine Cadastre data from Hawaii between 2017 and 2020.

When thinking about data evolution, we most often envision gradual drift. This type of drift describes the slow, consistent evolution of data over a period of time; the change in weekly trajectories for fishing vessels in Hawaii is one such example. Animation 1 highlights the latitudinal range that covers 90 percent of the points for each fishing week. As this range shifts north, more weekly trajectories appear in the sea north of Oʻahu; as it shifts south, more weekly trajectories appear south of Oʻahu. Several slow, incremental shifts both northward and southward are evident over the four-year period and demonstrate gradual drift.

Animation 1. Weekly fishing vessel trajectories off the coast of Hawaii between 2017 and 2020. The yellow highlighted area indicates the 90-percent quantile for the latitudinal points. Animation courtesy of Amelia Henriksen.

Because vessel movements are affected by Earth’s seasons, seasonal drift is particularly important when modeling ship tracks. This kind of drift describes patterns that appear repeatedly in the data in a periodic way. Sailing and pleasure crafts often have distinct seasons; Animation 2 indicates that sailing ships in Hawaii tend to experience peaks in the summer (typically around July) and lows in the winter.

Animation 2. Weekly pleasure/sailing craft trajectories off the coast of Hawaii between 2017 and 2020. Blue indicates winter and yellow indicates summer. A distinct increase occurs each summer, particularly in July. Animation courtesy of Amelia Henriksen.

We note, however, that the summer 2020 peak in Hawaii is less dense than in previous years. This brings us to the final important form of drift: abrupt drift. Abrupt drift describes a sudden shift in the underlying distribution that usually manifests as an outlier at first, but is characterized as drift when the shift persists. One of the most dramatic, abrupt shifts happened in 2020 when the World Health Organization declared COVID-19 a global pandemic; this worldwide event literally created a new normal distribution. This instance of abrupt drift is readily apparent in Animation 3, which demonstrates the significant effect on passenger vessels.

Animation 3. Weekly passenger vessel trajectories off the coast of Hawaii between 2017 and 2020. From 2017 through 2019, a more subdued data evolution in the form of gradual drift with some seasonal effects is evident. This scenario changes dramatically in 2020 after the advent of COVID-19. Animation courtesy of Amelia Henriksen.

Introducing DB-Drift

One of the most common types of UAD algorithm is centered on density-based clustering for outlier discovery — in particular, density-based spatial clustering of applications with noise (DBSCAN) [3]. Researchers who are investigating anomaly detection on AIS data have generated many algorithmic pipelines that feature DBSCAN or its variants as a key component [4].

DBSCAN is popular because it is easy to implement and understand, automatically identifies outliers, and does not require a set number of clusters (like k-means, for example). However, it is fundamentally designed to accumulate data in a static fashion — which is essentially the opposite of adapting to concept drift. The fact that prior work on density-based clustering under drift rarely incorporates even one type of drift—let alone multiple types—makes our problem even more difficult.

In response to these challenges, we created DB-Drift: an unsupervised, density-based clustering algorithm that can identify maritime outliers and compensate for gradual and periodic drift (we are currently extending the algorithm to handle abrupt drift as well). The present two-drift pipeline consists of the following overarching steps (see Figure 1).

First, DB-Drift receives AIS data points from maritime traffic and feeds them as a raw data stream into a trajectory assembler—such as Sandia National Laboratories’ Tracktable module—which filters and processes the data into trajectories or trajectory segments. During step two, we extract \(N\)-features (like speed-over-ground quantiles, trajectory shape approximations, and so forth) from the input trajectories and feed them into our outlier detector. We split the model into two outlier scorers: one under gradual drift and one under seasonal drift.

Our critical contribution occurs during the third and fourth steps of DB-Drift. In step three, we evolve multiple layers of cluster sets that each capture a different kind of drift over time. Every layer uses DenStream [1] as the base clustering algorithm because it can inherently output an outlier score in real time [2]. The simplest version of DB-Drift involves two sets of evolving clusters. The first cluster set evolves quickly and weights recent information more heavily to capture gradual concept drift, while the second layer assigns a specific cluster set to each season in our recurrent drift model. In our initial implementation of this model—which is for actual meteorological seasons—we specifically assign a separate DenStream clustering to each month of the year and report the outlier score for a new trajectory only with respect to the month to which it belongs. Each month needs its own DenStream model since seasons themselves can also evolve over time — albeit typically at a slower rate.

In step four, each layer outputs an outlier score with respect to the cluster layer. We then combine these scores as a weighted sum and compare it to a threshold that is determined based on the cluster radius—one of DenStream’s key hyperparameters—and the desired percentage of outliers that we wish to capture in the data (generally quite small). In real-world scenarios, a domain expert or an expert algorithm would then review the now-tractably-sized outlier subset for further assessment.

Figure 1. The algorithm workflow for density-based outlier detection on automatic identification system (AIS) data under gradual and seasonal drift. Figure courtesy of Amelia Henriksen.

The Algorithm in Action

Now we consider an example of the algorithm in action for the Hawaii dataset. Based on the U.S. Coast Guard’s incident investigation reports, we know that the operator for the commercial fishing vessel Lady Mocha II noticed a drop in performance on March 16, 2017 and went to investigate. He found water in the fuel filter, so he turned off the engine and bled the water but could not get the engine to restart. He then contacted a sister vessel named Lady Mocha I, which arrived on March 20, 2017, with a new starter and batteries. Lady Mocha II was hence essentially stranded for four days.

Lady Mocha II was out of range for our Hawaii coastline dataset on the day of the incident, it but moved into range on March 19, 2017. Using only speed-over-ground quantile information from the corresponding sporadic trajectory segments, our algorithm immediately identified Lady Mocha II’s AIS data as anomalous. But a standard sliding-window DBSCAN-based pipeline for UAD—which does not incorporate multiple forms of drift—could not detect this anomaly, even over a variety of different parameter combinations. This real-world example highlights the importance of robust anomaly detectors on AIS data.

When it comes to UAD at sea, drift matters. Algorithms must be able to address the multiple forms of concept drift that may be present in data, and DB-Drift provides a framework to do just that — with promising initial results.

Amelia Henriksen delivered a minisymposium presentation on this research at the 2022 SIAM Conference on Mathematics of Data Science, which took place in San Diego, Ca., last year.

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

References
[1] Cao, F., Ester, M., Qian, W., & Zhou, A. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM international conference on data mining (pp. 3280339). Bethesda, MD: Society for Industrial and Applied Mathematics.
[2] Choudhary, D., Kejariwal, A., & Orsini, F. (2017). On the runtime-efficacy trade-off of anomaly detection techniques for real-time streaming data. Preprint, arXiv:1710.04735.
[3] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD’96: Proceedings of the second international conference on knowledge discovery and data mining (pp. 226-231). Portland, OR: Association for the Advancement of Artificial Intelligence.
[4] Wolsing, K., Roepert, L., Bauer, J., & Wehrle, K. (2022). Anomaly detection in maritime AIS tracks: A review of recent approaches. J. Mar. Sci. Eng., 10(1), 112.

Amelia Henriksen is a data scientist in the Machine Intelligence and Visualization Department of Sandia National Laboratories. She received her Ph.D. in computational science, engineering, and math from the Oden Institute for Computational Engineering and Sciences at the University of Texas at Austin. Her research interests primarily lie in unsupervised algorithms for challenging real-world datasets, as well as open and FAIR data release.