SIAM News Blog
SIAM News
Print

Data Science for Blockchain

By Cuneyt G. Akcora, Murat Kantarcioglu, and Yulia R. Gel

On October 31, 2008, the pseudonymous Satoshi Nakamoto posted a white paper titled “Bitcoin: A Peer-to-Peer Electronic Cash System” to the Cypherpunks Mailing List. In eight pages, Nakamoto explained the transactions, network, incentives, and other building blocks for a new digital currency called Bitcoin.

Bitcoin addressed the challenge of sending and receiving digital currency on the World Wide Web. The idea of digital currency is as old as the Internet itself, and traditional banks and Internet companies have created online payment services—such as PayPal, Visa, and Mastercard—for similar purposes. However, these solutions involve a trusted entity that intermediates currency flow and updates user balances as transactions are processed. Blockchain removes this trusted entity while still providing a framework that can correctly and securely process transactions and maintain user balances.

The fundamental blockchain unit in Bitcoin is called a transaction, which lists a set of inputs and a set of newly created outputs. Here, the term “output” refers to a data structure that contains both an address and the amount that is directed to this address during a transaction. 

Bitcoin has an unorthodox view of transactions, in that a single transaction can involve multiple senders and multiple receivers (unlike bank transactions, wherein one sender account transmits fiat currency to one receiver account). During a Bitcoin transaction, each sender and receiver is represented by its Bitcoin address. If there is only one sender, that sender must digitally sign the transaction with the private key that is associated with their address. If there are multiple senders, each sender must sign their own portion before one member of the group forwards the transaction to the network. All senders must therefore coordinate, though in many cases a single blockchain user owns every address and signs each one when creating the transaction. Regardless, addresses that appear together in inputs are more likely to be related to each other.

Any blockchain user can observe the peer-to-peer network and collect a limited number of transactions in a block. Users mine blocks by solving cryptographic puzzles through trial and error. A valid solution is proof that the miner has put in some effort to find the puzzle’s answer. Mining is purposefully difficult to ensure that that the blockchain is not littered with blocks.

In the following sections, we outline promising research directions that pertain to the challenges and opportunities of data science in the context of blockchains. A more detailed discussion of blockchain models and data repositories is available in [2, 10].

Tokenomics and Decentralized Finance 

Decentralized finance (DeFi) on platform blockchains is an emerging field in which real-life assets—like houses, cars, and investment bonds—are “digitized” (i.e., given unique and unmodifiable asset identifications) and traded on blockchains in a tokenized fashion. Analysis of DeFi mechanisms constitutes an important interdisciplinary research direction with implications in finance and software analysis.

Cryptocurrencies like Bitcoin and Monero involve digital coins, whereas smart contracts utilize tokens; users trade both asset types on online exchanges for fiat currencies. A new asset class called stablecoins pegs prices to a particular asset, such as gold [5]. Some stablecoin prices are designed to fluctuate as little as possible due to mechanisms such as algorithmic buy/sell, while other stablecoins claim to hold real assets—i.e., U.S. treasury bonds—to maintain the peg. However, the recent collapse of the TerraUSD stablecoin demonstrated that algorithmic pegging mechanisms may not be robust under volatile market conditions [13]. 

Figure 1. The three main chainlet types (split, transition, and merge), as well as extreme (left or right) chainlets. Figure courtesy of the authors.

Transaction Network Analysis

Cryptocurrencies and platforms both store ordinary financial transactions, with minimal differences. Platform transactions (e.g., Ethereum) are one-to-one, meaning that one address sends coins/assets to another address that can be modeled with traditional graph structures. But most cryptocurrencies (e.g., Bitcoin) are unspent transaction output (UTXO)-based blockchains wherein the output stores the amount to be spent. A single address can own multiple outputs, and each output can have a different coin amount.

Early blockchain data analytics was largely rooted in the conventional methods of network science and thus approached UTXO data by creating a graph that employed only a single type of node. Such analytic procedures are known as transaction and address graph approaches. The transaction graph approach ignores addresses and creates edges among transaction nodes [8]. The transaction graph is naturally acyclic, and a transaction node cannot have new edges in the future. Conversely, the address graph approach ignores transactions and creates edges among address nodes [11]. However, Bitcoin does not explicitly store input to output coin flows; all inputs are gathered at the transaction and immediately directed to output addresses. As a result, transaction inputs must be connected to all output addresses — which may create large cliques if too many addresses are involved in one transaction.

Single node type approaches do not provide faithful representations of blockchain data, and the loss of information about addresses or transactions seems to impact predictive models [4]. In contrast, researchers can model Bitcoin graphs as heterogeneous networks with two node types: addresses and transactions. Figure 1 depicts a sample blockchain substructure called a chainlet [1]. The key rationale is that chainlets (especially extreme ones) may reflect hidden patterns in transaction dynamics that stem from whale activity—a small group of players who act based on insider information or pump-and-dump schemes—or other scenarios.

Cybersecurity

Cryptocurrencies are ideal for darknet markets, since users can make and receive payments pseudo-anonymously from anywhere in the world. Beginning with the Silk Road in 2011, darknet markets have been processing ever-growing amounts of illicit goods, like fake passports and guns. Tracking the usage of cryptocurrencies to perform illegal or illicit trades—such as human trafficking, child pornography, and ransomware payments—reveals a new research direction that links dark web activity with blockchains [7].

Ransomed entities (e.g., companies, municipalities, or hospitals) tend to behave similarly when paying a ransom [3]. First, the entity uses an online exchange to purchase coins; the exchange facilitates this process by matching buyers to sellers. The purchased coins are then directed into an address \(a_1\) that is created specifically for the entity. Since the ransom is usually high, the inputs for the transaction can comprise hundreds of addresses that each contribute small amounts of coins. Figure 2 depicts this phenomenon as transaction \(t_1\). Usually, \(t_1\) has hundreds of inputs and one or two outputs; in addition to the ransom amount, the output also includes a transaction fee. The transaction \(t_2\) is the ransom payment itself. Any amount remaining after the ransom is directed to \(a_2\). Interestingly, the time difference between \(t_1\) and \(t_2\) is usually around 24 hours, thus suggesting a significant time gap between the moment when entities agree to pay and the time that they actually make the payment.

Hackers do not control the payment pattern in Figure 2; in fact, the identification of similar patterns can reveal hidden payments that companies silently make to avoid bad publicity. Once the coins reach \(a_0\), hackers use money laundering methods to cash out and remove the taint, then sell the previously tainted coins during online exchanges for fiat money.

Figure 2. Anatomy of a ransomware payment to \(a_0\). Usually, \(t_1\) has hundreds of inputs and one or two outputs. Figure courtesy of the authors.

Money Laundering

Since 2009, malicious entities have used three money laundering regimes on blockchain with increasing levels of sophistication. In the first regime, hackers pass a large number of coins through multiple transactions to hide their origins. In this scenario, it is assumed that observers do not have the necessary analytical tools to track the flow of coins in the large blockchain graph. However, increasing law enforcement activities and analytical capabilities have rendered such obfuscation efforts futile. 

Starting in 2013, users have designed coin mixing schemes to further blur the flow, making coin tracking in the network a painstaking and often fruitless task [9]. Furthermore, mixing repeats over multiple rounds with identical output amounts. At the end of the scheme, multiple addresses hold the laundered coins (minus the transaction and user payout fees), which can then be sold separately. During our analysis, we quickly reached 70 percent of the daily Bitcoin network addresses by starting from known WannaCry ransomware addresses and moving forward by two transactions.

Shapeshifting, which commenced in 2017, is the latest money laundering scheme for cryptocurrency [14]. This technique passes tainted Bitcoins to an online exchange, where they are then sold. The exchange pays the amount in Monero, Zcash, or Dash cryptocurrencies. These anonymous cryptocurrencies provide multiple mechanisms for the creations of transactions that hide input, output, and amount information.

Yet humans err often, and cases of breached anonymity have occurred even with anonymous cryptocurrencies [6]. Mistakes have also transpired at the protocol level, such as when the ZCash wallet exposes the IP of an address upon receiving a malformed transaction [12]. Overall, interdisciplinary methods at the interface of statistical data analysis and protocol software analysis play an increasingly important role in the detection of money laundering schemes.

Conclusions

The Internet allows ordinary people to access and download information from anywhere. Blockchain is a promising technology that permits users to do the same in cases like decentralized finance, cross-country payments, and contractual agreements.

We must develop data analytics capabilities for all blockchains and integrate their data in order to glean a holistic view of the blockchain ecosystem. The most urgent capabilities pertain to e-crime prevention and detection. For example, e-crime goes mostly unreported because victims gain nothing from disclosure. In fact, we have detected potentially as many as 10 times more ransomware payments than those that are officially reported [3]. Because this information may be unknown to law enforcement agencies, the address of e-crime is particularly important. 


Acknowledgments: The authors are grateful to Ricky Rambharat (Office of Comptroller of the Currency within the U.S. Department of the Treasury) for the motivating discussion and invaluable feedback on this paper. The research has been supported in part by NSF OAC 2115094, NSF DMS 1925346, and NSF ECCS 1824716.

References
[1] Akcora, C.G., Dey, A.K., Gel, Y.R., & Kantarcioglu, M. (2018). Forecasting Bitcoin price with graph chainlets. In PAKDD 2018: Advances in knowledge discovery and data mining (pp 765-776). Lecture notes in computer science (Vol. 10939). Cham, Switzerland: Springer.
[2] Akcora, C.G., Gel, Y.R., & Kantarcioglu, M. (2022). Blockchain networks: Data structures of Bitcoin, Monero, Zcash, Ethereum, Ripple and Iota. WIREs Data Mining Knowl. Discov., 12(1), e1436.
[3] Akcora, C.G., Li, Y., Gel, Y.R., & Kantarcioglu, M. (2020). BitcoinHeist: Topological data analysis for ransomware detection on the Bitcoin blockchain. In Proceedings of the twenty-ninth international joint conference on artificial intelligence (pp. 4439-4445). International Joint Conference on Artificial Intelligence.
[4] Greaves, A., & Au, B. (2015). Using the Bitcoin transaction graph to predict the price of Bitcoin. Retrieved from http://snap.stanford.edu/class/cs224w-2015/projects_2015/Using_the_Bitcoin_Transaction_Graph_to_Predict_the_Price_of_Bitcoin.pdf.
[5] Moin, A., Sirer, E.G., & Sekniqi, K. (2019). A classification framework for stablecoin designs. Preprint, arXiv:1910.10098.
[6] Möser, M., Soska, K., Heilman, E., Lee, K., Heffan, H., Srivastava, S., ... Christin, N. (2018). An empirical analysis of traceability in the Monero blockchain. Proc. Priv. Enh. Technol., 2018(3), 143-163. 
[7] Portnoff, R.S., Huang, D.Y., Doerfler, P., Afroz, S., & McCoy, D. (2017). Backpage and Bitcoin: Uncovering human traffickers. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1595-1604). New York, NY: Association for Computing Machinery.
[8] Ron, D., & Shamir, A. (2013). Quantitative analysis of the full Bitcoin transaction graph. In International conference on financial cryptography 2013: Financial cryptography and data security (pp. 6-24). Lecture notes in computer science (Vol. 7859). Berlin, Heidelberg: Springer.
[9] Ruffing, T., Moreno-Sanchez, P., & Kate, A. (2014). CoinShuffle: Practical decentralized coin mixing for Bitcoin. In Computer security - European symposium on research in computer security 2014 (pp. 345-364). Lecture notes in computer science (Vol. 8713). Cham, Switzerland: Springer.
[10] Shamsi, K., Victor, F., Kantarcioglu, M., Gel, Y., Akcora, C.G. (2022). Chartalist: Labeled graph datasets for UTXO and account based blockchains. In Proceedings of the 36th conference on neural information processing systems (NeurIPS).
[11] Spagnuolo, M., Maggi, F., & Zanero, S. (2014). Bitiodine: Extracting intelligence from the Bitcoin network. In Proceedings of 18th international conference on financial cryptography and data security (pp. 457-468). Christ Church, Barbados.
[12] Tramèr, F., Boneh, D., & Paterson, K.G. (2020). Remote side-channel attacks on anonymous transactions. Cryptology ePrint Archive, Paper 2020/220.
[13] Van Boom, D. (2022, May 25). Luna crypto crash: How UST broke and what's next for Terra. CNET. Retrieved from https://www.cnet.com/personal-finance/crypto/luna-crypto-crash-how-ust-broke-and-whats-next-for-terra.
[14] Yousaf, H., Kappos, G., & Meiklejohn, S. (2019). Tracing transactions across cryptocurrency ledgers. In SEC'19: Proceedings of the 28th USENIX security symposium (pp. 837-850). Santa Clara, CA: USENIX Association.

Cuneyt G. Akcora is an assistant professor of computer science and statistics at the University of Manitoba in Canada. His primary research interests are data science on complex networks and large-scale graph analysis, with applications in social, biological, Internet of Things, and blockchain networks. 
Murat Kantarcioglu is an Ashbel Smith Professor in the Department of Computer Science at the University of Texas at Dallas (UTD), director of the Data Security and Privacy Lab at UTD, a faculty associate at Harvard University’s Data Privacy Lab, and a visiting scholar at the University of California, Berkeley’s Sky Computing Lab. His research focuses on the integration of cybersecurity, data science, and blockchains for the creation of technologies that can efficiently and securely process and share data. 
Yulia R. Gel is a professor of statistics at the University of Texas at Dallas and is currently on a stint as program director in the Division of Mathematical Sciences at the National Science Foundation. Her research interests are in mathematical foundations of data science and focus particularly on topological and geometric methods in statistics and machine learning. 
blog comments powered by Disqus