Ethereum Transaction: A Machine Learning Approach on Classification

JHN
Sep 12, 2023
16 min read

1. INTRODUCTION

This document provides details on the domain, topic, dataset(s) and research question for this Major Research Project (MRP). We will firstly have a brief discussion about the background of the topic domain, sources of datasets, then define the problem. Then, it is followed by literature review and a detailed exploratory analysis of the data being used. Next, the methodology to perform classification, from preprocessing data to training classifying models, will be presented along with the results using key performance metrics. Finally, we will close this paper with a discussion about the implication, limitation, and potential future path for this project.

A. Background

For the purpose of this Major Research Project, the classification of Ethereum transactions has been selected as a primary focus. This topic has gained increasing relevance with the escalating regulation of the cryptocurrency sector by the Securities and Exchange Commission (SEC) [1]. The agency's chairman, Gary Gensler, expresses a widespread belief that cryptocurrencies are rife with illegal activities. Consequently, it becomes essential to scrutinize this claim meticulously and develop instrumental tools that can guide policy formation in this area.

Bitcoin, the inaugural cryptocurrency, was designed with an emphasis on transparency, leveraging public ledgers to maintain accountability. In keeping with this spirit, the capability to monitor and audit transactions on any cryptocurrency, akin to conventional financial methodologies, could deter potential wrongdoers and offer an accurate depiction of this sector, in particularly the Ethereum's activity landscape for this paper. To this end, we reviewed and build on the work of previous researchers conducted in this field.

There are several methods proposed to deanonymize and to trace users’ identity in the blockchain world. On the one hand, there are many research into the mechanism of each crypto platform. For instance, A.Kumar et al provided a deep look into the ring signature vulnerability of Monero with "A Traceability Analysis of Monero's Blockchain" in 2017 [2]. However, this approach is limited and difficult to scale. On the other hand, there are significant interest in using machine learning (ML) to help regulating the field. A significant contribution in this area was made by the 2019 paper, "Regulating Cryptocurrencies: A Supervised Machine Learning Approach to De-Anonymizing the Bitcoin Blockchain" [3], which applied ML in an innovative manner to tackle the issue of cryptocurrency regulation. Additionally, "Blockchain is Watching You: Profiling and Deanonymizing Ethereum Users" [4] (2020), provided valuable insights by establishing a node embedding methodology for deanonymizing Ethereum users.

B. Research Question

There are two main parts of this analysis, descriptive and predictive analytics. With descriptive analytics, the goal is to find any patterns, correlations or relationships that can be used to generalize or manually differentiate between good and bad transactions. These general patterns or behaviours can then help explaining the models result in the predicting step. The prediction part comes from training and evaluating multiple classifier that could correctly label 263 categories listed in Appendix A. Based on the classification result and training time, it will be evaluated to find best models for live monitoring and crypto investigation based on transaction.

2. LITERATURE REVIEW

As mentioned above, the emerging field of cryptocurrency regulation has seen the implementation of machine learning (ML). A comprehensive review from 2009 to 2019 on data mining techniques used in detecting financial fraud also echoed the call for further research, particularly in emerging financial technologies like cryptocurrencies [9]. In light of this, "Past, present, and future of the application of machine learning in cryptocurrency research" (2022) [10] discussed how reinforcement learning algorithms have been utilized in blockchain transaction design and in identifying illicit transactions, with a call for more research on recent events potentially impacting blockchain. Concurrently, it's essential to remember that blockchain and artificial intelligence, while potentially beneficial in detecting fraud, could also assist in committing fraud if manipulated or learned to exhibit biases, as warned by "Fraud in a World of Advanced Technologies" (2019) [11]. Thus, it is important for our datasets to mimic the real world distribution of legit and illegal ones to minimize bias. According to the report by Chain Analysis in 2023, the current ratio between dodgy and legal transactions are 97 to 3 [16].

In "Regulating Cryptocurrencies: A Supervised Machine Learning Approach to De-Anonymizing the Bitcoin Blockchain" (2019), the authors applied seven supervised ML techniques on a large dataset sourced from a blockchain audit firm, Chainanalysis, and found that the Gradient Boosting algorithm outperforms models like Random Forest or AdaBoost Classifier [3]. However, these techniques do not guarantee universally applicable to all cryptocurrencies. "Blockchain is Watching You: Profiling and Deanonymizing Ethereum Users" (2020) demonstrated that due to architectural differences, Bitcoin techniques do not perform well on Ethereum [4]. This study showed that the Tornado Cash Mixer, a privacy tool on Ethereum, does not work effectively for transactions above the average transaction value. In addition to these ML-based techniques, other methods have also been applied for the same purpose. "Deanonymizing Tor Hidden Service Users Through Bitcoin Transactions Analysis" (2019) leveraged public information on social networks like Twitter and BitcoinTalk, successfully identifying real people behind Tor hidden services [5]. Similarly, "A Traceability Analysis of Monero's Blockchain" (2017) analyzed Monero's blockchain transactions using graph closure analysis, revealing vulnerabilities in the anonymity of the platform [2]. An extensive study, "Deanonymization and linkability of cryptocurrency transactions based on network analysis" (2019), found that even with the use of Tor, users' IP locations can be traced in various cryptocurrencies, suggesting the avoidance of centralized networks [6].

The continuous advancement of ML has also been utilized to detect fraudulent transactions within Ethereum. "LGBM: a machine learning approach for Ethereum fraud detection" (2022) introduced the LightGBM machine learning model, which outperformed older models like Random Forest or Logistic Regression [7]. A year later, a more sophisticated hybrid feature fusion model called LBPS (LSTM-FCN) was tested, presenting an even higher performance against old models [8].

Lastly, some recent work focused on leveraging new techniques like Graph Neural Networks for fraud detection. Runnan Tan et al. (2021) proposed a method for detecting fraudulent transactions on the Ethereum network, achieving an accuracy of 95% [12]. Similarly, a model suitable for smart contract anomaly detection was proposed to detect financial fraud on the Ethereum platform, proving to be more effective and stable than traditional models [13]. As cryptocurrencies continue to evolve, newer algorithms will be leveraged. Thus, for the scope of this project, Gradient Boosting algorithms (XGBoost and LightGBM) will be put to test against younger generations like Multilayer Perceptron and LSTM. The training time and key metrics will also determine the suitability of being used for live monitoring, which requires fast and global accuracy, or investigation purposes, which needs to be as precise as possible.

3. METHODOLOGY

Since the dataset contains text, date, and numeric fields, I conducted two different kinds of analysis to learn more about my dataset as mentioned above.

3.1 Data Acquisition

The labelled addresses are obtained from Kaggle published dataset [2], which is a taken from a trusted service called EtherScan. Google BigQuery is also used to obtain all relevant information, for instance transactions detail of each transaction for each address [3].

Figure 1: Data preparation and analysis process

The detail code is provide as followed in figure 2, where the “kaggle_table” is the table containing addresses from the Kaggle public dataset.

Figure 2: Code to obtain list of associated transactions to Kaggle dataset’s addresses.

Due to the limitation of a local machine, I cannot use the full 20,000 addresses as the file size with full transaction data is too big. However, 10,000 addresses already yield over 10GB of data with 17,000,000+ transactions. Due to local computational constraints, I am limited to only 4,422 addresses for Machine Learning models training. These 4,422 addresses have a total of 7,660,789 transactions. It was then combined with other feature-engineering parameters (table 1) to create a 5GB csv file. It has to be noted that we aim to categorize transactions even for accounts with zero or few histories, so the Ethereum address column is removed.

Table 1: Parameters after feature engineering.

3.2 Data Processing

As a result of using public dataset for big data, Google BigQuery breaks returned data into 450MB chunks. Hence, we have a total of 23 files to merge. It has to be noted that as mentioned above, it is vital to keep the ratio as close to real life as possible. Thus, by taking only bad activities in the first 10 files, we retained 211,693 instances of ‘Dodgy’ class and 7,449,093 instances of 'Legit' ones, making the ratio 2.8 to 97.2 that satisfied the requirements.

After these steps, there are many transactions with nan or empty values in many columns. Thus, those transactions were removed. Then, looking at the number of tags, there are obviously some noisy tags. In fact, the average occurrences stand at 30,643 times. Those tags at the bottom 5th percentiles were treated as noise since they did not even occur for more than 10 times in this dataset.

Next, as we have mixed types of data, it is required to convert categories column with encoder to have a standard all-float type dataframe. Here, we used one-hot encoding because it is independent and does not make assumption of the order of categories like ordinal encoding. Moreover, XGBoost and LightGBM are based on decision tree, which handles the label encoding method well. Then, we proceeded to scale columns using the MinMax Scalar library by the Sklearn package. However, there are two sets of dataframe is saved here with one keeping the date format of columns like block timestamp for descriptive analysis purpose while the other will also be scaled to prepare as input for ML models.

3.3 Descriptive Analysis

Moving to discovery analysis part of this dataset, it revolves around the distinct patterns that emerge when comparing legitimate and dodgy transactions. The data, spanning from 2015 to 2023, indicates that the majority of transactions are conducted by smart contracts, with a substantial portion being legitimate. However, due to limitation of using free source, I could not get an even time distribution as the data is skewed to the 2018-2022 range (figure 4). Furthermore, the illegal activities are up to only 2019 (figure 3) and it consequently misses out a huge portion of state-sponsored cyber attack transactions that became popular from 2022. Given more financial resources, the premium pro plus package from Etherscan could help with a more even distribution and latest data.

Notably, as mentioned above, the dominance of the "Legit" class is creditable as it is supported by Chain Analysis. It also has to be noted that the majority of transactions are made by smart contract service and not wallet. While this gives more confidence in detecting user dodgy transactions, it can potentially miss out systemic scam crypto projects or schemes like FTX. The data also incorporates the time zone, specifically UTC–0 (England), providing a global perspective on Ethereum transactions. Thanks to its nature, Ethereum can help process cross-border transactions faster with less fees than traditional means. Based on figure 5 and figure 6, it can be seen that there is clear distinction between legit and dodgy transactions.

Figure 5: Class Legit hour of transaction.

Figure 6: Distribution of Dodgy class transaction time.

In fact, there is a significant drop in dodgy activities volume before 1:00 (21:00 UTC-4 Toronto time, 8:00 UTC+7 Bangkok time) and after 15:00 UTC-0 (11:00 UTC-4, 22:00 UTC+7). This is not the same case for legit transactions, except for a spike at 8:00 to 9:00 UTC-0 (4:00 to 5:00 UTC-4, 15:00 to 16:00 UTC +7). This suggests that there is a correlation between working hours of East Asians and evening time of North America with dodgy transactions.

Figure 7: Distribution of Dodgy transactions by day in a week

Figure 8: Distribution of Legit transactions by day in a week. It can also be seen that legit crypto trades are done during the weekend as Saturday and Sunday volumes take the top two (figure 8). On the other hand, bad actors seem to favor Friday, Saturday and surprisingly Tuesday (figure 7). This suggested that cryptocurrencies are treated as past-time investment rather than a full-time job likes stock or bond traders. Next, we can delve into the distribution of transaction value, which is represented by the figure 9 below, which seems to be not as important as number of deposits and number of withdrawals based on the heat map (figure 10).

Figure 9: Distribution of Dodgy transaction value.

Figure 10: Heat map of features correlation.

This visual representation provides a more tangible understanding of the financial implications associated with each class, thereby offering a more comprehensive view of the transaction landscape. In general, the average value of illegal activities is smaller than legit ones. In fact, it is extremely frequent to see bad actors transacted at lower than 0.000859 ETH, which is less than 2.1 CAD at the time of this writing. This suggests bad actors frequently use mixing services to break big fees into many smaller ones, so it is not only harder to trace back but also more difficult for auto-fraud detectors to function on crypto exchanges, especially those operating in developing nations where this amount falls under the day-to-day expense.

As anticipated, the most active transactions occurring on the Ethereum platform are from crypto exchanges (figure 11). This observation is hardly surprising, given that crypto exchange platforms, especially international ones, have constant inflows and outflows throughout the day. While it is difficult to see since there are many overlaps below the 10,000 withdrawal thresholds, based on the color, the majority of light green ones are crypto exchanges and smart contracts like Uniswap or Stablecoin.

Figure 11: Most active actors clustered by tags.

An intriguing revelation from the dataset is that dodgy addresses have a much shorter lifespan than legitimate ones (figure 12). This observation suggests that illegal addresses are often used a few times before either being blocked or abandoned by bad actors.

Figure 12: Distribution of lifetime for Legit (1) and Dodgy (0) classes.

The y axis unit is days. As can be seen from the chart above, there is an abnormal sign at the 1000s days for the “Dodgy” class and there are some instances reaching 2000s of lifetime. This suggests the nature of crimes are different, where accounts can be hacked or stolen. Once that happens, it effectively changes labels from “Legit” to “Dodgy”.

3.4 Design

Moving to the second part of the analysis, we need to divide the dataset into training, testing and validating sets. In this MRP, we shuffled the dataset before splitting it based on the standard format 70-15-15, with the stratify parameter being used to make sure the existence of low distribution tags in all three sets. The computer being used to train ML models is the Lenovo Legion Slim 7 Gen 6, which has the AMD Ryzen 7 5800H as CPU, 24GB RAM and NVIDIA GeForce RTX 3060 6GB VRAM as GPU. This ensures that the training time is close to the standard average home laptop.

We will perform two set of tests, with one dataframe having both transaction and tag related features for crypto investigation and the other only have transaction related features for live monitoring purpose. This is due to the fact that in the live monitoring case, new transactions will not have tag information yet. Also, since both scenarios have different needs, we will choose the models based on its suitable metrics.

3.5 Model selection

As mentioned above, XGBoost, LightGBM, MLP and LSTM are chosen to measure and to compare the effectiveness on predicting transactions labels. Due to physical computational constraints, it was not possible to perform cross-validation for the XGBoost model. Thus, for all 4 models, we have to compensate for the lack of cross-validation with manual change in random_state when shuffling data.

For the XGBoost model, we used the default models with the only exception on number of rounds being trained at 3, which can be increased to 5. We also used the log metrics to monitor training and testing loss to prevent overfitting. For the LightGBM, due to memory constraints, we have to modify the subsample to 0.8, gbdt as boosting_type. Furthermore, the number of rounds is set at 3 only, which will be explained in section 4.

For the multilayer perceptron neural network, we crafted a relatively simple yet optimal model. It is using 3 Dense layers of 1024, 512 and 256 and a Dropout rate of 30%. While using the Dropout rate at 20% slightly boosts the training time, the prediction precision is impacted by 1% to 2%, depending on the random_state. Thus, the dropout rate can be customized to suit personal needs. ReLU is used as activation functions for all 3 Dense layers. For the LSTM neural network, we only used 2 layers of 50 neurons and one Flatten layer. ReLU is also chosen as the activation function here. For both neural networks, we only train by 1 time, which will explained below.

For all models, the cost sensitivity learning is enabled with scale_pos_weight due to the fact that our dataset is imbalanced. As mentioned above, the average occurrence is 30,643 times but the lower end percentiles are from 10 and above. The primary evaluation metrics employed to assess classifier performance were Precision and Recall. While Accuracy was also considered, Precision held greater significance due to the imbalanced nature of the dataset. A high Accuracy, when everything is assigned to the majority class, can be misleading.

4. RESULTS

4.1 Crypto investigation purpose

We will firstly discuss the training time for each model for the crypto investigation case. As can be seen from table 2, MLP are performing at the top level with only 90s for 1 rounds. However, it is also important to bring the performance metrics on a separate never-seen-before validation set into context as training time alone provides no real values. As can be seen from table 3, with only 1 round of training, LSTM performs on the same tier with 3 rounds LightGBM and 3 rounds XGBoost. However, the amazing result is still with MLP. With only 1 round, it achieved the same macro performance with 5 rounds LightGBm and only slightly below the 5 rounds XGBoost.

Table 2: Training time with both transaction and tag features

Table 3: Key performance metrics for each ML model instance on validation set.

The macro average treats all categories the same, making it ideal for investigation because most of the time, bad actors are under-represent or less popular than other cases. From the figure 13 and figure 14 above, it can be seen that XGBoost 5 rounds marginally outperforms MLP 1 round in terms of ROC-AUC. For this reason, it can be suggested that MLP is the preferred choice for crypto investigation if we increase the training round of MLP by at least one more round. This is thanks to the low training time of just 90s per round, making it at most 4.8 minutes for 3 rounds and still significantly less than 58 minutes of 5-rounds XGBoost.

4.2 Live monitoring purpose

We will now transition to the live monitoring scenario, which we only use transaction related features to train models. This is taken into accounts that bad actors can hijack legal accounts to do illegal transactions. Since neural networks models completely outperform the other two Gradient Boosting methods, we only tested with MLP and LSTM for this case. However, to keep the consistency of the paper, we will use macro average, which often yields lower result than the other one while also proved to be more robust to changes in data imbalance scenarios.

Table 4: Training time with both transaction features.

Table 5: Key performance metrics with only transaction features for each ML model instance on validation set.

From the figure 15 and figure 16 above, as well as table 5, it provides a mixed result. Although the MLP model initially outperformed the LSTM model in round 1, the rate of improvement exhibited by the LSTM model was remarkable. By round 3, the LSTM model slightly surpassed the MLP model across all metrics. However, this was achieved at the cost of triple the learning time. Therefore, the selection of an appropriate model is reasonably dependent on individual needs and considerations.

5. CONCLUSION, LIMITATIONS AND FUTURE RESEARCH

In conclusion, the outcome of this paper falls into the selection of suitable models. For a crypto investigation scenario, Multi-Layer Perceptron (MLP) proves to be a superior choice due to its speed and effective performance. It not only provides robust results with only one round of training, but also outshines other models when trained for a few more rounds. The low training time, in addition to its performance which is on par with, if not superior to, XGBoost with five rounds, further solidifies MLP as the go-to model.

However, when it comes to live monitoring, the situation is slightly more complex. The decision between the MLP and LSTM models hinges on the specific requirements of the scenario. While MLP may lead the way initially, LSTM's rate of improvement is substantial and it pulls ahead by round three. Nevertheless, this performance comes at the cost of increased learning time. Hence, the choice of model for live monitoring in cryptocurrency exchanges needs careful consideration. The first key limitation encountered in this study relates to the potential omission of systemic scam cryptocurrency projects or schemes, such as those observed in platforms like FTX. The inherent nature of these schemes, often characterized by complex and deceptive strategies touted by charismatic CEOs, might not be readily identifiable by the methods employed in this research. Furthermore, as the data range for illegal activities is only up to 2019, this missed out on the state-sponsored actors. As mentioned in the Chain Analysis 2023 report, the world witnessed a spike in use of crypto currency by Russian and North Korea state-sponsored hackers.

Secondly, the operation of automated fraud detectors in the crypto exchanges, particularly those functioning in developing nations, poses significant challenges. The socio-economic disparities and varied financial norms across regions make it difficult to define and implement a universal threshold for transaction amounts that may be considered suspicious. In many developing nations, transaction amounts that might be flagged as suspicious in more affluent countries often fall under the realm of day-to-day expenses.

Given these limitations, one potential area to explore could involve devising more sophisticated machine learning models capable of identifying systemic scam cryptocurrency projects. For example, the paper “Identification of Scams in Initial Coin Offering with Machine Learning” focused on the identification of scams. The study uses machine learning to predict if an ICO is a scam based on known features in advance, and to identify the most important characteristics that lead to scam ICOs [17]. However, it will require much more sophistication when dealing with scam brokers.

6. REFERENCES

[1] B. Pisani, “SEC chair Gensler says Crypto Frenzy is rife with ‘hucksters, fraudsters, scam artists,’” CNBC, https://www.cnbc.com/2023/06/08/sec-chair-gensler-says-crypto-frenzy-is-rife-with-hucksters-fraudsters-scam-artists-.html (accessed Jun. 9, 2023).

[2] A. Kumar, C. Fischer, S. Tople, and P. Saxena, “A traceability analysis of Monero's blockchain,” SpringerLink, 01-Jan-1970. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-66399-9_9. [Accessed: 09- Mar-2023].

[3] H. H. S. Yin, K. Langenheldt, M. Harlev, R. R. Mukkamala, and R. Vatrapu, “Regulating cryptocurrencies: A supervised machine learning approach to de-anonymizing the bitcoin blockchain,” CBS Research Portal, 01-Jan-1970. [Online]. Available: https://research.cbs.dk/en/publications/regulating-cryptocurrencies-a-supervised- machine-learning-approac. [Accessed: 09-Mar-2023].

[4] F. Béres, I. A. Seres, A. A. Benczúr, and M. Quintyne-Collins, “Blockchain is watching you: Profiling and Deanonymizing Ethereum Users,” arXiv.org, 13-Oct-2020. [Online]. Available: https://arxiv.org/abs/2005.14051. [Accessed: 09-Mar-2023].

[5] H. A. Jawaheri, M. A. Sabah, Y. Boshmaf, and A. Erbad, “Deanonymizing Tor hidden service users through Bitcoin transactions analysis,” arXiv.org, 10-Jul-2019. [Online]. Available: https://arxiv.org/abs/1801.07501. [Accessed: 09-Mar-2023].

[6] A. Biryukov and S. Tikhomirov, “Deanonymization and linkability of cryptocurrency transactions based on network analysis,” 2019 IEEE European Symposium on Security and Privacy (EuroS&P), 2019. doi:10.1109/eurosp.2019.00022.

[7] R. Aziz, M. Baluch, S. Patel and A. Ganie, "LGBM: a machine learning approach for Ethereum fraud detection," International Journal of Information Technology, vol. 14, Jan. 2022.

[8] T. Wen, Y. Xiao, A. Wang and H. Wang, "A novel hybrid feature fusion model for detecting phishing scam on Ethereum using deep neural network," Expert Systems with Applications, vol. 211, Jan. 2023, doi: 10.1016/j.eswa.2022.118463.

[9] K. Al-Hashedi and P. Magalingam, "Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019," Computer Science Review, vol. 40, May. 2021, doi: 10.1016/j.cosrev.2021.100402.

[10] Y. Ren, C. Ma, X. Kong, K. Baltas and Q. Zureigat, "Past, present, and future of the application of machine learning in cryptocurrency research," Research in International Business and Finance, vol. 63, Dec. 2022, doi: 10.1016/j.ribaf.2022.101799.

[11] M. Nickerson, "Fraud in a World of Advanced Technologies: The Possibilities are (Unfortunately) Endless," The CPA Journal, vol. 89, no. 6, Jun. 2019.

[12] R. Tan, Q. Tan, P. Zhang and Z. Li, "Graph Neural Network for Ethereum Fraud Detection," 2021 IEEE International Conference on Big Knowledge (ICBK), Auckland, New Zealand, 2021, pp. 78-85, doi: 10.1109/ICKG52313.2021.00020.

[13] L. Liu, W. Tsai, Md.. Zakirul, H. Peng and M. Liu, "Blockchain-enabled fraud discovery through abnormal smart contract detection on Ethereum," Future Generation Computer Systems, vol. 128, pp. 158-166, Mar. 2022, doi: 10.1016/j.future.2021.08.023.

[14] H. Hall, "Labelled Ethereum Addresses," Kaggle, 2020. https://www.kaggle.com/datasets/hamishhall/labelled-ethereum-addresses(accessed Dec. 12, 2022).

[15] “Ethereum in BigQuery: A public dataset for Smart Contract Analytics | Google Cloud Blog,” Google, 2018. https://cloud.google.com/blog/products/data-analytics/ethereum-bigquery-public-dataset-smart-contract-analytics (accessed Feb. 6, 2023).

[16] “The chainalysis 2023 crypto crime report,” Chainalysis, https://go.chainalysis.com/2023-Crypto-Crime-Report.html (accessed Jun. 6, 2023).

[17] B. Karimov and P. Wójcik, "Identification of Scams in Initial Coin Offerings With Machine Learning," Frontiers, vol. 4, Oct. 2021, doi: 10.3389/frai.2021.718450.