Using Data Mining and Machine Learning to Detect Cryptocurrency Risks

Written by Lai Yan Jean, Lee Jing Xuan, Valary Lim Wan Qian, and Xu Pengtai

Cryptocurrency and the Need to Regulate It

Cryptocurrency is a medium of exchange (an alternative form of payment) that exists in the digital world and relies on encryption to make transactions secure. The technology behind cryptocurrency allows users to directly send the currency to others without going through a 3rd party, such as a bank. To make these transactions, users will need a digital wallet that can be set-up without the need to provide personal details such as an identification number or credit score, therefore allowing users to be pseudo-anonymous.

Our Solution

Image for post
Image for post
Image retrieved from: https://dribbble.com/shots/2723032-Needle-in-a-Haystack

Data Scraping of Open Source Information

We identified 3 categories of open source data that could provide valuable information to aid in detecting suspicious activities in the cryptocurrency space. These categories are:

  1. Cryptocurrency-specific news sites e.g. Cryptonews and Cointelegraph, which are more likely to cover news on smaller entities and minor security incidents.
  2. Social media sites e.g. Twitter and Reddit, where cryptocurrency owners may post about hacks before official releases of hack news.
Image for post
Image for post

Sentiment Analysis Model

We attempted four different Natural Language Processing tools for sentiment analysis, namely VADER, Word2Vec, fastText and BERT models. After evaluating these models through key selected metrics (recall, precision and F1), the RoBERTa model (a variant of BERT), performed the best and was chosen as the final model.

Image for post
Image for post
Image retrieved from: https://www.codemotion.com/magazine/dev-hub/machine-learning-dev/bert-how-google-changed-nlp-and-how-to-benefit-from-this/

Risk Scoring

Each article now has an associated source (news/reddit/twitter), a probability of risk, and a count, that refers to the number of times the article was reposted, shared, or retweeted. To convert these risk probabilities to a single risk score for a cryptocurrency entity, we first scaled the article’s probability values to a range of 0 to 100, and obtained a weighted average for each source that combines the article’s risk score and count. A weighted average was used to place greater importance on articles with a higher count as the number of shares are a likely indication of article relevance or significance.

Image for post
Image for post
Image for post
Image for post

Effectiveness of Our Solution

We tested our solution on a list of 174 cryptocurrency entities from 1 January 2020 to 30 October 2020, and compared the results with known hack cases within that time period. We found that our risk scoring method performed fairly well, identifying 32 out of the 37 known hack cases. We also analysed the effectiveness of our solution for individual entities. The graph below shows the risk score of Binance from 1 January 2020 to 30 October 2020. The dotted red lines represent the known hack cases. From the graph, we observe that our solution reported increased risk scores for 4 of the 5 known hacks. There are also several spikes that do not coincide with known hack cases. However, this does not pose a major problem since it is more important for our model to identify as many hacks as possible and reduce the number of unidentified hacks.

Image for post
Image for post

Interesting Findings

During the risk scoring process, we observed that risk scores of larger entities tend to have a larger proportion of false positives recorded compared to the smaller ones. This is because larger entities are talked about more and would therefore have more negative posts and false rumours that contribute to the higher inaccuracy rate.

Limitations

We identified two potential limitations of our solution, the first being the need to maintain the scrapers constantly. Website designs may change over time, and the scrapers for those websites would need to be updated in order to ensure that the relevant information will still be retrieved for the purpose of risk scoring.

Conclusion

Image for post
Image for post

Penultimate Year Student at National University of Singapore

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store