Prediction on fast-fluctuating cryto price movement
Background:
A cryptocurrency is a digital currency which is a novel form of payment designed by using encryption algorithms. It has been an unprecedented new and fervently discussed topic in the financial and information technology industry since being investigated in 2009. Everyday, over $40 billion worth of cryptocurrencies are traded in the global market. However, the high volatility of asset signature leads to the unpredictability of the return for this new kind of investment. In this project, machine learning expertise was applied to forecast the fast-changing crypto prices for better understanding of such innovative payment or currency formats.
The data source used in this project originated from a code competition in Kaggle, named ‘G-Research Crypto Forecasting’. This dataset contains information in table 1 on historic trades for 14 crypto assets dating back to 2018, such as Bitcoin and Ethereum, which is used to train our machine learning model.
Data preprocessing
In this project, several data analytic techniques were delivered to train the data model. First, data processing was used to handle the missing values and separate different types of crypto assets. The dataset was split into a training set and a testing set. Visualization of data was also done with plotly packages.
Feature engineering for time series analysis
Lag method was used for the time series analysis problem instead of a simple linear regression model. A lag is a fixed time period passed. The previous set of data in a time series was lagged against the next set of observations. Further, not only the original price for a certain time period, but also new mathematical features were added to the crypto prices for the same time period according to the lag values. The main reason for calculating log was that the targeted and predicted value of interest should be consistent with the log value.
Data model training using LightGBM method
LightGBM was used to train the model, along with the cross validation controlling the parameters and tuning the model. LightGBM is a gradient boosting framework that uses tree based learning algorithms. Then, a weighted version of the Pearson correlation coefficient was used for evaluation. The correlation between the targeted value from our model and the real value in the dataset were calculated, meaning that our model would score higher if the model was sufficient.
Read Training data, Validation data, Testing data
Text segmentation
Filter out stopwords, and emojis
Padding of sentences
Vectorization of text
Model Training with training data
Prediction on testing data
Validation
The steps for pre-trained models were shown below:
Read Training data, Validation data, testing data
Padding of sentences
Model loading from Hugging Face
Model Training with training data
Prediction on testing data
Validation
Product Comparison Review Detection
Due to the inadequacy of positive data, two different types of attempts were conducted for increasing accuracy.
For the supervised approach, after data preprocessing, for the RNN models, tokenization of text was completed. Sample size of the major group was downsized to ensure that the proportion of positive data group were adjusted to 25%, 50% and 75% respectively. Then, the model was trained with adjusted samples and predictions on testing data was done. Finally, the predictions were passed to validation.
For the semi-supervised approach, after preprocessing, for the RNN models, tokenization of text was also completed. Models were first trained with labelled samples in which the samples were marked 0 and 1. Then, predictions on unlabelled training samples were completed in order to obtain the pseudo labels from unlabeled training data, but the predictions were first returned as floating numbers ranging from 0 to 1 which represented certainty of the pseudo labels. A threshold, which is larger than 0.9 and smaller than 0.1, was set for selecting data samples with higher degree of certainty. The floating numbered samples were reorganised to integers labelled, meaning that samples larger than 0.9 were relabelled to 1 while samples smaller than 0.1 were relabelled to 0. Model training was again executed with labelled samples and pseudo labelled samples. Finally, predictions on testing data were accomplished based on the retrained model and then passed to validation.
The steps for supervised approach were shown below:
Read training data, validation data, testing data
Adjustment Training data proportion
Text segmentation
Filter out stopwords, and emojis
Padding of sentences
Vectorization of text
Model Training with training data
Prediction on testing data
Validation
The steps for semi-supervised approach were shown below:
Read Training data, Validation data, Testing data
Text segmentation
Filter out stopwords, and emojis
Padding of sentences
Vectorization of text
Model Training with labelled training data
Get pseudo labels from unlabelled training data
Retraining with labelled data and pseudo-labelled training data
Prediction on testing data
Validation
Graph 1: Datasets after preprocessing
Graph 2: Data visualization of the cryptocurrency
First, based on the previous discoveries, the initial job was to choose the useful data and features and handle the missing values from the massive amount of data. However, there were a few rows of null values for a long time period, which were being processed during feature engineering
Then, the dataset was divided into several subsets according to its asset ID for model training. The following graph 1 described the dataset after preprocessing.
Graph 3: lag value used first
Since the way of choosing the lag value would represent the expected feature, referring to solutions from the competition, and the objectives of this project, the values had been chosen as below graph 3.
The meaning for a lag indicated the fixed time period passed. For example, it meant that the features would be calculated in the past 60 minutes. The reason for taking a larger value was also to avoid the influence of missing values in the original data set.
Two features were created to analyze the time series problem as follows. First, the ratio of the price in the close moment to the average for the past period was named F1, “log(close/mean) in ‘lag’”. Also, the ratio of the price in the close moment to the price before the past period was named F2, “log(close/close) return ‘lag’”.
Residuals of the lag values were also the features of interest, so two values were calculated (De Gooijer & MacNeill, 1999). The differences between the change in F1 and the mean of the changes for all currencies were calculated for every currency and named “F1 – F1_mean”. Similarly, the differences between the change in F2 and the mean of the changes for all currencies were calculated and named “F2 – F2_mean”.
Graph 4 and 5: data model built with 5-fold (left) and 7-fold (right) cross validation
The model was first trained with 5-fold cross validation and 7-fold cross validation.
The results showed that the 7-fold (0.06893, graph 5) got a better score in training than the 5-fold(0.06819, graph 4). Due to concerns regarding the problem of overfitting, a local test using the dataset was done. The resulting score of 5-fold cross validation was 0.05018, which was smaller than the score of 7-fold cross validation, having the value of 0.05199. It indicated that 7-fold was not overfitted with the training dataset, so 7-fold was decided to use.
Graph 6 and 7: scores associated with local test with 5-fold (left) and 7-fold (right) cross validation
The lag value was reconsidered based on the little improvement of the data model in changing the K value in K-fold CV. Considering that the targeted values were based on the price changes within 15 minutes, a trial on lag = 15 was determined if there was any possible enhancement of the model. Another trial on lag = 900 was also considered but there were concerns regarding a larger time gap would aggravate the situation. After the test, the exasperated result showed that changing the array lags could not help the data model building.
Table 3: lags and corresponding weighted correlation score
As a result, lags = [15,60,300,900] were used to train the model. However, the processing time was longer due to the expansion of the dataset.
Boxplots were plotted to illustrate the importance for certain features.
Graph 8, 9 and 10: Boxplots for importance for features
From the above boxplots, both lag = 15 and lag = 900 were important features but they behaved differently in terms of cryptocurrencies. Hence, a better score could not be achieved by using either lag = 15 or lag = 900. During this process of further training with both lag = 15 and lag = 900, a better score was achieved.
Graph 11: The final result
The resulting training score was 0.07091, which is better than the baseline, 0.06893, and the result of the local test is 0.06783, which is much better than 0.05199.
From the feature engineering, although only close moments of the price were chosen from the raw observations, new features, mainly based on the lag values calculated, were added to showcase the price changes for a certain period, not only compared to each currency itself, but also compared with relative to other currencies.
In this time series analysis problem, the choice of lags forecasting the targets were based on different multiples of 15 since the data characteristics. The model was trained many times for the optimal combination of lags. However, the model building had some shortcomings, such as large time cost and enormous amount of memory usage. Also, the massive amount of data, meaning that there were 1355915 samples to train and 226610 samples to validate for every validation, resulted in a long time period required for every data model training phase. We could try more combinations of lags for a more preferable result, as well as other approaches, such as the Long Short Term Memory method and Principal component analysis for anomaly detection, may be available, if extra time would be available.
Further, from the project, machine learning methods could be used to understand the principle of cryptocurrency trading, but they may be of very limited help to traders or amateur people. The visualization of the data depicted the high volatility of the cryptocurrency price movement, as well as the high variance of the cryptocurrency. More improvement on model training can be done and strived in different directions, for example, switching from choosing “Close” attribute to using the volume weighted average price for the minute (“VWAP”).
However, with the massive amount of data and the methodology used in this project, it can help increase the understanding of many complicated time series analysis problems. The massive amount of data is still challenging for computation, so the selection of the feature process is essential and critical to the solution of the problem. For example, the methods in the project can also be used in enterprise market research in which many real world problems relate to the time.
From this project, machine learning methods were delivered to train for prediction on the returns in the prices of cryptocurrency based on real data. Lag method was chosen to analyze the time series problem, in which optimal combination lag values were found. The LightGBM method along with cross validation was used to train relevant datasets. The corresponding scores, meaning the significance of the model, were also calculated for prediction of the prices of crypto assets. However, the massive amount of data leads to the problems of huge time cost and large amount of memory usage during the data model building. Overall, the project can accomplish and fulfill the main objective, increasing the understanding and mastering of machine learning methods, though some limitations may be encountered.
Hey there! If you're curious, why not swing by my GitHub link below? Check it out for more details!
Link: Cryto-Forecasting