The COVID-19 pandemic has affected the lives and habits of billions of people around the world. One of the newly observed trends is the apparent increase in interest in financial markets, as illustrated in Figure 1.
It is impossible not tonotice the fivefold increase in turnover generated by individual investors or, toput it another way, “ordinary Kowalskis”. Asimilar trend is also observed on cryptocurrency exchanges. There are many reasons for this. One of them may be the desire toprotect capital against apotential economic crisis or high inflation.
Leaving aside the reasons, it is worth asking aquestion – what could be the consequences? One of them is the emergence of anumber of inexperienced investors, whose lack of familiarity with the market realities may pose aserious threat in the face of not always legal practices of some market participants.
Pump & Dump (Mechanism Description)
Pump & Dump is the name of the process of manipulating the valuation of an asset in the market. It consists of artificially inducing aprice increase by creating the illusion of sudden interest in shares of agiven company. When the price reaches ahigh enough level, the person organizing the shares sellsthe previously acquired claims at an inflated price.
Let’s break down the mechanism into stages:
- Generating increased traffic in the company by buying back shares.
- After two or three days of continuous growth caused by the buyback, the company attracts the attention of other market participants who, wanting totake advantage of the price increase, issue purchase offers.
- When asufficient number of investors become interested in the company, the organizerbuys many shares, “clearing” the queue. This results in arapid increase in the stock price. Other players seeing the dynamics, also want tobecome owners of the company’s shares, raising the price even higher. The bubble begins tolive its own life – it propels itself.
- After reaching an appropriate price ceiling, the organizer graduallyget rid of his shares, selling them at an inflated price.
- The next few days bring acorrection and the price returns tothe level from before the action.
For the operation tobe possible, the company’s victim (the attack) must meet several criteria.
First of all, capitalization must be sufficiently low, which reduces the amount of capital required toact. The second criterion is the liquidity of the company. Here, ideally, it should be low and still exist. This makes it possible togenerate movement in phase one while our trades are dominant in the company.
Machine learning
The increase in computing power of computers over the past few years and the progressive digitization of data have created conditions conducive tothe development of advanced forms of data analysis, in particular artificial intelligence algorithms have benefited. In anutshell, toreap the benefits of so-called models, weneed totrain them beforehand, that is “show” historical data related tothe problem under study. This can be accomplished in several ways. The two most popular are so-called supervised learning and unsupervised learning. With the former you are required tospecify aclass for each observation – what is visible in the image, what the text is about or whether the transaction is fraudulent.
In unsupervised techniques weare limited tojust the data itself without specifying the class of each observation. As is usually the case each approach has its pros and cons. Often, creating the data description necessary toapply supervised techniques is alabor- and time-intensive activity, often involving many people with specific skills. Naturally this increases the cost of the project and the time tocomplete it.
From this perspective unsupervised learning seems better. However, this is not the whole truth. Unsupervised algorithms usually work based on similarity. The group data into clusters that combine the most similar observations or identify outliers.
Unfortunately, the Data Scientist has limited control over this process, which means that the model may pay attention todifferent aspects of the data than the author’s intentions would suggest. This usually means less accurate predictions, which translates into poorer results than tosupervised methods. On the other hand, the lack of arequirement tolabel each observation reduces data preparation time.
It may also happen that leaving the interpretation of the data tothe model will lead tothe discovery of correlations previously overlooked or erroneously ignored by the Data Scientist. Nevertheless, in practice, solutions based on supervised learning are preferable because of the greater control over the learning process, the possibility of adapting the classes of observations tothe analyzed problem and better results.
Implementation
Let us consider the possibility of automatic detection of Pump & Dump practices described earlier. Undoubtedly, the mechanism’s characteristics are strong increases in price and volume. At the same time, wedo not have an official database of identified manipulation cases.
It would take weeks of work for aperson skilled in detecting stock market irregularities toflag all available data manually. Such an undertaking is beyond the scope planned for the current article. In line with the earlier description of machine learning techniques, Iwill focus on unsupervised type algorithms.
These carry the risk of unintentionally indicating correct quotes as instances of manipulation. This is due tothe high similarity of Pump & Dump tothe situation that may occur, for example, after the publication of unexpectedly good financial results of acompany. Possible nuances that distinguish these two cases may not be considered by the model which the Data Scientist has limited influence on. With all the knowledge presented so far, let’s try tobuild an artificial intelligence model whose task will be todetect unusual events.
GAN
Generative Adversarial Networks, or GAN for short, is away of training machine learning models where two networks compete against each other. The task of one of them (generator) is togenerate observations, while the other network (discriminator) tries todistinguish actual observations from generated ones. As atraining set (real observations for the discriminator), weuse historical quotations of companies from the NewConnect exchange, divided into 5 to90 days sequences.
Companies listed on NewConnect seem tobe abetter testing ground than those from the more widely known main market of the Warsaw Stock Exchange. This is due primarily totheir more susceptibility toPump & Dump attacks. Compared totheir counterparts from the WSE, they are characterized by lower capitalization, liquidity and popularity.
In turn, the length of the sequence results from the observation of historical quotations. It is impossible tounequivocally state the manipulation operation begins and when it ends. Nevertheless, 5 consecutive quotes seem tobe the minimum value. Longer/More extended periods are intended tosupplement the data with abroader context.
Sequence length | Training set size | Test set size |
---|---|---|
5 | 499932 | 11985 |
30 | 12026 | 11785 |
The idea behind the proposed solution is quite simple. The stock price series produced by the generator become similar toreal tofool the discriminator. Since non-standard rate/volume changes are relatively rare, both networks adapt tothe standard series. This means that atrained discriminator should flag the Pump & Dump case as “fake”.
The process of optimizing the solution was done iteratively, starting with training on aseverely limited amount of data and asmaller network, which was gradually increased when the overfitting point was reached. Iprepared asmall test set for evaluation purposes where Imanually flagged suspicious cases captured on the NewConnect exchange.
The first approach – Fully Connected layers
building blocks of adiscriminator and generator within the GAN architecture. Their ability to“remember” observations gave very good results in the initial stage (for alimited number of observations). Unfortunately, they performed poorly when generalizing tolarger amounts of data.
Second approach – LSTM
Due tothe nature of the analyzed data (time series), it seems natural touse recursive layers, more specifically LSTM (Long Short Term Memory). Their main advantage is that they take into account the entire series of points, not just asingle observation. The mentioned functionality is realized by asystem of gates, which combine the incoming data with the information obtained from the previous points of the sequence.
Unfortunately, as in the previous case, the results are not satisfactory. Itested both approaches with aFully Connected generator and LSTM discriminator and one where both networks were based on LSTM.
Third Approach – Convolution
Convolutional (convolutional) networks are commonly used image processing, NLP or graph networks. Due tothe topic of this paper, wewill use them tointerpret time series. For the case study Iused the so-called “Causal Convolution”.
The standard convolution operation, applied toaone-dimensional time series, considers both the points located “before” and “after” the point for which the convolution is calculated, in this particular case called one-dimensional. Let’s analyze this using the example in the figure below.
Our series is asequence of numbers one per day (green). Weapply aconvolution with akernel dimension equal to3, symbolically represented by the gray color and arrows. This means that when calculating anew series, the value for day 11.07.2021 is calculated using the data from days 10.07.2021, 11.07.2021, and 12.07.2021.
As wecan see, classical convolution combined with ordered data raises the problem of “looking” into the future. Causal Convolution limits the input data only tothose “historical”, thus preserving causality. This is realized through an asymmetric weave kernel that only considers past data.
The results obtained in this way looked relatively most attractive.
Results
Obtaining anetwork that generates realistic-looking stock quotes is not asimple issue. Although it is akind of by-product of discriminator training, it can be treated as agood indicator of the quality of aGAN-type network. Unfortunately, the generated stock price seriesdid not resemble real quotations in most of the conducted training. Additionally, the quality of the obtained results was affected by the so-called “mode collapse” phenomenon, which causes areduction in the diversity of the data coming from the generator.
The results obtained on the test set were characterized by asignificant scatter. Depending on the training time, the same discriminator model could indicate all observations from the test set as false, true or doing it in acompletely unexpected way. Therefore, it is hard totalk about any particular metric value. Alternatively, the high classification score obtained (e.g., AUC above 0.8) could be due tothe discriminator adjusting randomly toarelatively small test set.
Classic methods
Let’s verify the ability todetect unusual events using more established techniques – classical machine learning and rule sets. Todo this Iwill change the data format slightly. Asingle observation in the set will be aquotation from agiven day and features describing the change in the value of agiven stock. Of course, this will affect the size of the datasets.
Training set size | Test set size |
---|---|
468663 | 10783 |
Modeling
Let us first take the Isolation Forest algorithm. Its goal is toidentify unusual observations in aset. In anutshell, the algorithm splits based on aselected variable’s value. The feature’s choice and the value against which the division takes place are chosen randomly. Outliers need significantly less number of splits tobe isolated. Let us verify the effectiveness of Isolation Forrest on agroup of companies that wesuspect are victims of the Pump & Dump attack.
In the chart above (Figure 11), wesee the quotes and their derivatives for one of the companies. The operations towhich the original data have been subjected are mainly the standardization of the range and the determination of daily differences (gradients). Cases identified as outliers are marked with red lines at the chart’s top. The model seems tohave done quite well – detecting anomalies in the observed data. Asimilar observation emerges after examining the second graph, i.e. Figure 12.
Iused the PCA algorithm toproduce it, which reduced the number of dimensions totwo (previously non-existent), allowing for areadable visualization. Each point represents one observation. The colors distinguish whether it was considered normal or abnormal. There is aclear concentration of blue points, while those representing abnormal observations are much further from the center. This confirms the effectiveness of the model in identifying abnormal observations.
Finally, weplot the confusion matrix and ROC for the test set:
Rule set
The methods described so far donot exhaust all available solutions. Toincrease the completeness of the analysis of the problem, it is worth confronting the results with the approach based on aset of rules. It is the simplest conceptually, which does not necessarily mean worse results.
As adeterminant of “normality” Iused the sum of distances of agiven observation from the mean expressed in units of standard distance. For example, for the case of observations where the value of the first characteristic is 5, the mean is 3, and the standard deviation is 2, the result is:
The operation is repeated for the remaining features and the absolute values are summed. The two graphs below show the distribution of the obtained results appropriate for sums and single columns.
As expected, the distribution is strongly clustered around low values. As acriterion for belonging tothe anomalous group, Iobserved avalue exceeding the 99th centile for the sum of distances or any feature describing asingle quotation. Below, wesee analogous visualizations for the Isolation Forest model approach. It seems that the results obtained using rules are slightly worse.
Conclusions
The work done in the paper can be divided into two areas – generation of quotations and detection of unusual cases. The former is not asimple task. Trainingmodelsbased on GANs architecture involve many problems, such as selecting properly balancing discriminator and generator, optimal training parameters, etc. Nevertheless, the solution using Causal Convolution networks can serve as astarting point for further research in this direction. The results obtained in this way looked quite realistic.
In detecting Pump & Dump practices, the standard anomaly detection methods – Isolation Forest and rule set – performed better. On apre-prepared test set they achieved results that can attest tothe usefulness of the models. At the same time, it should be noted that there is no official data on the detection of the described practices and the scale of the procedure and thus, awell-defined set. Not every significant jump of quotations or volume of transactions must mean illegal practices, introducing another degree of difficulty tothe overall solution.
In conclusion, the described and the application of unsupervised machine learning techniques have managed totrack down probable cases of stock market manipulation. As expected, the results obtained are subject toinaccuracies due tothe nature of unsupervised learning.