Pump & Dump detection with ML methods - Billennium (2024)

The COVID-19 pandemic has affected the lives and habits of billions of people around the world. One of the newly observed trends is the apparent increase in interest in financial markets, as illustrated in Figure 1.

Pump & Dump detection with ML methods - Billennium (1)

It is impossible not tonotice the fivefold increase in turnover generated by individual investors or, toput it another way, “ordinary Kowalskis”. Asimilar trend is also observed on cryptocurrency exchanges. There are many reasons for this. One of them may be the desire toprotect capital against apotential economic crisis or high inflation.

Leaving aside the reasons, it is worth asking aquestion – what could be the consequences? One of them is the emergence of anumber of inexperienced investors, whose lack of familiarity with the market realities may pose aserious threat in the face of not always legal practices of some market participants.

Pump & Dump (Mechanism Description)

Pump & Dump is the name of the process of manipulating the valuation of an asset in the market. It consists of artificially inducing aprice increase by creating the illusion of sudden interest in shares of agiven company. When the price reaches ahigh enough level, the person organizing the shares sellsthe previously acquired claims at an inflated price.

Let’s break down the mechanism into stages:

  1. Generating increased traffic in the company by buying back shares.
  2. After two or three days of continuous growth caused by the buyback, the company attracts the attention of other market participants who, wanting totake advantage of the price increase, issue purchase offers.
  3. When asufficient number of investors become interested in the company, the organizerbuys many shares, “clearing” the queue. This results in arapid increase in the stock price. Other players seeing the dynamics, also want tobecome owners of the company’s shares, raising the price even higher. The bubble begins tolive its own life – it propels itself.
  4. After reaching an appropriate price ceiling, the organizer graduallyget rid of his shares, selling them at an inflated price.
  5. The next few days bring acorrection and the price returns tothe level from before the action.

For the operation tobe possible, the company’s victim (the attack) must meet several criteria.

First of all, capitalization must be sufficiently low, which reduces the amount of capital required toact. The second criterion is the liquidity of the company. Here, ideally, it should be low and still exist. This makes it possible togenerate movement in phase one while our trades are dominant in the company.

Pump & Dump detection with ML methods - Billennium (2)

Machine learning

The increase in computing power of computers over the past few years and the progressive digitization of data have created conditions conducive tothe development of advanced forms of data analysis, in particular artificial intelligence algorithms have benefited. In anutshell, toreap the benefits of so-called models, weneed totrain them beforehand, that is “show” historical data related tothe problem under study. This can be accomplished in several ways. The two most popular are so-called supervised learning and unsupervised learning. With the former you are required tospecify aclass for each observation – what is visible in the image, what the text is about or whether the transaction is fraudulent.

In unsupervised techniques weare limited tojust the data itself without specifying the class of each observation. As is usually the case each approach has its pros and cons. Often, creating the data description necessary toapply supervised techniques is alabor- and time-intensive activity, often involving many people with specific skills. Naturally this increases the cost of the project and the time tocomplete it.

From this perspective unsupervised learning seems better. However, this is not the whole truth. Unsupervised algorithms usually work based on similarity. The group data into clusters that combine the most similar observations or identify outliers.

Unfortunately, the Data Scientist has limited control over this process, which means that the model may pay attention todifferent aspects of the data than the author’s intentions would suggest. This usually means less accurate predictions, which translates into poorer results than tosupervised methods. On the other hand, the lack of arequirement tolabel each observation reduces data preparation time.

It may also happen that leaving the interpretation of the data tothe model will lead tothe discovery of correlations previously overlooked or erroneously ignored by the Data Scientist. Nevertheless, in practice, solutions based on supervised learning are preferable because of the greater control over the learning process, the possibility of adapting the classes of observations tothe analyzed problem and better results.

Implementation

Let us consider the possibility of automatic detection of Pump & Dump practices described earlier. Undoubtedly, the mechanism’s characteristics are strong increases in price and volume. At the same time, wedo not have an official database of identified manipulation cases.

It would take weeks of work for aperson skilled in detecting stock market irregularities toflag all available data manually. Such an undertaking is beyond the scope planned for the current article. In line with the earlier description of machine learning techniques, Iwill focus on unsupervised type algorithms.

These carry the risk of unintentionally indicating correct quotes as instances of manipulation. This is due tothe high similarity of Pump & Dump tothe situation that may occur, for example, after the publication of unexpectedly good financial results of acompany. Possible nuances that distinguish these two cases may not be considered by the model which the Data Scientist has limited influence on. With all the knowledge presented so far, let’s try tobuild an artificial intelligence model whose task will be todetect unusual events.

GAN

Generative Adversarial Networks, or GAN for short, is away of training machine learning models where two networks compete against each other. The task of one of them (generator) is togenerate observations, while the other network (discriminator) tries todistinguish actual observations from generated ones. As atraining set (real observations for the discriminator), weuse historical quotations of companies from the NewConnect exchange, divided into 5 to90 days sequences.

Companies listed on NewConnect seem tobe abetter testing ground than those from the more widely known main market of the Warsaw Stock Exchange. This is due primarily totheir more susceptibility toPump & Dump attacks. Compared totheir counterparts from the WSE, they are characterized by lower capitalization, liquidity and popularity.

In turn, the length of the sequence results from the observation of historical quotations. It is impossible tounequivocally state the manipulation operation begins and when it ends. Nevertheless, 5 consecutive quotes seem tobe the minimum value. Longer/More extended periods are intended tosupplement the data with abroader context.

Sequence lengthTraining set sizeTest set size
549993211985
301202611785

The idea behind the proposed solution is quite simple. The stock price series produced by the generator become similar toreal tofool the discriminator. Since non-standard rate/volume changes are relatively rare, both networks adapt tothe standard series. This means that atrained discriminator should flag the Pump & Dump case as “fake”.

Pump & Dump detection with ML methods - Billennium (3)

The process of optimizing the solution was done iteratively, starting with training on aseverely limited amount of data and asmaller network, which was gradually increased when the overfitting point was reached. Iprepared asmall test set for evaluation purposes where Imanually flagged suspicious cases captured on the NewConnect exchange.

The first approach – Fully Connected layers

building blocks of adiscriminator and generator within the GAN architecture. Their ability to“remember” observations gave very good results in the initial stage (for alimited number of observations). Unfortunately, they performed poorly when generalizing tolarger amounts of data.

Pump & Dump detection with ML methods - Billennium (4)

Second approach – LSTM

Due tothe nature of the analyzed data (time series), it seems natural touse recursive layers, more specifically LSTM (Long Short Term Memory). Their main advantage is that they take into account the entire series of points, not just asingle observation. The mentioned functionality is realized by asystem of gates, which combine the incoming data with the information obtained from the previous points of the sequence.

Pump & Dump detection with ML methods - Billennium (5)

Unfortunately, as in the previous case, the results are not satisfactory. Itested both approaches with aFully Connected generator and LSTM discriminator and one where both networks were based on LSTM.

Pump & Dump detection with ML methods - Billennium (6)

Third Approach – Convolution

Convolutional (convolutional) networks are commonly used image processing, NLP or graph networks. Due tothe topic of this paper, wewill use them tointerpret time series. For the case study Iused the so-called “Causal Convolution”.

The standard convolution operation, applied toaone-dimensional time series, considers both the points located “before” and “after” the point for which the convolution is calculated, in this particular case called one-dimensional. Let’s analyze this using the example in the figure below.

Our series is asequence of numbers one per day (green). Weapply aconvolution with akernel dimension equal to3, symbolically represented by the gray color and arrows. This means that when calculating anew series, the value for day 11.07.2021 is calculated using the data from days 10.07.2021, 11.07.2021, and 12.07.2021.

Pump & Dump detection with ML methods - Billennium (7)

As wecan see, classical convolution combined with ordered data raises the problem of “looking” into the future. Causal Convolution limits the input data only tothose “historical”, thus preserving causality. This is realized through an asymmetric weave kernel that only considers past data.

Pump & Dump detection with ML methods - Billennium (8)

The results obtained in this way looked relatively most attractive.

Pump & Dump detection with ML methods - Billennium (9)

Results

Obtaining anetwork that generates realistic-looking stock quotes is not asimple issue. Although it is akind of by-product of discriminator training, it can be treated as agood indicator of the quality of aGAN-type network. Unfortunately, the generated stock price seriesdid not resemble real quotations in most of the conducted training. Additionally, the quality of the obtained results was affected by the so-called “mode collapse” phenomenon, which causes areduction in the diversity of the data coming from the generator.

The results obtained on the test set were characterized by asignificant scatter. Depending on the training time, the same discriminator model could indicate all observations from the test set as false, true or doing it in acompletely unexpected way. Therefore, it is hard totalk about any particular metric value. Alternatively, the high classification score obtained (e.g., AUC above 0.8) could be due tothe discriminator adjusting randomly toarelatively small test set.

Pump & Dump detection with ML methods - Billennium (10)

Classic methods

Let’s verify the ability todetect unusual events using more established techniques – classical machine learning and rule sets. Todo this Iwill change the data format slightly. Asingle observation in the set will be aquotation from agiven day and features describing the change in the value of agiven stock. Of course, this will affect the size of the datasets.

Training set sizeTest set size
46866310783

Modeling

Let us first take the Isolation Forest algorithm. Its goal is toidentify unusual observations in aset. In anutshell, the algorithm splits based on aselected variable’s value. The feature’s choice and the value against which the division takes place are chosen randomly. Outliers need significantly less number of splits tobe isolated. Let us verify the effectiveness of Isolation Forrest on agroup of companies that wesuspect are victims of the Pump & Dump attack.

Pump & Dump detection with ML methods - Billennium (11)

In the chart above (Figure 11), wesee the quotes and their derivatives for one of the companies. The operations towhich the original data have been subjected are mainly the standardization of the range and the determination of daily differences (gradients). Cases identified as outliers are marked with red lines at the chart’s top. The model seems tohave done quite well – detecting anomalies in the observed data. Asimilar observation emerges after examining the second graph, i.e. Figure 12.

Pump & Dump detection with ML methods - Billennium (12)

Iused the PCA algorithm toproduce it, which reduced the number of dimensions totwo (previously non-existent), allowing for areadable visualization. Each point represents one observation. The colors distinguish whether it was considered normal or abnormal. There is aclear concentration of blue points, while those representing abnormal observations are much further from the center. This confirms the effectiveness of the model in identifying abnormal observations.

Finally, weplot the confusion matrix and ROC for the test set:

Pump & Dump detection with ML methods - Billennium (13)

Pump & Dump detection with ML methods - Billennium (14)

Rule set

The methods described so far donot exhaust all available solutions. Toincrease the completeness of the analysis of the problem, it is worth confronting the results with the approach based on aset of rules. It is the simplest conceptually, which does not necessarily mean worse results.

As adeterminant of “normality” Iused the sum of distances of agiven observation from the mean expressed in units of standard distance. For example, for the case of observations where the value of the first characteristic is 5, the mean is 3, and the standard deviation is 2, the result is:

Pump & Dump detection with ML methods - Billennium (15)

The operation is repeated for the remaining features and the absolute values are summed. The two graphs below show the distribution of the obtained results appropriate for sums and single columns.

Pump & Dump detection with ML methods - Billennium (16)

Pump & Dump detection with ML methods - Billennium (17)

As expected, the distribution is strongly clustered around low values. As acriterion for belonging tothe anomalous group, Iobserved avalue exceeding the 99th centile for the sum of distances or any feature describing asingle quotation. Below, wesee analogous visualizations for the Isolation Forest model approach. It seems that the results obtained using rules are slightly worse.

Pump & Dump detection with ML methods - Billennium (18)

Pump & Dump detection with ML methods - Billennium (19)

Pump & Dump detection with ML methods - Billennium (20)

Pump & Dump detection with ML methods - Billennium (21)

Conclusions

The work done in the paper can be divided into two areas – generation of quotations and detection of unusual cases. The former is not asimple task. Trainingmodelsbased on GANs architecture involve many problems, such as selecting properly balancing discriminator and generator, optimal training parameters, etc. Nevertheless, the solution using Causal Convolution networks can serve as astarting point for further research in this direction. The results obtained in this way looked quite realistic.

In detecting Pump & Dump practices, the standard anomaly detection methods – Isolation Forest and rule set – performed better. On apre-prepared test set they achieved results that can attest tothe usefulness of the models. At the same time, it should be noted that there is no official data on the detection of the described practices and the scale of the procedure and thus, awell-defined set. Not every significant jump of quotations or volume of transactions must mean illegal practices, introducing another degree of difficulty tothe overall solution.

In conclusion, the described and the application of unsupervised machine learning techniques have managed totrack down probable cases of stock market manipulation. As expected, the results obtained are subject toinaccuracies due tothe nature of unsupervised learning.

Pump & Dump detection with ML methods - Billennium (2024)
Top Articles
How to Report Stock Options on Your Tax Return - S'witty Kiwi
Home Insurance For First Time Buyers: What You Need to Know
Friskies Tender And Crunchy Recall
Umbc Baseball Camp
Swimgs Yuzzle Wuzzle Yups Wits Sadie Plant Tune 3 Tabs Winnie The Pooh Halloween Bob The Builder Christmas Autumns Cow Dog Pig Tim Cook’s Birthday Buff Work It Out Wombats Pineview Playtime Chronicles Day Of The Dead The Alpha Baa Baa Twinkle
Live Basketball Scores Flashscore
Coverage of the introduction of the Water (Special Measures) Bill
9192464227
Die Windows GDI+ (Teil 1)
Craigslist Nj North Cars By Owner
Produzione mondiale di vino
Conduent Connect Feps Login
Obituary | Shawn Alexander | Russell Funeral Home, Inc.
Pittsburgh Ultra Advanced Stain And Sealant Color Chart
Uhcs Patient Wallet
Playgirl Magazine Cover Template Free
Curtains - Cheap Ready Made Curtains - Deconovo UK
Xxn Abbreviation List 2023
Sport-News heute – Schweiz & International | aktuell im Ticker
25Cc To Tbsp
Pekin Soccer Tournament
Officialmilarosee
Puss In Boots: The Last Wish Showtimes Near Cinépolis Vista
Raz-Plus Literacy Essentials for PreK-6
R. Kelly Net Worth 2024: The King Of R&B's Rise And Fall
Academy Sports Meridian Ms
Nesb Routing Number
Essence Healthcare Otc 2023 Catalog
Expression Home XP-452 | Grand public | Imprimantes jet d'encre | Imprimantes | Produits | Epson France
Stockton (California) – Travel guide at Wikivoyage
950 Sqft 2 BHK Villa for sale in Devi Redhills Sirinium | Red Hills, Chennai | Property ID - 15334774
Funky Town Gore Cartel Video
Rays Salary Cap
Bfri Forum
Mumu Player Pokemon Go
Kokomo Mugshots Busted
Craigslist In Myrtle Beach
Samsung 9C8
Labyrinth enchantment | PoE Wiki
What Does Code 898 Mean On Irs Transcript
Wayne State Academica Login
Dcilottery Login
VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium
Powerspec G512
The Many Faces of the Craigslist Killer
A rough Sunday for some of the NFL's best teams in 2023 led to the three biggest upsets: Analysis
Kaamel Hasaun Wikipedia
Aurora Southeast Recreation Center And Fieldhouse Reviews
Mail2World Sign Up
300 Fort Monroe Industrial Parkway Monroeville Oh
Southwind Village, Southend Village, Southwood Village, Supervision Of Alcohol Sales In Church And Village Halls
Craigs List Sarasota
Latest Posts
Article information

Author: Amb. Frankie Simonis

Last Updated:

Views: 5688

Rating: 4.6 / 5 (56 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Amb. Frankie Simonis

Birthday: 1998-02-19

Address: 64841 Delmar Isle, North Wiley, OR 74073

Phone: +17844167847676

Job: Forward IT Agent

Hobby: LARPing, Kitesurfing, Sewing, Digital arts, Sand art, Gardening, Dance

Introduction: My name is Amb. Frankie Simonis, I am a hilarious, enchanting, energetic, cooperative, innocent, cute, joyous person who loves writing and wants to share my knowledge and understanding with you.