# 1. Introduction

Using the latest advancements in deep learning to predict stock price movements

Boris B

Jan 9

Link to the complete notebook: https://github.com/borisbanushev/stockpredictionai

In this notebook I will create a complete process for predicting stock price movements. Follow along and we will achieve some pretty good results. For that purpose we will use a Generative Adversarial Network (GAN) with LSTM, a type of Recurrent Neural Network, as generator, and a Convolutional Neural Network, CNN, as a discriminator. We use LSTM for the obvious reason that we are trying to predict time series data. Why we use GAN and specifically CNN as a discriminator? That is a good question: there are special sections on that later.

We will go into greater details for each step, of course, but the most difficult part is the GAN: very tricky part of successfully training a GAN is getting the right set of hyperparameters. For that reason we will use Bayesian optimisation (along with Gaussian processes) and Deep Reinforcement learning (DRL) for deciding when and how to change the GAN’s hyper parameters (the exploration vs. exploitation dilemma). In creating the reinforcement learning I will use the most recent advancements in the field, such as Rainbow and PPO.

We will use a lot of different types of input data. Along with the stock’s historical trading data and technical indicators, we will use the newest advancements in NLP (using ‘Bidirectional Embedding Representations from Transformers’, BERT, sort of a transfer learning for NLP) to create sentiment analysis (as a source for fundamental analysis), Fourier transforms for extracting overall trend directions, stacked autoencoders for identifying other high-level features, Eigen portfolios for finding correlated assets, autoregressive integrated moving average (ARIMA) for the stock function approximation, and many more, in order to capture as much information, patterns, dependencies, etc, as possible about the stock. As we all know, the more (data) the merrier. Predicting stock price movements is an extremely complex task, so the more we know about the stock (from different perspectives) the higher our changes are.

For the purpose of creating all neural nets we will use MXNet and its high-level API — Gluon, and train them on multiple GPUs.

Note: Although I try to get into details of the math and the mechanisms behind almost all algorithms and techniques, this notebook is not explicitly intended to explain how machine/deep learning, or the stock markets, work. The purpose is rather to show how we can use different techniques and algorithms for the purpose of accurately predicting stock price movements, and to also give rationale behind the reason and usefulness of using each technique at each step.

Table of Contents

Accurately predicting the stock markets is a complex task as there are millions of events and pre-conditions for a particular stock to move in a particular direction. So we need to be able to capture as many of these pre-conditions as possible. We also need make several important assumptions: 1) markets are not 100% random, 2) history repeats, 3) markets follow people’s rational behavior, and 4) the markets are ‘perfect’. And, please, do read the Disclaimer at the bottom.

We will try to predict the price movements of Goldman Sachs (NYSE: GS). For the purpose, we will use the daily closing price from January 1st, 2010 to December 31st, 2018 (seven years for training purposes and two years for validation purposes). We will use the terms ‘Goldman Sachs’ and ‘GS’ interchangeably.

We need to understand what affects whether GS’s stock price will move up or down. It is what people as a whole think. Hence, we need to incorporate as much information (depicting the stock from different aspects and angles) as possible. (We will use daily data — 1,585 days to train the various algorithms (70% of the data we have) and predict the next 680 days (test data). Then we will compare the predicted results with a test (hold-out) data. Each type of data (we will refer to it as feature) is explained in greater detail in later sections, but, as a high-level overview, the features we will use are:

Correlated assets — these are other assets (any type, not necessarily stocks, such as commodities, FX, indices, or even fixed income securities). A big company, such as Goldman Sachs, obviously doesn’t ‘live’ in an isolated world — it depends on, and interacts with, many external factors, including its competitors, clients, the global economy, the geo-political situation, fiscal and monetary policies, access to capital, etc. The details are listed later.

Technical indicators — a lot of investors follow technical indicators. We will include the most popular indicators as independent features. Among them — 7 and 21 days moving average, exponential moving average, momentum, Bollinger bands, MACD.

Fundamental analysis — A very important feature indicating whether a stock might move up or down. There are two features that can be used in fundamental analysis: 1) Analysing the company performance using 10-K and 10-Q reports, analysing ROE and P/E, etc (we will not use this), and 2) News — potentially news can indicate upcoming events that can potentially move the stock in certain direction. We will read all daily news for Goldman Sachs and extract whether the total sentiment about Goldman Sachs on that day is positive, neutral, or negative (as a score from 0 to 1). As many investors closely read the news and make investment decisions based (partially of course) on news, there is a somewhat high chance that if, say, the news for Goldman Sachs today are extremely positive the stock will surge tomorrow. One crucial point, we will perform feature importance (meaning how indicative it is for the movement of GS) on absolutely every feature (including this one) later on and decide whether we will use it. More on that later. For the purpose of creating accurate sentiment prediction, we will use Neural Language Processing (NLP). We will use BERT — Google’s recently announced NLP approach for transfer learning for sentiment classification stock news sentiment extraction.

Fourier transforms — Along with the daily closing price, we will create Fourier transforms in order to generalize several long- and short-term trends. Using these transforms we will eliminate a lot of noise (random walks) and create approximations of the real stock movement. Having trend approximations can help the LSTM network pick its prediction trends more accurately.

Autoregressive Integrated Moving Average (ARIMA) — This was one of the most popular techniques for predicting future values of time series data (in the pre-neural networks ages). Let’s add it and see if it comes off as an important predictive feature.

Stacked autoencoders — most of the aforementioned features (fundamental analysis, technical analysis, etc) were found by people after decades of research. But maybe we have missed something. Maybe there are hidden correlations that people cannot comprehend due to the enormous amount of data points, events, assets, charts, etc. With stacked autoencoders (type of neural networks) we can use the power of computers and probably find new types of features that affect stock movements. Even though we will not be able to understand these features in human language, we will use them in the GAN.

Deep Unsupervised learning for anomaly detection in options pricing. We will use one more feature — for every day we will add the price for 90-days call option on Goldman Sachs stock. Options pricing itself combines a lot of data. The price for options contract depends on the future value of the stock (analysts try to also predict the price in order to come up with the most accurate price for the call option). Using deep unsupervised learning (Self-organized Maps) we will try to spot anomalies in every day’s pricing. Anomaly (such as a drastic change in pricing) might indicate an event that might be useful for the LSTM to learn the overall stock pattern.

Next, having so many features, we need to perform a couple of important steps:

Perform statistical checks for the ‘quality’ of the data. If the data we create is flawed, then no matter how sophisticated our algorithms are, the results will not be positive. The checks include making sure the data does not suffer from heteroskedasticity, multicollinearity, or serial correlation.

Create feature importance. If a feature (e.g. another stock or a technical indicator) has no explanatory power to the stock we want to predict, then there is no need for us to use it in the training of the neural nets. We will using XGBoost (eXtreme Gradient Boosting), a type of boosted tree regression algorithms.

As a final step of our data preparation, we will also create Eigen portfolios using Principal Component Analysis (PCA) in order to reduce the dimensionality of the features created from the autoencoders.

print(‘There are {} number of days in the dataset.’.format(dataset_ex_df.shape[0]))

output >>> There are 2265 number of days in the dataset.

Let’s visualize the stock for the last nine years. The dashed vertical line represents the separation between training and test data.

As explained earlier we will use other assets as features, not only GS.

So what other assets would affect GS’s stock movements? Good understanding of the company, its lines of businesses, competitive landscape, dependencies, suppliers and client type, etc is very important for picking the right set of correlated assets:

Overall, we have 72 other assets in the dataset — daily price for every asset.

We already covered what are technical indicators and why we use them so let’s jump straight to the code. We will create technical indicators only for GS.

“”” Function to create the technical indicators “””

def get_technical_indicators(dataset):

# Create 7 and 21 days Moving Average

dataset[‘ma7’] = dataset[‘price’].rolling(window=7).mean()

dataset[‘ma21’] = dataset[‘price’].rolling(window=21).mean()

# Create MACD

dataset[’26ema’] = pd.ewma(dataset[‘price’], span=26)

dataset[’12ema’] = pd.ewma(dataset[‘price’], span=12)

dataset[‘MACD’] = (dataset[’12ema’]-dataset[’26ema’])

# Create Bollinger Bands

dataset[’20sd’] = pd.stats.moments.rolling_std(dataset[‘price’],20)

dataset[‘upper_band’] = dataset[‘ma21′] + (dataset[’20sd’]*2)

dataset[‘lower_band’] = dataset[‘ma21′] – (dataset[’20sd’]*2)

# Create Exponential moving average

dataset[’ema’] = dataset[‘price’].ewm(com=0.5).mean()

# Create Momentum

dataset[‘momentum’] = dataset[‘price’]-1

return dataset

So we have the technical indicators (including MACD, Bollinger bands, etc) for every trading day. We have in total 12 technical indicators.

Let’s visualise the last 400 days for these indicators.

For fundamental analysis we will perform sentiment analysis on all daily news about GS. Using sigmoid at the end, result will be between 0 and 1. The closer the score is to 0 — the more negative the news is (closer to 1 indicates positive sentiment). For each day, we will create the average daily score (as a number between 0 and 1) and add it as a feature.

2.3.1. Bidirectional Embedding Representations from Transformers — BERT

For the purpose of classifying news as positive or negative (or neutral) we will use BERT, which is a pre-trained language representation.

Pre-trained BERT models are already available in MXNet/Gluon. We just need to instantiated them and add two (arbitrary number) Dense layers, going to softmax – the score is from 0 to 1.

import bert

Going into the details of BERT and the NLP part is not in the scope of this notebook, but you have interest, do let me know — I will create a new repo only for BERT as it definitely is quite promising when it comes to language processing tasks.

Fourier transforms take a function and create a series of sine waves (with different amplitudes and frames). When combined, these sine waves approximate the original function. Mathematically speaking, the transforms look like this:

We will use Fourier transforms to extract global and local trends in the GS stock, and to also denoise it a little. So let’s see how it works.

“”” Code to create the Fuorier trasfrom

“””

data_FT = dataset_ex_df[[‘Date’, ‘GS’]]close_fft = np.fft.fft(np.asarray(data_FT[‘GS’].tolist()))fft_df = pd.DataFrame({‘fft’:close_fft})fft_df[‘absolute’] = fft_df[‘fft’].apply(lambda x: np.abs(x))fft_df[‘angle’] = fft_df[‘fft’].apply(lambda x: np.angle(x))plt.figure(figsize=(14, 7), dpi=100)fft_list = np.asarray(fft_df[‘fft’].tolist())for num_ in [3, 6, 9, 100]:

fft_list_m10= np.copy(fft_list); fft_list_m10[num_:-num_]=0

plt.plot(np.fft.ifft(fft_list_m10), label=’Fourier transform with {} components’.format(num_))plt.plot(data_FT[‘GS’],

label=’Real’)plt.xlabel(‘Days’)plt.ylabel(‘USD’)plt.title(‘Figure 3: Goldman Sachs (close) stock prices & Fourier transforms’)plt.legend()plt.show()

As you see in Figure 3 the more components from the Fourier transform we use the closer the approximation function is to the real stock price (the 100 components transform is almost identical to the original function — the red and the purple lines almost overlap). We use Fourier transforms for the purpose of extracting long- and short-term trends so we will use the transforms with 3, 6, and 9 components. You can infer that the transform with 3 components serves as the long term trend.

Another technique used to denoise data is called wavelets. Wavelets and Fourier transform gave similar results so we will only use Fourier transforms.

ARIMA is a technique for predicting time series data. We will show how to use it, and althouth ARIMA will not serve as our final prediction, we will use it as a technique to denoise the stock a little and to (possibly) extract some new patters or features.

error = mean_squared_error(test, predictions)print(‘Test MSE: %.3f’ % error)

output >>> Test MSE: 10.151

As we can see from Figure 5 ARIMA gives a very good approximation of the real stock price. We will use the predicted price through ARIMA as an input feature into the LSTM because, as we mentioned before, we want to capture as many features and patterns about Goldman Sachs as possible. We go test MSE (mean squared error) of 10.151, which by itself is not a bad result (considering we do have a lot of test data), but still, we will only use it as a feature in the LSTM.

Ensuring that the data has good quality is very important for our models. In order to make sure our data is suitable we will perform a couple of simple checks in order to ensure that the results we achieve and observe are indeed real, rather than compromised due to the fact that the underlying data distribution suffers from fundamental errors.

2.6.1. Heteroskedasticity, multicollinearity, serial correlation

We will not go into the code here as it is straightforward and our focus is more on the deep learning parts, but the data is qualitative.

print(‘Total dataset has {} samples, and {} features.’.format(dataset_total_df.shape[0],

dataset_total_df.shape[1]))

output >>> Total dataset has 2265 samples, and 112 features.

So, after adding all types of data (the correlated assets, technical indicators, fundamental analysis, Fourier, and Arima) we have a total of 112 features for the 2,265 days (as mentioned before, however, only 1,585 days are for training data).

We will also have some more features generated from the autoencoders.

2.7.1. Feature importance with XGBoost

Having so many features we have to consider whether all of them are really indicative of the direction GS stock will take. For example, we included USD denominated LIBOR rates in the dataset because we think that changes in LIBOR might indicate changes in the economy, that, in turn, might indicate changes in the GS’s stock behavior. But we need to test. There are many ways to test feature importance, but the one we will apply uses XGBoost, because it gives one of the best results in both classification and regression problems.

Since the features dataset is quite large, for the purpose of the presentation here we’ll use only the technical indicators. During the real features importance testing all selected features proved somewhat important so we won’t exclude anything when training the GAN.

regressor = xgb.XGBRegressor(gamma=0.0,n_estimators=150,base_score=0.7,colsample_bytree=1,learning_rate=0.05)

xgbModel = regressor.fit(X_train_FI,y_train_FI, eval_set = [(X_train_FI, y_train_FI), (X_test_FI, y_test_FI)], verbose=False)

fig = plt.figure(figsize=(8,8))plt.xticks(rotation=’vertical’)plt.bar([i for i in range(len(xgbModel.feature_importances_))], xgbModel.feature_importances_.tolist(), tick_label=X_test_FI.columns)plt.title(‘Figure 6: Feature importance of the technical indicators.’)plt.show()

Not surprisingly (for those with experience in stock trading) that MA7, MACD, and BB are among the important features.

I followed the same logic for performing feature importance over the whole dataset — just the training took longer and results were a little more difficult to read, as compared with just a handful of features.

Before we proceed to the autoencoders, we’ll explore an alternative activation function.

2.8.1. Activation function — GELU (Gaussian Error)

GELU — Gaussian Error Linear Unites was recently proposed — link. In the paper the authors show several instances in which neural networks using GELU outperform networks using ReLU as an activation. gelu is also used in BERT, the NLP approach we used for news sentiment analysis.

We will use GELU for the autoencoders.

Note: The cell below shows the logic behind the math of GELU. It is not the actual implementation as an activation function. I had to implement GELU inside MXNet. If you follow the code and change act_type=’relu’ to act_type=’gelu’ it will not work, unless you change the implementation of MXNet. Make a pull request on the whole project to access the MXNet implementation of GELU.

Let’s visualize GELU, ReLU, and LeakyReLU (the last one is mainly used in GANs – we also use it).

def gelu(x):

return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * math.pow(x, 3))))

def relu(x):

return max(x, 0)def lrelu(x):

return max(0.01*x, x)

plt.figure(figsize=(15, 5))plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=.5, hspace=None)

ranges_ = (-10, 3, .25)

plt.subplot(1, 2, 1)plt.plot([i for i in np.arange(*ranges_)], [relu(i) for i in np.arange(*ranges_)], label=’ReLU’, marker=’.’)plt.plot([i for i in np.arange(*ranges_)], [gelu(i) for i in np.arange(*ranges_)], label=’GELU’)plt.hlines(0, -10, 3, colors=’gray’, linestyles=’–‘, label=’0’)plt.title(‘Figure 7: GELU as an activation function for autoencoders’)plt.ylabel(‘f(x) for GELU and ReLU’)plt.xlabel(‘x’)plt.legend()

plt.subplot(1, 2, 2)plt.plot([i for i in np.arange(*ranges_)], [lrelu(i) for i in np.arange(*ranges_)], label=’Leaky ReLU’)plt.hlines(0, -10, 3, colors=’gray’, linestyles=’–‘, label=’0’)plt.ylabel(‘f(x) for Leaky ReLU’)plt.xlabel(‘x’)plt.title(‘Figure 8: LeakyReLU’)plt.legend()

plt.show()

Note: In future versions of this notebook I will experiment using U-Net (link), and try to utilize the convolutional layer and extract (and create) even more features about the stock’s underlying movement patterns. For now, we will just use a simple autoencoder made only from Dense layers.

Ok, back to the autoencoders, depicted below (the image is only schematic, it doesn’t represent the real number of layers, units, etc.)

Note: One thing that I will explore in a later version is removing the last layer in the decoder. Normally, in autoencoders the number of encoders == number of decoders. We want, however, to extract higher level features (rather than creating the same input), so we can skip the last layer in the decoder. We achieve this creating the encoder and decoder with the same number of layers during the training, but when we create the output we use the layer next to the only one as it would contain the higher level features.

The full code for the autoencoders is available in the accompanying Github — link at top.

We created 112 more features from the autoencoder. As we want to only have high level features (overall patterns) we will create an Eigen portfolio on the newly created 112 features using Principal Component Analysis (PCA). This will reduce the dimension (number of columns) of the data. The descriptive capability of the Eigen portfolio will be the same as the original 112 features.

Note Once again, this is purely experimental. I am not 100% sure the described logic will hold. As everything else in AI and deep learning, this is art and needs experiments.

How GANs work?

As mentioned before, the purpose of this notebook is not to explain in detail the math behind deep learning but to show its applications. Of course, thorough and very solid understanding from the fundamentals down to the smallest details, in my opinion, is extremely imperative. Hence, we will try to balance and give a high-level overview of how GANs work in order for the reader to fully understand the rationale behind using GANs in predicting stock price movements. Feel free to skip this and the next section if you are experienced with GANs (and do check section 4.2.).

A GAN network consists of two models — a Generator (G) and Discriminator (D). The steps in training a GAN are:

The Generator is, using random data (noise denoted z), trying to ‘generate’ data indistinguishable of, or extremely close to, the real data. Its purpose is to learn the distribution of the real data.

Randomly, real or generated data is fitted into the Discriminator, which acts as a classifier and tries to understand whether the data is coming from the Generator or is the real data. D estimates the (distributions) probabilities of the incoming sample to the real dataset. (more info on comparing two distributions in section 3.2. below).

Then, the losses from G and D are combined and propagated back through the generator. Ergo, the generator’s loss depends on both the generator and the discriminator. This is the step that helps the Generator learn about the real data distribution. If the generator doesn’t do a good job at generating a realistic data (having the same distribution), the Discriminator’s work will be very easy to distinguish generated from real data sets. Hence, the Discriminator’s loss will be very small. Small discriminator loss will result in bigger generator loss (see the equation below for L(D,G)). This makes creating the discriminator a bit tricky, because too good of a discriminator will always result in a huge generator loss, making the generator unable to learn.

The process goes on until the Discriminator can no longer distinguish generated from real data.

When combined together, D and G as sort of playing a minmax game (the Generator is trying to fool the Discriminator making it increase the probability for on fake examples, i.e. minimize z∼pz(z)[log(1−D(G(z)))]. The Discriminator wants to separate the data coming from the Generator, D(G(z)), by maximizing x∼pr(x)[logD(x)]. Having separated loss functions, however, it is not clear how both can converge together (that is why we use some advancements over the plain GANs, such as Wasserstein GAN). Overall, the combined loss function looks like:

Note: Really useful tips for training GANs can be found here.

Note: I will not include the complete code behind the GAN and the Reinforcement learning parts in this notebook — only the results from the execution (the cell outputs) will be shown. Make a pull request or contact me for the code.

Generative Adversarial Networks (GAN) have been recently used mainly in creating realistic images, paintings, and video clips. There aren’t many applications of GANs being used for predicting time-series data as in our case. The main idea, however, should be same — we want to predict future stock movements. In the future, the pattern and behavior of GS’s stock should be more or less the same (unless it starts operating in a totally different way, or the economy drastically changes). Hence, we want to ‘generate’ data for the future that will have similar (not absolutely the same, of course) distribution as the one we already have — the historical trading data. So, in theory, it should work.

In our case, we will use LSTM as a time-series generator, and CNN as a discriminator.

Note: The next couple of sections assume some experience with GANs.

I. Metropolis-Hastings GAN

A recent improvement over the traditional GANs came out from Uber’s engineering team and is called Metropolis-Hastings GAN (MHGAN). The idea behind Uber’s approach is (as they state it) somewhat similar to another approach created by Google and University of California, Berkeley called Discriminator Rejection Sampling (DRS). Basically, when we train GAN we use the Discriminator (D) for the sole purpose of better training the Generator (G). Often, after training the GAN we do not use the D any more. MHGAN and DRS, however, try to use D in order to choose samples generated by G that are close to the real data distribution (slight difference between is that MHGAN uses Markov Chain Monte Carlo (MCMC) for sampling).

MHGAN takes K samples generated from the G (created from independent noise inputs to the G — z0 to zK in the figure below). Then it sequentially runs through the K outputs (x′0 to x′K) and following an acceptance rule (created from the Discriminator) decides whether to accept the current sample or keep the last accepted one. The last kept output is the one considered the real output of G.

Note: MHGAN is originally implemented by Uber in pytorch. I only transferred it into MXNet/Gluon.

Note: I will also upload it into Github sometime soon.

Figure 10: Visual representation of MHGAN (from the original Uber post).

II. Wasserstein GAN

Training GANs is quite difficult. Models may never converge and mode collapse can easily happen. We will use a modification of GAN called Wasserstein GAN — WGAN.

Again, we will not go into details, but the most notable points to make are:

Hands down, this was the toughest part of this notebook. Mixing WGAN and MHGAN took me three days.

3.4.1. LSTM or GRU

As mentioned before, the generator is a LSTM network a type of Recurrent Neural Network (RNN). RNNs are used for time-series data because they keep track of all previous data points and can capture patterns developing through time. Due to their nature, RNNs many time suffer from vanishing gradient — that is, the changes the weights receive during training become so small, that they don’t change, making the network unable to converge to a minimal loss (The opposite problem can also be observed at times — when gradients become too big. This is called gradient exploding, but the solution to this is quite simple — clip gradients if they start exceeding some constant number, i.e. gradient clipping). Two modifications tackle this problem — Gated Recurrent Unit (GRU) and Long-Short Term Memory (LSTM). The biggest differences between the two are: 1) GRU has 2 gates (update and reset) and LSTM has 4 (update, input, forget, and output), 2) LSTM maintains an internal memory state, while GRU doesn’t, and 3) LSTM applies a nonlinearity (sigmoid) before the output gate, GRU doesn’t.

In most cases, LSTM and GRU give similar results in terms of accuracy but GRU is much less computational intensive, as GRU has much fewer trainable params. LSTMs, however, and much more used.

Strictly speaking, the math behind the LSTM cell (the gates) is:

where ⊙is an element-wise multiplication operator, and, for all x=[x1,x2,…,xk]⊤∈R^k the two activation functions:,

3.4.2. The LSTM architecture

The LSTM architecture is very simple — one LSTM layer with 112 input units (as we have 112 features in the dataset) and 500 hidden units, and one Dense layer with 1 output – the price for every day. The initializer is Xavier and we will use L1 loss (which is mean absolute error loss with L1 regularization – see section 3.4.5. for more info on regularization).

Note — In the code you can see we use Adam (with learning rate of .01) as an optimizer. Don’t pay too much attention on that now – there is a section specially dedicated to explain what hyperparameters we use (learning rate is excluded as we have learning rate scheduler – section 3.4.3.) and how we optimize these hyperparameters – section 3.6.

gan_num_features = dataset_total_df.shape[1]sequence_length = 17

class RNNModel(gluon.Block):

def __init__(self, num_embed, num_hidden, num_layers, bidirectional=False, sequence_length=sequence_length, **kwargs):

super(RNNModel, self).__init__(**kwargs)

self.num_hidden = num_hidden

with self.name_scope():

self.rnn = rnn.LSTM(num_hidden, num_layers, input_size=num_embed, bidirectional=bidirectional, layout=’TNC’)

self.decoder = nn.Dense(1, in_units=num_hidden)

def forward(self, inputs, hidden):

output, hidden = self.rnn(inputs, hidden)

decoded = self.decoder(output.reshape((-1,self.num_hidden)))

return decoded, hidden

def begin_state(self, *args, **kwargs):

return self.rnn.begin_state(*args, **kwargs)

lstm_model = RNNModel(num_embed=gan_num_features, num_hidden=500, num_layers=1)lstm_model.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu())trainer = gluon.Trainer(lstm_model.collect_params(), ‘adam’, {‘learning_rate’: .01})loss = gluon.loss.L1Loss()

We will use 500 neurons in the LSTM layer and use Xavier initialization. For regularization we’ll use L1. Let’s see what’s inside the LSTM as printed by MXNet.

print(lstm_model)

output >>>

RNNModel(

(rnn): LSTM(112 -> 500, TNC)

(decoder): Dense(500 -> 1, linear))

As we can see, the input of the LSTM are the 112 features (dataset_total_df.shape[1]) which then go into 500 neurons in the LSTM layer, and then transformed to a single output – the stock price value.

The logic behind the LSTM is: we take 17 (sequence_length) days of data (again, the data being the stock price for GS stock every day + all the other feature for that day – correlated assets, sentiment, etc.) and try to predict the 18th day. Then we move the 17 days window with one day and again predict the 18th. We iterate like this over the whole dataset (of course in batches).

In another post I will explore whether modification over the vanilla LSTM would be more beneficial, such as:

3.4.3. Learning rate scheduler

One of the most important hyperparameters is the learning rate. Setting the learning rate for almost every optimizer (such as SGD, Adam, or RMSProp) is crucially important when training neural networks because it controls both the speed of convergence and the ultimate performance of the network. One of the simplest learning rate strategies is to have a fixed learning rate throughout the training process. Choosing a small learning rate allows the optimizer find good solutions, but this comes at the expense of limiting the initial speed of convergence. Changing the learning rate over time can overcome this tradeoff.

Recent papers, such as this one, show the benefits of changing the global learning rate during training, in terms of both convergence and time. Let’s plot the learning rates we’ll be using for each epoch.

schedule = CyclicalSchedule(TriangularSchedule, min_lr=0.5, max_lr=2, cycle_length=500)iterations=1500

plt.plot([i+1 for i in range(iterations)],[schedule(i) for i in range(iterations)])plt.title(‘Learning rate for each epoch’)plt.xlabel(“Epoch”)plt.ylabel(“Learning Rate”)plt.show()

3.4.4. How to prevent overfitting and the bias-variance trade-off

Having a lot of features and neural networks we need to make sure we prevent overfitting and be mindful of the total loss.

We use several techniques for preventing overfitting (not only in the LSTM, but also in the CNN and the auto-encoders):