Can Twitter Sentiment Analysis Guide Stock Market Investment
Many individuals and businesses are interested in predicting stock market price movement and direction. Furthermore, the stock market is an important component of a country’s economy. It is one of the most important chances for firms and investors to invest. To know whether to sell or purchase a stock, stock traders must forecast market movements. To make money, stock traders should buy stocks whose prices are projected to climb soon and sell stocks whose prices are likely to fall. Traders who can accurately identify stock trends and patterns might make a substantial profit margin. However, financial markets are very unpredictable and, as a result, impossible to forecast. External variables like social media and financial news may have a large impact on stock price movement. As a result, social media is seen as crucial for making accurate market forecasts.
In order to prevent purchasing dangerous stocks, investors evaluate a company’s performance and stock before deciding whether to purchase the company’s shares. This assessment includes an examination of the company’s performance on social media networks. Twitter is one such social media tool that is quite important in the financial and stock market worlds. Every day, one hundred million active Twitter users update about 500 million tweets . These tweets allow users to convey their thoughts, choices, sentiments, and forecasts, which may be turned into helpful information. However, such a massive volume of social media data cannot be evaluated just by investors. It is practically difficult for people to complete on their own. As a result, investors want a computerized analysis system, which will automatically assess stock movements utilizing such enormous volumes of data in data sets.
Previous research on stock prediction has used a significant amount of experience to historical or social media data. Using historical data for research entails using a technical analysis technique in which mathematics is used to examine data in order to predict future stock market patterns and prices . On historical stock price data, researchers employed several machine learning approaches such as deep learning  and regression analysis . These research, however, did not take into account external influences such as social media. It is critical to use social media data because events communicated on social media may have a big impact on stock prices and trends owing to the assumption that prices vary according to human behavior, which can be represented by social media.
Social media sentiment research is a rich source of information that may reveal favorable or negative attitudes about stocks and trends. In recent years, there has been a substantial quantity of study on sentiment analysis on diverse issues such as movie reviews and Twitter feeds. Agarwal et al.  researched sentiment analysis on Twitter Data and prefaced POS-specific prior polarity features, as well as the usage of a tree kernel to minimize the requirement for slow feature engineering. Pang and Lee 2004  introduced a novel machine-learning strategy that restricts text-categorization algorithms to just topic areas of texts. Kim 2014 suggested a basic one-layer Convolutional Neural Network (CNN) that would yield outstanding results across a variety of data sets. CNN performed far better than the other three kinds in four of the seven categories assessed in the trial. CNN had the most accuracy in movie reviews, for example, with 81.5. The strong findings obtained with this CNN architecture imply that neural networks may be a superior substitute for well-established baseline models like Support Vector Machines  and Logistic Regression.
Furthermore, there is tremendous space for improvement in the study that has employed both social media and historical data. Chakraborty et al. 2017  used Twitter and stock market data to anticipate the stock market using machine learning algorithms, as did Khatri and Srivastava 2016 , Chen and Laser 2011 , and Khan et al. 2020 .
This study will expand on the work of Chakraborty et al., who published “Predicting stock movement using sentiment analysis of the Twitter stream.” The researchers discovered in their work that Twitter data may forecast stock prices quite effectively on steady days in the stock market. The researchers, on the other hand, employed a boosted regression tree model to forecast the stock price difference for the following day based on the current day’s stock market sentiment. In this article, neural networks will be used to investigate whether they outperform the boosted tree model. A Multilayer Perceptron Neural Network (MLP) model will be used specifically. This article intends to enhance on past work by using MLP and to investigate the efficacy of utilizing Twitter data to forecast stock market movements and prices.
A sentiment tagged Twitter dataset of 1.6 million tweets gathered from Sentiment 140 will be utilized for sentiment classification in this article. The Boosted Regression Tree and Multilayer Perceptron models will then be used to forecast the following day’s stock movement using today’s tweets containing the terms “stock market,” “Stock Twits,” and “AAPL.” “Can Stock Market Related Tweets Accurately Predict Stock Market Movement?” is the hypothesis that this research will test. Furthermore, this article will investigate if neural networks are more successful than standard models in predicting stock market movement.
Materials and Procedures
Similarly, to Chakraborty et al., the training data set was gathered using Sentiment140, which can be found on Kaggle . The collection includes 1.6 million hand-tagged tweets obtained using the Sentiment 140 API. The tweets are labeled “1” and “0” to indicate whether they are “good” or “negative.” We divide the dataset into a training dataset and a testing data set using a random split. There are 1.52 million tweets in the training dataset and 80,000 tweets in the testing dataset. The data distribution is seen in.
Distributions of data
As seen here, the data is nicely balanced, with about equal numbers of Positive and Negative tweets in both the Sentiment 140 testing and training sets.
Following that, tweets containing the terms “stock market,” “Stock Twits,” and “AAPL” that were made between January and December 2016 are gathered in order to forecast the associated stock movement. Every day, we collected at most a hundred tweets. GetOldTweets3, a Python module for retrieving old tweets, was used to gather the tweets. It enables the user to retrieve historical tweets based on dates, keywords, or users. It also allows the user to get tweets depending on their location. We utilize GetOldTweets3 because, unlike other APIs, it enables us to retrieve ancient tweets.
Yahoo Finance  provides historical stock price data. The specified stock markets’ price data are gathered in csv file format from Yahoo Finance for the set time. The downloaded data files include seven features: Date, Open, High, Low, Close, Volume, and Adjusted Close, which show the stock traded day, stock open price, stock maximum trading price, stock lowest trading price, stock closing price, number of shares traded, and closing price of a stock when dividends are paid to investors on a specific date. Only the Date and Close price are utilized in this report.
Yahoo Finance was used to acquire historical stock data for this project. Similarly, data on the stock closure price of DJIA and APPLE Inc. from January 2016 to December 2016 was gathered by Chakraborty et al. Data was only gathered on days when the stock market was open for business.
The Sentiment 140 dataset models were used to estimate sentimental values since the stock keywords data was not annotated with sentimental numbers. is an excerpt from the Sentiment 140 dataset.
Each tweet will be preprocessed according to the parameters below. The data will be preprocessed by executing a function on all of the text in accordance with the criteria below. The data will subsequently be transformed as indicated in using the function. This preprocessing technique varies from the prior study in that it includes lemmatization, keyword removal, and short word removal. These procedures were introduced because they improve the data’s ability to be preprocessed for sentimental analysis.
1) Lower Case: All texts are changed to lowercase.
2) URL replacement: Links beginning with “Http,” “https,” or “www” are replaced with “URL.”
3) Emoji Replacement: Emojis may be replaced by utilizing a pre-defined lexicon that includes emojis and their meanings. (For example, “:)” to “EMOJI smile”)
4) Usernames Replacement: Replace @Usernames with the word “USER.” (For example, “@Kaggle” to “USER”)
5) We are deleting Non-Alphabets: Replace all characters with spaces except Digits and Alphabets.
. A snapshot of the first four samples in the dataset.
6) Removing Consecutive Letters: Replace three or more consecutive letters with two letters. (For example, “Hey
7) Short Words Are Removed: Words having a length of less than two are removed.
8) Remove Stop words: Stop words are English words that add little significance to a statement. They may be safely disregarded without affecting the sentence’s meaning. (For example, “the,” “he,” or “have”)
One Last Thought
Using the same procedure as Chakraborty et al., 5% of the training data from the sentiment 140 dataset was utilized to test the trained models. Likewise, five distinct models have been trained.