Part 0: Initial
Why choose this subject?
First reason: Bitcoin is a courted subject, especially after the buzz in December 2017. Second reason, among several topics (gold, real estate, stock market, bitcoin), gold and bitcoin seem to be potentially the most « simple » to predict. Indeed, for these subjects there are quantifiable prediction factors and it’s possible to collect them easily. I choose to first start to study bitcoin.
What will be important in this project?
It’s a way to put into practice the data science knowledge acquired in recent months. The subject is free to explore, I had to ask the right questions.
At the end, I will be able to better understand the stages of a research project, the difficulties, the delays, and the communication of the results in order to improve my efficiency on the next projects.
The most important point is that the data should be retrieved in real time (API, webscraping, package), if thereafter I would like to create a dashboard in real time.
What tools are used?
I use Jupyter Notebook with the Python 3 programming language. The main libraries are used as pandas, numpy, matplotlib. I also use bs4 (BeautifulSoup), re, PyTrends (unofficial Google Trends API), and Scikit-learn library.
The models used will be Autoregression, ARIMA, Linear Regression, Ridge, and Lasso.
Is there any research on the subject?
Yes, there are many. Some use ARIMA, others LSTM, some are false, some give little information on their results. I didn’t find a study that gave me inspiration for my work.
Economists from Yale found that Google searches « BTC » could predict a rise in the price of bitcoin. A few months later (early September 2019), Google searches ‘BTC’ increase by 2400% (because of mysterious manipulation..)! Making the buzz of December 2017 a tiny bump.
Part 1: Project goals
What are the objectives of the research?
I want to get some encouraging predictions (for a version 0.1) on the Bitcoin price in the long term (year +1) and the short term (day +1). I would like to find features that can understand the trends and anticipate the sharp fluctuations in the price of bitcoin.
How long did the search last?
From the idea of the project to writing this summary, I spent about 100 hours on the subject (over a period of 3 weeks). In theory, I had planned 70 hours, almost 1/3 more.
What is good to know at the beginning of this project?
I had a investment experience with cryptocurrency in early 2017, so I had some basic notions about bitcoin. It is essential to know how this currency works to try to predict its price.
The price of bitcoin can be very affected by events that can’t be predicted: new regulations (laws), new technologies (Libra) or the behavior of « whales » (people who hold a lot of bitcoins). Bitcoin is an highly volatile currency, and its price can quickly experience large unanticipated fluctuations (unlike gold, for example, which is more stable day after day).
What are the expectations of the results?
Knowing the nature of bitcoin and its bombshell, predicting bitcoin in the long run will be an exercise rather « fun » than realistic. For the prediction of tomorrow, it seems possible. Moreover, several studies say that 7 days is a maximum of prediction. So even if the RMSE score is high, I hope to predict trends (positive or negative) and strong « predictable » variations.
Many have tried and are trying to predict the price of bitcoin. I guess if some people succeed, or they keep it secret, or they offer their results to companies and investors for job or money.
Part 2: Get data
What data will be recovered?
Initially, I targeted 4 axes of information: Bitcoin price, Google searches, volatility, and news (Twitter posts related to bitcoin).
For the analysis of Twitter posts with sentiment analysis, I keep this for a future version. Notably because I found an indicator that incorporates volatility and sentiment analysis on Twitter.
How will the data be obtained?
For the price, i choose to webscrap coinmarketcap.com, which is a widely used platform for the valuation of cryptocurrencies.
For Google searches (and in front of the difficulty to webscrap this data), I was happy to find an unofficial API for Google Trends. This Python library is called PyTrends, great tool!
For volatility and news, I found a Fear & Greed Index offered by alternative.me, which provides the data since February 21, 2018.
How is the calculated Fear & Greed Index?
It is a clever mix of 5 data where we can find: a calculation of volatility (25%), the volume bought (25%), a Twitter sentiment score (15%), the result of polls (15%), market dominance of Bitcoin (10%), and several Google Trends queries score (10%).
Which of the search terms are most correlated to the price?
From my research, I found two very distinct types. On one side, what I can name « queries of curiosity », those of the general public who has read or heard about bitcoin and want to know more. On the other side, « buy queries », with people looking for specific terms, investment tips and trading platforms.
The most correlated query with the bitcoin price I found is ‘bitcoin trading’.
There was also, mentioned above, this « bug » (or manipulation) on the queries ‘btc’ and ‘binance’ which distorts the results of Google Trends for these terms.
Was the data clean?
The data was relatively clean. I think it’s easy to find less clean data.
Regarding the Fear & Greed Index, the API is supposed to provide data type CSV, but didn’t work, it wasn’t compatible. So i needed few operations to get them. Then I found (some) missing and duplicate values.
For Google Trends, I noticed that over a period of about 8 months, the data given is weekly and not daily. I had to split my requests into several periods to overcome that.
What will be my target variable?
Data about bitcoin price give several values: Open, High, Low, Close. Even though the daily fluctuations can be large, my target variable will be the average of (High + Low).
During the project, I will work first on the variable Price, then on the variations of the variable Price ((P1 – P0) / P0).
Part 3: Prepare data
How are groups of data named?
« GST » for Google Search Trends. « FGI » for Fear & Greed Index. « BHP » for Bitcoin Historical Price. Some features will come from the days before, I will add (-1) (-2) (-3) (- ..) according to the number of days.
What are the first features I’m going to create?
For each of the 3 groups of data, I will get the data of the last 7 days. I will also calculate the moving average of the last 3 days. Thanks to these features, I will be able to start a work to detect trends able to predict next day price.
How to go further in the creation of features?
During my exploratory data analysis, I will quickly realize that the features created can’t help the model, it doesn’t understand them. Are there « borders » in the features? Can trends be found among variations greater than 500$ ? Can we add investment opportunity features ? Many questions will lead me to create new features.
How does this translate into new features?
« IsBitcoinQuiet? » if bitcoin queries are low (<30).
« HowBitcoinDay? » how much bitcoin queries evolving (qcut (today – yesterday)).
« isGoodDay? » « isGoodPeriod? » if people trust in bitcoin (FGI> 60).
« HowConfidenceDay? » « HowConfidencePeriod? » how much FGI evolving.
« isSellOpportunity? » « isBuyOpportunity? » if the price has had a big variation and represents an opportunity to buy or sell.
« HowDailyPrice? » « HowStablePrice? » how much BHP evolving.
*FGI (Fear&Greed Index), BHP (Bitcoin Historical Price)
Part 4: Results
What were the results obtained for the long-term prediction?The Autoregression model did not allow me to have satisfactory predictions. On the other hand, I studied the patterns present on the price curve. And I noticed that we have exactly the same pattern (over a period of 2 years) at different scales with a peak in December 2013 (+1.000$) and in December 2017 (+20.000$). If we believe this series – and many people want it to happen – then in 2021 we will have a Bitcoin price of +300.000. To remain realistic, I will focus on exceeding the 100.000$ mark.
Which first features influence the most predictions?
The essential data for making predictions is the price, Bitcoin Historical Price. Knowing the price of the last days is essential (+0.78 correlation). Second, the Fear and Greed Index provides an interesting coefficient (+0.35). Finally, the Google Search Trends score is not significant on the prediction (+0.03). It can even degrade the model if the feature is not processed.
What were the results obtained for the short-term prediction?
We used linear models to make our predictions (Linear Regression, Ridge, Lasso). 70% of predictions have an error of less than 100$. And 60% of the predictions are of the same sign as the target variable. The predictions manage to capture some trends, but I think they are not satisfactory in this state.
What were the difficulties in price prediction?
The price has a role too important in the predictions (min +0.78 – max +0.99) which has the effect of giving a prediction too close to the price of the day before. Then and despite the new features, I had difficulty anticipating big variations (positive or negative).
The question is: can we really predict these big fluctuations? I hope so. And I will do my best for future releases to improve these predictions.
Part 5: Conclusions
What could be the improvements?
Integrate new features. For example, I tried to retrieve SEO data (number of visitors) for coinmarketcap.com. Because people who know the site don’t look for « coin market cap » on Google, they access the site by direct link (bookmark). But, « SEO check » plateforms show the last 6 months (for free use of services). Also, i was suspicious with some SEO results, so i didn’t wanted to buy a premium access.
Also, i’m thinking of features like: sentiment analysis on Twitter, the trading volume of Bitcoin, the percentage of Bitcoin dominance, and also several financial indicators (RSI, MACD, bands of Bollinger).
What did I miss during this project?
First of all, the lack of expertise (in financial calculation, and in cryptocurrency) did not allow me to have all the technical indicators. Second point, time. To stay consistent with my goals, my time was limited so the feature engineering & results exploration parts weren’t done enough. Still, this is the feedback on my first try (Version 0.1).
For the version 0.2, here are some steps:
– Check if sentiment analysis can be an interesting feature
– Create financial indicators features
– Use of feature selection, optimization of model parameters, and use more models
– Reduce predictions possibilities with classification approach. Ex: positive (1) or negative (0). Ex: No variation (0), small variation (1), big variations(2).
• Notebook to get BTC data: Github Jupyter Notebook (or on NBviewer)
• Fear & Greed Index: https://alternative.me/crypto/fear-and-greed-index/
• Bitcoin Historical Price: https://coinmarketcap.com/
• Unofficial Google Trends API: https://github.com/GeneralMills/pytrends
• Some SEO cheker: SEMRush, SimilarWeb, Yooda Insight
– – –
And you, have you ever worked on predicting the price of Bitcoin?
– – –