By: Akul Gokaram
Since a long time, people as a whole have pursued wealth. The opulent depiction of overflowing riches creates a drive for just about anyone. Forgetting desire itself, in the modern world wealth is a synonymous to living in the sense that the world revolves around money. In this modern day, a lot of people relentless strive towards the financial freedom that will allow them to do what they desire. With the development of technology over time, that has become even more reachable/ With a wealth of information online about things like income stream diversification and other wealth management techniques a concept that has stood out the most is investing.
Investing in real estate, the stock market and other commodities has been a common thing in this trend for finanical stability and journey of building wealth. As technology developed, then came along the next innovation that completely revolutionized the game in many ways: blockchain technology. With the way blockchain completely revolutionized holding data, it was bound for success. A surge around this new technology with a market developing around it was born, leading to all kinds of revolutionizations in old industries and the birth of new ones. One of the things that came about through blockchain technology is cryptocurrency. A digital currency, completely changing the game. The market for cryptocurrencies grew from its initial start years ago, up until recently when there was a surge around cryptocurrency due to key events. At a certain point one of the most popular cryptocurrencies, Bitcoin, almost reached a staggering value of $70,000 USD at its peak due to the value surrounding it.
In the stock market, it is common to utilize computers and algorithms to predict market trends and thus invest according to those predictions to allow one to grow their own wealth. It is commonly known as algorithmic trading. We can emulate the same concept but apply it towards cryptocurrency. As such, this obviously can be translated and applied to cryptocurrencies as well since they are quite similar to stocks but are representative of completely different things. Additionally, the restrictions and regulations around cryptocurrencies are not yet as tightly regulated as normal stocks which something that some people would consider a positive. As such, what better time than now to take advantage of this? With the power of data science and the internet we can make use of cryptocurrency to pursue a much more financially saturated life.
In this tutorial, our goal is to analyze historical OHLCV cryptocurrency data of the top cryptocurrencies currently and see the correlation in characteristics of the data and create a model to predict OHLCV of those top 10 cryptocurrencies. We can do this by utilizing the steps of the data science lifecycle which consists of:
Okay, now that we have a basic understanding of what our data is and our objectives we can get started. So lets get to it!
These are all of the libraries that we will be using in this tutorial to carry out various steps of the data science lifecycle.
import requests as r
from bs4 import BeautifulSoup
import json
import pandas as pd
from Historic_Crypto import HistoricalData
from Historic_Crypto import Cryptocurrencies
from datetime import datetime
import datetime as dt
import numpy as np
import plotly.io as pio
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense
# Required settings in order for plotly to render properly. Will cover plotly later in the tutorial
pio.renderers.default = "notebook+plotly_mimetype+png+jpeg+svg+pdf"
Okay, now that we have imported all of our relevant libraries, we can go ahead and move onto the first part of the data science lifecycle: data collection.
To gather our data, we are going to do a couple things. We need to know the top 10 cryptocurrencies by market cap so we will webscrape to find these details. Additionally, we need to get historical data for the cryptocurrencies. For this, we have a CSV file that contains a majority of the data that we are looking for. However, we run into an issue here since the data is only up until October 2022. To get the rest of the data we need a different source. So what we can do is utilize a a price-tracking service to get the rest of the data. We will handle that part of the collection later as we need to follow a certain order of things to make sure that we don't gather more data than we need. More on this later.
Okay, to find out the top cryptocurrencies currently we can scrape CoinMarketCap, a price-tracking service for cryptoassets, where we will be getting the table of cryptocurrencies ordered by descending market cap, which is on their homepage and what we are considering as the critera to determine the "top" cryptocurrencies. You can view a picture of the homepage and part of the target table below:
In order to do this, we can utilize the the requests library and then use BeautifulSoup to filter through the results and find the top cryptocurrencies currently. The requests library is a HTTP library that you can import in Python, if you would like to learn more click here.
After we find out the top cryptocurrencies, we can then cross check it with the a Kaggle dataset that contains historical data on various cryptocurrencies from the time period of 2013 around the initial creation of Bitcoin to October 2022. To view the dataset on Kaggle click here.
Okay, now that we know what we are doing to collect we can get started. We are going to first get the list of the top cryptocurrencies where top is dependent on market cap.
# Defining the URL of the webpage that we would like to make a GET request.
cmp_url = "https://coinmarketcap.com/"
# Executing the GET request using requests library.
result = r.get(cmp_url)
# Printing to see if the request went through and what the status of the request is.
print(result)
<Response [200]>
Okay great! It looks like our request to get the HTML was successful as indicated by the output. Response 200 means that the GET request was ok. If you would like to learn more about response codes and what it means check out this [link](https://learning.mlytics.com/the-internet/http-response-status-codes/#:~:text=HTTP%20response%20status%20codes%20(or,responded%20to%20the%20client%27s%20request.).
Now that we have our response from our get request, we can go ahead and use BeautifulSoup to process the HTML and get the names of the top cryptocurrencies. BeautifulSoup is a library that makes it easy to scrape information from web pages, which can be in the form of HTML which is what our request will return. We will be using the find
function within the library to locate the table. The find function returns the first tag that meets the criteria specified in the parameter in the segment of HTML that we are looking at in that instance. To learn more about BeautifulSoup click here.
After locating the HTML segment, we want to read it in and turn it into something easier to use for analysis, we will utilize Pandas for this. Pandas is one of the most popular libraries in python for data analysis. It takes data and stores it as a tabular format through what is called a dataframe. We can then execute a multitude of operations to manipulate this dataframe to allow us to better analyze the data. If you would like to learn more about Pandas click here.
# Passing in the raw HTML content to setup BeautifulSoup for data extraction.
root = BeautifulSoup(result.content)
# Getting to the body of the HTML document so that we can start searching for the table
root = root.find("body", {"class": "DAY"})
# Navigating through body to get to the page division that contains the table
root = root.find("div").find("div").find("div").find("div", {"class" : "cmc-body-wrapper"}).find("div") \
.find("div").find("div", {"class" : "sc-beb003d5-2 bkNrIb"}).find("table")
# Utilizing the built in function for reading HTML to process the data into a list of information
top_crypto_df_list = pd.read_html(root.prettify())
# Putting together the elements of the list of information to form a dataframe
top_crypto_df = pd.concat(top_crypto_df_list)
# Viewing the first 5 entries in the dataframe
top_crypto_df.head(10)
Unnamed: 0 | # | Name | Price | 1h % | 24h % | 7d % | Market Cap | Volume(24h) | Circulating Supply | Last 7 Days | Unnamed: 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 1.0 | Bitcoin 1 BTC | $26,467.81 | 0.64% | 1.45% | 10.31% | $512.52B $512,517,484,139 | $17,908,091,334 676,863 BTC | 19,371,375 BTC | NaN | NaN |
1 | NaN | 2.0 | Ethereum 2 ETH | $1,788.10 | 0.99% | 0.10% | 10.02% | $220.01B $220,005,830,732 | $8,338,380,767 4,654,292 ETH | 122,802,176 ETH | NaN | NaN |
2 | NaN | 3.0 | Tether 3 USDT | $1.00 | 0.04% | 0.05% | 0.02% | $82.79B $82,788,137,577 | $26,881,856,452 26,861,845,571 USDT | 82,726,510,004 USDT | NaN | NaN |
3 | NaN | 4.0 | BNB 4 BNB | $306.26 | 0.21% | 0.06% | 6.18% | $47.75B $47,752,156,743 | $487,392,775 1,590,821 BNB | 155,860,209 BNB | NaN | NaN |
4 | NaN | 5.0 | USD Coin 5 USDC | $1.00 | 0.04% | 0.06% | 0.02% | $29.97B $29,967,992,391 | $3,962,403,789 3,961,076,343 USDC | 29,957,952,807 USDC | NaN | NaN |
5 | NaN | 6.0 | XRP 6 XRP | $0.4249 | 0.48% | 2.16% | 9.20% | $22.03B $22,026,197,113 | $1,150,113,730 2,706,749,094 XRP | 51,837,820,505 XRP | NaN | NaN |
6 | NaN | 7.0 | Cardano 7 ADA | $0.3608 | 0.23% | 2.00% | 8.48% | $12.57B $12,571,058,654 | $243,247,046 674,180,030 ADA | 34,841,766,154 ADA | NaN | NaN |
7 | NaN | 8.0 | Dogecoin 8 DOGE | $0.07128 | 0.13% | 1.35% | 10.29% | $9.93B $9,930,789,434 | $373,075,835 5,233,844,884 DOGE | 139,318,086,384 DOGE | NaN | NaN |
8 | NaN | 9.0 | Solana 9 SOL | $20.49 | 0.71% | 2.38% | 10.57% | $8.10B $8,102,862,403 | $326,094,987 15,912,803 SOL | 395,403,978 SOL | NaN | NaN |
9 | NaN | 10.0 | Polygon 10 MATIC | $0.8479 | 1.43% | 1.22% | 15.43% | $7.84B $7,842,302,945 | $439,832,282 518,752,605 MATIC | 9,249,469,069 MATIC | NaN | NaN |
Okay nice! What we can see above in the output is the first 10 entries in the CoinMarketCap website table. The table looks somewhat unpleasant to look at and the name is difficult to extract however this will be resolved when we clean this table later.
Now that we have our top cryptocurrencies by market cap, the next thing we must do is read in our CSV containing the historical dataset. A CSV is a Comma Seperated Values textfile and is a common way of storing data. If you would like to learn more about CSV's please click here.
I have downloaded the CSV file to the folder that contains this notebook so that I can reference it by relative path and would advise you to do the same for similarity in following the instructions for this tutorial. To load our CSV, we can utilize the built in function in pandas which is read_csv
# Reading in CSV of historical data
historical_data_df = pd.read_csv('dataset.csv')
# Viewing the dataframe constructed from the CSV.
historical_data_df
Unnamed: 0 | open | high | low | close | volume | marketCap | timestamp | crypto_name | date | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 1.288693e+09 | 2013-05-05T23:59:59.999Z | Bitcoin | 2013-05-05 |
1 | 1 | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 | 6.229819e+07 | 2013-05-05T23:59:59.999Z | Litecoin | 2013-05-05 |
2 | 2 | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 1.249023e+09 | 2013-05-06T23:59:59.999Z | Bitcoin | 2013-05-06 |
3 | 3 | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 | 5.859436e+07 | 2013-05-06T23:59:59.999Z | Litecoin | 2013-05-06 |
4 | 4 | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 1.240594e+09 | 2013-05-07T23:59:59.999Z | Bitcoin | 2013-05-07 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
72941 | 72941 | 0.022604 | 0.022988 | 0.022197 | 0.022796 | 4.040134e+07 | 1.652957e+09 | 2022-10-23T23:59:59.999Z | VeChain | 2022-10-23 |
72942 | 72942 | 1.468244 | 1.530464 | 1.435415 | 1.517878 | 2.844351e+07 | 1.572825e+09 | 2022-10-23T23:59:59.999Z | Flow | 2022-10-23 |
72943 | 72943 | 4.950431 | 5.148565 | 4.945280 | 5.117206 | 1.069497e+08 | 1.559551e+09 | 2022-10-23T23:59:59.999Z | Filecoin | 2022-10-23 |
72944 | 72944 | 0.000233 | 0.000243 | 0.000226 | 0.000239 | 2.143268e+08 | 1.576291e+09 | 2022-10-23T23:59:59.999Z | Terra Classic | 2022-10-23 |
72945 | 72945 | 0.465490 | 0.471006 | 0.453438 | 0.469033 | 9.509743e+08 | 2.339868e+10 | 2022-10-23T23:59:59.999Z | XRP | 2022-10-23 |
72946 rows × 10 columns
Nice! We can see that we have successfully loaded our data from the CSV into a dataframe. Now that we have a majority of the data we need to supplement the rest of it so we have more up to date information. In order to do this we will utilize a library to get it for us.
In this part of the data science lifecycle, what we typically do is scrub the data that we have gathered. This can consist of many different actions and is dependent on what you want to do with your data. Some common actions are reformulating column values, removing missing or null values and combining dataframes into structures that are easier to use.
In our case, what we need to do is clean up our top 10 cryptocurrencies list and retrieve a proper list of the top 10 cryptocurrencies names so that we can then gather only that data from our historical cryptocurrency dataframe. Additionally, we need to clean up some of the columns in our dataframe so that the values are more understandable.
Here are the specifics for both of the data sources:
Okay, now that we know what we are doing lets get to it!
Okay, let's get started on cleaning the top 10 dataframe and retrieving our top 10 list.
The first thing we are going to do is clean this dataset a little by getting rid of columns that have missing values. We can visually see from before that there are some columns that turned out a bit weird from the scraping so we can get rid of those. In particular, we will get rid of the columns "#", "Last 7 Days", "Unnamed: 0", and "Unnamed: 11". To do this we can utilize the drop
function that is built into pandas.
top_crypto_df = top_crypto_df.drop(columns=['Last 7 Days', 'Unnamed: 0', 'Unnamed: 11', '#'])
top_crypto_df
Name | Price | 1h % | 24h % | 7d % | Market Cap | Volume(24h) | Circulating Supply | |
---|---|---|---|---|---|---|---|---|
0 | Bitcoin 1 BTC | $26,467.81 | 0.64% | 1.45% | 10.31% | $512.52B $512,517,484,139 | $17,908,091,334 676,863 BTC | 19,371,375 BTC |
1 | Ethereum 2 ETH | $1,788.10 | 0.99% | 0.10% | 10.02% | $220.01B $220,005,830,732 | $8,338,380,767 4,654,292 ETH | 122,802,176 ETH |
2 | Tether 3 USDT | $1.00 | 0.04% | 0.05% | 0.02% | $82.79B $82,788,137,577 | $26,881,856,452 26,861,845,571 USDT | 82,726,510,004 USDT |
3 | BNB 4 BNB | $306.26 | 0.21% | 0.06% | 6.18% | $47.75B $47,752,156,743 | $487,392,775 1,590,821 BNB | 155,860,209 BNB |
4 | USD Coin 5 USDC | $1.00 | 0.04% | 0.06% | 0.02% | $29.97B $29,967,992,391 | $3,962,403,789 3,961,076,343 USDC | 29,957,952,807 USDC |
... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | PancakeSwap CAKE | $ 1.79 | NaN | NaN | NaN | NaN | NaN | NaN |
96 | Convex Finance CVX | $ 4.45 | NaN | NaN | NaN | NaN | NaN | NaN |
97 | FLOKI FLOKI | $ 0.00 | NaN | NaN | NaN | NaN | NaN | NaN |
98 | 1inch Network 1INCH | $ 0.41 | NaN | NaN | NaN | NaN | NaN | NaN |
99 | Enjin Coin ENJ | $ 0.33 | NaN | NaN | NaN | NaN | NaN | NaN |
100 rows × 8 columns
Okay nice! This is already looking much better. Now that we have gotten rid of columns that have no value what we can do next we can reformulate the name column to contain only the name rather than extraneous information that will make it hard to process the historical dataframe.
What we are going to do is process the string in the name field and split it by the spaces. Then we can keep the the first part of the split which is the name of the coin and the last part which is the ticker. We will discard the number as the natural ordering of the dataframe gives us what we want. In order to do this, we can carry it out as a column operation so we don't have to loop through the table manually. We transform the column data into type string and then utilize the split()
function to split the string into a list of strings split on the substring which we specify. After that we can use indexing and save the data accordingly.
# Carrying out as a column operation.
# Taking the value in name, splitting it, and putting the last element of the resultant list as the ticker
# and making the first value of the resultant list the name
top_crypto_df['Ticker'] = top_crypto_df['Name'].str.split(" ").str[-1] # Gets the last element in the list
top_crypto_df['Name'] = top_crypto_df['Name'].str.split(" ").str[0]
top_crypto_df
Name | Price | 1h % | 24h % | 7d % | Market Cap | Volume(24h) | Circulating Supply | Ticker | |
---|---|---|---|---|---|---|---|---|---|
0 | Bitcoin | $26,467.81 | 0.64% | 1.45% | 10.31% | $512.52B $512,517,484,139 | $17,908,091,334 676,863 BTC | 19,371,375 BTC | Bitcoin |
1 | Ethereum | $1,788.10 | 0.99% | 0.10% | 10.02% | $220.01B $220,005,830,732 | $8,338,380,767 4,654,292 ETH | 122,802,176 ETH | Ethereum |
2 | Tether | $1.00 | 0.04% | 0.05% | 0.02% | $82.79B $82,788,137,577 | $26,881,856,452 26,861,845,571 USDT | 82,726,510,004 USDT | Tether |
3 | BNB | $306.26 | 0.21% | 0.06% | 6.18% | $47.75B $47,752,156,743 | $487,392,775 1,590,821 BNB | 155,860,209 BNB | BNB |
4 | USD | $1.00 | 0.04% | 0.06% | 0.02% | $29.97B $29,967,992,391 | $3,962,403,789 3,961,076,343 USDC | 29,957,952,807 USDC | USD |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | PancakeSwap | $ 1.79 | NaN | NaN | NaN | NaN | NaN | NaN | PancakeSwap |
96 | Convex | $ 4.45 | NaN | NaN | NaN | NaN | NaN | NaN | Convex |
97 | FLOKI | $ 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | FLOKI |
98 | 1inch | $ 0.41 | NaN | NaN | NaN | NaN | NaN | NaN | 1inch |
99 | Enjin | $ 0.33 | NaN | NaN | NaN | NaN | NaN | NaN | Enjin |
100 rows × 9 columns
Okay, now when you look at the name column you see only the name of the coin which is expected. Additionally, the ticker has been added on at the end just in case. Now, we can extract a list of the top 10 cryptocurrencies by market cap as we will need it for the next dataset we are cleaning.
# Here, we get the top 5 rows, and then take the Name column giving us a Pandas series of the top 5 names.
top_10_list = top_crypto_df.head(10)['Name']
# Now, we want the series to be a string array so we set the series type to be a string and then convert it to a
# list using the built in function.
top_10_list = top_10_list.astype(str).tolist()
print(top_10_list)
['Bitcoin', 'Ethereum', 'Tether', 'BNB', 'USD', 'XRP', 'Cardano', 'Dogecoin', 'Solana', 'Polygon']
As we can see, we are now able to retrieve a list of the cryptocurrencies with the 10 largest market caps. However, we must cross reference with our datasources to see if they have these 10 and then select a list of the 10 biggest that are available in all of our sources.
Okay, now that our top 10 dataframe is cleaned up and we have a list of the data that we would like the keep, we can go about cleaning up the dataframe to contain daily data that only pertains to the cryptocurrencies that we are interested in and make sure that the data is as recent as it can be.
Before we can clean we need to make sure our data is complete and more up to date. So in order to do that what we can do is utilize a python library that will call the Coinbase Pro API to retrieve the data for us. An API is what is known as an Application Programming Interface, if you would like to learn more click here. We will only retrieve the missing data of our top 10 as it is pointless to gather the data for the rest since we don't need it.
To retrieve our top 10 is not as simple as just pulling the 10 entries that have the largest market capitalizations. What we need to do is double check that of the topmost entries that meet the criteria, the Coinbase Pro API that our library interacts with supports information on that cryptocurrency. When that criteria is met, the next thing to check is that the information on all of those coins is currently retrievable. If either of those two conditions are not met, we don't include that currency in our top 10 and move further down into the list until we met a combination of cryptocurrencies that meets our requirements.
# We need to check if the library can get data on our top 10, if not we can just shift further down the list and
# get the top 10 that are retrievable.
# The library returns all of the possible tickers if none match the search criteria. In order to figure out if
# this is what happened we call it once with no search string and save the # of rows in the dataframe to know.
not_possible = (Cryptocurrencies(extended_output=False).find_crypto_pairs()).shape[0]
available = []
count = 10
while(len(available) != 10):
# Since the library function requires the ticker and currency, we can get the tickers for the top 10
# in our top_crypto_df and then append USD to the value since the rest of our data is in USD as well.
top_10_ticker_list = top_crypto_df.head(count)['Ticker']
top_10_name_list = top_crypto_df.head(count)['Name']
top_10_ticker_list = top_10_ticker_list.astype(str).tolist()
top_10_name_list = top_10_name_list.astype(str).tolist()
combined_top_10_list = list(zip(top_10_name_list, top_10_ticker_list))
# resetting for iteration
available = []
# Looping through current options and checking to see which are avaiable and which are not
for name,ticker in combined_top_10_list:
ticker_info = ticker+'-USD'
options = Cryptocurrencies(coin_search = ticker_info, extended_output = False).find_crypto_pairs()
# Okay so we know that the symbol is available, now we need to check that the specific one that we want
# is online
if options.shape[0] != not_possible:
if 'delisted' not in options.loc[options['id'] == ticker_info]['status'].astype('str').tolist():
available.append((name,ticker))
count += 1
# Looping through and printing out results
idx = 1
print("Our list is as follows\n")
for name,ticker in available:
print(f"{idx}: {name} - {ticker}")
idx += 1
print("\n")
top_10_list = available
Connected to the CoinBase Pro API. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 4 instances containing the term BTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 5 instances containing the term ETH-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term USDT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term XRP-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term ADA-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOGE-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term SOL-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term MATIC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 4 instances containing the term BTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 5 instances containing the term ETH-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term USDT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term XRP-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term ADA-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOGE-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term SOL-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term MATIC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 4 instances containing the term BTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 5 instances containing the term ETH-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term USDT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term XRP-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term ADA-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOGE-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term SOL-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term MATIC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 4 instances containing the term BTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 5 instances containing the term ETH-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term USDT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term XRP-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term ADA-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOGE-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term SOL-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term MATIC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term LTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 4 instances containing the term BTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 5 instances containing the term ETH-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term USDT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term XRP-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term ADA-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOGE-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term SOL-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term MATIC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term LTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term BUSD-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 4 instances containing the term BTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 5 instances containing the term ETH-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term USDT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term XRP-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term ADA-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOGE-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 3 instances containing the term SOL-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term MATIC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Unable to find specific search term, returning all available data. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term DOT-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term LTC-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 1 instances containing the term BUSD-USD. Extended Output is False, returning reduced dataset. Connected to the CoinBase Pro API. Found 2 instances containing the term SHIB-USD. Extended Output is False, returning reduced dataset. Our list is as follows 1: Bitcoin - BTC 2: Ethereum - ETH 3: Tether - USDT 4: Cardano - ADA 5: Dogecoin - DOGE 6: Solana - SOL 7: Polygon - MATIC 8: Polkadot - DOT 9: Litecoin - LTC 10: Shiba - SHIB
If we look closely, we can see that at the bottom of the execution output there is a list of 10 cryptocurrencies, with their tickers. Now that we have our top 10 that are available to get data for, we can now go ahead and retieve the data for them and save it to our dataframe. We will do this by making a new empty dataframe to hold to consolidated results and iterate through our top 10 list and utilize our library to gather the relevant information.
# We will loop through the top 10 list we made and gather the data for each coin in a new dataframe.
# Making a dictionary of dataframes so we can store the data.
top_10_recent = pd.DataFrame()
# Since our start date from when we want to start gathering data doesnt change, we can just hardcode it.
start_date = '2022-10-24-00-00'
for name,ticker in top_10_list:
ticker_info = ticker+'-USD'
# This is the interval between datapoints, 86400 seconds is one day which is what we want.
granularity = 86400
request_df = HistoricalData(ticker_info, granularity, start_date).retrieve_data()
# Adding name and ticker to resultant df for consistency with other data source and so that we know
# what coin the data corresponds to
request_df['Name'] = name
request_df['Ticker'] = ticker
# Adding result data to the dataframe holding the results of all the top 10
top_10_recent = pd.concat([top_10_recent, request_df])
top_10_recent
Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'BTC-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'ETH-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'USDT-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'ADA-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'DOGE-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'SOL-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'MATIC-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'DOT-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'LTC-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data. Checking input parameters are in the correct format. Formatting Dates. Checking if user supplied is available on the CoinBase Pro API. Connected to the CoinBase Pro API. Ticker 'SHIB-USD' found at the CoinBase Pro API, continuing to extraction. Retrieved Data from Coinbase Pro API. Returning data.
low | high | open | close | volume | Name | Ticker | |
---|---|---|---|---|---|---|---|
time | |||||||
2022-10-24 | 19159.350000 | 19603.570000 | 19572.200000 | 19330.410000 | 2.943258e+04 | Bitcoin | BTC |
2022-10-25 | 19240.760000 | 20420.880000 | 19332.110000 | 20086.280000 | 4.245234e+04 | Bitcoin | BTC |
2022-10-26 | 20055.850000 | 21022.810000 | 20086.270000 | 20775.400000 | 4.361899e+04 | Bitcoin | BTC |
2022-10-27 | 20196.010000 | 20878.170000 | 20773.590000 | 20296.970000 | 3.412358e+04 | Bitcoin | BTC |
2022-10-28 | 20000.000000 | 20755.090000 | 20295.770000 | 20597.910000 | 2.958175e+04 | Bitcoin | BTC |
... | ... | ... | ... | ... | ... | ... | ... |
2023-05-08 | 0.000008 | 0.000009 | 0.000009 | 0.000009 | 1.680438e+12 | Shiba | SHIB |
2023-05-09 | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.494224e+11 | Shiba | SHIB |
2023-05-10 | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.374366e+11 | Shiba | SHIB |
2023-05-11 | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.152456e+11 | Shiba | SHIB |
2023-05-12 | 0.000008 | 0.000009 | 0.000009 | 0.000009 | 7.590283e+11 | Shiba | SHIB |
2010 rows × 7 columns
We have now gotten the more recent OHLCV data for our top 10 and added the name and ticker as columns to the information as well. What we need to do now is refine both sets and put them together to form our final dataset for analysis.
Now that we have our more recent data, which is from October 24th 2022 to the most recent data, we can now prepare this dataset to join into our main dataset.
The first step in cleaning the recent data is to fix the date column. As we can see in the output above, it appears that the date column is labeled time and is the index of the dataframe. We will need to change these things to make sure that this data can smoothly join with the historical dataframe. After we fix the date column, we will change the types of the name and ticker column such that they are strings.
top_10_recent.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2010 entries, 2022-10-24 to 2023-05-12 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 low 2010 non-null float64 1 high 2010 non-null float64 2 open 2010 non-null float64 3 close 2010 non-null float64 4 volume 2010 non-null float64 5 Name 2010 non-null object 6 Ticker 2010 non-null object dtypes: float64(5), object(2) memory usage: 125.6+ KB
As we can see here, the time column does not even show up as it has been set as the index, in order to fix this we will call the pandas reset_index
function which will have the dataframe utilize a default index and take the old index and append it as a column to the dataframe.
top_10_recent = top_10_recent.reset_index()
top_10_recent
time | low | high | open | close | volume | Name | Ticker | |
---|---|---|---|---|---|---|---|---|
0 | 2022-10-24 | 19159.350000 | 19603.570000 | 19572.200000 | 19330.410000 | 2.943258e+04 | Bitcoin | BTC |
1 | 2022-10-25 | 19240.760000 | 20420.880000 | 19332.110000 | 20086.280000 | 4.245234e+04 | Bitcoin | BTC |
2 | 2022-10-26 | 20055.850000 | 21022.810000 | 20086.270000 | 20775.400000 | 4.361899e+04 | Bitcoin | BTC |
3 | 2022-10-27 | 20196.010000 | 20878.170000 | 20773.590000 | 20296.970000 | 3.412358e+04 | Bitcoin | BTC |
4 | 2022-10-28 | 20000.000000 | 20755.090000 | 20295.770000 | 20597.910000 | 2.958175e+04 | Bitcoin | BTC |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2005 | 2023-05-08 | 0.000008 | 0.000009 | 0.000009 | 0.000009 | 1.680438e+12 | Shiba | SHIB |
2006 | 2023-05-09 | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.494224e+11 | Shiba | SHIB |
2007 | 2023-05-10 | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.374366e+11 | Shiba | SHIB |
2008 | 2023-05-11 | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.152456e+11 | Shiba | SHIB |
2009 | 2023-05-12 | 0.000008 | 0.000009 | 0.000009 | 0.000009 | 7.590283e+11 | Shiba | SHIB |
2010 rows × 8 columns
As we can see now, the time column has been shifted over and become the leftmost column in the dataframe with a default index taking over as the indexing for the dataframe.
top_10_recent.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2010 entries, 0 to 2009 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 2010 non-null datetime64[ns] 1 low 2010 non-null float64 2 high 2010 non-null float64 3 open 2010 non-null float64 4 close 2010 non-null float64 5 volume 2010 non-null float64 6 Name 2010 non-null object 7 Ticker 2010 non-null object dtypes: datetime64[ns](1), float64(5), object(2) memory usage: 125.8+ KB
Now that we have shifted the time as a column in the dataframe, we can see that it has appeared as a column in the dataframe. Even better is that it is a datetime object which saves us any trouble that we would've had to go through in converting it to be as such. Now all we need to do is change the types of the name and ticker column.
top_10_recent[['Name', 'Ticker']] = top_10_recent[['Name', 'Ticker']].astype('string')
top_10_recent.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2010 entries, 0 to 2009 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 2010 non-null datetime64[ns] 1 low 2010 non-null float64 2 high 2010 non-null float64 3 open 2010 non-null float64 4 close 2010 non-null float64 5 volume 2010 non-null float64 6 Name 2010 non-null string 7 Ticker 2010 non-null string dtypes: datetime64[ns](1), float64(5), string(2) memory usage: 125.8 KB
Nice! Now if we look at the types of the dataframe, we can see that the name and ticker columns are of type string as exepcted and the time is a date object. All that is left now is to fix any inconsistencies in the data and rearrange the dataframe.
If you look at the historical dataframe that we read in and compared it with our data gathered for our top 10, there is a slight issue. In the historical dataframe the coin SHIB has its name registered as "Shiba Inu" whereas in our recent data it is just "Shiba". In order to be able to join our dataframes successfully and not have any issues later when accessing the data for this cryptocurrency we will rename the field here to "Shiba Inu" as well. We will use the replace
function in pandas to do this.
Additionally, the column names do not like nice since some are capitalized while the others are not. To fix this we will rename the columns with capitalization and fix the time column such that it is called date as that is more accurate. We will then rearrange the columns to follow a more logical format. To rearrange the columns into a more logical format we can make use of the pandas iloc
function.
# Changing instances of Shiba in the dataframe to Shiba Inu
top_10_recent = top_10_recent.replace('Shiba', 'Shiba Inu')
# Reassigning columns attribute of dataframe so that the column names are all capitalized.
top_10_recent.columns = ['Date', 'Low', 'High', 'Open', 'Close', 'Volume', 'Name', 'Ticker']
# Using the iloc function to get the columns and put them in a more logical order. Numbers correspond to the index
# of the column in terms of the list of column names passed in above.
top_10_recent = top_10_recent.iloc[:, [0,6,7,3,2,1,4,5]]
top_10_recent
Date | Name | Ticker | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | 2022-10-24 | Bitcoin | BTC | 19572.200000 | 19603.570000 | 19159.350000 | 19330.410000 | 2.943258e+04 |
1 | 2022-10-25 | Bitcoin | BTC | 19332.110000 | 20420.880000 | 19240.760000 | 20086.280000 | 4.245234e+04 |
2 | 2022-10-26 | Bitcoin | BTC | 20086.270000 | 21022.810000 | 20055.850000 | 20775.400000 | 4.361899e+04 |
3 | 2022-10-27 | Bitcoin | BTC | 20773.590000 | 20878.170000 | 20196.010000 | 20296.970000 | 3.412358e+04 |
4 | 2022-10-28 | Bitcoin | BTC | 20295.770000 | 20755.090000 | 20000.000000 | 20597.910000 | 2.958175e+04 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2005 | 2023-05-08 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000008 | 0.000009 | 1.680438e+12 |
2006 | 2023-05-09 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.494224e+11 |
2007 | 2023-05-10 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.374366e+11 |
2008 | 2023-05-11 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.152456e+11 |
2009 | 2023-05-12 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000008 | 0.000009 | 7.590283e+11 |
2010 rows × 8 columns
Much better. Now our recent data is organized and ready for combining into the historical frame, but before we can do that we must clean the historical dataframe.
historical_data_df
Unnamed: 0 | open | high | low | close | volume | marketCap | timestamp | crypto_name | date | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 1.288693e+09 | 2013-05-05T23:59:59.999Z | Bitcoin | 2013-05-05 |
1 | 1 | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 | 6.229819e+07 | 2013-05-05T23:59:59.999Z | Litecoin | 2013-05-05 |
2 | 2 | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 1.249023e+09 | 2013-05-06T23:59:59.999Z | Bitcoin | 2013-05-06 |
3 | 3 | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 | 5.859436e+07 | 2013-05-06T23:59:59.999Z | Litecoin | 2013-05-06 |
4 | 4 | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 1.240594e+09 | 2013-05-07T23:59:59.999Z | Bitcoin | 2013-05-07 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
72941 | 72941 | 0.022604 | 0.022988 | 0.022197 | 0.022796 | 4.040134e+07 | 1.652957e+09 | 2022-10-23T23:59:59.999Z | VeChain | 2022-10-23 |
72942 | 72942 | 1.468244 | 1.530464 | 1.435415 | 1.517878 | 2.844351e+07 | 1.572825e+09 | 2022-10-23T23:59:59.999Z | Flow | 2022-10-23 |
72943 | 72943 | 4.950431 | 5.148565 | 4.945280 | 5.117206 | 1.069497e+08 | 1.559551e+09 | 2022-10-23T23:59:59.999Z | Filecoin | 2022-10-23 |
72944 | 72944 | 0.000233 | 0.000243 | 0.000226 | 0.000239 | 2.143268e+08 | 1.576291e+09 | 2022-10-23T23:59:59.999Z | Terra Classic | 2022-10-23 |
72945 | 72945 | 0.465490 | 0.471006 | 0.453438 | 0.469033 | 9.509743e+08 | 2.339868e+10 | 2022-10-23T23:59:59.999Z | XRP | 2022-10-23 |
72946 rows × 10 columns
As we can see, there are a couple things that we need to do to this dataframe before it is ready to be put together with the more recent data. So lets get to it!
If we look at the dataframe, there are a couple of columns that serve no purpose. The Unnamed: 0 column is of no use to us since it is the exact same value as the index and we don't need it. The timestamp column provides us nothing really as we do not need to exact timestamp data for our analysis. We will remove these columns in a similar fashion as to how we did with the recent data.
historical_data_df = historical_data_df.drop(columns=['Unnamed: 0', 'timestamp'])
historical_data_df
open | high | low | close | volume | marketCap | crypto_name | date | |
---|---|---|---|---|---|---|---|---|
0 | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 1.288693e+09 | Bitcoin | 2013-05-05 |
1 | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 | 6.229819e+07 | Litecoin | 2013-05-05 |
2 | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 1.249023e+09 | Bitcoin | 2013-05-06 |
3 | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 | 5.859436e+07 | Litecoin | 2013-05-06 |
4 | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 1.240594e+09 | Bitcoin | 2013-05-07 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
72941 | 0.022604 | 0.022988 | 0.022197 | 0.022796 | 4.040134e+07 | 1.652957e+09 | VeChain | 2022-10-23 |
72942 | 1.468244 | 1.530464 | 1.435415 | 1.517878 | 2.844351e+07 | 1.572825e+09 | Flow | 2022-10-23 |
72943 | 4.950431 | 5.148565 | 4.945280 | 5.117206 | 1.069497e+08 | 1.559551e+09 | Filecoin | 2022-10-23 |
72944 | 0.000233 | 0.000243 | 0.000226 | 0.000239 | 2.143268e+08 | 1.576291e+09 | Terra Classic | 2022-10-23 |
72945 | 0.465490 | 0.471006 | 0.453438 | 0.469033 | 9.509743e+08 | 2.339868e+10 | XRP | 2022-10-23 |
72946 rows × 8 columns
Now that we got rid of the columns, the next thing is to make sure that the columns are of the right type, we can take a look at their types using the info
function.
historical_data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72946 entries, 0 to 72945 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 open 72946 non-null float64 1 high 72946 non-null float64 2 low 72946 non-null float64 3 close 72946 non-null float64 4 volume 72946 non-null float64 5 marketCap 72946 non-null float64 6 crypto_name 72946 non-null object 7 date 72946 non-null object dtypes: float64(6), object(2) memory usage: 4.5+ MB
As we can see from the information of the dataframe, the date column is appearing as a pandas object. We would like this to be a datetime object so we will convert it to become as such. To do this we will use the pandas function to_datetime
. Additionally, we would like the crypto_name column to be a string.
historical_data_df['date'] = pd.to_datetime(historical_data_df['date'])
historical_data_df['crypto_name'] = historical_data_df['crypto_name'].astype('string')
historical_data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72946 entries, 0 to 72945 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 open 72946 non-null float64 1 high 72946 non-null float64 2 low 72946 non-null float64 3 close 72946 non-null float64 4 volume 72946 non-null float64 5 marketCap 72946 non-null float64 6 crypto_name 72946 non-null string 7 date 72946 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(6), string(1) memory usage: 4.5 MB
As we can see now, the date column is now of type datetime and the crypto_name column is now of type string. This is all we need to do in terms of fixing the dtypes of our dataframe. Now it is time to get rid of unneccessary data in our dataframe.
Now that we the types sorted out and useless columns removed, we can go about filtering the dataframe so that it will only contain the data for our top 10 cryptocurrencies. To this we will get the names of our top 10 cryptocurrencies and then filter out the rows that that don't have a column value within the list of those names. The filtering condition will be handled by inverting the result of the pandas isin
function.
# Using list comprehension to get the names of the top 10
name_list = [name for name,ticker in top_10_list]
# Accounting for the inconsistency in name
name_list[-1] = 'Shiba Inu'
# Using the pandas
historical_data_df = historical_data_df[historical_data_df['crypto_name'].isin(name_list)]
historical_data_df
open | high | low | close | volume | marketCap | crypto_name | date | |
---|---|---|---|---|---|---|---|---|
0 | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 1.288693e+09 | Bitcoin | 2013-05-05 |
1 | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 | 6.229819e+07 | Litecoin | 2013-05-05 |
2 | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 1.249023e+09 | Bitcoin | 2013-05-06 |
3 | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 | 5.859436e+07 | Litecoin | 2013-05-06 |
4 | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 1.240594e+09 | Bitcoin | 2013-05-07 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
72923 | 28.013676 | 29.331540 | 27.857959 | 29.270071 | 6.331571e+08 | 1.048962e+10 | Solana | 2022-10-23 |
72924 | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 1.138769e+08 | 5.573157e+09 | Shiba Inu | 2022-10-23 |
72928 | 0.832370 | 0.901422 | 0.822506 | 0.899837 | 3.701980e+08 | 7.859466e+09 | Polygon | 2022-10-23 |
72929 | 5.847780 | 5.968046 | 5.776710 | 5.960239 | 2.014905e+08 | 6.731207e+09 | Polkadot | 2022-10-23 |
72935 | 52.181555 | 53.950366 | 51.944883 | 53.950366 | 3.446915e+08 | 3.854806e+09 | Litecoin | 2022-10-23 |
19143 rows × 8 columns
Nice, we have drastically reduced the size of our dataframe. From roughly 73,000 rows worth of entries down to 19,000. This will make future operations less resource intensive. Now that we have the content that we would like to focus on, lets move on and carry out the next step in cleaning the data.
Now that our types are correct and we have filtered the data accordingly, we can add the column that we are missing which is the ticker information. After that, we will rename the columns for naming consistency and rearrange the columns to be in a more logical order.
# Fix discrepancy mistake
top_10_list[-1] = ('Shiba Inu', 'SHIB')
# set up dictionary
name_to_ticker_dict = {}
# inserting into dictionary
for name,ticker in top_10_list:
name_to_ticker_dict[name] = ticker
# Adding column to dataframe w/ temp values
historical_data_df.insert(0, 'Ticker', np.nan)
# Setting ticker values for the dataframe
for index,row in historical_data_df.iterrows():
historical_data_df.at[index, 'Ticker'] = name_to_ticker_dict[historical_data_df.at[index, 'crypto_name']]
historical_data_df
Ticker | open | high | low | close | volume | marketCap | crypto_name | date | |
---|---|---|---|---|---|---|---|---|---|
0 | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 1.288693e+09 | Bitcoin | 2013-05-05 |
1 | LTC | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 | 6.229819e+07 | Litecoin | 2013-05-05 |
2 | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 1.249023e+09 | Bitcoin | 2013-05-06 |
3 | LTC | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 | 5.859436e+07 | Litecoin | 2013-05-06 |
4 | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 1.240594e+09 | Bitcoin | 2013-05-07 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
72923 | SOL | 28.013676 | 29.331540 | 27.857959 | 29.270071 | 6.331571e+08 | 1.048962e+10 | Solana | 2022-10-23 |
72924 | SHIB | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 1.138769e+08 | 5.573157e+09 | Shiba Inu | 2022-10-23 |
72928 | MATIC | 0.832370 | 0.901422 | 0.822506 | 0.899837 | 3.701980e+08 | 7.859466e+09 | Polygon | 2022-10-23 |
72929 | DOT | 5.847780 | 5.968046 | 5.776710 | 5.960239 | 2.014905e+08 | 6.731207e+09 | Polkadot | 2022-10-23 |
72935 | LTC | 52.181555 | 53.950366 | 51.944883 | 53.950366 | 3.446915e+08 | 3.854806e+09 | Litecoin | 2022-10-23 |
19143 rows × 9 columns
We now have the ticker column, so lets move these around and change the naming so it looks better.
historical_data_df.columns = ['Ticker', 'Open', 'High', 'Low', 'Close', 'Volume', 'Market Cap.', 'Name', 'Date']
historical_data_df = historical_data_df.iloc[:, [8,7,0,1,2,3,4,5,6]]
historical_data_df
Date | Name | Ticker | Open | High | Low | Close | Volume | Market Cap. | |
---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 1.288693e+09 |
1 | 2013-05-05 | Litecoin | LTC | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 | 6.229819e+07 |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 1.249023e+09 |
3 | 2013-05-06 | Litecoin | LTC | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 | 5.859436e+07 |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 1.240594e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
72923 | 2022-10-23 | Solana | SOL | 28.013676 | 29.331540 | 27.857959 | 29.270071 | 6.331571e+08 | 1.048962e+10 |
72924 | 2022-10-23 | Shiba Inu | SHIB | 0.000010 | 0.000010 | 0.000010 | 0.000010 | 1.138769e+08 | 5.573157e+09 |
72928 | 2022-10-23 | Polygon | MATIC | 0.832370 | 0.901422 | 0.822506 | 0.899837 | 3.701980e+08 | 7.859466e+09 |
72929 | 2022-10-23 | Polkadot | DOT | 5.847780 | 5.968046 | 5.776710 | 5.960239 | 2.014905e+08 | 6.731207e+09 |
72935 | 2022-10-23 | Litecoin | LTC | 52.181555 | 53.950366 | 51.944883 | 53.950366 | 3.446915e+08 | 3.854806e+09 |
19143 rows × 9 columns
top_10_recent
Date | Name | Ticker | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | 2022-10-24 | Bitcoin | BTC | 19572.200000 | 19603.570000 | 19159.350000 | 19330.410000 | 2.943258e+04 |
1 | 2022-10-25 | Bitcoin | BTC | 19332.110000 | 20420.880000 | 19240.760000 | 20086.280000 | 4.245234e+04 |
2 | 2022-10-26 | Bitcoin | BTC | 20086.270000 | 21022.810000 | 20055.850000 | 20775.400000 | 4.361899e+04 |
3 | 2022-10-27 | Bitcoin | BTC | 20773.590000 | 20878.170000 | 20196.010000 | 20296.970000 | 3.412358e+04 |
4 | 2022-10-28 | Bitcoin | BTC | 20295.770000 | 20755.090000 | 20000.000000 | 20597.910000 | 2.958175e+04 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2005 | 2023-05-08 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000008 | 0.000009 | 1.680438e+12 |
2006 | 2023-05-09 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.494224e+11 |
2007 | 2023-05-10 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.374366e+11 |
2008 | 2023-05-11 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000009 | 0.000009 | 9.152456e+11 |
2009 | 2023-05-12 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000008 | 0.000009 | 7.590283e+11 |
2010 rows × 8 columns
Great! As we can see both of our OHLCV data components are prepped and ready to put together. Now we can go ahead and do that.
# Putting together two data sources
final_df = pd.concat([historical_data_df, top_10_recent])
# Sorting the combined dataset first by date and then by name within organized dates
final_df.sort_values(by=['Date', 'Name'], inplace=True)
# Since market capitalization data is missing in our more recent data, we can get rid of the column as it is
# of no use to us if it isn't complete
final_df = final_df.drop(columns=['Market Cap.'])
final_df
Date | Name | Ticker | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 |
1 | 2013-05-05 | Litecoin | LTC | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 |
3 | 2013-05-06 | Litecoin | LTC | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1607 | 2023-05-12 | Polkadot | DOT | 5.285000 | 5.314000 | 5.120000 | 5.264000 | 7.401615e+05 |
1406 | 2023-05-12 | Polygon | MATIC | 0.841100 | 0.874100 | 0.818800 | 0.850800 | 1.865495e+07 |
2009 | 2023-05-12 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000008 | 0.000009 | 7.590283e+11 |
1205 | 2023-05-12 | Solana | SOL | 20.180000 | 20.750000 | 19.810000 | 20.620000 | 1.044406e+06 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 |
21153 rows × 8 columns
We now have our final dataset! The next step is to take this completed dataset and carry out some analysis and visualization to get a better understanding of the data.
In this part of the data lifecycle, we really focus on manipulating the data to understand how it ties together and the structure that exists within it. To do this we are going to do a couple things. We will focus on calculating quantities that will help us better understand the data and then visualize the data through different representations and formats.
In this portion, we utilize a library called Plotly to generate our plots. To learn more about the library click here.
The first thing that we should do is make a copy and execute our actions on that copy just in case we want our an original version of our final dataset at any point.
# creating a copy just in case dataset gets messed up and original is needed.
final_cp = final_df.copy()
final_cp
Date | Name | Ticker | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 |
1 | 2013-05-05 | Litecoin | LTC | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 |
3 | 2013-05-06 | Litecoin | LTC | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1607 | 2023-05-12 | Polkadot | DOT | 5.285000 | 5.314000 | 5.120000 | 5.264000 | 7.401615e+05 |
1406 | 2023-05-12 | Polygon | MATIC | 0.841100 | 0.874100 | 0.818800 | 0.850800 | 1.865495e+07 |
2009 | 2023-05-12 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000008 | 0.000009 | 7.590283e+11 |
1205 | 2023-05-12 | Solana | SOL | 20.180000 | 20.750000 | 19.810000 | 20.620000 | 1.044406e+06 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 |
21153 rows × 8 columns
Given that this is referred to as OHLCV data, representing Open, High, Low, Close, and Volume data over a set interval for an asset. The most natural form of visualizing this data is through what is known as an OHLC chart. We will do this first to get a basic understanding of what this data looks like. The way we will plot this is we will generate a OHLC chart for each of our 10 cryptocurrencies. This will be better as the time interval for the data for each coin is different and would not show true justice to it's behavior if shown in some sort of calculated interval.
Just a warning in advance, this might take some time to render and show up as it is a lot of data in a graph with lots of features and sharp display. It is normal for it to take a couple minutes to show up and would advise to wait 5 minutes before checking to see if there are any issues.
for name in name_list:
temp_df = final_cp[final_cp['Name'] == name]
trace = go.Ohlc(x=temp_df['Date'], open = temp_df['Open'], high = temp_df['High'], low = temp_df['Low'], \
close = temp_df['Close'])
fig = go.Figure(data=trace)
fig.update_layout(
title=f"OHLC for {name} Over Time", \
xaxis_title='Year', yaxis_title='Price per Coin (USD)')
fig.show()
Okay, taking a look at these graphs, we can see that they all have a huge peak and then sharp drop and are now maintaining closer to present day. For this huge increase and peak it appears to be somewhere around 2021-2022 for all of these currencies.
This makes sense given that there was a huge cryptocurrency popularity surge around that time due to a multitude of factors. Here is an article around that time talking about key crypto moments in 2021 by CNBC. Another article to check out around that time is a specific analysis on the trends of Bitcoin in that timeframe by Outlook.
This goes to show that there might be some correlating factors that we can take a look at. To see a more general trend and change without as much volatility shown in this, we can plot the averages of the cryptocurrency over time.
We can calculate the average price on a day by just taking the mean of the open and close price for that date:
$$\Large avg = \frac{open + close}{2} $$final_cp.insert(8, 'Average', np.nan)
final_cp['Average'] = (final_cp['Open'] + final_cp['Close']) / 2
final_cp
Date | Name | Ticker | Open | High | Low | Close | Volume | Average | |
---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 114.405003 |
1 | 2013-05-05 | Litecoin | LTC | 3.493130 | 3.692460 | 3.346060 | 3.590890 | 0.000000e+00 | 3.542010 |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 114.140003 |
3 | 2013-05-06 | Litecoin | LTC | 3.594220 | 3.781020 | 3.116020 | 3.371250 | 0.000000e+00 | 3.482735 |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 111.875000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1607 | 2023-05-12 | Polkadot | DOT | 5.285000 | 5.314000 | 5.120000 | 5.264000 | 7.401615e+05 | 5.274500 |
1406 | 2023-05-12 | Polygon | MATIC | 0.841100 | 0.874100 | 0.818800 | 0.850800 | 1.865495e+07 | 0.845950 |
2009 | 2023-05-12 | Shiba Inu | SHIB | 0.000009 | 0.000009 | 0.000008 | 0.000009 | 7.590283e+11 | 0.000009 |
1205 | 2023-05-12 | Solana | SOL | 20.180000 | 20.750000 | 19.810000 | 20.620000 | 1.044406e+06 | 20.400000 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 | 1.000350 |
21153 rows × 9 columns
Nice, now we have calculated the average price per day in USD. We can go ahead and plot these values for our 10 cryptocurrencies once again and see what this might show us. This time, will put them all on the same graph so we can view their growth relative to each other.
# Using plotly express to create the graph
fig = px.line(final_cp, x='Date', y='Average', color='Name', symbol='Ticker')
# adding title and labeling axis
fig.update_layout(title=f"Average Price Over Time", xaxis_title='Year', yaxis_title='Price per Coin (USD)')
fig.show()
Okay, it looks like we can see that Bitcoin change relative to the others was immense, which is expected looking at the values in our dataset. Another coin whose trend we can somewhat see is Ethereum. To better see some sort of correlation it might be better to exclude Bitcoin and Ethereum as they hold too much bias in growth.
# Coins to exclude
remove = ['Bitcoin', 'Ethereum']
# Keeping rows that do not have information pertaining to Bitcoin or Ethereum
final_cp_2 = final_cp[~final_cp['Name'].isin(remove)]
# Plotting remaining data
fig = px.line(final_cp_2, x='Date', y='Average', color='Name', symbol='Ticker')
# Setting plot title and axis labels
fig.update_layout(title=f"Average Price Over Time", xaxis_title='Year', yaxis_title='Price per Coin (USD)')
fig.show()
Interestingly, we see now with the exclusion of Bitcoin and Ethereum it appears that Litecoin, Solana and Polkadot have much more relative growth in comparison to Cardano, Dogecoin, Tether, Shiba Inu, and Polygon. This makes sense given that Litecoin is based off the Bitcoin protocol so the mass popularity around Bitcoin would spread to it logically. Solana and Polkadot are both products that bank on aspects of blockchain and have had success, hence the growth in their value.
The results around Dogecoin and Shiva Inu are expected given that they are what is considered to be a "meme coin". nIt also makes sense that Tether does not shift much as it is a stable coin. THe same logic applies for Cardano as it is a proof-of-stake blockchain plotform that was founded back in 2017 so it is not as massive in its market share compared to Ethereum. Finally, Polygon is a tool used for Ethereum scaling whcih explains why its growth is not much since its scope is much more narrow in comparison to Bitcoin and such.
Overall, it appears that the trends showing are massive fluctuations for more popular coins backed by successful technologies which follows with the volatile popularity and changing economy that we have witnessed in the past couple of years given the aftermath of COVID-19 and such. Other coins have held their own and haven't had such drastic changes just due to the fact that they are relatively newer and are of smaller scale. It is hard to determine the change given the bias shown by coins that change a lot more in value.
To further confirm and corroborate this idea, we can take a look at some simple moving averages to understand and see if they could be good predictors for cryptocurrency changes. A simple moving average is just the average of a selected range of prices over tne number of periods in that range. In our case we can calculate a couple of them and take our average price as the price to sum in these ranges. The formula we will be using is the following:
$$\Large SMA = \frac{A_1 + A_2 + A_3 + ... + A_n}{n} $$Note that in this case n is the number of periods in the range.
We will plot a 25, 50, and 100 day simple moving average and see how it correlates with the change of a cryptocurrency.
# reordering for calculation of SMA's
final_cp = final_cp.sort_values(['Name', 'Date'])
# calculating SMA's
final_cp['SMA25'] = final_cp.groupby('Name')['Average'].rolling(window = 25).mean().reset_index(0, drop=True)
final_cp['SMA50'] = final_cp.groupby('Name')['Average'].rolling(window = 50).mean().reset_index(0, drop=True)
final_cp['SMA100'] = final_cp.groupby('Name')['Average'].rolling(window = 100).mean().reset_index(0, drop=True)
for name in name_list:
fig = make_subplots(rows = 1, cols = 1)
# Filtering list to only data for the coin we are currently looking at
filtered_df = (final_cp[final_cp['Name'] == name]).copy()
# Creating graph components for each line.
trace1 = go.Scatter(x=filtered_df['Date'], y=filtered_df['Average'], name="Average")
trace2 = go.Scatter(x=filtered_df['Date'], y=filtered_df['SMA25'], name="SMA25")
trace3 = go.Scatter(x=filtered_df['Date'], y=filtered_df['SMA50'], name="SMA50")
trace4 = go.Scatter(x=filtered_df['Date'], y=filtered_df['SMA100'], name="SMA100")
# Putting lines onto the figure.
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.add_trace(trace3)
fig.add_trace(trace4)
# Setting title and axis labels
fig.update_layout(title=f"Price of {name} with 25, 50, and 100 day SMAs",
xaxis_title="Date", yaxis_title="Price per Coin (USD)")
fig.show()
Okay, we have now graphed our 10 cryptocurrencies with their 25, 50, and 100 day simple moving averages respectively. A common trait that we can see is that when the averages tend to intersect and share similar movement to each other and the heavier averages tend to move with more change than usual this is an indication of an extreme shift in either direction. There is clearly some correlation between the change in a moving average and the shift in a cryptocurrency. Overall, we can say that an observation of this graph is that there is some correlation between SMA and the trend in a cryptocurrency.
Another factor that we can take a look at which should hopefully allow to see trends more clearly is growth which would allow us to evaluate the change in the value relative to itself rather than the actual value. Cryptocurrencies that show strong growth indicate popularity and good performance rather than being skewed by price and this is what we are striving to achieve, finding assets that have good growth so that we can invest in them. To calculate growth, we can use the following formula:
$$\Large Growth = (\frac{Close}{Open} - 1) * 100 $$# generating new column filled with NaN
final_cp.insert(12, 'Growth', np.nan)
# formula for calculating growth
final_cp['Growth'] = ((final_cp['Close'] / final_cp['Open']) - 1) * 100
final_cp
Date | Name | Ticker | Open | High | Low | Close | Volume | Average | SMA25 | SMA50 | SMA100 | Growth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 114.405003 | NaN | NaN | NaN | 2.666078 |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 114.140003 | NaN | NaN | NaN | -3.172961 |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 111.875000 | NaN | NaN | NaN | -0.668151 |
7 | 2013-05-08 | Bitcoin | BTC | 109.599998 | 115.779999 | 109.599998 | 113.566002 | 0.000000e+00 | 111.583000 | NaN | NaN | NaN | 3.618616 |
9 | 2013-05-09 | Bitcoin | BTC | 113.199997 | 113.459999 | 109.260002 | 112.669998 | 0.000000e+00 | 112.934998 | NaN | NaN | NaN | -0.468197 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
598 | 2023-05-08 | Tether | USDT | 1.001390 | 1.001660 | 1.000600 | 1.000840 | 1.071410e+08 | 1.001115 | 1.000535 | 1.000755 | 1.000828 | -0.054924 |
599 | 2023-05-09 | Tether | USDT | 1.000840 | 1.001770 | 1.000290 | 1.000540 | 7.751265e+07 | 1.000690 | 1.000522 | 1.000702 | 1.000834 | -0.029975 |
600 | 2023-05-10 | Tether | USDT | 1.000540 | 1.000870 | 1.000200 | 1.000790 | 1.013193e+08 | 1.000665 | 1.000512 | 1.000648 | 1.000840 | 0.024987 |
601 | 2023-05-11 | Tether | USDT | 1.000780 | 1.000790 | 1.000000 | 1.000430 | 9.465740e+07 | 1.000605 | 1.000504 | 1.000602 | 1.000845 | -0.034973 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 | 1.000350 | 1.000489 | 1.000562 | 1.000848 | -0.013994 |
21153 rows × 13 columns
Now that we have growth calculated, let us plot it over time for all of the cryptocurrencies that we are taking a look at to see what the scale of growth is in each of them rather than the actual value of growth. To do this we can just utilize plotly express and plot the data.
# Generating growth over time graph for all of the cryptocurrencies we are looking at
fig = px.line(final_cp, x='Date', y='Growth', color='Name', symbol='Ticker')
fig.update_layout(title="Growth of Cryptocurrencies Over Time",
xaxis_title="Date", yaxis_title="Price per Coin (USD)")
fig.show()
Looking at this graph, surprisingly we can see that in that period of crypto surge, the cryptocurrency with the largest growth was actually Shiba Inu rather than Bitcoin. Although a Bitcoin was worth more, if one invested enough into Shiba Inu at the right time there profit would've dwarfed the profit made in Bitcoin. In general, it seems that each of these currencys hit one of their growth peaks at a certain point and then came back down and fluctuated slightly here and there. It looks like the more stable growth appears to be something like Bitcoin or Ethereum. It seems that with higher prices there is an indication of more stability so there might be a correlation between price and growth other than the obvious connotation linking the two.
Another aspect that would make sense to look at is the relationship between volume and growth. It makes sense logically that price would shift in some sense if a lot of coins are sold or a lot are purchased, that is, in theory at least it makes sense. This is due to the fact that a spike in volume could be due to the fact that a coin has lost a lot of profitability and people are in a rush to sell or vice versa where a cryptocurrency is doing extremely well and people are in a rush to buy in.
final_cp['Volume']
0 0.000000e+00 2 0.000000e+00 4 0.000000e+00 7 0.000000e+00 9 0.000000e+00 ... 598 1.071410e+08 599 7.751265e+07 600 1.013193e+08 601 9.465740e+07 602 8.327807e+07 Name: Volume, Length: 21153, dtype: float64
Looking at the data here, we can see that the volume quantity grows to extremely large sizes. If we want to keep things close in value we need to figure out some sort of scaling transformation or calculation to apply to volume traded so that we can see any possible trend.
If we think about it, what we don't want to know is how much volume was traded in a standardized manner. What we really care about is rapid changes and what happens to growth then. If that is the case, we need to see the difference in volume traded from the previous day in comparison to the current day;. To do this we can just offset the columns and then calculate.
# Resorting dataframe for easier calculation of volume changed
final_cp = final_cp.sort_values(['Name', 'Date'])
final_cp['C_Volume'] = final_cp.groupby('Name')['Volume'].diff()
final_cp
Date | Name | Ticker | Open | High | Low | Close | Volume | Average | SMA25 | SMA50 | SMA100 | Growth | C_Volume | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 114.405003 | NaN | NaN | NaN | 2.666078 | NaN |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 114.140003 | NaN | NaN | NaN | -3.172961 | 0.00 |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 111.875000 | NaN | NaN | NaN | -0.668151 | 0.00 |
7 | 2013-05-08 | Bitcoin | BTC | 109.599998 | 115.779999 | 109.599998 | 113.566002 | 0.000000e+00 | 111.583000 | NaN | NaN | NaN | 3.618616 | 0.00 |
9 | 2013-05-09 | Bitcoin | BTC | 113.199997 | 113.459999 | 109.260002 | 112.669998 | 0.000000e+00 | 112.934998 | NaN | NaN | NaN | -0.468197 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
598 | 2023-05-08 | Tether | USDT | 1.001390 | 1.001660 | 1.000600 | 1.000840 | 1.071410e+08 | 1.001115 | 1.000535 | 1.000755 | 1.000828 | -0.054924 | 62362190.82 |
599 | 2023-05-09 | Tether | USDT | 1.000840 | 1.001770 | 1.000290 | 1.000540 | 7.751265e+07 | 1.000690 | 1.000522 | 1.000702 | 1.000834 | -0.029975 | -29628362.87 |
600 | 2023-05-10 | Tether | USDT | 1.000540 | 1.000870 | 1.000200 | 1.000790 | 1.013193e+08 | 1.000665 | 1.000512 | 1.000648 | 1.000840 | 0.024987 | 23806631.47 |
601 | 2023-05-11 | Tether | USDT | 1.000780 | 1.000790 | 1.000000 | 1.000430 | 9.465740e+07 | 1.000605 | 1.000504 | 1.000602 | 1.000845 | -0.034973 | -6661882.15 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 | 1.000350 | 1.000489 | 1.000562 | 1.000848 | -0.013994 | -11379326.46 |
21153 rows × 14 columns
Okay now that we have the change in the volume of coins traded with respect to the previous day, we can go ahead and make a plot of growth over volume change and see the trend to verify our results. To do this we will make a scatterplot using plotly.
# Generating scatterplot of data points
fig = px.scatter(final_cp, x='C_Volume', y='Growth', color='Name', symbol='Ticker')
# Adding title and axis labels
fig.update_layout(title="Growth of Cryptocurrenices over Volume Traded",
xaxis_title="Change in Volume of Coins Traded", yaxis_title="Growth")
fig.show()
Wow, our hypothetical regarding changes in volume in correlation to growth appears to be wrong. Apart from a few outliers that had some profit trading in large volumes, for the most part the density of data appears to be near the origin of the graph. That is to say, we see much more commonly positive growth occuring when minimal changes in volume traded occurs with there being more of a tail towards positive growth.
It is possible that there must be some kind of correlation between instability and large changes in volume which does make a bit of sense, but it seems that the extremity of the situation outweighs the gain and rather slow and steady seems to win the growth race.
Another factor that we can calculate is the [Average True Range](https://www.wallstreetmojo.com/average-true-range/#:~:text=The%20ATR%20formula%20is%20%E2%80%9C%5B(,calculated%20on%2014%2Dday%20periods.), shorthand being ATR. ATR is a technical analysis indicator which helps in calculating market or price volatitily. It helps analyze the volatility involved in price changes of an asset so that an investor can select the best timne for trading and achieve consistency in trading. In our case, we can utilize it to see the range in which a value can change and how it correlates with other measures of volatility like change in volume traded and such.
We need several formulas to calculate ATR and they are below:
$$\Large TR = max[(H - L), \left\lvert H - C_p\right\rvert, \left\lvert L - C_p\right\rvert]$$$$\Large ATR = \frac{1}{n}*\sum \limits_{i=1}^{n} TR_{i}$$In order to be able to calculate this, we will sort the dataset into an ordering of first by name and then by date. Then we will carry out the step in this guide. We will be calculating a 14 day ATR so our window for when we take the MA is 14.
# sorting dataset for groupby to work properly
final_cp = final_cp.sort_values(['Name', 'Date'])
# Calculating ranges and saving as column
final_cp['H-L'] = final_cp.groupby('Name').apply(lambda x: x['High'] - x['Low']).reset_index(0, drop=True)
final_cp['H-C1'] = final_cp.groupby('Name').apply(lambda x: np.abs(x['High'] -
x['Close'].shift())).reset_index(0, drop=True)
final_cp['L-C1'] = final_cp.groupby('Name').apply(lambda x: np.abs(x['Low'] -
x['Close'].shift())).reset_index(0, drop=True)
# Calculating the true range, which is the largest of the three ranges calculated before
final_cp['TR'] = final_cp.groupby('Name').apply(lambda x: np.max(pd.concat([x['H-L'], x['H-C1'], x['L-C1']], axis=1), axis=1)).reset_index(0, drop=True)
# We are calculating 14 day ATR
final_cp['ATR'] = final_cp.groupby('Name')['TR'].rolling(14).mean().reset_index(0, drop=True)
final_cp = final_cp.drop(columns=['H-L', 'H-C1', 'L-C1', 'TR'])
final_cp
Date | Name | Ticker | Open | High | Low | Close | Volume | Average | SMA25 | SMA50 | SMA100 | Growth | C_Volume | ATR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 114.405003 | NaN | NaN | NaN | 2.666078 | NaN | NaN |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 114.140003 | NaN | NaN | NaN | -3.172961 | 0.00 | NaN |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 111.875000 | NaN | NaN | NaN | -0.668151 | 0.00 | NaN |
7 | 2013-05-08 | Bitcoin | BTC | 109.599998 | 115.779999 | 109.599998 | 113.566002 | 0.000000e+00 | 111.583000 | NaN | NaN | NaN | 3.618616 | 0.00 | NaN |
9 | 2013-05-09 | Bitcoin | BTC | 113.199997 | 113.459999 | 109.260002 | 112.669998 | 0.000000e+00 | 112.934998 | NaN | NaN | NaN | -0.468197 | 0.00 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
598 | 2023-05-08 | Tether | USDT | 1.001390 | 1.001660 | 1.000600 | 1.000840 | 1.071410e+08 | 1.001115 | 1.000535 | 1.000755 | 1.000828 | -0.054924 | 62362190.82 | 0.001539 |
599 | 2023-05-09 | Tether | USDT | 1.000840 | 1.001770 | 1.000290 | 1.000540 | 7.751265e+07 | 1.000690 | 1.000522 | 1.000702 | 1.000834 | -0.029975 | -29628362.87 | 0.000922 |
600 | 2023-05-10 | Tether | USDT | 1.000540 | 1.000870 | 1.000200 | 1.000790 | 1.013193e+08 | 1.000665 | 1.000512 | 1.000648 | 1.000840 | 0.024987 | 23806631.47 | 0.000914 |
601 | 2023-05-11 | Tether | USDT | 1.000780 | 1.000790 | 1.000000 | 1.000430 | 9.465740e+07 | 1.000605 | 1.000504 | 1.000602 | 1.000845 | -0.034973 | -6661882.15 | 0.000916 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 | 1.000350 | 1.000489 | 1.000562 | 1.000848 | -0.013994 | -11379326.46 | 0.000886 |
21153 rows × 15 columns
Okay, now that we have ATR calculated, we can go ahead and plot close prices over ATR prices to see how the correlation is between the two data points.
for name in name_list:
# Filtering list to only data for the coin we are currently looking at
filtered_df = (final_cp[final_cp['Name'] == name]).copy()
fig = px.scatter(filtered_df, x='ATR', y='Close', trendline='ols')
# Setting title and axis labels
fig.update_layout(title=f"Close Price of {name} over ATR",
xaxis_title="Price per Coin (USD)", yaxis_title="Price per Coin (USD)")
fig.show()
Looking at these plots of ATR, we can clearly see that there is a strong correlation between ATR and close price, with the OLS being calculated as well we can clearly see that the OLS regression line covers the spread of data quite well except for a few outliers for the most part with R^2 values ranging from 0.6 to 0.9 which indicate that a good portion of the variance in the data can be explained by the model. These graphs have given us an important fact that there is a strong correlation between the two.
Now that we have calculated ATR, we can calculate another indicator that is related to it and will help us understand our data better.
We will now look at ADX, the Average Directional Index which is an extremely popular technical analysis indicator. The ADX lets you understand if your chart data is trending or ranging, in other words, it will let you know how strong the trend in the data is. That is to say, if we are on a upward trend, it will tell us if this trend appears to still be going strong or if it is weakening and could lead to a plateau or drop in price. To calculaate ADX, we need a couple formulas.
$$\Large +DM = Current High - Previous High $$$$\Large -DM = Previous Low - Current Low $$$$\Large +DI = \frac{Smoothed(+DM)}{ATR}$$$$\Large -DI = \frac{Smoothed(-DM)}{ATR}$$$$\Large ADX = 100 * (\frac{\left\lvert +DI - (-DI) \right\rvert}{\left\lvert +DI + (-DI) \right\rvert})$$Okay, now that we know our formulas and already have ATR, we will calculate ADX and start with getting +/- DM and DI as that is all that is left to get. Keeping the theme going with our indicators, we will calculate a 14-day ADX in our case.
interval = 5
# Preparing dataset by sorting for groupby's
final_cp = final_cp.sort_values(['Name', 'Date'])
# Calculating positive and negative changes
final_cp['h_diff'] = final_cp.groupby('Name')['High'].diff()
final_cp['l_diff'] = final_cp.groupby('Name').apply(lambda x: x['Low'].shift() - x['Low']).reset_index(0, drop=True)
# Calculating the positive and negative directional movement
final_cp['+DM'] = final_cp.apply(lambda x: x['h_diff'] if x['h_diff'] > x['l_diff'] else 0, axis=1)
final_cp['-DM'] = final_cp.apply(lambda x: x['l_diff'] if x['l_diff'] > x['h_diff'] else 0, axis=1)
# Calculating Smoothed positive and negative directional movement
final_cp['+DM_s'] = final_cp.groupby('Name')['+DM'].rolling(interval).mean().reset_index(0, drop=True)
final_cp['-DM_s'] = final_cp.groupby('Name')['-DM'].rolling(interval).mean().reset_index(0, drop=True)
# Calculating positive and negative directional index
final_cp['+DI'] = final_cp['+DM_s'] / final_cp['ATR']
final_cp['-DI'] = final_cp['-DM_s'] / final_cp['ATR']
# Using formula, calculating ADX
final_cp['ADX'] = 100 * (np.abs(final_cp['+DI'] - final_cp['-DI']) / np.abs(final_cp['+DI'] + final_cp['-DI']))
final_cp = final_cp.drop(columns=['h_diff', 'l_diff', '+DM', '-DM', '+DM_s', '-DM_s', '+DI', '-DI'])
final_cp
Date | Name | Ticker | Open | High | Low | Close | Volume | Average | SMA25 | SMA50 | SMA100 | Growth | C_Volume | ATR | ADX | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 114.405003 | NaN | NaN | NaN | 2.666078 | NaN | NaN | NaN |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 114.140003 | NaN | NaN | NaN | -3.172961 | 0.00 | NaN | NaN |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 111.875000 | NaN | NaN | NaN | -0.668151 | 0.00 | NaN | NaN |
7 | 2013-05-08 | Bitcoin | BTC | 109.599998 | 115.779999 | 109.599998 | 113.566002 | 0.000000e+00 | 111.583000 | NaN | NaN | NaN | 3.618616 | 0.00 | NaN | NaN |
9 | 2013-05-09 | Bitcoin | BTC | 113.199997 | 113.459999 | 109.260002 | 112.669998 | 0.000000e+00 | 112.934998 | NaN | NaN | NaN | -0.468197 | 0.00 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
598 | 2023-05-08 | Tether | USDT | 1.001390 | 1.001660 | 1.000600 | 1.000840 | 1.071410e+08 | 1.001115 | 1.000535 | 1.000755 | 1.000828 | -0.054924 | 62362190.82 | 0.001539 | 33.720930 |
599 | 2023-05-09 | Tether | USDT | 1.000840 | 1.001770 | 1.000290 | 1.000540 | 7.751265e+07 | 1.000690 | 1.000522 | 1.000702 | 1.000834 | -0.029975 | -29628362.87 | 0.000922 | 9.278351 |
600 | 2023-05-10 | Tether | USDT | 1.000540 | 1.000870 | 1.000200 | 1.000790 | 1.013193e+08 | 1.000665 | 1.000512 | 1.000648 | 1.000840 | 0.024987 | 23806631.47 | 0.000914 | 19.018405 |
601 | 2023-05-11 | Tether | USDT | 1.000780 | 1.000790 | 1.000000 | 1.000430 | 9.465740e+07 | 1.000605 | 1.000504 | 1.000602 | 1.000845 | -0.034973 | -6661882.15 | 0.000916 | 114.678899 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 | 1.000350 | 1.000489 | 1.000562 | 1.000848 | -0.013994 | -11379326.46 | 0.000886 | 122.857143 |
21153 rows × 16 columns
Now that we have calculated our ADX, we should plot it to see how it correlates with other statistics that we have. The most logical general approach is to plot it against OHLC, to see how exactly it correlates with information on if a sheet is trending or not.
for name in name_list:
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Filtering list to only data for the coin we are currently looking at
filtered_df = (final_cp[final_cp['Name'] == name]).copy()
# Creating graph components for each line.
trace1 = go.Scatter(x=filtered_df['Date'], y=filtered_df['ADX'], name="ADX")
trace2 = go.Scatter(x=filtered_df['Date'], y=filtered_df['Average'], name="Average")
trace3 = go.Scatter(x=filtered_df['Date'], y=filtered_df['SMA25'], name="SMA25")
trace4 = go.Scatter(x=filtered_df['Date'], y=filtered_df['SMA50'], name="SMA50")
trace5 = go.Scatter(x=filtered_df['Date'], y=filtered_df['SMA100'], name="SMA100")
# Putting lines onto the figure.
fig.add_trace(trace1, secondary_y=True)
fig.add_trace(trace2, secondary_y=False)
fig.add_trace(trace3, secondary_y=False)
fig.add_trace(trace4, secondary_y=False)
fig.add_trace(trace5, secondary_y=False)
# Setting title and axis labels
fig.update_layout(title=f"Averages of {name} with ADX over Time",
xaxis_title="Date", yaxis_title="Price per Coin (USD)")
fig.show()
Looking at the graphs, it appears that there is no correlation between ADX and the average price as the spikes in ADX do not appear to line up with the changes in the graph properly. It sparsely does in some cases but for the mos part it is not close enough that we can consider it as a significant correlation. We can conclude that ADX is an important indicator and will as such be utilized in predictions and such but we do not need to worry about correlations between potential independent parameters.
Another indicator that we want to calculate that can give us more insight on the relations within our data is RSI, the Relative Strength Index. RSI is a really important indicator as it is a momentum indicator that is really common for use with OHLC charts. RSI measures the speed and magnitude of an assets recent price changes to evaluate overvalued or undervalued conditions in the price of that asset. The formulas for RSI are as follows:
$$\Large RS = \frac{Average Gain}{Average Loss}$$$$\Large RSI = 100 - \frac{100}{1 + RS}$$We will calculate this in a similar manner as to how we calculated ATR. Just like ATR, we will also be calculating a 14 day RSI, so our interval is 14.
# Sorting dataset properly in order for groupby to work properly
final_cp = final_cp.sort_values(['Name', 'Date'])
# Getting C differentials
final_cp['C_Close'] = final_cp.groupby('Name')['Close'].diff()
# Calculating the closing price difference up an down
final_cp['C_Close_Up'] = final_cp['C_Close'].clip(lower=0)
final_cp['C_Close_Down'] = -1 * final_cp['C_Close'].clip(upper=0)
# Calculating the Average Gain and Average Loss
final_cp['Avg_Up'] = final_cp.groupby('Name')['C_Close_Up'].rolling(14).mean().reset_index(0, drop=True)
final_cp['Avg_Down'] = final_cp.groupby('Name')['C_Close_Down'].rolling(14).mean().reset_index(0, drop=True)
# Using RSI formula and calculating
final_cp['RS'] = final_cp['Avg_Up'] / final_cp['Avg_Down']
final_cp['RSI'] = 100 - ( 100 / (1 + final_cp['RS']) )
final_cp = final_cp.drop(columns=['C_Close', 'C_Close_Up', 'C_Close_Down', 'Avg_Up', 'Avg_Down', 'RS'])
final_cp
Date | Name | Ticker | Open | High | Low | Close | Volume | Average | SMA25 | SMA50 | SMA100 | Growth | C_Volume | ATR | ADX | RSI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 114.405003 | NaN | NaN | NaN | 2.666078 | NaN | NaN | NaN | NaN |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 114.140003 | NaN | NaN | NaN | -3.172961 | 0.00 | NaN | NaN | NaN |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 111.875000 | NaN | NaN | NaN | -0.668151 | 0.00 | NaN | NaN | NaN |
7 | 2013-05-08 | Bitcoin | BTC | 109.599998 | 115.779999 | 109.599998 | 113.566002 | 0.000000e+00 | 111.583000 | NaN | NaN | NaN | 3.618616 | 0.00 | NaN | NaN | NaN |
9 | 2013-05-09 | Bitcoin | BTC | 113.199997 | 113.459999 | 109.260002 | 112.669998 | 0.000000e+00 | 112.934998 | NaN | NaN | NaN | -0.468197 | 0.00 | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
598 | 2023-05-08 | Tether | USDT | 1.001390 | 1.001660 | 1.000600 | 1.000840 | 1.071410e+08 | 1.001115 | 1.000535 | 1.000755 | 1.000828 | -0.054924 | 62362190.82 | 0.001539 | 33.720930 | 58.581236 |
599 | 2023-05-09 | Tether | USDT | 1.000840 | 1.001770 | 1.000290 | 1.000540 | 7.751265e+07 | 1.000690 | 1.000522 | 1.000702 | 1.000834 | -0.029975 | -29628362.87 | 0.000922 | 9.278351 | 54.623656 |
600 | 2023-05-10 | Tether | USDT | 1.000540 | 1.000870 | 1.000200 | 1.000790 | 1.013193e+08 | 1.000665 | 1.000512 | 1.000648 | 1.000840 | 0.024987 | 23806631.47 | 0.000914 | 19.018405 | 54.329004 |
601 | 2023-05-11 | Tether | USDT | 1.000780 | 1.000790 | 1.000000 | 1.000430 | 9.465740e+07 | 1.000605 | 1.000504 | 1.000602 | 1.000845 | -0.034973 | -6661882.15 | 0.000916 | 114.678899 | 51.120163 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 | 1.000350 | 1.000489 | 1.000562 | 1.000848 | -0.013994 | -11379326.46 | 0.000886 | 122.857143 | 44.078947 |
21153 rows × 17 columns
Interestingly enough here, the smaller coins appear to be much more traded and frequently have RSI values less than 30 or over 70 which indicate oversold or overbought positions. However, big coins like Bitcoin appear to have relatively stable RSIs. There might be some relation to RSI and Growth as these results seem akin to the results of plotting growth. We can plot the both of on a graph with scaled axes to see if there is any kind of correlation between the two.
for name in name_list:
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Filtering list to only data for the coin we are currently looking at
filtered_df = (final_cp[final_cp['Name'] == name]).copy()
# Creating graph components for each line.
trace1 = go.Scatter(x=filtered_df['Date'], y=filtered_df['Average'], name="Average")
trace2 = go.Scatter(x=filtered_df['Date'], y=filtered_df['RSI'], name="RSI")
trace3 = go.Scatter(x=filtered_df['Date'], y=filtered_df['High'], name="High")
trace4 = go.Scatter(x=filtered_df['Date'], y=filtered_df['Low'], name="Low")
# Putting lines onto the figure.
fig.add_trace(trace1, secondary_y=False)
fig.add_trace(trace2, secondary_y=True)
fig.add_trace(trace3, secondary_y=False)
fig.add_trace(trace4, secondary_y=False)
# Setting title and axis labels
fig.update_layout(title=f"Price of {name} with Low, High, Average, and RSI",
xaxis_title="Date", yaxis_title="Price per Coin (USD)")
fig.show()
It appears that RSI seems to rapidly increase or decrease when it comes to more siginificant trends in prices occur as is visible in the graphs for some of the cryptocurrencies. Seeing the similar flowing pattern, it is safe to conclude that there is some correlation between RSI and the OHLC data as a whole and it can accurately help us understand the momentum of our currency data. Additionally, RSI is showing us the rapid relative fluctuations and change in a currency that we cannot see in some cases due to a massive gain which leads to a larger scale y-axis which compresses the trends in lower value regions.
We spent some time looking at our data and calculating all kinds of markers and visualizing their information in correlation with other data to look for trends and were able to hypothesize and explore relations in the data in this section. Now that we have calculated some new fields in our dataset and were able to identify if there existed correlations between variables and if there was not, now we are able to go ahead and move onto the next step in the data science lifecycle.
Now that we have a solid understanding of our data and have generated a couple of quantities that show characteristics of the data that weren't present before, we can put together a model in order to build a deeper understanding of the data and utilize it to apply to further circumstances outside of the data itself. To figure out what we are going to do, let us first take a look at our data with it all having been calculated.
final_ml = final_cp.copy()
final_ml
Date | Name | Ticker | Open | High | Low | Close | Volume | Average | SMA25 | SMA50 | SMA100 | Growth | C_Volume | ATR | ADX | RSI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-05-05 | Bitcoin | BTC | 112.900002 | 118.800003 | 107.142998 | 115.910004 | 0.000000e+00 | 114.405003 | NaN | NaN | NaN | 2.666078 | NaN | NaN | NaN | NaN |
2 | 2013-05-06 | Bitcoin | BTC | 115.980003 | 124.663002 | 106.639999 | 112.300003 | 0.000000e+00 | 114.140003 | NaN | NaN | NaN | -3.172961 | 0.00 | NaN | NaN | NaN |
4 | 2013-05-07 | Bitcoin | BTC | 112.250000 | 113.444000 | 97.699997 | 111.500000 | 0.000000e+00 | 111.875000 | NaN | NaN | NaN | -0.668151 | 0.00 | NaN | NaN | NaN |
7 | 2013-05-08 | Bitcoin | BTC | 109.599998 | 115.779999 | 109.599998 | 113.566002 | 0.000000e+00 | 111.583000 | NaN | NaN | NaN | 3.618616 | 0.00 | NaN | NaN | NaN |
9 | 2013-05-09 | Bitcoin | BTC | 113.199997 | 113.459999 | 109.260002 | 112.669998 | 0.000000e+00 | 112.934998 | NaN | NaN | NaN | -0.468197 | 0.00 | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
598 | 2023-05-08 | Tether | USDT | 1.001390 | 1.001660 | 1.000600 | 1.000840 | 1.071410e+08 | 1.001115 | 1.000535 | 1.000755 | 1.000828 | -0.054924 | 62362190.82 | 0.001539 | 33.720930 | 58.581236 |
599 | 2023-05-09 | Tether | USDT | 1.000840 | 1.001770 | 1.000290 | 1.000540 | 7.751265e+07 | 1.000690 | 1.000522 | 1.000702 | 1.000834 | -0.029975 | -29628362.87 | 0.000922 | 9.278351 | 54.623656 |
600 | 2023-05-10 | Tether | USDT | 1.000540 | 1.000870 | 1.000200 | 1.000790 | 1.013193e+08 | 1.000665 | 1.000512 | 1.000648 | 1.000840 | 0.024987 | 23806631.47 | 0.000914 | 19.018405 | 54.329004 |
601 | 2023-05-11 | Tether | USDT | 1.000780 | 1.000790 | 1.000000 | 1.000430 | 9.465740e+07 | 1.000605 | 1.000504 | 1.000602 | 1.000845 | -0.034973 | -6661882.15 | 0.000916 | 114.678899 | 51.120163 |
602 | 2023-05-12 | Tether | USDT | 1.000420 | 1.000670 | 1.000220 | 1.000280 | 8.327807e+07 | 1.000350 | 1.000489 | 1.000562 | 1.000848 | -0.013994 | -11379326.46 | 0.000886 | 122.857143 | 44.078947 |
21153 rows × 17 columns
If we look at our data, we have a lot of factors that we can take into account. In particular, we saw a correlation between ATR and close price so we can put those two together in some sort of model for prediction. Additionally, we can build the model around OHLCV data as that is the backbone of of the other metrics as well. We saw as well that RSI was able to indicate if a price was overbought or undersold so we can factor that into account. Finally, in predicting some trend, we can utilize ADX to tell us the strength of the trend and give some insight on potential if a cryptocurrency will continue to move in the direction it is going in. We should build a hypothesis and model around these factors.
Let us set a hypothesis to validate our claims from our exploratory data analysis and visualization. We will mimic something closer to the assumption that the selected variables have no correlation with the dependent variable. In our case specifically, let us state as a null hypothesis that all of the factors, excluding non-numerics and the average, do not have any correlation with the average. Our alternative hypothesis in this case is that this is not the case and that they do relate to each other.
Okay, with our understanding of the factors reviewed and our hypothesis set, let us now go ahead and figure out hte model that will utilize all of the factors and help us prove our hypothesis in some shape or form.
Taking all of this into account, what we can try and do for our model is predict the average price of a cryptocurrency on the next day based off previous OHLC data along with our calculated indicators through a supervised learning model. It would be difficult to figure out and verify the relationship between the dependent and independent variables through a neural network so we will utilize a regression model. This will allow us to hypothesize on the correlation of those variables and determine if they really do have some part in indicating how the average price moves.
Now that we know that we will utilize a regression model, we need to select which one we will utilize. A quick look at the data and our graphs indicates thats there is not a lot of linearity with the relationships that the data exhibits. As such, it would not be a good idea to utilize a linear regression model to do this. What we can utilize is a suprevised learning technique which builds a regression model for us: Random Forest Regression.
As stated previously, a simple regression would not cut it as the movements of the assets are much more complicated that a linear relation cannot express with certainty the trend that the asset is following. A Random Forest Regression is an application of ensemble learning and is good for solving regression problems. To learn more about Random Forest Regression and see the reference used in putting together this tutorial, check out this article linked here.
Okay, we will use the RandomForestRegressor()
built within scikit-learn to try and create a model to predict cryptocurrency data. The documentation is linked here. What we will do is build a Regression model for each of the 10 cryptocurrencies that we decided to look at, and pass in the data points and indicators we have regarding that asset to create a average price prediction model through regresson.
for name in name_list:
# Narrow down our list to just one coin at a time as that is the only way this model can predict
regressor_df = (final_ml[final_ml['Name'] == name]).copy()
# Sorting by date so that we have the proper ordering
regressor_df = regressor_df.sort_values(['Date'])
regressor_df = regressor_df.set_index('Date')
# Getting rid of non-numerical and quantities that don't want used for regression
regressor_df = regressor_df.drop(columns=['Volume', 'Name', 'Ticker'])
# Removing all rows with NaN or inf values
regressor_df = regressor_df.dropna()
regressor_df = regressor_df[~regressor_df.isin([np.nan, np.inf, -np.inf]).any(1)]
# Shift everything down 1 so that you have yeserday's data to predict today's average.
regressor_shifted = pd.concat([regressor_df.iloc[:, 1:4].shift(1),
regressor_df['Average'],
regressor_df.iloc[:, 5:].shift(1)], axis=1)
# clean up NaN again
regressor_shifted = regressor_shifted.dropna()
# Splitting into training and test data
x_train, x_test, y_train, y_test = train_test_split(regressor_shifted.drop(columns=['Average']),
regressor_shifted['Average'])
# Creating Random Forest Regressor with base parameters and start training
rf = RandomForestRegressor(n_estimators=50, random_state=0).fit(x_train, y_train)
# Use trained model to predict on test data
prediction = rf.predict(x_test)
# Take results and append to dataframe for visualization
x_test = x_test.reset_index()
actual_y = y_test.reset_index().sort_values('Date')
combined_results = pd.concat([x_test, pd.Series(prediction, name="Predicted")], axis=1)
combined_results = combined_results.sort_values('Date')
fig = make_subplots(rows = 1, cols = 1)
# Creating traces for actual and predicted
trace1 = go.Scatter(x=combined_results['Date'], y=combined_results['Predicted'], name="Predicted")
trace2 = go.Scatter(x=actual_y['Date'], y=actual_y['Average'], name="Actual")
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=1, col=1)
# Setting title and axis labels
fig.update_layout(title=f"Actual and Predicted Average Price of {name} Over Time",
xaxis_title="Date", yaxis_title="Price per Coin (USD)")
fig.show()
# Calculate MSE, RMSE and R^2 value to see how well the model did
mse = mean_squared_error(y_test, prediction)
rmse = mean_squared_error(y_test, prediction, squared=False)
r_2 = r2_score(y_test, prediction)
print(f'Random Forest Regressor Model for {name}')
print('Mean squared error:', mse)
print('Root mean squared error:', rmse)
print('R-squared:', r_2)
print("\n")
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Bitcoin Mean squared error: 286973.5969311636 Root mean squared error: 535.6991664462095 R-squared: 0.9986472716405851
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Ethereum Mean squared error: 2005.4328645104265 Root mean squared error: 44.782059627828936 R-squared: 0.9980645002297991
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Tether Mean squared error: 1.036084692556676e-05 Root mean squared error: 0.0032188269486828213 R-squared: 0.865471259588598
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Cardano Mean squared error: 0.0009526060586103359 Root mean squared error: 0.03086431691468865 R-squared: 0.997058372751413
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Dogecoin Mean squared error: 3.969158945535347e-05 Root mean squared error: 0.006300126145987354 R-squared: 0.9938422023954274
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Solana Mean squared error: 15.736115639046183 Root mean squared error: 3.966877315855153 R-squared: 0.9950806564738562
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Polygon Mean squared error: 0.0037114403102814987 Root mean squared error: 0.06092159149498229 R-squared: 0.9911664148045579
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Polkadot Mean squared error: 1.2741498782012448 Root mean squared error: 1.12878247603391 R-squared: 0.9932244691495107
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Litecoin Mean squared error: 14.233409907159855 Root mean squared error: 3.772719166219486 R-squared: 0.9965445641672661
/var/folders/ry/hnnkjnnd15j3yzs0qmtbwnv40000gn/T/ipykernel_2682/283871246.py:16: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
Random Forest Regressor Model for Shiba Inu Mean squared error: 2.3202678348126907e-12 Root mean squared error: 1.523242539720018e-06 R-squared: 0.9892966258211765
Amazing! Our R^2 is greater than 0.95 for most of the currencies, with Tether being the exception. We can see this visually as well with how well our predicted values map to the actual values. This could just be due to a different behavior shown by the currency or it could be that the default parameters that were utilized for the Random Forest Regressor don't seem to work for that. A way to fix that would be to fine tune the model however 0.87 is an acceptable level for R^2 so we will not delve further into that. To see how a fine tuning utilizing grid search could be down, check out the articles linked earlier.
On another note, we can conclude that our null hypothesis is rejected. Given the extremely large R^2 value and the very low error it is safe to conclude that the factors create a good fit to the data and as such the null hypothesis is rejected and we accept the alternative that there is a relationship between all of our dependent factors and our independent.
Well, we've reached the end of the journey. I hope that after this tutorial you have developed a better understanding on how to analyze cryptoasset data and how to train a machine to make predictions on some aspect of that data. While this model is extremely simple and not designed for high frequency trading, it still gets the point across that we are able to analyze the data on some level and make a more informed decision than blatantly spending money in some ineffective manner. In terms of strategy that you could employ utilizing the model, what can be doen is that you could construct a basic "Buy-Low Sell-High" strategy around the movements of the average price of an asset that you are interested in. As time progresses and technology develops, it will become even easier to develop these tools for our own use. Time waits for nobody, I hope that you seize the moment and take advantage of the resources around you as you pave the way towards a financially free future.