Data Cleaning

Introduction

This page provides a detailed account of the data cleaning section, which serves as the foundation for subsequent EDA and modeling analysis. It should be noted that data cleaning is an iterative process, whereby the section may be revisited on numerous occasions as the project progresses, with content continually added and optimized.

Data Cleaning Process

Below is the process description of data cleaning. Corresponding visualizations, such as heatmaps and box plots, are generated and displayed in the code section.

1. Data Type Correction and Formatting

Given the considerable number of raw datasets (totally 31 datasets) and the subsequent necessity of synthesizing the processed datasets required for analysis, we advanced the processing of the data types. The raw data types are all objects, and the date columns were converted to the date format, while all other columns were converted to the floating point format. This format meets the subsequent needs for data analysis.

2. Feature Engineering

In order to ensure consistency in the presentation of foreign exchange data, all FX values were converted to U.S. dollar-based calculations. Subsequently, the calculation of cryptocurrency return rates and FX percentage changes was performed using the adjusted closed price. Data frames were generated with dates as rows, cryptocurrency names or FX names as columns, and return rates or percentage changes as cell values.

3. Managing Missing Data

Given the enhanced stability and reliability of the data sources, it is anticipated that the number of missing values will be minimal. Consequently, we elected to utilize the heat map visualization as a rapid mean of identifying the presence of missing values and their respective locations.

  • Crypto Data: A preliminary examination of the datasets revealed that certain cryptocurrencies have data beginning after January 1, 2019. The sole instance of anomalous missing values was observed for AVAX-USD, for which data for the interval between July 15 and September 21, 2020, was entirely absent. This was subsequent to the availability of price data solely for July 13 and 14, 2020. Consequently, the decision was taken to remove the data for the 13th and 14th, and to establish the data commencement for AVAX-USD as September 22, 2020. Subsequently, the data inception dates for all cryptocurrencies were documented for subsequent analysis.

  • Forex Data: The heat map revealed a mere two instances of missing data, prompting the decision to employ a method of filling in the missing values. This entailed the use of the average perncent change for the two days preceding and succeeding the aforementioned date.

4. Outlier Detection and Treatment

In light of the project’s research objectives and the intrinsic characteristics of financial markets, it is our contention that extreme returns or volatility changes should be retained, as they may serve to reflect the effects of market volatility that we are interested in. Accordingly, this section employs a combination of z-score and heatmap methodologies to identify and localize outliers.

5. Data Correction & Adjustment

During the data validation process, it was discovered that the adjusted close price of SHIB-USD was 0 until April 16, 2021. This is likely due to the fact that the long-term price of SHIB is lower than the supported precision of Yahoo Finance. In light of the fact that all subsequent prices are exceedingly low and volatile, it seems prudent to consider replacing the cryptocurrency. Its market share is only 14th among the acquired cryptocurrencies and does not exert a decisive influence on the dataset. Therefore, we posit that replacing it with the cryptocurrency TON11419, which has a similar market share but more reliable price data, represents the optimal solution. The new dataset has been collected, cleaned, and will be utilized in subsequent analysis.

In the subsequent analysis, we found that due to the prolonged absence of some cryptocurrency data, we had to narrow down the data range to avoid affecting the application or results of some analysis techniques sensitive to missing values. Since the amount of data is still sufficient, we have decided to modify the starting time of data collection so that all cryptocurrencies have full raw data.

In addition, in our subsequent EDA and unsupervised learning analysis, we found that due to the complexity of financial market data itself, it was difficult to find significant patterns and results in daily volatility research. We decided to expand the generation of weekly datasets for subsequent analysis. Similar cleaning and finishing steps are applied.

Code

# Import needed packages
import pandas as pd
import os
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Suppress SettingWithCopyWarning
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

# Define cleaning function of crypto data
def clean_crypto_data(df):
    # Step1: Adjust the structure
    df = df.iloc[2:]
    df.reset_index(drop=True, inplace=True)
    df.rename(columns={'Price': 'Date', 'Adj Close': 'Adj_Close'}, inplace=True)

    # Step2: Convert data types
    numeric_cols = ['Adj_Close', 'Close', 'High', 'Low', 'Open', 'Volume']
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

    # Step3: Get return rate
    df['Return_Rate'] = df['Adj_Close'].pct_change()

    return df

# Process all crypto CSV files
crypto_list = [
    "BTC-USD", "ETH-USD", "XRP-USD", "USDT-USD", "SOL-USD", "BNB-USD", 
    "DOGE-USD", "ADA-USD", "USDC-USD", "STETH-USD", "WTRX-USD", "TRX-USD",
    "AVAX-USD", "WSTETH-USD", "TON11419-USD"
]
crypto_data_combined = pd.DataFrame()
for symbol in crypto_list:
    file_path = f'../../data/raw-data/{symbol}.csv'
    df = pd.read_csv(file_path)
    df_cleaned = clean_crypto_data(df)
    df_cleaned['Crypto'] = symbol
    # Append to combined DataFrame
    crypto_data_combined = pd.concat([crypto_data_combined, df_cleaned], ignore_index=True)
crypto_data_combined.head()

# Subset the combined DataFrame for Return_Rate and reshape it
crypto_returns = crypto_data_combined.pivot_table(
    index='Date', 
    columns='Crypto', 
    values='Return_Rate'
)

# Check missing values if any through visualization
def visualize_missing_values(df):
    plt.figure(figsize=(15, 8))
    sns.heatmap(df.isna(), cbar=False, cmap='viridis')
    plt.title("Missing Values Heatmap")
    plt.show()
visualize_missing_values(crypto_returns)

# # Handle missing values
# crypto_returns.loc['2020-07-14', 'AVAX-USD'] = float('nan')
# visualize_missing_values(crypto_returns)

# # Define each crypto's start date
# def check_start_dates(df, crypto_list):
#     for crypto in crypto_list:
#         crypto_data = df[crypto].dropna()
#         if not crypto_data.empty:
#             start_date = crypto_data.index.min()
#             print(f"{crypto}: Start Date = {start_date}")
#         else:
#             print(f"{crypto}: No data available.")
# check_start_dates(crypto_returns, crypto_list)

# Check data types and convert if needed
crypto_returns_reset = crypto_returns.reset_index()
print(crypto_returns_reset.dtypes)

# Check Outliers by visualization
def visualize_outliers(df):
    plt.figure(figsize=(15, 8))
    sns.boxplot(data=df, palette="Set2")
    plt.title("Boxplot of Cryptocurrency Return Rates")
    plt.xlabel("Cryptocurrency")
    plt.ylabel("Return Rate")
    plt.xticks(rotation=45, ha='right', fontsize=10)    
    plt.show()
visualize_outliers(crypto_returns)

# Visualize Outliers as Heatmap
def visualize_outliers_heatmap(df, threshold=3):
    z_scores = stats.zscore(df, nan_policy='omit')
    outlier_mask = abs(z_scores) > threshold
    
    plt.figure(figsize=(15, 8))
    sns.heatmap(outlier_mask, cmap='Reds', cbar=False)
    plt.title("Outlier Detection Heatmap (Z-score)")
    plt.xlabel("Cryptocurrency")
    plt.ylabel("Date")
    plt.show()
visualize_outliers_heatmap(crypto_returns)

# Save cleaned data
print(crypto_returns.head())
cleaned_filepath = '../../data/processed-data/crypto_returns_cleaned.csv'
crypto_returns.to_csv(cleaned_filepath)

Crypto
Date            datetime64[ns]
ADA-USD                float64
AVAX-USD               float64
BNB-USD                float64
BTC-USD                float64
DOGE-USD               float64
ETH-USD                float64
SOL-USD                float64
STETH-USD              float64
TON11419-USD           float64
TRX-USD                float64
USDC-USD               float64
USDT-USD               float64
WSTETH-USD             float64
WTRX-USD               float64
XRP-USD                float64
dtype: object

Crypto       ADA-USD  AVAX-USD   BNB-USD   BTC-USD  DOGE-USD   ETH-USD  \
Date                                                                     
2022-04-02 -0.008081 -0.007445 -0.021605 -0.008917 -0.016997 -0.001302   
2022-04-03  0.025894  0.021928  0.030652  0.012745  0.054278  0.022576   
2022-04-04  0.022770 -0.026077 -0.006271  0.003640  0.014599 -0.000452   
2022-04-05 -0.034745 -0.047101 -0.005256 -0.022879  0.163644 -0.031082   
2022-04-06 -0.100811 -0.088608 -0.056217 -0.051568 -0.170554 -0.070374   

Crypto       SOL-USD  STETH-USD  TON11419-USD   TRX-USD  USDC-USD  USDT-USD  \
Date                                                                          
2022-04-02 -0.016604   0.002508     -0.021115 -0.021951 -0.000650 -0.000014   
2022-04-03  0.034618   0.021610      0.009449 -0.001256  0.000751 -0.000071   
2022-04-04 -0.031897  -0.004387     -0.031880 -0.032013  0.000028 -0.000045   
2022-04-05 -0.041917  -0.030522     -0.035178 -0.019614 -0.000426 -0.000034   
2022-04-06 -0.110189  -0.069029     -0.041537 -0.088453 -0.000111 -0.000039   

Crypto      WSTETH-USD  WTRX-USD   XRP-USD  
Date                                        
2022-04-02    0.000758 -0.017695 -0.003408  
2022-04-03    0.021470 -0.002618  0.021858  
2022-04-04   -0.001200 -0.031035 -0.018624  
2022-04-05   -0.029230 -0.019782 -0.011616  
2022-04-06   -0.071138 -0.089449 -0.068891  
# Define cleaning function of fx data
def clean_fx_data(df, symbol):
    # Step1: Adjust the structure
    df = df.iloc[2:]
    df.reset_index(drop=True, inplace=True)
    df.rename(columns={'Price': 'Date', 'Adj Close': 'Adj_Close'}, inplace=True)

    # Step2: Convert data types
    numeric_cols = ['Adj_Close', 'Close', 'High', 'Low', 'Open', 'Volume']
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

    # Step3: Convert to USD/ if needed
    if "USD" not in symbol:
        df['Adj_Close'] = 1 / df['Adj_Close']

    # Step4: Get rate
    df['Percent_Change'] = df['Adj_Close'].pct_change()

    return df

# Process all crypto CSV files
fx_list = [
    "EURUSD=X", "JPY=X", "GBPUSD=X", "AUDUSD=X", "NZDUSD=X", "CNY=X",
    "HKD=X", "SGD=X", "INR=X", "MXN=X", "PHP=X", "IDR=X",
    "THB=X", "MYR=X", "ZAR=X", "RUB=X"
]
fx_data_combined = pd.DataFrame()
for symbol in fx_list:
    file_path = f'../../data/raw-data/{symbol}.csv'
    df = pd.read_csv(file_path)
    df_cleaned = clean_fx_data(df, symbol)
    df_cleaned['FX'] = symbol
    # Append to combined DataFrame
    fx_data_combined = pd.concat([fx_data_combined, df_cleaned], ignore_index=True)
fx_data_combined.head()

# Subset the combined DataFrame for rate and reshape it
fx_rates = fx_data_combined.pivot(
    index='Date', 
    columns='FX', 
    values='Percent_Change'
)

# Check missing values if any through visualization
def visualize_missing_values(df):
    plt.figure(figsize=(15, 8))
    sns.heatmap(df.isna(), cbar=False, cmap='viridis')
    plt.title("Missing Values Heatmap")
    plt.show()
visualize_missing_values(fx_rates)

# Handle missing values
fx_rates = fx_rates.drop(fx_rates.index[0])
# def fill_missing_values_avg(df):
#     prev_val = df.shift(1)
#     next_val = df.shift(-1)
#     df.fillna((prev_val + next_val) / 2, inplace=True)
#     return df
# fx_rates = fill_missing_values_avg(fx_rates)
# visualize_missing_values(fx_rates)
# fx_rates.head()

# Check data types and convert if needed
fx_rates_reset = fx_rates.reset_index()
print(fx_rates_reset.dtypes)

# Check Outliers by visualization
def visualize_outliers(df):
    plt.figure(figsize=(15, 8))
    sns.boxplot(data=df, palette="Set2")
    plt.title("Boxplot of Forexcurrency Change Rates")
    plt.xlabel("Forexcurrency")
    plt.ylabel("Percentage Change")
    plt.xticks(rotation=45, ha='right', fontsize=10)    
    plt.show()
visualize_outliers(fx_rates)

# Visualize Outliers as Heatmap
def visualize_outliers_heatmap(df, threshold=3):
    z_scores = stats.zscore(df, nan_policy='omit')
    outlier_mask = abs(z_scores) > threshold
    
    plt.figure(figsize=(15, 8))
    sns.heatmap(outlier_mask, cmap='Reds', cbar=False)
    plt.title("Outlier Detection Heatmap (Z-score)")
    plt.xlabel("Forexcurrency")
    plt.ylabel("Date")
    plt.show()
visualize_outliers_heatmap(fx_rates)

# Save cleaned data
print(fx_rates.head())
cleaned_filepath = '../../data/processed-data/fx_rates_cleaned.csv'
fx_rates.to_csv(cleaned_filepath)

FX
Date        datetime64[ns]
AUDUSD=X           float64
CNY=X              float64
EURUSD=X           float64
GBPUSD=X           float64
HKD=X              float64
IDR=X              float64
INR=X              float64
JPY=X              float64
MXN=X              float64
MYR=X              float64
NZDUSD=X           float64
PHP=X              float64
RUB=X              float64
SGD=X              float64
THB=X              float64
ZAR=X              float64
dtype: object

FX          AUDUSD=X     CNY=X  EURUSD=X  GBPUSD=X     HKD=X     IDR=X  \
Date                                                                     
2022-04-04 -0.000726 -0.003647 -0.002265 -0.003289 -0.000143 -0.000696   
2022-04-05  0.007669 -0.000047 -0.006465  0.000826  0.000018  0.002702   
2022-04-06  0.005541 -0.000110 -0.006326 -0.002838  0.000089 -0.003007   
2022-04-07 -0.010627  0.000723 -0.000752 -0.000405 -0.000427 -0.001147   
2022-04-08 -0.003545 -0.000189 -0.003064 -0.000209  0.000082 -0.000209   

FX             INR=X     JPY=X     MXN=X     MYR=X  NZDUSD=X     PHP=X  \
Date                                                                     
2022-04-04 -0.001126 -0.006965  0.000901 -0.001188 -0.003793  0.004343   
2022-04-05  0.006847 -0.001620  0.002422 -0.001661  0.005906  0.003209   
2022-04-06  0.000362 -0.006850 -0.008671  0.001426 -0.000118  0.000525   
2022-04-07 -0.006510 -0.000542 -0.008871 -0.000949 -0.005204 -0.001380   
2022-04-08  0.000420 -0.003006  0.001547 -0.000664 -0.003849 -0.001998   

FX             RUB=X     SGD=X     THB=X     ZAR=X  
Date                                                
2022-04-04 -0.029686 -0.001695 -0.007228 -0.003741  
2022-04-05  0.005973  0.000000  0.001226  0.005215  
2022-04-06 -0.002977 -0.002016 -0.003219 -0.006639  
2022-04-07  0.022841 -0.000720 -0.000715 -0.000382  
2022-04-08  0.036294 -0.001101  0.002508 -0.003057  
import numpy as np

# Function for Weekly Returns
def calculate_weekly_returns(df, value_col, date_col='Date'):
    df[date_col] = pd.to_datetime(df[date_col])
    weekly_returns = (
        df.set_index(date_col)[value_col]
        .resample('W-FRI')
        .last()
        .pct_change()
    )
    return weekly_returns.reset_index()

# Weekly Aggregation for Crypto Data
weekly_crypto = pd.DataFrame()
for symbol in crypto_list:
    crypto_weekly = calculate_weekly_returns(
        crypto_data_combined[crypto_data_combined['Crypto'] == symbol],
        'Adj_Close'
    )
    crypto_weekly['Crypto'] = symbol
    weekly_crypto = pd.concat([weekly_crypto, crypto_weekly], ignore_index=True)

weekly_crypto_returns = weekly_crypto.pivot_table(
    index='Date', 
    columns='Crypto', 
    values='Adj_Close'
)
weekly_crypto_returns = weekly_crypto_returns.drop(weekly_crypto_returns.index[-1])

# Save Weekly Cryptocurrency Data
weekly_crypto_filepath = '../../data/processed-data/weekly_crypto_returns.csv'
weekly_crypto_returns.to_csv(weekly_crypto_filepath)

# Weekly Aggregation for Forex Data
weekly_fx = pd.DataFrame()
for symbol in fx_list:
    fx_weekly = calculate_weekly_returns(
        fx_data_combined[fx_data_combined['FX'] == symbol],
        'Adj_Close'
    )
    fx_weekly['FX'] = symbol
    weekly_fx = pd.concat([weekly_fx, fx_weekly], ignore_index=True)

weekly_fx_rates = weekly_fx.pivot_table(
    index='Date', 
    columns='FX', 
    values='Adj_Close'
)

# Save Weekly Forex Data
weekly_fx_filepath = '../../data/processed-data/weekly_fx_rates.csv'
weekly_fx_rates.to_csv(weekly_fx_filepath)

# Visualization Check
def visualize_weekly_data(df, title):
    plt.figure(figsize=(15, 8))
    sns.heatmap(df.isna(), cmap='coolwarm', cbar=False)
    plt.title(f"Missing Values Heatmap - {title}")
    plt.show()

visualize_weekly_data(weekly_crypto_returns, "Weekly Cryptocurrency Returns")
visualize_weekly_data(weekly_fx_rates, "Weekly Forex Percent Change")

The raw and cleaned datasets are stored locally in the following folders: Raw Data: ../../data/raw-data/ Cleaned Data: ../../data/processed-data/

Note: These directories are part of the local project repository and are excluded from version control on GitHub. They are included here for reference only and will not be accessible externally.