Diet and Covid-19

Motto: "Let food be thy medicine, and let medicine be thy food"
Hippocrates

alt text

Outline

  • Introduction and motivation
  • Requirements (machine learning libraries)
  • Dataset
    • Data analysis and description
    • Distribution of the data
    • Data analysis by country
  • Machine learning algorithm
  • Conclusions and discussion

Introduction and motivation

Since the Spring of 2020, the entire world is facing an ongoing pandemic caused by the SARS-CoV-2 virus. The symptoms of COVID-19 are various: some people don't even know that they are infected, while others confront with life-threatening symptoms. As the virus is new, researchers still have a lot of questions to answer: how is it spread, which medicine should be use to treat the symptoms and how can we prevent getting infected.

It is believed that the virus spreads through air and via contaminated surfaces. To prevent the virus, authorities recommend social distancing, wearing face masks in public, frequent hand washing and disinfecting surfaces. Also, multiple vaccines have been developed and distributed to the population. For infected persons, there are some treatments that addreess COVID-19 symptoms, but there are no drugs that inhibit the virus.

Some specialists also recommend taking vitamins (vitamin C and vitamin D3) to improve our imune system. It is known that our eating habits and lifestyle make a great difference for a healthy immune system. Therefore, the main objective of this project is to analyse if there is a correlation between the diet (which is said to influence our immune system) and the COVID cases in different regions around the world.

In this project I want to study the relationship between COVID-19 cases and COVID-19 related deaths, and the diet of different countries. It is said that a healthy diet (lots of fruit and vegetables, reduced consumption of alcohol and animal products) and a right pyhsiological state (not obese nor undernourished), can help boost your immune system and therefore increase one's resistance to the virus.

So, I hope that after finishing this tutorial, the reader will understand the steps required to do a data science project, and will become more aware on how a healty diet can help boost the immune system.

Requirements (machine learning libraries)

I used the following machine learning libraries:

  • pandas 1.1.5 - for loading and manipulating csv files
  • matplotlib 3.3.3 and seaborn - for creating and displaying various charts
  • numpy 1.19.5 - for working with arrays
  • scikit-learn 0.24.1 - for various machine learning models and for metrics to evaluate my models
In [ ]:
!pip install seaborn
!pip install plotly
In [30]:
import seaborn
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as plt_exp
%matplotlib inline

Dataset

The data that I will work with for this project were downloaded from kaggle: https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset
The author of this dataset gathered nutrition data (energy intake, fat and protein quantity etc.), obesity and undernourished rate, as well as the most up to date confirmed/deaths/recovered/active COVID cases around the world.
The data was scraped from different websites:

  • Food and Agriculture Organization of the United Nations FAO website
  • Population Reference Bureau PRB website
  • Johns Hopkins Center for Systems Science and Engineering CSSE website
  • USDA Center for Nutrition Policy and Promotion diet intake

Data analysis and description

The first step is to load the data, handle missing values and perform exploratory data analysis.

As mentioned in the introductory section we are interested in the relationship between a healty diet and an appropriate physiological state, and the COVID-19 cases and deaths.

My assumption of a healty diet is:

  • consuming lots of fruits and vegetables
  • reduced consumption of animal products (meat)
  • reduced consumption of alchool
  • reduced consumption of sugar and sweeteners.

The available data related to the physiological state of the inhabitants is the obese and undernourished rate.

With this in mind, I will only analyse the following colums related to the diet: Fruits, Sweeteners, Alcoohol, Vegetables, Obesity, Undernourished, Meat, Vegetal_products, Animal_products.

I noticed that rows contain missing values, so I will remove those countries from the analysis.

In [32]:
# Load the data into a panda dataframe
df = pd.read_csv("Food_Supply_Quantity_kg_Data.csv")
# rename some columns that I will be using frequently
df.rename(columns = {"Miscellaneous":"Misc","Milk - Excluding Butter":"Dairy",
                    "Sugar & Sweeteners":"Sweeteners", "Fruits - Excluding Wine":"Fruits",
                    "Unit (all except Population)":"Unit",
                     "Alcoholic Beverages": "Alcoohol",
                    "Animal fats": "Animal_fats",
                    "Animal Products": "Animal_products",
                    "Vegetal Products": "Vegetal_products",
                     "Fish, Seafood": "Fish"
                    }, inplace=True)


countries_all = len(df)
# drop missing values
df = df.dropna()
countries_valid = len(df)
print('Removed {} countries from {} as they had missing infromation. Working with data from {} countries. '.format(countries_all - countries_valid, countries_all, countries_valid))

# keep only the columns we are interest in
columns_of_interest = ['Sweeteners', 'Alcoohol', 'Fruits', 'Animal_fats', 'Animal_products', 'Vegetal_products', 'Obesity', 'Undernourished', 'Meat',
                      'Deaths', 'Confirmed', 'Population', 'Vegetables', 'Dairy', 'Fish', 'Country']
df = df[columns_of_interest]

# The Undernourished column has some string values <2.5, and we need to take care of this issue
df['Undernourished'] = df.apply(lambda ft: 2.5 if ft['Undernourished'] == '<2.5' else float(ft['Undernourished']), axis = 1)
Removed 16 countries from 170 as they had missing infromation. Working with data from 154 countries. 

Distribution of the data

I will also visualize the distribution of these numerical columns.

Now, let's look at the statistical properties of the data. First, I will describe the statistical properties of the data:

In [33]:
df.describe()
Out[33]:
Sweeteners Alcoohol Fruits Animal_fats Animal_products Vegetal_products Obesity Undernourished Meat Deaths Confirmed Population Vegetables Dairy Fish
count 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 1.540000e+02 154.000000 154.000000 154.000000
mean 2.732413 3.020031 5.534741 0.227173 12.121764 37.874880 18.449351 11.324026 3.202042 0.039882 2.060157 4.796579e+07 5.971525 6.737858 1.283970
std 1.511877 2.404583 3.167282 0.284108 6.039498 6.039735 9.519483 11.771718 1.642639 0.049285 2.396535 1.639258e+08 3.491491 5.118950 1.170999
min 0.366600 0.000000 0.659600 0.002200 1.739100 23.113200 2.100000 2.500000 0.356000 0.000000 0.000312 7.200000e+04 0.857000 0.096300 0.035000
25% 1.704675 0.906350 3.402050 0.040600 6.752650 32.869525 8.250000 2.500000 1.864250 0.002086 0.141688 3.403500e+06 3.480075 2.035425 0.535750
50% 2.539850 2.733450 4.939050 0.116850 11.830350 38.168200 21.300000 7.050000 3.255250 0.012576 1.035083 1.058950e+07 4.956050 5.522500 1.003750
75% 3.630550 4.680600 6.807550 0.255075 17.122875 43.244575 25.700000 15.075000 4.306275 0.069680 3.534770 3.383650e+07 7.789925 10.963900 1.698925
max 9.725900 15.370600 19.302800 1.355900 26.886500 48.258500 45.500000 59.600000 8.092900 0.185428 10.408199 1.402385e+09 16.701900 20.837800 8.795900
In [34]:
# Plot the distribution of the numerical columns that I'm analyzing

features_to_display = set(columns_of_interest).difference(set(['Population', 'Deaths', 'Confirmed', 'Country']))


num_rows = len(features_to_display)//3 
num_cols = 3
fig, _ = plt.subplots(num_rows, 3, figsize=(15,15)) # create a plot with 3 columns 

for idx, feature in enumerate(features_to_display):
    plt.subplot(num_rows, num_cols, idx + 1) # "activate" the plot at index idx + 1 in the grid (num_rows, num_cols)
    plt.hist(df[feature], bins=10)
    plt.title(feature)
    
        
fig.show()
c:\users\diana\appdata\local\programs\python\python36\lib\site-packages\ipykernel_launcher.py:16: UserWarning:

Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.

Based on these plots, we can notice that some features follow a normal distribution, while others have a more "skewed" distribution. These will help us choose the appropriate standardization algorithm for these variables.

Finally, let's look at the correlation between the features that we are analysing:

In [7]:
plt.figure(figsize=(16, 16))
seaborn.heatmap(df.corr(), annot=True, cmap='Blues')
plt.show()

From the correlation matrix above we can notice that COVID cases are obviously related to the consumption of animal products and fats, alcoohol and dairy consumption, and obesity. I would have expected sweetner consumption to have a much more significant impact.

Anyway, it seems that a healthy diet might indeed help your immune system.

Data analysis by country

Finally, let's analyse the distribution of the COVID-19 cases based on their geographical distribution.

First, let's plot the COVID-cases incidence (confirmed cases and deaths) by countries.

In [8]:
df_confirmed_country = df[['Confirmed', 'Deaths', 'Country']]

#sort dataframe by confirmed cases
df_confirmed_country = df_confirmed_country.sort_values(by='Confirmed', ascending=False)

plt.figure(figsize = (12, 32))
seaborn.barplot(x = df_confirmed_country['Confirmed'], y=df_confirmed_country['Country'])

plt.xlabel("Confirmed cases")
plt.ylabel("Country")
plt.title("Confirmed cases per country")
plt.show()

It can be noticed that the countries with the highest number of deaths are localized in Europe. Also USA is also in the top countries with the highest number of cases. This can be do the fact that these countries peform more COVID-19 tests, have a high density of the population and people tend to travel more in these regions.

In [9]:
# Also display a choropleth map to have an idea of how the covid-cases  are distributed geographically
fig = plt_exp.choropleth(df, locations="Country",
                    color="Confirmed", 
                    locationmode='country names',
                    color_continuous_scale=plt_exp.colors.sequential.Plasma)
fig.show()
In [10]:
#sort dataframe by deaths
df_confirmed_country = df_confirmed_country.sort_values(by='Deaths', ascending=False)

plt.figure(figsize = (12, 32))
seaborn.barplot(x = df_confirmed_country['Deaths'], y=df_confirmed_country['Country'])

plt.xlabel("Confirmed cases")
plt.ylabel("Country")
plt.title("Deaths per country")
plt.show()
In [11]:
# Also display a choropleth map to have an idea of how the covid-cases and deaths are distributed geographically
fig = plt_exp.choropleth(df, locations="Country",
                    color="Deaths", 
                    locationmode='country names',
                    color_continuous_scale=plt_exp.colors.sequential.Plasma)
fig.show()

The countries that had high numbers of confirmed COVID-19 cases also have higher death rates related to COVID-19. However, we can notice that in South America, although there aren't so many COVID-19 cases, the death toll is larger. This could be due to the fact perhaps the number of cases is much larger and not enough COVID-19 tests are performed.

Let's now look at the diet for the N countries with the most covid cases and at the diet of N countries with the least covid cases.

In [12]:
N = 15
df_sorted_confirmed = df.sort_values(by='Confirmed', ascending=False)
most_cases = df_sorted_confirmed.head(N)
least_cases = df_sorted_confirmed.tail(N)
In [13]:
fig, _ = plt.subplots(1, 2, figsize=(15, 15))

features_to_display = set(columns_of_interest).difference(set(['Population', 'Deaths', 'Confirmed', 'Country', 'Undernourished', 'Obesity']))

plt.title('Comparison between the diet of countries with highest rates of covid cases (left) and  lowest rates of covid cases (right)', fontsize=15)

plt.subplot(1, 2, 1)
data_highest_cases = most_cases[list(features_to_display)].mean(skipna=False).tolist()
labels = features_to_display
plt.pie(data_highest_cases, explode=[0.1]*len(data_highest_cases), labels=labels, autopct='%1.1f%%')
plt.title('Diet in the {} countries with the highest number of covid cases'.format(N))


plt.subplot(1, 2, 2)
data_highest_cases = least_cases[list(features_to_display)].mean(skipna=False).tolist()
plt.pie(data_highest_cases, explode=[0.1]*len(data_highest_cases), labels=labels, autopct='%1.1f%%')
plt.title('Diet in the {} countries with the loest number of covid cases'.format(N))

plt.show()

It can be noticed that the countries with lower COVID-19 rates consume much more vegetal products, fish, less alchool and less animal products.

Now let's do the same analysis for the number of deaths.

In [14]:
N = 15
df_sorted_confirmed = df.sort_values(by='Deaths', ascending=False)
most_cases = df_sorted_confirmed.head(N)
least_cases = df_sorted_confirmed.tail(N)

fig, _ = plt.subplots(1, 2, figsize=(15, 15))

features_to_display = set(columns_of_interest).difference(set(['Population', 'Deaths', 'Confirmed', 'Country', 'Undernourished', 'Obesity']))

plt.title('Comparison between the diet of countries with highest deaths realted to covid (left) and  lowest deaths related to covid (right)', fontsize=15)

plt.subplot(1, 2, 1)
data_highest_cases = most_cases[list(features_to_display)].mean(skipna=False).tolist()
labels = features_to_display
plt.pie(data_highest_cases, explode=[0.1]*len(data_highest_cases), labels=labels, autopct='%1.1f%%')
plt.title('Diet in the {} countries with the highest deaths related to covid'.format(N))


plt.subplot(1, 2, 2)
data_highest_cases = least_cases[list(features_to_display)].mean(skipna=False).tolist()
plt.pie(data_highest_cases, explode=[0.1]*len(data_highest_cases), labels=labels, autopct='%1.1f%%')
plt.title('Diet in the {} countries with the lowest deaths related to covid'.format(N))

plt.show()

The same observation is valid for the countries where the deaths related to COVID-19 are less numerous: people in these countries seem to consume much more vegetable products and less animal products. It is interesting that the consumption of meat and sweetners does not differ so much between these two groups of countries. Instead, the dairy consumption seems to have a higher impact: countries with a lower death toll consume 3.56 less dairy products.

Finally, let's see the relationship between the vegetal products consumption and number of COVID-19 confirmed cases and deaths.

In [15]:
df1 = df.sort_values(by='Confirmed', ascending=False)
plt.scatter(df['Confirmed'], df['Vegetal_products'], color='salmon')
plt.xlabel('Confirmed cases')
plt.ylabel('Vegetal products consumption')
plt.title("Confirmed COVID-19 cases vs. vegetal products consumption")
Out[15]:
Text(0.5, 1.0, 'Confirmed COVID-19 cases vs. vegetal products consumption')
In [ ]:

Machine learning algorithm

In this section of the tutorial, we'll design a machine learning algorithm to predict the number of covid cases and the number of deaths realted to covid cases based on the the diet of a country.

As the output is a continous variable (the prediction is the estimated number of cases or deaths), we are dealing with a regression problem. More formally, the definition of regression is : "a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features')." - https://en.wikipedia.org/wiki/Regression_analysis

We will experiment with three different regression classifiers: linear regression (the vanilla regressor that works with least square method), decision tree and random forest.

All the colums from my dataset are numerical (I have no categorical data). To "clean" the dataset I already dropped the rows that contained null values.

Now, to get started, I have to split the data intro training and development (or test) data and to standardize it.

To standardize the data I will use RobustScaler from sklearn library. This scaling method is based on statistics: it removes the median and scales the data based on the interquartile range (the range between the 25th quartile and the 75th quartile). This scaling method is known to be robust to outliers in the data.

To split the data into train and test set, again we will use some functions from sklearn library. The input features for our models will be the diet information, while the output will be the number of COVID-19 cases.

First, let's import all the libraries that we will use for the machine learning task.

In [16]:
# functions to preprocess the data
from sklearn.utils import shuffle
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

# metrics to evaluate the algorithms
from sklearn.metrics import mean_squared_error, mean_absolute_error

# machine learning models

from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# misc
import seaborn
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

And then let's preprocess the data.

In [26]:
# To allow the user run only the machine learning algorithm, I will reload the data here
df = pd.read_csv("Food_Supply_Quantity_kg_Data.csv")

diet_related_columns = ['Animal fats',  'Eggs', 'Fish, Seafood', 'Meat',
                       'Milk - Excluding Butter', 'Alcoholic Beverages', 'Fruits - Excluding Wine', 
                        'Sugar & Sweeteners', 'Vegetal Products', 'Animal Products',
                        'Vegetables']


df = df.dropna()
# The Undernourished column has some string values <2.5, and we need to take care of this issue
df['Undernourished'] = df.apply(lambda ft: 2.5 if ft['Undernourished'] == '<2.5' else float(ft['Undernourished']), axis = 1)

df_machine_learning = df[diet_related_columns+['Obesity', 'Confirmed', 'Deaths']]

features = diet_related_columns
target = 'Confirmed'

print('Model features: \n\t', features)
print('Model target: \n\t', target)

X = df_machine_learning[features]
rs = RobustScaler()
X = rs.fit_transform(X)
y = df_machine_learning[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle = False, random_state = 13)
Model features: 
	 ['Animal fats', 'Eggs', 'Fish, Seafood', 'Meat', 'Milk - Excluding Butter', 'Alcoholic Beverages', 'Fruits - Excluding Wine', 'Sugar & Sweeteners', 'Vegetal Products', 'Animal Products', 'Vegetables']
Model target: 
	 Confirmed

We will experiment with several regression algorithms: LinearRegression, DecisionTreeRegressor and RandomForestRegressor.

The interested reader can learn more about the implementation details and the parameters of all these algorithms, from the sklean library documentation:

To evaluate the models we will report:

  • R2 - coefficient of determination which is a statistical measure of how well the regression predictions approximate the real data points;
  • MAE - mean absolute error;
  • RMSE - root mean square error.
In [35]:
# this function trains a regression model lr on the train data (X_train, y_train) and then returns the performance of the model on the test data
def perform_linear_regression(lr, X_train, X_test, y_train, y_test):
    # train the model
    lr.fit(X_train, y_train)

    # get the predictions on the training and test data
    pred_test = lr.predict(X_test)
    pred_train = lr.predict(X_train)

    # compute the performance on the training and test data
    score_train = lr.score(X_train, y_train)
    score_test = lr.score(X_test, y_test)

    # compute mean squared error and mean absolute error on train and test data
    rmse_train = np.sqrt(mean_squared_error(y_train, pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_test, pred_test))
    
    mae_train = np.sqrt(mean_absolute_error(y_train, pred_train))
    mae_test = np.sqrt(mean_absolute_error(y_test, pred_test))
    
    return score_train, score_test, rmse_train, rmse_test, mae_train, mae_test


classifiers = [LinearRegression(), SGDRegressor(), RandomForestRegressor()]
classifier_names = ['LinearRegression', 'SGDRegressor', 'RandomForestRegressor']

acc_test = []
rms_test = []
mae_test = []
acc_train = []
rms_train = []
mae_train = []
for classifier, classifier_name in zip(classifiers, classifier_names):
    score, score_tst, rmse, rmse_tst, mae, mae_tst = perform_linear_regression(classifier, X_train, X_test, y_train, y_test)
    
    print('Classifier {}'.format(classifier_name))
    print("\tAccuracy on training Data: {:.2f}".format(score))
    print("\tAccuracy on test Data: {:.2f}".format(score_tst))
    print('\tThe RMSE of the training set is:', rmse)
    print('\tThe RMSE of the testing set is:', rmse_tst)
    print('\tThe MAE of the training set is:', mae)
    print('\tThe MAE of the testing set is:', mae_tst)
    acc_test.append(score_tst)
    rms_test.append(rmse_tst)
    mae_test.append(mae_tst)
    acc_train.append(score)
    rms_train.append(rmse)
    mae_train.append(mae)
    
classifiers_performance = pd.DataFrame(columns=[])
classifiers_performance['classifier'] = classifier_names
classifiers_performance['R2_score test'] = acc_test
classifiers_performance['R2_score train'] = acc_train
classifiers_performance['RMSE test'] = rms_test
classifiers_performance['RMSE train'] = rms_train
classifiers_performance['MAE test'] = mae_test
classifiers_performance['MAE train'] = mae_train
classifiers_performance
Classifier LinearRegression
	Accuracy on training Data: 0.47
	Accuracy on test Data: 0.56
	The RMSE of the training set is: 1.6893163625284342
	The RMSE of the testing set is: 1.7251019336187505
	The MAE of the training set is: 1.126935035263973
	The MAE of the testing set is: 1.169220201912816
Classifier SGDRegressor
	Accuracy on training Data: 0.46
	Accuracy on test Data: 0.53
	The RMSE of the training set is: 1.716922608395054
	The RMSE of the testing set is: 1.7886746098752644
	The MAE of the training set is: 1.1349282002074317
	The MAE of the testing set is: 1.1932326160849756
Classifier RandomForestRegressor
	Accuracy on training Data: 0.91
	Accuracy on test Data: 0.63
	The RMSE of the training set is: 0.7033823754272875
	The RMSE of the testing set is: 1.584765826985974
	The MAE of the training set is: 0.711771153051628
	The MAE of the testing set is: 1.0975328802167827
Out[35]:
classifier R2_score test R2_score train RMSE test RMSE train MAE test MAE train
0 LinearRegression 0.564375 0.473707 1.725102 1.689316 1.169220 1.126935
1 SGDRegressor 0.531677 0.456366 1.788675 1.716923 1.193233 1.134928
2 RandomForestRegressor 0.632368 0.908759 1.584766 0.703382 1.097533 0.711771
In [36]:
# let's plot the predictions of the best model (random forest classifier)
model = classifiers[-1]

X = df_machine_learning[features]
rs = RobustScaler()
X = rs.fit_transform(X)
y = df_machine_learning[target]

predictions = model.predict(X)
display = pd.DataFrame(X, columns=diet_related_columns)
display['Predicted Cases'] = predictions
display['Cases'] = y
display.head(3)
Out[36]:
Animal fats Eggs Fish, Seafood Meat Milk - Excluding Butter Alcoholic Beverages Fruits - Excluding Wine Sugar & Sweeteners Vegetal Products Animal Products Vegetables Predicted Cases Cases
0 0.375102 -0.507165 -0.832850 -0.840798 0.230756 -0.723866 0.120526 -0.618394 0.230968 -0.231070 0.419539 0.467292 0.142134
1 0.087889 0.270160 -0.680164 -0.561317 1.142278 -0.281261 0.542373 -0.520880 -0.668700 0.669036 1.582248 3.167633 2.967301
2 -0.413335 0.157619 -0.655232 -0.870077 0.234799 -0.652408 0.423154 -0.366405 0.211748 -0.211852 1.552803 1.004575 0.244897

Conclusions and discussion

In this tutorial we've analized the number of COVID-19 cases and the deaths related to this virus in countries all around the world, and their correlation with the diet of the population in those countries. The assumption was that a healthy diet could boost our immunity system and, therefore, our bodies will have more power to fight the virus.

In the first part of the tutorial we performed data analysis and visualization part; we noticed that countires in which people consume more vegetables and less animal products and dairy, indeed have the lowest COVID-19 cases and deaths.

Regarding the geographical distribution of the cases, we noticed that there are more COVID-19 cases in countries from developed regions (from USA and Europe). The countries from these regions perform more COVID-19 tests (so it is normal to have a higher rate of confirmed cases) and people live in urban agglomerations and travel more. Also, the overall age of the population is higher in these regions.

In less developed countries, perhaps a much lower number of tests is performed.

Countries in Asia and Oceania also, have a small number COVID-19 cases. In the case of Australia and New Zeeland it might be the fact that the gouvernment imposed strict rules to fight COVID-19 and basically the countries were isolated during this period.

Finally, in the second part of the tutorial we employed machine learning algorithms to predict the number of COVID-19 cases based on the diet and physiological state (obesity) of the inhabitants of a country. We used several linear regression models for this. The R2 score for this task was ~0.61 for the random forest classifier.

This is perhaps due to the fact that in reality things are much more complex and only a healthy diet cannot ensure our resistance against COVID-19. Moreover, there are much more aspects (which are not related to the diet), that have a higher impact on the COVID-19 cases than the diet: the density of the population, how many tests are performed in the country, what preventive measures did the gouvernment of each country applyed to help stop the pandemic.

To conclude, even if diet alone cannot be used to fight with the coronavirus, I think that we should pay more attention on the things that we eat:

"Take care of your body. It's the only place you have to live.".

In [ ]: