From 998c3a875e1b903dae26e4d0cc21418089c2925b Mon Sep 17 00:00:00 2001 From: Carlos Camara Date: Thu, 14 Dec 2023 17:38:56 +0000 Subject: [PATCH] Add quarto syntax not to stop execution by a forced error --- .../Lab_4/IM939_Lab_4_Exercises/execute-results/html.json | 6 +++--- content/labs/Lab_4/IM939_Lab_4_Exercises.ipynb | 2 ++ 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/_freeze/content/labs/Lab_4/IM939_Lab_4_Exercises/execute-results/html.json b/_freeze/content/labs/Lab_4/IM939_Lab_4_Exercises/execute-results/html.json index 8474b4a..ef8dae3 100644 --- a/_freeze/content/labs/Lab_4/IM939_Lab_4_Exercises/execute-results/html.json +++ b/_freeze/content/labs/Lab_4/IM939_Lab_4_Exercises/execute-results/html.json @@ -1,9 +1,9 @@ { - "hash": "5342bed808529da3378c0bd69d9245c3", + "hash": "8924f51c73c2f0a5abe3d45ac3f510ab", "result": { - "markdown": "# Exercise: Wine dataset\n\nAs with previous exercises, fill in the question marks with the correct code.\n\nLast week you were introduced to the [wine dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality). We have 10 input variables and 1 output variables.\n\nInput variables (based on physicochemical tests):\n\n1. fixed acidity\n2. volatile acidity\n3. citric acid\n4. residual sugar\n5. chlorides\n6. free sulfur dioxide\n7. total sulfur dioxide\n8. density\n9. pH\n10. sulphates\n11. alcohol\n\nOutput variable (based on sensory data):\n\n12. quality (score between 0 and 10)\n\nI suggest we look at two broad questions with this dataset:\n\n1. Will dimension reduction reveal variable groupings? Think back to how we interpreted the loadings in the crime dataset.\n2. What does clustering the wines well us?\n\n## Load data and import libraries\n\n::: {.cell execution_count=1}\n``` {.python .cell-code}\nimport warnings\nwarnings.filterwarnings('ignore')\n\nimport pandas as pd\nfrom sklearn.preprocessing import MinMaxScaler\nimport seaborn as sn?\nfrom sklearn.cluster import KMeans\nfrom sklearn.decomposition import PC?\nfrom sklearn.decomposition import S????ePCA\nfrom sklearn.manifold import TSNE\n\ndf = pd.read_excel('data/winequality-red_v2.xlsx')\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (197875397.py, line 9)\n```\n:::\n:::\n\n\n::: {.cell execution_count=2}\n``` {.python .cell-code}\ndf.h??d()\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (197674394.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=3}\n``` {.python .cell-code}\n# May take a while depending on your computer\n# feel free not to run this\nsns.pair????(df)\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (4010129658.py, line 3)\n```\n:::\n:::\n\n\n# Dimension reduction\n\n::: {.cell execution_count=4}\n``` {.python .cell-code}\nfrom sklearn.decomposition import PCA\n\nn_components = 2\n \npca = PCA(n_??????????=n_components)\ndf_pca = pca.fit(df?iloc[:, 0:11])\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2518476773.py, line 5)\n```\n:::\n:::\n\n\n::: {.cell execution_count=5}\n``` {.python .cell-code}\ndf_pca_vals = df_pca.???_transform(df.iloc[:, 0:11])\ndf['c1'] = [item[0] for item in df_pca_????]\ndf['c2'] = [item[1] for item in df_pca_vals]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2292344659.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=6}\n``` {.python .cell-code}\nsns.scatterplot(data = df, x = ?, y = ?, hue = 'quality')\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (321227352.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=7}\n``` {.python .cell-code}\nprint(df.columns)\ndf_pca.components_\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'df' is not defined\n```\n:::\n:::\n\n\nWhat about other dimension reduction methods?\n\n## SparcePCA\n\n::: {.cell execution_count=8}\n``` {.python .cell-code}\ns_pca = SparsePCA(n_components=n_components)\ndf_s_pca = s_pca.fit(df.????[:, 0:11])\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2237240520.py, line 2)\n```\n:::\n:::\n\n\n::: {.cell execution_count=9}\n``` {.python .cell-code}\ndf_s_pca_vals = s_pca.fit_?????????(df.iloc[:, 0:11])\ndf['c1 spca'] = [item[0] for item in df_s_pca_vals]\ndf['c2 spca'] = [item[1] for item in df_s_pca_vals]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (496583237.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=10}\n``` {.python .cell-code}\nsns.scatterplot(data = df, x = 'c1 spca', y = 'c2 spca', hue = 'quality')\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'sns' is not defined\n```\n:::\n:::\n\n\n## tSNE\n\n::: {.cell execution_count=11}\n``` {.python .cell-code}\ntsne_model = TSNE(n_components=n_components)\ndf_tsne = tsne_model.fit(df.iloc[:, 0:11])\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'TSNE' is not defined\n```\n:::\n:::\n\n\n::: {.cell execution_count=12}\n``` {.python .cell-code}\ndf_tsne_vals = tsne_model.fit_transform(df.iloc[:, 0:11])\ndf['c1 tsne'] = [item[0] for item in ??_tsne_vals]\ndf['c2 tsne'] = [item[1] for item in df_tsne_vals]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (3280343393.py, line 2)\n```\n:::\n:::\n\n\n::: {.cell execution_count=13}\n``` {.python .cell-code}\n# This plot does not look right\n# I am not sure why.\nsns.scatterplot(data = ??, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (847552475.py, line 3)\n```\n:::\n:::\n\n\nThat looks concerning - there is a straight line. It looks like something in the data has caused the model to have issues.\n\nDoes normalising the data sort out the issue?\n\n::: {.cell execution_count=14}\n``` {.python .cell-code}\nfrom sklearn.preprocessing import MinMaxScaler\ncol_names = df.columns\nscaled_df = pd.DataFrame(MinMaxScaler().fit_transform(df))\nscaled_df.columns = col_names\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'df' is not defined\n```\n:::\n:::\n\n\n::: {.cell execution_count=15}\n``` {.python .cell-code}\ntsne_model = TSNE(n_components=n_components)\n\nscaled_df_tsne = tsne_model.fit(scaled_df.iloc[:, 0:11])\nscaled_df_tsne_vals = tsne_model.fit_transform(df.iloc[:, 0:11])\n\nscaled_df['c1 tsne'] = [item[0] for item in scaled_df_tsne_vals]\nscaled_df['c2 tsne'] = [item[1] for item in scaled_df_tsne_vals]\n\nsns.scatterplot(data = scaled_df, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'TSNE' is not defined\n```\n:::\n:::\n\n\nNormalising the data makes no difference. It could be the model is getting stuck somehow. You could check the various attributes of the tsne fit object (tsne_model.fit), try using only a few columns and search google a lot - this could be a problem other have encountered.\n\nFor now, we will use PCA components.\n\n::: {.cell execution_count=16}\n``` {.python .cell-code}\ndata = {'columns' : df.iloc[:, 0:11].columns,\n 'component 1' : df_pca.components_[0],\n 'component 2' : df_pca.components_[1]}\n\n\nloadings = pd.?????????(data)\nloadings_sorted = loadings.sort_values(by=['component 1'], ascending=False)\nloadings_sorted.iloc[1:10,:]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (4262587979.py, line 6)\n```\n:::\n:::\n\n\n::: {.cell execution_count=17}\n``` {.python .cell-code}\nloadings_sorted = loadings.sort_values(by=['component 2'], ascending=False)\nloadings_sorted.iloc[1:10,:]\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'loadings' is not defined\n```\n:::\n:::\n\n\n## Clustering\n\n::: {.cell execution_count=18}\n``` {.python .cell-code}\nfrom sklearn.cluster import KMeans\n\nks = range(1, 10)\ninertias = []\nfor k in ks:\n # Create a KMeans instance with k clusters: model\n ????? = KMeans(n_clusters=k)\n \n # Fit model to samples\n model.fit(df[['c1', 'c2']])\n \n # Append the inertia to the list of inertias\n inertias.append(model.inertia_)\n\nimport matplotlib.pyplot as plt\n\nplt.plot(ks, inertias, '-o', color='black')\nplt.xlabel('number of clusters, k')\nplt.ylabel('inertia')\nplt.xticks(ks)\nplt.show()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nObject `??? = KMeans(n_clusters=k)` not found.\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'model' is not defined\n```\n:::\n:::\n\n\n::: {.cell execution_count=19}\n``` {.python .cell-code}\nk_means_3 = KMeans(n_clusters = 3, init = 'random')\nk_means_3.fit(df[['c1', 'c2']])\ndf['Three clusters'] = pd.Series(k_means_3.???????(df[['c1', 'c2']].values), index = df.index)\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2729138574.py, line 3)\n```\n:::\n:::\n\n\n::: {.cell execution_count=20}\n``` {.python .cell-code}\nsns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Three clusters')\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'sns' is not defined\n```\n:::\n:::\n\n\nConsider:\n\n* Is that userful? \n* What might it mean?\n\nOutside of this session you could try normalising the data (centering around the mean), clustering the raw data (and not the projections from PCA), trying to get tSNE working or using different numbers of components.\n\n", + "markdown": "# Exercise: Wine dataset\n\nAs with previous exercises, fill in the question marks with the correct code.\n\nLast week you were introduced to the [wine dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality). We have 10 input variables and 1 output variables.\n\nInput variables (based on physicochemical tests):\n\n1. fixed acidity\n2. volatile acidity\n3. citric acid\n4. residual sugar\n5. chlorides\n6. free sulfur dioxide\n7. total sulfur dioxide\n8. density\n9. pH\n10. sulphates\n11. alcohol\n\nOutput variable (based on sensory data):\n\n12. quality (score between 0 and 10)\n\nI suggest we look at two broad questions with this dataset:\n\n1. Will dimension reduction reveal variable groupings? Think back to how we interpreted the loadings in the crime dataset.\n2. What does clustering the wines well us?\n\n## Load data and import libraries\n\n::: {.cell execution_count=1}\n``` {.python .cell-code}\nimport warnings\nwarnings.filterwarnings('ignore')\n\nimport pandas as pd\nfrom sklearn.preprocessing import MinMaxScaler\nimport seaborn as sn?\nfrom sklearn.cluster import KMeans\nfrom sklearn.decomposition import PC?\nfrom sklearn.decomposition import S????ePCA\nfrom sklearn.manifold import TSNE\n\ndf = pd.read_excel('data/winequality-red_v2.xlsx')\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (197875397.py, line 9)\n```\n:::\n:::\n\n\n::: {.cell execution_count=2}\n``` {.python .cell-code}\ndf.h??d()\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (197674394.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=3}\n``` {.python .cell-code}\n# May take a while depending on your computer\n# feel free not to run this\nsns.pair????(df)\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (4010129658.py, line 3)\n```\n:::\n:::\n\n\n# Normalisation\n\nBefore you carry out any operation, you might want to perform some normalisation. This will ensure that some of the assumptions that the algorithms are making are met and also the results are not biased/determined by the different value ranges and variation ranges inherent in the data. \n\nDo try out the following steps **without** normalisation first and then come back to this, normalise the data and see the differences it makes using a **normalised** copy of the data.\n\n::: {.cell execution_count=4}\n``` {.python .cell-code}\nfrom sklearn.preprocessing import MinMaxScaler\n\n# first save the column names, we will create a new dataset with the scaled data\ncol_names = df.columns\n\n# This is the normalization function. \n# We are using the MinMaxScaler which brings all the data between 0 and 1.\n# Make use of other transformations offered by scikitlearn, experiment, note changes. \n\n# The last column of the data contains the \"quality\" labels/scores, we don't want to normalize them \n# as they is sort of the \"dependent (or \"target\") variable and there is meaning in these scores. \n# Let's normalize the first 11 columns which are our \"independent\" columns.\nscaled_df = pd.DataFrame(MinMaxScaler().fit_transform(df.iloc[:, 0:11]))\n\n# now we want to add the \"quality\" values back in. We'll need them.\nscaled_df = scaled_df.join(df.iloc[:, 11:12])\n\n# now we name the columns with the original column names. We do this because MinMaxScaler \n# produces a data frame with no column names (don't ask me why..)\nscaled_df.columns = col_names\n\n# let's have a look at what the data is looking like:\nscaled_df.head()\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'df' is not defined\n```\n:::\n:::\n\n\n**Important note:** The rest of the code will continue to use the **non-normalised** version of the data. For now, just carry on with that and try running the operations with the non-normalised version. Once you are through and/or somewhere in the middle, try them out with the **normalised** data. See what this will change.\n\n# Dimension reduction\n\n::: {.cell execution_count=5}\n``` {.python .cell-code}\nfrom sklearn.decomposition import PCA\n\nn_components = 2\n \npca = PCA(n_??????????=n_components)\ndf_pca = pca.fit(df?iloc[:, 0:11])\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2518476773.py, line 5)\n```\n:::\n:::\n\n\n::: {.cell execution_count=6}\n``` {.python .cell-code}\ndf_pca_vals = df_pca.???_transform(df.iloc[:, 0:11])\ndf['c1'] = [item[0] for item in df_pca_????]\ndf['c2'] = [item[1] for item in df_pca_vals]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2292344659.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=7}\n``` {.python .cell-code}\nsns.scatterplot(data = df, x = ?, y = ?, hue = 'quality')\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (321227352.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=8}\n``` {.python .cell-code}\nprint(df.columns)\ndf_pca.components_\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'df' is not defined\n```\n:::\n:::\n\n\nWhat about other dimension reduction methods?\n\n## SparcePCA\n\n::: {.cell execution_count=9}\n``` {.python .cell-code}\ns_pca = SparsePCA(n_components=n_components)\ndf_s_pca = s_pca.fit(df.????[:, 0:11])\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2237240520.py, line 2)\n```\n:::\n:::\n\n\n::: {.cell execution_count=10}\n``` {.python .cell-code}\ndf_s_pca_vals = s_pca.fit_?????????(df.iloc[:, 0:11])\ndf['c1 spca'] = [item[0] for item in df_s_pca_vals]\ndf['c2 spca'] = [item[1] for item in df_s_pca_vals]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (496583237.py, line 1)\n```\n:::\n:::\n\n\n::: {.cell execution_count=11}\n``` {.python .cell-code}\nsns.scatterplot(data = df, x = 'c1 spca', y = 'c2 spca', hue = 'quality')\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'sns' is not defined\n```\n:::\n:::\n\n\n## tSNE\n\n::: {.cell execution_count=12}\n``` {.python .cell-code}\ntsne_model = TSNE(n_components=n_components)\ndf_tsne = tsne_model.fit(df.iloc[:, 0:11])\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'TSNE' is not defined\n```\n:::\n:::\n\n\n::: {.cell execution_count=13}\n``` {.python .cell-code}\ndf_tsne_vals = tsne_model.fit_transform(df.iloc[:, 0:11])\ndf['c1 tsne'] = [item[0] for item in ??_tsne_vals]\ndf['c2 tsne'] = [item[1] for item in df_tsne_vals]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (3280343393.py, line 2)\n```\n:::\n:::\n\n\n::: {.cell execution_count=14}\n``` {.python .cell-code}\n# This plot does not look right\n# I am not sure why.\nsns.scatterplot(data = ??, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (847552475.py, line 3)\n```\n:::\n:::\n\n\nThat looks concerning - there is a straight line. It looks like something in the above code might not be correct.\n\nCan you find out what that might be? \n\n**Hint:** think about when you would get a straight line in a scatterplot?\n\nOnce you fixed the error above, you will notice a different structure to the ones you observed in the PCA runs. There isn't really a clear next step which of these projections one should adopt. \n\nFor now, we will use PCA components. PCA would be a good choice if the interpretability of the components is important to us. Since PCA is a linear projection method, the components carry the weights of each raw feature which enable us to make inferences about the axes. However, if we are more interested in finding structures and identify groups of similar items, t-SNE might be a better projection to use since it emphasises proximity but the axes don't mean much since the layout is formed stochastically (fancy speak for saying that there is randomness in the algorithm and the layout will be different each time your run it).\n\n::: {.cell execution_count=15}\n``` {.python .cell-code}\ndata = {'columns' : df.iloc[:, 0:11].columns,\n 'component 1' : df_pca.components_[0],\n 'component 2' : df_pca.components_[1]}\n\n\nloadings = pd.?????????(data)\nloadings_sorted = loadings.sort_values(by=['component 1'], ascending=False)\nloadings_sorted.iloc[1:10,:]\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (4262587979.py, line 6)\n```\n:::\n:::\n\n\n::: {.cell execution_count=16}\n``` {.python .cell-code}\nloadings_sorted = loadings.sort_values(by=['component 2'], ascending=False)\nloadings_sorted.iloc[1:10,:]\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'loadings' is not defined\n```\n:::\n:::\n\n\n## Clustering\n\n::: {.cell execution_count=17}\n``` {.python .cell-code}\nfrom sklearn.cluster import KMeans\n\nks = range(1, 10)\ninertias = []\nfor k in ks:\n # Create a KMeans instance with k clusters: model\n ????? = KMeans(n_clusters=k)\n \n # Fit model to samples\n model.fit(df[['c1', 'c2']])\n \n # Append the inertia to the list of inertias\n inertias.append(model.inertia_)\n\nimport matplotlib.pyplot as plt\n\nplt.plot(ks, inertias, '-o', color='black')\nplt.xlabel('number of clusters, k')\nplt.ylabel('inertia')\nplt.xticks(ks)\nplt.show()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nObject `??? = KMeans(n_clusters=k)` not found.\n```\n:::\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'model' is not defined\n```\n:::\n:::\n\n\n::: {.cell execution_count=18}\n``` {.python .cell-code}\nk_means_3 = KMeans(n_clusters = 3, init = 'random')\nk_means_3.fit(df[['c1', 'c2']])\ndf['Three clusters'] = pd.Series(k_means_3.???????(df[['c1', 'c2']].values), index = df.index)\n```\n\n::: {.cell-output .cell-output-error}\n```\nSyntaxError: invalid syntax (2729138574.py, line 3)\n```\n:::\n:::\n\n\n::: {.cell execution_count=19}\n``` {.python .cell-code}\nsns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Three clusters')\n```\n\n::: {.cell-output .cell-output-error}\n```\nNameError: name 'sns' is not defined\n```\n:::\n:::\n\n\n\nConsider:\n\n* Is that useful? \n* What might it mean?\n\nOutside of this session go back to normalising the data and try out different methods for normalisation as well (e.g., centering around the mean), clustering the raw data (and not the projections from PCA), trying to get tSNE working or using different numbers of components.\n\n", "supporting": [ - "IM939_Lab_4_Exercises_files" + "IM939_Lab_4_Exercises_files/figure-html" ], "filters": [], "includes": {} diff --git a/content/labs/Lab_4/IM939_Lab_4_Exercises.ipynb b/content/labs/Lab_4/IM939_Lab_4_Exercises.ipynb index 9a4e038..261d0e6 100644 --- a/content/labs/Lab_4/IM939_Lab_4_Exercises.ipynb +++ b/content/labs/Lab_4/IM939_Lab_4_Exercises.ipynb @@ -98,6 +98,8 @@ "metadata": {}, "outputs": [], "source": [ + "#| error: true\n", + "\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "# first save the column names, we will create a new dataset with the scaled data\n",