Red Wine Characterstics by Perceived Quality by David Broadwater

Overview: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Guiding Question: Which chemical properties influence the quality of red wines?

## 
## The downloaded binary packages are in
##  /var/folders/zw/p_pzs_xj1hbdkm1gqsx2670c0000gn/T//RtmpwgEMkT/downloaded_packages

Univariate Plots

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Most of the wines were rated 5 or 6 on a 10 point scale, with 75% rated a 6 or below. The ratings in the dataset ranged from 3-8 out of a possible 10 point scale. I wonder if these ratings are typical for red wines, or just this set in particular. I’m surprised that none were rated higher than 8, and that so few were rated as an 8. After reading a bit more about how the ratings were calculated (by taking the median of the ratings from three different wine experts), I can see how there could be fewer ratings at the extremes of the spectrum.

The density values appear to have a small amount of variance, while it looks like there is much more variance in the residual sugar, free sulfur dioxide, and total sulfur dioxide values. I’m interested to see how these all relate to quality and each other. First, I’ll take a look at the distributions of each of the variables to get a feel for each.

Explore Variable Distributions

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] 7.2

Fixed acidty was slightly skewed toward higher acidities, but otherwise looked somewhat normal with a long tail extending out to a max value of 15.9 g/dm^3. The median value was 7.9 g/dm^3, and 7.2 g/dm^3 was the most frequently occuring value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] 0.6

Volatile acidity has a bimodal distribution with peaks around 0.40 and 0.60 g/dm^3. Since there were a couple outliers, I adjusted the x-axis scale a bit to get a better look at the overall distribution. The median volatile acidity value was 0.52 g/dm^3 with a maximum value (and outlier) of 1.58 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.680   8.445   8.847   9.740  16.280

I think it could be useful to see how the total acidity (defined here as the sum of fixed and volatile acidities) relates to the other features, so I’ve added it to the dataset as total.acidity. Here is the distribution of total acidities, which looks similar to the fixed acidity distribution, as expected (since the volatile acidity values are much smaller than the fixed acidity values). The median total acidity was 8.445 g/dm^3, an increase of 0.545 g/dm^3 over the median fixed acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01330 0.04219 0.06164 0.06220 0.07903 0.17220

Now that we know the total acidity, we can calculate the ratio of the volatile acidity to the total acidity for each wine. I added this to the dataset as volatile.acidity.ratio and plotted the distribution here. This looks similar to the bimodal distribution seen in `volatile.acidity’. Volatile acidities make up a small portion of the total acidity (which makes sense, if it can cause unpleasant vinegar-like flavors). The median volatile acidity ratio was 0.062 (6.2%), and 75% of the wines had a ratio below 0.079.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric acid also had a unique distribution; the most common value was 0.00 (132 wines), with 0.49 as the next most common value (68 wines). I wonder why 0.49 g/dm^3 is such a common value compared to the rest of the values, along with 0.02 (50 wines), and 0.24 (51 wines). It seems odd that so many wines had the exact same non-zero values, especially compared to the surrounding values. These citric acid concentation values seem to fall in about the same range as the volatile acidity values, although they are more skewed in general toward smaller concentrations. The median citric acid concentration was 0.26 g/dm^3, and 75% of the wines had a concentration below 0.42 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01187 0.03080 0.02925 0.04324 0.13200

Similar to what we did with volatile acidity, let’s look at the ratio of citric acid to the total acidity. I added it to the dataset as citric.acid.ratio and plotted its distribution here. Unsurprisingly, overall it looks very similar to the citric acid distribution (without the distinct peaks, which were lost in calculating the ratio). The citric acid ratios are smaller than the volatile acidity ratios, with 75% of the wines having a citric acid ratio of 0.043 (which is very close to the 1st Quartile value of 0.042 for the volatile acidity ratios). The median citric acid ratio was 0.031.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual sugar had a normal distribution centered around 2.2 g/dm^3 (which also was the median value), with a long tail that extended out to 15.5 g/dm^3. I adjusted the x-axis limits a bit to get a better look at the main distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Similarly, chlorides had a normal distribution centered around 0.07 g/dm^3 with a long tail of values that extended out to 0.61 g/dm^3. I also adjusted the x-limits a bit here to get a better feel for the main distribution of values, which appeared mostly normal. The median chloride value was 0.079 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## [1] 6
## Source: local data frame [1 x 1]
## 
##     n
## 1 138

The free sulfur dioxide values were skewed toward lower concentrations, with 6.0 g/dm^3 as the most common value (138 wines) but a maximum value of 72.0 g/dm^3. The median free sulfur dioxide concentration was 14.0 g/dm^3, and 75% of the wines had a value less than 21 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## [1] 28
## Source: local data frame [1 x 1]
## 
##   n
## 1 1

Total sulfur dioxide also followed the same pattern, and was heavily skewed toward smaller values. A concentration of 28 g/dm^3 was most common (43 wines), although the median value was 38 g/dm^3. There were some big outliers, with 2 values greater than 280 g/dm^3, while 75% of the wines had values below 62 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02273 0.25930 0.37500 0.38230 0.48480 0.85710
## [1] 0.5
## Source: local data frame [1 x 1]
## 
##    n
## 1 68

I wonder what the ratio of free sulfur dioxide to total sulfur dioxide looks like. I’ve added it to the dataset as free.sulfur.dioxide.ratio and plotted its distribution here. Interestingly, the free sulfur dioxide ratio was skewed toward ratios below 0.500 in general (75% of the wines had values below 0.485), while the most common ratio was actually 0.500 (68 wines). I’m not sure why there were so many wines with a ratio of 0.500, especially compared to the surrounding values. The median ratio was 0.375.

## Warning: position_stack requires constant width: output may be incorrect

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
## [1] 0.9972
## Source: local data frame [1 x 1]
## 
##    n
## 1 36

Density had a normal-looking distribution (slightly skewed toward higher densities), with a median value of 0.9968 g/cm^3. The most frequently occurring value was 0.9972 g/cm^3 (36 wines). I’ll be interested to see this broken out by quality rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
##           n
## 1 0.9806129

The pH values appear to be pretty normally distributed. The background info for this dataset stated that most wines have a pH between 3 and 4, which holds here, with 98% of the wines falling within that range, and 50% falling between 3.21 and 3.40. The median pH was 3.31.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The sulphate values also had a long tail, probably related to the long tail we saw in the total sulfur dioxide distribution (since our background info states that sulphates contribute to the overall sulfur dioxide levels). After adjusting the x-axis limits, the main distribution of values appears mostly normal, but skewed toward larger values. The median sulphate concentration was 0.620 g/dm^3, with 75% of the values falling below 0.73 g/dm^3. The maximum value (and outlier) was 2.00 g/dm^3. I suspect that value is related to the max-value outlier seen in the total sulfur dioxide values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] 9.5
## Source: local data frame [1 x 1]
## 
##     n
## 1 139

The alcohol values were skewed toward larger percentages, with 50% of the wines between 10.2 and 14.9 % by volume. The most frequently occurring alcohol percentage was 9.5 (139 wines), which was also the 1st Quartile value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##           n
## 1 0.8248906

The quality values were somewhat normally distributed, but slightly skewed toward higher ratings. The median rating was 6 (638 wines), but the most frequently occurring rating was 5 (681 wines). 82% of the wines had one of those two ratings. Only 18 wines were rated 8, and the ratings only varies between 3 and 8 (on a 0-10 scale), again likely due to the way the ratings were calculated (the median rating from the three wine experts).

Univariate Analysis

What is the structure of your dataset?

There are 1599 different observations (of wines) of 13 different wine characteristics. All of the variables except X (the observation identifier) and quality (the rating from 0-10 given by the wine tasters) were floating numerical values and were measured values.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the quality rating. I’m trying to determine how the other features potentially influence it.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think that the acidity features (volatile acidity, fixed acidity,and citric acid), residual sugar, sulfur dioxide features (free and total sulfur dioxide), pH, and density will have the biggest impacts on overall quality. I’m not much of a wine drinker though, so my unfamiliarity with chlorides and sulphates (in relation to wine) are probably adding bias to my assumptions. Even though there seems to be a small amount of variance in the densities, I think that could impact what is perceived as “mouthfeel”, so I think density could be significant. Sugar, acidity, and pH seem like they would impact how well “balanced” a wine is perceived to be, since something that is very tart or very sweet might be off-putting, if nothing else because they would taste different than what a wine drinker is used to or expecting. Volatile acidity is mentioned as creating a vinegar-like taste in high-concentrations, so I suspect that will also affect taste and quality. Lastly, since the dataset description mentions that large concentrations of free sulfur dioxide become apparent in the nose and taste of a wine, I think it is very likely to have an impact on perceived quality, although I’m not sure if it would be in a positive or negative way.

Did you create any new variables from existing variables in the dataset?

I created a few different variables for this dataset. First, I created a variable called total.acidity by summing the volatile and fixed acidities. In order to further explore the relationship between free and total sulfur dioxide, I created a variable called free.sulfur.dioxide.ratio by finding the ratio of free sulfur dioxide to total sulfur dioxide. Similarly, I created volatile.acidity.ratio and citric.acid.ratio by calculating the ratio of volatile acidity and citric acid (respectively) to the total acidity mentioned above.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Since the data was already in a very tidy format, no additional formatting was necessary for the univariate analysis portion. I did remove the X variable from the dataset, since that simply represented the row number and didn’t provide any value to the analysis.

Volatile acidity had a bimodal distribution with peaks around 0.40 and 0.60 g/dm^3. I’m curious to see how that looks separated by quality ratings.

Citric acid also had a unique distribution; the most common value was 0.00 (132 wines), with 0.49 as the next most common value (68 wines). There were a few other values that occurred much more frequently than others, which was odd since they weren’t close together at all.

The total sulfur dioxide and sulphate distributions both had some large outliers, which I suspect are related to one another.

Otherwise, most of the distributions of the variables were fairly normal, or at least not too unusual.

Bivariate Plots Section

Correlation Between Variables

##                           fixed.acidity volatile.acidity citric.acid
## fixed.acidity                     1.000           -0.256       0.672
## volatile.acidity                 -0.256            1.000      -0.552
## citric.acid                       0.672           -0.552       1.000
## residual.sugar                    0.115            0.002       0.144
## chlorides                         0.094            0.061       0.204
## free.sulfur.dioxide              -0.154           -0.011      -0.061
## total.sulfur.dioxide             -0.113            0.076       0.036
## density                           0.668            0.022       0.365
## pH                               -0.683            0.235      -0.542
## sulphates                         0.183           -0.261       0.313
## alcohol                          -0.062           -0.202       0.110
## quality                           0.124           -0.391       0.226
## total.acidity                     0.995           -0.157       0.628
## volatile.acidity.ratio           -0.618            0.898      -0.725
## citric.acid.ratio                 0.419           -0.586       0.943
## free.sulfur.dioxide.ratio        -0.131           -0.073      -0.167
##                           residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                      0.115     0.094              -0.154
## volatile.acidity                   0.002     0.061              -0.011
## citric.acid                        0.144     0.204              -0.061
## residual.sugar                     1.000     0.056               0.187
## chlorides                          0.056     1.000               0.006
## free.sulfur.dioxide                0.187     0.006               1.000
## total.sulfur.dioxide               0.203     0.047               0.668
## density                            0.355     0.201              -0.022
## pH                                -0.086    -0.265               0.070
## sulphates                          0.006     0.371               0.052
## alcohol                            0.042    -0.221              -0.069
## quality                            0.014    -0.129              -0.051
## total.acidity                      0.117     0.102              -0.158
## volatile.acidity.ratio            -0.049    -0.009               0.043
## citric.acid.ratio                  0.138     0.205              -0.003
## free.sulfur.dioxide.ratio         -0.071    -0.105               0.327
##                           total.sulfur.dioxide density     pH sulphates
## fixed.acidity                           -0.113   0.668 -0.683     0.183
## volatile.acidity                         0.076   0.022  0.235    -0.261
## citric.acid                              0.036   0.365 -0.542     0.313
## residual.sugar                           0.203   0.355 -0.086     0.006
## chlorides                                0.047   0.201 -0.265     0.371
## free.sulfur.dioxide                      0.668  -0.022  0.070     0.052
## total.sulfur.dioxide                     1.000   0.071 -0.066     0.043
## density                                  0.071   1.000 -0.342     0.149
## pH                                      -0.066  -0.342  1.000    -0.197
## sulphates                                0.043   0.149 -0.197     1.000
## alcohol                                 -0.206  -0.496  0.206     0.094
## quality                                 -0.185  -0.175 -0.058     0.251
## total.acidity                           -0.108   0.685 -0.673     0.160
## volatile.acidity.ratio                   0.083  -0.278  0.513    -0.277
## citric.acid.ratio                        0.110   0.189 -0.411     0.301
## free.sulfur.dioxide.ratio               -0.371  -0.265  0.185    -0.010
##                           alcohol quality total.acidity
## fixed.acidity              -0.062   0.124         0.995
## volatile.acidity           -0.202  -0.391        -0.157
## citric.acid                 0.110   0.226         0.628
## residual.sugar              0.042   0.014         0.117
## chlorides                  -0.221  -0.129         0.102
## free.sulfur.dioxide        -0.069  -0.051        -0.158
## total.sulfur.dioxide       -0.206  -0.185        -0.108
## density                    -0.496  -0.175         0.685
## pH                          0.206  -0.058        -0.673
## sulphates                   0.094   0.251         0.160
## alcohol                     1.000   0.476        -0.084
## quality                     0.476   1.000         0.086
## total.acidity              -0.084   0.086         1.000
## volatile.acidity.ratio     -0.096  -0.347        -0.537
## citric.acid.ratio           0.128   0.219         0.367
## free.sulfur.dioxide.ratio   0.246   0.194        -0.141
##                           volatile.acidity.ratio citric.acid.ratio
## fixed.acidity                             -0.618             0.419
## volatile.acidity                           0.898            -0.586
## citric.acid                               -0.725             0.943
## residual.sugar                            -0.049             0.138
## chlorides                                 -0.009             0.205
## free.sulfur.dioxide                        0.043            -0.003
## total.sulfur.dioxide                       0.083             0.110
## density                                   -0.278             0.189
## pH                                         0.513            -0.411
## sulphates                                 -0.277             0.301
## alcohol                                   -0.096             0.128
## quality                                   -0.347             0.219
## total.acidity                             -0.537             0.367
## volatile.acidity.ratio                     1.000            -0.664
## citric.acid.ratio                         -0.664             1.000
## free.sulfur.dioxide.ratio                  0.015            -0.166
##                           free.sulfur.dioxide.ratio
## fixed.acidity                                -0.131
## volatile.acidity                             -0.073
## citric.acid                                  -0.167
## residual.sugar                               -0.071
## chlorides                                    -0.105
## free.sulfur.dioxide                           0.327
## total.sulfur.dioxide                         -0.371
## density                                      -0.265
## pH                                            0.185
## sulphates                                    -0.010
## alcohol                                       0.246
## quality                                       0.194
## total.acidity                                -0.141
## volatile.acidity.ratio                        0.015
## citric.acid.ratio                            -0.166
## free.sulfur.dioxide.ratio                     1.000

First, let’s look at the correlation values for each pair of variables to get a feel for potential relationships in the data. There doesn’t seem to be much strong correlation in the dataset, particularly relating to quality, which is a bit surprising to me. The strongest correlation value involving quality was with alcohol (0.476), followed by volatile acidity (-0.391), volatile acidity ratio (-0.347), sulphates (0.251), and citric acid (0.226). There could also be other chemical properties not measured here that could affect wine quality and taste, or the relationships could be more complex than what could be captured by Pearson’s Correlation Coefficient. None of the new variables I created had strong correlations with quality, and all had weaker correlations with quality than at least one of their “parent” variables.

Since we’re most concerned with how the chemical properties of wine affect perceived quality, I’ll focus on relationships involving quality first. For each boxplot I made quality a factored variable.

Quality vs Alcohol

Alcohol had the strongest correlation with quality. This isn’t too surprising to me since I’d imagine a higher alcohol content would be related to a higher concentration of flavor. Lower concentrations of alcohol would likely have more of a “watery” mouthfeel and might not be perceived has being of a high quality.

This also matches up with my own experiences with beer (my brother is a professional brewmaster), where I’ve experienced more complex flavors in beers with higher alcohol contents. I’ve also experienced this with bourbon, where alcohol content impacts my own perceived quality in general, with most of my favorite bourbons having higher proofs in general than bourbons I didn’t care for.

I suspect that part of this is because a beer/wine/spirit with a higher alcohol content would need to be fairly “balanced” in its flavors to mask or hide the alcohol a bit more. I’ve experienced some beers and bourbons that had a higher than normal alcohol content, but tasted very harsh and were unpleasant because the alcohol “burn” was overwhelming. I suspect the brewer/distiller/winemaker would also be able to identify that taste before it ever got to market (if they were truly being objective) and adjust their product accordingly, perhaps to water it down a bit more.

## factor(quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## factor(quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## factor(quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## factor(quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## factor(quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## factor(quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
## factor(quality): 3
## [1] 0.85
## -------------------------------------------------------- 
## factor(quality): 4
## [1] 1.4
## -------------------------------------------------------- 
## factor(quality): 5
## [1] 0.8
## -------------------------------------------------------- 
## factor(quality): 6
## [1] 1.5
## -------------------------------------------------------- 
## factor(quality): 7
## [1] 1.3
## -------------------------------------------------------- 
## factor(quality): 8
## [1] 1.55

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

I made quality a factored variable to create a boxplot. The limits were adjusted in the second boxplot to remove some of the outliers. Here the relationship between quality and alcohol is much easier to see. It’s important to remember that there aren’t as many wines rated highly, but there definitely appears to be a positive relationship between alcohol and quality. The lowest rated wines (with a quality of 3) all had alcohol values less than or equal to 11%, while roughly 75% of the highly rated (quality of 7 or 8) wines had alcohol values greater than 11% abv. Wines with a quality of 8 had the largest inter-quartile range (1.55 % by volume), while wines rated as 5 had quite a few outliers at high alcohol values, but also had the lowest inter-quartile range (0.8 % by volume). With the exception of wines rated as a 5, there is a clear positive relationship between alcohol and quality.

Here’s a stacked histogram of alcohol value colored by quality rating. Note that since the counts are stacked on top of one another (i.e., the maximum count value for each bin is the sum of ALL of the counts for each quality rating in that bin), it’s the relative count of each rating within each bin that is important. Put another way, it shows how the quality ratings are distributed for each bin of the histogram created in the Univariate section. It’s very apparent how many more wines are rated as 4 or 5 (82.4%) than all of the other ratings. The wine with the highest alcohol content (14.9 % by volume) had a rating of 5 (perhaps because it was “unbalanced”), while otherwise most of the wines with higher alcohol contents had ratings in the 6-8 range. On the other end of the spectrum, one of the two wines with the lowest alcohol contents (8.4 % by volume) was rated as a 3, while the other was rated as a 6. There definitely seems to be some other factors influencing quality, as I would suspect.

Finally, let’s look at some density plots of the alcohol percentages broken out by quality. In general, the same trends we saw in the boxplots hold here. It’s still a bit strange that the wines rated as 5 are the mostly skewed toward smaller alcohol percentages, but that explains the low median value from the boxplots.

Quality vs Volatile Acidity

## factor(quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## factor(quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## factor(quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## factor(quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## factor(quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## factor(quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Since volatile acidity had the second strongest correlation with quality, let’s take a look at a scatter plot of volatile acidity vs. quality. I added jitter and transparency to prevent overplotting. It definitely looks like there is a negative correlation between the two, as indicated in the correlation for the two variables (-0.39). I think the boxplot will provide a better feel for the relationship since quality is really more of a factored variable.

## Warning: Removed 15 rows containing non-finite values (stat_boxplot).

Here the inverse relationship is much more apparent. There is a definite trend in lower volatile acidity levels as wine quality increases; the median volatile acidity level drops with each successive increase in quality rating, with the exception of 7 and 8, where the median stays the same. Since we know from the background info that high levels of volatile acidity can cause the wine to taste like vinegar, this inverse relationship between volatile acidity and quality makes sense.

These density plots of volatile acidity by quality rating explain the bimodal distribution seen in the histogram of volatile acidity values. The lower rated wines (3-6) explain the peak in values around we saw around 0.60 g/dm^3, while the higher rated wines explain the peak in values around 0.40 g/dm^3. The wines rated in the middle of the scale (5-6), also had slight bimodal distributions themselves.

Quality vs Sulphates

## factor(quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## factor(quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## factor(quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## factor(quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## factor(quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## factor(quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

The next strongest correlation value for quality was with sulphates concentrations, so lets take a look at those plotted together. Here I again added jitter and some transparency to prevent overplotting. There does appear to be a trend toward higher sulphate levels in higher rated wines. There is a large amount of variance in the sulphates values for wines rated as 5 or 6. Let’s look at the boxplots for more insight.

## Warning: Removed 79 rows containing non-finite values (stat_boxplot).

Here the positive relationship between sulphates and quality is more apparent, but it is also clear there are a large number of outliers for the wines rated as 5 or 6. This likely drove down the correlation value. The median sulphate concentration increases with each quality rating (again, except for 7 and 8, where it remains the same). We know from the background info that sulphates act as an antimicrobial, so perhaps the microbes they are killing have an adverse affect on the perceived quality.

It’s a bit harder to see the relationship between quality and sulphates in this density plot, but the peak densities mostly increase with rating, matching the trend we saw in the boxplots. The plots for ratings 3-5 and 7-8 almost completely overlap each other, respectively (not counting the outliers).

## factor(quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## factor(quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## factor(quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## factor(quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## factor(quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## factor(quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Finally, let’s look at quality and citric acid plotted against each other. Wow, there is a lot a variance in these values, but I can see a slight positive trend, which coincides with the positive correlation value between the two (0.226).

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

The boxplot makes this positive relationship much clearer, with the median citric acid concentrations increasing steadily with each successive quality rating, from a median value of 0.0350 g/dm^3 for wines rated 3, up to a median value of 0.420 g/dm^3 for wines rated 8. The median values almost seem to be grouped together in rating pairs (3-4, 5-6, 7-8). I’m surprised by the overall variance across the values.

There are some really unusual distributions in the plots. They are consistent with the observation earlier that certain values (0, 0.24, and 0.49) were more likely than others. This is especially apparent for wines rated 5 and 6, where there are distinct local maxima in the density plots around those values. Despite the unusual peaks, it is clear that higher rated wines tended to have larger citric concentrations than lower rated wines.

Other Relationships: Acidty and pH

Next let’s take a look at some of the other relationships in the dataset. First, lets look at fixed acidity vs pH, which (unsurprisingly) had one of the the strongest overall correlations in the dataset (-0.683).

Going back to my high school chemistry days, I remember that acidity and pH are related. The background info for the dataset confirms this, stating that pH “describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic).” According to Wolfram, the relationship is described by pH = -log10[H+], “where [H+] represents the concentration of hydrogen ions in units of moles per liter of volume”. Now that we know the nature of the relationship, lets try a log10 transformation on the fixed acidity concentration. After the transformation, the relationship is much clearer. I also added a smoother (using a line of the form y ~ log10(x)), so it’s much easier to see that the relationship between fixed acidity and pH aligns with our knowledge of chemistry.

Acidity Relationships

I’m curious to see what the relationships between the different types of acidities look like. Are wines with higher fixed acidity more likely to have higher volatile acidity as well? How does citric acid relate to either?

There doesn’t appear to be a very strong relationship between fixed acidity and volatile acidity, which is in line with its relatively weak correlation (-0.256). Perhaps this is because the volatile acidity concentrations are so much smaller than the fixed acidity values (likely because high levels of volatile acidity can make the wine taste like vinegar). I added some jitter and transparency again here to prevent overplotting.

Citric acid and fixed acidity had one of the stronger correlations in the dataset (0.672), which is apparent here. There is definitely a positive relationship between the two, as expected (since citric acid presumably contributes to the fixed acidity concentration).

Density and fixed acidity also had a fairly strong correlation (0.685), so let’s see what they look like plotted against each other. It is clear there is a positive relationship between the two, perhaps because the acids present in the wine have a density greater than water. According to Wolfram|Alpha, acetic acid (or volatile acidity in this dataset) has a density of 1.049 g/cm^3, while citric acid has a density of 1.665 g/cm^3, so at least two of the known acids present in red wine have densities greater than water (1.000 g/cm^3).

Sulfur Dioxide

Finally, let’s look at the relationship between free and total sulfur dioxide. They also had a relatively strong correlation (0.668), as expected (since free sulfur dioxide is a part of the total sulfur dioxide concentration).

Surprisingly, there doesn’t appear to be much of the same relationship (or any relationship at all) when it comes to total sulfur dioxide and sulphates (which also contribute to the total sulfur dioxide concentration). The correlation score between them was a very weak 0.043.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The main feature of interest in the dataset, quality, had relatively strong relationships (based on correlation scores) with four of the features: alcohol, density, volatile acidity, and sulphates. The strongest correlation value involving quality was with alcohol (0.476), followed by volatile acidity (-0.391), volatile acidity ratio (-0.347), sulphates (0.251), and citric acid (0.226).

Alcohol and quality had the strongest correlation score, and there was a clear positive relationship between the two in the boxplots. Other than a slight dip for wines rated as a 5, the median alcohol values steadily increased with each rating. I suspect that higher alcohol wines have more concentrated flavor in general.

Volatile acidity had an inverse relationship with quality, and variance decreased with each increase in rating. Since high levels of volatile acidity can lead to vinegar-like flavors, this decrease in median values and variance as ratings increase isn’t surprising.

Like alcohol, sulphates had a positive relationship with quality. The variance in sulphates concentrations increased with rating in general, as did the median values. The wines rated as 5 or 6 had a lot of outliers. Sulphates act as an anitmicrobial and antioxidant, so a positive relationship between quality and sulphates makes sense (assuming that microbials can lead to undesired flavors).

Lastly, citric acid also had a positive relationship with quality. Variance decreased in general as ratings increased, but the variance was surprisingly large across all of the ratings. However, there was a significant increase in median values as ratings increased (by more than a factor of 10 between ratings of 3 and 8). There also weren’t very many outliers.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I noticed a few different interesting relationships involving fixed acidity and some of the other variables. Fixed acidity and pH had one of the strongest relationships in the dataset (a correlation of -0.673), which was unsurprising because pH is a way to measure how acidic or basic something is. Some quick research revealed that pH is given is calculated via the expression pH = -log10[H+], where [H+] is the concentration of hydrogen ions. Transforming the fixed acidity values via a log10 transform yielded a negative linear relationship, as expected.

Fixed acidity and volatile acidity weren’t strongly correlated (-0.256), but fixed acidity and citric acid were (0.671), likely because citric acid contributes to the overall fixed acidity concentrations. Density and fixed acidity were also strongly correlated (0.668), probably due to the fact that some of the acids contributing to the fixed acidity levels (including citric acid) have a larger density than water.

Finally, total sulfur dioxide and sulphates weren’t as strongly correltated (0.043) as I expected them to be, considering that sulphates contribute to the total sulfur dioxide concentration.

What was the strongest relationship you found?

Ignoring the (uninteresting) strong relationships involving the new variables and the original variables used to create them, the strongest relationship was between total acidity and density (0.685), followed by pH and fixed acidity (-0.683), pH and total acidity (-0.673), fixed acidity and citric acid (0.671), density and fixed acidity (0.668), and free sulfur dioxide and total sulfur dioxide (0.667). Unsurprisingly, the strongest overall relationship was between fixed acidity and total acidity, with a correlation of 0.995 between them.

Multivariate Plots Section

Quality and Acidity

Total Acidity vs Quality

## factor(quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.460   8.051   8.882   9.244  10.460  12.180 
## -------------------------------------------------------- 
## factor(quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.380   8.185   8.473   9.070  12.960 
## -------------------------------------------------------- 
## factor(quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.520   7.735   8.390   8.744   9.490  16.260 
## -------------------------------------------------------- 
## factor(quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.300   7.605   8.400   8.845   9.881  14.610 
## -------------------------------------------------------- 
## factor(quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.320   7.880   9.110   9.276  10.480  16.280 
## -------------------------------------------------------- 
## factor(quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.420   7.625   8.730   8.990  10.530  12.910
## factor(quality): 3
## [1] 2.40875
## -------------------------------------------------------- 
## factor(quality): 4
## [1] 1.69
## -------------------------------------------------------- 
## factor(quality): 5
## [1] 1.755
## -------------------------------------------------------- 
## factor(quality): 6
## [1] 2.27625
## -------------------------------------------------------- 
## factor(quality): 7
## [1] 2.605
## -------------------------------------------------------- 
## factor(quality): 8
## [1] 2.905

## Warning: Removed 32 rows containing non-finite values (stat_boxplot).

Now let’s examine some of the acidity relationships a bit deeper. Total acidity and quality were very weakly correlated overall (0.086), and that’s evident in the boxplots. With the exception of the lowest rated wines, the inter-quartile ranges increased with rating. I suspect the weak correlation is due in part to the opposite effects citric acid and acetic acid have on quality, keeping the overall acidity values roughly level.

## factor(quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04059 0.06702 0.10820 0.10000 0.11620 0.17210 
## -------------------------------------------------------- 
## factor(quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02205 0.06482 0.08267 0.08486 0.10040 0.16940 
## -------------------------------------------------------- 
## factor(quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01786 0.05172 0.06767 0.06792 0.08280 0.17220 
## -------------------------------------------------------- 
## factor(quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01541 0.03982 0.05787 0.05888 0.07500 0.14830 
## -------------------------------------------------------- 
## factor(quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01330 0.03209 0.03943 0.04631 0.05414 0.13390 
## -------------------------------------------------------- 
## factor(quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02401 0.03167 0.04511 0.05113 0.06099 0.13180

The volatile acidity ratios mostly follow the trends seen in the volatile acidity values earlier, which makes sense since there weren’t any strong trends between total acidity and quality. There was a slight increase in median ratios (0.039 to 0.045) between wines rated 7 and 8, as seen in the total acidity values. Most wines had volatile acidity ratios less than 0.100 in general, while the highest rated wines (6-8) had a majority of their ratios below 0.075.

## factor(quality): 3
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0005365 0.0043600 0.0154600 0.0306300 0.0541900 
## -------------------------------------------------------- 
## factor(quality): 4
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.003415 0.011610 0.019380 0.031370 0.102900 
## -------------------------------------------------------- 
## factor(quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01154 0.02577 0.02675 0.03951 0.09371 
## -------------------------------------------------------- 
## factor(quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01157 0.03205 0.02959 0.04357 0.13200 
## -------------------------------------------------------- 
## factor(quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03332 0.04321 0.03891 0.04920 0.08532 
## -------------------------------------------------------- 
## factor(quality): 8
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005008 0.040440 0.044050 0.041320 0.052630 0.057730

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

The citric acid ratios are much smaller than the volatile acidity ratios, with most of the ratios falling below 0.05. Otherwise, this looks very similar to the citric acid boxplot seen earlier, except with more outliers.

Total Acidity vs pH

## 
## Call:
## lm(formula = wine$pH ~ log10(wine$fixed.acidity))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50495 -0.06314  0.00164  0.06477  0.47367 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.45857    0.02890  154.26   <2e-16 ***
## log10(wine$fixed.acidity) -1.25921    0.03158  -39.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1093 on 1597 degrees of freedom
## Multiple R-squared:  0.4989, Adjusted R-squared:  0.4986 
## F-statistic:  1590 on 1 and 1597 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = wine$pH ~ log10(wine$total.acidity))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50645 -0.06378 -0.00085  0.06601  0.49691 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.5665     0.0329  138.79   <2e-16 ***
## log10(wine$total.acidity)  -1.3365     0.0349  -38.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1115 on 1597 degrees of freedom
## Multiple R-squared:  0.4787, Adjusted R-squared:  0.4783 
## F-statistic:  1466 on 1 and 1597 DF,  p-value: < 2.2e-16

Now we can plot the total acidity against pH as we did before with fixed acidity. I wonder how they look side-by-side. Unsurprisingly, they are very similar, since the same physical relationship between acidity and pH holds. Additionally, fixed and total acidity only differ by the volatile acidity concentrations, which are very small compared to both. The R-squared values for each of the linear regression models were very close as well; fixed.acidity had a slightly better fit, with an R-squared value of 0.4989, while `total.acidity’ had an R-squared value of 0.4787. However, the fact they are so much less than 1.0 suggests there are other relationships not captured by the models.

Now let’s look at the two variables with the strongest correlations with quality plotted against each other and colored by quality. I used a color scheme which accentuates the lowest and highest rated wines to help clarify the relationships present. While there are some exceptions, it is easy to see two main regions: the lowest quality wines tended to have lower alcohol percentages and higher volatile acidity concentrations, while the higher quality wines had higher alcohol percentages and lower volatile acidity concentrations, in general.

Finally, we can create a similar plot to examine volatile acidity and citric acid colored by quality. Here there isn’t as quite as clear of a delineation between the low and high rated wines, but it does look somewhat similar to the previous plot because citric acid and alcohol both have a positive correlation with quality. The highest rated wines tended to have higher citric acid concentrations and low volatile acidity concentrations, and the lower rated wines tended to have lower citric acid concentrations and higher volatile acidity concentrations.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Most of the relationships from this part of the analysis were consistent with what was seen in the earlier sections. There were significant differences in the distributions of volatile acidity concentrations by quality rating, as seen in the density plot. The “average” (5-6) rated wines also had bimodal distributions, which helped explain the overall bimodal distribution we saw earlier.

The citric acid ratio, volatile acidity ratio, and total acidity variables I created didn’t add much value to determining which chemical characterics influence wine quality. This was mainly because total acidity didn’t have a strong correlation with quality. Total acidity did have the expected relationship with pH, however.

Looking at alcohol plotted against volatile acidity and colored by quality rating helped to visualize the strongest relationships involving quality, even though alcohol and volatile acidity were weakly correlated (-0.202) themselves. The highest quality wines tended to have high alcohol percentages and low volatile acidity concentrations.

Were there any interesting or surprising interactions between features?

Citric acid had some really interesting density plots when colored by quality rating. There were certain values (mainly 0.00, 0.24, and 0.49) which appeared much more frequently than other values, and it’s not clear why. There were distinct peaks in the density plots at each of those values, especially for the wines rated 5 and 6. Perhaps it could be related to some sort of measurement or rounding error (since they roughly occur at multiples of 0.25).

OPTIONAL: Did you create any models with your dataset?

I did not, mainly because none of the relationships seemed strong enough to justify creating a model other than for the sake of creating one.


Final Plots and Summary

Plot One

Description One

Citric acid concentration and quality rating had a positive relationship, with higher rated red wines tending to have higher citric acid concentrations. This is likely due to the fact that citric acid is known to add “freshness” and flavor to wines, which are both desirable. Many wines (8.26%) had no measurable citric acid at all; however, all of the highest rated wines had a citric acid concentration of at least 0.03 g/dm^3. There were also a few values which appeared much more frequently than others (at 0.02, 0.24, and 0.49 g/dm^3), especially among the “average” rated wines (quality ratings of 4-5), and it’s not clear why. This could be due to a measurement or rounding error.

Plot Two

Description Two

Alcohol had the strongest correlation with red wine quality (0.476) among all of the chemical properties measured. The lowest rated wines (with a quality of 3) all had alcohol values less than or equal to 11%, while roughly 75% of the highly rated (quality of 7 or 8) wines had alcohol values greater than 11% abv. With the exception of wines rated as a 5, there was a clear positive relationship between alcohol and quality. This makes sense since I’d expect a higher alcohol content would be related to a higher concentration of flavor. Lower concentrations of alcohol would likely have more of a “watery” mouthfeel in comparison and might not be perceived has being of a high quality.

Plot Three

Description Three

Alcohol by volume and volatile acidity were the two chemical properties most closely related to quality in red wine. Alcohol had a positive relationship with quality, perhaps due to a higher concentration of flavor in wines with higher alcohol percentages. Volatile acidity had a negative relationship with quality rating, due to the fact that higher concentrations can lead to undesirable vinegar-like flavors. As evidenced by the two distinct regions in the plot, the lowest quality wines tended to have lower alcohol percentages and higher volatile acidity concentrations, while the higher quality wines had higher alcohol percentages and lower volatile acidity concentrations, in general.

Reflection

The red wine data set includes chemical property information and blind taste-test quality ratings for 1,599 red wines from Portugal. My goal was to determine which chemical properties had the strongest effect on perceived red wine quality. I started by examining each of the 14 variables in the dataset to look for any interesting distributions and to get a feel for the ranges of values. I also created a few new variables by taking ratios or sums of a few select variables, which I later found didn’t add much value. Then, I calculated the correlation coefficients for each combination of variables in order to determine the strengths of the relationships between the variables, particularly those involving quality.

Alcohol, volatile acidity, sulphates, and citric acid had the strongest correlations with quality. I was surprised that pH, density, and residual sugar didn’t have a big impact on quality. I also noticed a familiar relationship between pH and fixed and total acidity based on the way that pH is actually measured. Finally, I used density plots and scatter plots colored by quality rating to better understand the multivariate relationships between the chemical properties and quality. Overall, none of the relationships with quality were particularly strong, and didn’t suggest that a simple model would be useful in this case.

The new variables I created didn’t really add value, and I was perplexed about a few of the citric acid values which appeared more often than others. The biggest difficulty was handling the complexity of the dataset. It was hard to keep track of all of the different relationships at play, and to determine where to focus next. Because there were so many different potential directions to go with the analysis, it was hard for me to balance my desire to be thorough and consistent with trying to focus only on the important informaton. It was a good taste of the difficulties of dealing with complex datasets. The boxplots and density plots helped the most to visualize the relationships within the data. I definitely gained a new appreciation for them.

There are a number of different ways to expand and improve this analysis. The dataset could be expanded to include more wines of this category, especially among the higher and lower rated wines (since they had relatively small populations). I would have also been interested to see how price factored in as well, even if the testers didn’t know it ahead of time. I also think the dataset could be expanded to include other varieties of red wine from other regions to see how those results compare or contrast. Lastly, a machine learning algorithm woulc be interesting to run on the dataset, partiluarly to help predict quality rating.