80 When Life is Linear SO, we can again reframe the data but ⅲ two dimenslons. This could be extended t0 4 , 5 , or 10 dimensions with data living in a 2D plane in that higher dimension. AS such, we can take data ⅲ 10 dimenslons or even 100 and reduce it down tO tWO dimensrons. 、 can reduce dimenslons even ifthere a zero elgenvalue. a slight vanation t0 the example: instead 0f the ov 引 being on a 2D plane, there is a perturbation Ofthe coordinates above and below the plane. For this type ofproblem, there would still be three eigenvectors, but none would equal zero. The values might be something like 10 , 8 , and 0.1. The eigenvalue of 0 」 indicates that there is very little varration in the data in the direction ofthe corresponding elgenvector. Earlier, we saw that all the informatlon existed in tWO dimenslons which resulted in an elgenvector with an elgenvalue 0f0. 、 Me can discard this dimenslon and pro. 」 ect the data intO tWO dimenslons, making the dataset simpler. The data could have existed in five or 50 dimenslons and, if a similar dynamic exists, could have been brought down tO two, simplifying the problem and reducing the size of the data. Our 2D ellipse was centered at ( 0 , 0 ). げ it were centered somewhere else, like ( 5 , 4 ) , then we'd take all the x-coordinates and subtract their mean. Similarly, we'd subtract the mean for the y-coordinates. This centers the ellipse at the ongin. For PCA, the data is always preprocessed to subtract the mean ⅲ such a way.ln Ch 叩 ter 1 1 , we'll see how to 叩可 y this idea to facial recognltlon. As we saw, PCA can reduce the dimensions of a problem. ln doing SO, 、 approximate data, sometimes reducing the n01Se ln measurement or capturmg the nature Ofthe data.ln the next chapter, we approximate data by assummg it iS linear.
79 Stretch and Shrink 5 0 -0 0 5 0 5 一 5 0 一 5 0 LO 0 【 0 20 ー 1 5 ー 10 ー 5 0 5 10 15 20 Figure 8.9. POints in Figure 8.7 rotated (a). The first and second prmcipal compo- nents (b). ー 20 ー 15 ー 10 ー 5 0 5 10 15 20 This vector, indeed, passes through the data at the same spot as the x-aXIS in the data in Figure 8.7. Our eigenvectors indicate the amount Of varlance ofthe data in that direction. SO, the eigenvector with the highest elgenvalue is the pnncipal component. The second eigenvector is perpendicular tO the first and reframes the data in a new coordinate system as seen in Fig- ure 8.9 (b). One ofthe biggest advantages Ofthis methOd is in dimensron reduction. S11PP0se we rotate the ellipse into 3D as seen in Figure 8.10. The data is still two dimensionalin nature. lt has length and width but no height which is why it all lies in a plane, which is drawn with a dotted line in Figure 8.10. For this type of data, we will find three eigenvectors but one 0f them will have an eigenvalue Of zero indicating the data actually is tWO dimensional. Figure 8.10. A 2D ellipse rotated in 3D.
Mining f0 「 Meaning From smartphones tO tablets tO laptops and even tO supercomputers, data is being collected and produced. With so many bits and bytes, data analytics and data mimng play unprecedented roles in computing. Linear algebra is an important tOOl in this field. ln this chapter, we touch on some tOOls in data mining that use linear algebra, many built on ideas presented earlier in the book. Before we start, how much data is 心 ot ofdata? Let's ok to Facebook. What were you doing 15 minute ago? ln that time, the number of photos uploaded t0 Facebook is greater than the number 0f photographs stored ⅲ the New York PubIic Library photo archives. Think about the amount ofdata produced in the past tWO hours or since yesterday or last week. Even more lmpresslve is hOW Facebook can organize the data SO it can appear quickly intO your news feed. 11.1 SIice and Dice ln Section 8.3 , we looked at clusterlng and saw how to break data into two groups usmg an elgenvector. As we saw in that section, it can be helpful and sometimes necessary for larger networks, tO 可 Ot the adjacency matr1X 0fa gr 叩 h.ln Figure 11.1 , we see an example where a black square is placed where there iS a nonzero entry in the matr1X and a white square iS placed Othe ハ Vise. The go 引 ofclusterlng is tO find maximally intraconnected components and minimally interconnected components.ln a 可 Ot Of a matr1X, this results in darker square reglons. We saw this for a network of about fifty Face- book friends in Figure 8.5 (b). Now, let's turn to an even larger network. We'll analyze the graph of approximately 500 of my frlends on Facebook. 106
77 Stretch and Shrink を・をを Figu re 8.6. Looking at the Facebook social network in Figure 8.5 ofapproximately 50 friends via the ad. 」 acency matrix. the x-axis, illustrated by the dotted lines in Figure 8.8 (a). This equals 600 , glVing the vanance Of the data over the y-coordinates. That is, the measure Of hOW spread out is the data in the . y-direction. Compare this tO the var ト ance in the vertical direction. The square Of the distance from every point to the y-axis, illustrated by the dotted lines in Figure 8.8 (b), is 2400. From these tWO numbers, we find the data tO be more spread out over the x- coordinates. You can see this visually since the data points are farther away from each 0ther honzontally than vertically. 0 8 CD 4 ・つ」 0 ワ」 0 Figure 87. Points sampled from an ellipse centered at the origin. 20 5 00
66 When Life is Linear This little equation contains a lot ofinformation. lt states that according t0 this data, under this linear model, for each trillion of dollars spent on chocolate, an additional 2 560 doctorates will be awarded in these fields! And, if we spend nothing on chocolate, then there will be 880 doctorates awarded! TO further this argument, let's see how correlated this data is. To com- pute correlation, we first subtract the mean om each column Of data ⅲ Table 7.4. If we let c and d equal the mean-subtracted vectors of data, then the correlation equals llc 旧回ド which we saw earlier with the dOt product. lndeed, we are agaln measurlng an angle, which is helping us find correlation. lfcorr(), d) = し then the vectors c and d point in the same direction and the data is perfectly correlated— knowing the value Of one quantity enables you tO exactly compute the value 0f the other. For this data, the correlation coeffcient equals 0.99. We have found strong correlation ℃ en doctorates in SC1ence and engineermg and chocolate consumption. Ready tO ask the government tO invest in chocolate so we see more doctorates ln SCIence and engineerlng? Before calling your legislator, keep ln mind an important saymg in statlstlcs, corr(), d) = CORRELATION DOESN'T MEAN CAUSATION. For example, it's been shown that shark attacks are correlated with ice cream consumption. Consider a few more correlations found by Harvard law student Tyler Vigen on his website Spunous CorreIations 卩 6 ]. Sour cream consumption correlates with motorcycle accidents. He 引 SO found that the number of people who drown after falling out of a fishing boat correlates with the marrlage rate. Did you find yourself trying to think why these events might be con- nected? That's exactly because we tend tO think about correlation being linked with causation, which can glve us keen insight but also misleading inttlltlon. Correlations can be real. Causation can lead to correlation. But that doesn't mean that correlation means causation. TWO factors may be corre- lated due tO their association with another factor. We may simply be seelng a random connectlon in data. Even when we see a strong and repeatable
115 Mining fO 「 Meaning Table 11.1. Ratings 可、んⅢ・ル 0 〃ん 0 リ夜・ Six M の元 & M el ody Mike Lydia 5 4 3 3 5 5 Sam Movie ー Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 recommendations, you had Netflix's dataset Of users ratings, which are inte- gers from 1 t0 5 , for all movies. Then, your program would give an anticipated rating for films for a user. This isn't recommending a film, it is supplying a prediction for how a film will be rated by a user. the competition, Netflix supplied data on which you could test your ideas. When you thought you had something, you'd test your recommenda- tion method by predicting ratings Of movles and users ln another dataset that had actual ratings Of movies. げ your predictions were at least 10 % better than the recommendation system Netflix used, Netflix would wrlte a check for a million dollars! TO dO our analysis, we'll store the data, not surpnsingly, in a matr1X. A column one user's ratings for all the moues with a 0 being a movie that wasn't rated. SO, a row contains all the ratings for one mov 肥 . When the Netflix competition was announced, initial work quickly led tO improvement over the existing recommendation system. lt didn't reach the magic 1 0 % but the strides were impressive, nonetheless. The key was the SVD which we've seen and used earlier. We saw in Chapter 9 how the SVD can be used in data compression where a lot 0f data is expressed by less. SO, let's represent our recommendation data in compressed form and use that t0 d0 our analysis. Suppose we have Sam, Me10dy, Mike, and Lydia recommending SIX mOV1es as seen in Table 11. I. store the data in a matrlx with six rows and four columns 0 5 5 5 0 3 4 5 3 4 0 3 5 3 0 0 5 4 4 5 5 4 5 5
88 When Life is Linear lmage's storage requirements. The rank 10 image is a better approxlmatlon than the rank 1 approximation, as it should be. StiII, it could be characterlzed as a computationally bleary—eyed image! Let's crank up the rank 0f our approximation and 」 ump t0 a rank 50 matnx.ln Figure 9.2 , we see the closest rank 50 approximation t0 月 and the onginal prmt. Can you tell which is which? lt would be even harder if we raised the rank 0fthe approximation. This probably helps you sense that the closest rank 50 matrix is doing pretty well at approximating a rank 509 matr1X. The SVD typically is not the method used in compression algorlthms. Other methods exist that can dO an even better 」 ob. Still, this gives us a sense Of hOW image compresslon can be done and why it sometimes can lead tO a fuzzier picture than the orlginal. ln some cases, it is very difflcult tO visually see a difference between the onginal and the 叩 proximation. げ we are sending pictures 仕 om the Mars Rover, then uslng a compressed image thatlooks essentially the same as the onginal may be quite helpful. 9.3 ⅲ a 引 u 「 Data compression started with a matrIX that stored a higher resolution image. Lower rank approximatlons tO the matrIX resulted in degradation in the data. When the data ⅲ the orlginal matr1X has noise,lower rank approximations can actually improve the quality 0f the data. N0ise 0ften occurs in data and measurement. Reducing it can help identify the underlying phenomenon. We'll create our own data that contains noise. TO begin, we'll generate exact data with the function Z = S1n as ad 叩 ted 仕 om lecture notes 0fAndrew Schul レ 0f Wellesley College い 5 ]. TO construct the matr1X, we take the z-values Of this function over the grld formed between 0 and 3 兀 / 2 , in increments of 兀 / 30 in the both the x- and y-directions. 、 Me can then linearly interpolate between the points. This produces the graph seen in Figure 9.3 (a). To introduce noise, we'll add random values between ー 0.1 and 0.1 tO every element of the matr1X. This produces the image in Figure 9.3 (b). As before, we'll construct a rank ん approximatlon to the matrix. But what ん should we choose? To aid in this decision, we turn to a graph of the
87 Zombie Math—Decomposing Figu re 9.1 . The grayscale image (a) of the data contained ⅲ the matrix in ( 9. い . A 648 by 509 pixel image (b) of AIbrecht Dürer's prmt Mela 〃勧 0 / . in Figure 9.2 (a). An advantage 0fthis low rank approximation can be in the storage savings. The original matrix stored 648 ( 509 ) = 329 , 832 numbers whereas . can be stored with 648 numbers from U (the first column), 1 singular value, and 509 numbers from 「 (the first row). SO, 一 requires stormg only 1 , 158 numbers or 0.35 % 0f the original data. That's a very small fraction Of the orlginal data! Great savings on storage, but not a great representation Of the 0 ロ g ⅲ引 data. Let's increase the rank Of our approximation. The closest rank 10 approximation t0 月 is depicted in Figure 9.2 (b). Let's 0k at the stor- age savmgs again. This matnx reqmres stormg 10 column vectors Of U, 10 singular values, and 10 row vectors Of 「 . This correlates tO a storage requirement 0f 6 , 480 十 10 十 5 , 090 numbers or about 3.5 % ofthe onginal Figure 9.2. The closest rank ー (a) and rank 川 (b) approximations to the Dürer print.ln (c) and (d), we see the closest rank 50 approximation and the original print. Can you tell which is which?
72 When Life is Linear Figure 8.2. An originalimage is repeatedly scaled via a linear transformation. This matrlx's only eigenvalue is え , and any 2D column matrrx is an elgenvec- tor. Figure 8.2 shows the effect of repeatedly applying this transformation with え = 1 / 3 ⅲ (a) and with え = ー 1 / 3 ⅲ (b).ln both figures, we have placed the onginalimage and the images produced by three applications 0f this transformatlon on top Of each other. Knowing the values of え , can you determine which 0f the four images ⅲ (a) and in (b) is the starting image? 8.3 Finding G「0 叩 ies Clustermg is a powerfultool in data mining that sorts data into groups called c / ″、セパ . Which athletes are similar, which customers might buy similar products, or what movles are similar are all possible topics Of clusterlng. There are many ways to cluster data and the type of question you ask and data you analyze often influences which clustenng method you will use.ln this section, we'll cluster undirected graphs using elgenvectors. We'Il see how eigenvectors play a ro 厄 in clustenng by partitioning the undirected graph in Figure 8.3. A graph is a collection 0f vertices and edges. The vertices ⅲ this graph are the circles labeled 1 t0 7. Edges exist between pairs Of vertices. This graph is undirected since the edges have no
62 When Life is Linear alumni g1Ving rank. Excluding the columns for the score and name of the college, we will use the remaining columns in Table 7.5 t0 find a linear combination tO create the score. The only column that is changed is that for SAT scores, which is reported as a range like 1 引住ー 1530 for Williams C011ege. Since we will want one number per column, we will take the average of the range, which for WiIIiams would be 1420. Given there are fourteen columns Of data, we need only the data from the top fourteen colleges that correspond t0 data 仕 om the rows for WiIIiams C011ege t0 the US Naval Academy. This creates the linear system 17 3 71 4 92 4 1420 91 7 93 1 97 6 58 92 5 1425 84 13 7 70 2 9 94 1 98 10 57 91 6 1440 84 15 7 74 2 8 93 4 97 9 46 87 6 1385 86 1 8 17 68 1 9 94 1 1 96 3 55 87 2 1460 90 14 20 70 1 8 94 1 98 6 43 8 1410 83 14 68 1 87 10 93 6 96 14 50 89 12 1390 78 引 12 69 1 8 93 14 95 10 46 88 12 1415 78 引 16 65 1 9 97 4 97 27 58 83 2 1400 94 25 8 94 5 79 1 6 96 15 44 85 14 1390 71 14 4 86 2 9 94 1 1 96 21 43 88 14 1395 74 23 20 68 0.3 8 95 6 97 13 33 83 10 1360 82 28 15 69 0 1 1 99 6 96 37 53 89 1 00 95 22 67 2 8 97 21 98 1 8 33 88 46 1270 53 7 24 61 0 9 94 25 97 1 幻 Easy enough t0 solve. Keep in mind, solving the system produces weights that compute the exact overall score for the fourteen schools. SO, we now apply them t0 the remammg schools and qee the predicted scores. Here, we see how poorly we've done. For example, Washington and Lee's overall score was 88 ; our method predicts 73.7. Wesleyan scored 86 although we predict 1 18. Also, C01by and Colgate both scored 84 and our method gives them scores 0f88 and 96. SO, we may seem at a loss. But, as is Often the case ln mathematics, meeting a dead end may mean that only a slight change ln perspective willlead t0 a ⅲ t 応ー new direction. Like the example with Olympic data, we weren't using all the data. Using least squares enabled us t0 include more gold medaltimes and can let us include more schools here. We'll only use the first 20 schools so we have a few 厄負 against which to test our method. This would produce the system S20x い 0 8 一一 6 4 一 4- ) っ 4 0- 0 9- 9 一 8 0 9- 9 一 0 ノ 9- 9 一 9 一 9. 9- 9- 9 一 8 8 8 D2()x14W14xl