where - みる会図書館

73 Stretch and Shrink 3 5 4 1 7 2 6 Figu re 8.3. An undirected graph to cluster. onentation. Think Of the edges as connections. 嶬で see that 1 is connected tO 4 and since the graph is undirected that means 4 is alSO connected tO l. a directed graph, there is onentation. SO, 1 may be connected t0 4 but 4 may not be connected t0 1 .ln such graphs, this is denoted by an arrow. TO cluster this graph, we'll use the eigenvector Of a matrIX. First, we need t0 form the matnx, which begins by forming the adjacency matnx 月 1 ifthere is an edge between nodes / and ノ and is of the graph where ノ 0 otherwise. SO, for Figure 8.3 , 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 The next step is forming the matnx D which is a diagonal matnx where dll equals the row sum Of the ith row 0f the matr1X 工 Since the first row 0f 月 sums t0 2 , the diagonal element 0f D in the first row is set t0 2. Similarly, since the second row Of. 4 sums tO 3 , the diagonal element in that row Of D equals 3. Continuing in this way, we find 2 0 0 0 0 0 0

2. WHEN LIFE IS LINEAR

What Are the Chances? もお . バ気し、驫、缸龕。↓ 103 Figu re 10.7. Sierpifiski's triangle created with the chaos game visualized after 5 000 (a), 20 , 000 (b), and 80 , 000 (c) steps. the new vector, then one step Ofour game 燔 captured in = [ 0 T , P2 = [ 0 ()]T,andI)3= where p randomly chosen from 店 Sierpinski's triangle forms after an infinite number Of iterations but there comes a point after which a larger number Of iterations no longer produces visible differences. Such changes are only perceived after zooming into regions. TO see this visually, Figures 10.7 (a), (b), and (c) are this process after 5 000 , 20 , 000 , and 80 , 000 iterations. 、 Ve can also represent the chaos game as レ 2 0 0 1 / 2 where p randomly chosen t0 be PI, P2, or P3. T(V) is a new point that we graph since it is part 0f the fractal. Then, we let v equal T(v) and perform the transformation again. This looping produces the fractal. This formula becomes more versatile when we add another term: COS 〃レ 2 0 0 1 / 2 ー Sin() COS 0 Let's set 〃 t0 5 degrees. This produces the image in Figure 川 .8 (a). An interesting vanation is tO keep 〃 at 0 degrees when either points 2 or 3 are chosen ⅲ the chaos game. げ point 1 is chosen, then we use 〃 equalto 5 degrees. This produces the image in Figure 川 .8 (b).

3. WHEN LIFE IS LINEAR

25 Fitting the Norm TabIe 4.1. Movie 血 g を・ 0 覊硼ん〃〃〃イ Me. Tim Oscar Emmy 5 2 3 5 5 5 Gravity P ん″ 0 〃〃 [ 0 5 ー 3 4 ] and Emmy's e = い 2 5 5 ] . Tofindthepreference similanty between Oscar and me, we compute = 1L8743. Similarly, the distance between Emmy's and my movie preferences IS に .0830. So, under this norm, my taste is more like Oscar's than Emmy's. The distance between Emmy's and Oscar's moue vectors is 9.9499 , SO they may en. 」 oy a movie together more, at least if their expenence aligns with Euclidean distance. We 」 ust built our vectors om the columns 0f Table 4.1. Why stop there? Let's create our vectors from the rows Of this table, which builds moue vectors. For example, the 月〃肥百 ca 〃〃″ vector is [ 0 5 ー 3 ]. With user vectors, 、 found similar users. 、 mOVIe vectors, we can find similar movles, at least according tO users' ratings. For example, the Euclidean distance between the 川夜元 0 〃〃″ / e and P ん / 0 〃肥〃 4 vectors is 4.12 引 ( 0 ー 4 ) 2 十 ( 5 ー 5 ) 2 十 ( ー 3 ー ( ー 4 ) ) 2. げ you find the EucIidean distance between every pair Of vectors, you'd find these tO be the closest. But, how helpfulis this? Each person's ratings were created randomly. What if we take 代 ratings? Could we mine useful information? Rather than random ratings, let's 0k at 100 , 000 ratings 0f over 900 people with the MovieLens dataset collected by the GroupLens Research l)roject at the University 0f Minnesota. This dataset consists 0f 川 0 000 ratings (where a rating is between ー and 5 ) from 943 users on 1682 moues where each user rated at least 20 movies. The data was collected from September 1997 through Aprill 998. Let's create a mOV1e vector where the elements are the ratings Of all 943 people. A zero indicates that a movie was not rated. We will find the Ⅱ 0 ー t112 =

4. WHEN LIFE IS LINEAR

Mining f0 「 Meaning From smartphones tO tablets tO laptops and even tO supercomputers, data is being collected and produced. With so many bits and bytes, data analytics and data mimng play unprecedented roles in computing. Linear algebra is an important tOOl in this field. ln this chapter, we touch on some tOOls in data mining that use linear algebra, many built on ideas presented earlier in the book. Before we start, how much data is 心 ot ofdata? Let's ok to Facebook. What were you doing 15 minute ago? ln that time, the number of photos uploaded t0 Facebook is greater than the number 0f photographs stored ⅲ the New York PubIic Library photo archives. Think about the amount ofdata produced in the past tWO hours or since yesterday or last week. Even more lmpresslve is hOW Facebook can organize the data SO it can appear quickly intO your news feed. 11.1 SIice and Dice ln Section 8.3 , we looked at clusterlng and saw how to break data into two groups usmg an elgenvector. As we saw in that section, it can be helpful and sometimes necessary for larger networks, tO 可 Ot the adjacency matr1X 0fa gr 叩 h.ln Figure 11.1 , we see an example where a black square is placed where there iS a nonzero entry in the matr1X and a white square iS placed Othe ハ Vise. The go 引 ofclusterlng is tO find maximally intraconnected components and minimally interconnected components.ln a 可 Ot Of a matr1X, this results in darker square reglons. We saw this for a network of about fifty Face- book friends in Figure 8.5 (b). Now, let's turn to an even larger network. We'll analyze the graph of approximately 500 of my frlends on Facebook. 106

5. WHEN LIFE IS LINEAR

108 When Life is Linear 5 4 1 3 2 7 Figure 1 1 3. An undirected gr 叩 h to cluster. 6 lower rlght. There are still connections outside each cluster. This is due tO fnendships that cross the groups. But, what are the groups? Clustermg won't identify them. The smaller cluster is largely alumni om Davidson C011ege (where lteach) and assorted faculty. Thelarger cluster contains friends 仕 om vanous other parts 0f my life—high school, college, graduate school and such. This isn't entirely true but overall the pattern is identifiable. So, we'll learn tO break the matrIX intO more groups, uslng a method called e 灯 e 〃イ e イ Let's return tO the small network om Section 8.3 , which is reprlnted in Figure 11.3. As we saw earlier, we want tO Ok at the eigenvector corre- sponding t0 the second smallest eigenvalue 0f the Laplacian matnx, which for this problem っ 4 0 0 -1 0- 一ー -0- 1 0 0- 1 0 っ ~ 0 1 1 0 1 ) 0 ー 0 0 0 っ 4 0- 一 1 -0- ~ 1 0 0 0- 1 1 0 -1 0 1 1 0 -1 0 っ 0 0 一 1 0 0 0 一 1 The eigenvector Of interest is 0.4801 ー 0.1471 ー 0.4244 0.3078 ー 0.3482 0.4801 ー 0.3482

6. WHEN LIFE IS LINEAR

89 Zombie Math—Decomposing 5 0 5 い ) Figure 9.3. A 3D graph (a), with noise (b), that is reduced (c) using the SVD. singular values seen in Figure 9.4. L00king at the 可 0t , we see a steep drop- 0 in values beginning at the seventh largest singular value. That drop-off is the signal 0fa ん t0 choose. We'II take ん = 8 and construct the rank 8 matr1X approximation with the SVD. Now, we 可 ot the rank 8 matnx approximation as seen ⅲ Figure 9.3 (c). The noise is reduced, though we d0 not entirely regain the image in Figure 9.3 (a). This technique is used when you may know there is noise but don't know where or hOW much. SO, having improvement in the data like that seen in Figures 9.3 (c) and (b) can be important. The blurrmg 0f the lmage that occurred in compresslon aidS in reducing n01Se. 10 35 40 45 50 15 20 25 30 Figure 9.4. Singular values ofthe matrix containing the data graphed in Figure 9.3

7. WHEN LIFE IS LINEAR

42 When Life is Linear Figure 5.9. Three three—pixel images. Finally, let's take three nonlinear pixel values 50 , 30 , 103 that we see visualized in Figure 5.9 (c). Then, g(x) = 50 ー 2 ( 30 ) 十 103 = 93. Keep in mind, we could get negative values for g は ). げ this h 叩 pens, we will then replace g(x) by its absolute value. We applied the formula to one pixel. Now, we will 叩可 y it to every pixel in an 〃 by 川 matnx ofpixel values. We will call the pixel ⅲ row / and column ノ , pixel 0 , カ . We can apply our formula to detect changes ⅲ color ⅲ the horlzontal direction by computing P = (value ofpixel 0 , ノ十 1 ) ) ー 2(vaIue ofpixel ( / , ノ ) ) 十 (value ofpixel ( / , ノー 1)). For instance, suppose the pixel at row / and column ノ has a grayscale value of 25. The pixel to the 厄代 of it has a value 0f40 and the pixel to the nght it has a value 0f20. Then P = 40 ー 2 ( 25 ) 十 20 = 10. We will then replace the pixel in row / and column ノ by the value of P. Doing this for every pixel in the image forms a new image. げ a pixel ⅲ the orlginalimage had the same value as the pixel t0 the nght and に代 0f it then the new value will be 0. SimiIarIy, if the color at a pixel changes linearly in the honzontal direction, then it will be colored black in the new image. Again, we take the absolute value Of this computation, which is not a linear process but is the only nonlinear step. There is the question Ofwhat values tO set for the pixels on the top and bOttom rows.ln computer graphics, an option, which is the one we'll take, is tO set these pixels tO black. 、 Mhere's the linear algebra? Our computation IS, in fact, a dot prod- uct. For the example, if we are interested in a pixel with values 40 , 25 , and 20 then the new image would have the pixel in the same loca- tion colored with the value of [ 40 25 20 ] ・い一 2 1 ]. More generally, let [ 角、 / 角、 / 乃ノー司 where 乃、 / equals the value ofthe pixel ( / , カ in ノ・い一 2 1 ] ・ the image. Then, we replace pixel 乃、 / in the image with u What does this ok like for an image? Applying this technique to the grayscale image 0f the Mona Lisa ⅲ Figure 5.10 (a) produces the image ln

8. WHEN LIFE IS LINEAR

120 When Life is Linear 1 1 い ) Figu re 12.1. A fictional season played between three NCAA basketball teams. A directed edge points from the winning team tO the losing team. The weight of an edge indicates the difference between the winmng and lOSing scores.ln (a), we see a season where transitivity Of scores holds.ln (b), scores are given that dO not match transitivity, which is Often the case ln a season Of data. Additional games would add more rows t0 MI and p ぃ For most sports with a season Of data, MI will have many more rows than columns. There are no values for 自 , 2 , and 物 that will make all three equations simultaneously true. We turn tO least squares which leads us tO multiply both sides ofthe equation by the transpose of MI, in which the rows become columns. SO, 、 VhiCh becomes -1 1 0- ・ 1 0 1 1 0 1 0- -1 1 1 1 0 1 5 ) 0 1 1 ・ 1 0- 1 0 1 1 1 1 0 This system has infinitely many solutions. SO, we take one more step and replace the last row 0f the matnx on the nghthand side 0f the equation with and the last entry in the vector on the righthand side with a 0. This will enforce that the ratings sum tO 0. Finding the ratings corresponds tO solving っ ~ 1 9 一 4 5 ) 1 1 っ ~ 1 っ ~ 1 ( 12.1 )

9. WHEN LIFE IS LINEAR

What Are the Chances? げ a business owner finds the company coming up on the second page 0f results, there is understandable concern. Understanding the algorithm behind many search engines can help explain why one page is listed before another. What do we want tO have returned 什 om a query tO a search engine? We need web pages t0 be relevant t0 our query. We also need a sense 0fthe quality 0f a web page, and this is where we will focus our attention. 、 Mith billions of web pages out there, how can we possibly determine the quality 0f a page? Goog 厄 tackles this issue with the use 0f the PageRank algorithm, developed by Page and Brm. Goog に determines the popularity 0f a web page by modeling lnternet activity. If you visit web pages according t0 Google's model, which pages would you end 叩 at the most? The frequency of time spent on a web page yields that page's PageRank. What is Google's model 0f surfing? ls someone tracking your surfing to build the model? Goog 厄 models everyone as being a random surfer by assuming that you randomly choose links t0 f0 ⅱ OW. ⅲ this way, the model stays away from making decisions based on preferred content. The PageRank model assumes that you have an 85 % chance of 応ⅱ ow - ing a hyperlink on a page, and a 15 % chance 0f 」 umping t0 any web page ln the network (with uniform probability), any time you are on a web page with links on it. lfyou are on a web page with no links on it, like many PDF files or even video or audiO files, then you are equally likely t0 Jump anywhere on the lnternet. Such a page is called a dangling node. You could even 」 ump back t0 your current page! That can seem 0dd. Why allow someone t0 Jump back to the same page? Why would you ever d0 that? The better question is why let the model d0 this. Why? TO guarantee an answer. This means no matter hOW the World Wide web is organized, we always have a ranking and are never unable tO know hOW tO rank the web pages. What guarantees this is a theorem based on the linear algebra Goog 厄 uses for the PageRank algonthm. Let's see this model Of surfing in action on the network in Figure 10.4. The graph represents six web pages, with the vertices or nodes 0f the graph representing web pages. A directed edge from one vertex t0 another represents a link from one web page tO another. SO, on web page 1 there is a link that you can click and go t0 web page 4. On web page 4 , there are links to web pages 2 and 5. Web page 6 has no links. By the PageRank model, ifyou are on web page 6 , then you will Jump, with equal probability, t0 any web page ⅲ the network. Said another way,

10. WHEN LIFE IS LINEAR

102 When Life is Linear Figu re 10.6. Sierpifiski's triangle, a fractal, named after its founder, waclaw Sierpifiski. Rules 1. PIace a dot halfway between square ー and 2. 2. RoII a die and place a new dot halfway between where you placed thelast dot and square 1 if you roll 1 or 2 , square 2 if you roll 3 or 4 , or square 3 ifyou roll 5 or 6. 3. Return to Step 2. Play a few times! 、 Mhat shape emerges? If you played long and accurately enough, the emergmg lmage is a fractal known as Sierpifiski's tnangle seen in Figure 10.6. The image contains three copies Ofthe larger lmage. There is one at the top and tWO along the bOttom. Magnifying an ObJect and seeing similarities tO the whole is an important property of fractals. An obJect with self-similanty has the property 0 ロ 00k ⅲ g the same as or similar tO itself under increaslng magnification. HOW dO we create this shape using linear algebra? Let's ok care- fully at the rules. Let's represent the squares in the game by the vectors = [ 0 1 卩 ' P2 = [ 00 ド , and P3 = い 0 ] T. The game weJt1St played, sometlmes called the chaos game, entails taking our current vector v and letting the new vector equal half the distance between the current v and vectors for squares し 2 , or 3 , which we choose randomly. lfwe let vn denote