う 3 Provably Secure Obfuscation: lt's Possible (Sometimes) ! DE-OBFUSCATE( P ,F ) : program and F is the obfuscating transformation the obfuscator uses. Algorithm 5.1 Overview of algorithm REAA. P is the source of the obfuscated う . Output 、ゞ . 4. if P ′ / P repeat from 1. う . Obfuscate using transformation 7 ' , i. e. , ← F い , 向 . 2. Guess a key た . Guess a source program 、ゞ . 1. for example, the OBFPP algorithm can only perform additions and multiplications. the operatlons you can perform on them. 、 en computing with encrypted data' ln all these algorithms you are severely restricted as tO the assets you protect and obfuscate encryption. 从℃ include a discussron Of its workings tO show one direction being explored tO other algorithms ⅲ this chapter, whitebox DES is not provably secure; howeven we describe whitebox cryptography and its use t0 obfuscate DES encryption. Unlike encrypted values ⅲ a provably secure fashion. On a related note' in Section 53.4 ろ 29 a database. ln section う . う 3 う 24 , we'll show you hOW tO perform operations on ln section う 3.2 522 you'll see hOW it's possible tO obfuscate the relations in use these prmitives tO obfuscate access programs expressions. then extend this result tO obfuscating multiple input output relations and then we'll the simplest case by obfuscating a single input output relation 0f a program. 嶬Ⅱ do SO by limiting ourselves tO certain assets. ln section う 3.1 ろ 14 we will begin with ln this sectlon, we will construct examples Of provably secure obfuscation and we'll lmpossible. However, that doesn't mean that 4 〃 types Of obfuscation are impossible! ln the next section we'll shOW you hOW the most general form Of obfuscation is lt's Possible (Sometimes) ! 53 Provably Secure Obfuscation: code insertion obfuscation until you find the original program. all valid programs shorter than 1ength(0bfuscated program) and apply the dead hO 从℃ ve ら you dO not need tO perform any static analysis at all. You instead generate
4.5 Data Encodings 259 new into the 01d. The analogy with the び々〃 and 々〃 functions Ekey() Dkey() is intentional—in fact, you could imagine using encryption tO obfuscate data. You will see examples Of this in the next section. TO make this a bit more formal, imagine that you have an abstract data type 7 ' with operations OT, ØT, type OT : T x T → 7 ' 〇 7 , : T x T → T To obfuscate T, you construct a new abstract data type with operations for converting between T and T', and one new operation operating on for every Dr, : T' ET' : T type operation operating on T: & 卩 : T' x T ′→ T' : 7- ' ′ x T ′→ T ′ T ′ even if their underlying representation is based on the same obfuscation algorithm. so that you can create a large number 0f different-looking obfuscated variables tation tO be parameterized. ln Other words, you want a - な Of representations ldeally, to prevent pattern-matching attacks, you want the obfuscated represen- operation on the de-obfuscated values. directly on the new representation, but for Others you will have tO perform the that, for a particular obfuscated representatlon, some operations can be performed directly. ln practice, however, you will often have t0 compromise. You may find much prefer for every operation tO be performed on the obfuscated representation cated type it reveals the de-obfuscated values 0f the arguments and the result! You This really isn t a good idea, since every time you perform an operauon on an Obfus- x ØT' ア x ①ア convert back tO T, perform the operation, and then again convert tO The simplest way tO obfuscate is for every operation on values Of type 7 ' ' tO first
4.5 Data Encodings 261 Since x and y now have different representauons, before any operation can be performed on it, the second statement has tO first de-obfuscate x and then obfuscate it in order tO bring it intO the same representatlon as Y. This reveals x's true value. For some pairs Of representations, you will be able tO convert directly from one tO the other without going through cleartext, but this means you may have tO provide such conversron operations for e palr Of obfuscated types ! TO avoid this problem' an automatic obfuscator has tO very carefully choose which obfuscated representations tO assign tO which variables. You can start the obfuscation bY computing a backwards slice from each variable, which will reveal which Other variables contribute tO its value. TO mimmize the number Of conversrons, variables that occur in the same should be assigned the same obfuscated representation. Of course no reason Why a variable should have the same representation throughout the execution Of the program. On the contrary an attacker will find that dynamic analysis 0f the program IS much harder if the representauon 0f a partic- ular variable contmuously changes over tlme or as the program chooses different execution paths [ 62 ]. ln this section, we will shOW you techniques for obfuscating integers' booleans' strings, and arrays. Booleans are easier than integers tO their range is known and small ( [ 0 ... 1 川 . 4.5.1 Encoding lntegers lntegers are the 1 OSt comr れ on data type ln most programs and therefore important tO obfuscate. Unfortunately, programmers are used tO the idea that operations on integers are cheap, SO you have tO be careful not tO transform them intO an exotic rep- resentation that may be highly obfuscated but carries a huge performance penalty. ln this section, we will illustrate the obfuscated transformations in a very hands- on fashion, bywriting functions that replace the built-in operators. Every represen- tation will look like the set of definitions below. TI is the data type of the obfuscated representatlon, EI is a function that transforms from cleartext integers rntO the Ob- fuscated representation, DI transforms obfuscated integers intO cleartext' and ADD 1 , MULI, and LTI define how t0 add, multiply, and compare two obfuscated integers: typedef int TI ; TI E1(int e) {return e;} int DI()I e) {return e ; } TI 2D1 ( TI a, TI b) {return E1(D1(a) + D1(b));} TI 鼬 LI ( TI a, TI b) {return E1(D1(a)*D1(b));} BOOL LTI()I a, TI b) {return D1(a)<D1(b) ; }
212 Code Obfuscation For example, although meow and growl have the same signature, they can be given the same name since they are declared in classes that have no inheritance relationship. ProbIem 4.2 For a strongly typed languagelikeJava, write a t001 that de-obfuscates obfuscated identifiers making use of available type informatlon, references to stan- dard libraries with known semantics, and a library of common programming idioms. Evaluate your tool by having subjects compare the readability of original (unobfus- cated) programs to your de-obfuscated programs with generated identifiers. How would your t001 be different for a weakly typed languagelike C that relies less heav- ily than Java on a standard library? Could you use statistical methods to generate identifiers, for example, by tramng a neural network on a set of well-structured programs? Generating identifiers in "Hungarian Notation" is straightforward, but can you do better? Can you do better than this p 叩 er [ 7 引 ? 4.13 Obfuscation Executives If you have multiple obfuscating transformations at your disposal, a question nat- urally arises: Which transformations should I apply where, and in which order should I apply them? The problem gets even more complex if you not only want t0 obfuscate the program but also want to watermark and tamperproof it. Should I watermark first, then obfuscate, and then tamperproof, or is some other ordering better? A similar problem (known as the "phase-ordering-problem" 7 う 074 ] ) arises ln compilers when there are many optimlzing transformations available. some com- pilers 叩 ply the optimizations ⅲ a fixed order, while others iteratively apply the transformatlons until there are no 1 ore changes, or until S01 e time limit has been exceeded. With some exceptions, choosing a good optimrzation order is an に 4 立 2 problem than choosing a good obfuscatlon order. The reason is that most opumizations make the program 切戸ん日 for example, by removing redundant computations), or at the very least, don't make the program much more complex. Transformations such as 100P unrolling are exceptions. Obfuscating transformations, on the other hand, are ノな〃にノ to make the program more complex ! SO for every obfuscating transformation you apply, the program gets more complex and you make the job of the next transformation harder. A related problem is how to decide when to stop obfuscating. With a fixed transformation order, this isn't a problem, of course: Just apply each transformation
Software Tamperproofing 0 tamperproof a program IS tO ensure that it "executes as lntended," even ln the presence Of an adversary whO tries to disrupt, monitor, or change the execution. Note that this is different from obfuscation, where the intent is to make it difficult for the attacker to 4 〃ノ / 4 〃ノ the program. ln practice, the boundary between tam- perproofing and obfuscation is a blurry one: A program that is harder to understand because it's been obfuscated ought also be more difficult to modify! For example, an attacker who can't find the decrypt ( ) function in a DRM media player because it's been thoroughly obfuscated also won't be able to modify it or even monitor it by settlng a breakpoint on it. The dynamic obfuscation algorithms in Chapter 6 (Dynamic Obfuscation), in particular, have Often been used tO prevent tampering. ln this book, we take the view that a pure tamperproofing algorithm not only makes tampering difficult but is also able to ノ e た when tampering has occurred and to 尾 0 〃ノ to the attack by in some way punishing the user. ln practice, tamperproofing is always combined with obfuscation: 1. If you both obfuscate and tamperproof your code, an attacker who, in spite of the tamperproofing, is able to extract the code still has to de-obfuscate it ⅲ order tO understand it ・ 2. Code that you lnsert to test for tamperrng or effect a response to tampering must be obfuscated in order to prevent the attacker from easily discovering it. 401
46 5.5.1 Overcoming lmpossibility Obfus cation Theory Given the value of provable obfuscation and the proof that the general case is un- solvable, what future directlons are possible and promising? The proof suggests that there exist programs that cannot be obfuscated. However, it does not suggest that any specific program is not obfuscatable. One promislng possibility is to find a way tO restrict the class Of programs you are interested ⅲ obfuscating so as to exclude Secret. For example, you have already seen that point functions can indeed be obfus- cated. What would be the most useful program one could try to obfuscate? lt would be a program c 叩 able of providing the most general functionality while remaining outside the domain of the proof. For example, you could choose to obfuscate a par- ticular limited virtual machine. There is nothing directly in the proof you saw that suggests this is impossible. Once you have such an obfuscated virtual machine, your task is "simply" tO deliver to this virtual machine the program you wish to execute ⅲ a form that it can execute that is nevertheless resilient to analysis. However, S1nce the output of this virtual machine would be usable by an attacker, it is likely that the original impossibility proof could be adapted to show that the obfuscation of even particular virtual machines iS impossible. A more promlsing approach is to find an alternate definition of obfuscation that remains useful but prevents Theorem う . し引 0 (lmpossibility of Obfuscation) from being 叩 plicable. ln the remainder of this section, we will explore both these directions for resculng provable obfuscation. ln section う . う .2 you will see 尾 4 / - / ゆりび 0 戸 4 / 4 / わ〃 which splits a program up and executes each piece by a separate party in such a way that no one party galns complete access to the asset you are trymg tO protect. ln Section う . う 3 弭 9 we will explore another approach that transforms an obfuscated program ln such a way that it encrypts the output before returning it. Both of these transformations turn the non-interactive definition of obfuscatlon 1ntO an lnteractlve one that a server and a client to be present ln order tO execute an obfuscated program. 5.5.2 Definitions Revisited: Make Obfuscation lnteractive The definitions of obfuscation you saw ⅲ section 5. しづ国 implied that once a pro- gram P has been obfuscated, it executes without further interaction with a server. Defined in this way, general provably secure obfuscation is impossible. However, if you allow an obfuscated program to distribute its computation so some Of it is not accessible to an attacker, this amounts to providing a blackbox tO an obfuscated program where it can securely perform computation and store
う .4 Provably secure Obfuscation: lt's lmpossible (Sometimes) ! that is P' time(P')) 怦い ( P ′ ) = 1 ] 怦い ( 1 559 There is another restrlction that you must place on obfuscated programs in order tO reason about them in this section. Specifically, you need tO restrict hOW much bigger or slower an obfuscated program can be compared t0 the original. Let's call such programs 4 〃 and 厩 , defined as follows: Definition 5.9 (Small). 〇 is 4 〃 if for all programs P , 〇 ( P ) is at most POIY- nomially larger than P. Definition 5.10 (Efficient). 0 is 夜 / if for all programs P, O(P) is at most polynomially slower th an P. By reqtllring that there is at most only a P01Yn01 ial increase ln size Of a program and at most only a polynomial slowdown in its speed' you eliminate the degenerate case where an obfuscated program consists solely Of an exhaustive list Of input and output pairs. ln most programs, such a list would be exponentially larger than the p rogram itself. Thus ⅲ this sectlon, when we say that a program is obfuscated, we mean the following : Definition 5.11 (Obfuscated). A program 〇 ( P) is an 0 ろ々 4 ノ version 0f P if: is a virtual blackbox. iS efficient; and iS SI a Ⅱ・ IS correct; have been obfuscated. However, according t0 Definition 5.11 , Listing 5 う” 7 (b) is efficient, and at least the length, comments, and number Of variable properties it into the program in Listing う . うのろ 7 (b). The program is still correct' small and length, comments, and number 0f variables 0f the original program bY transforming For example, given the program in Listing う . うのう 7 (a) , you can obfuscate the lnp ut—outp ut relation. gram s source code doesn t give the attacker an advantage over having access tO its The virtual blackbox property states that having access t0 an obfuscated pro-
22 Obfuscation Theory way tO dO this is tO introduce new paths between every pair of nodes with a keyed hash that no string maps to, like this: here's a phone bOOk cont'dlmng an assoclation between names and phone num- data attributes with a key derived from a hash of the query attributes. For example, gorithm OBFNS takes advantage of point-function obfuscation by encrypting the the field you are querying and a database, and outputs the particular record. AI- a single record in a database you would use a point function that takes as input building a database that is an obfuscated lookup function. For example, to access form many of the functions you want from a database. The algorithm starts by for details) gives a method for extending obfuscated multi-point functions to per- tions to obfuscate arbitrary databases. Algorithm OBFNS [ 26 刀 (see Algorithm 5.2 第ろ 25 You may want to extend your newfound ability to obfuscate multi-point func- lnput Programs: Arithmetic and relational operations on data Attacker Limits: None Asset: Private data Go 引 53.2 AIgorithm 0 群Ⅳ & ・ Obfuscating Databases additional arcs and states hide the true structure of the automaton. tO the finite state automaton in which no incommg transitlons are ever taken. These states using labels that nothing hashes to. ln addition, new states could also be added keyed hashes. The new transition arcs labeled 亠 connect previously unconnected To build such an automaton, first the labels on transition arcs are replaced with
40 Obfus cation Theory Listing 5.6 A small self-reproducing program. This program fails to be unobfuscat- oracle access t0 the program eqmvalent tO source-code access. class Se1f { public static void main(String ロ args) { char qq=34 , q = 39 ; String payload= + qq + payload ; system. out. println(payload + qq + payload=' + qq + payload. replace (q , (q)) ; } } ・ payload="class Se1f{pub1ic static void main(String[] args){ \ char qq=34 , q = 39 ; String payload= + qq + payload; system. out. println(payload + qq + " ; payload=" + qq + payload. replace(q , (q) ) ; not obfuscated because there are many other propertles, such as control flow and variable names, that are preserved. Now that we have a good definition of obfuscation, we will show the following: 7 尾 ). 7. ・ (lmpossibility of obfuscation) Let be the set of all programs. Given any obfuscating transformation 0 , there exists a p e 2 such that 0 ( P) is not obfuscated according to Definition う .1 しの 9. will know that we have succeeded in proving this theorem ( 1 ) if we can construct a program that has a property that is always evident from source-code access to any obfuscated version, and ( 2 ) if there is a negligible probability that this property can be deduced from Just oracle access. 5.4.2 Obfuscating Learnable Functions One way you might try to build an unobfuscatable program is to make the program so simple that it has only a single output. For example, you may wish to use the Hello World program. The intuition here is to build programs that are so trivial that they simply contaln no structure rich enough to obfuscate. The problem with this function is that its triviality, iromcally, makes it 切戸な tO obfuscate, given the definition. Simple functions are learnable with just a small number of oracle queries—in the case of Hello world, a table of input output pairs consisting of a single entry mapping an lnput of the empty string to the output Hello ・ World. As a result, all differences between oracle access and source_code access disappear. ln other words, simple learnable programs like HeIIo ・ world can
6 Obfuscation Theory What does it mean to say "obfuscation is lmpossible? " ln the last secuon, you saw that you can obfuscate point-functions, databases, and encryption functions. Aren't we contradicting ourselves if we say that obfuscation is both "provably pos- sible " and 。 p rovably impos sible The answer IS no—all this means is that programs exist that can be obfuscated and Other programs exist that cannot be. ln the same way, the general uncomputabil- ity of HALTING does not prevent us from comlng up with classes of programs for which you 〃 compute whether the program halts. Consider a みんみ ox program, which is one for which you have no access to its internals. A strongly obfuscated program should reveal no more information than a blackbox program. ln practice of course, compared to blackbox programs, obfuscated programs leak some information. Given a blackbox program, 砠 you can observe is its I/O behavior (and perh 叩 s how long it takes to run). ln contrast, glven an obfuscated program, you can Observe lts memory accesses, the instructlons that are being executed, procedure calls, and other clues that you may rntuitively suspect could lead an adversary t0 eventually "crack" the obfuscation and force the program tO reveal itS secrets. ln this section, we will show that this intultion is, in fact, correct and that hiding a//properties ofa//programs is unachievable. To show that this is true, it is sufficient tO construct Just one program contalning a secret property that obfuscation is unable to hide. This is exactly what we will do. First we will define what a blackbox is and how it relates to our definition of obfuscation. Next we will construct a property that cannot be obfuscated and a program that exhibits this property, and thus we will show that not all properties of all programs can be obfuscated. Finally, we will discuss what features of obfuscation allowed us to construct such an unobfuscatable P rogram ・ 5.4.1 A GeneraI Obfuscator ・ We have devised our definition of obfuscation in terms of assets. A general ob_ fuscator is one that is able to hide 4 〃 assets in a program. A generally obfuscated program IS ln some sense the most obfuscated version of a program that could be devised. lt is one that cannot be analyzed to reveal any property of the original program except the relationship between input and output. How is such a perfectly obfuscated program ()f it existed) different from a blackbox? The only difference is that you have 04 尾イ 0 ノ 40 化 to an obfuscated program. The only query you can make on a blackbox, on the other hand, is to compute its output on a finite number Of inputs. Such access to a program is called 0 な 4 化 .