Wednesday 1 August 2007

Restricting the search scope to improve performance

One major issue with DeepWeaver is that any call that results in the body predicate executing results in substantial slowdown, particularly upon a large code base. Predicates that result in body being called include call(x,y,z) and assign(x,y,z) when x is a free variable.

In both cases here, x represents a statement, and without that statement being bound, the body predicate must be called to traverse all Units in the codebase to check whether it matches your criteria. This can be particularly slow if the matching criteria itself is expensive.

Remember DeepWeaver doesn't just analyse the class or package you specified, if you call an unbound assign or call predicate it will dive into the libraries and look for results. This includes looking through classes such as Object and Throwable from which common Java features are derived. Needless to say, it's going to slow you down.

At time of writing there's no artificial predicate to limit the scope based on what is specified in your command line, but there's an easy way to do this manually.
If you want to restrict yourself to a certain class or set of classes, simply specify:
scope(u)=class(c,"org.owasp.webgoat.session.User*"), units(body(sootmethod(c,<-),<-),u);

and then
scope(u), assign(u,y,z), ...

and now the first predicate in assign is bound to be a unit from some class beginning with User in the specified package. This dramatically restricts the scope of possible values for u and thus greatly improves the resulting performance.

Notice the use of wildcards here, you may use wildcards in both the class and package name but you must attempt to specify the class name to some degree. So you could say org.owasp.*.User* to find your classes, or ever org.*.* but that dot is essential to include, or it will assume your package to be a class name and fail to find the result you're looking for.

Tuesday 31 July 2007

Datatypes in DeepWeaver

DeepWeaver introduces limited type safety, as previously described.
To use types, you can use either built in types, or any other Java object by specifying it's full long-form name (eg. java.awt.AbstractListener)
If you're feeling really brave you can add your own datatype (although I've never seen a need to do so), just add it to the list in dw.type.Types and then add your own corresponding class file.

Here's a brief run-down of the built-in types:
  • String: Letters in double-quotes. Usually used for matching, printing or restriction
  • List: A Prolog style list, but shorthand list notation is not yet available
  • Stmt: Corresponds with the soot Unit type. A line or unit of code, which also retains information about the methods and values associated with it
  • Value: Corresponds with the soot Value type. A code value that also stores information about its declaring unit (but may be linked to by more than one unit!)
  • Block: A CodeBlock, or series of units, usually in some order that relates to their exection. In fact, it is a implemented as a chain of units, and the unit(cb,u) predicate can be used to split it into these component units. Also stores information about its parent method, and synchronization
  • Method: An analysis of an entire Method, including information of body, declaring class and synchronization.
  • Class: An analysis of an entire class, corresponding with soot.SootClass.

Tuesday 24 July 2007

Running DeepWeaver from the command line

Calling DeepWeaver from the command line is pretty similar to any other Java program, ie. not very pretty! The basic format is:
java -cp CLASSPATH dw.DW script.dw org.mytarget.Target

where
  • CLASSPATH contains all the JARs required by DeepWeaver (these may be contained in the $CLASSPATH environment variable if you prefer) - should be separated by a colon. For more info, run java -help
  • script.dw is the name of the predicate script you want to run
  • org.mytarget.Target is the name of the class file on which you want to run it, and you can specify as many targets as you like
Commonly used paramaters are
  • -javac "javac -cp ~/workspace/dw/build/classes"
    • Instructs the use custom java compiler parameters, which is nearly always necessary. In this case the java compiler will add my classes directory to its classpath.
  • -addcp /home/user/lib
    • Adds a custom class path for inclusion in Deepweaver execution. All the libraries you need, if they're not directly locatable from your current execution directory need to be added. This should point to either a root folder (eg. looking for org.apache.catalina.Role then it must be in /home/user/lib/org/apache/catalina/Role.class) or a JAR file. The easy way around this, if you have lots of JAR files is to create a library folder and just inflate all your JAR files there, that way you only ever have to specify one -addcp parameter
  • -time
    • displays timing information
  • -v
    • verbose output
You can also specify targets for execution en masse by using the + symbol on pacakages
+org.apache.catalina

Anything linked to code that is analysed will be loaded dynamically if it can be found.

Confused? Here's a sample command for Linux
java -cp jars/antlr-2.7.5.jar:jars/sootclasses-2534-1.jar:jars/jasminclasses-2327.jar:jars/polyglotclasses-1.3.2.jar:build/classes dw.DW -javac "javac -cp ${workspace_loc}/dw/build/classes" -addcp ~/workspace/webgoat_src/build/WEB-INF/classes/ -addcp ~/jars transact.dw +org.owasp.webgoat

This isn't as bad as it may seem!
  • Our classpath requires 4 DW jar files and the build/classes folder, which we've specified using -cp (Eclipse adds this automatically if you run using it).
  • -javac tells the java compiler to look inside /dw/build/classes for compile info
  • The -addcp parameters add the webgoat compiled classes and library of extracted JAR files to the path
  • transact.dw is the script to be called on the source code
  • We want to analyse all code within org.owasp.webgoat

Monday 23 July 2007

How to use the between predicate

The between predicate is an invaluable tool in your troubleshooting arsenal but it can also be quite tricky to use. The format of the predicate is:
between(a,z,x,b)

a and z should be code units that represent the start and end of the between block.
x is the code between a and z. If x is output, it will come in the form of one line of soot analysis at at time, in the form of units. If x is input, it should be a unit (or unit box) of one or more lines of code.
b is a boolean choice that selects between must (true) and may (false) analysis, eg. code in an if block between a and z may not be called between a and z and thus will be included in a may, but not a must analysis.

To use between to find the x variable, the most important thing is to ensure that you specify a and z as two non-equal code units. However, units are not necessarily the most instinctive way to specify a and z. For example, you may want to specify a or z in terms of a call. The wrong way to do this is:
between(call(<-,*,*), z, X, false)

This has a number of faults including:
  • There's nothing to stop the result of your call being the same as z (an arbitrary unit) which could give a null exception
  • Your input is likely to give multiple locations, which will give multiple results from between that may not be easily distinguishable from each other
Between is very powerful but you need to be firm about your input to it. Here is an example of how to use between to find all the code that may be between two method calls, one called begin, the other called commit.
getUnit(call(<-,p,*), y), getUnit(call(<-,q,*), z), methodMatches(p, "* begin(..)), methodMatches(q, "* commit(..)"), between(y, z, X, false )

This works because call's middle parameter returns a method. This method name can then be checked with methodMatches to ensure that it has an expected name. Since the two match parameters are different, this also ensures the two results won't be equal. Then we get the unit from the result so that we are sure between will be receiving a unit as both parameters. X is the result which may be multiple lines of code we can then test.

Thursday 19 July 2007

Predicate Specification

A couple of notes on how to specify your predicates in DeepWeaver, that are worth mentioning because they're different from how they're specified in Prolog.

When you're specifiying a predicate there are a number of reserved keywords, and an few restrictions.
name(fn,sn)=getName(fn,sn);
This is the basic standard specification, fn and sn are bound or unbound variables of any type.
name(in fn, out sn)=getName(fn,sn);
This restricts the standard specification, because fn must be bound and sn must be unbound when this predicate is called, otherwise it will be ignored. You could overload is by following it with the first example above.
name(String fn, out sn)=getName(fn,sn);
Now fn must be a String object, but may be bound or unbound, sn must still be unbound.

Points of note:
  • Unless your variable name is one char long, it should start with a small letter or it will be mistaken for a type binding
  • in and out should not be used as variable names
  • Predicates can be overloaded, but they are overloaded in order (same as Prolog), meaning that if you have a predicate which has no binding or type restrictions, it should be the final specified predicate
  • No error will be thrown if your bindings cause a predicate call that you did not expect

Friday 22 June 2007

Examples of two useful predicates

Two of DW's most useful predicates are assign(A,B,C) and call(A,B,C).
Here's a good rundown I was given on how to use them.

Given the following code example
c = a.lookup()

we have call(A,B,C) where A is a.lookup(), B is lookup() and C is a
and assign(A,B,C) where A is c = X (where X is the return value of a.lookup), B is c and C is a.lookup()

Sunday 17 June 2007

DIY - How to build your own DW predicate

1. Check that you don't already have the tools to write the predicate in terms of other Prolog predicates. Existing predicate wrappings are stored in natives.dw and using existing Prolog predicates is quicker and easier than writing your own (usually).

2. If there isn't you'll need to create a new class file in the dw.predicate package that extends PredicateMatcher, with a suitably descriptive name. You'll need to add a PredicateFactory to the end of this class too, this allows the engine to construct new copies of your predicate as required. Remember that your constructor must reset your predicate to a fresh state, so if you're using any class variables they need to be reset in the constructor. Use PredicateFactory's of other predicate class files as examples.

3. Think about the cases you'll need to cater for. Take for example the predicate imaginary(x,y). You'll need 4 cases, one for when both parameters are bound, two for when only one is bound, and a final one for when both are unbound. Test param(0).isFree() to see whether each paramater is bound or not, and select in an if block. Some cases won't apply, in which case you may either throw an execption or return false.

4. When building each predicate case, you'll need to getValue() on each bound parameter and cast it to an instance of some of the types you expect. The types are listed in dw.type, but generally your incoming variable will be a CodeValue or a CodeUnit. Use instanceof to deal with the various types, and use unify to attempt to resolve unbound predicates, using the boolean result of unify as a return value.

5. Bind the predicate to the classfile natively in natives.dw, to do this simple add
native imaginary(x,y) = "dw.predicate.Imaginary" (or whatever you called your class file)

6. Test the results! If it doesn't work, use the print(x) Prolog predicate as a simple aid to diagnosis, the most usual causes are accidentally leaving a variable unbound that you thought was bound, or sending in an instance of an unexpected type.

Tips:
  • Even though DW's types are currently a wrapping over soot, its usually better to use their built in methods than going straight to soot. They're there for a reason!
  • Throw your exceptions rather than returning false where possible, it's much easier to diagnose unexpected behaviour this way.

Tuesday 29 May 2007

Soot Datatypes and how they relate to each other

Soot is the underlying manipulation mechanism of DeepWeaver and while it is eventually hoped that programmers can be removed from having to think about Soot, we're not quite there yet. For now, its helpful to be aware of Soot's behaviour.

Soot breaks compiled Java code into an intermediate representation called Jimple, which represents the code body in terms of units. These units are 3 address statements which simplifies the number of possible instructions. So for example, the java statment:
x = foo + 3 * bar;
would have a Jimple representation that somewhat resembles
i0 = 3 * bar; x = foo + i0

This is important to understand because it illustrates DeepWeaver's awareness of the code. Since Deepweaver deals with bytecode it's unaware of your original statements, instead it sees this representation and thus the codeblocks may actually consist of a group of units, which are more atomic than the original statements they are derived from.

This then corresponds with a DeepWeaver heirarchy which wraps to the soot types. These are:
  • CodeValue, atoms such as constants, locals, etc...
  • Statements, corresponding with units above, of which the most important ones are assignment statements (illustrated above), if statements, return statements and return-void statements
  • CodeBlock, a set (list?) of statements
  • SootMethods, a callable method within the program. The method itself is a codeblock, but there are some extras to represent parameters etc.
  • SootClasses, corresponding to a Java class.

Monday 28 May 2007

Modifications of standard Prolog

Deepweaver changes a number of standard Prolog syntactical features to bring it closer to Java, and make pattern cuts more familiar to Java developers

Declaring a predicate is no longer done using the Prolog iff term :-, instead the = symbol is used.
Terminating a predicate is no longer performed using the fullstop symbol . instead the semicolon is used ; Since the semicolon in Prolog represents or, this has now been replaced with the pipe character, |
Comments are now of standard Java form, so instead of % we have // for single line comments and /* .. */ for multiline comments
Negation is slightly different from standard Prolog in that \+ is no longer used to reperesent negation, instead use the not(A) predicate.

Finally, its worth noting that in a .dw file, only predicates preceeded by a question mark will be executed. One should declare all the required predicates and then select the one to be executed by writing ?runMe(A,B); at the end of the file.

eg. well_located(A,B) = statement(A,X), not(between(s, X, B, true));

Wednesday 23 May 2007

Predicate Library Summary

A summary of the built-in DeepWeaver predicates and their expected functions.
  • allBoxIter(A)
  • args(A,B) A call to method A has a list of argument objects B
  • assign(A,B,L) Variable A is assigned value B at location L
  • between(a,z,X,B) X is a codeblock between ground/atomic codeblocks a and z if boolean B is true. If B is false, then X may or may not be between a and z.
  • body(a,B) Finds a method body location B in an atom or ground variable a, the code block location may be in only non-abstract, real method bodys which should be included in a.
  • call(A,B,C) Instruction A calls method B which takes some sort of object C.
  • class(A) A is a class. Returns any class if A is unbound
  • defBoxes(a,B) Finds a definition box B within unit a
  • doms(A,B,C)
  • element(A,N,B)
  • end(a,B) B is the soot-generated final piece of code of method a. See start.
  • forall(A,B) Standard Prolog forall, for all cases of A, B is true.
  • jump(A) Returns a statment which causes a jump in the code
  • target(A,L) Returns the Location of a jump in the code at statement A
  • method(A,B)
  • methodMatches(A,B)
  • member(A,B) Standard Prolog member, returns true if A is an element of list B
  • name(A,B)
  • not(X) Standard Prolog negation, if X cannot be proved, this statement returns true.
  • parentMethod(a,B) Gets the entire parent method of atom or ground variable a as a codeblock bound to B.
  • path(A,B,C)
  • precedes(A,B) Code block A occurs at or before Code block B in a method (or logical Soot code flow)
  • pred(A,B)
  • print(A) Prints the type and toString() result of A
  • sameValue(A,B)
  • sootmethod(A,B)
  • start(a,B) B is the soot-generated first code piece of method a. Note that this is soot generated, so the start block includes all code that comes between the first method call, loop block, or other non-trivial piece of code.
  • statement(A,B) A is the location of the statement that results in B.
  • succ(A,B) Same as pred(B,A)
  • type(A,a) Restricts A to be of the same type as a.
  • units(A,B)
  • unused(A)
  • useboxes(A,B)
  • uses(A,L) Code block A is used at location L.
  • value(A,B) B is the value of variable A, returned as a CodeValue. Common usage: value(A,X), value(B,X).

Insertion of return values

Prolog variables don't have return values, but often there's a case where a function only works one way, meaning that an unbound variable essentially acts as a return value. While this detracts from the declarative power of Prolog, it certainly reduces the number of temporary variables hanging around programs.

So, to shorten the pattern cut implementations and reduce the number of temporary variables we use the notation <-
eg. We can rewrite bar(X), foo(X,c) as foo(bar(<-),c)

In useful terms, one nice rewrite we could have is

statement(Ta,CallA), statement(Tb, CallB), precedes(Ta,Tb)
rewrites to a much shorter and intuitve
precedes(statement(<-,CallA), statement(<-,CallB))

More extreme examples that certainly cut down on the number of inserted variables but potentially at the cost of legibility could be
dominates(a,b) = parentMethod(a,Ta), body(Ta,Tb), start(Tb, Tc), between(Tc,b,a,true)
which shortens to
dominates(a,b) = between(start(body(parentMethod(a,<-),<-),<-),b,a,true)

Its the type of predicate that's handy to have in your arsenal but its good to be aware of its impact on readability