Program analysis for quantitative-reachability properties
OA Version
Citation
Abstract
Program analysis studies the execution behaviors of computer programs including programs’ safety behavior, privacy behavior, resource usage, etc. The kind of program analysis on the safety behavior of a program involves analyzing if a particular line of code leaks a secret and how much secret is leaked by this line of code. When studying the resource usage of a program, certain program analysis mainly focuses on analyzing whether a piece of code consumes a certain resource and how much resource is used by this piece of code. Yet another kind of program analysis is studying the program privacy behavior by analyzing whether a specific private data is dependent on other data and how many times they are dependent during multiple executions. We notice that when studying the aforementioned behaviors, there are two dominant program properties that we are analyzing – “How Much” and “Whether”, namely quantitative properties and reachability properties. In other words, we are analyzing the kind of program property that contains two sub-properties – quantitative and reachability. A property is a hyper-property if it has two or more sub-properties. For the class of properties that has quantitative and reachability sub-properties, I refer to them as quantitative-reachability hyperproperties. Most existing program analysis methods can analyze only one subproperty of a program’s quantitative-reachability hyper-property. For example, the reachability analysis methods only tell us whether some code pieces are executed, whether the confidential data is leaked, whether certain data relies on another data, etc., which are only the reachability sub-properties. These analysis methods do not address how many times or how long these properties hold with respect to some particular code or data. Quantitative analysis methods, such as program complexity analysis, resource cost analysis, execution time estimation, etc., only tell us the upper bound on the overall quantity, i.e., the quantitative sub-property. However, these quantities are not associated with a specific piece of code, program location, private data, etc., which are related to the reachability sub-properties. This thesis presents new program analysis methodology for analyzing two representative quantitative-reachability properties. The new methodology mitigates the limitations in both reachability analysis methods and quantitative analysis methods and help to control the program’s execution behaviors in higher granularity. The effectiveness of the new analysis method is validated through prototype implementations and experimental evaluations.
The first noteworthy quantitative-reachability property I look into is the adaptivity in the programs that implement certain adaptive data analyses. Data analyses are usually designed to identify some properties of the population from which the data are drawn, generalizing beyond the specific data sample. For this reason, data analyses are often designed in a way that guarantees that they produce a low generalization error. An adaptive data analysis can be seen as a process composed by multiple queries interrogating some data, where the choice of which query to run next may rely on the results of previous queries. The generalization error of each individual query/analysis can be controlled by using an array of well-established statistical techniques. However, when queries are arbitrarily composed, the different errors can propagate through the chain of different queries and result in high generalization error. To address this issue, data analysts are designing several techniques that not only guarantee bounds on the generalization errors of single queries, but that also guarantee bounds on the generalization error of the composed analyses. The choice of which of these techniques to use, often depends on the chain of queries that an adaptive data analysis can generate, intuitively the adaptivity level in an adaptive data analysis. To help analysts with identifying which technique to use to control their generalization error, we consider adaptive data analyses implemented as while-like programs, and we design a program analysis framework. In this framework, we first formalize the intuitive notion of adaptivity as a quantitative-reachability property, which is a key measure of an adaptive data analysis to choose the appropriate technique. Then we design a program analysis algorithm that estimates a sound upper bound on the adaptivity of the program that implements an adaptive data analysis. We also implement my program analysis and show that it can help to analyze the adaptivity of several concrete data analyses with different adaptivity structures.
As a continuation of the previous work, to get a more precise bound on a program’s adaptivity level, I look at another quantitative-reachability hyper-property – the number of times a given location inside a procedure is visited during the program execution. The upper bound on this hyper-property is referred to as the reachability-bound. It can help to improve the program analysis results when studying other different program features. For example, the reachability-bound on each program location can be used by some resource cost analysis techniques to compute a precise bound on a program’s worst-case resource consumption. When we analyze the adaptivity in an adaptive data analysis program as discussed above, the accuracy of my program analysis result can also be improved through a tight reachability-bound on every program location. Some existing program complexity analysis methods can be repurposed to analyze and estimate the reachability-bound. However, these methods focus only on the overall quantity and ignore the path sensitivity in the program. For this reason, the reachability-bounds of the locations in different sub-procedures are usually over-approximated. As far as we know, there is no general analysis algorithm that computes the reachability-bound for every program location directly and path-sensitively. To this end, I present a pathsensitive reachability-bound algorithm, which exploit the path sensitivity to compute a precise reachability-bound for every program location. We implement this path-sensitive reachability-bound algorithm in a prototype, and report on an experimental comparison with state-of-art tools over four different sets of benchmarks.