Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Recent years have seen the emergence of computer-aided personalized education as a new, important research field. Sophisticated techniques relying on formal methods, programming languages, and program synthesis have been designed to assist teachers in grading and providing feedback for introductory programming assignments [24], automata constructions [1], and geometric constructions [10, 13]. In this paper, we propose a novel program repair framework that enhances the state of the art in automated feedback generation for students in introductory programming courses.

The goal of automatic program repair is to identify a set of syntactic changes that can turn a program that is incorrect with respect to a given specification into a correct one. In the context of automated feedback generation, repairing a program corresponds to finding a “fix” to the student’s incorrect solution. The specification can be as simple as a set of test cases.

Existing program repair techniques are typically “best-effort” and aim to find any repair that meets the given specification. Such techniques can end up generating a program that is quite different from the original one. Although this may be acceptable in some settings, in the context of education, the goal should be to slowly guide the students towards a correct solution. In particular, if a student solution is close to a correct one, a teacher wouldn’t point the student to a completely different program, but would rather show how the student solution can be corrected with small changes. Advanced program repair techniques address this problem by computing syntactically minimal program repairs [19, 23, 24].

In this paper, we argue that even the smallest syntactic fix to a program can significantly alter its behaviour. We propose a new approach to program repair based on program distances, which quantify changes not only to the program syntax but also to the program semantics. While syntactic distances capture the number of edits to the program, semantic distances quantify the number of changes to the “behaviour” of a program with respect to a given set of tests. We formalize the quantitative program repair problem in which the “optimal” repair is defined as a correct program that minimizes an objective function over multiple program distances. Although our framework is general, we present two types of syntactic distances, three types of semantic distances, and propose a solution to the quantitative program repair problem with respect to these distances.

We have implemented our techniques in a prototype tool called Qlose (Quantitatively close), which is built on top of the Sketch synthesis system [26]. In Qlose, we encode the functional correctness property of the student solution with respect to a set of tests as a hard constraint and the syntactic and semantic distances with respect to the original solution as soft constraints. The repair generated by Sketch maximizes the number of soft constraints that can be satisfied, while satisfying the hard constraints. We evaluate \(\textsc {Qlose} \) on 11 representative benchmark programs taken from student submissions to the Microsoft CodeHunt platform [29] and the Introduction to Programming course on the edX platform. Our preliminary results show that encoding quantitative program repair using syntactic and semantic distances is practically feasible for small student solutions and leads to more desirable repairs.

Contributions. This paper makes the following key contributions.

  • We define new notions of syntactic and semantic distances between programs with respect to a given set of tests and use these notions to formalize the quantitative program repair problem (Sects. 3, 4).

  • We encode the quantitative repair problem in a prototype tool called Qlose, which is built on top of the Sketch synthesis system (Sect. 5).

  • We evaluate Qlose and the strengths of the different distances on 11 representative student submissions taken from education platforms (Sect. 6).

2 Motivating Example

We use the example in Fig. 1 to show that semantic distances can sometimes yield more intuitive program repairs than syntactic distances.Footnote 1 Figure 1 contains a set of tests that is representative of the intended semantics of a desired program. Given this test set, the student has come up with the program FindCBuggy in Fig. 1(a) which fails the first test but passes the other ones. The desired program is one that given a string s, a character c, and an integer k outputs whether the character c appears in s at some position \(j\le k\). Besides a small imprecision, the student solution captures the intended algorithm in the sense that successful executions of the program are not far from those in the correct algorithm.

Fig. 1.
figure 1

A buggy program (a) with two possible syntactic repairs (b) and (c).

Limitation of syntactic distances. To give feedback to the student, one can try fixing the student solution using existing program repair techniques that minimize the number of syntactic changes to an incorrect program. Techniques like the ones presented in [19, 24] would return one of the two programs at the bottom of Fig. 1. Both of these programs differ from the one in Fig. 1(a) by exactly one expression. However, one of the repaired programs is, in some sense, more “disruptive” than the other. In particular, although the program in Fig. 1(b) simply changes the guard of the if-statement, its executions on previously correct tests are now very different: on all tests the loop is now executed only once! On the other hand, for the program in Fig. 1(c), the executions on correct tests are the same as for the original program. Syntactic program distances cannot distinguish between these two candidate repairs and are inadequate for this example.

Semantic distances. One can capture the intuition that the program in Fig. 1(c) is a better repair for the student solution than the program in Fig. 1(b) by examining the execution of these programs on successful tests. For example, if we were to track the locations (lines of code) traversed by the three programs on the second input test with \(s=\) aba?gc we would get the following sequences of locations: (a) 2, 3, 2, 3, 2, 3, 2, 3, 4; (b) 2, 3, 4; and (c) 2, 3, 2, 3, 2, 3, 2, 3, 4. These sequences highlight that the program in Fig. 1(c) is semantically closer to the student solution.

A similar argument can be made for repairing using only a semantic distance as the repaired program may be syntactically very far from the original one. In summary, in order to repair programs in a meaningful way, it is often necessary to take into account multiple quantitative objectives such as the number of syntactic edits and the distance between program behaviours.

3 Program Repair

In this section, we formalize programs, correctness specifications, and permissible program edits. We then use these notions to define the program repair problem.

3.1 Programs

We fix a simple imperative programming language, in which a program \(P\) consists of a function definition \(f(i_1,\ldots , i_q):o\), a set of program variables \(V\), and a sequence of labeled statements \(\sigma =s_1\ldots s_n\). A statement is one of the following: skip, return, assignment, conditional or loop statement.Footnote 2 Each statement in \(\sigma \) is labeled with a unique location identifier from the set \(L= \{\ell _0, \ell _1, \ldots , \ell _p, exit\}\). The function f has a designated set of input variables \(I= \{i_1,\ldots , i_q\}\) and a designated output variable \(o\). The program statements are allowed to use an auxiliary set of variables \(V= \{v_1,\ldots ,v_r\}\). We assume a universe \(\mathcal {U}\) of values. We also assume that all variables are associated with a given type and are only assigned values from \(\mathcal {U}\) with the proper types.

We now define the semantics of our programs. The semantics of program statements is standard. Without loss of generality, we assume execution of a return statement assigns a value to the output variable and transfers control to a designated location \(exit\).

A program configuration \(\eta \) is a pair \((\ell , \nu )\) where \(\ell \in L\) is a location and \(\nu : I\cup \{o\}\cup V\mapsto \mathcal {U} \cup \{nd\}\) is a valuation function that assigns values to all variables. The element nd indicates that a variable has not been assigned a value yet or is out of scope. We write \((\ell , \nu ) \rightarrow (\ell ', \nu ')\) if execution of the statement at location \(\ell \) under variable valuation \(\nu \) transfers control to location \(\ell '\) with variable valuation \(\nu '\).

The execution \(\pi (\nu )\) of program \(P\) on a valuation \(\nu \) is a sequence of configurations \(\eta _0\), \(\eta _1\), ..., where \(\eta _0 = (\ell _0, \nu )\) and for each h, \(\eta _h \rightarrow \eta _{h+1}\). An execution terminates once the location \(exit\) is reached.

Here we are only interested in executions for which the initial valuation \(\nu \) is such that for every input variable \(x\in I\), \(\nu (x)\ne nd\), and for every non-input variable \(y\in V\cup \{o\}\), \(\nu (y)= nd\). Given a partial valuation \(\nu _I:I\mapsto \mathcal {U}\) assigning values to the input variables, let \(\nu _I^+\) be the valuation such that for every input variable \(x\in I\), \(\nu _I^+(x)= \nu _I(x)\), and for every other variable \(y\not \in I\), \(\nu _I^+(y)= nd\). We denote by \([\![{P}]\!]: (I\mapsto \mathcal {U}) \mapsto \mathcal {U}\) the partial function computed by a program \(P\), and define it as \([\![P]\!](\nu _I) = res\) iff \(\pi (\nu _I^+)\) terminates with output valuation \(\nu '\) and \(\nu '(o)=res\).

Example 1

Consider the program FindCBuggy in Fig. 1(a). The input variables \(I\) are \(\{s,c,k\}\) and the designated output variable is o. The set of program variables is the singleton \(\{j\}\). The execution of FindCBuggy on \(\nu \) such that \(\nu _I(s)=\mathtt{ab?}, \nu _I(c)=\mathtt{?}, \nu _I(k)=2\) is illustrated in the following table:

 

\(\eta _0\)

\(\eta _1\)

\(\eta _2\)

\(\eta _3\)

\(\eta _4\)

\(\eta _5\)

\(\eta _6\)

loc

2

3

2

3

2

5

\(exit\)

s

ab?

ab?

ab?

ab?

ab?

ab?

ab?

c

?

?

?

?

?

?

?

k

2

2

2

2

2

2

2

j

nd

0

0

1

1

2

\(nd\)

o

nd

nd

nd

nd

nd

nd

false

We thus have \([\![\textsc {FindCBuggy}]\!]\)(ab?, ?, 2) \(=\) false.

3.2 Test Sets as Specifications

A test \(t\) is a pair \((\nu _I,res)\), where \(\nu _I:I\mapsto \mathcal {U}\) is a valuation over the input variables, and \(res\in \mathcal {U}\) is the expected output value. A program \(P\) satisfies a test \(t\) if \([\![P]\!](\nu _I)= res\). A program \(P\) satisfies a test set \(T\) if it satisfies all the tests \(t\in T\).

We use \(\dot{\pi }(t)\), read as “execution of a program on a test \(t\)”, to refer to \(\pi (\nu _I^+)\).

Example 2

Consider the test set and the program FindCBuggy in Fig. 1. Clearly, FindCBuggy does not satisfy the test set. In particular, on the first test from Example 1 we have \([\![\textsc {FindCBuggy}]\!]\)(ab?, ?, 2) \(\ne \) true.

3.3 The Program Repair Problem

In our repair model we permit program expressions to be changed, but not program statements. For example, we permit replacement of loop guards and right-hand sides of assignments and disallow replacement of an assignment with a \(\mathtt {return}\) statement. Formally, a permissible program edit applied to a labeled statement \(\ell : \mathtt{stmt}\) in program \(P\) is any modification of stmt that replaces an expression in stmt with another expression over the same domain, and leaves the label \(\ell \) unchanged.

Given program \(P\) and a subset of locations \(\textsc {loc}\subseteq L\) of \(P\), let \(\mathcal{R}_\textsc {loc}(P)\) be the set of all programs that can be obtained by applying permissible program edits to labeled statements with labels in \(\textsc {loc}\). The following proposition holds trivially.

Proposition 1

Given programs \(P\), \(P'\), with locations \(L\), \(L'\), the following statements are equivalent:

  1. (i)

    there exists unique \(\textsc {loc}\subseteq L\) such that \(P'\in \mathcal{R}_\textsc {loc}(P)\)

  2. (ii)

    there exists unique \(\textsc {loc}'\subseteq L'\) such that \(P\in \mathcal{R}_{\textsc {loc}'}(P')\)

  3. (iii)

    \(L= L'\) and there exists unique \(\textsc {loc}\subseteq L\) such that \(P'\in \mathcal{R}_\textsc {loc}(P)\) and \(P\in \mathcal{R}_\textsc {loc}(P')\).

Example 3

In Fig. 1, \(\textsc {FindCBadFix} \in \) \(\mathcal{R}_{\{3\}}(\textsc {FindCBuggy})\) as \(\textsc {FindCBadFix}\) replaces the guard \(s[j]==c\) in location 3 of \(\textsc {FindCBuggy}\) with the guard \(c==?\). Similarly, \(\textsc {FindCGoodFix} \in \mathcal{R}_{\{2\}}(\textsc {FindCBuggy})\) as \(\textsc {FindCGoodFix}\) replaces the loop guard \(j<k\) in location 2 of \(\textsc {FindCBuggy}\) with the guard \(j\le k\) Footnote 3.

Given a program \(P\) and a test set \(T\) such that \(P\) does not satisfy \(T\), the goal of program repair is to compute \(P'\) such that: (1) \(P'\) satisfies \(T\), and (2) there exists \(\textsc {loc}\subseteq L\) such that \(P'\in \mathcal{R}_\textsc {loc}(P)\).

Example 4

Consider the programs in Fig. 1. The programs FindCBadFix and FindCGoodFix are possible repairs of the program FindCBuggy with respect to the test set shown in the figure. They are both correct on the test set and, from Example 3, \(\textsc {FindCBadFix} \in \) \(\mathcal{R}_{\{3\}}(\textsc {FindCBuggy})\) and \(\textsc {FindCGoodFix} \in \mathcal{R}_{\{2\}}(\textsc {FindCBuggy})\).

4 Quantitative Program Repair

In this section, we define program distances and the quantitative program repair problem. Given two programs \(P\), \(P'\) and a test set \(T\), a program distanceFootnote 4 is a function over \(P\), \(P'\) and \(T\) that quantifies how close are \(P\) and \(P'\) w.r.t. \(T\). We classify program distances as syntactic and semantic distances. A syntactic program distance simply tracks the syntactic change between \(P\) and \(P'\), independent of the test set \(T\). Hence, a syntactic program distance is a function over \(P\) and \(P'\). A semantic program distance tracks the semantic differences between \(P\) and \(P'\) with respect to executions on the test set \(T\). In particular, a semantic program distance tracks the differences in the executions of \(P\) and \(P'\) on all tests \(t\) such that both \(P\) and \(P'\) satisfy \(t\). In what follows, we define several syntactic and semantic distances. One could easily define more sophisticated distances and we invite the reader to do so. The following distances sufficed for our “proof of concept” experiments with quantitative program repair.

4.1 Syntactic Distances

A syntactic distance between programs \(P\) and \(P'\) is defined modulo an expression distance \(\varepsilon \). An expression distance tracks the syntactic difference between two expressions. In this work, we use two simple expression distances, defined below.

Boolean expression distance, \(\varepsilon _{\textsc {bool}}\), is a Boolean-valued distance that simply tracks if two expressions are equal or not:

$$\begin{aligned} \varepsilon _{\textsc {bool}}(expr, expr')&= {\left\{ \begin{array}{ll} 0 &{} \text {if } expr= expr'\\ 1 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Expression-size distance, \(\varepsilon _{\textsc {size}}\), tracks the size of the repaired expression:

$$\begin{aligned} \varepsilon _{\textsc {size}}(expr, expr')&= {\left\{ \begin{array}{ll} 0 &{} \text {if } expr= expr'\\ size(expr') &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

where \(size(expr')\) can be defined in different ways. For example, \(size(expr')\) could be the total number of symbols and operators in \(expr'\). In Sect. 5, we present the definition of \(size(expr')\) used in our implementation. Note that \(\varepsilon _{\textsc {size}}(expr, expr')\) is not a symmetric function.

A syntactic program distance between programs \(P\) and \(P'\) is finite only if \(P\) and \(P'\) can be obtained from each other by applying a set of permissible program edits. Given an expression distance \(\varepsilon \), a syntactic program distance accumulates the expression distance across all expression changes between \(P\) and \(P'\). Formally:

$$\begin{aligned} d_{syn}^{\varepsilon }(P, P')&= {\left\{ \begin{array}{ll} \infty &{} \text {if } \forall \textsc {loc}: P'\not \in \mathcal{R}_\textsc {loc}(P) \\ \underset{\ell \in \textsc {loc}: P'\in \mathcal{R}_\textsc {loc}(P)}{\sum } \varepsilon (expr_\ell , expr'_\ell ) &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Note that Proposition 1 ensures the uniqueness of \(\textsc {loc}\) in the second case. Here, \(expr_\ell , expr'_\ell \) denote expressions in \(\ell \)-labeled statements of \(P\), \(P'\), respectively. Thus, if \(\varepsilon = \varepsilon ^{\textsc {bool}}\), \(d_{syn}^{\varepsilon }(P, P')\) equals the number of permissible program edits required to transform \(P\) to \(P'\). Similarly, if \(\varepsilon = \varepsilon _{\textsc {size}}\), \(d_{syn}^{\varepsilon }(P, P')\) equals the total size of all new expressions in \(P'\).

Example 5

Consider the programs in Fig. 1. For \(\varepsilon = \varepsilon _{\textsc {bool}}\), one can see that \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy}, \textsc {FindCBadFix})\) and \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy},\textsc {FindCGood}\textsc {Fix})\) both equal 1, as there is exactly one permissible program edit in each case. For \(\varepsilon = \varepsilon _{\textsc {size}}\), if expression size is given by the total number of symbols and operators, \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy},\textsc {FindCBadFix})\) and \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy},\textsc {FindCGoodFix})\) both equal 3. Neither syntactic distance can distinguish between \(\textsc {FindCBadFix}\) and \(\textsc {FindCGoodFix}\).

4.2 Semantic Distances

The semantic distance between programs \(P\) and \(P'\) with respect to a test set \(T\) is defined modulo an execution distance \(\zeta \). An execution distance tracks the differences between two executions. In this paper, we consider three types of execution distances, defined on terminating executions.

Let \(T_{sat}\subseteq T\) consist of all tests \(t\) such that \(P\) and \(P'\) both satisfy \(t\). Given a test \(t\) in \(T_{sat}\), let \(\dot{\pi }(t)\), \(\dot{\pi }'(t)\) denote executions of \(P\), \(P'\), respectively on \(t\). In what follows, we fix \(\dot{\pi }(t) = \eta _0, \eta _1, \ldots , \eta _M\) and \(\dot{\pi }'(t) =\) \(\eta '_0\), \(\eta '_1\), \(\ldots \), \(\eta '_K\). Recall that a configuration \(\eta _h\) is a tuple of the form \((\ell _h, \nu _h)\).

Our execution distances essentially compute the Hamming distance between two executions, using different abstractions of configurations. For executions of equal lengths, this distance equals the minimum number of configuration substitutions required to transform one execution into another. For executions of differing lengths, this distance additionally includes the difference in the execution lengths. All three execution distances can be defined as follows:

$$\begin{aligned} \zeta (\dot{\pi }(t), \dot{\pi }'(t))&= {\left\{ \begin{array}{ll} |M - K| \, + \, \sum \nolimits _{h=0}^{min(M,K)} \; \mathtt {diff}(\eta _h, \eta '_h) &{} \text {if } M, K < \infty \\ \infty &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

where the definition of \(\mathtt {diff}(\eta _h, \eta '_h)\) varies for each execution distance.

Concrete execution distance, \(\zeta _{\textsc {conc}}\), compares both locations and variable values in two executions: \(\mathtt {diff}_{conc}(\eta _h, \eta '_h) = 0\) if \(\eta _h = \eta '_h\) and 1 otherwise.

Value execution distance, \(\zeta _{\textsc {val}}\), only compares the variable values in two executions: \(\mathtt {diff}_{val}(\eta _h, \eta '_h) = 0\) if \(\nu _h = \nu '_h\) and 1 otherwise.

Location execution distance, \(\zeta _{\textsc {locs}}\), only compares the locations in two executions: \(\mathtt {diff}_{loc}(\eta _h, \eta '_h) = 0\) if \(\ell _h = \ell '_h\) and 1 otherwise.

A semantic program distance between programs \(P\), \(P'\) w.r.t. test set \(T\) is finite only if \(P\) and \(P'\) can be obtained from each other by applying a set of permissible program edits and \(T_{sat}\) is not empty. Given an execution distance \(\zeta \) and the set \(T_{sat}\), a semantic program distance accumulates the execution distance between executions of \(P\) and \(P'\) on tests in \(T_{sat}\). Formally:

$$\begin{aligned} d_{sem}^\zeta (P, P', T)&= {\left\{ \begin{array}{ll} \infty &{} \text {if } \forall \textsc {loc}: P'\not \in \mathcal{R}_\textsc {loc}(P) \text { or } T_{sat}\text { is empty}\\ \underset{t\in T_{sat}}{\sum }\; \zeta (\dot{\pi }(t), \dot{\pi }'(t)) &{} \text {otherwise.} \\ \end{array}\right. } \end{aligned}$$
Fig. 2.
figure 2

Semantic distances between executions. We do not show the input variables s, c, and k as their values are never modified.

Example 6

The executions of FindCBuggy and FindCBadFix from Fig. 1 on \(\nu \) such that \(\nu _I(s)=\mathtt{aba?gc}, \nu _I(c)=\mathtt{?}, \nu _I(k)=5\) are shown in Fig. 2. The last 3 rows of the table show \(\mathtt {diff}(\eta _h, \eta '_h)\) for \(h = 0, 1, 2, 3\). Note that \(\mathtt {diff}_{val}\) doesn’t distinguish between \(\eta _2\) and \(\eta '_2\) as these configurations share the same variable values. The difference in lengths of the given two executions is 6. Thus, for these executions, \(\zeta _{\textsc {conc}}\) = \(\zeta _{\textsc {locs}}\) = 6 + 2 = 8, and \(\zeta _{\textsc {val}}\) = 6 + 1 = 7.

The execution of FindCGoodFix on the same \(\nu \) with \(\nu _I(s)=\mathtt{aba?gc}, \nu _I(c)=\mathtt{?}, \nu _I(k)=5\) is exactly the same as the execution of FindCBuggy shown in Fig. 2. Hence, \(\zeta _{\textsc {conc}}\) = \(\zeta _{\textsc {locs}}\) = \(\zeta _{\textsc {val}}\) = 0 for the executions of FindCBuggy and FindCGoodFix. Our semantic program distances can distinguish between FindCBadFix and FindCGoodFix.

4.3 The Quantitative Program Repair Problem

Given a program \(P\) and a test set \(T\) such that \(P\) does not satisfy \(T\), syntactic distance functions \(d_{syn}^1,\ldots ,d_{syn}^x\), semantic distance functions \(d_{sem}^1,\ldots ,d_{sem}^y\), and objective functions \(f_1, \ldots , f_z\) over \(d_{syn}^1,\ldots ,d_{syn}^x,d_{sem}^1,\ldots ,d_{sem}^y\), the goal of quantitative program repair is to compute \(P'\) such that:

  1. (1)

    \(P'\) satisfies \(T\),

  2. (2)

    there exists \(\textsc {loc}\subseteq L\) such that \(P'\in \mathcal{R}_\textsc {loc}(P)\), and

  3. (3)

    \(P'= \underset{\exists \widehat{\textsc {loc}} \subseteq L: \widehat{P}\in \mathcal{R}_{\widehat{\textsc {loc}}}(P)}{\arg \min } \; \underset{1\le i\le z}{\textsc {aggregate}} \{ f_i(d_{syn}^1(P, \widehat{P}),\ldots ,d_{sem}^y(P,\widehat{P}, T))\}\).

Here aggregate allows multiple objective functions to be combined. For example, aggregate could enforce Pareto optimality.

Example 7

Consider the programs in Fig. 1. In Example 4, we showed that both FindCBadFix and FindCGoodFix satisfy conditions (1) and (2) of the quantitative program repair problem for program FindCBuggy and the test set shown in Fig. 1.

In Example 5, we showed that both \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy}, \textsc {FindCBadFix})\) and \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy},\textsc {FindCGoodFix})\) equal 1 for \(\varepsilon = \varepsilon _{\textsc {bool}}\). For \(\varepsilon = \varepsilon _{\textsc {size}}\), \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy},\textsc {FindCBadFix})\) and \(d_{syn}^{\varepsilon }(\textsc {FindCBuggy},\textsc {Find}\textsc {CGoodFix})\) both equal 3.

The set \(T_{sat}\) consists of the last two tests in the test set \(T\) in Fig. 1. Let the test with \(s=\mathtt{aba?gc}\) be denoted \(t_1\) and the test with \(s=\mathtt{?aba}\) be denoted \(t_2\). Let \(\dot{\pi }(t_1)\) and \(\dot{\pi }(t_2)\) denote the executions of FindCBuggy on \(t_1\) and \(t_2\), respectively. Let \(\dot{\pi }'(t_1)\), \(\dot{\pi }'(t_2)\) and \(\dot{\pi }''(t_1)\), \(\dot{\pi }''(t_2)\) denote the executions of \(\textsc {FindCBadFix}\) and \(\textsc {FindCGoodFix}\) on \(t_1\), \(t_2\), respectively.

We have seen in Example 6 that \(\zeta _{\textsc {conc}}(\dot{\pi }(t_1), \dot{\pi }'(t_1)) = 8\) and \(\zeta _{\textsc {conc}}(\dot{\pi }(t_1), \dot{\pi }''(t_1)) = 0\). It’s not hard to see that \(\zeta _{\textsc {conc}}(\dot{\pi }(t_2), \dot{\pi }'(t_2)) = 0\) and \(\zeta _{\textsc {conc}}(\dot{\pi }(t_2), \dot{\pi }''(t_2)) = 0\). Thus, we can compute \(d_{sem}^{\zeta _{\textsc {conc}}}(\textsc {FindCBuggy},\) \(\textsc {FindCBadFix}, T) = 8\) and \(d_{sem}^{\zeta _{\textsc {conc}}}(\textsc {FindCBuggy}, \textsc {FindCGoodFix}, T) = 0\).

If we choose \(d_{syn}^{\varepsilon _{\textsc {bool}}}\), \(d_{syn}^{\varepsilon _{\textsc {size}}}\) as our syntactic distances, \(d_{sem}^{\zeta _{\textsc {conc}}}\) as our semantic distance and our objective function f to simply be the sum of \(d_{syn}^{\varepsilon _{\textsc {bool}}}\), \(d_{syn}^{\varepsilon _{\textsc {size}}}\) and \(d_{sem}^{\zeta _{\textsc {conc}}}\), the value of f is 4 for FindCGoodFix and is 12 for FindCBadFix. Hence, this instance of the quantitative program repair problem will prefer the program FindCGoodFix as a repair candidate.

5 Quantitative Program Repair Using Sketch

In this section, we describe the formulation of the quantitative program repair problem as an instance of the MAX-SMT problem. We encode the program semantics using a symbolic Boolean encoding and specify the functional correctness of the program w.r.t the given test set T as a hard constraint. The syntactic and semantic distances are encoded using soft constraints. The repair generated by the MAX-SMT solver maximizes the number of soft constraints that can be satisfied while ensuring the satisfaction of the hard constraints. We perform a syntax-directed translation from the source imperative language to Sketch [26], and use the minimization algorithm in Sketch to solve the MAX-SMT constraints. Instead of using a general MAX-SMT solver, we use the Sketch solver because of the ease in translation of the buggy programs into constraints. The Sketch solver allows for optimization constraints similar to MAX-SMT, but uses several algorithmic optimizations before encoding the problem into low-level SMT constraints. We now describe the key ideas in the formulation and translation of the quantitative program repair problem using the Sketch system.

5.1 Background on Sketch

Sketch is a synthesis system for writing partial programs (with holes) together with some high-level specifications of the programs. The synthesis algorithm fills the holes automatically using a constraint-based, counterexample-guided inductive synthesis (CEGIS) algorithm such that the completed program satisfies the given specifications. For example, consider the Sketch program shown in Fig. 3(a). One possible completion synthesized by the Sketch system is shown in Fig. 3(b). The hole expressions ?? can take any constant integer value, and they can further be composed to construct more complex unknown expressions.

Fig. 3.
figure 3

(a) A simple Sketch program and (b) a possible completion.

5.2 Space of Expression Edits

For our quantitative program repair encoding, we restrict the class of expressions that can potentially be modified by the solver to (i) the set of conditional expressions and (ii) the right hand side expressions of assignment statements. Furthermore, to restrict the space of possible repairs, we use an expression template corresponding to a linear combination of constants and all program variables in scope at the program location. In Sketch, the modifiable expressions are replaced by functions that either allow for returning the original unmodified expression in the program or some instantiation of the expression template.

For example, the Sketch translation for the buggy program in Fig. 1(a) is shown in Fig. 4(a). The conditional expressions and the right hand side expressions of the assignment statements are translated to change functions \(f_i\). An example change function \(f_1\) is shown in Fig. 4(b). Each change function \(f_i\) is associated with a Boolean variable \(b_{f_i}\) that indicates if the original expression is selected (\(b_{f_i} = 0\)) or some new expression is selected for the completion of the function \(f_i\) (\(b_{f_i} = 1\)). Each set of possible new expressions is represented as a linear combination of program variables of appropriate types where the coefficients of the variables are denoted using unknown values \(??_{i,j}\). For expressions involving strings, the change function restricts the edit expression to consist of only 1 character from that string. The characters are then interpreted as integers in Sketch.

Fig. 4.
figure 4

The Sketch translation for the FindCBuggy program from Fig. 1(a).

5.3 Encoding Distances

We now describe how the syntactic and semantic distances are encoded as constraints in the Sketch system.

Syntactic Distances. We encode our syntactic distances modulo our two expression distances in Sketch as follows.

  • Boolean expression distance: The syntactic distance \(d_{syn}^{\varepsilon }\) for \(\varepsilon = \varepsilon _{\textsc {bool}}\) computes the number of expression changes that are performed by the solver and is computed as \(\varSigma _i b_{f_i}\). The Boolean variable \(b_{f_i}\) is set to 0 if the expression corresponding to function \(f_i\) remains unchanged in the final solution and is set to 1 otherwise.

  • Expression-size distance: The syntactic distance \(d_{syn}^{\varepsilon }\) for \(\varepsilon = \varepsilon _{\textsc {size}}\) computes the total size of modified expressions, where the size of a modified linear arithmetic expression corresponding to \(f_i\) is computed as the sum of all of its coefficients \(|??_{i,j}|\). Thus, \(d_{syn}^{\varepsilon _{\textsc {size}}}\) is defined as \(\varSigma _{i} \varSigma _j\, |??_{i,j}|\).

Semantic Distances. We encode our semantic distance modulo the concrete execution distance. The \(\textsc {Sketch}\) translation is instrumented to capture program states at different program locations as shown in Fig. 5(a), where \(S^t_\ell [j]\) denotes the program state for \(j^\mathtt{th}\) loop iteration at program location \(\ell \) for a test case t. The concrete execution distance \(\zeta _{\textsc {conc}}\) between the original program and the modified program on a test case t in \(T_{sat}\) is computed as \(\varSigma _{\ell ,j} \phi (S^t_\ell [j],S^\mathtt{orig,t}_\ell [j])\), where the function \(\phi \) counts the number of variables that do not have equal values across two states \(S^t_\ell [j]\) and \(S^{\mathtt{orig,t}}_\ell [j]\), as shown in Fig. 5(b). Our encoding enforces a bound on the length of program executions by unrolling loops a fixed number of times.

Fig. 5.
figure 5

Encoding semantic distance in Sketch.

Quantitative Objective. The final quantitative objective in the Sketch translation is encoded as the following constraint:

$$\mathtt{assert} ~ d_{syn}^{\varepsilon _{\textsc {bool}}} < N \; \wedge \; \mathtt{minimize}~(d_{syn}^{\varepsilon _{\textsc {size}}} + d_{sem}^{\zeta _{\textsc {conc}}})$$

We use a linear search to first find the minimum number of expression changes N that are needed to repair the buggy program using a linear iterative search. After computing the value n, we then add the minimization constraint to find a repair with minimum semantic distance \(d_{sem}^{\zeta _{\textsc {conc}}}\) and simpler expression modifications \(d_{syn}^{\varepsilon _{\textsc {size}}}\). The Sketch solver uses an incremental search methodology to compute the repair that corresponds to the minimum objective function value [24]. The hard constraints specifying functional correctness w.r.t a test set T is encoded in a standard way using assert statements in Sketch. If we refer back to the definition of quantitative program repair in Sect. 4.3, the resulting repaired program is

$$P'= \underset{\exists \widehat{\textsc {loc}} \subseteq L: \widehat{P}\in \mathcal{R}_{\widehat{\textsc {loc}}}(P)}{\arg \min } \; \langle d_{syn}^{\varepsilon _{\textsc {bool}}}(P, \widehat{P}), d_{syn}^{\varepsilon _{\textsc {size}}}(P, \widehat{P}) + d_{sem}^{\zeta _{\textsc {conc}}}(P, \widehat{P})\rangle .$$

In this case, the aggregation operator is the one that first minimizes the left element of the pair and then the right one.

6 Evaluation

We implemented a prototype tool \(\textsc {Qlose} \) that given a (simplified) C# program, a set of test cases, and the desired types of distances, constructs a Sketch program with the corresponding constraints to encode the quantitative program repair problem. We evaluated \(\textsc {Qlose} \) on 11 representative benchmark programs using the distances presented in Sect. 5. Our preliminary results suggest that \(\textsc {Qlose} \) is practically feasible for small student solutions and generates more desirable repairs while using a combination of syntactic and semantic distances.Footnote 5

6.1 Benchmarks

Our benchmark set consists of 11 representative buggy programs taken from student submissions to introductory programming courses and recent program repair literature. The LargestGap problem is taken from the Microsoft CodeHunt platform [29] and asks students to write a program to compute the largest difference amongst any two values in a given input array of integers. The FindC program is the same as FindCBuggy in Fig. 1. The tcas-semfix benchmark is taken from the SemFix [20] system and corresponds to a code excerpt from the Tcas benchmarkFootnote 6. The max3 problem asks students to compute the maximum of 3 integers. The iterPower, epoly, and multIA problems are taken from the Introduction to Programming course taught on the edX platform. The iterPower problem asks students to write an iterative program that, given two integers m and n, computes the value \(m^n\). The epoly problem evaluates a polynomial (defined using an array of integer coefficients) on an integer value, and the multIA problem requires students to write a program to compute multiplication of two integers using successive additions.

The number of lines of code (LOC), the number of variables (\(\mid \)Vars\(\mid \)), and the number of test input-output pairs (\(|T_{sat}|\)) for each benchmark problemFootnote 7 is shown in Fig. 6. The number of lines in the benchmarks varied from 4 to 10 lines, whereas the number of variables and the number of test cases varied from 3 to 5. For the CodeHunt benchmarks, we reused the test input-output pairs automatically generated by the CodeHunt engine. For the tcas-semfix benchmark, we use the tests from the SemFix paper [20]. For the benchmarks obtained from the edX class, we manually selected the relevant test cases that exposed different corner case behaviors.

Fig. 6.
figure 6

Solving times and the desiredness of the generated repairs for different distances. TO denotes that the solver timed out (> 20 min), The symbol ✓ (resp. ✗) denotes that the generated repair was (resp. wasn’t) the desired one.

6.2 Desired Repairs

The experimental results obtained by running \(\textsc {Qlose} \) on different benchmarks using different distances are shown in Fig. 6. We manually inspected the repairs generated using different distance metrics and classified them into desired (✓) or not (✗). For performing this classification, we did not inspect the reference code for the problem, but instead inspected the original buggy program and manually inferred the algorithm the student (or programmer) likely intended to implement. We then checked whether the repaired program matched the intended algorithm.

We can observe that using only syntactic or semantic distance sometimes leads to undesired repairs whereas combining the two distances always leads to the desired fixes in our benchmark set. For example, for the LargestGap-2 program shown in Fig. 7(a), the syntactic distance encoding causes the solver to come up with a fix that sets the loop initialization variable i to 0 instead of 1. Although, this repair is correct on the test cases, it is less desirable than the repair that assigns a[0] to the low variable l, which corresponds to the solution that student had in mind. \(\textsc {Qlose} \) generates this repair when it uses both syntactic and semantic distances. A similar example of a desirable repair generated by \(\textsc {Qlose} \) using both syntactic and semantic distances is illustrated in Fig. 7 for the ePoly-1 benchmark.

Fig. 7.
figure 7

(a) The original LargestGap-2 and ePoly-1 programs, (b) the repair generated by the syntactic distance, and (c) the repair generated by the combination of syntactic and semantic distances that corresponds to the desired repair.

6.3 Solving Time

The solving times for different combinations of syntactic and semantic distances are shown in Fig. 6. As expected, the syntactic distances take the smallest amount of time to resolve the sketches. For some problems, the semantic distances also resolve within a few seconds, but there are some cases where the solver takes much longer (including a case where the solver times out at 20 min). Our hypothesis for this phenomenon is that the semantic constraints by themselves under-constrain the space of repairs, which causes the solver to search a larger space for finding the optimal solution for the minimization objective. On the other hand, by combining syntactic and semantic distances, \(\textsc {Qlose} \) can solve the sketches with minimization constraints within 20 s for each benchmark.

Fig. 8.
figure 8

Repairs obtained for FindC with syntactic distance for different test sets.

6.4 Repairs with Different Test Sets

In this experiment, we evaluate the effect of using different sets of tests on the repairs generated by \(\textsc {Qlose} \). We empirically observe that the combination of syntactic and semantic distances is more robust with respect to changes in the test set as compared to individual distances. For example, if we look at Fig. 8, we can see that when we vary the test set for the FindC benchmark, using only syntactic distances yields different and undesired repairs. On the contrary, we obtain the same desired repair using the combined distance for these test sets.

7 Related Work

We review relevant work focussing on sequential, imperative software programs.

The authors in [30] were the first to emphasize the need to look for repaired programs that are semantically close to the original program. But they did not develop a quantitative formulation of the problem and relied on choosing sets of traces of the original program to be preserved exactly. There are several program repair approaches that aim to find repairs that are syntactically close to the original program [16, 19, 23, 24]. As we have discussed in the paper, focussing just on syntactic changes can lead to non-intuitive repairs. The AutoProf system [24] uses the Sketch solver to compute the minimum number of syntactic changes to incorrect student solutions based on a manual error model. \(\textsc {Qlose} \), on the other hand, uses additional syntactic and semantic distances, and generalizes the set of expression modifications using linear combinations of constants with program variables.

There is also a growing and interesting body of work on quantitative notions for verification and synthesis [4, 5, 11], which formalize distances between specifications and systems or between systems themselves. However, these distances mostly apply to reactive systems and temporal logic specifications. There have also been many proposals for scaling program repair and synthesis to large programs. These are based on techniques ranging from constraint-solving [20, 27, 28], winning strategies in games [14], abstractions [9, 18, 22], mutations [7], genetic algorithms [2, 8], using contracts [31], and focusing on data structure manipulations [25, 32]. As we develop \(\textsc {Qlose} \) further, we hope to leverage some of these techniques and improve the scope of our approach.

Many fault localization algorithms are based on analyzing error traces [3, 6, 15, 33]. Some of these techniques can be used as a preprocessing step to improve the efficiency of our algorithm. A recent paper [17] finds the root cause of an equivalence failure in binaries using a notion of semantic similarity between programs. The problem setting is quite different from ours and the notions of similarity mostly refer to the program abstract semantics rather than to concrete executions. We wish to explore whether the distances proposed in [17] can be instantiated in our framework.

A more general question is whether the notions of program distances appearing in quantitative program analysis and program repair can be modeled in \(\textsc {Qlose} \). While simple limits on the number of syntactic edits clearly fall in our framework [19], some complex distances could take into account features that we currently do not model. For example [23] uses location-specific costs that cannot be captured using our current definitions. Extending \(\textsc {Qlose} \) to more complex distances is an interesting research direction.

In this paper we use manual code inspection to decide which repair is most natural. Recently, many data-driven techniques have been proposed to reason about code naturalness [12, 21]. These techniques learn language models of source code from a large code corpus and then use these models for several applications such as learning natural coding conventions, code suggestions and auto-completion, improving code style, suggesting variable and method names etc. Using such automatic techniques to classify repairs is an interesting direction.

8 Limitations and Conclusion

We introduce the quantitative program repair problem informally described as follows: given a set D of syntactic and semantic distances, a program P, and a set of test cases T, find the closest program \(P'\) (with respect to some function over the distances in D ) such that \(P'\) is correct on all the tests in T. We differentiated ourselves from previous approaches by showing that, to find “natural” program repairs, both semantic and syntactic distances are necessary. Our techniques have been implemented in a prototype tool Qlose, but some limitations need to be addressed. The most important ones are that the distances are tailored to specifications given as test sets and that Qlose only handles programs with tens of lines of code. Addressing these limitations is part of our research agenda.