research-article

Open Access

Efficient Test Chip Design via Smart Computation

Authors:
Chenlei Fang

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA

0000-0002-0518-6348
View Profile

,
Qicheng Huang

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA

0000-0003-2891-6662
View Profile

,
Zeye Liu

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA

0000-0003-2516-3423
View Profile

,
Ruizhou Ding

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA

0000-0002-6357-6723
View Profile

,
Ronald D. Blanton

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA

0000-0001-6108-2925
View Profile

ACM Transactions on Design Automation of Electronic Systems Volume 28 Issue 2Article No.: 22pp 1–31https://doi.org/10.1145/3558393

Published:22 March 2023Publication History

ACM Transactions on Design Automation of Electronic Systems

Abstract

Submitted to the Special Issue on Machine Learning for CAD (ML-CAD). Competitive strength in semiconductor field depends on yield. The challenges associated with designing and manufacturing of leading-edge integrated circuits (ICs) have increased that reduce yield. Test chips, especially full-flow logic test chips, are increasingly employed to investigate the complex interaction between layout features and the process that improves the total process quality before and during initial mass production. However, designing a high-quality full-flow logic test chip can be time-consuming due to the huge design space and complex process to search for optimal result. This work describes a new design flow that significantly accelerates the logic test chip design process. First, we deploy random forest classification technique to predict potential synthesis outcome for test chip design exploration. Next, a new method is described to efficiently solve the integer programming problem involved in the design process. Various experiments with industrial design have demonstrated that the proposed two methods greatly improve the design efficiency.

1 INTRODUCTION

The reduced size of electronic components has made the semiconductor industry extremely capital-intensive. Subwavelength lithography and local layout effects create dominant layout geometries dependencies at the 16 nm node and below that make fast yield ramping increasingly challenging. Thus, there can be substantial economic benefits for fast yield ramping at leading-edge technology node. In other words, the successful IC manufacturing requires aggressive yield loss reduction.

Conventionally, many types of test chips are used at different stages of the technology development process according to its manufacturing maturity [11], ranging from SRAM, short-flow test chip with particular front-end of line (FEOL) or back-end of line (BEOL) layout geometries to full-flow logic (FFL) test chips. To be specific, Figure 1 describes the subsequent development process for a generic technology that has been divided into six separate stages, along with the types of test chips manufactured at each stage. For example, at beginning stage of technology development, simple proof-of-concept structures such as comb drive (structure manufactured in the shape of two combs used for measuring defect density size distribution) and via arrays are used to evaluate manufacture process independently. As the process developed, more complicated layout geometries are incorporated in the test chips, such as larger SRAM block and short-full test chip with either FEOL or BEOL layout demographics. As defectivity of the process reduces, the full-flow test chips are manufactured using the developing product design kit (PDK). The full-flow test chips include standard automated place-and-route (SAPR) logic test chip and the product-representative test chips, including large SAPR logic test chips. The full-flow SAPR logic test chips are intended to catch sources of yield loss that impact the random logic used in product design. In addition, since the full-flow SAPR logic test chip follows the standard automated place-and-route flow, they can be used by fabless companies to identify and correct the product yield losses before the full volume product manufacturing.

Fig. 1. Outline of a generic semiconductor technology development process.

The most common FFL test chips employed in industry are sub-circuits (e.g., a floating-point unit) from existing product designs. While such sub-circuits contain actual design features (e.g., standard-cell usage and complex layout geometries), the primary drawback is the low transparency to a large universe of failures, which results in difficult failure analysis (yield learning) on the conventional FFL test chips. To address this shortcoming, work in References [5, 6] describes a new type of FFL test chip called the Carnegie-Mellon Logic Characterization Vehicle (CM-LCV). The CM-LCV is designed for maximal testability and diagnosibility while being sensitive to the defect mechanisms that affect product designs. It is based on the insight that systematic defects are sensitive to the physical features of a design (i.e., layout geometries) instead of the logic functionality. This provides the freedom to select a logical functionality and structure that maximizes testability and diagnosability, and a layout implementation that has product-like physical features. Because test circuits are not manufactured in high volume and do not have specific functions, common metrics such as areas, power, and so on, are not of concern. Particularly, functionality of the CM-LCV is a two-dimensional array of functional unit blocks (FUBs) that implement one or more information-lossless functions with equal numbers of inputs and outputs. The logical functionality of the CM-LCV maximizes both testability and diagnosability for a variety of defect types [6].

However, designing a highly effective CM-LCV is not trivial. The designing process includes multiple stages, and some of them are complex and time-consuming. To achieve an optimal CM-LCV design, a huge search space must be generated, and the searching process is also difficult. Before introducing the details of CM-LCV design flow, two main challenges in designing CM-LCV are described here as a motivation to this work.

1.1 Challenges in Creating FUBs

One important step of the CM-LCV design flow is the creation of a FUB library that includes various unique FUB implementations. “Unique” here means the logical structure of a FUB implementation is different from any other implementation within the FUB library. Having a FUB library full of unique FUBs means design reflection objectives such as matching standard cell usage is eased. In other words, it is both easier and more likely to identify a set of unique FUB implementations such that the distribution of standard cells within the set matches a targeted design. Figure 2 illustrates simple examples of matching two targets with several unique full adder (FA) implementations FA\(_1\), FA\(_2\), and FA\(_3\). The first target distribution includes four standard-cell types (i.e., NAND2, AND2, OR3, and XOR3). Two implementations (FA\(_1\) and FA\(_2\)) match the first distribution perfectly. The second distribution includes two new cell types (OR2 and XOR2), a distribution that cannot be matched using only FA\(_1\) and FA\(_2\). However, using implementation FA\(_3\), we can perfectly match the second distribution using one instance of FA\(_1\), one instance of FA\(_2\), and four instances of FA\(_3\). The CM-LCV does not use the adder function for a number of reasons but the example, however, illustrates that a variety of FUB implementations is crucial for matching a given standard-cell distribution, and additionally, to achieve flexible physical feature incorporation.

Fig. 2. An example of matching two target standard-cell distributions with unique full adder implementations.

Unfortunately, creating a variety of unique FUB implementations is extremely expensive, since a FUB function has to be synthesized millions of times to accurately meet the design requirements. The reason that significant synthesis is needed is due to the fact that synthesis cannot guarantee the generation of a new unique FUB implementation that is different from all of the previous generated ones. For each synthesis run, although a new configuration (e.g., requirements of which standard cells can be used; more details in Section 2) is provided as input, the result can still be a non-unique implementation that satisfies the configuration. Figure 3 gives some statistics collected from a FUB library for previous CM-LCVs that required six weeks of generation time using 64 2.2 GHz CPU cores and 1 TB of RAM. The solid line shows that the number of unique implementations grows slowly with more synthesis. On average, 75 synthesis runs are required to produce one unique implementation. Overall, nearly 7.5 million synthesis runs are performed to reduce the mismatch rate of cell usage¹ (as shown in the dashed line) to an acceptable level. Therefore, if it can be learned which synthesis configurations are likely to lead to unique implementations before synthesis is executed, then we can avoid many useless runs, and the efficiency of the CM-LCV design process can be significantly improved.

Fig. 3. The number of unique FUB implementations (in solid line) and mismatch rate of cell usage (in dashed line) as a function of the number of synthesis runs. The dashed line is fitted by six samples, where six CM-LCVs are created to measure the mismatch.

Another challenge of FUB library creation is to ensure high testability of the logic-level implementations. Since one objective of the CM-LCV is to achieve high transparency to defects, after the FUB library is established, the testabiliy of each unique FUB implementation has to be measured by an Automatic Test Pattern Generator (ATPG) to obtain coverage for various fault models of concern. FUB implementations with low testability are disqualified from use within the CM-LCV and thus waste significant compute resources when measuring their poor testability. Figure 4 shows coverage distribution of the input pattern (IP) fault model [4] for all the unique FUB implementations created in the process illustrated in Figure 3. From the distribution, we observe that only 24.7% of the unique implementations achieve at least 75% IP fault coverage. Therefore, significant resources can be saved if the testability for each synthesized FUB implementation can be accurately predicated as satisfactory or not.

Fig. 4. IP fault coverage distribution of various FUB implementations.

The aforementioned observations motivate exploration of the possible reduction of synthesis/ATPG runs by predicting the outcomes of synthesis and ATPG to accelerate the CM-LCV design process. Particularly, the objective is to develop a methodology to predict whether a certain operation (i.e., synthesis with a certain configuration or testability measurement on a certain FUB implementation) will result in a satisfactory outcome (i.e., a unique or highly testable FUB implementation, respectively). Preliminary analysis reveals that there is no simple correlation either between synthesis configuration and uniqueness of the synthesis outcome or between the circuit structure and testability. Based on the recent successes of machine learning (ML) in uncovering higher-dimensional correlations [12, 13], we deploy random forests (RFs) for predicting FUB uniqueness and FUB testability, which will be described in Section 3.

1.2 Challenges in Matching Cell Distribution

The next challenge exists after a well-developed FUB library is created. The task to properly identify FUB implementations that mimic a given standard-cell usage distribution can be accomplished by solving a constrained under-determined equation for an integer vector solution [5, 17]. Such an integer programming (IP) problem has been proved to be NP-hard [21], if each possible integral solution needs to enumerated. The branch and bound algorithm [9], adopted by widely used commercial IP solvers [1, 23], reduces the computation efforts of pure enumeration by searching branches and discarding unpromising ones. However, the time and space complexity of branch and bound is still exponential in the number of variables. For the LCV implementation task, the optimization problem is extremely large (e.g., with \(\sim\) \(10^6\) variables), which greatly challenges even the best solvers. For example, a server with 64 2.2 GHz CPU cores and 1 TB of RAM is unable to handle an IP problem with 6 \(\times \ 10^6\) variables due to insufficient memory; even for a problem with 3 \(\times \ 10^4\) variables, it took more than one day to reach a solution with satisfactory error. Such a dilemma has become a severe bottleneck in the overall test-chip design process, especially for fabless and foundries that require fast yield ramping (e.g., a foundry may need to fabricate a new product each month). Therefore, it is crucial to develop a more efficient solver with both reduced runtime and compute resources.

Towards this goal, a methodology called IPSA (Integer Programming via Sparse Approximation) is proposed in Section 4 to find a near-optimal solution for the LCV implementation problem that requires shorter runtime and less memory than the original formulation. Our work is motivated by the observation that the original IP problem can be transformed into a sparse-regression problem without the integer constraint followed by a subsequent rounding process. Compared to directly solving the original IP problem, IPSA results in similar distribution-matching error, but requires far less time and memory due to the sparse restriction.

The contributions of this work include:

•	Development of two random forest (RF) classifiers to predict (i) whether a synthesis configuration will result in a unique FUB implementation; and (ii) whether a FUB implementation has an acceptable testability, to avoid unnecessary synthesis execution and testabiliy analysis.
•	Deployment of two strategies to efficiently solve the transformed sparse regression problem, namely: (i) greedy forward step-wise regression and (ii) L\(_1\)-regularization. Analysis of error bounds and computational cost are provided, which verify the correctness and efficiency of the two strategies.
•	Development of a new CM-LCV design flow that has higher efficiency in all stages.

The rest of the article is organized as follows: Section 2 provides the relevant background for CM-LCV design, especially including details for FUB library creation and the integer programming problem involved. Section 3 describes the ML methodology to predict the FUB creation. Section 4 describes two strategies to accelerate the solving of integer programming problem. Experiment results assessing and validating the methodology are presented in Section 5. The final section summarizes the article.

2 LCV DESIGN FLOW

In this section, we introduce a typical design flow of CM-LCV and point out the two difficult steps that this article aims to help with. We describe the details of synthesis configuration and why it is hard to generate a proper FUB library. The mathematical formulation of matching a desired cell distribution is also discussed.

Figure 5 illustrates the CM-LCV design flow described in References [15, 16, 17], in which CM-LCVs are designed to match standard-cell distributions [17], mimic the neighborhood of standard cells [16], and incorporate BEOL layout geometries of interest [15]. The design flow begins with the standard-cell library for a given technology, as shown in Step 1 of Figure 5. A typical standard-cell library contains numerous logic functions. Based on the logic library and a FUB function, Step 2 of Figure 5 is to generate a large variety of unique FUB implementations using the logic functions extracted from the library. This step is achieved through millions of synthesis runs with different configurations with retainment of the unique implementation that form the FUB library. Each FUB is analyzed to determine its physical features (PF) and testability characteristics (TB), resulting in a set of profiles (Step 3 of Figure 5). Step 2 and 3 can be viewed as one large step to create a proper FUB library.

Fig. 5. The CM-LCV design flow; steps in dashed box indicate bottlenecks that are optimized in this work.

However, the design objectives of the CM-LCV usually include testability requirements (e.g., IP fault coverage) and objective physical features (e.g., standard-cell distribution). The latter can be taken from representative industrial designs (Step 4 of Figure 5) or directly specified by the designers (Step 5 of Figure 5). Given the design requirements and the FUB library along with their profiles, the ultimate goal of the design flow is to identify a subset of FUB implementations that satisfy the design requirements. Identifying a subset of FUBs can be achieved by solving an optimization problem, as shown in Step 6 of Figure 5. The solution indicates which implementations and how many instances should be included to form what we call the FUB template. For example, the final FUB template in Figure 5 consists of one FUB\(_1\), one FUB\(_2\), and two instances of FUB\(_4\).

In spite of the numerous steps of the design flow, the main bottleneck lies in the two steps indicated by the red boxes, namely, the FUB creation step and the integer programming step. Our work aims to accelerate exactly the two bottle-neck steps.

The reason that FUB library generation is so time-consuming is that millions of synthesis runs have to be executed with different configurations to generate a sufficient number of unique FUB implementations. Figure 6 illustrates an example of n synthesis configurations, with each configuration represented by a row-vector. A synthesis configuration includes two parts: (i) a binary vector that constrains which logic gates can be used in the resulting implementation, as shown in blue, and (ii) a goal performance metric indicator, as shown in orange. In the first part, a “1” and a “0” indicate which gates are allowed and not allowed, respectively. The length of the binary vector is \(K-2\), where K is the number of logic functions in the logic library, because primitive logic functions such as the 2-input NOR and inverter are always required by the synthesis tool [14]. The second part, namely, the goal performance metric indicator, is a one-digit number indicating which performance goal that the synthesis should achieve with priority. Three goals are used in References [15, 16, 17]: minimal area, minimal delay, and balanced delay and area, indicated by “1” to “3,” respectively. For example, the vector \(\mathbf {x}^{(1)}\) in Figure 6 implies a synthesis configuration that is expected to generate a FUB implementation only using logic functions consisting of 2-input AND, 2-input NOR and inverter, while minimizing the overall circuit area. Note that any change to the vector or the performance goal leads to a new synthesis configuration, but does not necessarily result in a unique FUB implementation, because synthesis does not necessarily use all the specified logic functions.

Fig. 6. Illustration of n synthesis configurations; each one of them is represented by a vector.

Based on the aforementioned, the number of possible synthesis configurations for a FUB function is estimated to be: (1) \(\begin{equation} \begin{aligned}D = (2^{(K-2)} -1)\times g, \end{aligned} \end{equation}\) where K is the number of logic functions in the logic library, and g is the number of performance metrics. The space for all the possible configurations is extremely large, and thus too time-consuming to exhaustively explore. For example, the standard-cell library used in the example of Figure 3 has 58 logic functions, which translates to \(2.9\times 10^{17}\) synthesis configurations when there are \(g=3\) possible performance metrics. Without ML, an extensive amount of synthesis is required to generate a sufficient number of unique FUB implementations. If instead we can predict which configurations will lead to unique implementations before synthesis, useless synthesis runs can be avoided to save significant compute time.

However, predicting the outcome of synthesis is not trivial, especially given the high dimensionality of a configuration vector. In other words, it is difficult to uncover the correlation between high-dimensional features and the objective, either from experience or through simple model fitting (e.g., polynomials). The same challenge exists for predicting the testability of FUB implementations to accelerate FUB analysis. Therefore, we use ML techniques to learn the complex correlations within the high-dimensional space. This part of work will be described in Section 3.

After FUB creation, another difficult step is to find a proper combination of FUB implementations to match a desired cell distribution (step 7 of Figure 5). The FUB library identified by the previous step is similar to a box of Legos where each piece of brick being a specific logic implementation. The goal of LCV implementation task then becomes selecting a suitable set of Legos to mimic a target (i.e., a standard-cell distribution), which can be formulated as the following optimization problem:

(2)

where \(\mathbb {Z}\) denotes the integer set, \({\mathbf {A}}\in \mathbb {R}_{+}^{p \times n}\) and \(\mathbf {D}\in \mathbb {R}_{+}^{m \times n}\) are a \(p \times n\) and an \(m \times n\) non-negative matrix, respectively, and \({{\bf b}}\in \mathbb {R}_{+}^p\) and \(\mathbf {d} \in \mathbb {R}_{+}^m\) are non-negative vectors of p and m dimensions, respectively. The symbol \(\succeq\) denotes the element-wise greater or equal.

As illustrated in Figure 7(b), each column of \(\mathbf {A}\) represents a FUB implementation and each row represents a specific type of standard cell. So, each entry in \(\mathbf {A}\) indicates the number of cell instances for the corresponding FUB implementation. The vector \(\mathbf {x}\) represents the numbers of FUB instances selected to form the LCV. The target distribution is denoted by \(\mathbf {b}\), containing the required number of each standard cell. Mathematically, the columns of \(\mathbf {A}\) can also be called the “basis,” and the goal is to decompose the target vector \({{\bf b}}\) using the basis with \({{\bf x}}\) being the corresponding coefficients. In Equation (2), the L\(_2\)-norm of the difference between the actual and target histograms is used as the minimization objective. \({{\bf x}}\in \mathbb {Z}^n\) constraints the number of selected instances of a given FUB to be an integer. Other constraints on \(\mathbf {x}\) are included in \(\mathbf {D}{{\bf x}}\succeq \mathbf {d}\). For example, one basic constraint is that the derived integers in \(\mathbf {x}\) must be non-negative, which can be formulated by setting \(\mathbf {D}=\mathbf {I}\) and \(\mathbf {d}=\mathbf {0}\); if, however, the practitioner has the requirement that there must exist at least some minimum for a set of standard cells, then the constraint can be reformulated by setting (3) \(\begin{align} \mathbf {D}=\begin{bmatrix}\mathbf {I} \\ \mathbf {A} \end{bmatrix}, \mathbf {d}=\begin{bmatrix}\mathbf {\mathbf {0}} \\ \mathbf {c} \end{bmatrix} , \end{align}\) where \(\mathbf {c}\) has the same size as \(\mathbf {b}\), with each entry representing the number of instances for a given standard cell. The number of columns in \(\mathbf {A}\) is significantly larger than its row count, since a more diverse FUB library provides more “Lego”s to choose from and, hence, is more flexible for matching a target distribution. It implies that the linear system \(\mathbf {Ax}=\mathbf {b}\) is highly under-determined, i.e., the number of variables is much higher than the number of equations.

Fig. 7. Optimization formulation for test-chip (LCV) implementation: (a) a structural illustration of the LCV under design and (b) mathematical formulation of the matching objective. The goal of test-chip implementation is to select a suitable combination of implementations from a FUB library described by matrix \(\mathbf {A}\) , so the overall inclusion of standard cells within the test chip mimics the target distribution described by vector \(\mathbf {b}\) . The FUB implementations denoted by the colored columns are selected and their corresponding counts can be found in vector \(\mathbf {x}\) , which is the objective of the solver.

Figure 8 gives a two-dimensional geometric explanation for the optimization problem described in Equation (2). The points on the same contour have the same value for the objective function in Equation (2). The unshaded area is the feasible area defined by the linear constraints in \(\mathbf {D}{{\bf x}}\succeq \mathbf {d}\). Beacuse only an integer solution is possible due to \({{\bf x}}\in \mathbb {Z}^n\), only the points on the grid nodes make up the feasible solution set. Essentially, solving the optimization problem equates to identifying a point located at one of the unshaded grid nodes (as denoted by dark-blue dots) that corresponds to the smallest objective value.

Fig. 8. A two-dimensional geometric explanation for the optimization problem in Equation (2). The points on the same contour have the same value for the objective function. By solving the original IP problem in Equation (2), we are trying to find a point on the unshaded grid nodes that has the smallest objective value. Also, the solution derived from the relax-round strategy (solving the problem in Equation (4) and then rounding) can be far from the true optimal integer solution.

Solving the IP problem in Equation (2) is unfortunately, NP-hard [21]. The most direct and naive workaround is a relax-round strategy that first solves a relaxed version of the problem by eliminating the integer constraint as: (4) \(\begin{align} \begin{split} \min _{{{\bf x}}} \:& ||{\mathbf {A}}{{\bf x}}-{{\bf b}}||_2^2 \\ \text{s.t. } & \mathbf {D}{{\bf x}}\succeq \mathbf {d} \end{split} \end{align}\) and then rounding the real number solution \({{\bf x}}_\text{real}\) to the final integer solution \({{\bf x}}_{\text{int}}\) by a rounding function \(f_{\text{round}}(\cdot)\): (5) \(\begin{align} {{\bf x}}_{\text{int}} = \: f_{\text{round}}({{\bf x}}_{\text{real}}). \end{align}\)

The meaning of \(f_{\text{round}}(\cdot)\) is to map the relaxed solution to the nearest integer grid node. However, the solution resulting from this strategy can be far from optimal, because the nearest mapped integer point can have a less-desirable objective value compared to optimal solution, depending on the shape of the objective function. Figure 8 shows an example for such cases, where the mapped point \(\text{P}_{\text{int}}\) (as denoted by the blue triangle) is not necessarily the optimal integer solution \(\text{P}_{\text{int}}^*\) (as denoted by the yellow square). Their difference in objective function can be exaggerated when the dimension of the solution space increases.

Most commercial tools (e.g., References [1, 23]), however, are based on a more accurate branch-and-bound algorithm [9, 19, 20] targeting the IP problem in Equation (2) directly. The branch-and-bound algorithm starts with a feasible point \({{\bf x}}^{(0)}\) (i.e., a point satisfies all the constraints). The corresponding object function value \(f^{(0)}\) of the best solution so far serves as the upper bound. The n dimensions of \({{\bf x}}\) lead to a searching space defined by n variables. The space is then iteratively divided to search for an optimal solution. In each iteration, one variable \(x_j\) (\(j \in \lbrace 1,2, \ldots ,n\rbrace\)) is selected and two “branches” of sub-problems are formed. The two sub-problems have the new constraints \(x_j\le x_j^{(0)}\) and \(x_j\)\(\gt\)\(x_j^{(0)}\), respectively, and the lower bound for each sub-problem is calculated. If the lower bound of either sub-problem is larger than the current upper bound, then the sub-space it defines can be safely discarded, as the optimal value will not exist in this sub-space. By repetition of such a progress, a search tree keeps growing and being pruned until the optimal leaf is reached (i.e., the subspace contains only one candidate and cannot be further divided). Given sufficient time and memory, the algorithm is guaranteed to find the global-optimal solution. However, this tree-based searching process has time and memory complexity exponential to the variable dimension n and can easily be impractical for large-scale problems.

Due to the high complexity of branch-and-bound, other works have instead searched for a sub-optimal solution that exhibits lower complexity. For example, the authors of Reference [21] have proposed a method based on a semi-positive definite (SDP) relaxation of the original IP problem. A randomized algorithm is then applied to find a feasible solution. This algorithm has an overall complexity of \(O(n^3)\), which is much lower than that of branch-and-bound. However, it assumes the integer programming problem has no other constraints, which is not the case in our application scenario. The consequence of adding the inequality constraints in Equation (2) to the SDP problem results in long runtimes for commercial tools to solve, which is impractical for our problem, considering its large scale.

To accelerate this integer programming step, two strategies are described in Section 4 to solve the problem in an efficient way.

3 MACHINE LEARNING-ASSISTED FUB LIBRARY CREATION

In this section, we describe the details of the proposed methodology for efficient FUB library creation with ML. We first introduce the new design flow and formulate the corresponding mathematical problems. Then, we describe the deployed ML algorithm and the features used for learning. Finally, we illustrate an on-line learning strategy, which is used to train the ML model starting from limited labeled data.

3.1 Design Flow and Problem Formulation

Figure 9 shows the updated steps (Steps 2 to 4 of Figure 5). The flow charts in blue illustrate the proposed steps to accelerate the FUB creation. Instead of inputting the configurations randomly into the synthesis tool or feeding the entire FUB library into the ATPG tool, the two loops in blue work as two filters that (i) only select configurations predicted to be unique and (ii) only select FUB implementations if they are predicted to have high testability. The two filters are achieved by two classifiers \(C_1\) and \(C_2\).

Fig. 9. The use of RF models in the original CM-LCV FUB creation flow: a classifier \(C_1\) is trained using features derived from a synthesis configuration to predict synthesis outcome; and a classifier \(C_2\) is trained using features derived from a unique implementation to predict its testability.

Corresponding to the two objectives (i.e., predicting uniqueness of the synthesis outcome, and FUB testability), the ML-assisted flow here aims to solve the following two sub-problems:

•	Problem 1: uniqueness prediction Suppose we have n possible synthesis configurations and for each of them d features are extracted. Such data constitute the testing set, which can be represented as an \(n\times d\) matrix \(\mathbf {X}= [\mathbf {x}^{(1)}; \mathbf {x}^{(2)}; \ldots ; \mathbf {x}^{(n)}]\). Each row \(\mathbf {x}^{(i)}= [ x_1^{(i)}, x_2^{(i)}, \ldots , x_d^{(i)}]\) is a d-dimensional vector containing the d extracted features of the ith configuration. The objective of this sub-problem is to train a classification model \(C_1\), which takes in each test sample \(\mathbf {x}^{(i)}\) and generates a label \(y^{(i)}\) representing the uniqueness of the corresponding synthesized FUB implementation. \(y^{(i)}\) is a binary variable, which equals one when the implementation resulting from the ith synthesis configuration is unique (i.e., distinguishable from any existing implementations within the FUB library) and equals zero otherwise. For an optimal \(C_1\), the predicted labels \(\mathbf {y}= [ y^{(1)}, y^{(2)}, \ldots , y^{(n)}]\) should be as close to the real labels as possible.
•	Problem 2: testability prediction After synthesis, suppose there are m unique implementations selected into the FUB library. f features are extracted from each implementation, which form the test set. The test set can be represented as an \(m\times f\) matrix \(\mathbf {X}=[\mathbf {x}^{(1)}; \mathbf {x}^{(2)}; \ldots ; \mathbf {x}^{(m)}]\), where each row \(\mathbf {x}^{(i)}= [ x_1^{(i)}, x_2^{(i)}, \ldots , x_f^{(i)}]\) is an f-dimensional vector containing the f extracted features of the ith FUB implementation. Given a threshold for acceptable testablilty, the objective of this sub-problem is to train a classification model \(C_2\), which takes in each test sample \(\mathbf {x}^{(i)}\) and generates a label \(y^{(i)}\) representing whether the testability of the FUB implementation is acceptable. \(y^{(i)}\) is again a binary variable that equals one when the testabilility of the ith FUB implementation is larger than the pre-defined threshold, and equals zero if not. For an optimal \(C_2\), the predicted labels \(\mathbf {y}= [ y^{(1)}, y^{(2)}, \ldots , y^{(m)}]\) should be as close to the real labels as possible.

3.2 Feature Selection

To achieve optimal classification performance, the features should be carefully selected to best represent the raw data and also incorporate helpful domain knowledge.

The feature selection for classifier \(C_1\) is straightforward. Since a compacted vector shown in Figure 6 already includes all the information of a synthesis configuration, it can be directly used as the feature vector for each configuration. To be specific, given a logic library of K logic functions, \(K-1\) features are extracted from each configuration, where the first \(K-2\) features are binary numbers representing whether a certain logic function (excluding NOR2 and inverter) in the logic library is allowed for synthesis, and the last feature is a one-digit variable that captures the performance goal for synthesis (as explained in Section 2).

For classifier \(C_2\), the input raw data are implementations in the form of gate-level netlist. Unlike \(C_1\)’s case where the synthesis configuration can be directly converted into a vector, for \(C_2\), features must be manually designed with the cost of extraction kept in mind. Circuit testability largely depends on its topology, so we extract \(K+3\) features to represent the topology and connections. Figure 10 illustrates an example of the \(K+3\) features for m implementations. The first K features capture the logic function utilized in a FUB implementation. For example, the feature vector \(\mathbf {x}^{(1)}\) shown in Figure 10 is for a FUB with 12 inverters, two 2-input NOR gates, and so on. In addition, we include three more features to capture information concerning circuit structure, namely, the number of nets, the number of fanouts, and the maximum logic depth. We do not consider other features associated with testability, because extraction is too expensive, such as the number of re-convergent fanouts. The time required to extract such features exceeds the time needed for ATPG, thus making the use of those features nonsensical.

Fig. 10. Illustration of selected features for m unique implementations for classifier \(C_2\) .

3.3 Classification Algorithm

Among the variety of available ML algorithms, the RF has become popular because it has good performance and is easy to implement. In addition, the RF is capable of classifying nonlinearly separable data with short learning time. Given those aforementioned advantages, we use RF models to construct predictors \(C_1\) and \(C_2\).

Recently, other ML techniques such as graph neural networks [18] have been used to extract circuit features and make predictions. Although successful in certain tasks, graph neural networks are much harder to implement. In most cases, high-performance GPUs are required, and the training time is usually long. However, RF is much easier to implement, with limited requirement on hardware and training time. Therefore, RF is chosen as the classification method in this work. It can be seen in the experiments in Section 5 that RF performs sufficiently well and saves a lot of synthesis time.

An RF is an ensemble method based on decision trees. A decision tree learns a tree-structured model from the training samples, with each leaf representing a classification result. For each internal node of the tree, one feature is selected as the optimal split criteria at the current level. While a decision tree is easy to implement and interpret, it is not robust. A small change in the training samples can result in a totally different tree. A single decision tree is also prone to over-fitting the training set. In addition to a binary label of either “0” or “1,” a decision tree can also provide a probabilistic label, which is a probability for the label to be “1.” Such a probability is calculated from the ratio of label-1 testing instances within the leaf node.

An RF overcomes the disadvantages with ensemble learning [7]. An RF is an ensemble of decision trees with two degrees of randomness. First, the training samples for each tree are generated from bootstrap sampling (random sampling with replacement) of the entire training set. In this way, each tree has a different set of training data, although drawn from the same distribution. Second, when searching for the optimal split at each node, only a subset of all features is selected. A random forest model performs final classification by taking a majority vote over the trees. From these two degrees of randomness, an RF model achieves much lower variance than a single decision tree, at the expense of slightly increasing the fitting bias. Note that an RF can also produce a probabilistic label, which is the average of the probabilistic labels predicted by all the decision trees.

Another advantage of RF is that we can control overfitting by tuning hyper-parameters. We reduce the risk of overfitting by controlling the complexity of the prediction model via cross validation. Specifically, optimal values of hyper-parameters such as the “number of trees,” “maximum number of features adopted by each tree,” and “minimum number of samples required to be at a leaf node” are selected, which minimize the validation error. This practice avoids over-complex model architecture options (e.g., each tree becomes too deep) that overfit the training data, because with these options the validation error would be higher. Also, we believe that in this process, feature selection is implicitly applied. Because when model complexity (e.g., tree depth) is restricted, only the most important features will be used for tree construction.

Although we choose RF in this work because of all the previous reasons, it is not necessarily the perfect model for this application. The main goal of this work is not about finding a perfect ML model, but more about providing a framework and showing the potential of using ML in assisting test chip design. Practitioners can use RF as a baseline and try other ML models.

3.4 On-line Learning Strategy

An on-line learning strategy is developed specifically to tackle the insufficient-data problem faced in the training process of RFs. Since there is no quicker way to obtain the labels for the training set other than running synthesis/ATPG, we want to minimize the size of training set to minimize cost. However, a small training set can easily lead to over-fitting the RF models, which results in greater levels of misprediction.

On-line learning involves iteratively updating the ML model as more training data becomes available. It starts with an insufficient training set, so a model initially learned may be far from optimal. However, as more training data are gradually added, the updated model becomes more accurate compared to the previous versions. In our work, new training data stems from verification of the prediction results. For \(C_1\), after prediction, the configurations predicted with label “1” are synthesized to generate implementations. Such synthesis process provides not only the implementations for the next design stage, but also ground-truth labels after simple circuit structure comparison. Such data can augment the training set for ML model update. For \(C_2\), similarly, the implementations predicted with label “1” are analyzed for testability via ATPG. This process provides ground-truth labels while also verifying the predicted testability.

Two requirements are necessary for an efficient on-line learning process: (i) fast model training in each iteration and (ii) inexpensive creation of additional training data. Without either requirement, ML model update would be too costly. Fortunately, classifiers \(C_1\) and \(C_2\) satisfy both requirements. Specifically, training time for an RF is negligible (usually less than one minute) compared to other operations such as synthesis. In addition, modest effort is needed for obtaining the labels for new training data. For \(C_1\), the synthesis process cannot be skipped, since the resulting implementations are eventually required for creating the FUB library. The additional effort is simply circuit structure inspection for creating the uniqueness. Similarly, for \(C_2\), the additional effort requires ATPG to determine if coverage exceeds the pre-defined threshold.

Algorithm 1 summarizes the on-line learning strategy for \(C_1\). Given a set of synthesis configurations with features extracted as \(\mathbf {X}\), the goal is to iteratively find labels for them until no more configurations with label “1” are found. This termination condition is because the primary goal of \(C_1\) is to find as many unique implementations as possible, i.e., to achieve high recall. The first part of Algorithm 1 is to set up the initial training and testing sets. We randomly select a small number of samples from \(\mathbf {X}\), denoted as \(\mathbf {X}_\mathrm{init}\) (line 1). Then, synthesis with configurations corresponding to \(\mathbf {X}_\mathrm{init}\) is executed to label \(\mathbf {X}_\mathrm{init}\) (line 2). The labels are denoted as \(\mathbf {Y}_\mathrm{init}\) and the labeling process is denoted as a function named “get_label_by_syn().” The features (i.e., \(\mathbf {X}_\mathrm{init}\)) and their labels (i.e., \(\mathbf {Y}_\mathrm{init}\)) constitute the initial training set \(\mathbf {D}_\mathrm{train}=\lbrace \mathbf {X}_\mathrm{init},\mathbf {Y}_\mathrm{init}\rbrace\) (line 3). The remaining unlabeled portion of \(\mathbf {X}\) forms the initial testing set \(\mathbf {X}_\mathrm{test}\) for \(C_1\) (line 4).

The second part of Algorithm 1 describes on-line learning, where \(C_1\) is updated once during every iteration of the while loop. The update procedure in each iteration is illustrated in Figure 11. First, the classifier \(C_1\) predicts a label for all the data samples in the testing set \(\mathbf {X}_\mathrm{test}\) that results in a vector containing the probabilistic labels \(\mathbf {P}_\mathrm{test}\) (line 6). Here, a probabilistic label refers to a real number \(p \in [0,1]\), indicating the probability for the sample to be labeled with “1.” Then, the samples \(\mathbf {X}_\mathrm{test}\) are sorted according to their probabilistic labels \(\mathbf {P}_\mathrm{test}\) from high to low. A fixed number of top-ranked testing samples are selected and denoted as \(\mathbf {X}_\mathrm{top}\) (line 7). \(\mathbf {X}_\mathrm{top}\) are those synthesis configurations believed to lead to the unique implementations, and the true labels are actually determined by performing synthesis (line 8).

Fig. 11. Illustration of one iteration of on-line learning for \(C_1\) . After training based on the updated training set of the previous iteration, \(C_1\) predicts a probability for each testing sample to have label “1.” The top-ranked samples are synthesized for true labels and the bottom-ranked ones are directly labeled as “0.” These two parts of data are added to the training set for next iteration, while the remaining mid-ranked samples are forwarded to the next iteration.

As a complement of \(\mathbf {X}_\mathrm{top}\), the bottom samples in \(\mathbf {X}_\mathrm{test}\) with p less than a threshold (denoted as \(\mathbf {X}_\mathrm{btm}\) in line 9) and each directly assigned the label of “0.’’ When the threshold is very low (e.g., 0.01), directly assigning a label of “0” (denoted as “get_label_by_asg()” in line 10) is quite accurate. \(\mathbf {X}_\mathrm{btm}\) with the all-zero labels \(\mathbf {Y}_\mathrm{btm}\), together with \(\mathbf {X}_\mathrm{top}\) and the labels \(\mathbf {Y}_\mathrm{top}\), are added to the training set \(\mathbf {D}_\mathrm{train}\). Besides increasing the training set \(\mathbf {D}_\mathrm{train}\), another reason that we include both \(\mathbf {X}_\mathrm{top}\) and \(\mathbf {X}_\mathrm{btm}\) is to ensure class balance in \(\mathbf {D}_\mathrm{train}\) (lines 11–12). Since the performance of \(C_1\) is initially better than flipping a coin and also improves through iterations, labeled-1 data will dominate in \(\mathbf {Y}_\mathrm{top}\). If only \(\mathbf {Y}_\mathrm{top}\) is added to \(\mathbf {D}_\mathrm{train}\), then it will become significantly imbalanced, which will eventually lead to increasing levels of misprediction by \(C_1\). Finally, the remaining data in \(\mathbf {X}_\mathrm{test}\) apart from \(\mathbf {X}_\mathrm{top}\) and \(\mathbf {X}_\mathrm{btm}\) are used as testing data for the next iteration (line 13). On-line learning continues in this way until few predictions in \(\mathbf {Y}_\mathrm{top}\) are “1.”

Similar to the classifier \(C_1\), on-line learning for classifier \(C_2\) can be performed. The on-line learning flow for \(C_2\) is almost identical to Algorithm 1. The major difference is that instead of synthesizing the configurations to obtain the ground truth for the selected synthesis configurations \(\lbrace \mathbf {X}_\mathrm{top}\rbrace\), ATPG tools are used to obtain the ground truth by characterizing the testability for the selected FUB implementations.

4 SMART INTEGER PROGRAMMING FOR FUB SELECTION

As discussed in Section 2, after FUB creation, the next step is to solve the integer programming problem Equation (2), but existing methods are not satisfying. While the relax-round strategy described in Equations (4)–(5) usually cannot provide a sufficiently accurate solution, the relaxed problem in Equation (4) can be solved by convex solvers (e.g., Reference [10]) with limited compute time as compared to branch-and-bound. The method we developed, IPSA, is based on error analysis of the relax-round solution: By transforming Equation (4) to a sparse-regression problem, we can utilize the fast solving speed of the relax-round strategy while decreasing the error to a comparable level as the optimal integer programming solution.

In this section, we first introduce two sources of errors of the relax-round strategy, namely, fitting error and rounding error, and then demonstrate their relationship to the sparsity of the solution. For the transformed sparse-regression problem, we further propose two solving strategies—forward step-wise regression and L\(_1\)-regularization. Finally, detailed analysis concerning the advantages and limitations of the two strategies, as well as their error bound and computation cost, will be provided to verify their correctness and efficiency.

From a statistical view, we assume that there is an underlying optimal solution \({{\bf x}}^*\) such that \({{\bf b}}= {\mathbf {A}}{{\bf x}}^* + \mathbf {\epsilon }\), where \(\mathbf {\epsilon } \in \mathbb {R}^p\) is the Gaussian noise added to each dimension of \({{\bf b}}\), i.e., \(\epsilon _j \sim N(0,\sigma ^2), j = 1,2, \ldots ,p\). For any given real-number vector \({{\bf x}}\), the difference after applying the rounding function \(f_{\text{round}}\) on it is denoted as \(\Delta {{\bf x}}\). For the sake of simplicity, we assume a stochastic rounding strategy² so both \(\Delta {{\bf x}}\) and \(\mathbf {\epsilon }\) are random variables. The expected error of the rounded solution can be expressed as: (6) \(\begin{align} & {\mathbb {E}}\,\left\Vert {\mathbf {A}}\cdot f_\text{round}({{\bf x}})-{{\bf b}}\right\Vert _2^2 = {\mathbb {E}}_\epsilon {\mathbb {E}}_{\Delta {{\bf x}}}\,\left\Vert {\mathbf {A}}({{\bf x}}+\Delta {{\bf x}})-{{\bf b}}\right\Vert _2^2 \nonumber \nonumber \nonumber\\ = & {\mathbb {E}}_\epsilon {\mathbb {E}}_{\Delta {{\bf x}}} \, \big [\Vert {\mathbf {A}}{{\bf x}}-{{\bf b}}\Vert _2^2 + 2({\mathbf {A}}{{\bf x}}- {{\bf b}})^T{\mathbf {A}}\Delta {{\bf x}}+ \Vert {\mathbf {A}}\Delta {{\bf x}}\Vert _2^2\big ]. \end{align}\)

The expectation of the crossing term in Equation (6) can be re-written as:

(7)

where we use the fact that \({\mathbb {E}}_{\Delta {{\bf x}}} [\Delta {{\bf x}}\vert {{\bf x}}]=0\) as implied by the stochastic rounding assumption. Substituting Equation (7) into Equation (6) yields:

(8)

The error is composed of two parts: fitting error, which measures how well \({{\bf x}}\) can reconstruct \({{\bf b}}\), and rounding error, which indicates how far \({{\bf x}}\) is from an integer point. Suppose there are N non-zeros elements in \({{\bf x}}\), an upper bound on the rounding error can be calculated by (9) \(\begin{align} \Vert {\mathbf {A}}\Delta {{\bf x}}\Vert _2^2 & = \left\Vert \sum _j {\mathbf {A}}_j \Delta x_j \right\Vert _2^2 \le \sum _j \Vert {\mathbf {A}}_j \Delta x_j \Vert _2^2 \nonumber \nonumber \nonumber\\ & \le N \cdot \max _j \Vert {\mathbf {A}}_j \Vert _2^2, \end{align}\) where we use the fact that the maximum possible stochastic rounding error is 1. Since the upper bound in Equation (9) is proportional to N, decreasing the number of non-zero elements in \({{\bf x}}\) reduces the rounding error. However, a more sparse \({{\bf x}}\) implies fewer types of FUB implementations are used (i.e., fewer basis to decompose \({{\bf b}}\) onto), which increases the fitting error.

4.1 Two Error Types

Despite the tradeoff between rounding and fitting errors with respect to N, the impact of N on rounding error is much higher, especially when N is not very small (e.g., \(N\gt\)100). By studying various cases of the the relaxed IP problem in Equation (4), we find the solutions have two important characteristics: (i) they are not sparse, i.e., almost all of the elements in \({{\bf x}}\) are non-zero; (ii) most elements in \({{\bf x}}\) are extremely small (e.g., more than \(99.9\%\) are less than 0.5). Figure 12 shows an example by plotting the values of all the elements in \({{\bf x}}\). Only 74 out of 605,945 coefficients are larger than 0.5, and a fairly large proportion of the coefficients are even smaller than \(10^{-5}\). Studying the contribution from the basis with small coefficients to the fitting/rounding error reveals that such basis brings much higher rounding error than their contribution to reducing the fitting error. If we divide \({{\bf x}}\) into two vectors \({{\bf x}}_\text{L}\) (coefficients larger than 0.5, with corresponding basis \({\mathbf {A}}_\text{L}\)) and \({{\bf x}}_\text{S}\) (smaller than 0.5, with basis \({\mathbf {A}}_\text{S}\)), then the difference in fitting error and rounding error brought by basis in \({\mathbf {A}}_\text{S}\) can be defined as follows, respectively: (10) \(\begin{align} \Delta \epsilon _f &= \left(\Vert {\mathbf {A}}_\text{L}\cdot {{\bf x}}^{\prime }_\text{L}-{{\bf b}}\Vert _2 - \Vert {\mathbf {A}}{{\bf x}}-{{\bf b}}\Vert _2 \right)\:/\:\Vert {{\bf b}}\Vert _2 \end{align}\) (11) \(\begin{align} \Delta \epsilon _r &= \Vert {\mathbf {A}}_\text{S}\cdot \Delta {{\bf x}}_\text{S}\Vert _2\:/\:\Vert {{\bf b}}\Vert _2 , \end{align}\) where \({{\bf x}}^{\prime }_\text{L}\) are the coefficients solved from Equation (4) by using only the basis in \({\mathbf {A}}_\text{L}\), and \(\Delta {{\bf x}}_\text{S}\) is calculated by: (12) \(\begin{align} \Delta {{\bf x}}_\text{S} = {{\bf x}}_\text{S} - f_{\text{round}}({{\bf x}}_\text{S}). \end{align}\)

Fig. 12. Solved coefficient values for Equation (4). The solution is non-sparse but more than 99.9% of the coefficients are extremely small. The small coefficients lead to high rounding error as compared to fitting error.

In the case shown in Figure 12, \(\Delta \epsilon _f = 0.0149\) while \(\Delta \epsilon _r = 0.1446\), which means that considering the basis in \({\mathbf {A}}_\text{S}\) brings \(10\times\) more rounding error than their contribution to reducing the fitting error. Thus, if \({\mathbf {A}}_\text{S}\) can be identified beforehand and eliminated from \({\mathbf {A}}\), then the total error of the solution would be significantly reduced.

4.2 Sparse Regression and Solving Strategies

Based on analysis of the two types of errors, we transform Equation (4) to a sparse-regression problem, namely: (13) \(\begin{align} \begin{split} \min _{{{\bf x}}} \:& ||{\mathbf {A}}{{\bf x}}-{{\bf b}}||_2^2 \\ \text{s.t. } & \mathbf {D}{{\bf x}}\succeq \mathbf {d} \\ & \Vert {{\bf x}}\Vert _0 \le \lambda . \end{split} \end{align}\)

The L\(_0\)-norm of \({{\bf x}}\) in Equation (13) restricts the number of non-zero elements below a pre-defined hyper-parameter \(\lambda\). By solving Equation (13), we are expecting to decrease N for small rounding error and at the same time find a suitable set of basis so the fitting error can also be under control.

Note that the regression problem is highly under-determined, implying that the number of basis in \({\mathbf {A}}\) is so large that many of them are very similar. Thus, for Equation (4), it is probable to find many sub-optimal points near the optimal solution. In Equation (13) we aim to find such sub-optimal points that are on the axis, which have slightly higher fitting error but far lower rounding error. For example, as the illustration shows in Figure 13, \(\text{P}_1\)\(\sim\)\(\text{P}_3\) are three solutions of Equation (4). \(\text{P}_1\) is the optimal solution with the least fitting error, but during the rounding process, it has rounding errors in every dimension, so its total error is higher than other two sub-optimal solutions that have smaller rounding error due to sparsity. By solving Equation (13), we are trying to find solutions such as \(\text{P}_2\) and \(\text{P}_3\), which are sub-optimal in terms of fitting error but with much lower rounding and overall errors. Since Equation (13) is NP-hard due to the non-convexity of L\(_0\)-norm, two alternative strategies are proposed.

Fig. 13. \(\text{P}_1\) \(\sim\) \(\text{P}_3\) are three solutions of the relaxed problem in Equation (4). \(\text{P}_1\) has the least fitting error but higher rounding error, because it is not sparse. By sparse-regression in Equation (13), we are trying to find solutions such as \(\text{P}_2\) and \(\text{P}_3\) , both of which have lower overall error.

Solving Strategy 1: forward step-wise regression

When no other constraints are specified besides \({{\bf x}}\succeq \mathbf {0}\), a greedy forward step-wise strategy can be adopted to iteratively select a subset of the basis that can significantly decrease fitting error. As summarized in Algorithm 2, we begin with a residual \(\mathbf {r}^{(0)}={{\bf b}}\) and an empty basis index set V (step 1). In each iteration, a basis that is most positively correlated with the residual is identified by evaluating the normalized inner product defined in step 2. We need to ensure that the inner product is non-negative to ensure the coefficients with respect to the selected basis satisfy the constraint \({{\bf x}}\succeq \mathbf {0}\). So, in step 3, if no more basis are positively correlated to the residual, then the iteration terminates. Once a new basis is chosen, its index is added to V and the regression problem is re-solved based on the current selected basis in step 5 (which is easily solvable by convex solvers such as Reference [10]). A new \(\mathbf {r}^{(k)}\) is updated in step 6, indicating the residual that the remaining basis needs to fit. If not terminated at step 3, then an iteration will terminate when its time reaches a pre-defined threshold \(N_{\text{max}}\). Finally, all the coefficients with respect to the un-selected basis are set to zero (step 7). This algorithm is similar to orthogonal matching pursuit (OMP) in the way that the two algorithms both iteratively project residuals to basis. The first difference of this algorithm compared to OMP is that the solution vector \({{\bf x}}\) must be integers. The second difference is that the basis in this algorithm are not orthogonal and cannot be orthogonalized, as the basis represents FUB implementation, and the solution represents the number of FUBs that should be selected.

Solving Strategy 2: L\(_1\)-regularization

If additional constraints are required besides \({{\bf x}}\succeq \mathbf {0}\) (e.g., Equation (3)), then Strategy 1 cannot be used, because selecting a basis during iterations makes it difficult to ensure the final solution satisfies the additional constraints. In such cases, L\(_1\)-regularization can be used to solve the constrained sparse-regression problem.

The main idea of L\(_1\)-regularization is to relax Equation (13) by replacing the L\(_0\)-norm term with L\(_1\)-norm: (14) \(\begin{align} \Vert {{\bf x}}\Vert _1 \le \lambda . \end{align}\)

The L\(_1\)-norm of a vector is defined as the summation of the absolute value of all the elements, which is a convex function. Various theoretical studies from the statistics community demonstrate that under some general assumptions both L\(_0\)-norm regularization and L\(_1\)-norm regularization result in the same solution [8].

The two strategies have their respective advantages and limitations. Strategy 1 is more heuristic but works very well in practice. In each iteration, the optimization problem in step 5 of Algorithm 2 contains only a small number of variables so it can be efficiently solved. In addition, with Strategy 1, selection of the hyper-parameter \(\lambda\) can be avoided, because in practice we can integrate the rounding process and track the error in each iteration by rounding \(\hat{{{\bf x}}}^{(k)}\) and calculating the residual. Finally, we can simply select the best result within all the iterations (an example will be given in Section 5). The downside of Strategy 1, however, is that it can only solve the problem with constraint \({{\bf x}}\succeq \mathbf {0}\), which highlights the strength of Strategy 2—flexibility to deal with any linear constraints. However, Strategy 2 requires careful tuning of the hyper-parameter \(\lambda\), which significantly increases the total runtime if aiming for an optimal \(\lambda\).

4.3 Error Bound and Computational Cost

In this part, we analyze the error bound of the two strategies.

For Strategy 1, without loss of generality, we assume that the object \(\mathbf {b}\) can be constructed through a non-negative linear combination of basis \(\mathcal {D}=\lbrace \mathbf {\mathbf {\psi }}_1, \mathbf {\mathbf {\psi }}_2, \ldots ,\psi _n\rbrace\), where \(\psi \in \mathbb {R}^p,\Vert \mathbf {\mathbf {\psi }}\Vert _2=1\) for all \(\mathbf {\mathbf {\psi }} \in \mathcal {D}\). This dictionary \(\mathcal {D}\) can be constructed by normalizing each column of matrix \(\mathbf {A}\). In other words, we assume that there exists an underlying set of coefficients \({{\bf x}}^*\) that \(\mathbf {b} = \sum _j x_j^* \mathbf {\mathbf {\psi }}_j\). Based on the proof in the Appendix, we have the following theorem concerning the error bound of the residual \(\mathbf {r}^{(k)}\) of step k:

Theorem 1.

Let \(\Vert \mathbf {b}\Vert _{\mathcal {L}_1} = \sum _j x_j^* \lt \infty\), the residual \(\mathbf {r}^{(k)}\) after k steps satisfies (15) \(\begin{align} \Vert \mathbf {r}^{(k)}\Vert _2 \le \frac{\Vert \mathbf {b}\Vert _{\mathcal {L}_1}}{\sqrt { k+1}} \end{align}\) for all \(k \ge 1\).

Theorem 1 demonstrates, using Algorithm 2, the residual-error bound decreases at the rate of \(O(k^{1/2})\), which assures error convergence of Strategy 1 (i.e., the fitting error). However, after the subsequent rounding process, the rounding error increases approximately at the rate of \(O(k)\) (since the upper-bound is proportional to N). Considering the tradeoff between the two parts of error, there should exist an optimal k that minimizes overall error.

For Solving Strategy 2, with the same “\(\mathbf {b}=\mathbf {A}{{\bf x}}^*+\epsilon\)” assumption as stated at the beginning of Section 4.1, and that \(\Vert \mathbf {A}_j\Vert _2^2 \le n\), for \(j=1, \ldots , p\), we have the following theorem (see the Appendix for the proof of Theorem 2):

Theorem 2.

Set \(\lambda = \Vert {{\bf x}}^*\Vert _1\) in Equation (14), then with probability at least \(1-\delta\), the expectation of error between the fitted result \(\mathbf {A}\hat{{{\bf x}}}\) and \(\mathbf {A}{{\bf x}}^*\) has the error bound

(16)

It shows that the error bound is proportional to the \(\text{L}_1\)-norm of the underlying “real” coefficients \({{\bf x}}^*\). If \({{\bf x}}^*\) has a small \(\text{L}_1\)-norm value, then the solution \(\hat{{{\bf x}}}\) given by Strategy 2 will not have a large fitting error.

Now, we demonstrate that the two strategies have polynomial complexity, which is a significant speedup over branch-and-bound algorithm’s exponential complexity.

For Strategy 1, step 2 in Algorithm 2 involves n vector multiplications of length p, which means \(O(np)\) floating-point operations (FLOPs). Step 5 is the most complex, including solving for a constrained quadratic optimization problem. Conventional commercial tools use an interior-point method that requires \(O(n^2p)\) FLOPs. Step 6 is essentially a matrix-vector product and a vector subtraction, which requires \(O(kp)\) FLOPs. Assuming that iteration terminates after K iterations, the overall complexity of Strategy 1 is \(O(Kn^2p)\). The main calculation of Strategy 2 is to solve the \(\text{L}_1\) regularized problem. This step can also be performed by commercial tools using an interior-point method, which has \(O(n^2p)\) complexity.

5 EXPERIMENTS

In this section, we describe the details of design experiments for various CM-LCVs. First, the efficiency of ML-assisted FUB creation is demonstrated with experiments based on two industrial standard-cell libraries. Then, more designs are used to demonstrate the performance of IPSA in solving the integer programming problem compared to conventional methods.

5.1 Machine Learning-assisted FUB Library Creation

To evaluate the efficacy of the ML-aided methodology, we compare the design effort and characteristics of the CM-LCVs created by (i) the ML-aided flow and (ii) the conventional flow described in Reference [17].

For experiments, we have two standard-cell libraries (Lib0 and Lib1) provided by an industrial partner. Characteristics of the two libraries are listed in Table 1. For each standard-cell library, two FUB functions are deployed, named as “Cygnus” and “Hercules.” So, specifically, the CM-LCV design tasks are: Using Lib0, generate highly testable, unique implementations for FUB functions Cygnus and Hercules to form a FUB library that is sufficient for creating LCVs that have cell distributions that match to industrial design blocks; repeat the first task using instead Lib1 and corresponding industrial designs.

Table 1.

Standard-cell library	No. of standard cells	No. of logic functions	No. of possible synthesis configurations
Lib0	7,485	58	2.9 \(\times\)\(10^{17}\)
Lib1	11,981	63	9.0 \(\times\)\(10^{18}\)

View Table

Table 1. Information of the Two Commercial Standard-cell Libraries

Although the ML-aided flow focuses on mitigating the bottlenecks of the conventional flow (the dashed steps of Figure 5), in the experiments we execute the design flow and use characteristics of the resulting LCV for evaluation. All experiments are completed using a server with 64 2.2 GHz CPU cores and 1 TB of RAM. The RF models used in the experiments are implemented using Python package scikit-learn [22]. All the hyper-parameters in RF are picked using cross-validation.

The performance of the ML-aided flow is equated to the individual performances of classifiers \(C_1\) and \(C_2\), as well as the on-line learning strategies shown in Figure 11. To evaluate the performance of each classifier, precision and recall are used to analyze prediction correctness. Precision and recall are calculated based on four statistics: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). TP is the number of samples that are truly positive and predicted as positive. FP is the number of samples that are truly negative but predicted as positive. TN and FN are defined similarly. Precision and recall are computed as: (17) \(\begin{equation} \begin{aligned}Precision = \frac{TP}{TP+FP}, \\ Recall = \frac{TP}{TP+FN}.\\ \end{aligned} \end{equation}\)

In Equation (17), precision represents the probability that a sample predicted positive is truly positive, while recall represents the probability that a truly positive sample is correctly predicted. The precision and recall in Equation (17) are defined for the positive class, and those for the negative class can be defined similarly. Note that due to the on-line learning strategy, the classifiers can make predictions multiple times. As a result, the on-line learning algorithm is considered as a whole, precision and recall are calculated when on-line learning terminates, i.e., when the model predicts no configurations in the remaining set as positive. Here, the “precision” and “recall” are not exactly what is usually defined in ML community, i.e., the precision and recall on the testing set. It does not make sense to calculate the precision and recall of the remaining set (test set), as when the on-line learning algorithm terminates, the distributions in the “training set” (the samples that have already been synthesized and verified) and “testing set” (the samples that have not been synthesized) are no longer the same. Instead, we use the two concepts to demonstrate the ability of the proposed algorithm as a whole to reduce design cost.

5.1.1 Uniqueness Prediction.

First, \(C_1\) performance is explored, and FUB library creation by the ML-aided flow and the conventional flow are compared. The last column in Table 1 lists the number of possible synthesis configurations of a FUB function according to Equation (1). With the conventional flow, it is extremely time-consuming to extensively explore the configuration space. Thus, six weeks is used as a constraint on the amount of time for synthesis for the conventional flow. In addition, we limit a synthesis configuration to at most eight logic functions.

For the ML-aided flow, the input data samples \(\mathbf {X}\) for classifier \(C_1\) are also used by the conventional flow. Instead of directly running synthesis on each configuration, \(C_1\) predicts which configurations will lead to unique implementations. Following the on-line learning described in Algorithm 1, 20,000 synthesis configurations are randomly selected and labeled after synthesis, which forms the initial training data \(\mathbf {D}_\mathrm{train}\). It should be noted that in the initial training set, a substantial number of samples with label “0” and “1” exist. According to Table 2, the chance of the first randomly picked 20,000 configurations all lead to unique implementations is very low. Therefore, there will not be data imbalancement issue in the initial training data \(\mathbf {D}_\mathrm{train}\). For each iteration of the while loop, 10,000 top-ranked configurations \(\mathbf {X}_\mathrm{top}\) are synthesized to obtain their true labels \(\mathbf {Y}_\mathrm{top}\). Also, the samples with probabilistic labels that are less than 0.01 are included bottom-ranked data \(\mathbf {X}_\mathrm{btm}\).

Table 2.

Library	FUB function	No. of synthesis configuration		No. of unique implementation		Classifier \(C_1\)		Synthesis reduction
Library	FUB function	Conventional flow	ML-aided flow	Conventional flow	ML-aided flow	Precision	Recall	Synthesis reduction
Lib0	Cygnus	7,470,000	500,000	99,031	99,026	19.8%	99.9%	14.5\(\times\)
Lib0	Hercules	9,450,000	820,000	313,262	313,255	38.2%	99.9%	11.2\(\times\)
Lib1	Cygnus	6,750,000	600,000	125,779	125,766	20.9%	99.9%	10.8\(\times\)
Lib1	Hercules	9,240,000	1,010,000	406,849	406,834	40.3%	99.9%	8.9\(\times\)

View Table

Table 2. Prediction of Synthesis Outcome for Standard-cell Library Lib0 and Lib1

Table 2 lists the quantitative results of \(C_1\) and comparison of the two flows. The third and fourth columns show the number of synthesis runs of the two flows, while the fifth and sixth columns show the number of unique implementations generated by each flow. In the conventional flow, each configuration is used for synthesis. This means that the conventional flow generates the true labels for all the input configurations of \(C_1\), which provides the ground truth to evaluate \(C_1\)’s performance. Precision and recall for \(C_1\) are listed in seventh and eighth columns. The precision and recall of the label-1 class (the configuration that leads to a unique FUB implementation) are calculated according to Equation (17), where precision is calculated as the ratio of column 6 to column 4 and recall is calculated as the ratio of column 6 to column 5.

The recall of classifier \(C_1\) is almost perfect, which means the ML-aided flow identifies almost all the unique FUB implementations produced by the conventional flow. The low precision does not impact the efficiency of the ML-aided flow as compared to the conventional flow. Moveover, precision is expected to be low, given the extraordinary amount of imbalance in the data. For example, for the Lib0-Cygnus case, row 1 of Table 2 shows that only 99,031 out of 7,470,000 samples have label “1,” which means that in the conventional flow, for each synthesis run that leads to a unique implementation, 74 other synthesis runs do not. In such imbalanced cases, it is extremely difficult to simultaneously achieve high precision and recall for the minority class. For the ML-aided flow, a 19.8% precision (row 1 of Table 2 means, on average, only five synthesis runs are needed to produce a unique FUB implementation). This significant change in the amount of synthesis is reason for the speedup of ML-aided flow. It is also possible to trade off precision and recall by the strategies mentioned in Reference [12], but in this application, recall is much more important than precision, because it is highly desirable to find every unique FUB implementation. However, a poor precision means we need to spend some time and resources to verify samples that are predicted to be “1” but are actually with label “0.” Although not desired, the cost is acceptable compared to the loss of bad recall. Therefore, we tune the learning to achieve a near-perfect recall while still ensuring a satisfactory precision for ensuring efficiency.

With the high performance of \(C_1\) and the on-line learning, significant speedup for FUB library creation is achieved. The last column in Table 2 shows the synthesis reduction calculated as the ratio of column 3 to column 4. Speedup is also demonstrated in Figures 14 and 15, which show the number of unique FUB implementations as a function of the number of synthesis runs for libraries Lib0 and Lib1, respectively. The ML-aided curves (squares) in Figures 14 and 15 have a much steeper slope than the conventional curves (circles), and thus reach the same number of unique implementations using significantly less synthesis resources.

Fig. 14. Number of unique implementations generated using standard-cell library Lib0 for FUB functions (a) Cygnus and (b) Hercules.

Fig. 15. Number of unique implementations generated using standard-cell library Lib1 for FUB functions (a) Cygnus and (b) Hercules.

5.1.2 Testability Prediction.

After demonstrating the performance of classifier \(C_1\), \(C_2\) performance is explored in this part, and testability analysis of the FUB implementations by the two flows are compared.

Once the FUB library is created, the conventional flow analyzes the testability of every implementation. For the ML-aided flow, the implementations of the library are fed as input to classifier \(C_2\). Instead of directly running ATPG on each FUB implementation as is done in the conventional flow, \(C_2\) predicts which implementations are likely to achieve high testability and then only those implementations are submitted for ATPG. The on-line learning described by Algorithm 1 is also deployed here with different hyper-parameters. To be specific, 10% of the data samples are used as the initial training data for \(C_2\); in each iteration, 5,000 top-ranked implementations are analyzed by running ATPG to obtain their true labels; 0.1 is set as the threshold for collecting the bottom-ranked data samples (i.e., \(\mathbf {X}_\mathrm{btm}\)).

Table 3 lists the quantitative results of \(C_2\) and comparison of two flows. The third and fourth columns show the number of ATPG runs for the two flows, while the fifth and sixth columns show the number of FUB implementations with high testability. The conventional flow generates the true labels for all data samples of \(C_2\), which provides the ground truth to evaluate \(C_2\)’s performance. Precision and recall for \(C_2\) are listed in the seventh and eighth columns, respectively. The precision and recall of the label-1 class (the implementation with high testability) are calculated according to Equation (17). Like \(C_1\), \(C_2\) also achieves near-perfect recall, which means the ML-aided flow identifies almost all FUB implementations with high testability from the FUB library. Similar to \(C_1\), we tune \(C_2\) to achieve high recall rather than precision to identify nearly all FUBs with high testability. The last column in Table 2 shows the reduction in ATPG calculated as the ratio of column 3 to column 4. The reduction in ATPG is not as significant as it was for synthesis. However, it is still very meaningful to deploy \(C_2\) for reducing the resources needed for test chip design, especially given the small amount of resources needed to train and utilize classifier \(C_1\) and \(C_2\) (e.g., training \(C_2\) takes less than a second). In other words, the amount of ATPG is reduced 27% \(\sim\) 33% with negligible cost.

Table 3.

Library	FUB function	No. of testability analysis		No. of impl. with high testability		Classifier \(C_2\)		Analysis reduction
Library	FUB function	Conventional flow	ML-aided flow	Conventional flow	ML-aided flow	Precision	Recall	Analysis reduction
Lib0	Cygnus	99,031	72,336	26,137	25,940	35.9%	99.3%	1.37\(\times\)
Lib0	Hercules	313,262	225,962	138,644	138,577	61.3%	99.9%	1.39\(\times\)
Lib1	Cygnus	125,779	84,288	31,092	30,719	36.5%	98.9%	1.50\(\times\)
Lib1	Hercules	406,849	285,109	172,604	172,485	60.5%	99.9%	1.42\(\times\)

View Table

Table 3. Prediction of Testability for Various Unique FUB Implementations

5.1.3 Test Chip Design Evaluation.

We have demonstrated how machine learning can be used to significantly reduce the amount of synthesis and ATPG within the CM-LCV design flow. Here, we complete the remaining steps in the two design flows to obtain the final test-chip designs for comparison. Although a new method to solve the optimization problem is proposed in Section 4, for a fair comparison, the optimization problems are solved using conventional methods [10]. A comparison of the solving strategy will be found in the next subsection.

For both two flows, after synthesis and testability analysis, the unique FUB implementations with high testabiliy for the two FUB functions are combined together to form a FUB library. Then, in this experiment, five industrial blocks (Block A to Block E) are used as the target of cell-usage distribution matching. For each cell-usage distribution corresponding to the industrial blocks, an optimization problem is formulated to select FUB implementations to embody the test chip, with the aim of minimizing the mismatch rate in cell usage while achieving high testability. As a result, five test chip designs are generated using Lib0 and Lib1 to match five objective cell-usage distributions.

Table 4 lists the quantitative results of design effort and the characteristics of the LCVs created by the two flows. The third and fourth columns report the amount of mismatch between the LCV designs and the industrial distributions for the two flows. The testability of the designs is evaluated using single stuck-at line (SSL) fault coverage and IP fault coverage [4], which are shown in columns five through eight. Comparing the characteristics of the designs reveals that three out of five created by the ML-aided flow do not suffer from any performance degradation; the remaining two have at most 0.1% degradation in cell mismatch, 0.2% IP fault coverage reduction. These results demonstrate that only a few unique, high-testable FUB implementations are missed by the ML-aided design flow. Finally, the effort required by both flows are measured in terms of compute time. Specifically, the last two columns of Table 4 report the CPU runtime for creating the corresponding designs for the two flows. The reported times include the CPU runtime: (i) FUB library creation, (ii) testability analysis, and (iii) solving the optimization for forming the test chip. The results prove that the ML-aided design flow can provide up to \(11\times\) speedup for a test-chip design with negligible performance degradation.

Table 4.

Library	Cell distribution	Mismatch rate of cell-usage		SSL fault coverage		IP fault coverage		Runtime (hour)
Library	Cell distribution	Conv. flow	ML-aided flow	Conv. flow	ML-aided flow	Conv. flow	ML-aided flow	Conv. flow	ML-aided flow
Lib0	BlockA	4.9%	4.9%	99.7%	99.7%	79.2%	79.2%	2,042.7	188.8
Lib0	BlockB	8.7%	8.8%	99.6%	99.5%	75.8%	75.6%	2,040.2	188.5
Lib1	BlockC	7.5%	7.5%	99.6%	99.6%	76.8%	76.8%	2,054.6	244.7
	BlockD	12.8%	12.8%	99.6%	99.6%	75.9%	75.9%	2,064.5	253.8
	BlockE	11.7%	11.8%	99.4%	99.3%	76.1%	76.1%	2,061.7	251.0

View Table

Table 4. Design Effort and Characteristics Comparison of Test Chip Designs Created by the Conventional and ML-aided Flows

5.2 Smart Integer Programming

In this subsection, we apply IPSA to the LCV implementation problems and compare the performance with the naive relax-round method and a commercial integer programming solver. In addition, further analysis is conducted to demonstrate details of the method and illustrate the tradeoff between fitting and rounding errors, as discussed in Section 4.1.

5.2.1 Comparison of Solving Strategies.

We compare the performance of four methods for LCV implementation: (i) IPSA with the forward step-wise strategy, (ii) IPSA with L\(_1\)-regularization, (iii) naive relax-round method, and (iv) a commercial integer programming solver [23]. Seven examples (denoted as Designs 1–7) with respect to real industrial test-chip design cases are used to evaluate the performance. Each example corresponds to a standard-cell library and a target standard-cell histogram. In other words, Equation (2) is solved using specific matrix \({\mathbf {A}}\) and vector \({{\bf b}}\) for each design. The \({\mathbf {A}}\) matrices for Designs 1–3 contain 605,945 columns, indicating the solution \({{\bf x}}\) has a dimension of 605,945. The number for Designs 4–5 is 829,228, and 222,698 for Designs 6–7. A commercial convex optimization tool [10] is used for solving the relaxed problem in Equation (4), forward step-wise regression (step 5 in Algorithm 2), and L\(_1\)-regularization.

Two metrics are used to evaluate the four solution approaches, namely, standard-cell histogram matching and runtime. Histogram matching for a specific solution \(\hat{{{\bf x}}}\) is defined as follows: (18) \(\begin{align} \text{matching error} = \frac{\Vert {\mathbf {A}}\hat{{{\bf x}}}-{{\bf b}}\Vert _1}{\Vert {{\bf b}}\Vert _1}. \end{align}\)

Here, we use the L\(_1\)-norm, because it measures the histogram mismatch in a more intuitive way than L\(_2\)-norm, while in the optimization problem, we used the latter, because it is more smooth and strictly convex, making it easier to solve by commercial convex optimization tools.

First, we consider a scenario where the only linear constraint is \({{\bf x}}\succeq \mathbf {0}\), i.e., the practitioner has no specific requirement for the least number of specific standard cell(s) in the design. In this case, both forward step-wise and L\(_1\)-regularization strategies can be applied. The two metrics of the four methods are listed in Table 5, as shown in the “Overall error” and “Time” columns. For each design, the method that gives the smallest error is highlighted with bold text in each row. For more straightforward comparison, the error comparison with respect to the seven designs is also plotted in Figure 16.

Fig. 16. The comparison of the relative error (i.e., \(\Vert {\mathbf {A}}\hat{{{\bf x}}}-{{\bf b}}\Vert _1/\Vert {{\bf b}}\Vert _1\) ) of the solution \(\hat{{{\bf x}}}\) from each of the four methods for seven designs.

Table 5.

Design	Relax-round			IPSA Fwd step-wise			IPSA \(\text{L}_1\)			Solver [23]
Design	Full prec. error	Overall error	Time (sec.)	Full prec. error	Overall error	Time (sec.)	Full prec. error	Overall error	Time (sec.)	Overall error	Time (sec.)
1	0.0956	0.096	404.1	0.0564	0.0561	549.7	0.0956	0.0951	2,595	0.086	86,000
2	0.0513	0.0517	403.1	0.0312	0.0313	630	0.0514	0.0514	2,625	0.035	86,000
3	0.0721	0.072	345.3	0.0466	0.0465	537.1	0.0718	0.0718	2,750	0.087	86,000
4	0.0056	0.0747	347.3	0.0058	0.0077	637.3	0.0056	0.0091	7,520	0.006	86,000
5	0.0062	0.1531	350.2	0.00089	0.0031	516.5	0.00089	0.0059	10,885	0.001	86,000
6	0.1158	0.1159	67.3	0.1158	0.1153	136.6	0.1158	0.1159	3,948	0.1159	10,601
7	0.1138	0.1147	100.5	0.1138	0.1126	182.9	0.1138	0.1133	6,835	0.114	7,974

View Table

Table 5. Histogram Mismatch (Error) and Runtime Comparison of Different Methods

As expected, the relax-round method has the highest error for almost all cases. The integer programming solver is supposed to have the least error theoretically, but for Design 3 it has the highest. This is mainly because (i) we restrict the maximum runtime to one day due to the time limit of the server, (ii) we only allow it to use basis with 30,000 largest coefficients derived from solving Equation, (4) because server memory cannot handle a search space of the \(10^5\) dimensions exhibited by Equation (2). Forward step-wise regression significantly reduces the error over 50\(\times\) compared to relax-round method, and in five cases it even surpasses the integer programming solver. L\(_1\)-regularization does not exhibit similar improvement in some cases, probably because the time limit we set does not allow a thorough exploration of the hyper-parameters. If a suitable \(\lambda\) is found, then L\(_1\)-regularization is also expected to significantly reduce error for Designs 4 and 5 as well. For Designs 6 and 7, because the least-accurate relax-round method achieves a similar error to the integer programming solver, it is expected that little improvement brought can be achieved by these two sparse-regression strategies.

In terms of runtime, the relax-round method is the fastest, because it solves the simplest problem. In contrast, the integer programming solver is the slowest due to its high complexity; in some cases, it does not reach the final solution within the 86,000-second time limit. Forward step-wise regression runs slightly longer but comparable to relax-round, and more than \(100\times\) faster than the integer programming solver on average. L\(_1\)-regularization has a longer runtime, because the \(\text{L}_1\)-regularized problem has to be solved multiple times with different hyper-parameters \(\lambda\) to find the optimal value.

Except for the integer programming solver, the other three methods first solve a relaxed problem without an integer constraint, and then round the full-precision solution \({{\bf x}}_{\text{real}}\). So, in addition to the final error, the errors of \({{\bf x}}_{\text{real}}\) for the three methods are recorded and listed in columns named “Full prec. error” in Table 5. Comparing these errors with the corresponding values in “Overall error” columns, we can observe the effect of the rounding process for each method. The relax-round method is most susceptible to rounding. When its \({{\bf x}}_{\text{real}}\) is close to an integer grid node, the error resulting from the rounding process is small (e.g., for Design 6), but for other cases, Designs 4 and 5, for example, the rounding process can increase the error by 13\(\times\) and 25\(\times\), respectively. In contrast, forward step-wise regression and L\(_1\)-regularization are not affected that much, because the differences between the two columns are relatively small and stable. This demonstrates that finding a sparse solution can reduce rounding error.

We also consider another scenario where the practitioner requires that the LCV implementation includes at least 20 instances for each type of standard cell (i.e., the linear constraint becomes like Equation (3) with elements in \(\mathbf {c}\) equated to 20). In this case, forward step-wise regression is not applicable, so in Table 6, we list the performance of the remaining three methods to Designs 4 and 5. Similar comparison results as Table 5 can be observed, that is, L\(_1\)-regularization method achieves similar error as the integer programming solver (much lower than relax-round) but with significantly reduced runtime.

Table 6.

Design	Relax-round			IPSA \(\text{L}_1\)-regularization			Solver [23]
Design	Full prec. error	Overall error	Time (sec.)	Full prec. error	Overall error	Time (sec.)	Overall error	Time (sec.)
4	0.0056	0.0982	547.1	0.0056	0.0067	25,651	0.0062	86,000
5	0.0009	0.1832	350.2	0.0068	0.0172	20,198	0.0070	86,000

View Table

Table 6. Histogram Mismatch (Error) and Time Comparison of Different Methods with Linear Constraints

From Table 2 to 6, it can be seen that the classifier \(C_1\) contributes the most to overall design time reduction. These three methods contribute to three independent stages of test chip design, therefore can be used in combination.

5.2.2 Detailed Analysis.

In this part, the details of forward step-wise regression including an analysis of relax-round are presented to demonstrate the tradeoff between these two sources of errors.

For forward step-wise regression, the rounding process can be integrated into each iteration so the final error can be tracked across iterations. In this way, the searching for hyper-parameters can be eliminated, since we can simply choose the one with least overall error as the final solution. For example, Figure 17 shows the change of (i) fitting error with the full-precision solution \({{\bf x}}_{\text{real}}\) and (ii) overall error with the rounded solution \(\hat{{{\bf x}}}\), as the iterations proceed for Design 5. Because in each iteration, a new basis is included in the candidate set that \({{\bf b}}\) decomposes onto, the fitting error continues to decrease. Generally, the overall error is close to the fitting error, because the effects of rounding are not obvious for the sparse solution. However, as the number of non-zero coefficients increases, the gap between the two curves becomes obvious and the overall error starts fluctuating due to the varying rounding errors. The lowest overall error is achieved at the 65th iteration instead of the last one, and we take the solution at the 65th iteration as the final solution.

Fig. 17. The change of relative fitting error (i.e., \(||{\mathbf {A}}\hat{{{\bf x}}}_{\text{real}}-{{\bf b}}||_1/||{{\bf b}}||_1\) ) and relative overall error (i.e., \((||{\mathbf {A}}\hat{{{\bf x}}}-{{\bf b}}||_1)/||{{\bf b}}||_1\) ) as the iterations of Algorithm 2 proceed.

Another experiment is conducted to demonstrate the tradeoff between fitting and rounding error, as the sparsity of the solution varies. For the relax-round method, we use only a subset of basis, randomly selected from the columns of \({\mathbf {A}}\) for Design 5 and gradually include more basis to match the same target distribution. Figure 18 shows the change of fitting error and overall error with respect to different sizes of the basis (N). The deviation of the two curves is the result from the rounding error. When N is small, the effect of rounding is negligible. However, when N grows larger than \(2\ \times \ 10^5\), the rounding error begins to dominate the fitting error, thus demonstrating the tradeoff between them as a function of solution sparsity. The optimal sparsity is determined by tracking of errors during iterations of forward step-wise regression and tuning hyper-parameter \(\lambda\) for L\(_1\)-regularization.

Fig. 18. The change of relative fitting error (i.e., \(||{\mathbf {A}}\hat{{{\bf x}}}_{\text{real}}-{{\bf b}}||_1/||{{\bf b}}||_1\) ) and relative overall error (i.e., \((||{\mathbf {A}}\hat{{{\bf x}}}-{{\bf b}}||_1)/||{{\bf b}}||_1\) ) of the relax-round strategy based on different basis size (N).

6 CONCLUSION

In this work, a new flow to design CM-LCV is proposed, which significantly increases the design efficiency. First, a methodology is developed to improve the efficiency of the logic test chip design. We develop two RF classifiers to predict (i) whether a synthesis configuration will result in a unique FUB implementation and (ii) whether a unique FUB implementation has an acceptable level of testability, so unnecessary synthesis and ATPG can be avoided. Next, a method called IPSA is proposed to accelerate the step of solving an integer programming problem to match the target cell usage distribution. This method solves the integer programming problem in an effective way—solving a transformed sparse-regression problem with a subsequent rounding process. Two strategies for solving sparse-regression, namely, forward step-wise regression and L\(_1\)-regularization, are investigated for solving the sparse-regression problem. Various design experiments demonstrate that the two methods individually speed up two separate steps with negligible performance degradation. Therefore, the design process will be significantly accelerated by using these two methodologies as a combination.

A APPENDIX

A.1 Proof of Theorem 1

Proof.

For proof simplicity, we denote the vector selected in step 2 in Algorithm 2 as \(\mathbf {g}^{(k)}\). In other words: (19) \(\begin{align} \mathbf {g}^{(k)} = \mathop {\arg\!\max}\limits _{\psi \in \mathcal {D}} \langle {\mathbf {r}}^{(k-1)}, \psi \rangle . \end{align}\) Given the facts that (i) \({{\bf b}}^{(k)}\) is the best approximation to \({{\bf b}}\) from Span\((\lbrace \psi _j, i \in V^{(k)}\rbrace)\), and (ii) the approximation of \(\mathbf {r}^{(k-1)}\) from the set \(\lbrace a \cdot \mathbf {g}^{(k)}: a \in \mathbb {R}\rbrace\) is \(\langle \mathbf {r}^{(k-1)}, \mathbf {g}^{(k)}\rangle \mathbf {g}^{(k)}\), we have \(\Vert \mathbf {b}-\mathbf {b}^{(k)}\Vert ^2 \le \Vert \mathbf {b}-\mathbf {b}^{(k-1)}-\langle \mathbf {r}^{(k-1)}, \mathbf {g}^{(k)}\rangle \mathbf {g}^{(k)}\Vert ^2\). Because \(\mathbf {r}^{(j)} = {{\bf b}}- {{\bf b}}^{(j)}, j = 1, \ldots ,k\), the following holds: (20) \(\begin{align} \Vert \mathbf {r}^{(k)}\Vert ^2_2 & \le \Vert \mathbf {r}^{(k-1)}-\langle \mathbf {r}^{(k-1)}, \mathbf {g}^{(k)}\rangle \mathbf {g}^{(k)}\Vert ^2_2 \nonumber \nonumber \nonumber\\ & = \Vert \mathbf {r}^{(k-1)}\Vert ^2_2 - \vert \langle \mathbf {r}^{(k-1)}, \mathbf {g}^{(k)}\rangle \vert ^2. \end{align}\) Using the equations \(\mathbf {b} = \mathbf {b}^{(k-1)}+\mathbf {r}^{(k-1)}\) and \(\langle \mathbf {b}^{(k-1)}, \mathbf {r}^{(k-1)} \rangle = 0\), we have (21) \(\begin{align} \Vert \mathbf {r}^{(k-1)}\Vert ^2 & = \langle \mathbf {r}^{(k-1)}, \mathbf {r}^{(k-1)} \rangle = \langle \mathbf {r}^{(k-1)}, \mathbf {b} \rangle - \langle \mathbf {b}^{(k-1)}, \mathbf {r}^{(k-1)} \rangle \nonumber \nonumber \nonumber\\ & = \langle \mathbf {r}^{(k-1)}, \mathbf {b} \rangle = \sum _j x_j \langle \mathbf {r}^{(k-1)}, \mathbf {\mathbf {\psi }}_j \rangle \nonumber \nonumber \nonumber\\ & \le \sup _{\mathbf {\mathbf {\psi }} \in \mathcal {D}} \langle \mathbf {r}^{(k-1)}, \mathbf {\mathbf {\psi }} \rangle \sum _j x_j = \sup _{\mathbf {\mathbf {\psi }} \in \mathcal {D}} \langle \mathbf {r}^{(k-1)}, \mathbf {\mathbf {\psi }} \rangle \Vert \mathbf {b}\Vert _{\mathcal {L}_1} \nonumber \nonumber \nonumber\\ & = \langle \mathbf {r}^{(k-1)}, \mathbf {g}^{(k)} \rangle \Vert \mathbf {b}\Vert _{\mathcal {L}_1}. \end{align}\) Substituting Equation (21) into Equation (20) yields: (22) \(\begin{align} \Vert \mathbf {r}_{k}\Vert ^2 & \le \Vert \mathbf {r}^{(k-1)}\Vert ^2 \left(1 - \frac{\Vert \mathbf {r}^{(k-1)}\Vert ^2\langle \mathbf {r}^{(k-1)}, \mathbf {g}^{(k)} \rangle ^2}{\langle \mathbf {r}^{(k-1)}, \mathbf {g}^{(k)} \rangle ^2 \Vert \mathbf {b}\Vert _{\mathcal {L}_1}^2}\right) \nonumber \nonumber \nonumber\\ & = \Vert \mathbf {r}^{(k-1)}\Vert ^2 \left(1 - \frac{\Vert \mathbf {r}^{(k-1)}\Vert ^2}{ \Vert \mathbf {b}\Vert _{\mathcal {L}_1}^2}\right). \end{align}\)

Assume a series of non-negative numbers \(a^{(0)} \ge a^{(1)} ...\ge a^{(k)}\), where \(a^{(0)} \le M\), and \(a^{(k)} \le a^{(k-1)}(1-a^{(k-1)}/M)\), by deduction, we can derive \(a^{(k)} \le M/(k+1)\). Applying this to Equation (22), and because \(M = \Vert \mathbf {b}\Vert _{\mathcal {L}_1}^2\), we have: (23) \(\begin{align} \Vert \mathbf {r}_{k}\Vert ^2 \le \frac{\Vert \mathbf {b}\Vert _{\mathcal {L}_1}^2}{k+1}. \end{align}\) Theorem 1 can then be derived by taking the square root of each side of Equation (23).□

A.2 Proof of Theorem 2

Proof.

Because \(\hat{{{\bf x}}}\) is the solution to Equation (13) with the constraint Equation (14), the following holds: (24) \(\begin{align} \Vert \mathbf {b}-\mathbf {A}\hat{{{\bf x}}}\Vert _2^2 \le \Vert \mathbf {b}-\mathbf {A}{{\bf x}}^*\Vert _2^2. \end{align}\) After rearranging and using Holder’s inequality [2] and the bound for \(\text{L}_1\)-regularization \(\lambda =\Vert x^*\Vert _1\), we have (25) \(\begin{align} \Vert \mathbf {A}\hat{{{\bf x}}}-\mathbf {A}{{\bf x}}^*\Vert _2^2 & \le 2\langle \epsilon ,\mathbf {A}\hat{{{\bf x}}}-\mathbf {A}{{\bf x}}^* \rangle = 2 \langle \mathbf {A}^T\epsilon ,\hat{{{\bf x}}}-{{\bf x}}^* \rangle \nonumber \nonumber \nonumber\\ & \le 2 \Vert \hat{{{\bf x}}}-{{\bf x}}^* \Vert _1 \Vert \mathbf {A}^T \epsilon \Vert _\infty \le 4 \Vert {{\bf x}}^*\Vert _1 \Vert \mathbf {A}^T \epsilon \Vert _\infty . \end{align}\) In Equation (25), \(\Vert {\mathbf {A}}^T\epsilon \Vert _\infty = \max _{j=1, \ldots ,n}\vert {\mathbf {A}}_j^T\epsilon \vert\) is a maximum of n Gaussian random variables. By standard maximal inequality for Gaussian random variables, for any \(\delta \gt 0\), with probability of at least \(1-\delta\), (26) \(\begin{align} \max _{j=1, \ldots ,n}\vert {\mathbf {A}}_j^T\epsilon \vert \le \sigma \sqrt {2p\log (en/\delta)}. \end{align}\) Substituting Equation (26) into Equation (25), we have (27) \(\begin{align} \frac{1}{p}\Vert \mathbf {A}\hat{{{\bf x}}}-\mathbf {A}{{\bf x}}^*\Vert _2^2 \le 4\sigma \Vert {{\bf x}}^*\Vert _1 \sqrt {\frac{2\log (en/\delta)}{p}}, \end{align}\) which is the same as the rate given in Equation (16).□

Footnotes

¹ Mismatch rate of cell usage is the error between the target cell usage distribution and the one of CM-LCV. Particularly, it is calculated as \(\frac{\sum |\Delta _i|}{T}\), where \(|\Delta _i|\) is the absolute difference between the number of instances of standard-cell i in the design and the CM-LCV, and T is the total number of cells in the design.
Footnote
² We use stochastic rounding [3] for simplicity of analysis, but the similar arguments about the error composing of fitting and rounding errors generally hold if using nearest-rounding.
Footnote

REFERENCES

[1] Gurobi Optimization, LLC. 2023. Gurobi Optimizer Reference Manual. https://www.gurobi.com.Google Scholar
Reference 1Reference 2
[2] Cvetkovski Z.. 2012. Hölder’s Inequality, Minkowski’s Inequality and Their Variants. Springer Berlin Heidelberg, Berlin, Heidelberg, 95–105.Google Scholar
[3] Croci M., Fasi M., Higham N. J., Mary T., and Mikaitis M.. 2022. Stochastic rounding: implementation, error analysis and applications. Royal Society Open Science 9, 3 (2022), 211631.Google Scholar
[4] Blanton R. D. and Hayes J. P.. 1997. Properties of the input pattern fault model. In IEEE International Conference on Computer Design. 372–380.Google ScholarCross Ref
Reference 1Reference 2
[5] Blanton R. D., Niewenhuis B., and Liu Z.. 2015. Design reflection for optimal test chip implementation. In IEEE International Test Conference. 1–10.Google ScholarCross Ref
Reference 1Reference 2
[6] Blanton R. D., Niewenhuis B., and Taylor C.. 2014. Logic characterization vehicle design for maximal information extraction for yield learning. In IEEE International Test Conference. 1–10.Google ScholarCross Ref
Reference 1Reference 2
[7] Breiman L.. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.Google ScholarDigital Library
Reference
[8] Candès E. J. et al. 2006. Compressive sampling. In International Congress of Mathematicians. 1433–1452.Google Scholar
Reference
[9] Clausen J.. 1999. Branch and bound algorithms-principles and examples. Department of Computer Science, University of Copenhagen (1999), 1–30.Google Scholar
Reference 1Reference 2
[10] Grant Michael and Boyd Stephen. 2008. Graph implementations for nonsmooth convex programs. Recent Advances in Learning and Control (a tribute to M. Vidyasagar), V. Blondel, S. Boyd, and H. Kimura (Eds.). Lecture Notes in Control and Information Sciences, Springer, 95–110. http://stanford.edu/boyd/graph_dcp.html.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[11] Hess C., Inani A., Joag A., and Zaragoza M.. 2017. Stackable short flow characterization vehicle test chip to reduce test chip designs, mask cost and engineering wafers. In IEEE Adanced Semiconductor Manufacturing Conference. 328–333.Google Scholar
Reference
[12] Huang Q., Fang C., Mittal S., and D B. R.. 2010. Improving diagnosis efficiency via machine learning. In IEEE International Test Conference. 1–10.Google Scholar
Reference 1Reference 2
[13] Kahng A. B., Mallappa U., and Saul L.. 2018. Using machine learning to predict path-based slack from graph-based timing analysis. In IEEE 36th International Conference on Computer Design. 603–612.Google ScholarCross Ref
Reference
[14] Keutzer K.. 1987. DAGON: Technology binding and local optimization by DAG matching. In 24th ACM/IEEE Design Automation Conference. 617–623.Google ScholarDigital Library
Reference
[15] Liu Z. and Blanton R. D.. 2018. Back-end layout reflection for test chip design. In IEEE International Conference on Computer Design. 456–463.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[16] Liu Z., Fynan P., and Blanton R. D.. 2017. Front-end layout reflection for test chip design. In IEEE International Test Conference. 1–10.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[17] Liu Z., Niewenhuis B., Mittal S., and Blanton R. D.. 2016. Achieving 100% cell-aware coverage by design. In Design, Automation & Test in Europe Conference. 109–114.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[18] Ma Y., Ren H., Khailany B., Sikka H., Luo L., Natarajan K., and Yu B.. 2019. High performance graph convolutional networks with applications in testability analysis. In Design Automation Conference. 1–6.Google ScholarDigital Library
Reference
[19] Mitchell J. E.. 2002. Branch-and-cut algorithms for combinatorial optimization problems. Handb. Appl. Optimiz. 1 (2002), 65–77.Google Scholar
Reference
[20] Padberg M. and Rinaldi G.. 1991. A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. SIAM Rev. 33, 1 (1991), 60–100.Google ScholarDigital Library
Reference
[21] Park J. and Boyd S.. 2018. A semidefinite programming method for integer convex quadratic minimization. Optimiz. Lett. 12, 3 (2018), 499–518.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[22] Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., and Duchesnay E.. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.Google ScholarDigital Library
Reference
[23] Tawarmalani M. and Sahinidis N. V.. 2005. A polyhedral branch-and-cut approach to global optimization. Mathematical Programming 103, 2 (2005), 225–249.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5

Index Terms

Efficient Test Chip Design via Smart Computation
1. Hardware

Recommendations

An efficient SoC test technique by reusing on/off-chip bus bridge

Today's system-on-a-chip (SoC) is designed with reusable intellectual property cores to meet short time-to-market requirements. However, the increasing cost of testing becomes a big burden in manufacturing a highly integrated SoC. In this paper, an ...
Read More
On-chip SOC test platform design biased on IEEE 1500 standard

IEEE 1500 Standard defines a standard test interface for embedded cores of a system-on-a-chip (SOC) to simplify the test problems. In this paper we present a systematic method to employ this standard in a SOC test platform so as to carry out on-chip at-...
Read More
A Design Methodology for Efficient Application-Specific On-Chip Interconnects

As the level of chip integration continues to advance at a fast pace, the desire for efficient interconnects—whether on-chip or off-chip—is rapidly increasing. Traditional interconnects like buses, point-to-point wires, and regular topologies may suffer ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Design Automation of Electronic Systems Volume 28, Issue 2
March 2023
409 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3573314
Editor:
X. Sharon Hu
University of Notre Dame, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 22 March 2023
- Online AM: 12 September 2022
- Accepted: 9 August 2022
- Revised: 24 July 2022
- Received: 27 February 2022
Published in todaes Volume 28, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Random forest
integer programming
test chip design
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 531
  Total Downloads
- Downloads (Last 12 months)438
- Downloads (Last 6 weeks)48
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Efficient Test Chip Design via Smart Computation

ACM Transactions on Design Automation of Electronic Systems

Abstract

1 INTRODUCTION

1.1 Challenges in Creating FUBs

1.2 Challenges in Matching Cell Distribution

2 LCV DESIGN FLOW

3 MACHINE LEARNING-ASSISTED FUB LIBRARY CREATION

3.1 Design Flow and Problem Formulation

3.2 Feature Selection

3.3 Classification Algorithm

3.4 On-line Learning Strategy

4 SMART INTEGER PROGRAMMING FOR FUB SELECTION

4.1 Two Error Types

4.2 Sparse Regression and Solving Strategies

4.3 Error Bound and Computational Cost

5 EXPERIMENTS

5.1 Machine Learning-assisted FUB Library Creation

5.1.1 Uniqueness Prediction.

5.1.2 Testability Prediction.

5.1.3 Test Chip Design Evaluation.

5.2 Smart Integer Programming

5.2.1 Comparison of Solving Strategies.

5.2.2 Detailed Analysis.

6 CONCLUSION

A APPENDIX

A.1 Proof of Theorem 1

A.2 Proof of Theorem 2

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

An efficient SoC test technique by reusing on/off-chip bus bridge

On-chip SOC test platform design biased on IEEE 1500 standard

A Design Methodology for Efficient Application-Specific On-Chip Interconnects

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media