1 Introduction

Advanced industrial systems require increasingly improved product performance along with an increased need for quality control during production [1,2,3]. However, defects, such as scratches, spots, or holes on the surface of the product, adversely affect not only the aesthetics of the product and the comfort in using it but also its performance [4,5,6,7]. Defect detection is an effective method to reduce the adverse impact of product defects [8, 9].

Artificial visual inspection is a traditional method to perform quality control for industrial products [10]. Although in some cases, artificial visual inspection may be superior, it is inefficient and prone to fatigue. Artificial visual inspection is not feasible for some applications that have dangerous consequences in the event of a failure [11]. Because of its shortcomings, such as a low sampling rate, poor real-time performance, and low detection confidence, artificial visual inspection cannot meet the efficiency and quality requirements of modern industrial production lines [12]. Hence, more efficient and reliable visual inspection technologies need to be developed.

Machine vision is one of the key technologies used to perform intelligent manufacturing, and it has become an effective way to replace artificial visual inspection [13, 14]. Machine vision is a system that automatically receives and processes images of a real object through optical devices and noncontact sensors. Vision is one of the highest levels of human perception. Images play a very important role in human perception [15]. However, human perception is limited to the visible band of the electromagnetic spectrum. Machine vision inspection technology can cover the whole electromagnetic spectrum, ranging from gamma rays to radio waves [16]. Through powerful vision sensors, ingeniously designed optical transmission methods, and image processing algorithms, machine vision can accomplish many tasks that cannot be performed by artificial vision. With the development of computer equipment and artificial intelligence, machine vision, as a measurement and judgement technology, has been used widely in industry. Machine vision detection technology can improve the detection efficiency and degree of automation, enhance the real-time performance and accuracy of detection, and reduce manpower requirements, especially for some large-scale repetitive industrial production processes. As a non-contact and non-destructive detection method, machine vision can be easily employed to perform information integration, automation, intelligence, and precise control. It has become the basic technology required in computer integrated manufacturing and intelligent manufacturing. Moreover, machine vision has a wider range of spectral responses and a greater ability to work for a long time in harsh environments. Thus, the application of machine vision in manufacturing processes can benefit a large number of industrial activities [17,18,19].

A typical industrial visual inspection system mainly consists of three modules—optical illumination, image acquisition, and image processing and defect detection [11, 20]—as shown in Fig. 1. First, based on the product characteristics and inspection requirements, an optical illumination platform is designed. Next, CCD cameras or other image acquisition hardware are used to convert the target objects placed in the light field into images and transmit them to a computer. As an information carrier, the images that can reflect the features of the objects constitute the core element of visual inspection; hence, their quality is very important. Excellent optical illumination platforms and suitable image acquisition hardware are the prerequisites for obtaining high-quality images. Finally, based on some traditional image processing algorithms or deep learning algorithms, various operations are carried out on the images to extract features and to perform classification, localization, segmentation and other operations. Image processing is a key technology in machine vision. Through image processing and analysis, a computer can automatically understand, analyze, and judge image features, and then control the actuator of the automatic production line for further operation [21].

Fig. 1
figure 1

Typical industrial visual inspection system architecture and main components

In industry, the architecture can be used as a step guideline for designing a visual inspection system. For instance, investigating surface characteristics was the first step in designing a strongly reflective metal surface visual inspection system; hence, diffuse bright filed back light illumination was adopted. Light-sensitive components were then used for image acquisition. After image acquisition, wavelet smoothing was used for image preprocessing, and Otsu threshold was employed to segment the image. Finally, support vector machine classifier was designed for defect classification [22].

The main evaluation indexes of a visual inspection system are accuracy, efficiency and robustness. The goals of the system are high precision, high efficiency and strong robustness. In order to achieve these goals, it needs an excellent coordination of optical illumination, image acquisition, and image processing and defect detection.

This study is focused on the current state of development of industrial defect detection utilizing machine vision. Visual inspection modules, including optical illumination, image acquisition, image processing and defect detection are discussed in detail. The light source and illumination system design are discussed in Sect. 2. Section 3 describes the image sensors and image acquisition design for particular scenarios. Then, as the main portion of this study, Sect. 4 focuses on defect detection tasks such as defect classification, localization and segmentation, and it discusses representative traditional image processing methods and intelligent methods based on deep learning. Finally, insights into future research in defect detection based on machine vision are presented in Sect. 5.

2 Optical Illumination

Visual inspection technology is based on an image, and encompasses image acquisition and image processing [23]. The key to the success of the visual inspection system lies in getting high quality images. In general, the image quality is mainly affected by two factors: optical illumination and image acquisition [24, 25]. The main function of an optical illumination platform is to overcome the interference of environmental lighting, ensure the stability of the image, and obtain an image with a high contrast. Thus, the main goal of the optical illumination platform is to make the important features of the objects visible and reduce undesired features of the objects.

The research on optical illumination has a long history. In the 1980s, the commercial white light source for machine vision was not available in the market and some light sources designed for workbenches could not be easily integrated into vision detection systems. With the transition of vision detection systems from laboratory to industry, the necessity of optimizing optical illumination systems has gradually become a research area of focus, and the importance of optical illumination in visual systems has been understood at a preliminary level. In 1987, Mersch [26] systematically discussed the importance of optical illumination in visual systems. Based on the technical conditions at that time, he analyzed the application of polarization and color filters and pointed out the advantages of optical fiber lighting for the illumination of a small area. Furthermore, he discussed the fluorescent marking lighting method and frequency flash lighting technology. Later, Cowan [27] designed the positioning of a camera and a light source by using their models and surface reflectivity to meet the requirements of a vision system. Sieczka et al. [28] presented a detailed exposition and discussion on some important issues related to light sources, such as light source efficiency, light divergence, spectral content, light source size, and packaging. Combined with mathematical programming, Yi et al. [29] discussed the placement design of sensors and light sources. Kopparapu [30] proposed a design method, using multiple light sources to achieve uniform illumination, which regarded the solution of the optimal position of light source as a minimization problem, and used simulations to verify the effectiveness and applicability of the method.

Despite the rapid growth of computer digital image processing and calculations, optical illumination still plays a significant role in visual inspection systems. For an on-line visual inspection system, compared to the long calculation period to process the image by advanced algorithms, a specially designed optical illumination for field lighting can achieve a higher detection accuracy. Furthermore, a specially designed optical illumination can also meet the real-time requirements of the production line visual inspection in a better way. Therefore, as an important part of the machine vision application, optical illumination deserves further discussion.

2.1 Light Source

Light is a typical energy source for image formation. Common light source devices include LED lamps of various shapes, high frequency fluorescent lamps, optical fiber halogen lamps, etc. Currently, LED lamps have become available for every type of machine vision application [31, 32]. An LED light source can be customized in several array configurations to achieve the desired irradiance [33, 34]. In vision applications, the most popular light source is a circular ring array of LEDs [35]. The circular ring array of LEDs possesses high brightness and can be conveniently installed. It can effectively avoid the shadow phenomenon and highlight the features to be detected. It is often used for IC chip appearance and character detection [36], printed circuit board (PCB) substrate detection [37], microscope illumination [38], etc. In structured lighting, the linear array of LEDs is widely used [35]. Furthermore, it has good heat dissipation and flexibility of usage, and can be used for defect detection of some large structural parts, such as copper strip [39] and steel sheet [40].

Visible light is a common light source. Different wavelengths of light have distinct characteristics and applications. As its wavelength changes, visible light assumes different colors [41, 42]. White light source is a multi-wavelength compound light, which is widely used. High brightness white light source is suitable for color image shooting. The wavelength of blue light is between 430 and 480 nm, and is suitable for sheet metal, machining parts, and other products with a silver colored background, as well as metal printing on film. The wavelength of red light is typically between 600 and 720 nm, which is relatively long and can pass through dark objects. It is used in applications, such as line detection and light transmission film thickness detection. A red light source can significantly improve the contrast of an image. The wavelength of a green light source is typically between 510 and 530 nm and lies between the wavelengths of the red and blue lights. The green light source is mainly used for products with red or silver colored backgrounds.

Invisible light could be infrared light, ultraviolet light, or X-rays. The wavelength of infrared light is generally 780–1400 nm. Infrared light has a strong propagation ability and is generally used in liquid crystal display (LCD) screen detection and video monitoring industries [43]. The wavelength of ultraviolet light is generally 190–400 nm. The ultraviolet light has a short wavelength and strong penetration and is mainly used in certificate detection, ITO detection of touch screens, scratch detection of metal surfaces [44], etc. X-ray is a type of electromagnetic wave, whose wavelength range is from 0.01 to 10 nm. X-rays have a short wavelength and good perspective effect and are widely used in various perspective tests in industry [45]. These wavelengths of light are invisible to the human eye; however, they can be applied in machine vision. This is also another important advantage of machine vision over artificial vision.

To enhance the visibility of certain features, it is important to consider the interaction between light and objects, including the propagation mode of light, when it reaches the surface of objects, and the relationship between the wavelength of light and the color of objects [22]. The propagation of light is different in different materials. The defective part of an object would also affect the propagation of light. The common defects in surface inspection can be categorized into two categories: (i) geometric defects, such as pits, scratches, cracks, burrs, bulges, scratches, and bumps; (ii) surface strength defects or density defects, such as oxidation, rust, and stains. The geometric defects change the surface reflection, and surface strength defects or density defects change the surface reflection, as well as absorption. In visual inspection, opaque objects are common. The opaque objects have the ability to reflect or absorb color light of different wavelengths. The absorbed color light cannot be seen and only the reflected color light can directly act on the image acquisition devices. Using a black-and-white camera, reliable and stable detection can be achieved by selecting a specific wavelength of light source and highlighting the grayscale difference between the part to be detected on the surface of the object and the other parts. Therefore, the contrast of the image can be enhanced by effectively selecting the wavelength of light or combining multiple wavelengths of light.

2.2 Fundamental Illumination Modes

With the development of optical illumination technology, various types of designs of illumination structure have emerged [46]. In the field of machine vision, based on different positional relationship among the light source, object, and camera, the illumination can be divided into forward and back illuminations. According to the performance characteristics of light source, it can be divided into structured light and stroboscopic light.

2.2.1 Forward and Back Illuminations

In forward lighting, the light source and the camera are located on the same side of the object. Being the most widely used illumination method, forward lighting is suitable for detecting surface defects, scratches, and the important details of objects, especially the surface texture features. The angle between the light beam and the object surface affects the illumination effect. Depending on whether the light is directly reflected onto the camera, the forward lighting is divided into bright field forward lighting and dark field forward lighting, as shown in Fig. 2a, b. For dark field forward lighting, reducing the incident angle of the light forms a low angle dark field forward lighting. Low angle dark field forward lighting can highlight the edge and height of the surface, enhance the topological structure of the image, and provide a strong performance on the surface concavity and convexity. Coaxial forward lighting is a special forward lighting mode. Coaxial light source refers to a high-intensity uniform light passing through the half mirror to form the light coaxial with the lens, as shown in Fig. 2c. Coaxial forward lighting provides more uniform illumination than traditional lighting mode, while avoiding the reflection of the object. Therefore, it improves the accuracy and reproducibility of machine vision. The coaxial forward lighting can be used to detect surface defects, cracks, scratches, etc. For a highly reflective object with a smooth surface, the light is first projected onto the rough cover to produce a non-directional and soft light, and then projected on the surface of the detected object, which can avoid the strong reflection produced by the direct lighting mode, as shown in Fig. 2d. Scattering forward lighting of a dome structure is commonly used in solder joint detection, chip pin detection, etc. In back lighting, the light source is placed behind the object, as shown in Fig. 2e. A significant feature of back lighting is that it can highlight the shadow of opaque objects or observe the interior of transparent objects. Its advantage is that it can clearly outline the edge of the object to be measured. It is often used in object shape detection and dimension detection. Table 1 compares these typical illumination modes.

Fig. 2
figure 2

Schematic diagram of typical illumination modes: a bright field forward lighting; b dark field forward lighting; c coaxial forward lighting; d scattering forward lighting of dome structure; e back lighting

Table 1 The performances of typical illumination modes

2.2.2 Structured Light Illumination

A structured light illumination causes the light to have a certain shape by specific means, so as to facilitate the detection of three-dimensional object information using two-dimensional vision [47], as shown in Fig. 3. Here, firstly, the specific light information is projected on the object surface and the background. Then, a camera is used to collect the image containing the change in the information of the light signal caused by the structure of the object. Finally, the position and depth of the object are calculated by digital image processing technology, and the whole 3D space is restored [48].

Fig. 3
figure 3

Copyright 2018, Elsevier

Schematic diagram of structured light illumination [49].

Structured light illumination technology is widely used in visual measurement and inspection. Based on laser structured light vision, Li et al. [50] developed an inspection system for weld bead profile monitoring, measuring, and defect detection with scale calibration. Using triangulation with line-scan cameras in a 2D plane, Lilienblum and Al-Hamadi [51] presented a novel technique for optical 3D surface reconstruction by using a combination of line-scan cameras and structured light. It can measure continuously, whereby a single surface scan is sufficient to calculate a high-quality 3D reconstruction.

2.2.3 Stroboscopic Light Illumination

Stroboscopic light is a type of illumination technology applied in optical imaging. It can achieve the effect of freezing the motion of a moving object. An appropriate optical pulse can eliminate the motion blur in the images of fast-moving objects, which is very suitable for on-line high-speed detection of machine vision. By improving the brightness of the stroboscopic light, the exposure time can be reduced, and the whole vision detection system can run faster. In a stroboscopic illumination environment, the aperture can be reduced to get a better image depth of the field. To solve the problem of fuzzy images when high-speed moving objects are photographed in a continuous light source, Chen et al. [52] designed a narrow-pulse and high-current strobe light, with a high-illumination LED as the light source. A field-programmable gate array (FPGA) chip generates a pulse signal to control the timing of the stroboscopic light source.

2.2.4 Auxiliary Optical Devices for Illumination

In practical applications, production lines and working environments have different requirements on the brightness, working distance, and irradiation angle of light sources. They are sometimes limited to specific application environments, and it is very difficult to obtain a good visual image directly through the adjustment of light source type or irradiation angle. In this case, some special auxiliary optical devices are needed.

The common auxiliary optical devices include a filter, reflector, spectroscope, prism, polarizer, diffuser, optical fiber, screen, etc. In the image acquisition stage, some noise interferences can be eliminated and the signal-to-noise ratio (SNR) of the image can be improved by using a filter, and consequently, improve the efficiency of the system. A reflector can change the path and angle of the light, change the distance between the observation points, realize simultaneous or time-sharing observations of multiple targets, and provide more choice space for the installation of the light source. In a spectroscope, the ratio of the reflected light to the refracted light can be adjusted by changing the coating parameters. The coaxial illumination is a special case of a spectroscope. A prism can separate multi-colored compound light and get a single frequency light source. A polarizer can eliminate the reflection of light on non-metallic surfaces. A diffuser can make light more uniform and reduce unwanted reflections. An optical fiber can gather the light beam in an optical fiber tube for transmission, which makes the installation of the light source more flexible and convenient. The application of auxiliary optical devices can be of great help in industrial defect detection. For example, metal surface has a high reflection coefficient that makes it difficult to design a proper lighting system for defect enhancement. To suppress this light, Zhang et al. [22] designed a diffuse bright-field back light illumination and mounted a polarizing filter in front of the camera, and oriented it in such a way that the polarized light would be suppressed.

2.3 Illumination System Design

A light source can be designed in various shapes and structures, so that the light emitted has different characteristics. An effective way to achieve a specific lighting function is through an innovative design, which combines various fundamental illumination methods and some auxiliary optical devices. For some special occasions, there are special-purpose illumination methods available, which include point light source illumination, shadow less illumination, parallel light optical unit illumination, microscope illumination, and customized illumination based on the customer requirements.

For a visual inspection project that aims to obtain high quality images, it is necessary to design a targeted optical illumination system. Firstly, according to the specific needs of the project, the key factors, such as the characteristics and motion state of the objects, surrounding environment and type of camera should be analyzed. Then, the difference between the target and the background is studied to find out the difference in the optical phenomenon between them. According to the characteristics of the materials and the interaction between the light source and the objects, a preliminary determination of the type and color of the light source should be conducted. Finally, experiments should be carried out and from the test results the illumination system should be adjusted until it can meet the requirements of visual inspection. The following is the analysis of several application cases, respectively about highly reflective surfaces, heteromorphic structure, moving objects, and minimally invasive surgery (MIS).

  1. 1.

    Highly reflective surfaces are widely used in automobile, aviation, life science and aerospace industry. These application scenarios have high requirements for surface quality. Optical double-pass retro-reflection surface inspection technique is a typical optical detection technique realized by cleverly designing light reflection path, as shown in Fig. 4. It can inspect very small out-of-plane surface distortions on a specularly reflective surface, such as indentations and protrusions [53]. The advantage of optical double-pass retro-reflection surface inspection technique is that large surface area can be observed in real time, so it can be used for online real-time visual inspection.

  2. 2.

    For a belt condition monitoring system, due to the special shape of belt, unique design requirements are put forward for the illumination system. Yang et al. [54] arranged high-brightness linear light sources in a vaulted shape. This lighting design can adapt to the structural characteristics of the upper belt and improve the detection efficiency.

  3. 3.

    To cater to the diverse reflection characteristics of the surface of tin steel strips and different speeds of a tinning line, Peng and He [55] proposed an adaptive illumination light source. This light source was integrated with a time delay integration charge-coupled device to capture the images of the moving objects and facilitate inspection of the surface quality of the tin steel strips.

  4. 4.

    The combination of structured light and white light can take advantage of their advantages to achieve the desired effect. Clancy et al. [56] proposed a MIS stroboscopic illumination system, in which structured light and white light are interleaved during a high-speed camera acquisition. Besides playing its role in the corresponding cycles, the structured light is not perceived and white light can be used solely for navigation and visual assessment during the shielding period of structured light.

Fig. 4
figure 4

Copyright 2007, Elsevier

Schematic diagram of double-pass retro-reflection illumination system [53].

Optical illumination plays an important role in visual inspection. To achieve an appropriate illumination effect for a specific scenario, an appropriate light source should be employed based on considering the characteristics of the light source and the interactions between the light and the objects. To realize an innovative design of the optical illumination system, an effective combination of some fundamental illumination models is the preferred approach, and additional auxiliary optical devices will also help significantly.

3 Image Acquisition

In an appropriate optical illumination environment, an object surface can be imaged on a camera sensor by an optical lens. The optical signal is then converted into an electrical signal, and into a digital signal that can be processed by a computer to complete the acquisition process of the product surface image.

Image acquisition technology focuses on the characteristics of sensor devices and the field of view design. The typical photosensitive devices of industrial cameras are mainly based on charged coupled device (CCD) or complementary metal oxide semiconductor (CMOS) chips [57, 58]. The image acquisition technology of many conventional scenarios has become relatively mature, which this study does not elaborate. However, for some special detection requirements, a reasonable fields of view design and an effective photosensitive sensor selection can be very important. Several representative image acquisition schemes for some particular image acquisition scenarios will be discussed.

3.1 CCD and CMOS

CCD or CMOS image sensor technology is essential for image capturing. They convert optical signals into electrical signals. However, these two types of chips adopt different methods and means in the transmission of this information and their respective designs are totally different.

The CCD, which is a photoelectric converter, originated in the early 1970s and developed to maturity in the 1990s [59, 60]. In 1974, White et al. [61] discussed the image array characteristics of a low illuminance area array CCD. In 1978, Dillon et al. [62] discussed a color imaging system using a single CCD area array. In 1990, Beyer [63] discussed the calibration of CCD for machine vision and robotics. In the CCD chip, the charge of the photosensitive pixel shifts and is converted into a signal. The CCD has a series of advantages, such as small distortion, small volume, low system noise, self-scanning, light weight, small power consumption, long life, wide sensing spectrum range, and high reliability. It can be made into a highly integrated assembly. The CMOS image sensors have been around for almost as long as the CCD; however, it was not until the 1990s that commercial CMOS sensor chips were manufactured [60].

Currently, CCD sensors are widely used in machine vision [64,65,66]. CMOS image sensors are still in their early stages and yet to mature [67, 68]. The CMOS image sensors can get an image quality similar to that of CCD product and have made great breakthroughs in terms of power consumption and integration.

3.2 Image Acquisition Schemes

This section discusses the state of the art in the image acquisition system design from the aspects of multiple views, omnidirectional vision, micro-domain vision, multispectral.

3.2.1 Multiple Views

In visual inspection, for parts with complex structures, it is difficult to capture all the key information based on a single image. In this case, only a collection of multiple images can show the features to be inspected.

Sun et al. [69] designed a machine vision system to acquire three-view images of one electric contact (EC). For each view, the system incorporated different image pre-processing and feature extraction methods to enhance and detect the surface defects. Chiou and Li [70] proposed a multi-view system for the inspection of PU-packing. Their system consisted of three inspection stations. Station 1 focuses on obtaining image information of the top and bottom surfaces of the package. Station 2 uses another camera to check the interior of the packing incorporated. Station 3 uses two line-scan cameras to simultaneously scan the inner and outer cylindrical surfaces. Through this method, each of the inspection stations would perform its assigned tasks, and multiple view images of the PU-packing could be effectively collected on an efficient work line. For detection in bearings, there are many parts that need to be inspected, such as the inner and outer rings. Shen et al. [71] designed a new image acquisition system for bearing cover inspection. To get the enhanced deformation information, three bearings were captured in one image. The left and right bearings were inspected for deformation defects, while the center bearing was inspected for other defects besides the deformations. This was an efficient and ingenious image acquisition system.

3.2.2 Omnidirectional Vision

Omnidirectional vision is mainly implemented by installing a fisheye lens [72]. Pipes are used to transport gas, liquid, or fluid with solid particles. The detection of their security often involves a visual detection of the inner wall of the pipes. For perspective stereo cameras with limited viewing angle, it is necessary to build a ring of cameras. Hansen et al. [73] introduced a visual odometry-based system, using calibrated fisheye imagery and sparse structured lighting to produce high-resolution 3D textured surface models of the inner pipe wall. The prototype robot with a fisheye lens and a fiberglass pipe network used for testing are shown in Fig. 5. Their research results showed that using a single fisheye camera, high-precision pipe mapping could be achieved. The advantage of this wide-angle fisheye lens system is that it can use a single camera to achieve the full pipe coverage, thus avoid the challenge of multiple camera calibration, and keep the overall size compact. This method is obviously of great significance to improving the efficiency of pipe inspection.

Fig. 5
figure 5

Copyright 2015, SAGE Publications

Prototype robot and fiberglass pipe network used for testing [73].

Contact lenses possess the characteristics of contact, lightness, and convenience. The quality of contact lenses has a major influence on the human eye. For contact lens detection, Chen et al. [74] presented an omnidirectional image of a fisheye lens for contact lens inspection system and proved the feasibility of the same. The optical reflection of the object surface depends on the material and microstructure. In the detection of industrial parts, light reflection measurement is an important task. Kogumasaka et al. [75] developed a small reflection measurement system using a fisheye camera, and demonstrated that the fisheye camera system was an effective means for high-precision surface finish inspection.

3.2.3 Micro-Domain Vision

Quality inspection within mass production of micro-parts is a big challenge [76, 77]. During a micro-manufacturing process, the occurrence of surface imperfections is a critical problem [78]. Nevertheless, some conventional detection platforms are often unable to detect micro-defects on micro-parts [79]. In this regard, some researchers have put forth, micro-domain vision detection technologies to acquire and analyze 2D textures and 3D shape information, which effectively solved this problem.

For metallic micro components, Weimer et al. [80] proposed an image acquisition technology based on plenoptic cameras. The design of plenoptic cameras is relatively compact and can easily realize integrated manufacturing. Effective 2D and 3D information can be obtained in one measurement step by using plenoptic cameras to acquire images of micro-components. This method meets the requirements of quality detection in a micro-domain. To realize on-line surface detection, Scholz-Reiter et al. [76] designed an image acquisition system for micro-part surface imperfections using confocal laser microscopy and realized automatic detection of defects. Li et al. [81] designed a quality inspection system by using micro-vision technology to get graphic information of the micro-accessory. In these methods, the micro-domain vision technology played a significant role in the task of acquiring high-resolution images.

3.2.4 Multispectral

In some industrial detection scenarios, it is necessary to select the multiple photosensitive imaging devices for an effective combination based on the wavelength characteristics of the light, so as to fully represent the characteristics of the objects to be detected in the collected images. A multispectral imaging system, can make up for the shortcomings of traditional CCD photosensitive imaging.

Machine vision has great potential for detecting locomotive and rolling stock condition. Multispectral imaging allows recording of physical and thermal conditions, and their correlations. Combining multispectral imaging with machine vision, Hart et al. [82] proposed a multispectral machine vision technology, in which some visible and infrared (thermal) cameras were placed below the track to capture images. This technology can monitor the physical and thermal state of railway equipment more effectively than the existing methods and technologies.

In addition to the above methods, there are high-dynamic range imaging [83, 84] and multi-vision imaging [85] systems, etc. In each specific visual detection project, we need to consider the characteristics and detection requirements of the objects to be tested, to select the appropriate image acquisition method.

4 Image Processing and Defect Detection

Images are the information carriers of machine vision. Image processing and analysis are the key technologies for automatically obtaining an understanding of the images acquired by hardware in vision detection systems [86].

Image processing has a long history of development. In the 1920s, the first image was successfully transmitted using digital compression technology, from London to New York via submarine cables. This was the origin of digital image processing technology [87]. In the early days, simple defect detection could be achieved through primitive filtering methods. For example, in 1973, in an early attempt to apply visual inspection to industrial production, Ejiri et al. [88] described a method that employed two-dimensional nonlinear logical filtering to detect defects in complicated patterns such as PCBs. It could detect defects in complicated patterns in real time. Subsequently, Hara et al. [89] proposed an algorithm for comparing the local features of the patterns to be inspected with those of a reference pattern, with intended applications to an automatic PCB inspection system.

Currently, with the development of computer technology and mathematical theory, image processing and analysis methods have become more abundant and advanced. Flexible configurations in modern manufacturing systems can allow them to quickly switch from one product to another [90, 91]. For conventional machine learning, complex feature extractors need to be designed for particular cases so that the desired features can be retrieved. In addition, new products may present complex texture patterns or intensity changes, and surface defects can be of any size, direction, and shape. Therefore, manually designed features may lead to insufficient or unsatisfactory inspection performance in complex surface scenarios or dynamic processes. Compared with traditional machine learning, the main advantage of deep learning is that these rich features are not designed by human engineers but are learned automatically through convolutional neural networks from raw data [92]. Deep learning has been proven to be very adept at discovering complex structures in high-dimensional data [93]. Therefore, for defect detection by machine vision systems based on image processing technology, deep learning can play an important role in inaugurating the era of intelligent detection with machine vision.

In industrial production, there are three kinds of representative defect detection tasks based on machine vision: classification, localization and segmentation. Some primitive image preprocessing methods can help the subsequent image analysis, and sometimes may deal with a few simple defect detection tasks. For most defect detection scenarios, more image processing methods are needed to extract enough features for understanding defect information. For image feature learning, the main types of deep learning network architecture include convolutional neural networks (CNNs) [94], deep belief networks (DBNs) [95], and stacked auto-encoders (SAEs) [96]. Furthermore, long short-term memory (LSTM) [97] plays an important role in images with time-sequenced characteristics. DBNs and SAEs can help multi-feature fusion detection achieve better effect and accuracy.

4.1 Image Preprocessing

The purpose of image preprocessing is to enable the machine to understand the image better and prepare for the next step of image analysis [98]. The principle of image preprocessing is to eliminate irrelevant information and recover useful real information. Some factors may cause image noise, such as the field environment of machine vision, photoelectric conversion of the CCD image, transmission circuit, and electronic components. These noises reduce the image quality, which in turn, adversely affects the image analysis. Therefore, denoising is the main objective of image preprocessing.

Image preprocessing generally comprises spatial domain methods and frequency domain methods [86]. The main preprocessing algorithms include grayscale transformation, histogram equalization, various filtering algorithms based on spatial and frequency domains [99, 100], etc. In addition, mathematical morphology can also be used for image denoising [101].

The basic method for conversion from spatial domain to frequency domain is the Fourier transform and the representative tool for image processing in the frequency domain is the wavelet transform.

4.1.1 Fourier Transform

Fourier transform has helped the industry and academia prosper in an unprecedented manner [102]. Before the Fourier transform, image processing was confined to spatial domain operations. The function of various spatial filtering algorithms is to convolute the image with various templates. For example, the direct grayscale transformation transforms each pixel of the image according to a certain function to get the enhanced image. In generally, a spatial filtering algorithm is easy to operate and has high real-time performance; however, it is not suitable for complex image processing.

The Fourier transform can transform the image from the spatial domain to the frequency domain, and its inverse transform can transform the image from the frequency domain back to the spatial domain [103, 104]. For image processing, the two-dimensional discrete Fourier transform (DFT) is represented as:

$$F\left( {u,v} \right) = \sum\limits_{x = 0}^{M - 1} {\sum\limits_{y = 0}^{N - 1} {f(x,y){\text{e}}^{{ - {\text{j}}2\pi \left( {ux/M + vy/N} \right)}} } } ,$$
(1)

and the inverse discrete Fourier transform (IDFT) is

$$f(x,y) = \frac{1}{MN}\sum\limits_{u = 0}^{M - 1} {\sum\limits_{v = 0}^{N - 1} {F(u,v){\text{e}}^{{{\text{j}}2\pi \left( {ux/M + vy/N} \right)}} } } ,$$
(2)

where f (x, y) represents a digital image of size M × N, and then the frequency domain representation F (u, v) can be obtained by using DFT formula (1) [87]. In formulas (1) and (2), u (u = 0, 1, 2, …, M − 1) and v (v = 0, 1, 2, …, N − 1) represent the frequency domain variables; x (x = 0, 1, 2, …, M − 1) and y (y = 0, 1, 2, …, N − 1) represent the space domain variables. In addition, j is an imaginary number, equal to the square root of − 1.

Through the Fourier transform, the image can be converted to frequency domain for transformation and operation. In the frequency domain, the data reflect the intensity of grayscale changes in the image. The frequency domain filtering modifies the Fourier transform of the image and then, calculates its inverse transform to get the processed result. For example, the moving average window filter and Wiener linear filter use a low-pass filter to denoise, based on the premise that noise energy is concentrated in high frequency, and the image spectrum is distributed in a limited range [87]. For noise removal, Bai and Feng [98] introduced a new class of fractional-order anisotropic diffusion equations by using the DFT. Their experiments showed that the proposed equations yielded good visual effects and better SNR on denoising the real images. However, the frequency domain transformation is complex, and the operation cost is high.

4.1.2 Wavelet Transform

In recent years, the wavelet transform has been demonstrated to be a powerful approach for noise reduction and became a prime field of image processing research [105, 106]. The wavelet transform provides the localization analysis of time or space frequency and gradually refines the signal by scaling and translation [107]. The wavelet transform can subdivide time at high frequency and frequency at low frequency, thus automatically adapting to the requirements of the time–frequency signal analysis.

The wavelet transform plays an important role in image processing. Luisier et al. [108] introduced an inter-scale orthonormal wavelet thresholding algorithm. In this method, the denoising process was parameterized to the sum of the basic nonlinear processes with unknown weights, and the mean square error of the denoised image and the clean image was minimized. Jain and Tyagi [109] presented an edge preserving denoising technique based on wavelet transforms. They decomposed the noisy image and improved the denoising performance by clustering. Yan et al. [110] presented a novel wavelet thresholding procedure to suppress the additive Gaussian noises in images. This method effectively overcame the discontinuity of the hard threshold function. For inspection of strongly reflective metal surface defects, Zhang et al. [22] removed the noise effectively from the image by setting certain coefficients to zero by wavelet smoothing. In addition, the wavelet transform has also been widely used in image fusion [111, 112], image coding [107, 113], image compression [114], image encryption [115], and image enhancement [116, 117].

4.2 Classification

Defect classification is usually used to detect whether a certain defect exists in an image. The traditional way to solve the problem of computer vision is to classify the preprocessed images according to hand-crafted features. Most of the research has focused on the construction of hand-crafted features and classification algorithms, and some outstanding work has emerged.

Feature extraction extracts the information that describes the characteristics of the target from the image pixels and then maps the differences between the different targets to a lower-dimensional feature space to help compress the amount of data and improve the recognition efficiency. The common defect features used in visual inspection include greyscale features, shape and size features, and texture features. The greyscale features are the most intuitive features of the image, such as greyscale statistical characteristics and greyscale difference statistical characteristics. Shape and size features are important information for identifying various defects. Common defects can be detected by shape information, such as lines, curves, ellipses and rectangles, and size information, such as area and perimeter. The texture is an important feature of an image. It does not use color or brightness to reflect the homogeneity of images. It represents important information about the arrangement of the surface structures and their relationships with their surroundings [118, 119].

According to the characteristics of the defects, there are many feature extraction methods that can be used for defect classification.

As simple and effective feature descriptors that are based on statistical characteristics, histograms are widely used in the field of computer vision. For example, Li et al. [120] proposed a defect classification algorithm based on histogram features for automatically detecting defects in both nonpatterned and patterned fabrics. Common statistical features of histograms include the maximum, minimum, mean, median, range, entropy, variance, L1 norm, L2 norm, Bhattacharyya distance, and normalized correlation coefficient. The calculations are simple and are invariant in translation and rotation. However, these features reflect only the probability of the greyscale level of the image and not the spatial distribution of the pixels [121, 122].

The grey-level cooccurrence matrix (GLCM) is a common method of describing a texture by studying the spatial correlation properties of the greyscale. It reflects the comprehensive information from the image grey levels regarding the direction, adjacent interval and change amplitude, which can be used to analyze the image primitives and arrangement structure [123,124,125]. The Gabor transform is a type of windowed short-time Fourier transform. The window function is the Gaussian function. This transform simulates the biological action of human eyes and can extract relevant features in different scales and directions in the frequency domain [126, 127]. Raheja et al. [128] presented a new scheme for an automated fabric defect detection system using the GLCM and Gabor filter method. The experimental results showed that, compared with the Gabor filter method, the GLCM has greater accuracy and computational efficiency in the same environment.

The local binary pattern (LBP) expresses the relationship between the local neighborhood point and the center point through binary bits [129]. It has strong robustness to changes in the image greyscale level caused by changes in illumination [127, 130, 131]. For fabric defect classification, Zhang et al. [132] proposed an algorithm that combines the LBP and GLCM. The LBP and GLCM are used to extract the local feature information and overall texture information of the defect images, respectively. However, the LBP algorithm constructs a histogram of the defect images based on spatial neighborhood pixel coding, which may result in losing the discrimination information of the defect images.

The scale-invariant feature transform (SIFT) is an image descriptor for image-based matching and recognition [133, 134]. It can achieve reliable feature matching in different perspectives by extracting unique invariant features from images. The extracted features are invariant with respect to the image zoom, the rotation, 3D affine transformations within a certain range, noise superposition and illumination changes. Dunderdale et al. [135] used the SIFT descriptor combined with a random forest classifier to identify defective photovoltaic modules. The SIFT descriptor showed good performances and could be used to both detect and describe local feature points. However, SIFT has high requirements for image quality, which limits its application.

Histograms of oriented gradient (HOG) features are formed by computing statistical histograms of gradient directions in local regions of the image [136]. It can maintain good invariance to geometric and optical deformations of the image. Halfawy and Hengmeechai [137] presented an efficient pattern recognition algorithm that employed the HOG and support vector machine (SVM) to automate the detection and classification of pipe defects. Compared with the LBP, the HOG can more easily extract the edge information and consider the structural information of the image. However, the HOG algorithm may face the problems of having high dimensionality and neglecting the texture information.

Speeded up robust features (SURF) [138], binary robust independent elementary features (BRIEF) [139], and oriented FAST and rotated BRIEF (ORB) [140] are also used in feature extraction. Furthermore, there are many variations of the classical method; for example, the LBP family includes the completed local binary pattern (CLBP) [141], elliptical local binary pattern (ELBP) [142], adjacent evaluation completed local binary pattern (AECLBP) [5], and robust local binary pattern (RLBP) [143]. Based on these classical algorithms, some novel feature-extraction algorithms have also been proposed in recent years; for instance, Zhao et al. [144] proposed a discriminant manifold regularized local descriptor (DMRLD) algorithm for steel surface defect classification. Compared with hand-crafted histograms, DMRLD achieves better robustness by using the structure of a manifold with a learning mechanism to represent the information contained in the image.

There are many kinds of feature extraction methods with their own advantages and disadvantages. For specific visual inspection items, we should consider whether the feature extraction method makes full use of the global information, whether its calculations are convenient, whether it can meet the real-time needs, etc. For many application requirements, using a combination of multiple feature extraction methods is also a good way to increase efficiency and accuracy.

To identify the defect categories of an image, it is necessary that the selected features not only describe the image properly but also distinguish different categories of images. The primary mission of defect classification is to train the classifier according to the extracted feature set and then make it identify the type of each surface defect correctly based on supervised or unsupervised pattern recognition methods.

The support vector machine (SVM) [145] and K nearest neighbor (KNN) [146] are representative classifiers in supervised pattern recognition.

SVMs are suitable for small and medium-sized data samples and for nonlinear, high-dimensional classification problems, and they have been widely used in the field of industrial vision detection. For example, Jia et al. [147] described a real-time machine vision system that uses an SVM to automatically learn complicated defect patterns. Li and Huang [148] proposed a binary defect pattern classification method that combines a supervised SVM classifier with unsupervised self-organizing map clustering, in which the SVM is used to classify and identify manufacturing defects. The results showed that this method could achieve more than 90% classification accuracy, which was better than that of the back-propagation neural network. However, this study focused only on binary map classification. Valavanis and Kosmopoulos [149] proposed a method of multi-class defect detection and classification based on a multi-class SVM and a neural network classifier for weld radiographs. For real-time analysis of spectrum data, Huang et al. [150] established an improved SVM classification model based on a genetic algorithm to accurately estimate different types of porosity defects in an aluminium alloy welding process. Furthermore, the SVM classifier has played a significant role in the inspection of surface defects in copper strips [151, 152], laser welding process monitoring and defect diagnosis [153], defect detection for wheel bearings [154], etc.

The KNN algorithm has been proven to be simpler and more stable than neural networks [155, 156]. To detect fabric defects, Yıldız et al. [157] preprocessed images with wavelet, threshold, and pathological operations and then used the GLCM method to extract features. Finally, defect images were classified based on a KNN algorithm with an average accuracy rate of 96%. Cetiner et al. [158] proposed a method of feature extraction based on the wavelet moment and defect image classification based on KNN, which can be used in automatic defect classification systems in the forest industry. Das and Jena [159] presented a method combining image texture feature extraction techniques. First, LBP and the grey level run length matrix (GLRLM) were combined to extract image features, and then KNN and an SVM were used for classification. The experimental results showed that the combination of LBP and GLRLM can improve the performance of feature extraction, and the SVM has better classification performance than the nearest neighbor approach in texture feature classification. Therefore, Lei and Zuo [156] proposed a weighted K nearest neighbor (WKNN) algorithm based on the two-stage feature selection and weighting technique (TFSWT) to improve the performance of the KNN algorithm, and they successfully applied the WKNN method to identify gear cracks.

An unsupervised algorithm can also be used for defect classification. Based on K-means clustering, Mjahed et al. [160] presented an efficient algorithm for solving a multi-objective fault signal diagnosis problem using a genetic algorithm. Hamdi et al. [161] introduced an unsupervised defect detection algorithm for patterned fabrics. An image filtered by non-extensive standard deviation was divided into a series of blocks, and then the squared difference between each block median and the mean of all block medians was input into K-means clustering to classify the blocks as defective or non-defective, with an overall detection success rate that reached 95%.

Table 2 compares some traditional feature extraction and defect classification methods.

Table 2 The performances of traditional feature extraction and defect classification methods

In recent years, artificial intelligence technology has greatly benefited industrial production. Neural networks are an important branch in the development of artificial intelligence [162]. With the improvement of computing power and the advent of big data, deep learning, with the core idea that machines can automatically learn from data by increasing the number of network layers, has developed rapidly and has significantly impacted the field of machine vision. Deep learning methods can automatically extract and combine the essential feature information of objects, and they are especially adept at image classification.

The CNN is the most popular architecture for image classification. In 1998, the emergence of LeNet opened the era of CNNs [94]. In 2012, the success of AlexNet [163] in the ImageNet competition promoted the application of deep learning in computer vision. After that, a series of CNN models appeared, such as Network-in-network [164], VGGNet [165], GoogLeNet [166,167,168,169], ResNet [170], and DenseNet [171]. There are three main types of neural layers that play different roles in a CNN: convolutional layers, pooling layers, and fully connected layers [172, 173]. The convolutional layers are designed to detect local combinations of features from a previous layer, pooling layers are designed to merge semantically similar features into one, and fully connected layers ultimately convert the feature maps into a feature vector [174], as shown in Fig. 6.

Fig. 6
figure 6

Copyright 2018, Elsevier

Architecture of a CNN model [174].

The CNN was originally designed for image analysis; therefore, it is a good fit for automated defect classification in visual inspection [175,176,177]. According to the relevant literature in recent years, the application of deep learning in industrial defect classification involves many fields, such as industrial production and electronic components. For supervised steel defect classification, Masci et al. [178] presented a max-pooling CNN approach. Compared to SVM classifiers, the CNN obtains much better results and can work properly with different types of defects. The surface quality affects not only the appearance of products but also their performance. Park et al. [14] proposed a generic approach based on a CNN for the automatic visual inspection of dirt, scratches, burrs, and wears on part surfaces. Their results showed that a pretrained CNN model works well on small datasets with improved accuracy for a surface quality visual inspection system. To detect casting defects by X-ray inspection, Lin et al. [179] proposed a robust detection method based on a visual attention mechanism and feature-mapping deep learning and established a CNN to extract defect features from potentially defective regions and obtain a deep learning feature vector. Then, the similarity of suspicious defective regions could be calculated by using the feature vector. Their results showed that the method was effective in solving the problem of false and missing inspections. Nguyen et al. [180] proposed an inspection system based on a CNN to achieve defect classification in casting products. However, the CNN deep learning model can only perform well under the condition of having a large number of high-quality datasets. Kim et al. [181] proposed an indicator that can distinguish between defects and the background area for the classification of defect types in thin-film-transistor liquid–crystal display panels. For the process of industrial production, automatic defect classification was performed based on a CNN.

As one of the representative algorithms of machine vision, the CNN has played an important role in defect classification. However, CNNs are becoming increasingly deep, and they require large-scale datasets and massive computing power for training. In addition, collecting labelled datasets requires great human effort. Thus, as a further exploration, unsupervised learning by a CNN may be a meaningful research direction.

Transfer learning is a method of machine learning in which a pre-trained model is reused in another task. Transfer learning can help solve the problem of a lack of labelled data. Imoto et al. [182] proposed a CNN-based transfer learning method for automatic defect classification. The results showed that this method is robust against a lack of labelled data and can achieve more than 80% accuracy with only a few dozen labelled data points.

4.3 Localization

Defect localization needs to accurately determine the location of the defect in a given image and mark the defect category. Generally, defect localization is performed by a series of object detection methods.

The traditional object detection strategies and algorithms include Viola-Jones [183], HOG + SVM, non-maximum suppression (NMS) [184], the deformable part model (DPM) [185], selective search [186, 187], and edge boxes [188]. Ding et al. [189] proposed a detection scheme based on a HOG and SVM. The HOG was used to encode each block-based feature, and the SVM was used to classify the fabric defects. The experimental results showed that this method based on a HOG and SVM is relatively simple and easy to realize in online applications. Dou et al. [190] proposed a fast template matching-based algorithm (FTM) for railway bolt detection and a nearest-neighbor classifier to determine whether a bolt is in the correct position, which achieved a lower false positive rate than previous methods. The DPM is one of the most effective template-based approaches used in object detection. For railway fastener defect detection, He et al. [191] proposed a Gaussian mixture deformable part model (GMDPM) algorithm based on HOG features. Wei et al. [192] proposed an effective express box defect detection algorithm to identify the shape and size of defects, and this method achieved a 95.83% correct rate.

In recent years, after the successful application of CNN-based image classification methods, object detection technology based on deep learning has also made significant progress. The object detection methods based on deep learning can be divided into two major categories. One generates regions and then classifies each region to obtain different object categories. The other regards object detection as a regression or classification problem and uses a unified framework to obtain the final categories and locations directly [193]. The region proposal-based methods mainly include regions with CNN features (R-CNN) [194], spatial pyramid pooling (SPP-net) [195], Fast R-CNN [196], Faster R-CNN [197], region-based fully convolutional networks (R-FCNs) [198], feature pyramid networks (FPNs) [199], and Mask R-CNN [200]. The regression- and classification-based methods mainly include MultiBox [201], AttentionNet [202], G-CNN [203], You Only Look Once (YOLO) [204], the single-shot MultiBox detector (SSD) [205], YOLOv2 [206], RetinaNet [207], YOLOv3 [208], and YOLOv4 [209]. In terms of performance, the region proposal-based methods are high in accuracy but low in speed; the regression- and classification-based methods are high in speed but low in accuracy.

Based on a cascaded mixed FPN, Wu et al. [210] proposed a two-stage fabric defect detector. The end-to-end defect detection architecture is shown in Fig. 7. The feature extraction backbone model of matching parameters with fitting degrees was proposed to solve the problems caused by a small defect feature space and background noise. Stacked feature pyramid networks were set up to integrate cross-scale defect patterns for feature fusion and enhancement in a neck module. Cascaded guided region proposal networks (RPNs) were proposed for refining the anchor centers and the shapes used for anchor generation. The experimental results showed that this method could improve the recognition performance of included and size-variant fabric defects.

Fig. 7
figure 7

End-to-end fabric defect detection architecture [210]

Faster R-CNN is a state-of-the-art method for detecting objects with real-time object detection, which can generate regions of interest (ROIs) with an RPN instead of selective search [197, 211]. Lei et al. [211] adopted Faster R-CNN to implement the detection of defects in the polarizer and to perform the rapid detection and effective positioning of defects. To further improve the detection accuracy and efficiency, the number of layers of the network could be changed, and some of the network parameters should be adjusted to optimize the test model. Lei and Sui [212] proposed a Faster R-CNN method to perform intelligent fault detection for high voltage lines. To detect defects in an image, Faster R-CNN chooses a random region as the proposal region and then obtains the corresponding category and location of a certain component after training. The experiments showed that the detection method based on the ResNet-101 network model could effectively locate insulator damage and bird nests on a high voltage line. Sun et al. [213] proposed an improved Faster R-CNN method for surface defect recognition in wheel hubs. The last maximum pooling layer was replaced by an ROI pooling layer, as shown in Fig. 8. ROI pooling technology was used in order to employ a single feature map for all the proposals generated by the RPN in a single pass. It enabled object detection networks to use an input feature map with a flexible size and output a fixed-size feature map. The experimental results showed that the improved Faster R-CNN method has a higher detection accuracy. However, the detection speed of the Faster R-CNN method may not meet the real-time requirements of industrial applications.

Fig. 8
figure 8

The structure of the improved Faster R-CNN [213]

YOLO is an object recognition and location algorithm based on a deep neural network that performs object detection by using fixed-grid regression [214]. Its primary characteristic is that it runs quickly and can be used in real-time systems. Based on the idea of regression, YOLO takes a whole image as the input of the network and directly regresses the object border and the category of the object in multiple positions of the image. Adibhatla et al. [215] adopted a YOLO/CNN model to detect PCB defects and achieved a defect detection accuracy of 98.79%. However, the defect types that can be detected by the method are limited and need to be optimized. Lv et al. [216] proposed an active learning approach for steel surface defect inspection based on YOLOv2. This model achieves high efficiency but at the expense of precision. Jing et al. [217] proposed an improved YOLOv3 model by using the K-means algorithm to cluster the marker data. The experimental results showed that the improved YOLOv3 model achieves better performance in fabric defect detection. However, the real-time performance needs to be improved. As a regression-based detection method, the YOLOv4 network has an excellent detection speed. However, the detection accuracy for small targets needs to be improved. To detect iron material cracks, Deng et al. [218] proposed a cascaded YOLOv4 (C-YOLOv4) network. The experimental results showed that C-YOLOv4 has better robustness and crack detection accuracy.

SSD combines some strategies of YOLO and Faster R-CNN, and it uses multi-scale regional features for regression, which not only maintains the high speed of the YOLO method but also ensures a certain accuracy of performance. Zhai et al. [219] proposed a DF-SSD object detection method based on DenseNet and feature fusion. The feature extraction network DenseNet-S-32-1 was designed to replace VGG-16 in SSD. To effectively integrate low-level visual features and high-level semantic features, they also designed a fusion mechanism for multi-scale feature layers. The experimental results showed that the proposed DF-SSD method could achieve an advanced performance in the detection of small objects and objects with specific relationships.

4.4 Segmentation

Defect classification and localization can provide information on the defect types and their relative positions in images. Furthermore, in intelligent vision detection, defect segmentation, especially pixel-level segmentation, can provide important references for evaluating the defect severity and performing condition assessment.

Image segmentation is a process that divides an image into several specific and unique regions and proposes objects of interest [220]. The purpose of image segmentation is to predict the category of each pixel in the image. To solve the problem of image segmentation for different features, researchers have proposed numerous segmentation methods. Table 3 lists some traditional image segmentation methods and their characteristics.

Table 3 Traditional image segmentation methods and their characteristics

These methods are based on different image models, use different characteristics, and have a certain scope of application. Some researchers have also integrated genetic algorithms [233] and wavelet methods [234] into image segmentation and have achieved positive results. Among these methods, the clustering algorithm is widely used for defect segmentation. The clustering algorithm is an unsupervised algorithm that does not require a training set. It is simple and fast. Image segmentation divides the image into several disjoint regions, which is a pixel clustering process [235]. There are many clustering algorithms, such as fuzzy c-means (FCM) [236], BIRCH [237], CURE [238], CLARANS [239], K-means [240], CLARA [241], CHAMELEON [242], K-medoids [242][242], DBSCAN [244], K-prototypes [245], and MAPK-means [246]. The choice of clustering algorithm depends on the purpose of clustering and the type of data. Xiong et al. [247] proposed a novel 3D laser profiling system for rail surface defect detection. In the process of rail surface defect detection and classification, K-means clustering was used to merge the candidate defect points into candidate defect regions. Jian et al. [248] designed a surface defect detection system for mobile phone screen glass. In this system, improved fuzzy c-means (FCM) clustering was proposed to segment the surface defects more accurately. Melnyk and Tushnytskyy [249] proposed a PCB defect detection and classification system that implemented the K-means clustering algorithm. Li et al. [250] proposed a clustering algorithm that links the regions that are close to each other to detect cluster defects composed of many small point defects. The schematic diagram of the process of connecting domains A and B in the clustering method is shown in Fig. 9.

Fig. 9
figure 9

Copyright 2020, Elsevier

The process of connecting domains A and B. Two separate domains A and B are shown in (a). The result of the Boolean union of domains A′ and B′ is shown in (b). The result of linking domains A and B is shown in (c) [250].

Deep learning has also brought great progress for image segmentation technology. The fully convolutional network (FCN) is a breakthrough semantic segmentation model that has higher accuracy than traditional approaches [251]. FCNs can efficiently learn to make dense predictions for per-pixel tasks, for example, semantic segmentation, as shown in Fig. 10.

Fig. 10
figure 10

Fully convolutional network [252]

FCN-based segmentation methods also play an important role in industrial applications. Yu et al. [253] presented a novel 2-stage FCN framework for surface defect segmentation. The 2-stage framework improves the generality and reusability of FCNs. Li et al. [254] adopted region-based fully convolutional networks (R-FCNs) to inspect insulator defects. The experimental results showed that the R-FCN algorithm has good robustness and environmental adaptability. In crack inspection, conventional approaches are unable to identify and measure diverse types of cracks concurrently at the pixel level. Yang et al. [255] applied an FCN to study automatic pixel-level crack detection and measurement, and their results showed that the prediction had improved at the pixel level and that the training time was greatly reduced. However, the resolution of the feature maps generated by the FCN was low, and the prediction results were coarse owing to the large amount of spatial information loss during down-sampling. Qiu et al. [256] presented a 3-stage FCN for pixelwise surface defect segmentation. The FCN is a state-of-the-art algorithm for generic object segmentation. However, for small datasets, its performance cannot meet the requirements. The experimental results showed that the slicing method could improve the efficiency of FCNs in small datasets in industrial environments.

The current common image segmentation algorithms, in addition to FCNs, include U-Net [257], SegNet [258], Mask R-CNN, and PSPNet [259]. These models have an encoder-decoder architecture, where a CNN is used as an encoder to extract features, and a deconvolution network and skip connections are used as decoders to map features to the output image. U-Net was originally proposed to segment the greyscale of biomedical images. SegNet achieves a good trade-off between efficiency, the memory footprint and precision. The Mask R-CNN can be used for instance segmentation. PSPNet adopts a pyramid pooling module structure, which can extract smaller and more localized features, while the large-size layers can extract global information.

Furthermore, according to the specific visual inspection application scenario, additional defect segmentation methods have been continuously proposed. For example, Yu et al. [260] proposed an adaptive depth and receptive field selection network. In this method, an adaptive depth selection mechanism was designed to extract features of various depths, and an adaptive receptive field block was proposed to select the best acceptance domain. The experimental results for a casting defect segmentation dataset showed that the proposed method achieved better performance than the existing segmentation algorithms. Tabernik et al. [261] proposed a segmentation-based deep-learning architecture for surface defect detection. The network architecture was designed in two stages, as shown in Fig. 11. In the first stage, a segmentation network was used to locate the surface defects accurately at the pixel level. After defect segmentation, each pixel was trained as an independent sample, which increased the effective number of training samples. Then, the second stage was a decision network for binary image classification. The experimental results showed that this method could complete training on a small-scale defect sample dataset, which needed only 25–30 training samples. This has great significance for some industrial application scenarios with limited training samples, and this method effectively improves the practicability of deep learning methods.

Fig. 11
figure 11

Copyright 2020, Springer Nature

Two-stage architecture with segmentation and decision networks [261].

4.5 LSTM-Based Periodic Defect Recognition

As a deep learning architecture specifically designed for time-series forecasting, the RNN shares parameters among all time steps to learn the information that has been repeated in the past [262]. LSTM is one of the representative architectures in RNNs [97]. In industrial visual inspection, LSTM is an effective method for defects with strong time-sequenced characteristics.

Hu et al. [262] proposed an LSTM recurrent neural network (LSTM-RNN) model to classify common defects in an infrared thermography-based nondestructive testing task for honeycomb materials. Similarly, Wang et al. [263] adopted the LSTM-RNN method to determine the defect depth inside carbon fiber reinforced polymer structures, achieving better performance than a CNN.

A fusion algorithm of a CNN and LSTM is also a widely used defect detection method. For a molten pool online monitoring task, Liu et al. [264] proposed a CNN-LSTM algorithm combining the advantages of a CNN and LSTM. First, feature vectors were extracted from molten pool images through the CNN, and then LSTM was used for welding defect recognition. The experimental results showed that the accuracy of the CNN-LSTM algorithm could reach 94% in the defect detection task for the CO2 welding molten pool described in the literature and that it had high efficiency (the time consumption of each image was 0.067 ms), which fully met the industrial requirement of real-time monitoring.

According to the features of periodic roll mark defects in plates, Liu et al. [265] proposed a defect detection method based on a hybrid CNN and LSTM. To improve the detection performance, an attention mechanism algorithm was also integrated into the detection method. The complete network architecture is shown in Fig. 12. As the final output, O represents whether there is a periodic defect in the image sequence. The experimental results showed that the detection method had good performance in identifying periodic defects and that it had an 86.2% detection rate under the experimental conditions described in the literature. However, the integration of the attention mechanism increases the complexity of the algorithm and requires higher computer performance.

Fig. 12
figure 12

Network architecture of CNN + LSTM + attention algorithm [265]

4.6 Multi-Feature Fusion Detection Based on a DBN and SAE

The SAE is an unsupervised pre-training method that encodes the input data from a high-dimensional space into a low-dimensional space and then decodes the low-dimensional space data into a high-dimensional space stack by stack [266, 267]. Seker and Yuksek [268] performed fabric defect detection based on the SAE method. After fine-tuning the hyper-parameters of the deep learning model, they achieved a detection rate of 96% on their own datasets. Yang and Jiang [267] proposed a unified deep neural network with multi-level features for weld defect classification. To detect weld defects from radiographic images, they investigated SAEs for pre-training and fine-tuning strategies. As a kind of unsupervised pre-training algorithm, SAEs can improve the generalization performance and reduce the possibility of overfitting, as shown in Fig. 13. The results show that a unified deep neural network can take full advantage of the multi-level features extracted from each hidden layer.

Fig. 13
figure 13

Copyright 2020, Springer Nature

Stacked auto-encoders for deep neural network pre-training [267].

The DBN utilizes the restricted Boltzmann machine (RBM) as a learning module [269]. In a DBN, the top two layers form an undirected graph, and the remaining layers form a belief network with directed, top-down connections [173]. A graphic depiction of a DBN is shown in Fig. 14.

Fig. 14
figure 14

Graphic depiction of a DBN [173]

Chen et al. [270] constructed a DBN-based fault soft-max classifier for bearing fault classification. A DBN can be used to automatically classify raw data into corresponding classes. Furthermore, a new multi-sensor feature fusion method for bearing fault diagnosis based on a DBN and SAE was proposed [271]. In this study, the SAE extracted features from multiple sensors and merged them into one stream. Then, the features fused by the SAE were used to train a DBN for fault diagnosis and classification. The experimental results showed that this SAE-DBN method could effectively identify the machine running conditions.

With the advent of Manufacturing III [272], defect detection will develop from image-based detection to comprehensive detection combined with multiple sensors. Deep learning architectures such as DBNs and SAEs can help multi-feature fusion detection achieve better effect and accuracy, which is worthy of further research.

5 Conclusions and Perspectives

Machine vision has significantly improved the scope, efficiency, quality, and reliability of industrial inspection, which has ushered in a series of achievements that cannot be ignored in contemporary industry. However, there are further explorations to be carried out in the application of machine vision.

First, machine vision is a type of real-time in-line detection, which involves large amounts of data, redundant information, and a high-dimensional feature space. The image processing speed is one of the main bottlenecks influencing the real-time performance of vision systems. It is still difficult to achieve real-time in-line detection for objects with complex shape features.

The second issue is the anti-interference aspect of vision detection systems. Visual inspection should be capable of increasing the robustness of detection to reduce the dependence on the image acquisition environment.

The intelligence level of the vision detection system is another bottleneck; whereas a complex interference environment can be identified manually at a glance, it is difficult for a machine to do the same, and it may even make an incorrect judgement.

Although machine vision technology may not be perfect, defect detection based on machine vision is still the main direction for future research and development in this area. Therefore, some important points need to be considered in future development.

5.1 Robust General Algorithm for Balancing Efficiency and Precision

Artificial intelligence, represented by deep learning, has become an important area in industry as a result of the rapid technological developments in recent years. Deep learning marks a significant milestone in visual inspection. Many algorithms can be employed to achieve high accuracy but cannot be used for real-time online detection. In contrast, some algorithms are very fast but cannot reach the ideal accuracy. In addition, some algorithms can work for the detection of products in experimental cases but may not suit practical production. Therefore, it is a meaningful research direction to study a robust algorithm that achieves both efficiency and accuracy.

Moreover, most of the deep learning algorithms are heavily dependent on large-scale sample datasets, which has become a major factor that limits the application of these methods in some areas. Transfer learning is most effective when the source network has been trained with data that is similar to the target network [273]. Weak supervised learning, including incomplete supervision, inexact supervision, inaccurate supervision, or even unsupervised learning, will be an effective way to solve the problem of expensive data acquisition [274,275,276].

5.2 Fusion of Multiple Detection Technologies

Visual inspection is an image-based detection technology that is mainly aimed at the surface of objects. However, in many cases, industrial inspection concerns not only the surface but also the performance of the whole object.

In the era of Industry 4.0, in order to make the machine more intelligent, comprehensive sensing detection technology should be further studied [277]. Visual inspection can be combined with micro-thermal sensors [278], ultrasonic guided waves [279], eddy current detection [280], laser scanning thermography [281], etc., to achieve a full range of inspection and evaluation of objects.

5.3 Real-Time Performance

Machine vision is mainly used in industrial production line, which requires real-time processing ability. The amount of data involved in visual inspection is very large. However, image processing requires time, thereby leading to a lag in the entire system. The main difficulty in developing real-time detection is the speed of image processing.

Image processing and analysis algorithm should be further optimized to improve the speed of visual inspection system, which is the key technology for further research in the future. Of course, excellent hardware facilities are also very important, such as high-performance computers. Predictably, with the development of 5G communication technology with low delay, using network to upload image data to some powerful cloud servers for processing is also a worth solution [282]. In addition, the next generation of computing technology, represented by quantum computing [283], is expected to contribute fast computing capabilities to the process of visual inspection.

5.4 Extreme Small-Scale Visual Inspection

Manufacturing III takes atomic and close-to-scale manufacturing (ACSM) as the core technology and has become the primary future development trend in manufacturing [79, 272, 284]. To develop ACSM, defect detection will be a very important area. As an example, a neural network can look through a microscope on a sample surface and return information about the atomic structure and lattice defects in real time. On an atomic scale, the size of the datasets grows exponentially; therefore, the application of deep learning could be an effective approach to making great breakthroughs.