1 Introduction

Annually, an average woman spends 140 h on 45 trips shopping for clothes and shoes, and 95 h on 84 trips shopping for food (Forbes 2015), it is the main source contributing to $24 trillion annual total retail sales (https://www.statista.com/statistics/443522/global-retail-sales/). Retail stores such as supermarkets and warehouses are allocating the highest percentage of total spend to sophisticated marketing strategies compared with other industries (21.9%) to attract more customers to the store, to keep customers browsing longer and being happier within the stores, which will consequently result in their more spending (https://www.brafton.com.au/blog/content-marketing/the-ultimate-list-of-marketing-spend-statistics-for-2019-infographic/). Store layout, the design of a store’s floor area and the placement of items in the store, is one of the most effective in-store marketing tactics which can directly influence customer decisions to boost store sales and profitability (Larson et al. 2005). An efficient store layout presents merchandise to attract customer attention (Rhee and Bell 2002) and encourages customers to walk down more aisles which exposes them to more merchandise, which has been shown to be positively correlated with the sales (Martin 2011).

Currently, the conventional approach to designing store layout is based on a passive reaction to customer behavior (Cil 2012; Cil et al. 2009). For example, retailers examine sales data (e.g., by product category ), amend the store layout and introduce in-store displays and promotions to increase or maintain sales. However, these conventional design approaches may not reflect the actual behavior of the customers visiting the stores, which may vary significantly from one residential population to another and from one store to another due to the differences in the customers visiting the store. Importantly, the conventional design process does not reflect (1) how customers actually navigate through store aisles, (2) how much time customers actually spend in each section, and (3) what visible emotion (e.g., happiness) customers exhibit in response to a product. With current in-store technology, such data can be obtained and analyzed by applying AI to the video surveillance of Closed-Circuit TeleVision (CCTV) infrastructure which has been long and widely employed in stores as a security measure to reduce shoplifting and employee theft (Fig. 1).

Fig. 1
figure 1

Widely-employed security CCTV cameras in retail stores can provide in-store customer-behaviour insights to inform and improve store layout design

Recent advances in AI with its sub-fields in computer vision, machine learning and especially in deep learning, have led to breakthroughs in many tasks, with results that match or surpass human capacity (LeCun et al. 2015). The retail community has leveraged the power of AI for many tasks particularly related to the purchase transaction, such as pay-with-your-face, check-out free grocery stores by Amazon Go, visual and voice search by Walmart, Tesco, Kohl’s, Costco, and customer satisfaction tracking and behaviour prediction (https://spd.group/artificial-intelligence/ai-for-retail/?fbclid=IwAR0HM8tP2vQ9MI6jE2lrkD7JnyBP1NMlEAgRWqWWKKlHoctFctHnPC60J9M#Route_Optimization). In 2019, the retail sector leads in global spending on AI systems, with $5.9 billion invested in automated customer service agents, shopping advisers, and product recommendation platforms (https://spd.group/artificial-intelligence/ai-for-retail/?fbclid=IwAR0HM8tP2vQ9MI6jE2lrkD7JnyBP1NMlEAgRWqWWKKlHoctFctHnPC60J9M#Route_Optimization). However, store layout design still lags behind in the AI era. Although research has highlighted the potential of CCTV to capture consumer movement in stores (Newman and Foxall 2003; Nguyen et al. 2017a, 2017b), a framework showing how AI-derived insights can be used for store design has been lacking. Indeed, current retailing research on AI emphasizes consumer perceptions of AI (Davenport et al. 2020; Grewal et al. 2020; Roggeveen and Sethuraman 2020) rather than how AI can be used to inform store layout design.

This research aims to conduct a comprehensive review on existing approaches in store layout design and modern AI techniques that can be utilized in the layout design task. Based on this review, we propose an AI-powered store layout design framework. This framework applies advanced AI and data analysis techniques on top of existing CCTV video surveillance infrastructure to understand, predict and suggest a better store layout. The framework facilitates customer-oriented store layout design by translating visual surveillance data into customer insights and business intelligence. It answers the following questions: How do shoppers really travel through the store? Do they go through every aisle, or do they change from one area to another in a more direct manner? Do they spend much of their time moving around the outer ring of the store, or do they spend most of their time in certain store sections? The big and rich visual data from the CCTV infrastructure allows us to answer such questions and optimize the store layout design towards both customers’ convenience and satisfaction, and thereby increasing retailers’ sales.

Scope: Retail is one of the biggest markets for AI. Currently, many forms of AI are used such as chat bots, price adjustment and predictions, visual search, virtual fitting rooms and supply chain management and logistics (https://spd.group/artificial-intelligence/ai-for-retail/?fbclid=IwAR0HM8tP2vQ9MI6jE2lrkD7JnyBP1NMlEAgRWqWWKKlHoctFctHnPC60J9M#Route_Optimization). This article focuses on the visual AI and analysis techniques which can be applied on top of the CCTV infrastructure of retails stores to improve customer-oriented store layout design. This article represents interdisciplinary research between AI and marketing, this article provides a foundation for academic researchers and practitioners from both fields to collaborate on this problem. In addition, this article offers marketers and managers in retail insights to optimize store layout design, customer satisfaction and sales.

Our key contributions are twofold:

  • Comprehensively reviewing existing approaches in store layout design and modern AI techniques that can be utilized in the layout design task. Section 2 reviews conventional methodology for supermarket layout design, discussing the design aims, layout types and conventional design approaches. Section 3 reviews how modern visual AI techniques are being used in retails, to analyze customer emotion and behaviors while shopping.

  • Proposing a novel AI-powered store layout design framework. Section 4 provides details how the proposed framework applies advanced AI and data analysis techniques on top of existing CCTV video surveillance infrastructure to understand, predict and suggest a better store layout.

2 Conventional methodology for supermarket layout design

2.1 Layout design aims

The ultimate goal of layout design is to increase the sale of stores by navigating consumer behavior (Vrechopoulos et al. 2004). Store layout aims to provide four factors such as perceived usefulness, ease of use, entertainment, and time-consuming (Hansen et al. 2010). The layout of a retail store is a key element in its success, and can increase store sales and profitability (Larson et al. 2005). The goal can be compiled into the following factors.

Expose customers to more products: to attract more purchasing decisions. To optimize the picking up more products, retailers may arrange the essential products at the end of aisles (Tan et al. 2018). Endcaps effectiveness provides the prominent location and attracts the higher shopper traffic (Page et al. 2019). In addition, during a special occasion, such as Easter, stores can put Easter eggs at the end of the aisle of the biscuits area. Customers need to walk through a variety of products to approach daily products. For example, during the travelling time, the essential products should be put at the end of the aisle, consumers need these products, then will be likely go through the entire aisles to pick them. It increases the opportunities to sell other products on the way they go (Page et al. 2019). Designing the shop floor layout for the sales magnets to display products of interest enables customers to circulate around many sections and allows significant contact with a variety of products (Ohta and Higuchi 2013).

Fig. 2
figure 2

Four retail store layout types: a grid, b free form, c racetrack and d circulation

Increase browsing time: leading to different cluster configurations for short, medium, and long trips (Larson et al. 2005). The time spent around the supermarket is more or less to demonstrate whether customers are interested in shopping at supermarkets or not (Kim and Kim 2008; Sorensen 2016). Shopping can be a way to reduce stress and enjoy free time (Guiry et al. 2006; Hart et al. 2007). By planning the store layout, retailers encourage customers to flow through more aisles and see a wider variety of merchandise, which can increases the potential for more sales (Cil 2012), even lead to compulsive buying (Geetha et al. 2013). For example, the freeform layout shown in Fig. 2 is a free-flowing and asymmetric arrangement of displays and aisles, employing a variety of different sizes, shapes, and styles of display. This layout increases the recreation time consumers spend in the store (Larsen et al. 2020; Lindberg et al. 2018). The trip duration increases store performance metrics by understanding consumers’ needs and retailers’ profits.

Easily find related products: to increase customer satisfaction. A store layout arranged in sets of products intended for purposeful buying can increase sales (Hansen et al. 2010). Thus, stores can arrange substitute goods and complementary goods in the same area. For example, tea and coffee are substitute goods. Grouping products in this way replaces consumers finding tea in the beverage section, cheese in fresh cheese, and cereal in the cereal section. This layout leads the customer along specific paths to visit as many store sections as possible (Kim and Kim 2008; Vrechopoulos et al. 2004). In addition, the industrial store layout normally arranges related products together such as in the bakery area (with bread, cakes, biscuits, and so on), and the vegetable area (with carrots, beans, and so on). Further, exposing consumers to well-presented merchandise in such areas can result in higher sales (Kiran et al. 2012).

Cost-control and Calculate stock inventory: setting up a suitable layout and using AI to track image and traffic flows allow retailers to control cost, calculate in-time stocks, and fill in products in the shelves (Yang and Chen 1999). Frontoni et al. indicated that machine learning can predict the available display area in the store and notice the lack of goods on the shelves to the center (Frontoni et al. 2017). Employees receive the notice and quickly fill out the product, not to miss consumers. For example, Walmart is testing by using robots to scan shelves for missing items, and products that need to be restocked with offering the changing price tags and weights (https://spd.group/artificial-intelligence/ai-for-retail/?fbclid=IwAR0HM8tP2vQ9MI6jE2lrkD7JnyBP1NMlEAgRWqWWKKlHoctFctHnPC60J9M#Route_Optimization).

2.2 Layout types

There are four major store layout types: the grid layout, the freedom layout, the racetrack store, and the circulation spine layout (Cil 2012) as shown in Fig. 2.

The grid layout is a rectangular arrangement of displays and long aisles that generally run parallel to one another (Worse to come 2020). Fixtures and displays are laid parallel to the walls (Barghash et al. 2017). This type of layout is a popular choice for supermarkets, grocery stores and chain pharmacies (Vrechopoulos et al. 2004). The grid layout is known as end caps, and staple items such as milk and eggs are put at the back of the store. Thus, when consumers seek out essential products, they walk past a series of products to get them. This product exposure increases the potential for sales.

The free-form layout is a free-flowing and asymmetric arrangement of displays and aisles, employing a variety of different sizes, shapes, and styles of display (Cil 2012). This layout can increase the time consumers spend in a store. It also provides a wider view of products to consumers than the grid layout format. This increases the chance of consumers engaging in exploratory behavior where they look at new products.

The racetrack layout: offers a major aisle to control customer traffic to the store’s multiple entrances, known as a loop layout, “usually in the shape of a circle, square, or rectangle-and then returns the customer to the front of the store” (American Marketing Association, 2020). This layout leads the customer along a specific path to ensure that they are exposed to as many store sections as possible (Vrechopoulos et al. 2004).

Circulation spine layout: is one where there is a traffic loop around the entire store, but the layout also includes a customer path right through the middle of the store (Cil 2012; Langevin et al. 1994).

2.3 Layout design approaches

Conventional retail store layout design approaches mainly rely on three criteria: product categories, cross-elasticity and consumption universes.

Product categories: This layout mainly displays the manufacture products, or categories products and bases on the industry implication (Cil 2012). Here, supermarkets cluster product groups to assist consumers in picking them faster and easier. For example, supermarkets frequently group products in a bakery area (e.g., grouping products such as, bread, cakes, and biscuits), vegetable area (e.g., beans, cabbages), and fruit area (e.g., apples and oranges). The shopper is used to finding products on the same shelf or in the same aisle (Mowrey et al. 2018). Jones et al. indicate that arrangement by product category can enhance consumer impulse buying (Jones et al. 2003).

Cross-elasticity: This layout emphasizes changing sales of one product through price changes in another product (Cil 2012; Hansen et al. 2010; Kamakura and Kang 2007), capturing cross-product interactions in demand via prices (Hwang et al. 2005; Murray et al. 2010). According to Walters and Jamil (2003), product categories are placed side by side following cognitively logical pairs, considered as cross-elasticities. Thus, it captures the use association among categories. With this type of layout, consumers can buy products easily by comparing their strengths and weaknesses on the same store visit (DrÚze et al. 1994; Murray et al. 2010).

Consumption universes: This refers to a “consumer-oriented store layout approach through a data mining approach” (Cil 2012). According to this layout, breakfast products including tea, bread, cheese, and cereal are presented in the same place (Cil 2012). This approach replaces finding tea in the beverage section, cheese in fresh cheese, and cereal in the cereal section. Other types of universe include the baby universe or tableware universe which can be clustered as different product categories (Borges 2003). Clustering products around consumer buying habits can have significant appeal to busy consumers. In such situations, consumers can feel satisfaction because the retailer appears to understand their needs and life circumstances, which in turn can positively affect consumer purchasing decisions. Indeed, store layout design can not only satisfy consumer requirements, but can also let consumers more easily accept the price, increases loyalty (Bill and Dale 2001; Koo and Kim 2013), and repeat purchasing (Soars 2003). In addition, store layout affects consumer behavior in terms of in-store traffic patterns, shopping atmosphere, shopping behavior, and operational efficiency (Bill and Dale 2001; Donovan et al. 1994). Store atmosphere increases the positive mood, and then consumers satisfaction with shopping (Anic et al. 2010; Hussain and Ali 2015), and then they will be willing to buy again (Jalil et al. 2016).

3 How visual AI techniques are being used in retailing

This section discusses how modern visual AI techniques are being used in retailing to understand consumers and their shopping behavior. The overarching aim of this research is understanding consumers and their behaviors while shopping with the current supermarket layout. The analysis can lead to better understanding of consumer shopping satisfaction, which can be used to optimize store layout design thereby making the shopping experience more convenient, more productive and more satisfying for consumers. AI and its sub-fields in computer vision, machine learning and deep learning function like the brain to process and interpret the footage coming from the eyes (CCTV cameras). Modern AI video analytic can be applied to footage coming from a network of CCTV cameras for insights into customer shopping activity. This includes customer-centric tasks (Gupta and Ramachandran 2021) such as customer detection, shopping cart detection, customer identification, customer tracking, customer emotion recognition and customer action recognition.

3.1 Customer and shopping cart detection

Object detection

Object detection involves AI techniques locating and classifying an object based on a range of predefined categories. In the store design setting, objects to located include humans (i.e., shoppers), shopping items, and the shopping cart. After object detection has been performed, bounding boxes are returned, where each box presents the spatial location and extent of each object instance in the image or video. For example, Fig. 3 shows an example of how an object detection algorithm detects the customer and the items which have been picked up in the shopping cart.

Fig. 3
figure 3

Object detection in a supermarket setting

Object detection has been an active area of research (Zou et al. 2019) since it is the foundation to solve complex and high-level vision tasks such as identification and tracking (to be discussed later). Object detection is not simple as objects can vary significantly due to variations in their pose, scale, resolution and lighting. For example, a person can wear any type of clothing from working uniforms to pajamas to go shopping. In addition, the person can stand anywhere in the recorded scene, at any distance from the camera and from any angle to the camera. Recent advances in deep learning, after the breakthrough results proposed by Krizhevsky et al. (2012), have significantly improved object detection performance to a level sufficient for real life deployment. The modern object detectors currently being deployed are divided in two categories: two-stage detectors and one-stage detectors.

  • Two-stage detectors first extracts category-independent regions, then apply classification on the deep features of each region. State of the art examples of this category are Faster RCNN (Ren et al. 2015) and Mask RCNN (He et al. 2017).

  • One-stage detectors unify the two stages into one by directly predicting class probabilities and bounding box offsets from full images with a single feed-forward CNN in a monolithic setting, that does not involve region proposal generation or post classification. State of the art examples of this category are YOLO (Bochkovskiy 2020) and EfficientNet (Tan and Le 2019).

In the retail store setting, humans and projects are the key objects of interest for detection. Detecting these objects exhibit unique and/or extreme challenges compared to general object detection. These challenges include complex backgrounds, uneven lighting, unusual viewing angle, specularity (Santra and Mukherjee 2019) and severe occlusions among groups of instances of the same categories (Cai et al. 2021; Karlinsky et al. 2017). Several works have employed HOG features and SVM for human detection in a retail store (kuang et al. 2015; Marder et al. 2015; Ahmed et al. 2017). However, these handcrafted based approaches are vulnerable to imaging conditions. Recently, many deep learning based approaches have been proposed. Cai et al. (2021) proposed a cascaded localization and counting network which simultaneously regresses bounding boxes of objects and counts the number of instances. Nogueira et al. (2019) proposed to preprocess surveillance videos with foreground-background classification before feeding the RGB data and the foreground mask to their RetailNet for detection. Kim et al. (2019) investigated the performance of a range of popular deep learning based detectors including YOLO, SSD, RCNN, R-FCN and SqueezeDet in the retail setting. They found that Faster RCNN and R-FCN are the most accurate detectors for retails.

Pose estimation

Pose estimation aims to obtain posture of the human body from given images or footage. While object detection techniques infer the presence and location of customers, pose estimation techniques infer the body posture through body keypoints such as head, shoulders, elbows, wrists, hips, knees, and ankles (Chen et al. 2020). In the store design setting, pose estimation provides deep details to understand the action and interaction of a customer with surrounding objects. There are two categories of pose estimation: 2-dimension (2D) and 3-dimension (3D). While the 2D pose estimation task is predicting the location of body joints in the image (in terms of pixel values), the 3D pose estimation task predicts a three-dimensional spatial arrangement of all the body joints as its final output (Chen et al. 2020). In the 2D pose task, there are two main categories:

  • Regression-based methods: attempt to learn a mapping from image to kinematic body joint coordinates by an end-to-end framework and generally directly produce joint coordinates. Typical examples of this category are DeepPose (Toshev and Szegedy 2014; Luvizon et al. 2019).

  • Detection-based methods: predict approximate locations of body parts or joints, usually are supervised by a sequence of rectangular windows or heatmaps (Chen et al. 2020). Typical examples of this category are PartPoseNet (Tang and Wu 2019) and HRNet (Wang et al. 2020).

In the 3D pose task, there are two main categories:

  • Model-based methods: employ a parametric body model or template to estimate human pose and shape from images. Two most popular parametric models in the literature are SMPL (Loper et al. 2015) and kinematic model (Mehta et al. 2017). The pose estimation task is then interpret ted as a model parameter estimation task, which requires much less parameters.

  • Model-free methods: either directly map an image to 3D pose or estimate depth following intermediately predicted 2D pose from 2D pose estimation methods. Typical examples of this category are MocapNET (Qammaz and Argyros 2019) and 3DMDN (Li and Lee 2019).

Heatmap analytics

The output of human detection can be used for constructing business heatmaps. A heatmap can provide a visual summary of information by the two-dimensional representation of data, in which values are represented by colours as illustrated in Fig. 4. There can be a number of ways to display heatmaps, but they all share one thing in common—they use colours to draw the relationships between data values that would be much harder to understand if presented in a sheet of numerical values. For example, the following heatmap employs warmer colours for locations where all customers spend more time there. Heatmaps are being used to understand sales and marketing.

Fig. 4
figure 4

Heatmaps in a supermarket setting

3.2 Customer identification

Facial recognition

The facial region is one of the most natural human characteristics for identification due to the way we recognize humans. Facial recognition relies on the science of biometrics in detecting and measuring various characteristics, or feature points, of human faces for identification. When a new face comes into the scene, it is compared with a gallery of faces, which have been previously collected to infer the identity of the new face. In the supermarket setting, facial recognition can be used as a customer identification system that does not require physical ID cards to track either identity or shopping history as shown in Fig. 5.

Fig. 5
figure 5

Face recognition in a supermarket setting

Facial recognition has a long history and is actively researched within both the research community and industry. Facial recognition is not simple because of facial expressions—which can distort a person’s face, poses—in which non-frontal angles could occlude parts of the face, and lighting—which can illuminate the face non-uniformly. Recent advances in deep learning, after the breakthrough results proposed by Krizhevsky et al. (2012), have brought facial recognition performance to a level equal or even surpassing that of humans. The modern face recognizers are based on deep learning techniques. Compared with conventional feature engineering approaches, deep learning face recognition approaches employ deep networks to train end-to-end without the need for human feature engineering involvement. The prominent approach in the deep learning category is FaceNet proposed in 2015 by researchers from Google Inc. Schroff et al. (2015). FaceNet achieved the state-of-the-art accuracy on the famous LFW benchmark, approaching human performance on the unconstrained condition for the first time (FaceNet: 97.35% vs. Human: 97.53%) by training a 9-layer model on 4 million facial images. Recent research has even managed to dramatically boost the performance to above 99.80% in ArcFace (Deng et al. 2019) in 2019.

While the use of facial recognition in the retail setting is still controversial, especially due to the privacy concern, it is still a possible application of AI in the retail setting. However, performing facial recognition in the retail CCTV cameras could exhibit challenges that have to be further investigated. The first challenge is the small resolution of the faces presented in images due to the subject-camera distance or the actual resolution of a CCTV camera (Nguyen et al. 2012; Lin et al. 2007). The second challenge is the pose of a face or non-frontal faces. The unconstrained dynamics of customers while travelling a shop may make it challenging to capture an ideal frontal face for recognition. Super-resolution can be used to improve the resolution of facial images, either in the pixel domain or feature domain (Nguyen et al. 2018; Jiang et al. 2021). Generative models can be used to synthesize a face from a multitude of angles to deal with the angle view challenge (Wang and Deng 2021).

Customer characterization

A person’s face contains not only information about identity, but also important demographic attributes, including a person’s age, gender and ethnicity. As shown in Fig. 5, the age, gender and ethnicity of both customers are being extracted from the facial image. In the supermarket setting, this information could be used in many ways for customer-tailored targeting.

Modern face attributes estimation algorithms also rely heavily on deep learning to extract deep features from the facial image (Zheng et al. 2020). The prominent approaches in facial attribute estimation are Multi-task Convolutional Neural Network (MT-CNN) with local constraint for face attribute learning (Cao et al. 2018) and Deep Multi-Task Learning (DMTL) approach for Heterogeneous Face Attribute Estimation proposed in 2018 in Han et al. (2018). Experiments on large scale facial datasets showed promising 86%, 96% and 94% accuracy for age, gender and race estimation respectively.

3.3 Customer tracking

Human trajectory tracking

While each camera can detect the location of a human, tracking customers over a whole CCTV camera network is of great interest to business analytic. Human tracking over camera networks is composed of two functional modules: human tracking within a camera and human tracking across non-overlapping cameras. The human tracking within a camera focuses on locating human objects in each frame of a given video sequence from a camera, while the human tracking across multiple cameras concentrates on associating one tracked human object from the Field Of View (FOV) of a camera with that from the FOV of another camera (Fernando et al. 2018) (Fig. 6).

Fig. 6
figure 6

Object tracking in a supermarket setting

Human tracking within a camera

Human tracking within a camera generates the moving trajectories of human objects over time by locating their positions in each frame of a given video sequence. Based on the correlation among the human objects, the human tracking within a camera can be categorized as two types, the generative trackers and the discriminative trackers.

  • Generative trackers: each target location and correspondence are estimated by iteratively updating the respective location obtained from the previous frame. During the iterative search process for human objects, in order to avoid exhaustive search of the new target location to reduce the cost of computation, the most widely used tracking methods include Kalman filtering (KF) (Kalman 1960), Particle filtering (PF), and kernel-based tracking (KT).

  • Discriminative trackers: all the human locations in each video frame are first obtained through a human detection algorithm, and then the tracker jointly establishes these human objects’ correspondences across frames through a target association technique. The most widely used target association techniques include joint probability data association filtering (JPDAF), multiple-hypothesis tracking (MHT), and flow network framework (FNF).

There are two major problems in tracking: single-object tracking (SOT) and multiple-object tracking (MOT) (Wu et al. 2021). In a crowded supermarket, MOT is preferred considering it can track simultaneously multiple customers. The current state of the art MOT is based on deep learning such as ByteTrack (Zhang et al. 2021), which achieved 80.3 MOT accuracy on the test set of MOT17. New deep learning architectures such as transformers have also been shown achieving top performance in tracking benchmarks such as TransMOT (Chu et al. 2021).

In the specific setting of a retail store, a number of work have attempted to employ modern trackers for the customer tracking task. Nguyen and Tran (2020) employed Siamese-family trackers for tracking customers in crowded retail scenes. Leykin and Tuceryan (2005) designed a Bayesian jump-diffusion filter to track customers in a store and performs a number of activity analysis tasks. While most modern trackers can be applied to the customer tracking task in a retail store setting, the tradeoff between speed and accuracy has to be considered (Wojke et al. 2017).

Human tracking across cameras

Human tracking across non-overlapping cameras establishes detected/tracked human objects’ correspondence between two non-overlapping cameras to successfully perform label handoff. Based on the approaches used for target matching, human tracking across cameras can be divided into three main categories, human re-ID, CLM-based tracking, and GM-based tracking.

In addition, the use of multiple cameras allows us to obtain the 3D position of subjects using triangulation, but there are works capable of obtaining the 3D position of humans using a single camera (Neves et al. 2015), which reduces the system cost.

3.4 Customer emotion recognition

Emotion recognition

Humans express emotion through observable facial expressions such as raising an eyebrow, eyes opening or changing the expression of their mouth (e.g., smiling). Understanding customer emotion while they are browsing through the store could provide marketers and managers a valuable tool to understand customer reactions to the products they sell (Martin 2003; Martin and Lawson 1998). Given this potential, emotion recognition has grown significantly to a $20 billion industry. Many big names like Amazon, Microsoft and IBM now advertise “emotion analysis” as one of their facial recognition products.

Emotion recognition algorithms work by employing computer vision techniques to locate the face, and identify key landmarks on the face, such as corners of the eyebrows, tip of the nose, and corners of the mouth (Nguyen et al. 2017a, 2017b). Most approaches in the literature classify emotion into one of seven expression classes: fear, anger, joy, sadness, acceptance or disgust, expectancy, surprise (Li and Deng 2020). Beyond market research, emotion detection technology is now being used to monitor and detect driver impairment, test user experience for video games and to help medical professionals assess the well-being of patients (Don’t look now 2020).

Multimodal emotion recognition

Humans also express emotion in many other ways such as raising their voices and pointing their fingers. Studies by Albert Mehrabian in the 80s (Mehrabian 1981) established the 7–38–55% rule, also known as the “3V rule”: 7% of communication is verbal, 38% of communication is vocal and 55% of communication is visual. Multimodal Emotion Recognition approaches rely on a combination of facial, body and verbal signals to infer the emotion of a subject (Sharma and Dhall 2021). One state of the art example of multimodal emotion recognition is End-to-End Multimodal Emotion Recognition Using Deep Neural Networks (Tzirakis et al. 2017), in which the network comprises of two parts: the multimodal feature extraction part and the RNN part. The multimodal part extracts features from raw speech and visual signals. The extracted features are concatenated and used to feed 2 LSTM layers. These are used to capture the contextual information in the data. Other than LSTM, a Deep Belief Network can also be employed to further capture the interaction between multiple modalities (Nguyen et al. 2017a, 2017b). Contextual information can be further captured and modelled to improve the accuracy (Mittal et al. 2020).

In the supermarket setting, three main modalities of interest to multimodal emotion recognition are facial expression, speech and body gesture. Due to the distance from the cameras to the customers in the supermarket setting, facial expression and body gesture are more popular than speech.

3.5 Customer action recognition

Understanding customer behaviors is the ultimate goal for business intelligence. How customers behave or act reveals their interest in the product. Obvious actions like picking up products, putting products into the trolley, and returning products back to the shelf have attracted great interest for the smart retails. Other behaviors like staring at a product and reading the box of a product are a gold mine for marketing to understand the interest of customers in a product. From a marketing point of view, these behaviors provide evidence to investigate the elements of the decision-making process of purchase that determines a particular choice of consumers and how marketing tactics can influence consumers. Empirical researches on consumer behaviour are primarily based on the cognitive approach, which allows to predict and define possible actions that lead to the conclusion and to suggest implications for communication strategies and marketing (Le 2019; Martin and Nuttall 2017; Martin and Strong 2016).

Fig. 7
figure 7

Action recognition in a supermarket setting

Understanding and recognizing behaviors of a customer are based on the body gestures of the customer in relation to the shelves, the products and the trolley/basket he/she is using (Gammulle et al. 2020). The analysis is based on recognizing the head orientation, eye gazing, and 2-dimensional (2D) and 3-dimensional (3D) pose estimation as illustrated in Fig. 7. Compared with the traditional pose estimation approaches which required motion capture systems or depth cameras, modern approaches only need a monocular RGB video to infer 2D and 3D skeleton in real time.

Modern action recognition algorithms are based on the temporal movement of the skeleton estimated from the pose to extract motion patterns form a certain skeleton sequence and classify the action into a number of action categories of interest (Herath et al. 2017; Zhang et al. 2019). There are three main approaches to process the sequence of a 2D and 3D skeleton:

  • RNN-based: Recurrent Neural Networks are networks that process a sequence of time-series data recursively. The skeleton sequence estimated in an action video is fed into a RNN, which will model the temporal relationship among the sequence to classify the action observed into one of the pre-defined action categories. One state of the art example of this category is HBRNN-L (Yong et al. 2015).

  • CNN-based: Convolutional Neural Networks are networks that process images (2D CNNs) and videos (3D CNNs) using a hierarchy of convolutional layers which gradually learn high level semantic cues with its natural equipped excellent ability to extract high level information. The skeleton sequence estimated from in an action video is usually transformed into a pseudo-image to be processed by a CNN. One state of the art example of this category is Caetano et al. (2019).

  • GCN-based: Graph Convolutional Networks are networks that perform convolution operations on a graph, instead of on an image composed of pixels. Since skeleton data is naturally a topological graph instead of a sequence vector or pseudo-image, GCNs have been adopted for this task. One state of the art example of this category is ST-GCN (Yan et al. 2018).

Many researchers have applied these modern action recognisers to the customer action recognition task in a retail store setting. Liu et al. proposed combined hand feature (CHF), which includes hand trajectory, tracking status and the relative position between hand and shopping basket, classify arm motions into several classes of arm actions (Liu et al. 2015). Hoang et al. proposed a hierarchical finite state machine to detect human activities in retail surveillance (Trinh et al. 2011). Wang (2020) proposed a hierarchy-based system for recognizing customer activity in retail environments. Behaviors of customers can also be analysed offline when there is no need for real time instant decision. Researchers have also considered combining conventional RGB cameras with Depth sensors, which provide richer information about the 3D location and subject shapes for customer interaction analysis (Frontoni et al. 2013),

4 How AI techniques can be employed to improve layout design

Based on the analysis and review in Sects. 2 and 3, we propose a comprehensive framework to apply visual AI and data analytic techniques to the store layout design task. This framework will be referred to as AI-powered Store Layout Design.

Fig. 8
figure 8

STAL: cyclic conceptual diagram sense—think—act—learn model for store layout design

Conceptual design

The conceptual diagram of the Sense-Think-Act-Learn (STAL) framework is presented in Fig. 8. The highest-level architecture consists of Sense—Think—Act—Learn modules. Firstly, “Sense” is to collect raw data, i.e., video footage from a store’s CCTV cameras for subsequent processing and analysis, similar to how humans use their senses. Secondly, “Think” is to process the data collected through advanced AI and data analytic, i.e., intelligent video analytic and AI algorithms, similar to how humans use their brains to process the incoming data. Thirdly, “Act” is to use the knowledge and insights from the second phase to improve and optimize the supermarket layout. The process operates as a continuous learning cycle. An advantage of this framework is that it allows retailers to test store design predictions such as the traffic flow behavior when customers enter a store or the popularity of store displays placed in different areas of the store (Ferracuti et al. 2019; Underhill).

System architecture

The proposed framework has a multiple-layer architecture corresponding to multiple phases of the conceptual diagram. The proposed framework is presented in Fig. 9.

Fig. 9
figure 9

AI-powered store layout ecosystem architecture

The data layer in the SENSE phase includes data streams from the CCTV cameras and video recordings, metadata, customer data, market layout data, etc. In addition to dynamically formed information sources (flowing data), there may also be static information sources such as floor data. The collectors in the data layer collect this data continuously or when demand occurs at certain intervals and a resource pool is created. Data will be filtered and cleaned to improve the quality and privacy. At this stage, it will also be possible to transform the data into a structural form. Given privacy is a key concern for customers, data can be de-identified or made anonymous, for example, by examining customers at an aggregate level (Lewinski et al. 2016).

The pre-processing layer in the SENSE phase applies several techniques to improve the quality of captured images and videos such as cleaning, de-noising, de-blurring and transformation correcting. Since there is an intense data flow from the CCTV cameras, a cloud-based system can be considered as a suitable approach for supermarket layout analysis in processing and storing video data. Large video data can be stored successfully and steadily on cloud storage servers for extended periods. The proposed architecture is intelligent and fully scalable. Local client-server architectures are not a suitable solution for supermarkets, considering the need for system maintenance and qualified personnel and the financial costs.

The intelligent video analytic layer in the THINK phase plays the key role in interpreting the content of images and videos. Intelligent video analysis involves a variety of AI, computer vision, machine learning, and deep learning techniques for detection, identification, tracking, analyzing, and extracting meaningful information from images and video streams. Details on these techniques have been discussed in Sect. 3. A general video analysis process for supermarket layout optimization based on deep learning is summarized in Fig. 10.

The execute layer in the ACT phase employs the analytical results and insights from the analytic layer to take actions by improving layout, measuring the success of the improved layout, evaluating the results obtained and continuous revision of the created layout. Two examples of using the STAL framework would be studying maps of customer density or time spent in stores (see Ferracuti et al. 2019) to generate optimal layouts. Layout variables managers can consider include store design variables (e.g., space design, point-of-purchase displays, product placement, placement of cashiers), employees (e.g., number, placement), and customers (e.g., crowding, visit duration, impulse purchases, use of furniture, waiting queue formation, receptivity to product displays).

Fig. 10
figure 10

A processes of intelligent video analytic for supermarket layout optimization

Data flow

The information flows between the components that make up the architectural structure designed for the layout to be designed with AI support is shown in Fig. 11. In the Sense stage, the data flows out of the Data Generation and Pre-processing layers in the form of raw videos and processed videos, respectively. In the Think stage, the data flows out of the Intelligent Video Analytic in the form of analysis and interpretation such as measurements, statistics, and analytic. The analysis will be fed to the Visualization layer to output graphs and diagrams. Ultimately, intelligent decisions are made, suggestions are presented, and inferences are made to improve the layout. In this framework, a cyclical process (sense-think-act) is repeated as a result of the machine learning which is taking place. This iterative process results in changes in store layout which should improve customer satisfaction. For example, research shows how there is often a mismatch between customer expectations and what retailers do regarding shelf placement. Further, this mismatch has the potential to make customers frustrated (Valenzuela et al. 2013). Following the STAL model would reduce such discrepancies between consumer expectations and store design (Valenzuela et al. 2013). In so doing, it provides an example of how AI can be used to enhance business productivity (Kumar et al. 2021). Feedback from the STAL model can also be used to aid managers in the market segmentation of different groups of consumers with different behavior patterns (e.g., time in store, size and frequency of purchase, price and store display promotion sensitivity).

Fig. 11
figure 11

AI-powered supermarket layout design model and data flow

5 Discussion and conclusions

Improving supermarket layout design is one important tactic to improve customer satisfaction and increase sales. This paper reviews existing approaches in the layout design task, AI and big data techniques that can be applied in the layout design problem, and most importantly proposes a comprehensive and novel framework to apply AI techniques on top of the existing CCTV cameras to interpret and understand customers and their behavior in store. There are a number of research directions to be further explored:

  • Research performance of the proposed framework in real stores at different scales and types of merchandises;

  • Research the impact of unique retail store settings on the performance of AI components (e.g. detection, tracking, identification and behaviour analysis);

  • Research the impact of the change in marketing strategies caused by the insights generated from the proposed framework.