Left and Right Foot Segmentation Overview

IDM Solutions partnered with Augray to develop a 2D and 3D image segmentation framework to identify the left and right foot for an augmented reality-based virtual try-on mobile application. This framework correctly superimposes a 3D model of the shoe onto a person’s foot in the correct pose and orientation. 

This capability propelled Augray into the augmented reality (AR) -based fashion industry market space by developing a framework that allows a person to visualize what the shoe will look like before purchasing.  

Augmented Reality Virtual Try-on

One effect that the COVID-19 pandemic had on the fashion industry is the paradigm shift in how customers complete their online shopping. This shift in the standard practice of trying on clothes at stores became a virtual experience with the help of machine learning based computer vision and augmented reality. Using the immersive technology of augmented reality where a computer-generated 3D model is superimposed over cellphone-based images, customers can see how clothes fit on their body from multiple angles.   

2D Computer Vision-Based Machine Learning Image Segmentation

A UNET image segmentation model was used to determine the 2D image segmentation mask of the left and the right foot to fit a 3D model of the shoe for a virtual try-on experience. IDM Solutions created the entire MLOps pipeline to include data acquisition, data labeling, data augmentation, model development and training. The UNET model was developed and written in Google Colab. A categorical cross entropy and Jaccard Index based composite loss function was used. The architecture of the UNET model is shown below

The results of the trained model are shown in the figure below. The left column is the ground truth and the right column is the segmented mask predicted by the UNET model.

Once the mask was determined, a bounding box was placed around the segmentation mask of each foot. The bounding box around each foot was used to place the 3D model of the shoe. Therefore, the bounding box around the foot must be orientated such that the axis of the bounding box is aligned along the long axis of the foot. Using principal component analysis, the first two principal vectors are along the long (toes to heel) and the short axis of the foot (foot width). Knowledge of the principal directions allow for knowledge of how to rotate the bounding box such that it aligns with the pose and orientation of the foot.

In the figure below, the left three plots show the original image, the ground truth label, and the predicted mask from the UNET model. In the fourth plot, the principal directions are shown as vectors on each foot and the last plot shows the aligned bounding box.

3D Image Segmentation

Using a software application called Polycam, we obtained a 3D point cloud of a person’s feet from a 1st person vantage point. The entire process is depicted in the figure below where the image stack on the left is the LIDAR-based cell-phone image. The image in the middle is the 3D model of the shoe. The image on the far right is the derived point clouds of the foot which is placed in the derived point could of the shoe. To determine the point clouds for the left and right foot and shoe, the library open3d in Python was used. The scaling and transformation matrix from the shoe to the foot was determine and the two point clouds were superimposed. 

Finally, the figure below depicts the 3D model of the shoe on the customer’s right foot. The customer can move the camera around to see how the shoe will look at different angles and with different pants/shorts/dresses/etc.