Self-supervised-image-representation-learning

Background

SEA E-commerce is growing rapidly in recently year. Top players leverage Data Science techniques to increase their competitive advantage through massive amounts of data including images, texts, and user behaviours, etc.

In this blog, we focus on image data. The application is targeted towards 3 types of downsteam tasks: same-item-price comparison, gap-item-supplement, opportunity-item-exploration. All of these applications relies heavily on the quality of a single function: multi-modal item matching.

Items in E-commerce are associated with information of many modalities where visual information plays a vital role in the accuracy of matching algorithm. We are going to costomize a self-supervised image representation network that is capable to train on massive volumne of images while building robust resistance toward common E-commerce disturbance like orientation shift, multi-object, and watermarks.

Baseline Model

Good old Siamese Network

The most straight forward solution being bulding a Siamese Network with CNN-encoder training schema with contrastive loss as shown on the left. The drawbacks are 3-folds:

1. Supervised training. Very expensive to label large dataset.
2. Hard-negative mining. Negative samples are mostly too easy to learn anything meaningful.
3. Model to prone to many E-commerce disturbances. Shown in below section

Top-4 badcase scenarios from baseline model

1. Sellers might take photos from different perspectives
2. It's common to find shop logos on item images
3. Items comes with various colours in E-commerce
4. Sellers might list multiple items in one image

There are other long-tail disturbance hurting the baseline model. Above are the most common cases. Our new model should be trained with a robust strategy to combat each of them and handle new ones if found.

Self-supervised vision model

Model architecture is based on MoCo from Facebook

Proposed by He Kaiming, MoCo adopts a momentum encoder and a dynamic negative sample dictionary to perform image representation learning on large scale dataset while minimize computational overhead as compared to SimCLR.

1. We train the vision model on 6M images sampled evenly from SG/VN/MY/PH/TH/ID
2. Training conducted with 8 V100 GPUs over a week's time with learning rate anealing
3. Customized image augmentaion to target our weak cases as shown below

Augmentation for
E-commerce

1. Allow images to be transformed to different perspectives
2. Firstly remove all watermarks from images. Then, we allow 30% chance to randomly add watermarks from scrapped pool during training
3. Increase saturation, intensity, brightness variation
4. 10% chance to construct a multiple items image randomly selection 1~6 combinations

Experimental Results

The new model was benchmarked in 3 experiments with different difficulty level

1. Hard dataset: Taobao v.s. SEA data.
2. Local SEA dataset: off-line 100k experiment
3. Local SEA dataset: on-line full scale data

hard-dataset-experiment

offline-experiment

online-experiment

Addtional recall from
new vision model

1. Watermark
2. Multi-object
3. Orientation shift
4. Color change

Example are sampled from actual production pipeline where our new MoCo shows better robustness in areas we targeted to improve

Thanks for watching till the end of the post

Hope that the content is ineresting and could inspire new ideas. If you have any questions or suggestions, feel free to reach me with the contacts provided below.
Hope you have a great day ahead！ ^ ^

Brought to you by Mike

Image representation learning