Making machine learning human-centered

How we united UX Research and ML to improve image quality on Pinterest

banerjee.shilpa
Pinterest Design

--

Image by h heyerlein via Unsplash

Shilpa Banerjee | Pinner & Business Interfaces UX Researcher, Monetization and Iaroslav Tymchenko| Software Engineer, Content Quality

When you hear the acronyms AI and ML, what job roles usually come to mind? If you said software engineering and computer scientist, you’re in the majority. If you said user researcher, you would see puzzled faces. As a qualitative researcher, I’ve been there. But things are changing. More and more qualitative researchers are finding themselves working in the intersection of user research and machine learning (aka ML+UX). This could mean a myriad of things. For example, as a UX researcher, you’re advocating for fairness, diversity, and inclusion when building ML models, asking why a problem needs to be solved by AI, or helping craft human labeling guidelines for model training that are grounded in user insights. Today we’re sharing one such Pinterest case study to demonstrate how UX research, with engineering partnership, can contribute to making the machines human-centered.

A focus on high quality content

Image by Omar Prestwich via Unsplash

Pinterest is a visual platform of Pins saved by people and created by brands and creators with the intent to inspire people to action. Pinners come to Pinterest for discovery experiences that leave them thinking of us as the positive corner of the internet. This makes it crucial to ensure they experience only good quality content and nothing less. And because content is saved by people from all over the world, extra vigilance is needed. To ensure Pinners have an enjoyable experience, much effort is taken to remove bad content (e.g. adult content, hate speech, copyright violations, etc). Most of this is detected automatically by ML models, but we also pair that work with human evaluation, to identify policy violations as well as bad experiences. For example, a low resolution image or an image labeled or organized out of context doesn’t violate guidelines, but these aren’t very helpful to users on their quest for inspiration.

In the effort to keep the quality high, the Content Quality Engineering team developed a signal to score images within a given category (such as fashion or travel) such that (1) images with higher scores are considered more useful, of higher quality, and more aesthetically pleasing to Pinners (2) images with lower scores can generally be filtered out without compromising the user experience. We began the work with the question, “What do Pinners define as high quality visuals?”

How do Pinners perceive the quality of visuals?

Example of image scored high by the model / Image by Alexi Romano on Unsplash
Example of image scored low by the model / Image by Etienne Girardet on Unsplash

Traditionally, the notion of “goodness” is more or less clear, and so engineers were building models in isolation. However, for this particular signal, it’s not about what an engineer deems high or low quality, it’s about what Pinners deem of high or low quality. Hence, UX research is a critical part of this joint effort. To summarize, this effort to train a model is different in these two aspects:

  • We started with research. Note this was the first initiative to use qualitative user insights from UX research studies in the creation of human evaluation guidelines (as opposed to having engineering or product determine a goal that’s later verified through offline and online experiments).
  • We started with Pinner problems and needs first before designing the solution when most model building cases lead with the solution. We asked “What do Pinners define as high quality visuals?”, a unique approach which otherwise is heavily dependent on the engineer’s discretion, running the risk of (1) personal bias (2) not representative of the majority of user needs.

We broke down the work into six steps

  1. We started with scoping.

Like every project, this too needed a set of achievable objectives for round 1, a timeline, and a set of milestones to reach the end goal. While we had the research to kick off the guidelines work for five categories, we decided to cover one category first and learn from this experiment before expanding to other categories. We chose fashion for a couple of reasons (1) it’s a category that demands high quality visuals for users to act on the idea (eg: high quality image to look at the colors and the material for styling) (2) as a company we were investing in this category.

We knew from the start we’d be leveraging human raters for this experiment; it took 1 meeting to discuss and settle on the vendor to work with, the logistics of kicking off the work with them, best practices, cost. And we were set with the scope!

2. We aligned on objective criteria for the model.

For the model to be able to distinguish between good and bad quality images, it needed to first understand how Pinners defined good or bad quality images.

Through prior qualitative research work, we identified and established a set of objective criteria that Pinners used to evaluate image quality (see image). The prioritization was based on what Pinners cared for the most and the least when evaluating a Pin. The prioritization is crucial to (1) help focus on what is most important versus treating everything at the same level which is not how users judge image quality (2) operationalize the qualitative insights for model training and helps allocate weights for model training.

While initially we had identified and prioritized 13+ criteria that Pinners used repeatedly to judge image quality, we decided to go with the seven most important criteria commonly used by users for Round 1 of the guidelines and learn from the outcome. That is a threshold we, as a team, decide upon internally (eg — high quality image must meet the criteria — x.y.z.. If not = low quality).

Example of criteria used to rate high quality fashion images

3. We checked for subjectivity and bias.

To ensure we don’t bias a model, it was important to recognize differences in thinking or interpreting an insight. To ensure we were not bringing in our own biases and preferences, research and engineering took a few steps:

  • Invested time upfront in understanding the research and aligning: We spent time going through the research insights, asking/answering questions, addressing assumptions until we felt we were on the same page before getting into the weeds of designing the guidelines.
  • Defined an objective definition of image quality: there were many definitions of image quality floating around, which wasn’t helpful as it wasn’t objective enough for us to use. Hence we came up with a definition that felt accurate and objective and worked in all situations.
  • We annotated over 500 images using the seven shortlisted criteria (double checkmark). The end result was very satisfying — we rated the majority of the images similarly and we were set for the next step.

4. We crafted human labeling guidelines.

The next step was to write the guidelines as clearly as possible to be used by the human labelers and generate the data set for model training. After a few rounds of discussions and a few drafts of the guidelines, we were ready to test them with the human labelers. When writing the guidelines, we were sure to be mindful of accounting for cultural and language differences, and also worked to provide enough context and supporting visuals to explain the guidelines.

5. We evaluated and fine tuned the guidelines.

To ensure raters were interpreting the guidelines accurately and that comprehension was high, the guidelines were put through a test with a set of human labelers. During the initial round, we looked at rater agreement scores to see where the raters agreed amongst each other, and which parts of the guidelines needed tweaking. We went back to the drawing board a few times and made some trade-offs based on the rater feedback.

  • As per the guidelines, an image with one person was considered high quality. While this worked for images with one person, these guidelines deemed all the other images with multiple people in it as low quality, which made us change the criteria to ‘1 person in focus’ when there are multiple people present and the main idea is displayed by/around a single person.
The image to rate contains 1 person in the center.
In the background, there are more people and tall buildings. The provided answer is “1” with an explanation “even though there are other people in the image, there is only 1 essential person. Other people happened to be there randomly.”
Example of criteria we excluded from the original list of rating criteria

Some criteria were excluded because they limited the number of qualifying images too much, while others were excluded because they caused confusion between raters — so their answers were not reliable.

We only list a few examples of criteria because the full list of criteria is dynamic and will keep changing as we work on the next iterations of the model.

6. We trained the model and evaluated outputs.

After the training data was available, we kicked off the model training work. Real images from Pinterest with the tags “woman fashion,” “man fashion,” “kids fashion” were collected using labels provided automatically by different classifiers based on the Pin information.

The model architecture consisted of three hidden layers (FC with relu activation + dropout) + one output layer with sigmoid activation. The input to the model is a visual embedding trained for generic tasks.

We ran the model on newly uploaded images and looked at the best and worst scored images to confirm that the model was working sensibly.

What we learned (and what we’re looking to learn in the future)

Image by Isaac Smith via Unsplash

Our preliminary offline data analysis showed that Pinners have slightly higher positive engagement if we exclude images with extremely low aesthetics score.

We are currently running online experiments to verify our hypothesis about the value the model will bring to Pinners and Pinterest. Initial experiments have shown that some Pins with high engagement are scored low by the model. Manual inspection reveals those indeed do not pass the guidelines — which highlights that training solely on engagement data would have been wrong (and resulted in boosting spammy or clickbaity images).

We expect the model can be used as an additional signal to boost (high aesthetic images) or to demote (low aesthetic images) within a particular category.

We plan to keep evolving this model in a couple of ways:

  1. One idea is to see what low engagement Pins are scored high by the model and think about whether we want to boost them.
  2. We plan to extend the model beyond the fashion category for example into the home decor category.

Acknowledgment

We want to thank Andrey Gusev (EM) for seeding the idea and making the first iterations on the Travel category and Rahim Daya for the performed offline data analysis.

--

--