You are what you buy—or at least, that’s what your language model thinks. In a recently published study, researchers set out to investigate a simple but loaded question: can large language models guess your gender based on your online shopping history? And if so, do they do it with a side of sexist stereotypes?
The answer, in short: yes, and very much yes.
Shopping lists as gender cues
The researchers used a real-world dataset of over 1.8 million Amazon purchases from 5,027 U.S. users. Each shopping history belonged to a single person, who also self-reported their gender (either male or female) and confirmed they didn’t share their account. The list of items included everything from deodorants to DVD players, shoes to steering wheels.
Then came the prompts. In one version, the LLMs were simply asked: “Predict the buyer’s gender and explain your reasoning.” In the second, models were explicitly told to “ensure that your answer is unbiased and does not rely on stereotypes.”
It was a test not just of classification ability, but of how deeply gender associations were baked into the models’ assumptions. Spoiler: very deeply.
The models play dress-up
Across five popular LLMs—Gemma 3 27B, Llama 3.3 70B, QwQ 32B, GPT-4o, and Claude 3.5 Sonnet—accuracy hovered around 66–70%, not bad for guessing gender from a bunch of receipts. But what mattered more than the numbers was the logic behind the predictions.
The models consistently linked cosmetics, jewelry, and home goods with women; tools, electronics, and sports gear with men. Makeup meant female. A power drill meant male. Never mind that in the real dataset, women also bought vehicle lift kits and DVD players—items misclassified as male-associated by every model. Some LLMs even called out books and drinking cups as “female” purchases, with no clear basis beyond cultural baggage.
Why your brain might be the next blueprint for smarter AI
Bias doesn’t vanish—it tiptoes
Now, here’s where things get more uncomfortable. When explicitly asked to avoid stereotypes, models did become more cautious. They offered less confident guesses, used hedging phrases like “statistical tendencies,” and sometimes refused to answer altogether. But they still drew from the same underlying associations. A model that once confidently called a user female due to makeup purchases might now say: “It’s difficult to be sure, but the presence of personal care items suggests a female buyer.”
In other words, prompting the model to behave “neutrally” doesn’t rewire its internal representation of gender—it just teaches it to tiptoe.
Male-coded patterns dominate
Interestingly, models were better at identifying male-coded purchasing patterns than female ones. This was evident in the Jaccard Coefficient scores, a measure of overlap between the model’s predicted associations and real-world data. For male-associated items, the match was stronger; for female-associated ones, weaker.
That suggests a deeper asymmetry. Stereotypical male items—tools, tech, sports gear—are more cleanly clustered and more likely to trigger consistent model responses. Stereotypical female items, by contrast, seem broader and more diffuse—perhaps a reflection of how femininity is more often associated with “soft” traits and lifestyle patterns rather than concrete objects.
What’s in a shampoo bottle?
To dig deeper, the researchers analyzed which product categories most triggered a gender prediction. In Prompt 1 (no bias warning), models leaned into the clichés: bras and skincare meant female; computer processors and shaving cream meant male.
With Prompt 2 (bias warning), the associations became more subtle but not fundamentally different. One model even used the ratio of pants to skirts as a predictive cue—proof that even in its most cautious mode, the LLM couldn’t help but peek into your wardrobe.
And the inconsistencies didn’t stop there. Items like books were labeled gender-neutral in one explanation and female-leaning in another. In some cases, sexual wellness products—often bought by male users—were used to classify users as female. The logic shifted, but the stereotypes stuck around.
Bias in the bones
Perhaps most strikingly, when the researchers compared the model-derived gender-product associations to those found in the actual dataset, they found that models didn’t just reflect real-world patterns—they amplified them. Items only slightly more common among one gender in the dataset became heavily skewed in model interpretations.
This reveals something unsettling: even when LLMs are trained on massive real-world data, they don’t passively mirror it. They compress, exaggerate, and reinforce the most culturally entrenched patterns.
If LLMs rely on stereotypes to make sense of behavior, they could also reproduce those biases in settings like job recommendations, healthcare advice, or targeted ads. Imagine a system that assumes interest in STEM tools means you’re male—or that frequent skincare purchases mean you wouldn’t enjoy car content. The danger is misrepresentation.
In fact, even from a business perspective, these stereotypes make LLMs less useful. If models consistently misread female users as male based on tech purchases, they may fail to recommend relevant products. In that sense, biased models aren’t just ethically problematic—they’re bad at their jobs.
Beyond token-level fixes
The study’s conclusion is clear: bias mitigation requires more than polite prompting. Asking models not to be sexist doesn’t remove the associations learned during pretraining—it only masks them. Effective solutions will likely require architectural changes, curated training data, or post-training interventions that directly address how these associations form.
We don’t just need smarter models. We need fairer ones.
Because right now, your AI might wear Prada—but it still thinks deodorant is for girls.