The Label Is Only as Good as the Human Behind It

Expert-rate annotation pays whoever holds the account, regardless of who opened it.

Ammar Khan

In late 2024, a seller on a Telegram channel posted an update for his customers. He had forty Scale AI accounts blocked the previous week, which he framed as a minor operational setback. He had plenty more where those came from. The accounts were verified. The people logging in to run the tasks were different people. The rate arbitrage between what a verified expert seat earns and what it costs to rent one is the entire business model, and the seller was running it at scale.

That seller is not working alone. Reporters have mapped an active market where operators buy verified Outlier and Scale accounts for fifty to a hundred dollars apiece, route through residential proxies, and sell the output as if it came from the person who filled out the onboarding form. Follow-up reporting found the same pattern running across WhatsApp and Facebook, with one fraud researcher calling it an accelerating arms race. Academic work has produced a line that should be printed on the wall of every data operations team: quality control algorithms cannot ensure quality under Sybil attacks.

This matters more this year than last because the industry moved upmarket. Surge AI bootstrapped to 1.2 billion in revenue on premium RLHF work. Appen now sells verified experts in medicine, law, and finance for preference ranking. The crowdsourced era of drawing boxes around pedestrians for pennies is closing. The era in front of us is expert judgment at expert prices, and every dollar that rate climbs makes a verified seat more attractive to rent out and harder to keep in the hands of the person who earned it.

The verification layer underneath all of that has barely changed. Device fingerprinting loses to a factory reset. Skill tests prove competence, they do not prove uniqueness, and the same person can pass the same test three times under three accounts. Document-based checks run a dollar to three dollars each, which breaks the math on any task that pays only a few dollars to complete.

The quality controls every data operations lead tracks rest on an assumption that has quietly stopped holding. Inter-annotator agreement, the canonical measure of label reliability, depends on independent judgments from independent humans. When one operator is running three accounts through three proxies, your IAA is not measuring consensus. It is measuring one brain's agreement with itself, and that number will always look excellent. The same problem applies to redundancy sampling, calibration sets, and spot-check protocols. Each of those tools was designed for a world where a seat on the platform corresponded to a person on the other side of it. That world is gone, and the controls built on top of it keep reporting green long after the underlying signal has collapsed.

The downstream cost is what should keep AI leaders awake. A reward model compresses human judgment into a number. If the human behind that judgment is a different person than the expert you hired, the lie goes into the reward model and compounds into every output that model shapes. Recent research keeps confirming that annotation quality is the real bottleneck, not volume. Evaluation suites can run against the output for a year and never surface the problem, because the contamination sits upstream of every metric being measured.

A legal surface has sharpened alongside all of this. OFAC rules prohibit payments flowing to sanctioned regions, and credential-laundered accounts are a documented pathway into those regions. The exposure is no longer hypothetical.

The fix is straightforward and it is finally economic. Prove the person at the keyboard is a unique, live human in the jurisdiction they claim to be in, for one cent per check, with no government documents required. At that price, verification runs on every session instead of onboarding alone, and the rental economy collapses because a borrowed seat costs more than it earns.

The model is only as good as the data. The data is only as good as the label. The label is only as good as the human. If the human behind the account is a different person than the one you onboarded, you paid for expertise and received something else entirely.

What Bad Data Costs: Data Quality in Market Research and AI Training

0 to IPO: What VerifyYou Co-Founder and Former Reddit CTO Learned Scaling Through the Chaos

Back to Blog

The Label Is Only as Good as the Human Behind It

Want updates on launch and product insights?

We use cookies