WPML’s Private Translation Cloud (PTC) scores 90.0 out of 100 on independent translation review. DeepL — widely regarded as the strongest mainstream AI translation engine — scores 77.7 on the same content. PTC outperforms DeepL by 12.3 points overall, wins every one of the nine quality dimensions our linguists scored, and produces roughly an 18× reduction in mistakes per page.
At a Glance
| Metric |
|
|
|---|---|---|
| Avg. Translation Quality score (0–100) | 77.7 | 90.0 |
| Avg. issues per translated page | 1.27 | 0.07 |
| Quality dimensions where PTC scored higher than DeepL | — | 9 of 9 |
| Languages tested where PTC scored higher than DeepL | — | 6 of 6 (Arabic, English, French, German, Italian, Spanish) |
Why We Ran This Study
DeepL is treated by most of the WordPress and software industries as the gold standard for AI translation. When customers ask whether they can publish AI-translated content without human review, DeepL is usually the implicit benchmark.
WPML built its own AI translation engine — the Private Translation Cloud (PTC) — because “good enough” wasn’t. PTC is the default engine inside WPML’s Translate Everything mode and powers the bulk of automatic translation across the 1.5 million+ sites running WPML.
We wanted a measurable answer to a question prospects ask us every week: how does PTC actually compare to DeepL? This study is that answer. The numbers above are the headline; the rest of this article explains how they were produced and what they mean for a real production site.
What We Measured and How
We scored translation quality using a system based on MQM (Multidimensional Quality Metrics), the same framework WPML’s linguistics team uses for the ongoing PTC client-site reviews. Every translated page is scored across nine dimensions:
- Grammar — were there grammatical mistakes?
- Meaning — was the meaning of the source preserved?
- Naturalness — does it sound natural and idiomatic to a native speaker?
- Tone — was the tone of the source captured?
- Formality — was the right register used, both in addressing the reader and in grammatical address forms?
- Terminology — was the right terminology used?
- Consistency — was terminology applied consistently across the page?
- Capitalization — were target-language conventions followed?
- Number, units, and date localization — were target-language conventions followed?
Each dimension is scored 1–10 by a human linguist who is a native speaker of the target language. The final per-page score is the average across the nine dimensions, multiplied by ten — yielding a 1–100 scale that maps to plain-English quality bands:
| Score | What it means |
|---|---|
| 90–100 | Very good to perfect — ready to publish without review |
| 80–89 | Good — publish-ready in most contexts |
| 70–79 | Acceptable to mediocre — usable but mistakes are visible |
| 60–69 | Mediocre — needs touch-ups before publishing |
| 50–59 | Mistakes obvious to non-native speakers |
| Below 50 | Defects significant enough to require rework |
For context: the published average for generic human translation sits around 76. DeepL’s 77.7 puts it just above that line. PTC’s 90.0 puts it right at the boundary of the “very good — ready to publish without review” band.
Sample Size and Selection
To assess DeepL, we took 800 random sites translated with DeepL in the last 12 months, bucketed them by site age, content volume, and translation activity, and sampled across buckets. Our linguists then manually reviewed the main pages of the sampled sites — 227 webpages in total, evaluated in Arabic, English, French, German, Italian, and Spanish.
To compare DeepL and PTC head-to-head, we re-translated 73 of those same pages with PTC and re-scored them blind, with the same linguists working from the same source content. The PTC sample is smaller than the DeepL sample because each PTC measurement reflects a specific PTC version, and the engine continues to evolve — see the We’re not done section below.
Per-Language: PTC Wins in Every Pair
The aggregate scores are interesting; the per-language picture is what tells a multilingual content team what to expect.
TQ Scores by Language

PTC scores higher than DeepL in every language pair we tested. The widest gap is in Arabic (+22.4 points), where DeepL has historically underperformed and where WPML’s investment in RTL-language quality shows clearly. German (+11.5) and French (+9.9) come next. The narrowest gap is in English (+6.0 points) — and even there, PTC moves the page from “acceptable, visibly imperfect” into the publish-ready band.
The variance also matters. The DeepL distribution has a long lower tail in Arabic, with individual pages scoring well below average — pages that a German or French user would consider broken. PTC’s distributions are tighter and centred higher, so a content team can plan around an expected quality band rather than budgeting for outliers.
PTC also outperforms DeepL on every one of the nine quality dimensions we score: grammar, meaning, naturalness, tone, formality, terminology, consistency, capitalisation, and locale conventions. The biggest gains are in the dimensions that matter most for a human reading a page on a real website — terminology (+1.5), grammar (+1.4), naturalness (+1.4), and number/units/date localisation (+1.4). There’s no dimension where DeepL keeps up, and no dimension where PTC just edges ahead.
Mistake Rate: The Operational Picture
Average scores are useful; mistake counts are more useful for anyone running a real site. Our reviewers logged every distinct issue they identified on each page — grammar, meaning, terminology, the lot.
Average Issues per Page

DeepL produces an average of 1.27 issues per page. PTC produces 0.07 — roughly an 18× reduction in mistakes per page on the same source content.
For a 200-page site, that translates roughly to:
- DeepL: ~254 issues to find, decide on, and fix before publishing.
- PTC: ~14 issues across the entire site.
That’s the difference between AI translation as a first draft and AI translation as a workflow. DeepL is a strong first draft — every page still has issues to catch before it ships. PTC, on this measurement, is a workflow: the mistake count is low enough that the human-review step becomes optional for most content rather than mandatory for all of it.
What “Good Translation” Actually Means
The MQM framework’s grade descriptions are a useful counterweight to the marketing language that surrounds AI translation. A “9/10” page is not a 90% page; it’s a page where a native speaker, reading carefully, finds nothing they would want to change. A “7/10” page is acceptable as-is but visibly imperfect. A “5/10” page is one where a non-native speaker can spot mistakes.
DeepL’s average sits in the high 7s — acceptable, but visibly imperfect to a careful native reader. PTC’s average sits at 90.0, right at the boundary of “very good — ready to publish without review”. That is the quality difference our reviewers measured. It maps to a real distinction: DeepL output is a starting point that requires human review before publication; PTC output is in most cases the publication itself.
Why WPML Can Produce This Quality
We are aware that “we built our own AI engine” is a claim a lot of companies make right now. PTC is unusual for two specific reasons.
Scale and continuous review. WPML serves 1.5 million+ sites and has done for over a decade. WPML’s linguistics team continuously reviews PTC’s translation quality — sampling output, scoring it with the same MQM framework used in this study, and watching for patterns where the engine consistently produces lower-quality results for a particular term, language pair, or content type.
Focus. PTC isn’t a side project at WPML. It has a dedicated team — engineers, MLOps, and human linguists. When the linguists flag a recurring pattern, the engineering team treats it as a bug: extend the glossary, adjust the formality model, add an exception, ship the fix. This loop runs continuously, which is why PTC’s quality has improved over the last 12 months from “as good as average human translation” to the numbers in this study.
The combination — continuous human review at meaningful scale, paired with a team that responds to each finding as an engineering ticket — is what produces the quality numbers above. It is not the model architecture; it is the operating model.
We’re Not Done
90.0 puts PTC right at the boundary of “very good — publish-ready without review”. We’re happy with that result. We don’t think it’s the ceiling.
The QA → engineering loop described above is still running. WPML’s linguistics team continues to analyse the edits customers make to PTC’s output on real sites, identifies the patterns those edits reveal, and the engineering team ships the fixes — glossary extensions, formality adjustments, language-pair-specific tweaks. We expect the next measurement to be higher than this one.
What This Means for Your Workflow
The conventional multilingual content workflow is translate → review → publish. The review step is there because every AI engine — and most human translators — produces work that needs a second pair of eyes before it ships. That review step is also where most of the cost and most of the latency in multilingual content lives.
With DeepL, a review step is necessary. The 1.27 issues per page have to be caught somewhere; if the team doesn’t catch them, the customers will. With PTC, the review step is optional for most content. Some teams still review — for high-stakes pages or regulated industries — but the default workflow becomes translate → publish, with review reserved for the cases that actually need it.
This is the operational difference the numbers describe. It is not a small one.
What This Means for Your Costs
Translation cost is two things: the cost of producing the translation and the cost of reviewing it. Both have to be paid before content goes live.
For most professional contexts, human review costs a few cents per word — whether reviewers charge per word, per page, or per site, that’s roughly where the math lands. Full human translation (translate, then review by a second human) typically runs around €0.10–€0.20 per word for major language pairs, with the review portion alone in the €0.04–€0.08 range.
Here is what each workflow actually costs in those terms.
Human translation alone — the baseline. Around €0.10–€0.20 per word. Both steps are human.
DeepL (or any AI engine) plus human review. AI handles the translation step; a human still has to review every page, because at 1.27 issues per page the issues will reach customers if nobody catches them first. The translation step is essentially free; the review step still costs €0.04–€0.08 per word. Total saving versus human-only: about 60%. This is the standard “AI saves you money on translation” pitch — and it is true — but the cost ceiling sits at the review step that DeepL’s quality still requires.
PTC, no review needed. PTC costs €0.0012–€0.003 per word — 4 credits per word multiplied by €0.30–€0.75 per 1,000 credits depending on volume, with the first 2,000 credits per month free. (See WPML’s automatic translation pricing for the full tier table.) Because PTC’s measured quality (90.0 TQ score, 0.07 issues per page, +12.3-point gap over DeepL on the same source content) is high enough to skip the review step in most cases, the review cost goes to zero. Total saving versus human-only: roughly 99%.
That is not a marginal improvement on the AI-plus-review workflow. It is a different cost regime.
At a Glance — Cost to Ship a Translated Word
| Workflow | Translation | Review | Total | Saving vs. human |
|---|---|---|---|---|
| Full human translation | €0.10–€0.20 | included | €0.10–€0.20 | — |
| DeepL + human review | ~€0 | €0.04–€0.08 | €0.04–€0.08 | ~60% |
| PTC, no review | €0.0012–€0.003 | €0 | €0.0012–€0.003 | ~99% |
Human translation rates are typical industry estimates for major language pairs. PTC pricing is from WPML’s published rates.
A note on doing review yourself: when a business owner does the review themselves rather than paying a reviewer, the cost isn’t zero — it’s just less visible. The hours spent on review are hours not spent on the rest of the business, and at any reasonable hourly rate the implied cost is usually higher than what an outsourced reviewer would charge, not lower. PTC’s saving applies regardless of who is paying for the review step today.
What this opens up. When translation costs €0.10+ per word, you translate selectively — the homepage, the top product categories, a handful of high-traffic blog posts. Everything else stays in the source language because the math does not work. When translation costs €0.002 per word and produces output that is better than typical human translation, the math works for content that was not viable before:
- Full catalogue translation in eCommerce — every long-tail product description, every variant.
- Complete documentation portals in every language a customer might speak, not just the top three.
- Knowledge bases, support FAQs, blog archives — the long tail that drives search but never justified its translation cost.
- Localised editorial content for publishers operating in markets where audience-per-page revenue could not cover human translation.
Multilingual content has been a “what can we afford to translate” question for decades. With PTC, the question becomes “what is worth translating” — which is a different and more interesting one.
How to Get This Quality on Your Own Site
PTC is the default engine in WPML’s Translate Everything mode, which is a single global setting that translates every page on your site, automatically, in the background. There is no per-page configuration to manage.
To get the quality measured in this study, the only setup you need beyond turning Translate Everything on is to spend about a minute describing your site’s audience and tone in the Translate Everything settings — for example, “a software company addressing developers, formal tone, technical terminology preserved in English”. PTC uses that description to align translations with your brand voice.
That is all the setup. The result is what this study measured.
A Note on What We Did Not Do
We did not compare PTC to Google Translate, Azure Translator, or LLM-based services through their public APIs. The reason: those engines change frequently and a single-point measurement would be obsolete within months. The DeepL comparison is the one we ran because DeepL is the most-cited benchmark for production-grade AI translation, and because DeepL is what most of our competitors use under the hood.
We also did not measure speed, cost, or developer experience in this study — only quality. Those comparisons exist on our features page and our comparison pages, and they tell their own story.
Try It
If you run a multilingual WordPress site and want to see PTC’s output on your own content, install WPML, enable Translate Everything, and compare the result to whatever you’re using today. The quality difference described here is the quality difference you should expect to see.
Methodology questions or data requests: contact the WPML team.
Study conducted by the WPML linguistics team, April 2026. Sample: 227 DeepL-translated webpages reviewed by native-speaker linguists; 73 of those re-translated with the current PTC version and re-scored blind against the same source content. Languages: Arabic, English, French, German, Italian, Spanish. Scoring: MQM-based Translation Quality framework, 9 dimensions, 1–10 per dimension, averaged and scaled to a 0–100 result.