Languages of the Internet

Author: Piotr Patrzyk
Source code: ppatrzyk/common-crawl
Data: Common Crawl (2024-10)

Analysis of language prevalence in the internet. List of languages was parsed and aggregated by host to determine primary (i.e., the most common one on the host) and secondary language(s) of each website. See ppatrzyk/common-crawl for details on how data was queried and processed. Known limitations: (i) it's unknown whether data source constitutes representative sample of the entire internet, (ii) languages were determined automatically with unknown reliability [details], (iii) map visualization was prepared with default plotly map data - depending on your local laws, country borders might be inaccurate.


Most common languages

What are the most common languages among all websites? Percentage denotes which part of all websites is written in given language.


Most common languages by tld

What are the most common languages for each tld? Same as above but considers data only from a specific TLD.


Secondary languages ranking

Average ranking as a secondary language. Ranking here denotes ordered most common language on a given host. Note, by definition, max ranking for a secondary language is 2.


Most common secondary languages by primary language

For sites with given primary language, what is the most common secondary language content? Percentage denotes what part of all sites on selected primary language hosts is written in that secondary language.


Most common primary languages by secondary language

For each language, this plot depicts on which primary language hosts is this language most commonly added as secondary one.


Language prevalence in national domains

This map depicts how common is given language among national TLD (ccTLD) sites.