A significant new data leak containing thousands of pages of confidential company records has now exposed the inner workings of Google’s complex search algorithm publicly.
Given that it reveals the search engine giant has been lying about search, this revelation calls into question the transparency of Google’s policies throughout the years.
What Exactly Happened?
On March 13, an automated bot known as yoshi-code-bot posted thousands of papers on Github that appeared to originate from Google’s internal Content API Warehouse. These documents were shared earlier this month with SparkToro, co-founder Rand Fishkin.
Key Points Of Leaked Document
Google’s search algorithm is perhaps the most powerful system on the internet, deciding website longevity and influencing the form of online content. The specific algorithms that Google employs to determine website ranking have remained mysterious, slowly being uncovered by researchers, journalists, and search engine optimization (SEO) experts.
Here are some important points about the leaked document –
Current:
According to the documentation, this data is correct as of March.
Ranking features:
14,014 properties from 2,596 modules are represented in the API documentation.
Weighting:
The documents just mentioned the existence of ranking features; they made no mention of how they are weighted.
Twiddlers:
There are re-ranking features that “may modify the information retrieval score of a document or alter the ranking of a document.”
Demotions:
There are several reasons why content may be demoted, including:
- The target site is not compatible with a link.
- User dissatisfaction is shown by SERP indications.
- Product evaluations
- Pornography
- Domains that match exactly.
- Location
Change history:
Google preserves a duplicate of each and every iteration of any page it has ever indexed. This means that every modification made to a page can be “remembered” by Google. But when evaluating links, Google only considers the past 20 modifications to a URL.
There are 14000 ranking features and more in the documentation. Google computes a feature called “siteAuthority”. Navboost features a unique module solely focused on click signals portraying users as voters, and their clicks are saved as votes. Google saves which result had the longest click during the session.
Google has a feature called hostAge that is used particularly “to sandbox fresh spam in serving time”. One of the modules connected to page quality scores has a site-level measure of views from Chrome.
Other intriguing findings According to Google internal documents:
- Google considers dates in the byline (bylineDate), URL (syntaxDate), and on-page content (semanticDate).
- To evaluate if a document is or is not a main topic of the website, Google vectorizes pages and sites, then compares the page embeddings (siteRadius) to the site embeddings (siteFocusScore).
- Google saves domain registration information (RegistrationInfo).
- Page titles are still important. Google offers a function called title match score, which is thought to determine how well a page title matches a query.
- Google computes the average weighted font size of terms in documents (avgTermWeight) and anchor text.
What’s Google’s Take On Leaked Search Algorithm Documents?
Google’s main point of view regarding the leaked internal documents is:
- They confirm the authenticity of the 2,500 leaked internal documents filled with details about data the company collects.
- However, they caution against making inaccurate assumptions about how Google Search works based on these “out-of-context, outdated, or incomplete” information from the leaked documents.
- Google states that they have shared extensive information about how Search works and the factors their systems weigh, while also working to protect the integrity of their search results from manipulation.
- They acknowledge that the leaked documents suggest Google collects and potentially uses data that they have previously said does not contribute to ranking web pages, like clicks, Chrome user data, and more. However, it is unclear which pieces of data are actually used for ranking search content.
- Google maintains that the documents do not reveal how different elements are weighted in search rankings, if at all.
Summary
In summary, while confirming the authenticity of the leaked documents, Google is cautioning against drawing definitive conclusions about their search algorithms and ranking factors from this information, which they claim could be outdated, incomplete, or taken out of context.