How to Find All Existing and Archived URLs on a Website

There are various explanations you may have to have to discover many of the URLs on a web site, but your specific intention will identify what you’re trying to find. As an example, you may want to:

Discover every indexed URL to research troubles like cannibalization or index bloat
Acquire current and historic URLs Google has viewed, especially for web-site migrations
Uncover all 404 URLs to recover from put up-migration errors
In Each individual situation, only one Device won’t Supply you with every thing you'll need. However, Google Lookup Console isn’t exhaustive, plus a “web site:case in point.com” search is restricted and tricky to extract details from.

Within this submit, I’ll walk you thru some instruments to develop your URL checklist and just before deduplicating the data utilizing a spreadsheet or Jupyter Notebook, dependant upon your site’s sizing.

Previous sitemaps and crawl exports
If you’re in search of URLs that disappeared in the Are living website not long ago, there’s an opportunity another person on your own staff can have saved a sitemap file or a crawl export ahead of the variations ended up manufactured. In the event you haven’t now, check for these documents; they are able to normally give what you need. But, in case you’re reading this, you probably did not get so Fortunate.

Archive.org
Archive.org
Archive.org is an invaluable Instrument for Website positioning responsibilities, funded by donations. When you seek out a domain and select the “URLs” option, you could access as many as ten,000 listed URLs.

Even so, There are some restrictions:

URL limit: You may only retrieve nearly web designer kuala lumpur ten,000 URLs, which happens to be inadequate for more substantial web-sites.
Excellent: Numerous URLs may very well be malformed or reference source documents (e.g., photos or scripts).
No export option: There isn’t a developed-in solution to export the list.
To bypass the lack of the export button, utilize a browser scraping plugin like Dataminer.io. However, these limits mean Archive.org may well not present a whole Remedy for much larger web-sites. Also, Archive.org doesn’t reveal no matter if Google indexed a URL—but if Archive.org discovered it, there’s a good possibility Google did, too.

Moz Professional
Whilst you might usually use a hyperlink index to discover exterior web pages linking for you, these tools also explore URLs on your website in the procedure.


The way to utilize it:
Export your inbound backlinks in Moz Pro to acquire a quick and easy list of goal URLs out of your web-site. If you’re managing a massive Web site, think about using the Moz API to export details further than what’s manageable in Excel or Google Sheets.

It’s vital that you note that Moz Professional doesn’t confirm if URLs are indexed or found out by Google. Even so, considering that most internet sites implement exactly the same robots.txt principles to Moz’s bots because they do to Google’s, this method typically performs properly being a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Look for Console presents many beneficial resources for developing your listing of URLs.

Hyperlinks studies:


Comparable to Moz Professional, the Backlinks part presents exportable lists of goal URLs. However, these exports are capped at 1,000 URLs Every single. You may implement filters for distinct pages, but because filters don’t utilize on the export, you could need to trust in browser scraping resources—limited to five hundred filtered URLs at any given time. Not excellent.

Functionality → Search engine results:


This export offers you a listing of web pages obtaining look for impressions. Although the export is limited, You may use Google Look for Console API for larger sized datasets. Additionally, there are free of charge Google Sheets plugins that simplify pulling more considerable info.

Indexing → Internet pages report:


This section supplies exports filtered by issue variety, although these are typically also confined in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a wonderful resource for accumulating URLs, by using a generous limit of a hundred,000 URLs.


Better yet, it is possible to apply filters to create distinctive URL lists, efficiently surpassing the 100k Restrict. As an example, if you wish to export only site URLs, abide by these methods:

Move one: Incorporate a segment for the report

Stage two: Click on “Create a new phase.”


Phase 3: Outline the segment which has a narrower URL pattern, like URLs made up of /site/


Take note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide valuable insights.

Server log information
Server or CDN log information are Probably the last word Resource at your disposal. These logs seize an exhaustive listing of every URL path queried by users, Googlebot, or other bots through the recorded period.

Considerations:

Details dimensions: Log information can be large, so many web-sites only keep the last two weeks of data.
Complexity: Analyzing log information is usually demanding, but many resources can be obtained to simplify the method.
Merge, and very good luck
When you finally’ve gathered URLs from each one of these resources, it’s time to combine them. If your web site is small enough, use Excel or, for bigger datasets, applications like Google Sheets or Jupyter Notebook. Assure all URLs are constantly formatted, then deduplicate the listing.

And voilà—you now have an extensive listing of present, previous, and archived URLs. Superior luck!

Leave a Reply

Your email address will not be published. Required fields are marked *