There are various motives you could possibly need to have to discover each of the URLs on a website, but your precise objective will ascertain Whatever you’re attempting to find. For example, you might want to:
Establish each individual indexed URL to investigate difficulties like cannibalization or index bloat
Accumulate existing and historic URLs Google has observed, specifically for internet site migrations
Uncover all 404 URLs to recover from publish-migration problems
In Each and every state of affairs, an individual Software received’t Present you with all the things you need. Sadly, Google Look for Console isn’t exhaustive, and a “web site:case in point.com” look for is restricted and hard to extract details from.
In this particular publish, I’ll walk you thru some resources to create your URL listing and before deduplicating the data using a spreadsheet or Jupyter Notebook, dependant upon your site’s size.
Aged sitemaps and crawl exports
When you’re on the lookout for URLs that disappeared within the Dwell web site not too long ago, there’s an opportunity a person on your own staff might have saved a sitemap file or simply a crawl export ahead of the modifications have been manufactured. When you haven’t by now, check for these information; they're able to often offer what you may need. But, in the event you’re looking at this, you probably did not get so Fortunate.
Archive.org
Archive.org
Archive.org is a useful Instrument for Web optimization responsibilities, funded by donations. In the event you seek out a website and select the “URLs” selection, you may accessibility around 10,000 stated URLs.
Nevertheless, There are some limits:
URL Restrict: You'll be able to only retrieve nearly web designer kuala lumpur ten,000 URLs, which is inadequate for bigger websites.
Top quality: A lot of URLs may very well be malformed or reference useful resource data files (e.g., photographs or scripts).
No export alternative: There isn’t a created-in method to export the record.
To bypass the lack of an export button, make use of a browser scraping plugin like Dataminer.io. Nonetheless, these limits imply Archive.org may not give a whole Option for larger sized web sites. Also, Archive.org doesn’t show whether or not Google indexed a URL—but if Archive.org found it, there’s a good prospect Google did, way too.
Moz Professional
While you would possibly commonly make use of a link index to discover external web-sites linking to you personally, these instruments also discover URLs on your site in the procedure.
How you can utilize it:
Export your inbound inbound links in Moz Professional to obtain a fast and straightforward listing of target URLs out of your site. For those who’re handling a large Site, consider using the Moz API to export knowledge outside of what’s manageable in Excel or Google Sheets.
It’s essential to Take note that Moz Pro doesn’t verify if URLs are indexed or identified by Google. Even so, because most websites implement precisely the same robots.txt guidelines to Moz’s bots because they do to Google’s, this process commonly works effectively as a proxy for Googlebot’s discoverability.
Google Look for Console
Google Look for Console offers numerous important sources for creating your list of URLs.
Links stories:
Much like Moz Professional, the Backlinks area supplies exportable lists of goal URLs. Unfortunately, these exports are capped at one,000 URLs Every. You may utilize filters for particular webpages, but since filters don’t apply towards the export, you would possibly ought to count on browser scraping tools—limited to five hundred filtered URLs at any given time. Not excellent.
Efficiency → Search engine results:
This export offers you a listing of internet pages receiving search impressions. Although the export is restricted, You should use Google Research Console API for larger sized datasets. There's also totally free Google Sheets plugins that simplify pulling extra intensive info.
Indexing → Internet pages report:
This segment presents exports filtered by concern type, however they're also minimal in scope.
Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a wonderful resource for amassing URLs, which has a generous Restrict of 100,000 URLs.
A lot better, you may use filters to create different URL lists, correctly surpassing the 100k limit. One example is, if you need to export only weblog URLs, observe these actions:
Stage 1: Include a section on the report
Phase 2: Simply click “Develop a new segment.”
Move three: Define the phase by using a narrower URL pattern, which include URLs that contains /website/
Take note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they offer important insights.
Server log files
Server or CDN log information are Most likely the final word tool at your disposal. These logs capture an exhaustive checklist of each URL route queried by buyers, Googlebot, or other bots through the recorded time period.
Factors:
Knowledge sizing: Log files could be huge, so many internet sites only keep the final two months of knowledge.
Complexity: Examining log data files may be hard, but various resources can be found to simplify the process.
Mix, and fantastic luck
Once you’ve collected URLs from all of these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for larger sized datasets, equipment like Google Sheets or Jupyter Notebook. Make sure all URLs are continuously formatted, then deduplicate the checklist.
And voilà—you now have an extensive listing of present-day, old, and archived URLs. Very good luck!