Why Use tldextract?
When working with URLs, extracting key components like subdomains, main domains, and suffixes is essential. Whether you’re developing a spam filter, analyzing web traffic, or managing SEO projects, tldextract
provides a reliable and ready-to-use solution.
How Does tldextract Work?
tldextract
uses the Public Suffix List to accurately identify domains and their components. This ensures it remains up-to-date with the latest suffix changes.
Basic Example
Here’s a simple example of extracting the registered domain:
import tldextract
ext = tldextract.extract('http://forums.bbc.com')
print(ext.registered_domain) # Output: bbc.co.uk
Decomposing a URL
You can also extract individual components like subdomains, domains, and suffixes:
import tldextract
url = "https://blog.data-science.example.co.uk/path"
ext = tldextract.extract(url)
print("Subdomain:", ext.subdomain) # Output: blog.data-science
print("Domain:", ext.domain) # Output: example
print("Suffix:", ext.suffix) # Output: co.uk
Advantages of tldextract
- Accuracy: Handles complex domain structures and non-standard URLs.
- Automatic Updates: Keeps track of changes in the Public Suffix List.
- Ease of Use: Provides a straightforward API for extracting domain components.
Use Cases
tldextract
can be applied to various real-world scenarios, including:
- Filtering and whitelisting domains for security applications.
- Analyzing web server logs to determine top-level traffic sources.
- Building SEO tools to classify domains and subdomains.
- Detecting malicious domains in anti-phishing systems.
Filtering by Registered Domain
You can easily filter URLs by registered domain:
import tldextract
url = "https://malicious.example.com"
ext = tldextract.extract(url)
if ext.registered_domain == "example.com":
print("URL matches the target domain.")
Optimizing Performance
If you process a large number of URLs, configure tldextract
to use a local copy of the Public Suffix List to avoid network delays:
import tldextract
extractor = tldextract.TLDExtract(cache_dir='/path/to/local/cache', suffix_list_urls=None)
ext = extractor("https://example.co.uk")
print(ext.registered_domain)
Learn More
For more details, visit the official tldextract GitHub repository.