tldextract: The Best Python Library for Domain Name Extraction

Why Use tldextract?

When working with URLs, extracting key components like subdomains, main domains, and suffixes is essential. Whether you’re developing a spam filter, analyzing web traffic, or managing SEO projects, tldextract provides a reliable and ready-to-use solution.

How Does tldextract Work?

tldextract uses the Public Suffix List to accurately identify domains and their components. This ensures it remains up-to-date with the latest suffix changes.

Basic Example

Here’s a simple example of extracting the registered domain:


import tldextract

ext = tldextract.extract('http://forums.bbc.com')
print(ext.registered_domain)  # Output: bbc.co.uk
    

Decomposing a URL

You can also extract individual components like subdomains, domains, and suffixes:


import tldextract

url = "https://blog.data-science.example.co.uk/path"
ext = tldextract.extract(url)

print("Subdomain:", ext.subdomain)  # Output: blog.data-science
print("Domain:", ext.domain)        # Output: example
print("Suffix:", ext.suffix)        # Output: co.uk
    

Advantages of tldextract

  • Accuracy: Handles complex domain structures and non-standard URLs.
  • Automatic Updates: Keeps track of changes in the Public Suffix List.
  • Ease of Use: Provides a straightforward API for extracting domain components.

Use Cases

tldextract can be applied to various real-world scenarios, including:

  • Filtering and whitelisting domains for security applications.
  • Analyzing web server logs to determine top-level traffic sources.
  • Building SEO tools to classify domains and subdomains.
  • Detecting malicious domains in anti-phishing systems.

Filtering by Registered Domain

You can easily filter URLs by registered domain:


import tldextract

url = "https://malicious.example.com"
ext = tldextract.extract(url)

if ext.registered_domain == "example.com":
    print("URL matches the target domain.")
    

Optimizing Performance

If you process a large number of URLs, configure tldextract to use a local copy of the Public Suffix List to avoid network delays:


import tldextract

extractor = tldextract.TLDExtract(cache_dir='/path/to/local/cache', suffix_list_urls=None)
ext = extractor("https://example.co.uk")
print(ext.registered_domain)
    

Learn More

For more details, visit the official tldextract GitHub repository.

chevron_left
chevron_right

Leave a comment

Your email address will not be published. Required fields are marked *

Comment
Name
Email
Website