Python Scripts – Quentin Adt's Notes

A script to add a column containing only the domain name to an existing CSV file. It extract it from a column containing an URL.

It works with .co.uk and other country code top-level domain.

Just change “5” by the column containing the URL.

Also don’t forget to adjust. Here is it setup for semi-colon for input and output. Just change the delimiter by the one you need.

import csv
import tldextract

with open('input.csv','r') as csvinput:
    with open('output.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput, delimiter=';')
        reader = csv.reader(csvinput, delimiter=';')

        all = []
        row = next(reader)
        row.append('domain_name')
        all.append(row)

        for row in reader:
            #Column of URL is #5
            ext = tldextract.extract(row[5])
            row.append(ext.registered_domain)
            all.append(row)

        writer.writerows(all)

Why Use tldextract?

When working with URLs, extracting key components like subdomains, main domains, and suffixes is essential. Whether you’re developing a spam filter, analyzing web traffic, or managing SEO projects, tldextract provides a reliable and ready-to-use solution.

How Does tldextract Work?

tldextract uses the Public Suffix List to accurately identify domains and their components. This ensures it remains up-to-date with the latest suffix changes.

Basic Example

Here’s a simple example of extracting the registered domain:


import tldextract

ext = tldextract.extract('http://forums.bbc.com')
print(ext.registered_domain)  # Output: bbc.co.uk

Decomposing a URL

You can also extract individual components like subdomains, domains, and suffixes:


import tldextract

url = "https://blog.data-science.example.co.uk/path"
ext = tldextract.extract(url)

print("Subdomain:", ext.subdomain)  # Output: blog.data-science
print("Domain:", ext.domain)        # Output: example
print("Suffix:", ext.suffix)        # Output: co.uk

Advantages of tldextract

Accuracy: Handles complex domain structures and non-standard URLs.
Automatic Updates: Keeps track of changes in the Public Suffix List.
Ease of Use: Provides a straightforward API for extracting domain components.

Use Cases

tldextract can be applied to various real-world scenarios, including:

Filtering and whitelisting domains for security applications.
Analyzing web server logs to determine top-level traffic sources.
Building SEO tools to classify domains and subdomains.
Detecting malicious domains in anti-phishing systems.

Filtering by Registered Domain

You can easily filter URLs by registered domain:


import tldextract

url = "https://malicious.example.com"
ext = tldextract.extract(url)

if ext.registered_domain == "example.com":
    print("URL matches the target domain.")

Optimizing Performance

If you process a large number of URLs, configure tldextract to use a local copy of the Public Suffix List to avoid network delays:


import tldextract

extractor = tldextract.TLDExtract(cache_dir='/path/to/local/cache', suffix_list_urls=None)
ext = extractor("https://example.co.uk")
print(ext.registered_domain)

Learn More

For more details, visit the official tldextract GitHub repository.

Category: Python Scripts

Add column with domain name to CSV file with Python

tldextract: The Best Python Library for Domain Name Extraction