Category: Python Scripts

  • Add column with domain name to CSV file with Python

    A script to add a column containing only the domain name to an existing CSV file. It extract it from a column containing an URL.

    It works with .co.uk and other country code top-level domain.

    Just change “5” by the column containing the URL.

    Also don’t forget to adjust. Here is it setup for semi-colon for input and output. Just change the delimiter by the one you need.

    import csv
    import tldextract
    
    with open('input.csv','r') as csvinput:
        with open('output.csv', 'w') as csvoutput:
            writer = csv.writer(csvoutput, delimiter=';')
            reader = csv.reader(csvinput, delimiter=';')
    
            all = []
            row = next(reader)
            row.append('domain_name')
            all.append(row)
    
            for row in reader:
                #Column of URL is #5
                ext = tldextract.extract(row[5])
                row.append(ext.registered_domain)
                all.append(row)
    
            writer.writerows(all)

  • tldextract: The Best Python Library for Domain Name Extraction

    Why Use tldextract?

    When working with URLs, extracting key components like subdomains, main domains, and suffixes is essential. Whether you’re developing a spam filter, analyzing web traffic, or managing SEO projects, tldextract provides a reliable and ready-to-use solution.

    How Does tldextract Work?

    tldextract uses the Public Suffix List to accurately identify domains and their components. This ensures it remains up-to-date with the latest suffix changes.

    Basic Example

    Here’s a simple example of extracting the registered domain:

    
    import tldextract
    
    ext = tldextract.extract('http://forums.bbc.com')
    print(ext.registered_domain)  # Output: bbc.co.uk
        

    Decomposing a URL

    You can also extract individual components like subdomains, domains, and suffixes:

    
    import tldextract
    
    url = "https://blog.data-science.example.co.uk/path"
    ext = tldextract.extract(url)
    
    print("Subdomain:", ext.subdomain)  # Output: blog.data-science
    print("Domain:", ext.domain)        # Output: example
    print("Suffix:", ext.suffix)        # Output: co.uk
        

    Advantages of tldextract

    • Accuracy: Handles complex domain structures and non-standard URLs.
    • Automatic Updates: Keeps track of changes in the Public Suffix List.
    • Ease of Use: Provides a straightforward API for extracting domain components.

    Use Cases

    tldextract can be applied to various real-world scenarios, including:

    • Filtering and whitelisting domains for security applications.
    • Analyzing web server logs to determine top-level traffic sources.
    • Building SEO tools to classify domains and subdomains.
    • Detecting malicious domains in anti-phishing systems.

    Filtering by Registered Domain

    You can easily filter URLs by registered domain:

    
    import tldextract
    
    url = "https://malicious.example.com"
    ext = tldextract.extract(url)
    
    if ext.registered_domain == "example.com":
        print("URL matches the target domain.")
        

    Optimizing Performance

    If you process a large number of URLs, configure tldextract to use a local copy of the Public Suffix List to avoid network delays:

    
    import tldextract
    
    extractor = tldextract.TLDExtract(cache_dir='/path/to/local/cache', suffix_list_urls=None)
    ext = extractor("https://example.co.uk")
    print(ext.registered_domain)
        

    Learn More

    For more details, visit the official tldextract GitHub repository.