Ok, this might sound like that we are in the spamming business now. Well, we are not. The case is that email address is typically the only per-person unique key in CRM data. These couple of lines of Python will extract email addresses from any text file, e.g a HTML-file. This script will also make list unique so if the same email address is listed many times in the original data, it will be only once in the output. Enjoy:
#!/usr/bin/env python
# coding: utf-8
import os
import re
import sys
def grab_email(file):
"""Try and grab all emails addresses found within a given file."""
email_pattern = re.compile(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b',
re.IGNORECASE)
found = set()
if os.path.isfile(file):
for line in open(file, 'r'):
found.update(email_pattern.findall(line))
for email_address in found:
print email_address
if __name__ == '__main__':
grab_email(sys.argv[1])