»
S
I
D
E
B
A
R
«
Extracting email addresses from any text file with Python
June 13th, 2009 by Tuomas Rasila

picture-11Ok, this might sound like that we are in the spamming business now. Well, we are not. The case is that email address is typically the only per-person unique key in CRM data. These couple of lines of Python will extract email addresses from any text file, e.g a HTML-file. This script will also make list unique so if the same email address is listed many times in the original data, it will be only once in the output. Enjoy:


#!/usr/bin/env python
# coding: utf-8

import os
import re
import sys

def grab_email(file):
"""Try and grab all emails addresses found within a given file."""
email_pattern = re.compile(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b',
re.IGNORECASE)
found = set()
if os.path.isfile(file):
for line in open(file, 'r'):
found.update(email_pattern.findall(line))
for email_address in found:
print email_address

if __name__ == '__main__':
grab_email(sys.argv[1])

Share and Enjoy:
  • del.icio.us
  • Digg
  • Facebook
  • Reddit
  • Twitter

4 Responses  
garijon writes:
October 1st, 2009 at 11:47 pm

Thanks a lot, it was just what I was looking for.

Carlos writes:
October 2nd, 2009 at 8:36 am

what do i have to do in case that i want to extract all emails but those which begins with postmaster???

thanks

Tuomas Rasila writes:
October 6th, 2009 at 1:59 am

You can grep the file. Say your file with emails is foo.txt, do the following in the command line:

grep -v postmaster@ foo.txt > new_file.txt

John Kosty writes:
November 14th, 2009 at 1:13 am

Echo Garijon. Needed a good Python how-to example and was lucky enough to find this.

Thanks, big time!

Leave a Reply

  Technorati Profile
© Copyright © 2009 RasilaGarage