Washing emails
Introduction
In this tutorial I’ll show you how to create a simple PHP script to cleanup a list of email addresses. As a web developer you have probably been asked to wash a list of emails from a manager or marketer some times. Here’s the ultimate solution.
Requirements
There’s a few basic requirements when it comes to washing a list of emails:
- Remove duplicates
You don’t want to have multiple occurrences of the same email address - Verify that the email address exists
You want to be sure that you validate the domain in the email so you get rid of all spelling mistakes like “htmail.com”, “hotmail.cm”, “gmai.com” etc.Douglas Lovell at IBM Research has written a nice article about email validation in PHP. Email validation is a complex topic when it should be done right so we are going to build upon his findings and use his email validation function.
- Use pipes
The script should utilize STDIN and STDOUT so that the script can run without any modification or configuration on various UNIX machines.That means you would like to use it like:
$ cat dirty_emails.txt | php wash.php > clean_emails.txt
- Flexibility
The script should work on as most PHP installations as possible because of privacy issues.
The script (wash.php)
Let’s look at the code inside wash.php.
Conclusion
As you see this is a very simple script but I can’t count how many hours of work it has saved me. Always look for simple solutions and think about code reuse.
4 Comments
Žilvinas on November 27th, 2008
You should also validate emails according to RFC. Not only the domain part. Also there’s a lot of people using easy to catch fake emails. like a@aol.com none@gmail.com .. You could do some more complex filters to catch these.
Jory on November 27th, 2008
This was already asked by somebody in a comment on the article of Douglas Lovell, but the question stayed unawnsered, so I’ll ask again.
Why not just use $checkedEmail = filter_var( $uncheckedEmail, validate_email );? That should filter out any invalid email addresses and take care of invalid characters at the front or back at the same time, without you having to worry about anything.
Knut Urdalen on November 27th, 2008
Thanks for feedback guys.
@RubenV: Sorry, that part was a bit unclear. What I mean is that I want to be able to run the script directly on the server for future processing and destruction instead of having to download a huge list of email addresses locally to my machine. Most privacy policy statements include that account information (like email addresses and other personalia) is handled carefully and not distributed.
@Žilvinas: Douglas Lovell’s script follows RFC 1035, RFC 2234, RFC 2821 and RFC 2822 (that’s why I wanted to use his function, the best implementation I’ve found so far). I agree with you that it may be a good idea to try to catch more false positives by applying other filter mechanisms on top. Currently I haven’t run into the issue of fake emails on lists I’ve been working with so it hasn’t been a real issue for me yet. Any suggestions to some real implementation that could be used on top of it?
@Jory: filter_var is only available for PHP >= 5.2 (I still have servers running on 5.1). But yes of course that’s an option. It would also be interesting to check the difference between the filter_var implementation of validation of email address against Douglas Lovell since there are so many edge cases when it comes to validating emails.







RubenV on November 27th, 2008
“The script should work on as most PHP installations as possible *because of privacy issues*.” (emphasis added)
Could you elaborate on this? How is running on most PHP installations related to privacy?