Remove duplicate mails

The problem

Sometimes my evolution email client does the nasty thing that it duplicates some of the mails in my folders. I can think of 2 causes of this problem. Either it loses the track of mails downloaded from pop accounts on which the messages are not deleted imediately so it fetches them again or there is a problem with the filters that it forgets to delete the message from the original folder. This happens just once in a while but with folders containing thousands of mails some duplicates can be found.

The solution

The solution consists in this very old python script I wrote some years ago which cleans duplicates from a mbox like file. The usage is:

/cleanupmbox.py -i ~/.evolution/mail/local/Inbox  -o ~/.evolution/mail/local/Inbox.ok -h inbox.h

the .h file can be used to speed up the process later. Of course evolution has to be stopped and then the Inbox.ok copied onto the Inbox file.

Here is the script:

#!/usr/bin/env python
# author Marilen Corciovei len@len.ro, this code is offered AS IS, use at your own risk

import re, sys, email, getopt, marshal

msg_start = 'From'
cleaned = None
mids = {}

def parse_mbox(file_name):
    file = open(file_name, 'r')
    msg = ''
    lastLine = ''
    while 1:
        line = file.readline()
        if not line: break
        if line.startswith(msg_start) and lastLine == '':
            if len(msg) > 0:
                parse_msg(msg)
            msg = ''
        msg = msg + line #+ '\n'
        lastLine = line.strip()

def parse_msg(smsg):
    m = email.message_from_string(smsg)
    if 'message-id' in m:
        mid = m['message-id']
        if mid in mids:
            print 'Duplicate Message-ID:', mid
        else:
            print 'New Message-ID:', mid
            mids[mid]=mid
            cleaned.write(smsg)

if __name__=='__main__':
    in_file = ''
    out_file = ''
    hash_file = ''
    try:
        opts, args = getopt.getopt(sys.argv[1:], "i:o:h:")
    except getopt.GetoptError:
        print 'Usage', sys.argv[0], '-i input -o output [-h hash file]'
        sys.exit(2)
    for o, a in opts:
        if o == "-i":
            in_file = a
        if o == "-o":
            out_file = a
        if o == "-h":
            hash_file = a

    if in_file == '' or out_file == '':
        print 'Usage', sys.argv[0], '-i input -o output [-h hash file]'
        sys.exit(2)

    #global cleaned
    cleaned = open(out_file, 'w')
    if hash_file != '':
        try:
            mids = marshal.load(open(hash_file,'r'))
        except:
            pass

    parse_mbox(in_file)
    if hash_file != '':
        marshal.dump(mids, open(hash_file,'w'))

Later edit: this post describes the existence of a remove duplicates plugin, well, at least python coding does wonders for my optimisme.

Related Posts with Thumbnails

Related posts:

  1. cx_Oracle on ubuntu 9.04 jaunty Short list: find python version python -V Python 2.6.2 download...
  2. Python uno openoffice automatization This is a very short example I managed to do...
  3. asdoc pain Running asdoc should have been a breeze. Just create an...
  4. Evolution to Thunderbird migration I have been using Evolution since more than 7 years...
  5. Karmic various tricks Logout messages If you are opening a terminal to a...

3 Responses

  1. Thanks for sharing this. I couldn’t get the remove duplicate Plugin to run under my Arch Linux, but with your Python Script I’ve removed the duplicates in my Inbox.

    You’ve spared me a lot of work =)

  2. Hi,

    I am a bit new to this so have a few questions.

    Do I copy the code above (the large chunk) INTO the file named ‘inbox ‘ (in the relevant evolution folder) or do I create a file named cleanupmbox.py and copy the chuck above into that?

    What I have done is copy the chunk of code into the ‘inbox’ file and then opened a terminal window and typed in the command given in the small black box above (which starts with /cleanupmbox.py…). It came back with this error message ‘bash: /cleanupmbox.py: No such file or directory’ .

    So where is the problem? I know that I have to create a file, according to this message. But what do I do to do that?

    Please give fuller instructions in the future as this will save you and anyone else who comes across this post a lot of trouble (why is it always assumed that those who read these posts are experts at this stuff).

  3. Hi Christopher, you need to create a file called cleanupmbox.py and copy the chunk above into that. However I recommend you try some of the existing duplicate mails plugins before trying this method. As you said, this post assumes some previous knowledge and you could loose data.

Leave a Reply