The problem

Sometimes my evolution email client does the nasty thing that it duplicates some of the mails in my folders. I can think of 2 causes of this problem. Either it loses the track of mails downloaded from pop accounts on which the messages are not deleted imediately so it fetches them again or there is a problem with the filters that it forgets to delete the message from the original folder. This happens just once in a while but with folders containing thousands of mails some duplicates can be found.

The solution

The solution consists in this very old python script I wrote some years ago which cleans duplicates from a mbox like file. The usage is:

/cleanupmbox.py -i ~/.evolution/mail/local/Inbox  -o ~/.evolution/mail/local/Inbox.ok -h inbox.h

the .h file can be used to speed up the process later. Of course evolution has to be stopped and then the Inbox.ok copied onto the Inbox file.

Here is the script:

#!/usr/bin/env python
# author Marilen Corciovei len@len.ro, this code is offered AS IS, use at your own risk

import re, sys, email, getopt, marshal

msg_start = 'From'
cleaned = None
mids = {}

def parse_mbox(file_name):
    file = open(file_name, 'r')
    msg = ''
    lastLine = ''
    while 1:
        line = file.readline()
        if not line: break
        if line.startswith(msg_start) and lastLine == '':
            if len(msg) > 0:
                parse_msg(msg)
            msg = ''
        msg = msg + line #+ '\n'
        lastLine = line.strip()

def parse_msg(smsg):
    m = email.message_from_string(smsg)
    if 'message-id' in m:
        mid = m['message-id']
        if mid in mids:
            print 'Duplicate Message-ID:', mid
        else:
            print 'New Message-ID:', mid
            mids[mid]=mid
            cleaned.write(smsg)

if __name__=='__main__':
    in_file = ''
    out_file = ''
    hash_file = ''
    try:
        opts, args = getopt.getopt(sys.argv[1:], "i:o:h:")
    except getopt.GetoptError:
        print 'Usage', sys.argv[0], '-i input -o output [-h hash file]'
        sys.exit(2)
    for o, a in opts:
        if o == "-i":
            in_file = a
        if o == "-o":
            out_file = a
        if o == "-h":
            hash_file = a

    if in_file == '' or out_file == '':
        print 'Usage', sys.argv[0], '-i input -o output [-h hash file]'
        sys.exit(2)

    #global cleaned
    cleaned = open(out_file, 'w')
    if hash_file != '':
        try:
            mids = marshal.load(open(hash_file,'r'))
        except:
            pass

    parse_mbox(in_file)
    if hash_file != '':
        marshal.dump(mids, open(hash_file,'w'))

Later edit: this post describes the existence of a remove duplicates plugin, well, at least python coding does wonders for my optimisme.

Comments:

Kruste -

Thanks for sharing this. I couldn’t get the remove duplicate Plugin to run under my Arch Linux, but with your Python Script I’ve removed the duplicates in my Inbox. You’ve spared me a lot of work =)


Christopher -

Hi, I am a bit new to this so have a few questions. Do I copy the code above (the large chunk) INTO the file named ‘inbox ' (in the relevant evolution folder) or do I create a file named cleanupmbox.py and copy the chuck above into that? What I have done is copy the chunk of code into the ‘inbox’ file and then opened a terminal window and typed in the command given in the small black box above (which starts with /cleanupmbox.py…). It came back with this error message ‘bash: /cleanupmbox.py: No such file or directory’ . So where is the problem? I know that I have to create a file, according to this message. But what do I do to do that? Please give fuller instructions in the future as this will save you and anyone else who comes across this post a lot of trouble (why is it always assumed that those who read these posts are experts at this stuff).


len -

Hi Christopher, you need to create a file called cleanupmbox.py and copy the chunk above into that. However I recommend you try some of the existing duplicate mails plugins before trying this method. As you said, this post assumes some previous knowledge and you could loose data.


Prad -

len Just to let you know I chose your script over the plugin and it worked wonderfully. In addition, I had a problem with my Virtual folders where the unread count was incorrect (see for example here : bugs.launchpad.net/evolution/+bug/429591 - [use https]). This script seemed to fix this as well. Thank you.


len -

Glad it was helpful :)


Larry -

Len, Your script worked great under maverick with evo 2.30. I had about 3000 dup emails due to an SSL error which would have been a pain to clean up without your script. Thanks


Jacques Malaprade -

Thanks alot. I have slightly edited the folders to fit with new versions of evo. See: http://ubuntuforums.org/showpost.php?p=11115503&postcount=35


Jose Limas -

I’m afraid this script loses the last message, even if it is not a duplicate. However it is very useful. Thanks.