Collecting Communication Acts (Relationships/Links) from Online Mailinglists

We have also built a spider that collects messages from Web-archives of online mailing lists, and parses them into the Mysql database. The name of this database is "online_test".

Users can select the root URL of the online mailing list archive. They also have to define the HTML syntax of the archive. Details for this example are explained below. Afterwards the user first has to click "Get URLs" to collect all the messages. In the second step he can select/deselect messages he does not want. Clicking on the "Process" button parses the messages and stores them in the Mysql database.

 

The description below explains how to fill the tags at the example of the equinox-dev/maillist.html (taken from the online-help)

1) How to fill Url keyword:
let's take mail list http://dev.eclipse.org/mhonarc/lists/equinox-dev/maillist.html as an example. You want to parse mails under this Url index page, and each Mail's url is http://dev.eclipse.org/mhonarc/lists/equinox-dev/msg####.html. So, you could fill Url keyword as "msg", since "msg" appears in all Mail's url.
 
2) How to fill fields keywords :
Rule: More preciously to fill the keyword match strings, more correctly and efficiently to extract mails. And be careful on the margin characters. Such as ':', white space.
let"s take url (http://dev.eclipse.org/mhonarc/lists/equinox-dev/msg00178.html) as an example. Now look at its HTML source file :
Content starts where you want to start recording this mail. It could be <!--X-Body-of-Message-->
Content ends where you want to end recording this mail. It could be <!--X-MsgBody-End--> So, all the lines between these two fields will be recorded as content of this mail.
From the source file is:
<!--X-Head-of-Message--> ......
<li><em>From</em>: Jeff McAffer &lt;<A HREF=\mailto:Jeff_McAffer@DOMAIN.HIDDEN\>Jeff_McAffer@xxxxxxxxxx</A>&gt;</li>
......
So, its From keyword is <em>From</em>: and it is After field <!--X-Head-of-Message-->
Subject the source file is:
<HEAD> <TITLE>Re: [equinox-dev] uninstalling plugins</TITLE>
......
So, its Subject keyword is <TITLE> and it is After field <HEAD>
Date the source file is:
<!--X-Head-of-Message--> ......
<li><em>Date</em>: Wed, 11 Feb 2004 08:44:02 -0500</li>
......
So, its Date keyword is <em>Date</em>: and it is After field <!--X-Head-of-Message-->
To the source file is:
<!--X-Head-of-Message--> ......
<li><em>Delivered-to</em>: equinox-dev@eclipse.org</li>
......
So, its Date keyword is <em>Delivered-to</em>: and it is After field <!--X-Head-of-Message-->
If you could treat Reply relationship as From-To relationship here, then the From-To is "Jeff McAffer"-"Peter Kriens" ( since "Jeff McAffer" replays to the msg sent by "Peter Kriens"). Then,
<!--X-References--> ......
<ul><li><em>From:</em> Peter Kriens</li></ul></li>
......
<!--X-References-End-->
So, its To keyword is <em>From:</em> and it is After field <!--X-References-->
Cc there is no Cc field in the msg, then leave it blank. Or you could treat Cc keyword is <em>Delivered-to</em>: and it is After field <!--X-Head-of-Message-->