[Isis-users] Old CDS-ISIS archives

Luciano Ramalho luciano.ramalho at bireme.org
Mon Apr 11 22:45:14 CEST 2011


2011/4/11 De Smet Egbert <egbert.desmet at ua.ac.be>:
> I have in fact already done so. I run a Perl-script over the texts,
> which puts the typical mail-tags (e.g. Reply-To: etc.) as 'tagged-text'
> with the values of the fields.

Great, so you've already done most of the work, Egbert!

I don't understand what you mean by tagged-text in the context of
ISIS. Maybe using the header names as prefixes?


> Then I run a text-to-ISIS converter over it to convert in ISO2709.

Which text to ISIS converter would you use for this job? My plan was
to write one myself, or perhaps a text to ID format (this is an ASCII
format supported by the mx tool of BIREME), then I could use mx to
convert from ID to ISO or directly to MST.

>  It worked rather well in sample set of the ISIS archives, after I found a way to avoid the quoted e-mails in the body of the message (which are the same tags as the real ones) to be considered as a new e-mail. This was months ago however, so I will now (if I find some time) refresh my memory about it and send you some sample data.

Time is precisely what I am offering, if Henk can provide the raw
files, or at least a sample of them so that I can start.

> Then the thing is to get the whole bunch of files (they are many as they are per month for each year at least since 1994) over here and to run the conversion on it.

I can write a small script to automate this transfer. If Henk can
provide a directory listing using the dir command (in DOS) or ls -la
(in Linux), I can send him a script that will upload the files one by
one via FTP to a public FTP site that I can create.

> If the structure of the messages (I suppose that is what you mean by RFC-2822) is sufficiently constant, we could go for the database-approach (i.e. ABCD), if not we could go for your approach keeping it at a flat-text level with full-text indexing like Google Desktop's.

Yes, RFC-2822 is the Internet standard that defines the structure of
e-mail messages. It superseded in 2001 the older and better known
RFC-822 which was in effect since 1982.

http://tools.ietf.org/html/rfc2822

> The advantage of the database solution of course being one could search on 'From' fields (to do a statistic on who are the most active senders ;-) ) as well as 'Subject' searches on top of the full text of the messages themselves. I remember I succeeded with that with my first tests.

I agree that loading the messages to a datatabse would allow for more
structured searches, but I also think the most important task, from a
digital preservation point of view, is to provide public, hyperlinked
HTML/HTTP access to the original messages. Doing so without a database
allows anyone with a commodity Web hosting account to mirror the
archive.

Thanks for your support, Egbert!


-- 
Luciano Ramalho
supervisor de desenvolvimento || software development lead
BIREME/OPAS/OMS || BIREME/PAHO/WHO


More information about the isis-users mailing list