[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

run-fetch a perl frontend for fetch



This file describes the theory and functionality of the Perl frontend
"run-fetch" written for the leafnode "fetch" program.  No guarantee is
provided that it will work for you or not inflict damage upon your system.
Use at your own risk.  You're free to use this Perl source in anyway
you see fit.

I began using leafnode when I acquired a subscription to one of the
major commercial Usenet services.  I did this because I intend to use
this service solely for the purpose of acquiring binary postings from the
alt.binary.sounds.mp3 group and sub-groups.  I have no need to use it for
anything else because I also run INN with a UUCP link for all the various
text groups that I am interested in.  I chose leafnode because it would
allow me to continue to use tin as my newsreader.  I found that using tin
alone was impossible due to the way tin is designed--to always pull every
header for the group selected.  With my commercial news provider this
was entirely impractical because of my modem based PPP connection and
the retention maintained by this server.  My server typically maintains
better than 120,000 articles in the alt.binaries.sounds.mp3 group and
it would simply take hours to have tin try and download those headers
every time I wanted to look through the group.

With leafnode, I can effectively have the same "offline" functionality
with tin that I might get with another news reader like Agent.  But
since I generally prefer to avoid using Windows, Agent was not an
option for me. :)

After getting leafnode v1.8.1 and setting it up to work in the
delaybody mode, I noticed that fetch has no fault tolerance.  This is
very problematic for my situation because I will typically select about
200-400 articles from alt.binaries.sounds.mp3 at a time.  This typically
takes at least 10 or more hours over my PPP modem link, managed by diald.
>From observation I noticed that when there was a problem, fetch would
simply lose all the articles I had selected for download.  Given the
inherent problems with modems, fetch would invariably get interrupted
several times a night for one reason or another.  This was effectively
rendering leafnode and fetch useless for my purposes.

To provide some fault tolerance to fetch, I figured that the best way
to implement this would be by creating a back-up of each of the
various files kept under the interesting.groups directory.  Then, if
fetch was interrupted, I would simply restore the back-up.

The way run-fetch actually works is like this:

1)	Because fetch also has a tendency to get "hung", I wrote
	some logic to test whether or not there was already a running
	fetch process and to detect if the running process was hung or
	not.

2)	To handle the archive of the interesting.groups directory and
	associated files I created a secondary called
	interesting.groups.bak.  

3)	run-fetch, prior to launching fetch, does a comparison of
	the contents in interesting.groups and interesting.groups.bak
	and effectively merges the back-up file and whatever might
	exist in interesting.groups.  

4)	This merges avoids attempting to pull the same article body more
	than once by taking advantage of the way fetch will remove the
	original local article (which only has the headers in it).
	run-fetch simply looks through the merged group list and does
	a stat on each of the files listed.  If the file, in the group
	list, no longer exists it is not written into a new group
	list.  If the file exists (ie, the article with just the
	headers is still in the spool) then it is written back to the
	group list.

5)	Prior to finally launching fetch, this new updated copy of the
	group list is then archived into the interesting.groups.bak
	directory.

With my modifications, interesting.groups.bak actually becomes the source
of the "original" group file now.  To handle this change so that one
could still read and select new articles, a small change was made to
the nntpd (leafnode) source.  I basically went in and changed two lines
which were referencing "interesting.groups" to "interesting.groups.bak".
In this manner, if I select new articles for download, even while fetch
is running, they will simply be merged in the next time run-fetch runs.

To install run-fetch, the following needs to happen:

1)	The aforementioned change to nntpd.c needs to be implemented
	(i.e., change the two references for "interesting.groups" to
	"interesting.groups.bak"), the source recompiled and the new
	copy of leafnode installed.

2)	The Perl program run-fetch needs to be installed within the
	leafnode directory hierarchy.  I simply put it under sbin.

3)	User news' crontab entry needs to modified to call run-fetch
	instead of fetch directly.

4)	The run-fetch Perl program needs to have its variables
	customized to match the needs of the local installation of
	leafnode.  All of these configuration variables are stored at the
	very top/beginning of the Perl source.	Customization basically
	amounts to getting the proper pathnames setup for each of the
	various programs and files and adjusting the name of the network
	interface referenced in variable $interface.

5)	Create the directory interesting.groups.bak under the same
	directory as interesting.groups and adjust the ownership and
	permissions to match that of interesting.groups.
	

Several assumptions are made for this program:

1)	leafnode is configured to be used in a delaybody mode.  I've
	never used it any other way and have no idea how or if it
	would work if that was not the case.

2)	fetch is doing its work over a dynamic dial-up type
	connection.  In my case a PPP connection managed by diald with
	an idle timeout of about 10 minutes set.  This idle timeout is
	integral to the logic for deciding whether a particular fetch
	process might be hung or not.  It becomes a simple matter to
	decide that fetch is hung when there's an active fetch
	process, but no PPP connection is nailed up.  When that
	happens it becomes a safe bet to kill off the running fetch
	process.

3)	The OS hosting leafnode and fetch is Linux.  In essence
	run-fetch performs most of the advisory lockfile detection
	and manipulation before fetch actually has to.	Part of the
	way it verifies if a lockfile is stale or not is by making use
	of the way Linux maintains the /proc filesystem.  With Linux,
	each process has a corresponding directory under /proc.  It then
	becomes a simple matter to know if a PID number in a lockfile is
	active or not by simply looking in /proc for a matching directory.
	If the directory exists, the process is active.  If the directory
	does not exist, the process is dead.

With run-fetch in place, fetch now has fault tolerance where article
body retrieval is concerned.  One thing run-fetch won't do though,
is provide fault tolerance for article header retrieval.  If fetch is
interrupted, it will still go back to using the article numbers in the
leaf.node/newserver file.

###############################################################################

#!/usr/bin/perl

# run-fetch a Perl frontend to provide some fault tolerance to the
# leafnode fetch program.

# Michael Faurot <mfaurot@xxxxxxxxxxxxxxx>
# Tue Feb 16 15:03:46 EST 1999

# Location for the log file
$logfile="/var/log/news/fetch.log";

# Location for the fetch lock file
$lockfile="/var/lock/news/fetch.lck";

# Location of the fetch executable
$fetch="/misc/leafnode/sbin/fetch";

# Location of the directory interesting.groups
$interestingdir="/misc/leafnode/news/interesting.groups";

# Location of the backup directory for interesting.groups
$backupdir="/misc/leafnode/news/interesting.groups.bak";

# Location of the parent directory for the news artciles
$parentdir="/misc/leafnode/news";

# Location of the netstat program
$netstat="/bin/netstat";

# Which interface will fetch be making its connection through?
# E.G., ppp0, sl0.  NOTE: It's assumed the connection is
# one that is initiated on demand.  I don't think this program will
# work properly if the interface is nailed up all the time.
$interface="ppp0";

# Location of the date program
$date="/bin/date";

# Location of the mv program
$mv="/bin/mv";

# Location of the cp program
$cp="/bin/cp";

# Location of the cmp program and how to call it "quietly"
$cmp="/usr/bin/cmp -s";

# Location of the sort program
$sort="/bin/sort";

# Location of the uniq program
$uniq="/usr/bin/uniq";

###############################################################################
# Main section of program
###############################################################################

# Find out what the status of the lock file (if any) for fetch is
$lock = &check_lock;

# If the lock is good (ie, not stale or missing), then check to see 
# if a PPP link is up
if ($lock != 0) {
	
	$ppp = &check_ppp;

	if ($ppp == 1) {
		# There's already a "good" process of fetch active, so we
		# will simply exit and let it go on about its work
		exit (0);

	} else {
		# There's already a fetch process running, but it is
		# presumably "hung" so we'll kill that one off.  We'll
		# sleep for 20 seconds after that just to be sure
		# everything has had time to settle out.
		kill (15, $lock);
		sleep (20);
	}
}

# Check on the status of interesting.groups and interesting.groups.bak
# directories.
&check_interesting;

# Make a back-up copy of the files in interesting.groups that have
# more than zero bytes.
&copy_interesting;

$fetch_retval = &run_fetch;

exit (0);

###############################################################################
# Check to see if there's already a running fetch process
###############################################################################

sub	check_lock {

	local	($pid)	= 0;

	# Check to see if there's even a lock file out there.
	if ( ! -f $lockfile ) {
		return (0);	
	} 

	unless (open (LOCKFILE, $lockfile)) {
		die ("\nERROR: Could not open [$lockfile].\n\n");
	}

	$pid = <LOCKFILE>;
	close (LOCKFILE);

	# Test to see if the process is still active, by looking
	# for a directory name under /proc with the corresponding PID.
	if ( -d "/proc/$pid" ) {
		return ($pid);
	
	} else {
		return (0);
	}
}

###############################################################################
# Check to see if there's a PPP process nailed up or not
###############################################################################

sub	check_ppp {
	
	local	(@info) = 0;
	local	($line) = "";

	unless (open (NETSTAT, "$netstat -i |")) {
		die ("\nERROR: Could not open pipe from [$netstat].\n\n");
	}

	@info = <NETSTAT>;
	close (NETSTAT);

	# Look for a line, from the output of netstat, that begins
	# with $interface.  This will tell us if there's a valid link up
	# or not.
	foreach $line (@info) {

		if ($line =~ /^$interface/) {
			return (1);
		}
	}

	return (0);
}

###############################################################################
# Get down to the meat of running the fetch program.
###############################################################################

sub	run_fetch {

	local	($now)  	= "";
	local	(@args)  	= "";
	local	($retval)	= 0;
	local	($header)	= "";

	$now = `$date`;
	chomp ($now);

	$header = <<"END";


Started at $now
=======================================

END

	unless (open (LOGFILE, ">> $logfile")) {
		die ("\nERROR: Could not open [$logfile] for write.\n\n");
	}

	print LOGFILE $header;
	close (LOGFILE);

	@args = ("$fetch -v >> $logfile 2>&1");	
	$retval = system (@args);

	# sleep for about a minute after this has completed to allow
	# fetch to finishing its output into the logfile
	sleep (60);

	$now = `$date`;
	chomp ($now);

	$header = <<"END";

Exit code from fetch: [$retval]

Finished at $now
========================================

END

	open (LOGFILE, ">> $logfile");
	print LOGFILE $header;
	close (LOGFILE);

	return ($retval);	
}

###############################################################################
# Check to see if there's any interesting files in $backupdir
###############################################################################

sub	check_interesting {

	local ($filename) 	= "";
	local ($pathname) 	= "";
	local ($original) 	= "";
	local ($length) 	= 0;
	local (@array);
	local (@args);

	unless (opendir (BACKUPDIR, $backupdir)) {
		die ("\nERROR: Could not open directory [$backupdir].\n\n");
	}
	
	while ($filename = readdir (BACKUPDIR)) {

		$pathname = "$backupdir/$filename";

		# Only consider actual files (i.e., not directories)
		if ( -f $pathname ) {
			push (@array, $filename);
		}
	}

	close (BACKUPDIR);

	# Find out the length of the array.
	$length = @array;

	# If there are no elements in the array (ie, no backup files
	# were located, then simply return and let execution continue
	# normally.
	if ($length == 0) {
		return (0);
	}

	foreach $filename (@array) {

		# Test to see if the orginal filename has zero bytes.
		# If the orginal has zero bytes, then we'll simply
		# move the back-up on top of the zero byte file.
		if (-z "$interestingdir/$filename") {
			@args = ($mv, "$backupdir/$filename",
				"$interestingdir/$filename");
			system (@args);

		# If the original filename is not empty (has > zero
		# bytes) then call &merge_files to merge the two
		# together.
		} elsif (-s "$interestingdir/$filename") {
			&merge_files ("$backupdir/$filename",
				"$interestingdir/$filename");
		}

		# After having done one of the two above, we need to validate 
		# the new $interesting/$filename, but only if it has more 
		# than zero bytes.
		#
		# Call &validate to verify the contents of the new
		# original file is consistent with the contents of the
		# corresponding article directory.
		if (-s "$interestingdir/$filename") {
			&validate ($filename);
		}
	}
}

###############################################################################
# Merge a backup file with an original file
###############################################################################

sub	merge_files {

	local	($backup, $original) 	= @_;
	local	($args)			= "";
	local	($retval)		= 0;
	local	($tmpfile)		= "/tmp/merge.$$";

	# Compare the backup with the original.  If they are both the
	# same then simply remove the backup.
	$args = "$cmp $backup $original";
	$retval = system ($args);

	# print ("Merging [$backup]\n with [$original]\n");
	# print ("Return value from cmp was [$retval]\n\n");

	# If cmp retruns 0, then they're both the same and we can
	# simply remove the backup file and then return to the calling
	# function.
	if ($retval == 0) {
		unlink ($backup);
		return;
	}

	# Everything from here on, is to deal with the situation when
	# the original and backup differ.

	# To actually merge the original and the backup together,
	# we'll run both files (together) through a pipeline of sort
	# and uniq.
	$args = "$sort $backup $original | $uniq > $tmpfile";
	system ($args);

	# Now we'll remove the backup file.
	unlink ($backup);

	# Now we'll move the temporary file over top of the original
	$args = "$mv $tmpfile $original";
	system ($args);
}


###############################################################################
# Validate the contents of the group file
###############################################################################

sub	validate {

	local	($filename) 		= @_;
	local	($pathname)		= "";
	local	(@filelist);		
	local	($line)			= "";

	# print ("Validating [$filename]\n");

	# Derive the full pathname of the article directory based
	# upon changing the dots in the groupname to dashes and then
	# prepending the pathname with $parentdir
	$pathname = $filename;
	$pathname =~ s!\.!/!g;
	$pathname = "$parentdir/$pathname";

	# print ("Directory pathname: [$pathname]\n");

	# Load the file list into an array.	
	unless (open (GROUPFILE, "$interestingdir/$filename")) {
	      die ("\nERROR: Could not open [$interestingdir/$filename].\n\n");
	}

	@filelist = <GROUPFILE>;
	close (GROUPFILE);

	# Now rewrite the groupfile with only those entries that still
	# exist in the corresponding group directory.
	unless (open (GROUPFILE, "> $interestingdir/$filename")) {
	      die ("\nERROR: Could write to [$interestingdir/$filename].\n\n");
	}

	foreach $line (@filelist) {

		chomp $line;

		# Check to see if this file actually exists.  If it
		# does then write its entry back into the groupfile.
		if ( -e "$pathname/$line" ) {
			print GROUPFILE ("$line\n");
		} 
	}

	close (GROUPFILE);
}

###############################################################################
# Make a copy of files in interesting.groups that have more than 0  bytes
###############################################################################

sub	copy_interesting {

	local ($filename)	= "";
	local ($pathname)	= "";
	local ($args);

	unless (opendir (INTERESTINGDIR, $interestingdir)) {
	      die ("\nERROR: Could not open directory [$interestingdir].\n\n");
	}
	
	while ($filename = readdir (INTERESTINGDIR)) {

		$pathname = "$interestingdir/$filename";
		# print ("pathname [$pathname]\n");

		# Only consider pathnames that have more than 0 bytes
		# and are actually files.
		if ( -s $pathname && -f $pathname ) {
			$args = "$cp $pathname $backupdir/$filename";
			# print ("args [$args]\n");
			system ($args);
		}
	}

	close (INTERESTINGDIR);
}

-- 
leafnode-list@xxxxxxxxxxxxxxxxxxxxxxxxxxxx -- mailing list for leafnode
To unsubscribe, send mail with "unsubscribe" in the subject to the list