[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [leafnode-list] Filtering revisited.

On Fri, Oct 01, 1999 at 04:29:05AM +0200, Joerg Dietrich wrote:
> On Thu, Sep 30, 1999 at 04:55:40PM -0400, Lloyd Zusman wrote:
> > Filtering is very useful for those of us who use leafnode to manage a
> > small news spool behind a relatively slow connection to an upstream
> > server.  However, the currently available filtering capabilities in
> > leafnode don't allow me to take full advantage of the filtering that
> > would be ideal for me.

> Already here I disagree. Leafnode is a news-server. As this it provides
> service to one or more clients (users). Leafnode is designed to be most
> effective for small leaf sites. These sites are admittedly often dial-up
> boxes used by single person. This is the only case in which filtering
> by the news service  providing package can be useful. In all other cases
> different users will have different criteria to unselect articles.

I can imagine that in a lot of other cases sufficient communication
between the users is possible to work out some mutually acceptable
arrangement ("You don't read alt.sex.hamsters, so leave it alone." or
"Joe is a real idiot." "Sure is - let's plonk him.").

> 	IMHO filtering should be done with the newsreader and not with
> any tool belonging to the server. The only exception maybe a the
> aforementioned single-user dial-up box with a *slow* newsfeed.

Slow sounds like most dialup lines :-) .

> 	This brings me to the next point. The lack of speed of fetchnews
> is almost *never* due to a slow internet connection (unless you have a
> 14.4 modem or worse, maybe), but is a
> consequence of 

> (1) the NNTP,
> (2) the way the fetch process is implemented.

The main problem at the minute is in the way the connection is managed in 
Leafnode - it could do very much better than it does.

> I believe, that if Leafnode was able to either open multiple connections
> , and/or fetch asynchronously, or use UUCP, filtering would be
> superflous because the filtering process would consume more time than
> the fetching of the article itself. 
> 	Unfortunately we are now at point were filtering is implemented
> and it has to be supported in future versions of Leafnode.

Unless you're doing seriously heavy filtering the compute time is 
unlikely that you'll experience signifigant slowdowns.

The main problem in Leafnode is that it doesn't preload the connection.
At present, it sends a command to the server, processes the response,
sends the next command and so on.  Even in the simple, non-filtering
case this is suboptimal - there is a pause between each article being
sent where the connection idles.  A faster system which several other
news pullers implement is to commands in advance, before the server is
ready to read them.  That way you minimise the time the server takes to
respond - it can be processing a command while the previous results are
still being transmitted over the network.

This would produce a speed improvement for all users - it is possible to
saturate the modem with this method.  It would also mean that instead of
making the connection idle any filtering can take place while waiting
for the next batch of data to download which (given the typical relative
costs of bandwidth and CPU) should give plenty of time for even a very
complex set of filters to run.

[I'm probably going to repeat myself rather a lot from here on]

> > (2)  Allow for the current regular-expression matching to be wrapped
> >      by some sort of new filtering language that implements nesting,
> >      boolean logic, etc.

> Sounds very slow. Exactly the opposite of what you want if you filter
> with fetchnews.

With the usual scoring mechanisms (actually, with Gnus' which is the
one I use) it's a trivial computation - match on the regexp and then add
an appropriate value to the score.  If after applying all filters the
score is lower than a certain value, discard the article.  It's not
really very much worse than the current method, and given the relative
speeds of most dialups and most CPUs the CPU has plenty of otherwise
idle time while the data is downloaded.  If the compute time does become
a problem, use less complex filters.  If you don't use any filters the
effect of any filtering code should be marginal as it stands and 
completely unnoticable with prefetching.

If you're not using a dialup you probably don't care that much either
way - if it takes so long to download with leafnode that it's a problem,
you probably want to be using something more heavy duty anyway.

> > (3)  Allow for pluggable, optional filtering modules that could
> >      be written in C and dynamically linked with the leafnode
> >      exectuables.  A library of convenience routines could be
> >      supplied to aid the the writers of these optional modules
> >      to make it easier to do things like locating and extracting
> >      information about headers, article size, newsgroups, etc.  This
> >      would allow the leafnode administrators to create arbitrarily
> >      sophisticated filtering mechanisms that run relatively quickly.

> This sounds interesting. It would leave the true server/fetchnews code
> intact and would allow everybody to do what he wants to do. If any
> filtering at all, than this is IMHO the method of choice. The existing
> filtering could be used as a module and everybody would be happy :-)

You'd probably want a standard set and some idiot proof way of
configuring them - one of the really nice things about Leafnode right
now is that it's utterly simple to get working.  You would probably also
end up shipping a standard module doing whatever filtering people want
anyway, so the only benefit would be to make it slightly easier to add
new filtering mechanisms - expose an internal interface more.

> > (4)  Allow for some sort of embedded scripting language to be
> >      optionally built into leafnode ... python comes to mind, as does
> >      perl.  This would allow for scripts to be written that serve the
> >      same function as the pluggable, dynamically linked modules I
> >      described above in (3).  This would run a bit slower than the
> >      approach that uses the dynamically linked C modules, but it would
> >      probably be easier to use.

> Again, filtering is for slow connections. If the method of filtering is
> slow using this filter is pointless.

Perl is the standard language for writing filters with inn - it's what
cleanfeed uses, for example.  There's been a fair amount of work put
into making it really fast, but most of this is not really needed for
most sites.  Particularly not those dealing with the data rate of a
dialup.  Remember that perl is compliled at startup, so if you keep the
same copy of the script running throughout the download the overhead is
pretty low.

I've no idea how easy it would be, but it might be nice to be able to
interface with INN filtering modules (I speak from a position of almost
total ignorance of implementation details).

In any case, filtering doesn't need to be fast - it just needs to be
faster than the download.

> > At any rate, these four things come to mind, but I'm sure that there
> > also are other possibilities which could be just as useful.

> To emphasize it again: If you ask me increasing the speed of fetchnews
> is the top priority task in leafnode development. Most users simply
> won't need filtering at all anymore, if fetchnews is as fast as UUCP or
> suck with multiple connections. 

I agree with the prioritising, although I do think that filtering is a
very useful feature - I've had my disk flooded by junk which was
immediately killed by Gnus' automatic scoring before now, and filtering
is the sort of feature that gives me the warm and fuzzies.

> other parts of the Leafnode code (IIRC). My suggestion, therefore, would
> be to have a new beta version ASAP that includes rnews and whatever
> Cornelius has done since then.

> 	Cornelius, could you give a brief summary of what you've been
> doing and what the current stage of leafnode development is? I hope I
> don't ask for too much. I know that you have life besides leafnode and
> RfC's and I guess you're buried alive in patches.

Heh.  I have to agree snapshot would be nice, though - if nothing else,
it would probably mean that those patches you do get resemble more
closely the code you're working with currently and would hopefully be
easier to evaluate.

You know you want to make an anonymous CVS repository :-) .

Mark Brown  mailto:broonie@xxxxxxxxxxxxxxx   (Trying to avoid grumpiness)
EUFS        http://www.eusa.ed.ac.uk/societies/filmsoc/

leafnode-list@xxxxxxxxxxxxxxxxxxxxxxxxxxxx -- mailing list for leafnode
To unsubscribe, send mail with "unsubscribe" in the subject to the list