[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ANNOUNCE] Scoring capabilities within the filter file.



I have patched the leafnode-1.9.2 source code to add some simple scoring
capabilities to the filter file processing.  I will send these patches
in a second piece of email which follows this one.

I have added the capability to recognize and process scoring
lines within the filter file (the syntax and usage of these scoring
lines is described below).  This is backwards compatible: if no
such scoring lines exist, then the filter file will be interpreted
exactly as it is in previous leafnode versions.
    
However, if one or more scoring lines appear within the filter file,
they will be interpreted as follows:
    
  [score:N]
  [score:=N]
  [score::N]
  [score::=N]
  normal regexp entries
  ...
  ... etc. ...
  ...   
            
.. where N is a positive or negative integer.  For any regexp entries 
that match, this either adds N to the current score (in the cases without
the equal sign), or else (in the cases where the equal sign is used) it
sets the current score to N.
                        
In the case of a single colon, a matched regexp will cause the rest 
of the regexp's in the current scoring block to be ignored, and
for the next scoring block to be tested.  In the case of a double
colon, once a regexp matches, no more regexps will be tested and the
current score is what will be associated with the article being tested.

Whenever a new scoring entry is encountered, its rules supersede
any previous rules.
        
Any article whose score is >= 0 will be accepted, and any article with
a score < 0 will be rejected.  Each article's score is set to 0 by
default before scoring begins.

If no scoring line precedes any regular expressions, we operate
as if [score:=-1] was issued.  This is equivalent to the earlier
filter file paradigm, and this is what makes this algorithm backwards
compatible.
    
For example, consider the following hypothetical filter file:

   [score:: =1] 
   ^Newsgroup:.*alt.hippo.potamus

   [score: -10]
   .

   [score: +6]
   ^Subject:.*food
   [score: +6]
   ^Subject:.*hippopotamus
   [score: +6]
   ^Subject:.*telekenesis
  
The first section sets any article posted to alt.hippo.potamus to be
scored with 1 point and to be processed no further.  This guarantees
that we will always download all articles posted to this newsgroup,
since 1 is greater than or equal to zero.
  
The second section causes all remaining articles to have -10 added to
their score.  Since these articles all start out with a default score
of 0, this causes all of them to have a -10 score.

Because we didn't use the '::' form of scoring line here, all articles 
will then be processed further, and they are then handled within the
3rd-5th sections.  Here, 6 points are added to the score for
each of the following words appearing in the subject:  "food", 
"hippopotamus", and "telekenesis".  This has the result of causing us
to download all articles who have at least two of these words in their
subjects, since only in this case would the score get incremented to a
value greater than or equal to zero.

This is a simple and rather primitive scoring algorithm, but it's
powerful enough to handle many useful cases.

I've done some preliminary testing on this, and it seems to work.  But
please test this and hack on it yourselves, and please let me know any
problems you might encounter or feedback you may have.

Thanks!

The appropriate patches appear in my next email message.

-- 
 Lloyd Zusman
 ljz@xxxxxxxxxx

-- 
leafnode-list@xxxxxxxxxxxxxxxxxxxxxxxxxxxx -- mailing list for leafnode
To unsubscribe, send mail with "unsubscribe" in the subject to the list