You are not logged in.
Hi,
Just joined this site and plan on using your API for my forum. Thanks for providing this amazing service!
Is the API working correctly?
Example IP: 78.157.143.242
When I check this page:
http://www.stopforumspam.com/search?q=78.157.143.242
It says:
Found 42 entries for 78.157.143.242
When I check via the API using this syntax:
http://www.stopforumspam.com/api?ip=78.157.143.242
I get back:
<response success="true">
<type>ip</type>
<appears>yes</appears>
<lastseen>2008-08-24 11:04:29</lastseen>
<frequency>1</frequency>
</response>
Shouldn't frequency in the XML above be 42?
Offline
I was wondering the same thing actually! I am developing a Perl interface for the API and would like to make use of these figures but I am waiting to hear back as to the relation and how they are calculated.
I must admit, I was expecting to see the frequency match the number of entries in the database, but maybe this is not the intended case?
Hopefully someone will enlighten me soon?
Offline
I tracked down the bug to an error in the MySQL query. The frequency should be correct now, sorry for the trouble.
Offline
Thanks Russ - confirmed - frequency is correct now. Great job!
Offline
andypatmore wrote:
I was wondering the same thing actually! I am developing a Perl interface for the API and would like to make use of these figures but I am waiting to hear back as to the relation and how they are calculated.
I didn't (yet) get as far as including frequency or age of last report, but I did bring up a rudimentary PERL interface for the API a few days ago that's up and running on my YABB 2.1 board. Here's the code -- perhaps it will save you a step or two.
#! /usr/bin/perl
#
# ----------------------------------------------------------------------
#
# SpammerChk -- A PERL module for querying the StopForumSpam API
# KEG -- 21 August 2008 -- Version 1.0.0
#
# Usage:
# use SpammerChk;
# $result = isSpammerIP(<IP address>);
# $result = isSpammerEmail(<email address>);
# $result = isSpammerUser(<username>);
#
# $result is either true or false
#
# Purpose:
# This was written to allow dynamic checking of attempted YABB
# Bulletin Board registrations against spammer reports in the
# StopForumSpam database. The YABB software includes separate
# sections for checking against the local ban lists for IP, email,
# and username. This module was written to allow a final remote
# check against reported spammers for each section.
#
# ----------------------------------------------------------------------
package SpammerChk;
require Exporter;
@ISA = qw(Exporter);
@EXPORT = qw(isSpammerIP isSpammerEmail isSpammerUser);
use strict;
use LWP::UserAgent;
use XML::Simple;
my $url = 'http://www.stopforumspam.com/api';
sub isSpammerIP {
my $ip = shift;
return querySpammer($url . '?ip=' . $ip);
}
sub isSpammerUser {
my $user = shift;
return querySpammer($url . '?username=' . $user);
}
sub isSpammerEmail {
my $email = shift;
return querySpammer($url . '?email=' . $email);
}
sub querySpammer {
my $reqURL = shift;
my $response;
my $ua = LWP::UserAgent->new;
$ua->agent("SpammerChk/1.0.0");
$ua->from('webmaster@yourdomain');
$ua->max_size(8192);
$response = $ua->get($reqURL);
if ($response->is_success) {
my $xml = new XML::Simple;
my $data = $xml->XMLin($response->content);
return $data->{'appears'} eq 'yes';
} else { return ""; }
}
1;Last edited by keg (2008-08-24 7:40 pm)
Offline
keg wrote:
I didn't (yet) get as far as including frequency or age of last report, but I did bring up a rudimentary PERL interface for the API a few days ago that's up and running on my YABB 2.1 board. Here's the code -- perhaps it will save you a step or two.
As an afterthought, I might as well toss in a small bit of PERL code to take Russ's banned IP CSV file and write it to standard output as a set of one-per-line "deny from" statements suitable for inclusion in a .htaccess file. This is a rudimentary filter for offline use, not something intended as forum code.
One weakness is that the IP file contains only the 12,000+ IPs as a single line. There's no simple way of filtering the IPs in the file based on time of last observation. It would be possible to go through the list of IPs querying the API (with some suitable pause between queries), but that would still be a lot of queries.
The motivation has already been brought up by others. A lot of IPs are dynamic and will be reassigned after a couple of weeks. Thus, apart from some IP sections that one can afford to "waste", IP bans via .htaccess need to be refreshed periodically.
#! /usr/bin/perl
use strict;
{
my $inpfile = "bannedips.csv";
open INP, "<$inpfile" or die "\n *** Can't open $inpfile *** \n";
my $ipstr = <INP>;
close INP;
my @ips = split /, */, $ipstr;
@ips = sort cmpip @ips;
foreach my $ip (@ips) {
print "deny from $ip\n";
}
}
sub cmpip {
my @ip1 = split /\./, $a, 4;
my @ip2 = split /\./, $b, 4;
my $test = 0;
for ( my $i=0;$i<4;$i++ ) {
$test = $ip1[$i] <=> $ip2[$i];
if ( $test ) {last}
}
return $test;
}Offline
Hi Keg
Thanks! My module is working and can check for either single or multiple entries (well, 1, 2 or 3! LOL!), so far so good. My final touches would be to add a "severity" marker to the output, but I am currently trying to come up with a reasonable scoring system.
The way I see it, an old (6 months or more), unique (frequency of 1) entry is less severe than a new (within 24 hours), common (more than 1) entry.
Offline
andypatmore wrote:
Hi Keg
The way I see it, an old (6 months or more), unique (frequency of 1) entry is less severe than a new (within 24 hours), common (more than 1) entry.
Okay, here's a long-winded statistical approach.
Let's assume that you have a block of IP addresses in which some can be randomly considered to be spammers. Let's assume that you know that IP address A is a spammer at t=0. What you're in effect asking is the probability that A is still a spammer at some later time t1. Your information about A is lost with some relaxation time tau, related to reassignment of IP addresses in the dynamic block. Then you would say that your information decays exponentially to the background probability as exp(-t/tau). When it reaches some threshold, you consider it to be zero.
You might actually know something about how many IPs in the total block are spammers. Thus, given any random, individual IP, it would have this background probability of being a spammer. Statistically, you should ban random IPs with this probability, but that would be somewhat impolite to the innocent. Thus it's safer to err on the polite side and assume that all IPs are not spammers until proven otherwise. Thus you assume the background probability is zero. The IP A will decay to this background over some multiple of tau.
If you can estimate how often dynamic IPs are reset, then you have a means of estimating tau. If an IP is static, tau may be a very long time more related to system replacement. Tau could certainly vary by top domain, or because you simply want a factor of caution for some top domains. You could also apply the concept to entire blocks of IPs for blocks where you can afford to punish the innocent based on the behavior of some within the block.
Frequency at time of report (which actually isn't frequency meaning a rate of reporting but a count integrated over some time) may not be a significant consideration six months later. If the compromised machine is assigned a different IP in the interim, it doesn't matter how prolific it's spamming was on the old IP. What the count does help with is the certainty (i.e. reliability) of determination. One report could be a typo or a fluke (i.e. reports have some finite level of error). A dozen reports are pretty definite. The other thing the severity might help with, however, is the consequences of making the wrong estimate and thus how cautious you want to be in your rating. You could, for example, increase tau by the log of the number of reports (ln or log_10).
Don't know if the above helps or simply (randomly) stirs the mud.
Last edited by keg (2008-08-24 9:35 pm)
Offline
Blimey! Heavy stuff! I kinda lost it after you said "Let's assume..." LOL!
One factor that I did think about was this. If I "found" the following in the database...
username: andy
email: me@me.com
ip: 10.20.30.40
...they could be, but are probably not from the same record! Unless we fire multiple queries at the API or are fed back not only the frequency but the username, email and ip of every entry of each one, we have to use, as you pointed out, probability.
Offline
andypatmore wrote:
Blimey! Heavy stuff! I kinda lost it after you said "Let's assume..." LOL!
One factor that I did think about was this. If I "found" the following in the database...
username: andy
email: me@me.com
ip: 10.20.30.40
...they could be, but are probably not from the same record! Unless we fire multiple queries at the API or are fed back not only the frequency but the username, email and ip of every entry of each one, we have to use, as you pointed out, probability.
One could mine various sorts of statistics from the database. One would be the distribution for the time-span over which a U.S.-based IP gets used to spam.
The simultaneous query for IP, email, and username would be hard to scale for time, unless Russ is returning essentially three query results that have the time last seen and the frequency for each part. I haven't looked at this, but it was one reason I stuck with separate queries to start. In general, I don't do all three items, but stop with the first part that scores a hit.
Here's an example of "frequency" and time scaling I was talking about. Let's assume an IP address had 6 reports, was last seen 30 days ago, and I use a baseline value of 14 days for exponential decay. I might assign the following probability, if I assumed that a single spam report had 75% reliability or (1-0.75) error probability.
prob(spammer) = (1.-(1.-.75)^6)*exp(-30./(ln(1.+6.)*14.))
The uncertainty in the pre-exponential term washes out after a few reports (i.e. it rapidly goes to one). The frequency term also scales the decay constant in the exponential part from 14 days to around 27 days. Thus prob(spammer) = 0.33 at 30 days and you'd still want to ban it. At 180 days prob(spammer) = 0.0013.
Offline
keg wrote:
Here's an example of "frequency" and time scaling I was talking about. Let's assume an IP address had 6 reports, was last seen 30 days ago, and I use a baseline value of 14 days for exponential decay. I might assign the following probability, if I assumed that a single spam report had 75% reliability or (1-0.75) error probability.
prob(spammer) = (1.-(1.-.75)^6)*exp(-30./(ln(1.+6.)*14.))
The uncertainty in the pre-exponential term washes out after a few reports (i.e. it rapidly goes to one). The frequency term also scales the decay constant in the exponential part from 14 days to around 27 days. Thus prob(spammer) = 0.33 at 30 days and you'd still want to ban it. At 180 days prob(spammer) = 0.0013.
Well, so far so good! I have implemented the above formula in Perl (other languages should be similar) and have taken it to 10 decimal places.
(1-(1-0.75)^$dbfrequency)*exp(-$diff/(log(1+$dbfrequency)/log(10)*14)))
Where $dbfrequency is the frequency value for the found entry and $diff is the "age" of the entry in days (shown as a decimal if less than 24 hours to 3 decimal places). With a frequency of 1, the severity drifts off to zero after around 96 days and I have just checked a new entry added about 2 minutes ago and it gave me a severity of 2.0374741833.
I am worried about false values due to any time difference between the users server and the Stop Forum Spam server. For example, this server is in GMT -6:00 and my server is in GMT -8:00. So a new entry added at 01:00 GMT will show 20:00 GMT -6:00 on this server and but my server will be showing 18:00 GMT -8:00. If it proves to be a problem, I can add in a server offset value to bring the users server time in line with this one.
If more than one entry is found (username & IP for example) it will return an average of the severity values. Also, the "call" to this module receives both a true/false AND the severity, so either may be used. The preferred method would be to request a "tri-test" (all three) though single or any pair may also be requested.
Once I have carried out a few more tests and written some instructions, I will give Russ the details of where it may be found for people to download.
Offline
andypatmore wrote:
Well, so far so good! I have implemented the above formula in Perl (other languages should be similar) and have taken it to 10 decimal places.
Code:
(1-(1-0.75)^$dbfrequency)*exp(-$diff/(log(1+$dbfrequency)/log(10)*14)))Where $dbfrequency is the frequency value for the found entry and $diff is the "age" of the entry in days (shown as a decimal if less than 24 hours to 3 decimal places). With a frequency of 1, the severity drifts off to zero after around 96 days and I have just checked a new entry added about 2 minutes ago and it gave me a severity of 2.0374741833.
I am worried about false values due to any time difference between the users server and the Stop Forum Spam server. For example, this server is in GMT -6:00 and my server is in GMT -8:00. So a new entry added at 01:00 GMT will show 20:00 GMT -6:00 on this server and but my server will be showing 18:00 GMT -8:00. If it proves to be a problem, I can add in a server offset value to bring the users server time in line with this one.
If more than one entry is found (username & IP for example) it will return an average of the severity values. Also, the "call" to this module receives both a true/false AND the severity, so either may be used. The preferred method would be to request a "tri-test" (all three) though single or any pair may also be requested.
A few comments:
I was writing in pseudo-code. To be correct for PERL, replace the caret by ** (exponentiation operator). The number returned by the equation should always be between zero and one, inclusive.
So, within the exponential it looks like you decided to scale the "frequency" using log_10 rather than log_e (the latter being just log()). That simply means that the increase in decay time with reports will be less by dividing by 2.303 -- i.e. having less than 9 reports would reduce your decay time to less than 14 days. Since dividing by log(10) is a constant factor, you might just as well adjust the base decay time instead. There's nothing sacred about 14 days, that was just a top of my head estimate of a reasonable turnover rate. Or, you may just want to drop the division by log(10) and keep the 14.
With a decay time of 14 days, you can probably ignore time shifts as long as you ensure that you don't let $diff be less than zero. e.g.
$diff = ($diff >= 0.0) ? $diff : zero;
At worst, you're then simply letting the decay to threshold happen a day early.
If you have multiple reports, the probability way of merging them would be to subtract the result from the equation from one (i.e. calculate the probability of the person not being a spammer from each report, multiply them together, then subtract that from one to get a merged probability of being a spammer.
Example. If the equation gave the probability of being a spammer as 0.2, 0.7, and 0.1 for IP, email, and user respectively. The merged probability of being a spammer would be
prob(spammer,ip,email,user) = 1.0 - (1.0-0.2)*(1.0-0.7)*(1.0-0.1) = 0.784.
Note that each nonzero report reduces the probability of the person not being a spammer. The smaller that becomes, the larger the probability that they are a spammer.
Offline
keg wrote:
andypatmore wrote:
Well, so far so good! I have implemented the above formula in Perl (other languages should be similar) and have taken it to 10 decimal places.
Code:
(1-(1-0.75)^$dbfrequency)*exp(-$diff/(log(1+$dbfrequency)/log(10)*14)))Where $dbfrequency is the frequency value for the found entry and $diff is the "age" of the entry in days (shown as a decimal if less than 24 hours to 3 decimal places). With a frequency of 1, the severity drifts off to zero after around 96 days and I have just checked a new entry added about 2 minutes ago and it gave me a severity of 2.0374741833.
I am worried about false values due to any time difference between the users server and the Stop Forum Spam server. For example, this server is in GMT -6:00 and my server is in GMT -8:00. So a new entry added at 01:00 GMT will show 20:00 GMT -6:00 on this server and but my server will be showing 18:00 GMT -8:00. If it proves to be a problem, I can add in a server offset value to bring the users server time in line with this one.
If more than one entry is found (username & IP for example) it will return an average of the severity values. Also, the "call" to this module receives both a true/false AND the severity, so either may be used. The preferred method would be to request a "tri-test" (all three) though single or any pair may also be requested.A few comments:
I was writing in pseudo-code. To be correct for PERL, replace the caret by ** (exponentiation operator). The number returned by the equation should always be between zero and one, inclusive.
So, within the exponential it looks like you decided to scale the "frequency" using log_10 rather than log_e (the latter being just log()). That simply means that the increase in decay time with reports will be less by dividing by 2.303 -- i.e. having less than 9 reports would reduce your decay time to less than 14 days. Since dividing by log(10) is a constant factor, you might just as well adjust the base decay time instead. There's nothing sacred about 14 days, that was just a top of my head estimate of a reasonable turnover rate. Or, you may just want to drop the division by log(10) and keep the 14.
With a decay time of 14 days, you can probably ignore time shifts as long as you ensure that you don't let $diff be less than zero. e.g.Code:
$diff = ($diff >= 0.0) ? $diff : zero;At worst, you're then simply letting the decay to threshold happen a day early.
If you have multiple reports, the probability way of merging them would be to subtract the result from the equation from one (i.e. calculate the probability of the person not being a spammer from each report, multiply them together, then subtract that from one to get a merged probability of being a spammer.
Example. If the equation gave the probability of being a spammer as 0.2, 0.7, and 0.1 for IP, email, and user respectively. The merged probability of being a spammer would be
prob(spammer,ip,email,user) = 1.0 - (1.0-0.2)*(1.0-0.7)*(1.0-0.1) = 0.784.
Note that each nonzero report reduces the probability of the person not being a spammer. The smaller that becomes, the larger the probability that they are a spammer.
Doh! Of course it is! Silly me!
Right, I have changed that and incorporated your "probability merge" idea too. The more I look at it the clearer it is becoming. When I first read your original reply, I was wondering where the "translate" button was! LOL!
As far as the 14 day delay threshold is concerned, am I reading this right that the probability reaches zero after 2 weeks? If this is the case, I may make this a little longer or I may pass the "type" to the severity calculation routine and set different values according to type with username at 14 days, email at 28 and IP at 48.
Anyway, thanks for the advise so far. If you have any further comments feel free!
Offline
andypatmore wrote:
Right, I have changed that and incorporated your "probability merge" idea too. The more I look at it the clearer it is becoming. When I first read your original reply, I was wondering where the "translate" button was! LOL!
The translate button is there, it's just masquerading under the label of "iterate a few times". ![]()
andypatmore wrote:
As far as the 14 day delay threshold is concerned, am I reading this right that the probability reaches zero after 2 weeks? If this is the case, I may make this a little longer or I may pass the "type" to the severity calculation routine and set different values according to type with username at 14 days, email at 28 and IP at 48.
Anyway, thanks for the advise so far. If you have any further comments feel free!
What the 14 days implies is that the probability will decrease by a factor of (1/e)=0.368 every 14 days, ignoring the frequency scaling. The log(1+$dbfrequency) just makes that a bit longer depending of the original severity of the spamming. So it's the "e-folding time", not the time to effective zero. Four e-folding times brings you to about 0.018 of the original probability; six e-folding times (84 days) to about 0.0025. Six spam reports would give an increase factor of ln(1+6) = 1.94, making a 27 day e-folding time. Having 100 spam reports would give an increase factor of 4.6; 1000 would be a factor of 6.9. Using the log keeps it within reasonable bounds while still factoring in a caution extension based on how heavily the ip, email, or username was initially used.
And you're right -- if you assume that different types of reports have different time distributions of use, then you could use different base decay times for each type.
Last edited by keg (2008-08-26 11:34 pm)
Offline
Sorry to get back on topic, but...
If I submit invalid data to the checker functions of the API (say, an email address of "no", an IP address of 9, or a blank username), I get what looks like broken XML (namely, an incomplete tag composed of "false"> after a <response success="true"> tag). I'm assuming that this is a bug?
Offline
@electrictdustcart - yes I experienced the same thing. Should we start a new thread to discuss this as this one seems to have been thread-jacked?
Offline