You are not logged in.
- Topics: Active | Unanswered
Pages: 1
#1 2012-02-27 9:07 pm
- pedigree
- uıɐbɐ ʎɐqǝ ɯoɹɟ pɹɐoqʎǝʞ ɐ buıʎnq ɹǝʌǝu ɯ,ı
- From: New Zealand
- Registered: 2008-04-16
- Posts: 7,104
New "Confidence" Feature is Now Live
Ive just added a new feature to the API, called confidence. Using some serious maths, the API will now report a score, as a percentage (floating point) of any field being a spammer. Its not an exact science but it will give you a better idea of the data and how to handle it.
It is only supported using serial/XML data and is not available in the standard API called. if you append &f=json (or XML/serial etc) to each call, then you will see a new field, called 'confidence'
eg
http://www.stopforumspam.com/api?ip=1.2.3.4&f=json
will show
{"success":1,"ip":{"lastseen":"2012-01-31 16:17:22","frequency":35,"appears":1,"confidence":54.12}}
This example shows that the chances of this IP being used to possibly spam or post unsolicited commercial adverts, based on reported values vs days since we last saw it, is 54%. Its an estimation based on the Wilson scoring function.
I hope that people start using it and use it to help decide what to do with those entries that you could otherwise either reject or accept.
As usual, as bugs, please post here, via the contact form, or if you can keep the swearing to under 140 characters, then Twitter
Offline
#2 2012-02-27 9:36 pm
- TheVisitors
- Member
- Registered: 2010-10-15
- Posts: 13
- Website
Re: New "Confidence" Feature is Now Live
Such a cool little feature.
12 GB of ram, huh? (as you said on Twitter)
Maybe something that would be worth contributing for.
Offline
#3 2012-02-27 10:51 pm
- Katana
- Member
- Registered: 2009-08-18
- Posts: 1,886
Re: New "Confidence" Feature is Now Live
Squee, my scoring code's live! :3
うるさいうるさいうるさい!
Offline
#4 2012-02-27 10:53 pm
- pedigree
- uıɐbɐ ʎɐqǝ ɯoɹɟ pɹɐoqʎǝʞ ɐ buıʎnq ɹǝʌǝu ɯ,ı
- From: New Zealand
- Registered: 2008-04-16
- Posts: 7,104
Re: New "Confidence" Feature is Now Live
The server has 24gb now which is enough to move forward with reducing the MySQL footprint and moving a massive chunk of processing into MongoDB
Offline
#5 2012-02-27 11:15 pm
- kpatz
- Member
- Registered: 2008-10-09
- Posts: 1,437
Re: New "Confidence" Feature is Now Live
I think Ped is trying to get the entire SFS database loaded into RAM.
Spam happens when greed meets stupidity.
Offline
#6 2012-02-27 11:33 pm
- pedigree
- uıɐbɐ ʎɐqǝ ɯoɹɟ pɹɐoqʎǝʞ ɐ buıʎnq ɹǝʌǝu ɯ,ı
- From: New Zealand
- Registered: 2008-04-16
- Posts: 7,104
Re: New "Confidence" Feature is Now Live
Would need more RAM for that 48gb
Offline
#7 2012-02-27 11:39 pm
- Katana
- Member
- Registered: 2009-08-18
- Posts: 1,886
Re: New "Confidence" Feature is Now Live
Would need more RAM for that
48gb
100+ considering how it's constantly growing.
うるさいうるさいうるさい!
Offline
#8 2012-02-28 1:27 pm
- kpatz
- Member
- Registered: 2008-10-09
- Posts: 1,437
Re: New "Confidence" Feature is Now Live
Wow... 48 GB? That's a lot of spammers.
Well, with 7 billion "humans" on this planet, 16,402,099 complete retards is probably a conservative estimate.
Spam happens when greed meets stupidity.
Offline
#9 2013-07-16 3:10 pm
- banp
- Member
- Registered: 2013-07-16
- Posts: 2
Re: New "Confidence" Feature is Now Live
Hey,
Thanks for this feature, I believe this could emerge into something really useful. Although to really assess the confidence and use the score I think it is necessary to understand how it is derived. Could you share some details on it? Do you fit it to some standard distribution or do you use some nonparametric methods?
Offline
#10 2013-07-16 3:15 pm
- Alex Kemp
- Moderator
- From: Nottingham, England
- Registered: 2009-12-02
- Posts: 2,457
- Website
Re: New "Confidence" Feature is Now Live
Hi banp, welcome to SFS.
Search for `Wilson scoring'.
Offline
#11 2013-07-16 8:36 pm
- pedigree
- uıɐbɐ ʎɐqǝ ɯoɹɟ pɹɐoqʎǝʞ ɐ buıʎnq ɹǝʌǝu ɯ,ı
- From: New Zealand
- Registered: 2008-04-16
- Posts: 7,104
Re: New "Confidence" Feature is Now Live
Its not the perfect usage of the scoring system used but its better than the binary yes/no
Offline
#12 2013-07-24 9:45 am
- banp
- Member
- Registered: 2013-07-16
- Posts: 2
Re: New "Confidence" Feature is Now Live
Hello guys and thanks for your quick answer !
I have now had time to have a look at Wilson Score and I have to say I do not fully understand how it is used here.
As I understand it, the Wilson score is (as any binomial proportion confidence intervals) a confidence interval for a proportion. So it should be like "With 95% probability the real proportion belongs to the interval (0.4;0.6)", whereas you supply just one number like "54.12%".
In the "API usage" one can find that the score is "based on the last seen date and the number of sightings". So is the proportion for a given email defined as "unclean sightings / clean sightings + unclean sightings" or "time periodds when the email was unclean / all time periods checked" ? I am also not sure if the distribution is binomial. It seems to me that it is quite autoregressive - once an email becomes a spammer it is more probable that it will stay this way. And from the moment it becomes a spammer almost all posts from it will be spam posts...
If it is not secret please reveal a little more
Kind regards,
banp
Offline
#13 2013-07-24 2:53 pm
- Alex Kemp
- Moderator
- From: Nottingham, England
- Registered: 2009-12-02
- Posts: 2,457
- Website
Re: New "Confidence" Feature is Now Live
As pedigree says, it is far from perfect; we wanted a way to try to express the reliability of those db results numerically, and a former contributor suggested `Wilson scoring' since algorithms were available for that. You are not the first to cast doubts on it's suitability but, at this moment, it is all that we have.
The discussions at the point of it's consideration & incorporation are all available to Registered members like yourself; just search to discover them. Then, offer something better. ped's time is *very* limited, so do not expect him to do all your work for you. Put together a better algorithm & it will be snapped up, with grateful thanks.
Offline
#14 2013-07-24 3:13 pm
- pedigree
- uıɐbɐ ʎɐqǝ ɯoɹɟ pɹɐoqʎǝʞ ɐ buıʎnq ɹǝʌǝu ɯ,ı
- From: New Zealand
- Registered: 2008-04-16
- Posts: 7,104
Re: New "Confidence" Feature is Now Live
yes, its not perfect, it is weighted a lot towards number of listings and leans away from the lastseen. If you do have other suggestions, I would love to look into implementing something more suitable... with credit given on the site
The data you have to work with is.
1. Number of times we've seen the listing
2. The lastseen date for the listing.
Thats all that is stored in the caches. The mysql database provides a lot more but the API does not access that.
Considerations
1. It has to be quick
2. It cannot access mysql
3. It has to change with time as active records become older.
I can expand the data in the caches to include other data but cache size needs to be considered. Currently the API data is running at 1GB of ram, about 16 bytes per record and chunked into slabs. The API cannot use stupid amounts of memory are the mirror nodes have minimal ram.
Offline
#15 2017-11-09 3:27 am
- mcserverstore
- Member
- Registered: 2017-11-08
- Posts: 6
Re: New "Confidence" Feature is Now Live
This confidence feature is a great idea, I just developed an API Client using this feature.
Offline
Pages: 1