You are not logged in.

#1 2016-02-15 10:37 pm

pedigree
uıɐbɐ ʎɐqǝ ɯoɹɟ pɹɐoqʎǝʞ ɐ buıʎnq ɹǝʌǝu ɯ,ı
From: New Zealand
Registered: 2008-04-16
Posts: 7,056

downloadable email files

At the moment the downloadable email files dont contain any of the non-normalised email addresses, not the normalised data.

I've had a request to include the non-normalised email addresses to help cleanup scripts that dont handle normalised addresses.  Putting aside the endless combination of permutations possible with a gmail, I don't see any reason not to include them.

What do you guys think?

Offline

#2 2016-02-15 11:00 pm

Alex Kemp
Moderator
From: Nottingham, England
Registered: 2009-12-02
Posts: 2,423
Website

Re: downloadable email files

That sounds like someone asking you to do extra work because they cannot be bothered to do any extra work at their end. However, your choice.

Offline

#3 2016-02-16 12:52 am

NeoFox
Member
From: WI, USA, Earth
Registered: 2013-09-26
Posts: 830
Website

Re: downloadable email files

Alex Kemp wrote:

That sounds like someone asking you to do extra work because they cannot be bothered to do any extra work at their end. However, your choice.

As rare as it is, me and Alex agree on this LOL.

Offline

#4 2016-02-16 1:49 am

zero-tolerance
Member
Registered: 2013-02-25
Posts: 339

Re: downloadable email files

I see arguments both for and against:

If it was just a question of whether the normalisation is centralised or distributed, I would argue for doing it in one place, because the logic changes over time and having it consistently applied is a Good Thing. This would argue for reporting the non-normalised addresses.

But since reporting the non-normalised addresses would only report the variants that have actually been reported here, a cleanup script that is naive about normalisation will miss any variants that they have seen that are not reported here.
So while reporting non-normalised addresses would improve the success rate of a naive cleanup script, it will not do as well as applying the algorithm locally.

An alternative would be to make the normalisation algorithm downloadable, either abstractly or as a PHP function.

Currently, normalising gmail addresses would alter about 20% of them on my forum. This is probably also true of spammers who are not deliberately de-normalising their addresses to evade matching. If using it for evasion is relatively rare, it may be better to report the non-normalised ones, as if it's not practicable to keep up with your algorithm over time, normalising addresses centrally may actually cause more misses than it saves. Having said that, the normalisation algorithm is apparently moving towards standardisation, so that problem may go away over time: any code that compares email addresses may be expected to be able to normalise them just as we currently do with case-folding.


On balance, I would agree with Alex and BlueSage.

But suggest publishing the normalisation algorithm/rules that are currently used.

Offline

#5 2016-02-16 3:06 am

Maikuolan
Member
From: Perth, Western Australia
Registered: 2011-08-09
Posts: 799
Website

Re: downloadable email files

+1 to all of the above; It sounds like someone is trying to dump responsibility for how -they- should be handling the SFS data onto -you-.

Additionally, though, I would ask - especially in the case of regularly well-permuted gmail addresses - if a local handler isn't capable of correctly handling normalised data, how effective, exactly, would bothering to even actually make use of the SFS database be, when it comes to preemptively dealing with spammers?

If a spammer registers with some.address@gmail.com, and this is then blacklisted by the forum at which they're registering, they could simply re-register with something like som.e.address@gmail.com to completely bypass that blacklisting; One of the very points and purposes of normalisation is to prevent spammers from being able to do this sort of thing. So, if a "naive cleanup script" (one way to put it; adequately accurate and succinct, although, I'd probably go as far as referring to it as a stupid or useless cleanup script, too) isn't capable of doing this, I don't know how effective using SFS would actually even be for them (or how effective the use of -any- sort of database for the prevention of spam would be, for that matter).

The only way to really get around this would be perhaps to expand (or reverse-normalise) data prior to release (ie, pre-populate a non-normalised email database download file with all potential permutations prior to release), but, this would probably be a really bad idea anyway, for multiple reasons (the total size of such a file would likely have something towards the order of 10-fold increase in magnitude over it's non-expanded non-normalised counterpart, and, if a "naive cleanup script" already can't handle normalised data, I wonder what problems could be posed to it by needing to handle a file of that size, with so many potential entries and etc; plus, the additional processing required to generate it in the first place could potentially create a horrendous nightmare of a problem for the SFS servers themselves). This idea, also, isn't very practical, because permutations don't always follow the same strict pattern, and so, even for a single email address, you can have a ridiculously large number of permutations.

In short.. If someone is trying to seriously combat against spammers, and is seriously telling you that they can't normalise data on their end.. I would be inclined to ask WTF are they actually doing and thinking.

Offline

Board footer

Powered by FluxBB

Close
Close