Suggestions for a permanent solution to the Arabic spam threads

Started by caffeine, July 21, 2014, 07:33:40 AM

Previous topic - Next topic

caffeine

Nearly every morning for the last couple of months, HD has gotten bombarded with spam threads. A few hours later, one of the west coast mods wake up and deletes them. Does anyone have any ideas for a more permanent solution?

Blitz suggested automatically hiding threads with Arabic in the topic title. I remember in the TC days, I straight up banned Russian emails from signing up (not a good solution). I think what solved the problem on TC for me was a sign-up question that asked, "How many different Tetris pieces are there?" or something like that.

djackallstar

#1
Quote from: caffeine
Nearly every morning for the last couple of months, HD has gotten bombarded with spam threads. A few hours later, one of the west coast mods wake up and deletes them. Does anyone have any ideas for a more permanent solution?

Blitz suggested automatically hiding threads with Arabic in the topic title. I remember in the TC days, I straight up banned Russian emails from signing up (not a good solution). I think what solved the problem on TC for me was a sign-up question that asked, "How many different Tetris pieces are there?" or something like that.

A spammer is either a bot or a human.

For a bot spammer, an easy and effective way is to use CAPTCHAs to make it hard to register an account.

For a registered bot, mods can just delete its account after seeing its spams.

For a human spammer, CAPTCHAs are useless to prevent their registrations.
The following imperfect heuristic methods are often used in some forums I hang around:
1. A username is invalid if it contains certain patterns, hence can't be used to register an account. (Username check)
2. An email is invalid if it is not from any specified ISP. (ISP check)
3. One cannot register an account in certain geographic locations where most spammers come from. (IP check)

For a registered human spammer, HD should have a mechanism to detect a thread whether it is a spam or not upon posting. Once a thread is judged to be a spam, the thread is trashed and the thread owner is banned for one week, automatically. If the thread owner does not or fails to explain his behavior to a mod, his account is then deleted automatically.

As for how to judge a thread is a spam or not, some heuristic regular expressions will do.
The patterns of spams on HD are quite regular, from what I can see.

That being said, the main problem is that whether Blink and his brother have time to implement these features.
.

ohitsstef

Quote from: djackallstar
That being said, the main problem is that whether Blink and his brother have time to implement these features.


This.

And thank Aaron for deleting all of the spam we've been getting the past month or so.

We do not forgive. We do not forget.

Paul676

Set every user's first post to manual approval from an admin.
               Tetris Belts!

officegunner

#4
Solution to Arabic Spam on HD:

In response to 403 (and incrementing) posts of Arabic spam, officegunner has decided to write a script that sieves out the pestering spam once and for all.

http://jsbin.com/depopoqo/3/edit

Background on the spam:

A quick google search yields that this spam is banal on other sites such as: Adobe forums, Android Central, Amazon kindle forums, Autohotkey...

Given mass distribution of the posts, one can assume that the spam is merely automated, and not done by any intelligible lifeform. This means that if we have a way of forbidding such spam to be posted, the bot/person behind it would not bother with a workaround, hence no more spam.

What this code does:

This post does two things: Checking for Arabic Characters and checking for repetitive keywords.

1) Arabic Characters

function checkArabic(string){
var arabic = /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]/;
if(arabic.test(string) === true){return true}else{return false}
}


This Regular Expression checks whether characters within the range are present in the string(the post). More specifically, the Regex includes Arabic, Arabic Supplement, Arabic Presentation Forms-A and  Arabic Presentation Forms-B.

If any of these chracters are present in the string, then the function returns true.

2) Checking for repeats

function checkRepeat(string){
var repeat = false;
var rows = string.split("\n");
for(i=0; i<rows.length; i++){
    var repeatCount = -1;
    var words = rows[i].split(" ")
    for(j=0; j<words.length; j++){
        if(words[0] === words[j]){
            repeatCount++;
        }        
    }
    if(repeatCount>7){
        i=rows.length;//code terminated
        repeat= true;
    }
}
return repeat;
}


This code parses the string, first row by row, then word by word. Basically, if the first word of the sentence is repeated more than seven times, then the function returns true.

For the post to be considered spam, it has to satisfy BOTH conditions. I have not seen a non-spam post that has triggered both of these conditions.

Since the spammer is supposedly not intelligent, we can simply implement this on to the client side.

Even if the spammer does try to work around this code, he needs to remove all Arabic characters and repeating words, which is the quintessence of the spam anyway.

-officegunner


djackallstar

#6
Quote from: officegunner
1) Arabic Characters
function checkArabic(string){
var arabic = /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]/;
if(arabic.test(string) === true){return true}else{return false}
}

This Regular Expression checks whether characters within the range are present in the string(the post). More specifically, the Regex includes Arabic, Arabic Supplement, Arabic Presentation Forms-A and  Arabic Presentation Forms-B.

https://en.wikipedia.org/wiki/Basic_Multili...tilingual_Plane

Arabic Extended-A (08A0–08FF) should also be included.


var containsArabicChar = function(s){
  return /[\u0600-\u06ff]|[\u0750-\u077f]|[\u08a0-\u08ff]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]/.test(s);
};
.

officegunner

yes it should, making the full regex


var arabic = /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc][\u08A0-\u08FF]/;


I'm more concerned about posts that are caught by this by accident, and the actual implementation. In the end, its all about Blink's opinion and willingness to use this.

-officegunner

Kitaru

Quote from: Paul676Set every user's first post to manual approval from an admin.
Especially if it includes a link or image. First post may be legitimate otherwise, but signing up and immediately link dropping increases reason for automated suspicion.
<a href=http://backloggery.com/kitaru><img src="http://backloggery.com/kitaru/sig.gif" border='0' alt="My Backloggery" /></a>

officegunner

Quote from: Kitaru
Quote from: Paul676Set every user's first post to manual approval from an admin.
Especially if it includes a link or image. First post may be legitimate otherwise, but signing up and immediately link dropping increases reason for automated suspicion.

http://harddrop.com/forums/index.php?showtopic=6158

That would have little effect, since the Arabic spam do not have links, nor do they have images.

-officegunner

caffeine

I don't like the idea of admin-approved first topics. I definitely don't like the idea of having to approve a post. It's difficult enough to get lurkers to join in. If you make them wait, they're liable to not even bother.

benmullen

How many tetris pieces are there?

I would answer either 7, or infinite

officegunner

#12
Quote from: caffeine
I don't like the idea of admin-approved first topics. I definitely don't like the idea of having to approve a post. It's difficult enough to get lurkers to join in. If you make them wait, they're liable to not even bother.

http://harddrop.com/forums/index.php?showt...=6585&st=4#

InkofDeath

Use CAPTCHA3 on registration.
Ask an abstract question on registration, "How many t-spins are there", "What are the blocks called in Tetris?". Fill in all variations of answers with caps or spaces or underscores.

This will stop 100% of automated bots. It will also stop a large majority of non-English speaking human spammers (they can't Google, or easily translate).

If they start getting through, change the question. Or have the question rotate between 4-10.

exchliore

There's not really a permanent solution. But if you made the signup process a little more dynamic, most bots won't be able to figure it out.

1. Qualitative questions. Show a picture of a cat and ask "What animal is this?" and accept any answer that is cat-like (cat, kitten, lion, tiger). Other questions can be "What direction is this arrow pointing?" or "What do you drink water out of?". Rotate these questions.

2. Have a multi-page sign up process. The register page says steps 1-4, but if the account is made after step 1, then it is too easy to make an account. If the registration is separated out, it makes writing a bot less worthwhile. For example, if the first page is account details and the second page is captcha/verification, then someone would need to be devoted to writing a bot specifically for that process.

3. Make a javascript timer that delays the registration request (hit the server with a session token once, and then the registration payload). Most bots will not obey timers. Timer doesn't have to be too long, 1-2 seconds. If a registration comes in too fast, then you kick it out.

4. Do not use a static registration url (ie. /public/register). Also, do not reveal the url to the user. A form/post is too easy to code to. Have AJAX pull down a dynamic registration url when a user clicks "Join Hard Drop". Things like this break most of the bots out there and confuses most script kiddies (it helps if you obfuscate the javascript as well).

Keep in mind that if someone is truly dedicated, he/she will write a bot for your registration process.