Talk:Spamwords/Archive 1
Bob: I just posted a copy of the banned words list that the wiki currently uses. I will set it up so that a script automatically fetches this page at a reasonable interval-- although I haven't done this yet. Don't go crazy on banned words, since we really want to only ban real spammers, and not people, who for example, are talking about pharmacy-related RPG's (Pill Quest!). Also, don't link to this page from anywhere obvious, so the spammers themselves are less likely to notice it.
Mike: Well, it'll appear on the recent changes any time we edit it (or talk about it).
Anyway, so you liked my idea, eh? Cool.
Bob: this is still not active yet, just updating it to reflect some manual changes I made today. Will look at the problem more seriously when I really get back from vacation (still in Tennessee currently) TTYL
Mike: I wish those evil spam-bots of doom would at least handle a literal quote properly. *rolls eyes*
Mike: Hmm, you know what would be a very effective spam rule?
http\:\/\/ #filter hyperlinks (which of those escaped symbols really need to be?)
If someone really wants to post a link, they should at least log in to do so.
Bob: The problem with that is, as currently implemented, the filter uses the full-text of the edited page, not the diff between old and new versions. So if we blocked all external links, then an anonymous user who tried to fix a spelling error on a page that already contained a link would be blocked.
Mike: Eh, that also means that if (for whatever reason) we had a page with viagra on it, and said anonymous user tried to edit it, he'd be nuked as well... Hmm, perhaps it should apply to the diffs? Surely it can't be that difficult... Ok, that was bad, I know. But, my point still stands.
Bob: Yeah. So far, that has not happened (full text of all failed attempts goes to my e-mail so I can check for false-positives), but checking the diff would be much better, provided that I took the time to implement it. I am using wikimedia's $wgFilterCallback, and theoretically I should have access to the whole system, but I currently only understand it well enough to retreive post text and user name/IP.
Bob: Wow. The spams just keep coming faster and faster. I am disturbed. The latest wave included 30 defacings spaced about 5 seconds apart coming from multiple IP addresses (definitely either a botnet or an anonymizing proxy) and using several throwaway accounts. The filters are holding really well so far, but if the attacks evolve much further, I may have to implement some kind of captcha... and I hope It doesn't come to that, 'cause I hate captcha :(
Mike C. I dunno if that's a good idea. Ysoft's ohrgfx program is hosted on geocities...
Bob the Hamster I'd almost rather host it myself than un-block geocities url's
Mike C. I think my casino rules will take care of the rest of the geocities spam.
Of course, once you get it to filter based on the diff, you'll be able to use the rule I posted above:
(https?|ftp):\/\/.*
(actually, that's a better version of it, filtering multiple protocols)
Mike C. Ok, I'm working on improving the filter. MediaWiki has a nice DifferenceEngine class which would seem ideal for this. Alas, it only returns the formatted HTML output you see on the Diff pages. So, I copied the method of that class that does the work, and I'm tweaking it. Alas, again, I need to install media wiki, so I have somewhere to test this thing. Will report later.
Edit: By the way, MediaWiki 1.6 was released, like, ages ago. And yet, we're still running this dinosaur version of 1.5! :(
Bob the Hamster I update to security releases very quckly, but 1.6 is a big release, and I need to do a little testing first. (besides, you notice how quickly they had to release 1.6.1 and 1.6.2. I am always careful about *.*.0 releases of anything) I am really pleased about their new development model, and pretty soon I will start pulling directly from their svn repository, rather than downloading the tarball like I do now. Anyway, Hopefully I will have a chance to do it before the end of the week (hopefully).
Mike C. Ah, speaking from personal experience, eh? Heh. Anyway, I'll be setting up a test wiki shortly, running 1.6.2. It'll basically be a giant Sandbox... If I ever get it downloaded... :(
Mike C. All done. The filter now does diff-based filtering. As soon as you install it, we can rejoice!
Bob the Hamster No rejoicing yet. The diff changes work in the sense that they do not crash or produce any error messages. They do not work in the sense that they block any spam at all :(
The most obvious error I found was that you are returning the diff as an array of strings, when you should be returning a single multiline string. The second problem is that even when I ffix that first problem, it still blocks nothing, and I don't have time to figure out why yet. The third problem is that it is actually producing the diff between the previous revision and the new revision, not the current revision and the new revision. That is to say, it is giving you the last 2 changes instead of the last 1 change.... at least that is what it appeared to be doing in my tests. I was only testing by adding lines to the end of a page, so it could have also been a bug in the diffing that was giving a couple lines above the changed one.
Mike C. Using this snippet:
//Create a diff, for better filtering $old_page = new Article($title); $old = $old_page->fetchContent(); $diff = getDiff($old,$body); $diff = implode("\n",$diff); unset($old_page);
The first problem isn't a problem. I implode the string immediately after the function call, as you can see. getDiff is documented to return an array. Look at yonder comment.
The second problem, I've tested. I know I tested it. Check on my test wiki: http://ohrdev.com/wiki/ I just grabbed static copies of the filter lists, so if you register with Bob the Hamster, it will treat you as s not-evil user.
The third problem can't be. According to the class definition for Article, fetchContent (http://wikipedia.sourceforge.net/doc/MediaWiki/Article.html#fetchContent) is documented to return the current revision given no parameters. And, in my tests, that's what happened.
What I did to test was have it send me an email when right after it created the diff. The email had (in this order) the diff on top, then the old revision, then the new revision. That way, it'll tell you what diff it's producing, and with what revisions.
Edit: Damnit, you uninstalled it, didn't you? How am I supposed to prove you wrong unless I can test it? :(
Bob the Hamster Yes, I switched back to the older version. I see about the $diff = implode("\n",$diff);. I missed that line. We can test this out more later on.
Mike C. Error report on the "try to trick the spammers" page:
Warning: preg_match(): Compilation failed: range out of order in character class at offset 5 in /home/james/src/ohrrpgce-web/mediawiki-spamcallback.php on line 28
And, if you figured out why my diff filter wasn't working, then why isn't it installed?